flowchart LR
S["Text Input"] --> D["Detect <br> grepl / str_detect"]
S --> X["Extract <br> regmatches / str_extract"]
S --> R["Replace <br> sub / gsub / str_replace"]
S --> SP["Split <br> strsplit / str_split"]
D --> RE["Regular Expressions"]
X --> RE
R --> RE
SP --> RE
style S fill:#e3f2fd,stroke:#1976D2
style D fill:#fff3e0,stroke:#F57C00
style X fill:#fff3e0,stroke:#F57C00
style R fill:#fff3e0,stroke:#F57C00
style SP fill:#fff3e0,stroke:#F57C00
style RE fill:#f3e5f5,stroke:#8E24AA
17 Advanced String Operations and Functions
This chapter picks up where Chapter 16 left off and takes you into pattern-based string work. You will meet the base-R pattern-matching family (grep(), grepl(), regmatches(), sub(), gsub()) and their friendly counterparts in the stringr package (str_detect(), str_extract(), str_replace(), str_split()). You will learn just enough regular expressions to clean, extract, and validate real-world text, character classes, quantifiers, anchors, groups, and alternations, with small, runnable examples for each idea. You will see how to format numbers and dates as strings, how to handle encoding hazards, and the difference between “fixed” and “regex” matching. By the end of this chapter you will be able to pull structured information out of messy free text and bend it into the shape your analysis needs.
17.1 Why Regular Expressions?
So far you have searched for exact strings with startsWith() and endsWith(), and sliced by position with substr(). Real text is rarely that regular. A regular expression (regex) is a tiny language for describing patterns, “a word of three digits”, “any character followed by a dot”, “an email-looking token”, which R’s pattern functions then find, extract, or replace.
Most practical regex work uses a small subset of the full language: a handful of character classes, three or four quantifiers, and two anchors. Start with that subset; only reach for look-arounds and non-greedy matching when the simple tools fall short.
17.2 Regex in Twenty Minutes
| Symbol | Matches |
|---|---|
. |
Any single character except a newline. |
\\d |
A digit 0-9. |
\\D |
Anything that is not a digit. |
\\s |
A whitespace character (space, tab, newline). |
\\S |
Anything that is not whitespace. |
\\w |
A word character (letters, digits, underscore). |
\\W |
Anything that is not a word character. |
[abc] |
Any of a, b, or c. |
[^abc] |
Anything except a, b, or c. |
[a-z], [A-Z], [0-9] |
Ranges. |
| Quantifier | Meaning |
|---|---|
* |
Zero or more. |
+ |
One or more. |
? |
Zero or one (optional). |
{n} |
Exactly n times. |
{n,m} |
Between n and m times. |
| Anchor / Group | Meaning |
|---|---|
^ |
Start of the string. |
$ |
End of the string. |
(abc) |
Grouping, treats abc as one unit, and captures it for back-references. |
a|b |
Either a or b. |
R string literals themselves use \ as an escape character, so you have to write \\d in the source code to produce the regex \d. The pattern \. becomes "\\." in an R string. This is the single biggest source of “but my regex works on the website!” surprises.
17.3 Detecting Matches
grepl() and grep() in Base R
grepl() returns a logical vector the same length as the input. grep() returns the positions (or, with value = TRUE, the matching strings themselves).
stringr::str_detect(), the Tidyverse Twin
stringr functions always take the string first and the pattern second, always vectorise cleanly over both, and have consistent names. str_detect() is the tidyverse answer to grepl().
17.4 Extracting Matches
regmatches() with regexpr() returns the first match in each string; gregexpr() returns all matches. In stringr, str_extract() and str_extract_all() do the same, more readably.
Parentheses ( ... ) create groups you can later refer to. str_match() (and base R’s regmatches() with regexec()) returns the whole match plus each group.
17.5 Replacing Matches
sub() vs gsub() in Base R
sub() replaces the first match; gsub() replaces all matches. Both are vectorised.
stringr::str_replace() and str_replace_all()
str_replace() replaces the first match and str_replace_all() replaces all matches.
Parentheses in the pattern capture groups that you can reference in the replacement string as \\1, \\2, and so on.
17.6 Splitting with a Pattern
strsplit() and str_split()
Both take a regex as the separator. Use fixed = TRUE (base) or stringr::fixed() when you really want a literal split on special characters like ..
. Is a Regex Metacharacter
. matches any character, not only a literal period. To split on a literal dot, either escape it (\\.) or use fixed matching.
17.7 “Fixed” vs “Regex” Matching
Every base-R string function has a fixed = TRUE option that treats the pattern as a plain string, not a regex. In stringr, use the fixed("...") helper in place of a bare pattern. Choose fixed matching when your pattern contains characters that are regex metacharacters (., +, *, (, [, ?, \) and you want them taken literally.
17.8 Padding and Alignment
formatC() and stringr::str_pad()
Neat fixed-width output is a common need, for invoice numbers, report columns, or console-based progress indicators.
17.9 A Small Regex Cookbook
| Task | Pattern | Notes |
|---|---|---|
| Email (rough) | ^\\S+@\\S+\\.\\S+$ |
Fine for basic validation; the real RFC is much stricter. |
| Indian phone (rough) | ^(\\+91[- ]?)?[6-9]\\d{9}$ |
Optional +91 prefix, 10 digits starting 6-9. |
| 4-digit year | \\b\\d{4}\\b |
Word boundaries stop it from matching inside longer numbers. |
| Integer | ^-?\\d+$ |
Optional minus, then digits only. |
| Decimal | ^-?\\d+(\\.\\d+)?$ |
Optional minus, integer part, optional decimal part. |
| Leading / trailing whitespace | ^\\s+|\\s+$ |
Match instead of trimws() if you want to report where it was. |
17.10 A Worked Example: Parsing Free-Text Transactions
Every technique from the chapter is at work: str_detect() to validate each line, str_match() with capture groups to tear the record apart, gsub() to strip commas, and a final explicit cast to numeric.
17.11 Summary
| Concept | Key Takeaway |
|---|---|
| Regex is a pattern language | Describe shapes of text, not exact strings. |
| Double backslashes | \\d, \\s, \\., R strings eat one backslash. |
| Detect | grepl() / str_detect() return logical vectors. |
| Extract | regmatches() / str_extract() / str_match() pull matches and groups. |
| Replace | sub() / gsub() / str_replace[_all](); use \\1 for back-references. |
| Split | strsplit() / str_split() accept regex; beware .. |
| Fixed vs regex | Use fixed = TRUE or fixed(...) for literal matches. |
| Padding | formatC() / str_pad() produce fixed-width strings. |
| Cookbook patterns | Four or five small patterns cover most everyday cleanup. |
Regex is one of those skills that pays back every hour you invest. Start small, detect, then extract, and grow your own cookbook of patterns that cover your domain. Use stringr when consistency and clean names matter, and base R when you want zero dependencies. In the next chapter you will move from strings to the workhorse 2-D container for mixed-type data: the data frame.