17  Advanced String Operations and Functions

NoteWhat This Chapter Covers

This chapter picks up where Chapter 16 left off and takes you into pattern-based string work. You will meet the base-R pattern-matching family (grep(), grepl(), regmatches(), sub(), gsub()) and their friendly counterparts in the stringr package (str_detect(), str_extract(), str_replace(), str_split()). You will learn just enough regular expressions to clean, extract, and validate real-world text, character classes, quantifiers, anchors, groups, and alternations, with small, runnable examples for each idea. You will see how to format numbers and dates as strings, how to handle encoding hazards, and the difference between “fixed” and “regex” matching. By the end of this chapter you will be able to pull structured information out of messy free text and bend it into the shape your analysis needs.

flowchart LR
    S["Text Input"] --> D["Detect <br> grepl / str_detect"]
    S --> X["Extract <br> regmatches / str_extract"]
    S --> R["Replace <br> sub / gsub / str_replace"]
    S --> SP["Split <br> strsplit / str_split"]
    D --> RE["Regular Expressions"]
    X --> RE
    R --> RE
    SP --> RE
    style S fill:#e3f2fd,stroke:#1976D2
    style D fill:#fff3e0,stroke:#F57C00
    style X fill:#fff3e0,stroke:#F57C00
    style R fill:#fff3e0,stroke:#F57C00
    style SP fill:#fff3e0,stroke:#F57C00
    style RE fill:#f3e5f5,stroke:#8E24AA


17.1 Why Regular Expressions?

NoteCore Concept: Describing Patterns, Not Literals

So far you have searched for exact strings with startsWith() and endsWith(), and sliced by position with substr(). Real text is rarely that regular. A regular expression (regex) is a tiny language for describing patterns, “a word of three digits”, “any character followed by a dot”, “an email-looking token”, which R’s pattern functions then find, extract, or replace.

TipExpert Insight: 80 Percent of Regex in Half a Page

Most practical regex work uses a small subset of the full language: a handful of character classes, three or four quantifiers, and two anchors. Start with that subset; only reach for look-arounds and non-greedy matching when the simple tools fall short.


17.2 Regex in Twenty Minutes

NoteThe Building Blocks
Symbol Matches
. Any single character except a newline.
\\d A digit 0-9.
\\D Anything that is not a digit.
\\s A whitespace character (space, tab, newline).
\\S Anything that is not whitespace.
\\w A word character (letters, digits, underscore).
\\W Anything that is not a word character.
[abc] Any of a, b, or c.
[^abc] Anything except a, b, or c.
[a-z], [A-Z], [0-9] Ranges.
Quantifier Meaning
* Zero or more.
+ One or more.
? Zero or one (optional).
{n} Exactly n times.
{n,m} Between n and m times.
Anchor / Group Meaning
^ Start of the string.
$ End of the string.
(abc) Grouping, treats abc as one unit, and captures it for back-references.
a|b Either a or b.
WarningDouble Backslashes Inside R Strings

R string literals themselves use \ as an escape character, so you have to write \\d in the source code to produce the regex \d. The pattern \. becomes "\\." in an R string. This is the single biggest source of “but my regex works on the website!” surprises.


17.3 Detecting Matches

Notegrepl() and grep() in Base R

grepl() returns a logical vector the same length as the input. grep() returns the positions (or, with value = TRUE, the matching strings themselves).

Notestringr::str_detect(), the Tidyverse Twin

stringr functions always take the string first and the pattern second, always vectorise cleanly over both, and have consistent names. str_detect() is the tidyverse answer to grepl().


17.4 Extracting Matches

NotePulling the Pattern Itself Out

regmatches() with regexpr() returns the first match in each string; gregexpr() returns all matches. In stringr, str_extract() and str_extract_all() do the same, more readably.

NoteCapture Groups

Parentheses ( ... ) create groups you can later refer to. str_match() (and base R’s regmatches() with regexec()) returns the whole match plus each group.


17.5 Replacing Matches

Notesub() vs gsub() in Base R

sub() replaces the first match; gsub() replaces all matches. Both are vectorised.

Notestringr::str_replace() and str_replace_all()

str_replace() replaces the first match and str_replace_all() replaces all matches.

NoteUsing Back-References in the Replacement

Parentheses in the pattern capture groups that you can reference in the replacement string as \\1, \\2, and so on.


17.6 Splitting with a Pattern

Notestrsplit() and str_split()

Both take a regex as the separator. Use fixed = TRUE (base) or stringr::fixed() when you really want a literal split on special characters like ..

WarningCommon Mistake: Forgetting That . Is a Regex Metacharacter

. matches any character, not only a literal period. To split on a literal dot, either escape it (\\.) or use fixed matching.


17.7 “Fixed” vs “Regex” Matching

NoteWhen You Do Not Want Pattern Semantics

Every base-R string function has a fixed = TRUE option that treats the pattern as a plain string, not a regex. In stringr, use the fixed("...") helper in place of a bare pattern. Choose fixed matching when your pattern contains characters that are regex metacharacters (., +, *, (, [, ?, \) and you want them taken literally.


17.8 Padding and Alignment

NoteformatC() and stringr::str_pad()

Neat fixed-width output is a common need, for invoice numbers, report columns, or console-based progress indicators.


17.9 A Small Regex Cookbook

NotePatterns You Will Reuse
Task Pattern Notes
Email (rough) ^\\S+@\\S+\\.\\S+$ Fine for basic validation; the real RFC is much stricter.
Indian phone (rough) ^(\\+91[- ]?)?[6-9]\\d{9}$ Optional +91 prefix, 10 digits starting 6-9.
4-digit year \\b\\d{4}\\b Word boundaries stop it from matching inside longer numbers.
Integer ^-?\\d+$ Optional minus, then digits only.
Decimal ^-?\\d+(\\.\\d+)?$ Optional minus, integer part, optional decimal part.
Leading / trailing whitespace ^\\s+|\\s+$ Match instead of trimws() if you want to report where it was.

17.10 A Worked Example: Parsing Free-Text Transactions

NoteExtracting Structure from a Prose Log

Every technique from the chapter is at work: str_detect() to validate each line, str_match() with capture groups to tear the record apart, gsub() to strip commas, and a final explicit cast to numeric.


17.11 Summary

NoteKey Concepts at a Glance
Concept Key Takeaway
Regex is a pattern language Describe shapes of text, not exact strings.
Double backslashes \\d, \\s, \\., R strings eat one backslash.
Detect grepl() / str_detect() return logical vectors.
Extract regmatches() / str_extract() / str_match() pull matches and groups.
Replace sub() / gsub() / str_replace[_all](); use \\1 for back-references.
Split strsplit() / str_split() accept regex; beware ..
Fixed vs regex Use fixed = TRUE or fixed(...) for literal matches.
Padding formatC() / str_pad() produce fixed-width strings.
Cookbook patterns Four or five small patterns cover most everyday cleanup.
TipApplying This in Practice

Regex is one of those skills that pays back every hour you invest. Start small, detect, then extract, and grow your own cookbook of patterns that cover your domain. Use stringr when consistency and clean names matter, and base R when you want zero dependencies. In the next chapter you will move from strings to the workhorse 2-D container for mixed-type data: the data frame.