17 Advanced String Operations and Functions

What This Chapter Covers

This chapter picks up where Chapter 16 left off and takes you into pattern-based string work. You will meet the base-R pattern-matching family (grep(), grepl(), regmatches(), sub(), gsub()) and their friendly counterparts in the stringr package (str_detect(), str_extract(), str_replace(), str_split()). You will learn just enough regular expressions to clean, extract, and validate real-world text, character classes, quantifiers, anchors, groups, and alternations, with small, runnable examples for each idea. You will see how to format numbers and dates as strings, how to handle encoding hazards, and the difference between “fixed” and “regex” matching. By the end of this chapter you will be able to pull structured information out of messy free text and bend it into the shape your analysis needs.

flowchart LR
    S["Text Input"] --> D["Detect <br> grepl / str_detect"]
    S --> X["Extract <br> regmatches / str_extract"]
    S --> R["Replace <br> sub / gsub / str_replace"]
    S --> SP["Split <br> strsplit / str_split"]
    D --> RE["Regular Expressions"]
    X --> RE
    R --> RE
    SP --> RE
    classDef default fill:#2e4057,color:#ffffff,stroke:#ff9933,stroke-width:3px,rx:10px,ry:10px;

17.1 Why Regular Expressions?

Core Concept: Describing Patterns, Not Literals

So far you have searched for exact strings with startsWith() and endsWith(), and sliced by position with substr(). Real text is rarely that regular. A regular expression (regex) is a tiny language for describing patterns, “a word of three digits”, “any character followed by a dot”, “an email-looking token”, which R’s pattern functions then find, extract, or replace.

Expert Insight: 80 Percent of Regex in Half a Page

Most practical regex work uses a small subset of the full language: a handful of character classes, three or four quantifiers, and two anchors. Start with that subset; only reach for look-arounds and non-greedy matching when the simple tools fall short.

17.2 Regex in Twenty Minutes

The Building Blocks

Symbol	Matches
`.`	Any single character except a newline.
`\\d`	A digit 0-9.
`\\D`	Anything that is not a digit.
`\\s`	A whitespace character (space, tab, newline).
`\\S`	Anything that is not whitespace.
`\\w`	A word character (letters, digits, underscore).
`\\W`	Anything that is not a word character.
`[abc]`	Any of `a`, `b`, or `c`.
`[^abc]`	Anything except `a`, `b`, or `c`.
`[a-z]`, `[A-Z]`, `[0-9]`	Ranges.

Quantifier	Meaning
`*`	Zero or more.
`+`	One or more.
`?`	Zero or one (optional).
`{n}`	Exactly `n` times.
`{n,m}`	Between `n` and `m` times.

Anchor / Group	Meaning
`^`	Start of the string.
`$`	End of the string.
`(abc)`	Grouping, treats `abc` as one unit, and captures it for back-references.
`a\|b`	Either `a` or `b`.

Double Backslashes Inside R Strings

R string literals themselves use \ as an escape character, so you have to write \\d in the source code to produce the regex \d. The pattern \. becomes "\\." in an R string. This is the single biggest source of “but my regex works on the website!” surprises.

17.3 Detecting Matches

grepl() and grep() in Base R

grepl() returns a logical vector the same length as the input. grep() returns the positions (or, with value = TRUE, the matching strings themselves).

Try here

stringr::str_detect(), the Tidyverse Twin

stringr functions always take the string first and the pattern second, always vectorise cleanly over both, and have consistent names. str_detect() is the tidyverse answer to grepl().

Try here

17.4 Extracting Matches

Pulling the Pattern Itself Out

regmatches() with regexpr() returns the first match in each string; gregexpr() returns all matches. In stringr, str_extract() and str_extract_all() do the same, more readably.

Try here

Capture Groups

Parentheses ( ... ) create groups you can later refer to. str_match() (and base R’s regmatches() with regexec()) returns the whole match plus each group.

Try here

17.5 Replacing Matches

sub() vs gsub() in Base R

sub() replaces the first match; gsub() replaces all matches. Both are vectorised.

Try here

stringr::str_replace() and str_replace_all()

str_replace() replaces the first match and str_replace_all() replaces all matches.

Try here

Using Back-References in the Replacement

Parentheses in the pattern capture groups that you can reference in the replacement string as \\1, \\2, and so on.

Try here

17.6 Splitting with a Pattern

strsplit() and str_split()

Both take a regex as the separator. Use fixed = TRUE (base) or stringr::fixed() when you really want a literal split on special characters like ..

Try here

Common Mistake: Forgetting That . Is a Regex Metacharacter

. matches any character, not only a literal period. To split on a literal dot, either escape it (\\.) or use fixed matching.

Try here

17.7 “Fixed” vs “Regex” Matching

When You Do Not Want Pattern Semantics

Every base-R string function has a fixed = TRUE option that treats the pattern as a plain string, not a regex. In stringr, use the fixed("...") helper in place of a bare pattern. Choose fixed matching when your pattern contains characters that are regex metacharacters (., +, *, (, [, ?, \) and you want them taken literally.

Try here

17.8 Padding and Alignment

formatC() and stringr::str_pad()

Neat fixed-width output is a common need, for invoice numbers, report columns, or console-based progress indicators.

Try here

17.9 A Small Regex Cookbook

Patterns You Will Reuse

Task	Pattern	Notes
Email (rough)	`^\\S+@\\S+\\.\\S+$`	Fine for basic validation; the real RFC is much stricter.
Indian phone (rough)	`^(\\+91[- ]?)?[6-9]\\d{9}$`	Optional +91 prefix, 10 digits starting 6-9.
4-digit year	`\\b\\d{4}\\b`	Word boundaries stop it from matching inside longer numbers.
Integer	`^-?\\d+$`	Optional minus, then digits only.
Decimal	`^-?\\d+(\\.\\d+)?$`	Optional minus, integer part, optional decimal part.
Leading / trailing whitespace	`^\\s+\|\\s+$`	Match instead of `trimws()` if you want to report where it was.

Try here

17.10 A Worked Example: Parsing Free-Text Transactions

Extracting Structure from a Prose Log

Try here

Every technique from the chapter is at work: str_detect() to validate each line, str_match() with capture groups to tear the record apart, gsub() to strip commas, and a final explicit cast to numeric.

Summary

Concept	Description
Regex Foundations
Regular Expressions	Patterns describe sets of strings rather than literal text
Regex Building Blocks	Anchors, character classes, quantifiers, and groups
Double Backslashes in R	R requires \ to express a single backslash in a regex
Detect, Extract, Replace, Split
grepl() and grep()	Base R helpers to detect and locate matches in vectors
stringr::str_detect()	Tidyverse twin with consistent argument order and behaviour
Extracting Matches	regmatches(), str_extract(), str_match() pull matches out
Replacing Matches	sub() replaces first; gsub() replaces all
Splitting Strings	strsplit() and str_split() break strings into parts