16 Strings: Creation, Concatenation, and Substrings

What This Chapter Covers

This chapter is a thorough introduction to working with text in R. You will learn how character values are created and stored in character vectors, the two styles of quoting (" and '), the common escape sequences (\n, \t, \", \\), and the length metrics length() and nchar(), which do not mean the same thing. You will see the four everyday concatenation tools (paste(), paste0(), cat(), sprintf()), the substring-extraction functions (substr(), substring(), startsWith(), endsWith()), and how to change case with toupper(), tolower(), tools::toTitleCase(), and trimws(). The chapter sticks to base R; Chapter 17 covers the richer stringr toolkit and regular expressions. By the end of this chapter you will be able to build, inspect, extract, and clean text cleanly.

flowchart LR
    T["Text in R"] --> CR["Create <br> 'hello' / character(n) / readLines()"]
    T --> CO["Concatenate <br> paste / paste0 / cat / sprintf"]
    T --> SU["Substring <br> substr / substring / startsWith / endsWith"]
    T --> CL["Clean <br> toupper / tolower / trimws"]
    classDef default fill:#003366,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

16.1 Character Values and Character Vectors

Core Concept: Every String Is a One-Element Character Vector

R has no separate “character” and “string” types. A single letter and a full paragraph are both character vectors of length 1. Multiple strings sit side by side in a longer character vector.

Try here

Common Mistake: Confusing length() and nchar()

length(x) reports how many strings the vector holds. nchar(x) reports how many characters are inside each string. They answer different questions and are the source of a classic beginner bug.

Try here

16.2 Quotes and Escape Sequences

Double vs Single Quotes

Both styles delimit a string; pick whichever avoids escaping. The R community and tidyverse style guide prefer double quotes by default.

Try here

Common Escape Sequences

Sequence	Meaning
`\n`	Newline.
`\t`	Tab.
`\\`	A literal backslash.
`\"` / `\'`	A literal double / single quote inside the matching delimiter.
`\u00e9`	A Unicode code point (here, é).

Try here

Expert Insight: print() Shows Literals, cat() Renders Them

When print() displays a string, it shows the escape sequences as you typed them. When cat() or message() writes the string out, it interprets those escapes. This trips up beginners who think their \n isn’t working; the answer is usually that they are looking at print() output.

16.3 Building Strings Programmatically

paste() and paste0()

paste() joins pieces with a separator (default one space); paste0() is the same with sep = "". Both are vectorised and recycle a length-1 argument against a longer one.

Try here

paste() With collapse

paste() has a second argument, collapse, that joins a vector into a single string after concatenation. This is the right way to turn a character vector into one line of text.

Try here

sprintf() for Precise Formatting

sprintf() uses C-style format specifiers for controlled width, padding, and decimals. You saw it in Chapter 5; it is the right tool whenever the exact visual layout matters.

Try here

Best Practice: Pick the Right Tool

Reach for paste() when joining strings with a delimiter, paste0() when building file names or URLs, and sprintf() whenever decimals, widths, or padding matter. Reserve cat() for writing a message to the console rather than building a string.

16.4 Length Metrics

Everyday Measurements

Try here

Unicode, Bytes, and nchar()

nchar() reports characters by default, not bytes. A string with accented letters or emojis takes more bytes than characters. You can ask for a byte count with nchar(x, type = "bytes") when it matters for file size estimates.

Try here

16.5 Extracting Substrings

substr(): Positions as Start and Stop

substr(x, start, stop) returns the slice from position start to position stop, inclusive on both ends, 1-indexed. It is vectorised.

Try here

substring(): Allows Multiple Start Points

substring() is the older cousin. It defaults first = 1 and last = 1000000L and, usefully, supports multiple start positions at once, producing several substrings from a single input string.

Try here

Replacing Substrings with substr<-

Assigning to substr(x, start, stop) replaces the slice in place. The replacement must be the same length as the slice; if shorter, R truncates; if longer, only the first characters are used.

Try here

16.6 Prefix, Suffix, and Case

startsWith() and endsWith() for Prefix and Suffix Tests

Both functions are vectorised and return a logical vector.

Try here

Changing Case

Four functions cover the everyday cases.

Function	Purpose
`toupper(x)`	Convert to uppercase.
`tolower(x)`	Convert to lowercase.
`tools::toTitleCase(x)`	Title Case.
`chartr(old, new, x)`	Fixed-position character translation (1-to-1).

Try here

Expert Insight: Normalise Case at the Boundary

Before any string comparison or grouping, decide on a canonical case (often lowercase) and apply it once at the point where the data enters your program. “New Delhi”, “new delhi”, and “NEW DELHI” are three different values until you normalise them.

16.7 Removing Whitespace

trimws() Strips Leading and Trailing Spaces

trimws() has three modes: both ends ("both", the default), only the left, or only the right.

Try here

Common Mistake: Forgetting That trimws() Only Touches the Ends

Double spaces in the middle of a string survive trimws(). If you need to collapse them, use gsub("\\s+", " ", x) (covered in Chapter 17).

16.8 Splitting Strings with `strsplit()`

From One String to Several

strsplit(x, split) breaks each string on the separator and returns a list with one element per input string. This reflects that different strings may split into different numbers of pieces.

Try here

Best Practice: strsplit() Is for Unknown-Length Cases

If you know every string splits into the same number of fields (for example, well-formed fixed-column text), do.call(rbind, strsplit(...)) gives you a matrix. For real CSV data, use read.csv() or readr::read_csv() rather than hand-splitting strings.

16.9 A Worked Example: Tidying a Messy Name List

Putting the Chapter Together

Try here

Every core technique of the chapter appears: trimws() to drop edge whitespace, case normalisation with tolower() and toTitleCase(), strsplit() to break each name into words, substr() to pull first characters, and paste0() with collapse to assemble the initials.

Summary

Concept	Description
Strings as Vectors
String as Character Vector	Every string is a one-element character vector in R
length() vs nchar()	length() counts vector elements; nchar() counts characters within strings
Quotes and Escapes	Use , , \ inside double or single quotes
Building and Slicing
paste() and paste0()	Concatenate with separator (paste) or no separator (paste0)
paste() with collapse	Collapse argument joins multiple strings into one
sprintf()	Format numbers and strings with C-style format specifiers
substr() and substring()	Extract or replace substrings by character position
Case and Trimming	tolower(), toupper(), trimws() for case and whitespace handling