19 Advanced Data Frame Operations with dplyr
dplyr reframes data manipulation around a small, verb-shaped grammar: pick columns (select), keep rows (filter), add columns (mutate), sort (arrange), collapse (summarise), and split-then-summarise (group_by). Connect the verbs with the pipe |> and most analyses read top-to-bottom like plain English. This chapter also covers joins between two frames and closes with a worked example on quarterly sales.
19.1 Why dplyr?
The base R style from the previous chapter works, but it repeats the data frame name over and over:
sales$h1 <- sales$q1 + sales$q2
sales <- sales[order(-sales$h1), ]
sales <- sales[sales$h1 > 200, ]dplyr rewrites the same pipeline as a series of verbs applied to one object:
sales |>
mutate(h1 = q1 + q2) |>
arrange(desc(h1)) |>
filter(h1 > 200)Each line says exactly one thing. That readability is the whole value proposition.
19.2 Setup
dplyr is part of the tidyverse and runs in webr.
19.3 The pipe |>
The pipe passes the value on its left as the first argument of the function on its right. These two expressions are identical:
R 4.1 introduced the native pipe |>. The tidyverse pipe %>% from the magrittr package behaves almost identically and is still common in older code. Use |> in new work.
19.4 select(), pick columns
Helpers like starts_with(), ends_with(), contains(), and where(is.numeric) are worth the five minutes it takes to learn them.
19.5 filter(), keep rows
filter() takes one or more logical conditions. A comma between conditions means AND.
Compared with sales[sales$revenue > 100 & sales$region != "West", ], the dplyr form is shorter and doesn’t repeat sales$ inside the expression.
19.6 mutate(), add or change columns
mutate() creates new columns in place, the original frame is unchanged unless you assign the result. Columns can reference other columns created earlier in the same mutate() call.
transmute()
transmute() is mutate() + select(), it keeps only the columns you build.
19.7 arrange(), sort rows
Compared to base R’s order(-sales$revenue), the desc() helper reads naturally.
19.8 summarise(), collapse to one row
summarise() reduces many rows to one. It is most useful in combination with group_by().
n() is a dplyr helper that counts the rows in the current group.
19.9 group_by(), split–apply–combine
Group, then summarise. This is the pattern that does 80% of descriptive analytics.
Two grouping columns produce a per-combination summary:
The .groups = "drop" argument removes the grouping from the result so downstream code doesn’t have to think about it.
19.10 Counting
count() is a convenience for “how many rows per group?”
19.11 Joining two data frames
When information lives in two tables, a join brings them together on a shared key.
| Join | Keeps rows from |
|---|---|
left_join(x, y) |
all of x, matching from y |
right_join(x, y) |
all of y, matching from x |
inner_join(x, y) |
only keys present in both |
full_join(x, y) |
all keys from either side |
99% of day-to-day analyses use left_join(), “add these lookup columns to my main table.”
19.12 Worked example, quarterly sales report
Starting from the same sales frame, produce: per-region half-year totals, a growth percentage from Q1 to Q2, and a ranked league table.
One readable pipeline per transformation, one final arrange(), the full descriptive report in a dozen lines.
19.13 Summary
| Verb | Purpose |
|---|---|
select() |
choose columns |
filter() |
keep rows |
mutate() |
add / change columns |
arrange() |
sort rows |
summarise() |
collapse rows |
group_by() |
group before summarise / mutate |
count() |
quick frequency table |
left_join() |
merge with a lookup |
dplyr does not do anything base R can’t, it makes the common pipelines readable enough that you can show them to a colleague and have them understood at a glance. Combined with the pipe |>, it is the default tool for descriptive analytics in modern R.