19  Advanced Data Frame Operations with dplyr

NoteWhat this chapter covers

dplyr reframes data manipulation around a small, verb-shaped grammar: pick columns (select), keep rows (filter), add columns (mutate), sort (arrange), collapse (summarise), and split-then-summarise (group_by). Connect the verbs with the pipe |> and most analyses read top-to-bottom like plain English. This chapter also covers joins between two frames and closes with a worked example on quarterly sales.

19.1 Why dplyr?

The base R style from the previous chapter works, but it repeats the data frame name over and over:

sales$h1 <- sales$q1 + sales$q2
sales <- sales[order(-sales$h1), ]
sales <- sales[sales$h1 > 200, ]

dplyr rewrites the same pipeline as a series of verbs applied to one object:

sales |>
  mutate(h1 = q1 + q2) |>
  arrange(desc(h1)) |>
  filter(h1 > 200)

Each line says exactly one thing. That readability is the whole value proposition.

19.2 Setup

dplyr is part of the tidyverse and runs in webr.

19.3 The pipe |>

The pipe passes the value on its left as the first argument of the function on its right. These two expressions are identical:

R 4.1 introduced the native pipe |>. The tidyverse pipe %>% from the magrittr package behaves almost identically and is still common in older code. Use |> in new work.

19.4 select(), pick columns

Helpers like starts_with(), ends_with(), contains(), and where(is.numeric) are worth the five minutes it takes to learn them.

19.5 filter(), keep rows

filter() takes one or more logical conditions. A comma between conditions means AND.

Compared with sales[sales$revenue > 100 & sales$region != "West", ], the dplyr form is shorter and doesn’t repeat sales$ inside the expression.

19.6 mutate(), add or change columns

mutate() creates new columns in place, the original frame is unchanged unless you assign the result. Columns can reference other columns created earlier in the same mutate() call.

Tiptransmute()

transmute() is mutate() + select(), it keeps only the columns you build.

19.7 arrange(), sort rows

Compared to base R’s order(-sales$revenue), the desc() helper reads naturally.

19.8 summarise(), collapse to one row

summarise() reduces many rows to one. It is most useful in combination with group_by().

n() is a dplyr helper that counts the rows in the current group.

19.9 group_by(), split–apply–combine

Group, then summarise. This is the pattern that does 80% of descriptive analytics.

Two grouping columns produce a per-combination summary:

The .groups = "drop" argument removes the grouping from the result so downstream code doesn’t have to think about it.

19.10 Counting

count() is a convenience for “how many rows per group?”

19.11 Joining two data frames

When information lives in two tables, a join brings them together on a shared key.

Join Keeps rows from
left_join(x, y) all of x, matching from y
right_join(x, y) all of y, matching from x
inner_join(x, y) only keys present in both
full_join(x, y) all keys from either side

99% of day-to-day analyses use left_join(), “add these lookup columns to my main table.”

19.12 Worked example, quarterly sales report

Starting from the same sales frame, produce: per-region half-year totals, a growth percentage from Q1 to Q2, and a ranked league table.

One readable pipeline per transformation, one final arrange(), the full descriptive report in a dozen lines.

19.13 Summary

Verb Purpose
select() choose columns
filter() keep rows
mutate() add / change columns
arrange() sort rows
summarise() collapse rows
group_by() group before summarise / mutate
count() quick frequency table
left_join() merge with a lookup

dplyr does not do anything base R can’t, it makes the common pipelines readable enough that you can show them to a colleague and have them understood at a glance. Combined with the pipe |>, it is the default tool for descriptive analytics in modern R.