19 Advanced Data Frame Operations with dplyr

What this chapter covers

dplyr reframes data manipulation around a small, verb-shaped grammar: pick columns (select), keep rows (filter), add columns (mutate), sort (arrange), collapse (summarise), and split-then-summarise (group_by). Connect the verbs with the pipe |> and most analyses read top-to-bottom like plain English. This chapter also covers joins between two frames and closes with a worked example on quarterly sales.

19.1 Why dplyr?

The base R style from the previous chapter works, but it repeats the data frame name over and over:

sales$h1 <- sales$q1 + sales$q2
sales <- sales[order(-sales$h1), ]
sales <- sales[sales$h1 > 200, ]

dplyr rewrites the same pipeline as a series of verbs applied to one object:

sales |>
  mutate(h1 = q1 + q2) |>
  arrange(desc(h1)) |>
  filter(h1 > 200)

Each line says exactly one thing. That readability is the whole value proposition.

19.2 Setup

dplyr is part of the tidyverse and runs in webr.

Try here

19.3 The pipe `|>`

The pipe passes the value on its left as the first argument of the function on its right. These two expressions are identical:

Try here

R 4.1 introduced the native pipe |>. The tidyverse pipe %>% from the magrittr package behaves almost identically and is still common in older code. Use |> in new work.

19.4 select(), pick columns

Try here

Helpers like starts_with(), ends_with(), contains(), and where(is.numeric) are worth the five minutes it takes to learn them.

19.5 filter(), keep rows

filter() takes one or more logical conditions. A comma between conditions means AND.

Try here

Compared with sales[sales$revenue > 100 & sales$region != "West", ], the dplyr form is shorter and doesn’t repeat sales$ inside the expression.

19.6 mutate(), add or change columns

Try here

mutate() creates new columns in place, the original frame is unchanged unless you assign the result. Columns can reference other columns created earlier in the same mutate() call.

transmute()

transmute() is mutate() + select(), it keeps only the columns you build.

19.7 arrange(), sort rows

Try here

Compared to base R’s order(-sales$revenue), the desc() helper reads naturally.

19.8 summarise(), collapse to one row

summarise() reduces many rows to one. It is most useful in combination with group_by().

Try here

n() is a dplyr helper that counts the rows in the current group.

19.9 group_by(), split–apply–combine

Group, then summarise. This is the pattern that does 80% of descriptive analytics.

Try here

Two grouping columns produce a per-combination summary:

Try here

The .groups = "drop" argument removes the grouping from the result so downstream code doesn’t have to think about it.

19.10 Counting

count() is a convenience for “how many rows per group?”

Try here

19.11 Joining two data frames

When information lives in two tables, a join brings them together on a shared key.

Try here

Join	Keeps rows from
`left_join(x, y)`	all of `x`, matching from `y`
`right_join(x, y)`	all of `y`, matching from `x`
`inner_join(x, y)`	only keys present in both
`full_join(x, y)`	all keys from either side

99% of day-to-day analyses use left_join(), “add these lookup columns to my main table.”

19.12 Worked example, quarterly sales report

Starting from the same sales frame, produce: per-region half-year totals, a growth percentage from Q1 to Q2, and a ranked league table.

Try here

One readable pipeline per transformation, one final arrange(), the full descriptive report in a dozen lines.

Summary

Concept	Description
Pipe and Column Verbs
The Pipe \|>	Chains operations left-to-right, output of one becomes input of next
select()	Pick columns by name, range, or helper such as starts_with()
filter()	Keep rows that satisfy a condition
mutate()	Add or modify columns based on existing ones
Sort, Summarise, Group
arrange()	Sort rows by one or more columns, ascending or descending
summarise()	Collapse a data frame to one row of summary statistics
group_by()	Split-apply-combine — group rows then summarise within groups
Counting with count()	n() and count() shortcuts for tabulating frequencies