20  Factors: Creation, Levels, and Reordering

NoteWhat this chapter covers

A factor is R’s data type for categorical variables, values drawn from a fixed, known set of categories such as "Low" / "Medium" / "High" or "Pass" / "Fail". Internally a factor stores integers with a label table called levels, which makes it both memory-efficient and aware of order. This chapter covers creating factors, inspecting and renaming levels, ordered factors, and changing the level order so summaries and plots come out the way you want.

20.1 Why a separate type for categories?

A column of grades stored as text is just letters with no inherent order, "A", "B", "C" sort alphabetically by accident. A factor lets you state explicitly: “these are the only valid categories, and this is their order.” That information then drives:

  • which categories summary() and table() count (including those with zero observations),
  • the order categories appear in plots and group-by output,
  • modelling code that needs categorical predictors with a known reference level.

Notice the printout: the values appear without quotes and a Levels: line shows the category list.

20.2 Creating factors

factor() accepts a vector and infers levels by sorting the unique values alphabetically.

That alphabetical default is rarely what you want. Specify levels = to take control.

Pass labels = to rename categories at the same time:

WarningValues not in levels become NA

If a value in your data isn’t listed in levels, R silently turns it into NA. Use unique() first to be sure your level list covers everything.

The XL becomes <NA>, a useful safety net when you want to flag unknown categories, but a trap if you didn’t intend it.

20.3 Inspecting factors

as.integer() exposes the factor’s secret: each value is really an integer pointing into the levels vector. That is why factors are so cheap and so fast for grouping operations.

20.4 Adding, renaming, and dropping levels

Renaming all levels at once:

Adding a level that may not appear in the data yet, useful for plots that should always reserve space for an empty category:

Dropping unused levels after a filter:

droplevels() is the cleanup function to know.

20.5 Ordered factors

Some categorical variables have a natural order, Low < Medium < High, or grades D < C < B < A. ordered = TRUE records that order so comparisons work.

Use ordered factors for measurement scales and severity grades, but stick with regular factors for categories that have no inherent order (region, department, colour). Ordered factors change how some modelling functions treat the variable, so don’t reach for them by default.

20.6 Reordering levels

The level order, not the alphabetical order of the labels, drives every summary and plot. Three common ways to change it.

By hand with factor(..., levels = ...), re-state the level vector explicitly:

Move one level to the front with relevel(), handy for setting a regression reference category:

By another variable’s value with forcats::fct_reorder(), the easiest way to make a bar chart sort by height. forcats is part of the tidyverse and runs in webr.

Two more forcats helpers worth memorising:

  • fct_infreq(x), order levels by how often each appears (most common first).
  • fct_relevel(x, "Foo", "Bar"), push named levels to the front in the order given.

20.7 Worked example, student performance

A small dataset of student grades. We want a frequency table with categories in pedagogical order (F < D < C < B < A), not alphabetical, plus the top performer.

Two things to notice. First, cut() is the workhorse for turning a numeric variable into a factor with custom bins. Second, because grade is an ordered factor, max() and == give meaningful answers, exactly what factor() was designed for.

20.8 Summary

Task Function
Create a factor factor(x), factor(x, levels = …)
Ordered category factor(x, levels = …, ordered = TRUE)
Bin a numeric → factor cut()
Inspect levels(), nlevels(), table()
Rename all levels levels(f) <- …
Drop unused levels droplevels()
Reorder by hand factor(f, levels = …)
Change reference level relevel(f, ref = …)
Reorder by another variable forcats::fct_reorder()
Reorder by frequency forcats::fct_infreq()

Factors are R’s quiet workhorses for categorical data, efficient, order-aware, and central to every grouped summary, contingency table, and modelling formula you will write from here on.