8  Role and Purpose of Data Structures in R

NoteWhat This Chapter Covers

This chapter is a map of R’s core data structures and the reasons each one exists. You will see that R is unusual among programming languages in having data-analysis concepts (vectors, factors, data frames) baked into the language itself. You will learn the two dimensions along which every structure is classified, its shape (1-D, 2-D, or n-D) and whether it is homogeneous (all elements of one type) or heterogeneous (mixed types). You will meet each of the six everyday structures, vector, list, matrix, array, data frame, and factor, with a one-line definition, a typical use case, and a small example. You will also see a decision diagram that tells you which structure to reach for when a new kind of data shows up in your project. The individual structures are covered in depth in later chapters; this chapter exists so you know the neighbourhood before you move into any of the houses.

flowchart TD
    DS["R Data Structures"] --> D1["1-Dimensional"]
    DS --> D2["2-Dimensional"]
    DS --> DN["n-Dimensional"]

    D1 --> VEC["Vector <br> (homogeneous)"]
    D1 --> LST["List <br> (heterogeneous)"]
    D1 --> FAC["Factor <br> (categorical)"]

    D2 --> MAT["Matrix <br> (homogeneous)"]
    D2 --> DF["Data Frame <br> (heterogeneous)"]

    DN --> ARR["Array <br> (homogeneous, any #dims)"]

    style DS fill:#e3f2fd,stroke:#1976D2
    style D1 fill:#fff3e0,stroke:#F57C00
    style D2 fill:#fff3e0,stroke:#F57C00
    style DN fill:#fff3e0,stroke:#F57C00
    style VEC fill:#e8f5e9,stroke:#388E3C
    style LST fill:#f3e5f5,stroke:#8E24AA
    style FAC fill:#e8f5e9,stroke:#388E3C
    style MAT fill:#e8f5e9,stroke:#388E3C
    style DF fill:#f3e5f5,stroke:#8E24AA
    style ARR fill:#e8f5e9,stroke:#388E3C


8.1 Why Data Structures Matter

NoteCore Concept: Structure Shapes Behaviour

A data structure is the shape you give a collection of values so that the language can act on them efficiently. The same thousand numbers can be stored as a vector, a matrix, or a data frame column; each choice opens and closes particular operations. Arithmetic across columns is easy with a matrix but awkward with a list; mixing numbers and strings is natural in a data frame but impossible in a matrix. Picking the right structure up front is half of writing clean R code.

TipExpert Insight: R Was Designed Around Data, Not Objects

Most languages start from generic objects (records, classes) and bolt data-analysis features on top. R starts from the data-analysis features. Vectors, factors, and data frames are first-class citizens of the language, and the consequence is that idiomatic R reads like applied statistics rather than like software engineering. Understanding the structures is therefore understanding R.


8.2 The Two Dimensions of Classification

NoteHomogeneous vs Heterogeneous, 1-D vs 2-D vs n-D

Every core R data structure can be placed in a two-by-three grid. One axis is shape: how many dimensions the structure has. The other axis is content: whether every element must share a type (homogeneous) or whether types can mix (heterogeneous).

Shape Homogeneous Heterogeneous
1-Dimensional Vector List
2-Dimensional Matrix Data Frame
n-Dimensional Array (rare, not built in)

Factor is a special case: it is one-dimensional and holds categorical labels, not arbitrary text. It is stored internally as integer codes with a lookup table, which makes it memory-efficient for repeated categories.

WarningCommon Confusion: “Everything Is a Vector, Right?”

R sometimes describes itself as “a language of vectors”. That is true in the sense that a single number is really a length-one vector. It is not true that every container is a vector. Lists, matrices, arrays, factors, and data frames each have their own class, their own print method, and their own rules for subsetting.


8.3 Vectors

NoteHomogeneous, One-Dimensional, the Default Container

A vector is the simplest and most common structure in R. Every element has the same atomic type, and the whole collection is one-dimensional. You create one with c() (combine) or with a sequence generator like 1:10 or seq().

NoteWhen to Reach for a Vector

Use a vector when you have a single column of data, a set of coordinates, a list of names, or any sequence of values of one type. Almost every function in R accepts vector input and returns vector output, which is why vectors are the default choice.


8.4 Lists

NoteHeterogeneous, One-Dimensional, the Universal Container

A list is one-dimensional like a vector, but each element can be of any type, including another list. Lists are how R stores anything whose pieces do not all share a type: the result of a statistical test, the settings for a plot, or an arbitrary nested configuration.

NoteWhen to Reach for a List

Use a list when the pieces of your data do not fit neatly into a single type, when you want to return multiple objects from a function, or when you need a hierarchical (nested) structure. Lists are also the underlying representation of many R objects, including data frames themselves.


8.5 Matrices

NoteHomogeneous, Two-Dimensional, Organised into Rows and Columns

A matrix is a two-dimensional vector: every cell has the same atomic type, and the whole thing is laid out in a rectangle. Matrices support linear algebra, element-wise arithmetic, and fast row- or column-wise operations.

NoteWhen to Reach for a Matrix

Use a matrix when the data is naturally rectangular and every value has the same type, distances between cities, correlation coefficients, covariance structures, grayscale images. Use a data frame instead when the columns have different types.


8.6 Data Frames

NoteHeterogeneous, Two-Dimensional, the Workhorse of Analysis

A data frame looks like a table in a spreadsheet: rows are observations, columns are variables, and different columns can hold different types. Internally it is a list of equal-length vectors, one per column.

NoteWhen to Reach for a Data Frame

Use a data frame for essentially every real-world tabular data set: survey responses, sales records, experimental results, anything that came out of a CSV or a database query. It is the default argument to most modelling and plotting functions.


8.7 Factors

NoteCategorical Labels with a Fixed Set of Levels

A factor is R’s built-in type for categorical data. It looks like a character vector, but under the hood it stores small integer codes and a separate table of allowed values (the “levels”). Statistical models, plots, and tables all treat factors differently from plain character vectors, for example, they know the full set of possible values even if some are absent from the current data.

NoteWhen to Reach for a Factor

Use a factor when a variable has a small, known set of allowed values (gender, treatment group, grade band, region code). Keep it as a plain character vector when the set of values is open-ended or very large (free-text comments, user ids).


8.8 Arrays

NoteHomogeneous, n-Dimensional

An array is the general case of a matrix. A matrix is a 2-D array; an array can have any number of dimensions. The cells all share one atomic type.

NoteWhen to Reach for an Array

Arrays are less common in everyday data analysis than the other structures. They matter when you need genuinely higher-dimensional numeric data: a time-by-region-by-measure cube in economics, a three-way contingency table in statistics, image stacks in bioinformatics.


8.9 A Decision Diagram

NoteWhich Structure Should I Use?
If your data is … Use
A single sequence of values, all the same type. Vector
A collection of values with mixed types, possibly nested. List
A rectangle of values, all the same type (numeric grid, image). Matrix
A rectangle with columns of different types (typical table data). Data Frame
A small, fixed set of category labels. Factor
Higher-dimensional numeric data (three or more axes). Array
TipExpert Insight: Data Frames Solve 80 Percent of Real Problems

Most of what you will ever do in R happens inside a data frame. It is the common format for CSVs, for database results, for the tidyverse, for ggplot2, and for almost every modelling function. The other structures exist to support data frames internally or to handle the narrower cases where a data frame is the wrong fit.


8.10 A Quick Tour in Code

NoteSeeing All Six Structures Side by Side

8.11 Summary

NoteKey Concepts at a Glance
Concept Key Takeaway
Structure choice shapes code Picking the right container makes downstream code simple and fast.
Two axes of classification Shape (1-D / 2-D / n-D) and content (homogeneous / heterogeneous).
Vector Homogeneous, 1-D; the default container and the argument type for most functions.
List Heterogeneous, 1-D; the universal container; returned by many modelling functions.
Matrix Homogeneous, 2-D; ideal for numeric grids and linear algebra.
Data frame Heterogeneous, 2-D; the workhorse of real-world analysis.
Factor Categorical 1-D; small, fixed set of labels with level order.
Array Homogeneous, n-D; for truly higher-dimensional numeric data.
TipApplying This in Practice

Whenever you face a new data set or a new calculation, ask two questions before writing any code: what shape does this have (flat list, table, grid, stack), and do all of its values share a type? Those two answers pick the structure for you. The next chapter dives into the most important of them, vectors, and shows every way they can be created, sequenced, sized, and named.