1  Introduction to R

NoteWhat This Chapter Covers

This chapter introduces R as a language and as an environment for statistical computing and data analysis. You will learn what R is, where it came from, why it has become one of the most widely used tools in data science, and what role the wider R ecosystem plays when you sit down to work on a real problem. By the end of this chapter, you will have a clear mental model of what R is doing under the hood and why each of the chapters that follow is structured the way it is.

flowchart LR
    A["R Language <br> (Core Engine)"] --> B["R Environment <br> (Console + Workspace)"]
    B --> C["Packages <br> (CRAN, Bioconductor)"]
    C --> D["Your Analysis <br> (Scripts, Reports, Apps)"]
    style A fill:#e3f2fd,stroke:#1976D2
    style B fill:#fff3e0,stroke:#F57C00
    style C fill:#f3e5f5,stroke:#8E24AA
    style D fill:#e8f5e9,stroke:#388E3C

1.1 What is R?

NoteCore Concept: R as a Language and an Environment

R is a free, open-source programming language and software environment purpose-built for statistical computing, data analysis, and graphics. Two points in that definition deserve emphasis. First, R is a fully-featured programming language, which means you can write functions, build packages, and automate complex workflows just as you would in Python or any general-purpose language. Second, R is an interactive environment, meaning you can type a command at the console and immediately see the result, which is why it has been so well-suited to the exploratory, iterative rhythm of data analysis since its earliest days.

NoteA Short History of R

R was created in the early 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and was first released publicly in 1995. It is an open-source implementation of the S language, which was developed by John Chambers and colleagues at Bell Laboratories beginning in 1976. Because R is largely source-compatible with S, it inherited a mature, carefully designed language for data analysis, while the open-source licensing under GPL opened that language up to a global community of statisticians, researchers, and developers. Today R is maintained by the R Core Team and supported by the R Foundation.

Year Milestone
1976 S language created at Bell Labs (Chambers and colleagues)
1993 Initial work on R begins at University of Auckland
1995 R released as free software under GNU GPL
1997 CRAN (Comprehensive R Archive Network) established
2000 R version 1.0.0 released
2011 RStudio IDE released, dramatically improving accessibility
Present Thousands of contributed packages, used in academia, industry, and government
TipWhy the History Matters for You

Knowing that R grew out of S matters practically: most of the elegant, vectorised syntax you are about to learn was designed by statisticians for statisticians. Operations that feel verbose or awkward in general-purpose languages, such as applying a transformation to every element of a column or fitting a regression model in one line, are concise in R because the language was built around these exact tasks.


1.2 Why Choose R for Data Analysis?

NoteCore Concept: R’s Comparative Strengths

R’s dominant position in statistical computing is not an accident. It reflects a tight alignment between the language design, the surrounding ecosystem, and the daily work of analysts and researchers. The strengths below are the ones you will feel most immediately in your own work.

Strength What It Means in Practice
Free and open source Zero licensing cost, full transparency, and a right to inspect, modify, and distribute the source code.
Vectorised operations You apply operations to entire vectors or columns at once, without writing explicit loops.
Rich statistical library Thousands of packages implement classical and modern statistical methods, often authored by the researchers who developed them.
World-class graphics Base R graphics, lattice, and especially ggplot2 produce publication-quality visualisations.
Reproducible research R Markdown and Quarto let you weave code, results, and narrative into a single document.
Active community CRAN, R-bloggers, Stack Overflow, and Posit Community provide help and example code for almost every problem.
NoteHow R Compares to Other Tools

Choosing between tools is a question of fit, not a popularity contest. The table below summarises how R compares with the tools you are most likely to encounter in a business or academic setting.

Dimension R Python SAS / SPSS Excel
Cost Free Free Commercial licence Commercial licence
Primary focus Statistics, visualisation, reporting General-purpose programming, ML, web Enterprise statistics, regulated industries Ad-hoc spreadsheet analysis
Learning curve Moderate for analysts, syntax feels statistical Moderate, syntax feels general-purpose Low in GUI mode, higher in code Low initially, very high at scale
Data size Handles datasets up to memory limits well, larger via arrow or data.table Similar, with strong big-data connectors Strong at large, on-disk data Breaks down beyond a few hundred thousand rows
Reproducibility Very strong via R Markdown and Quarto Strong via Jupyter and scripts Moderate Weak
Community packages CRAN (around 20,000 packages) PyPI (general-purpose) Vendor ecosystem Add-ins
TipExpert Insight: R and Python Are Complements, Not Rivals

In professional practice, data teams routinely use R for statistical modelling, visualisation, and reporting, while using Python for production machine learning pipelines, web applications, and deep learning. The reticulate package even lets you call Python from R and pass objects between them. Treat the R versus Python question as a false dichotomy; the practitioner’s question is always “which language is the right tool for this specific task?”

WarningCommon Misconception: “R Is Only for Statisticians”

New learners often assume R is a niche tool for academic statisticians. In reality, R is used in banks for credit risk modelling, in pharmaceutical companies for clinical trial analysis, in newsrooms for data journalism, and in tech firms for A/B testing and experimentation. If your work involves data, the odds are good that someone in your field is already using R productively.


1.3 The R Ecosystem

NoteCore Concept: The Package System

The R language itself is intentionally small. Most of R’s power comes from packages, which are collections of R functions, data, and documentation that extend the base language for specific tasks. You install a package once, load it into your current session with library(), and then use its functions as if they were part of R itself.

NoteThe Three Pillars of the R Package Ecosystem
Repository Purpose Typical Users
CRAN The Comprehensive R Archive Network, the official and largest repository with around 20,000 packages covering statistics, data handling, and visualisation. Most R users, most of the time.
Bioconductor A curated repository of about 2,000 packages for genomics, proteomics, and computational biology. Researchers in life sciences and bioinformatics.
GitHub Unofficial but widely used for in-development and cutting-edge packages, installed via remotes or devtools. Package developers and users who need features not yet on CRAN.
TipThe tidyverse in One Sentence

Of the thousands of packages available, one collection deserves special mention at the start of your R journey. The tidyverse is a coordinated set of packages, including dplyr, ggplot2, tidyr, and readr, that share a consistent philosophy and grammar for data work. Learning tidyverse syntax early will pay dividends in every subsequent chapter of this book.

WarningPackage Sprawl Is a Real Cost

Because installing packages is so easy, it is tempting to reach for a new one the moment a problem feels slightly novel. In a production or reproducible-research setting, every package you add is a future maintenance cost: versions change, dependencies conflict, and authors abandon projects. The professional discipline is to prefer base R and well-established packages, and to pin package versions using tools such as renv when a project needs to remain reproducible over time.


1.4 What You Can Do with R

NoteCore Concept: The Five Common Workloads

In practice, most R work falls into one of five overlapping categories. Every chapter that follows in this book builds skills that feed into one or more of them.

Workload Typical Packages Example Task
Data import and cleaning readr, readxl, janitor, tidyr Load a messy Excel file from a client and produce a clean, analysis-ready data frame.
Statistical modelling stats (base), lme4, survival, rstanarm Fit a logistic regression to predict customer churn and interpret the coefficients.
Data visualisation ggplot2, plotly, leaflet Build a multi-panel chart comparing sales across regions and quarters.
Reproducible reporting rmarkdown, quarto, knitr Produce a monthly PDF report that re-runs automatically when new data arrives.
Interactive applications shiny, flexdashboard Build a web dashboard that lets non-technical colleagues explore results interactively.
NoteA First Glimpse of R Code

You have not learned any R syntax yet, so the code below is meant to be read rather than typed. Notice how few lines are needed to load data, summarise it, and draw a chart. This conciseness is what vectorised, statistics-oriented language design delivers.

TipTreat Every Chapter as a Building Block

The code above uses vectors, data frames, the pipe operator, grouped summaries, and ggplot2, which are concepts spread across the next several chapters. Do not worry about understanding every line now. Return to this example after finishing Module 3 and you will find it reads almost like English.


1.5 How This Book Is Organised

NoteReading Path for Different Backgrounds

The book is structured in four modules that map directly to the course learning outcomes. Depending on your background, you may benefit from reading them in slightly different orders.

Background Suggested Path
Complete beginner Read linearly, chapters 1 to 26.
Familiar with another language Skim Module 1, focus on Modules 2 and 3 for R-specific idioms, then work carefully through Module 4.
Coming from Excel or SPSS Read Module 1 carefully, spend extra time on Module 2 (data structures) and Module 3 (data frames), which are the biggest mental shift.
Coming from Python or pandas Focus on Module 2 (vectors behave differently from NumPy arrays) and Module 3 (dplyr idioms differ from pandas).

flowchart TD
    M1["Module 1 <br> Introduction to R Programming"] --> M2["Module 2 <br> Data Structures and Operations"]
    M2 --> M3["Module 3 <br> Descriptive Analytics with R"]
    M3 --> M4["Module 4 <br> Conditional Statements, Loops, and Functions"]
    style M1 fill:#e3f2fd,stroke:#1976D2
    style M2 fill:#fff3e0,stroke:#F57C00
    style M3 fill:#f3e5f5,stroke:#8E24AA
    style M4 fill:#e8f5e9,stroke:#388E3C


1.6 Summary

NoteKey Concepts at a Glance
Concept Key Takeaway
What R is A free, open-source language and interactive environment for statistical computing, graphics, and data analysis.
Where R came from Created by Ihaka and Gentleman in the 1990s as an open-source implementation of the S language from Bell Labs.
Why R is chosen Vectorised syntax, rich statistical library, world-class graphics, reproducible reporting, and an active community.
The R ecosystem A small base language extended by around 20,000 CRAN packages, with Bioconductor and GitHub adding specialised and bleeding-edge work.
The tidyverse A coordinated set of packages with a consistent grammar for modern data analysis, worth learning early.
What R is used for Import and cleaning, statistical modelling, visualisation, reproducible reporting, and interactive applications.
TipApplying This in Practice

When you sit down to a new problem in R, the most productive mindset is to ask three questions before writing any code. First, what is the smallest dataset that captures the question, because R rewards you for working iteratively with small samples. Second, does a well-established package already solve this, because reaching for a mature package is almost always faster and safer than writing the routine yourself. Third, how will you make the work reproducible, because the combination of scripts, R Markdown or Quarto documents, and version control is what separates a one-off analysis from durable professional output. Keep these three questions in mind as you move into the next chapter, where we examine the specific purpose and advantages that make R a compelling choice for your own work.