1 Introduction to R

What This Chapter Covers

This chapter introduces R as a language and as an environment for statistical computing and data analysis. You will learn what R is, where it came from, why it has become one of the most widely used tools in data science, and what role the wider R ecosystem plays when you sit down to work on a real problem. By the end of this chapter, you will have a clear mental model of what R is doing under the hood and why each of the chapters that follow is structured the way it is.

flowchart LR
    A["R Language <br> (Core Engine)"] --> B["R Environment <br> (Console + Workspace)"]
    B --> C["Packages <br> (CRAN, Bioconductor)"]
    C --> D["Your Analysis <br> (Scripts, Reports, Apps)"]
    classDef default fill:#004466,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

1.1 What is R?

Core Concept: R as a Language and an Environment

R is a free, open-source programming language and software environment purpose-built for statistical computing, data analysis, and graphics. Two points in that definition deserve emphasis. First, R is a fully-featured programming language, which means you can write functions, build packages, and automate complex workflows just as you would in Python or any general-purpose language. Second, R is an interactive environment, meaning you can type a command at the console and immediately see the result, which is why it has been so well-suited to the exploratory, iterative rhythm of data analysis since its earliest days.

A Short History of R

R was created in the early 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and was first released publicly in 1995. It is an open-source implementation of the S language, which was developed by John Chambers and colleagues at Bell Laboratories beginning in 1976. Because R is largely source-compatible with S, it inherited a mature, carefully designed language for data analysis, while the open-source licensing under GPL opened that language up to a global community of statisticians, researchers, and developers. Today R is maintained by the R Core Team and supported by the R Foundation.

Year	Milestone
1976	S language created at Bell Labs (Chambers and colleagues)
1993	Initial work on R begins at University of Auckland
1995	R released as free software under GNU GPL
1997	CRAN (Comprehensive R Archive Network) established
2000	R version 1.0.0 released
2011	RStudio IDE released, dramatically improving accessibility
Present	Thousands of contributed packages, used in academia, industry, and government

Why the History Matters for You

Knowing that R grew out of S matters practically: most of the elegant, vectorised syntax you are about to learn was designed by statisticians for statisticians. Operations that feel verbose or awkward in general-purpose languages, such as applying a transformation to every element of a column or fitting a regression model in one line, are concise in R because the language was built around these exact tasks.

1.2 Why Choose R for Data Analysis?

Core Concept: R’s Comparative Strengths

R’s dominant position in statistical computing is not an accident. It reflects a tight alignment between the language design, the surrounding ecosystem, and the daily work of analysts and researchers. The strengths below are the ones you will feel most immediately in your own work.

Strength	What It Means in Practice
Free and open source	Zero licensing cost, full transparency, and a right to inspect, modify, and distribute the source code.
Vectorised operations	You apply operations to entire vectors or columns at once, without writing explicit loops.
Rich statistical library	Thousands of packages implement classical and modern statistical methods, often authored by the researchers who developed them.
World-class graphics	Base R graphics, lattice, and especially ggplot2 produce publication-quality visualisations.
Reproducible research	R Markdown and Quarto let you weave code, results, and narrative into a single document.
Active community	CRAN, R-bloggers, Stack Overflow, and Posit Community provide help and example code for almost every problem.

How R Compares to Other Tools

Choosing between tools is a question of fit, not a popularity contest. The table below summarises how R compares with the tools you are most likely to encounter in a business or academic setting.

Dimension	R	Python	SAS / SPSS	Excel
Cost	Free	Free	Commercial licence	Commercial licence
Primary focus	Statistics, visualisation, reporting	General-purpose programming, ML, web	Enterprise statistics, regulated industries	Ad-hoc spreadsheet analysis
Learning curve	Moderate for analysts, syntax feels statistical	Moderate, syntax feels general-purpose	Low in GUI mode, higher in code	Low initially, very high at scale
Data size	Handles datasets up to memory limits well, larger via arrow or data.table	Similar, with strong big-data connectors	Strong at large, on-disk data	Breaks down beyond a few hundred thousand rows
Reproducibility	Very strong via R Markdown and Quarto	Strong via Jupyter and scripts	Moderate	Weak
Community packages	CRAN (around 20,000 packages)	PyPI (general-purpose)	Vendor ecosystem	Add-ins

Expert Insight: R and Python Are Complements, Not Rivals

In professional practice, data teams routinely use R for statistical modelling, visualisation, and reporting, while using Python for production machine learning pipelines, web applications, and deep learning. The reticulate package even lets you call Python from R and pass objects between them. Treat the R versus Python question as a false dichotomy; the practitioner’s question is always “which language is the right tool for this specific task?”

Common Misconception: “R Is Only for Statisticians”

New learners often assume R is a niche tool for academic statisticians. In reality, R is used in banks for credit risk modelling, in pharmaceutical companies for clinical trial analysis, in newsrooms for data journalism, and in tech firms for A/B testing and experimentation. If your work involves data, the odds are good that someone in your field is already using R productively.

1.3 The R Ecosystem

Core Concept: The Package System

The R language itself is intentionally small. Most of R’s power comes from packages, which are collections of R functions, data, and documentation that extend the base language for specific tasks. You install a package once, load it into your current session with library(), and then use its functions as if they were part of R itself.

Try here

The Three Pillars of the R Package Ecosystem

Repository	Purpose	Typical Users
CRAN	The Comprehensive R Archive Network, the official and largest repository with around 20,000 packages covering statistics, data handling, and visualisation.	Most R users, most of the time.
Bioconductor	A curated repository of about 2,000 packages for genomics, proteomics, and computational biology.	Researchers in life sciences and bioinformatics.
GitHub	Unofficial but widely used for in-development and cutting-edge packages, installed via `remotes` or `devtools`.	Package developers and users who need features not yet on CRAN.

The tidyverse in One Sentence

Of the thousands of packages available, one collection deserves special mention at the start of your R journey. The tidyverse is a coordinated set of packages, including dplyr, ggplot2, tidyr, and readr, that share a consistent philosophy and grammar for data work. Learning tidyverse syntax early will pay dividends in every subsequent chapter of this book.

Package Sprawl Is a Real Cost

Because installing packages is so easy, it is tempting to reach for a new one the moment a problem feels slightly novel. In a production or reproducible-research setting, every package you add is a future maintenance cost: versions change, dependencies conflict, and authors abandon projects. The professional discipline is to prefer base R and well-established packages, and to pin package versions using tools such as renv when a project needs to remain reproducible over time.

1.4 What You Can Do with R

Core Concept: The Five Common Workloads

In practice, most R work falls into one of five overlapping categories. Every chapter that follows in this book builds skills that feed into one or more of them.

Workload	Typical Packages	Example Task
Data import and cleaning	readr, readxl, janitor, tidyr	Load a messy Excel file from a client and produce a clean, analysis-ready data frame.
Statistical modelling	stats (base), lme4, survival, rstanarm	Fit a logistic regression to predict customer churn and interpret the coefficients.
Data visualisation	ggplot2, plotly, leaflet	Build a multi-panel chart comparing sales across regions and quarters.
Reproducible reporting	rmarkdown, quarto, knitr	Produce a monthly PDF report that re-runs automatically when new data arrives.
Interactive applications	shiny, flexdashboard	Build a web dashboard that lets non-technical colleagues explore results interactively.

A First Glimpse of R Code

You have not learned any R syntax yet, so the code below is meant to be read rather than typed. Notice how few lines are needed to load data, summarise it, and draw a chart. This conciseness is what vectorised, statistics-oriented language design delivers.

Try here

Treat Every Chapter as a Building Block

The code above uses vectors, data frames, the pipe operator, grouped summaries, and ggplot2, which are concepts spread across the next several chapters. Do not worry about understanding every line now. Return to this example after finishing Module 3 and you will find it reads almost like English.

1.5 How This Book Is Organised

Reading Path for Different Backgrounds

The book is structured in four modules that map directly to the course learning outcomes. Depending on your background, you may benefit from reading them in slightly different orders.

Background	Suggested Path
Complete beginner	Read linearly, chapters 1 to 26.
Familiar with another language	Skim Module 1, focus on Modules 2 and 3 for R-specific idioms, then work carefully through Module 4.
Coming from Excel or SPSS	Read Module 1 carefully, spend extra time on Module 2 (data structures) and Module 3 (data frames), which are the biggest mental shift.
Coming from Python or pandas	Focus on Module 2 (vectors behave differently from NumPy arrays) and Module 3 (dplyr idioms differ from pandas).

flowchart TD
    M1["Module 1 <br> Introduction to R Programming"] --> M2["Module 2 <br> Data Structures and Operations"]
    M2 --> M3["Module 3 <br> Descriptive Analytics with R"]
    M3 --> M4["Module 4 <br> Conditional Statements, Loops, and Functions"]
    classDef default fill:#004466,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

Summary

Concept	Description
Foundations
What is R	Free, open-source language and environment for statistical computing and graphics
R as Language and Environment	R is both a programming language and an interactive environment for analysis
History of R	Born from S at Bell Labs, popularised by R as an open-source successor
Comparison and Ecosystem
R vs Python	Complementary tools, not rivals — R for stats, Python for general programming
R vs Other Tools	Compared with SAS, SPSS, Stata, and Excel for analytics workflows
The R Ecosystem	Tools, packages, communities, and resources that surround core R
Package System (CRAN)	CRAN hosts thousands of installable packages that extend R
Common Misconceptions	R is not only for statisticians — it serves analysts, scientists, and engineers