flowchart LR
A["R Language <br> (Core Engine)"] --> B["R Environment <br> (Console + Workspace)"]
B --> C["Packages <br> (CRAN, Bioconductor)"]
C --> D["Your Analysis <br> (Scripts, Reports, Apps)"]
style A fill:#e3f2fd,stroke:#1976D2
style B fill:#fff3e0,stroke:#F57C00
style C fill:#f3e5f5,stroke:#8E24AA
style D fill:#e8f5e9,stroke:#388E3C
1 Introduction to R
This chapter introduces R as a language and as an environment for statistical computing and data analysis. You will learn what R is, where it came from, why it has become one of the most widely used tools in data science, and what role the wider R ecosystem plays when you sit down to work on a real problem. By the end of this chapter, you will have a clear mental model of what R is doing under the hood and why each of the chapters that follow is structured the way it is.
1.1 What is R?
R is a free, open-source programming language and software environment purpose-built for statistical computing, data analysis, and graphics. Two points in that definition deserve emphasis. First, R is a fully-featured programming language, which means you can write functions, build packages, and automate complex workflows just as you would in Python or any general-purpose language. Second, R is an interactive environment, meaning you can type a command at the console and immediately see the result, which is why it has been so well-suited to the exploratory, iterative rhythm of data analysis since its earliest days.
R was created in the early 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and was first released publicly in 1995. It is an open-source implementation of the S language, which was developed by John Chambers and colleagues at Bell Laboratories beginning in 1976. Because R is largely source-compatible with S, it inherited a mature, carefully designed language for data analysis, while the open-source licensing under GPL opened that language up to a global community of statisticians, researchers, and developers. Today R is maintained by the R Core Team and supported by the R Foundation.
| Year | Milestone |
|---|---|
| 1976 | S language created at Bell Labs (Chambers and colleagues) |
| 1993 | Initial work on R begins at University of Auckland |
| 1995 | R released as free software under GNU GPL |
| 1997 | CRAN (Comprehensive R Archive Network) established |
| 2000 | R version 1.0.0 released |
| 2011 | RStudio IDE released, dramatically improving accessibility |
| Present | Thousands of contributed packages, used in academia, industry, and government |
Knowing that R grew out of S matters practically: most of the elegant, vectorised syntax you are about to learn was designed by statisticians for statisticians. Operations that feel verbose or awkward in general-purpose languages, such as applying a transformation to every element of a column or fitting a regression model in one line, are concise in R because the language was built around these exact tasks.
1.2 Why Choose R for Data Analysis?
R’s dominant position in statistical computing is not an accident. It reflects a tight alignment between the language design, the surrounding ecosystem, and the daily work of analysts and researchers. The strengths below are the ones you will feel most immediately in your own work.
| Strength | What It Means in Practice |
|---|---|
| Free and open source | Zero licensing cost, full transparency, and a right to inspect, modify, and distribute the source code. |
| Vectorised operations | You apply operations to entire vectors or columns at once, without writing explicit loops. |
| Rich statistical library | Thousands of packages implement classical and modern statistical methods, often authored by the researchers who developed them. |
| World-class graphics | Base R graphics, lattice, and especially ggplot2 produce publication-quality visualisations. |
| Reproducible research | R Markdown and Quarto let you weave code, results, and narrative into a single document. |
| Active community | CRAN, R-bloggers, Stack Overflow, and Posit Community provide help and example code for almost every problem. |
Choosing between tools is a question of fit, not a popularity contest. The table below summarises how R compares with the tools you are most likely to encounter in a business or academic setting.
| Dimension | R | Python | SAS / SPSS | Excel |
|---|---|---|---|---|
| Cost | Free | Free | Commercial licence | Commercial licence |
| Primary focus | Statistics, visualisation, reporting | General-purpose programming, ML, web | Enterprise statistics, regulated industries | Ad-hoc spreadsheet analysis |
| Learning curve | Moderate for analysts, syntax feels statistical | Moderate, syntax feels general-purpose | Low in GUI mode, higher in code | Low initially, very high at scale |
| Data size | Handles datasets up to memory limits well, larger via arrow or data.table | Similar, with strong big-data connectors | Strong at large, on-disk data | Breaks down beyond a few hundred thousand rows |
| Reproducibility | Very strong via R Markdown and Quarto | Strong via Jupyter and scripts | Moderate | Weak |
| Community packages | CRAN (around 20,000 packages) | PyPI (general-purpose) | Vendor ecosystem | Add-ins |
In professional practice, data teams routinely use R for statistical modelling, visualisation, and reporting, while using Python for production machine learning pipelines, web applications, and deep learning. The reticulate package even lets you call Python from R and pass objects between them. Treat the R versus Python question as a false dichotomy; the practitioner’s question is always “which language is the right tool for this specific task?”
New learners often assume R is a niche tool for academic statisticians. In reality, R is used in banks for credit risk modelling, in pharmaceutical companies for clinical trial analysis, in newsrooms for data journalism, and in tech firms for A/B testing and experimentation. If your work involves data, the odds are good that someone in your field is already using R productively.
1.3 The R Ecosystem
The R language itself is intentionally small. Most of R’s power comes from packages, which are collections of R functions, data, and documentation that extend the base language for specific tasks. You install a package once, load it into your current session with library(), and then use its functions as if they were part of R itself.
| Repository | Purpose | Typical Users |
|---|---|---|
| CRAN | The Comprehensive R Archive Network, the official and largest repository with around 20,000 packages covering statistics, data handling, and visualisation. | Most R users, most of the time. |
| Bioconductor | A curated repository of about 2,000 packages for genomics, proteomics, and computational biology. | Researchers in life sciences and bioinformatics. |
| GitHub | Unofficial but widely used for in-development and cutting-edge packages, installed via remotes or devtools. |
Package developers and users who need features not yet on CRAN. |
Of the thousands of packages available, one collection deserves special mention at the start of your R journey. The tidyverse is a coordinated set of packages, including dplyr, ggplot2, tidyr, and readr, that share a consistent philosophy and grammar for data work. Learning tidyverse syntax early will pay dividends in every subsequent chapter of this book.
Because installing packages is so easy, it is tempting to reach for a new one the moment a problem feels slightly novel. In a production or reproducible-research setting, every package you add is a future maintenance cost: versions change, dependencies conflict, and authors abandon projects. The professional discipline is to prefer base R and well-established packages, and to pin package versions using tools such as renv when a project needs to remain reproducible over time.
1.4 What You Can Do with R
In practice, most R work falls into one of five overlapping categories. Every chapter that follows in this book builds skills that feed into one or more of them.
| Workload | Typical Packages | Example Task |
|---|---|---|
| Data import and cleaning | readr, readxl, janitor, tidyr | Load a messy Excel file from a client and produce a clean, analysis-ready data frame. |
| Statistical modelling | stats (base), lme4, survival, rstanarm | Fit a logistic regression to predict customer churn and interpret the coefficients. |
| Data visualisation | ggplot2, plotly, leaflet | Build a multi-panel chart comparing sales across regions and quarters. |
| Reproducible reporting | rmarkdown, quarto, knitr | Produce a monthly PDF report that re-runs automatically when new data arrives. |
| Interactive applications | shiny, flexdashboard | Build a web dashboard that lets non-technical colleagues explore results interactively. |
You have not learned any R syntax yet, so the code below is meant to be read rather than typed. Notice how few lines are needed to load data, summarise it, and draw a chart. This conciseness is what vectorised, statistics-oriented language design delivers.
The code above uses vectors, data frames, the pipe operator, grouped summaries, and ggplot2, which are concepts spread across the next several chapters. Do not worry about understanding every line now. Return to this example after finishing Module 3 and you will find it reads almost like English.
1.5 How This Book Is Organised
The book is structured in four modules that map directly to the course learning outcomes. Depending on your background, you may benefit from reading them in slightly different orders.
| Background | Suggested Path |
|---|---|
| Complete beginner | Read linearly, chapters 1 to 26. |
| Familiar with another language | Skim Module 1, focus on Modules 2 and 3 for R-specific idioms, then work carefully through Module 4. |
| Coming from Excel or SPSS | Read Module 1 carefully, spend extra time on Module 2 (data structures) and Module 3 (data frames), which are the biggest mental shift. |
| Coming from Python or pandas | Focus on Module 2 (vectors behave differently from NumPy arrays) and Module 3 (dplyr idioms differ from pandas). |
flowchart TD
M1["Module 1 <br> Introduction to R Programming"] --> M2["Module 2 <br> Data Structures and Operations"]
M2 --> M3["Module 3 <br> Descriptive Analytics with R"]
M3 --> M4["Module 4 <br> Conditional Statements, Loops, and Functions"]
style M1 fill:#e3f2fd,stroke:#1976D2
style M2 fill:#fff3e0,stroke:#F57C00
style M3 fill:#f3e5f5,stroke:#8E24AA
style M4 fill:#e8f5e9,stroke:#388E3C
1.6 Summary
| Concept | Key Takeaway |
|---|---|
| What R is | A free, open-source language and interactive environment for statistical computing, graphics, and data analysis. |
| Where R came from | Created by Ihaka and Gentleman in the 1990s as an open-source implementation of the S language from Bell Labs. |
| Why R is chosen | Vectorised syntax, rich statistical library, world-class graphics, reproducible reporting, and an active community. |
| The R ecosystem | A small base language extended by around 20,000 CRAN packages, with Bioconductor and GitHub adding specialised and bleeding-edge work. |
| The tidyverse | A coordinated set of packages with a consistent grammar for modern data analysis, worth learning early. |
| What R is used for | Import and cleaning, statistical modelling, visualisation, reproducible reporting, and interactive applications. |
When you sit down to a new problem in R, the most productive mindset is to ask three questions before writing any code. First, what is the smallest dataset that captures the question, because R rewards you for working iteratively with small samples. Second, does a well-established package already solve this, because reaching for a mature package is almost always faster and safer than writing the routine yourself. Third, how will you make the work reproducible, because the combination of scripts, R Markdown or Quarto documents, and version control is what separates a one-off analysis from durable professional output. Keep these three questions in mind as you move into the next chapter, where we examine the specific purpose and advantages that make R a compelling choice for your own work.