-
Notifications
You must be signed in to change notification settings - Fork 682
/
introduction.qmd
232 lines (175 loc) · 15.7 KB
/
introduction.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
\mainmatter
# Introduction {#sec-introduction}
```{r}
#| echo: false
#| message: false
#| results: asis
source("common.R")
status("drafting")
```
## Welcome to ggplot2
ggplot2 is an R package for producing statistical, or data, graphics.
Unlike most other graphics packages, ggplot2 has an underlying grammar, based on the Grammar of Graphics [@wilkinson:2006], that allows you to compose graphs by combining independent components.
This makes ggplot2 powerful.
Rather than being limited to sets of pre-defined graphics, you can create novel graphics that are tailored to your specific problem.
While the idea of having to learn a grammar may sound overwhelming, ggplot2 is actually easy to learn: there is a simple set of core principles and there are very few special cases.
The hard part is that it may take a little time to forget all the preconceptions that you bring over from using other graphics tools.
ggplot2 provides beautiful, hassle-free plots that take care of fiddly details like drawing legends.
In fact, its carefully chosen defaults mean that you can produce publication-quality graphics in seconds.
However, if you do have special formatting requirements, ggplot2's comprehensive theming system makes it easy to do what you want.
Ultimately, this means that rather than spending your time making your graph look pretty, you can instead focus on creating the graph that best reveals the message in your data.
ggplot2 is designed to work iteratively.
You start with a layer that shows the raw data.
Then you add layers of annotations and statistical summaries.
This allows you to produce graphics using the same structured thinking that you would use to design an analysis.
This reduces the distance between the plot in your head and the one on the page.
This is especially helpful for students who have not yet developed the structured approach to analysis used by experts.
Learning the grammar will not only help you create graphics that you're familiar with, but will also help you to create newer, better graphics.
Without a grammar, there is no underlying theory, so most graphics packages are just a big collection of special cases.
For example, in base R, if you design a new graphic, it's composed of raw plot elements like lines and points so it's hard to design new components that combine with existing plots.
In ggplot2, the expressions used to create a new graphic are composed of higher-level elements, like representations of the raw data and statistical transformations, that can easily be combined with new datasets and other plots.
This book provides a hands-on introduction to ggplot2 with lots of example code and graphics.
It also explains the grammar on which ggplot2 is based.
Like other formal systems, ggplot2 is useful even when you don't understand the underlying model.
However, the more you learn about it, the more effectively you'll be able to use ggplot2.
This book will introduce you to ggplot2 assuming that you're a novice, unfamiliar with the grammar; teach you the basics so that you can re-create plots you are already familiar with; show you how to use the grammar to create new types of graphics; and eventually turn you into an expert who can build new components to extend the grammar.
## What is the grammar of graphics?
@wilkinson:2006 created the grammar of graphics to describe the fundamental features that underlie all statistical graphics.
The grammar of graphics is an answer to the question of what is a statistical graphic?
ggplot2 [@wickham:2007d] builds on Wilkinson's grammar by focussing on the primacy of layers and adapting it for use in R.
In brief, the grammar tells us that a graphic maps the data to the aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars).
The plot may also include statistical transformations of the data and information about the plot's coordinate system.
Facetting can be used to plot for different subsets of the data.
The combination of these independent components are what make up a graphic.
As the book progresses, the formal grammar will be explained in greater detail.
The first description of the components follows below.
It introduces some of the terminology that will be used throughout the book and outlines the basic function of each component.
Don't worry if it doesn't make sense right away: you'll have many more opportunities to learn about the components and how they work together.
All plots are composed of the **data**, the information you want to visualise, and a **mapping**, the description of how the data's variables are mapped to aesthetic attributes.
There are five mapping components:
- A **layer** is a collection of geometric elements and statistical transformations.
Geometric elements, **geom**s for short, represent what you actually see in the plot: points, lines, polygons, etc.
Statistical transformations, **stat**s for short, summarise the data: for example, binning and counting observations to create a histogram, or fitting a linear model.
- **Scale**s map values in the data space to values in the aesthetic space.
This includes the use of colour, shape or size.
Scales also draw the legend and axes, which make it possible to read the original data values from the plot (an inverse mapping).
- A **coord**, or coordinate system, describes how data coordinates are mapped to the plane of the graphic.
It also provides axes and gridlines to help read the graph.
We normally use the Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.
- A **facet** specifies how to break up and display subsets of data as small multiples.
This is also known as conditioning or latticing/trellising.
- A **theme** controls the finer points of display, like the font size and background colour.
While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot.
A good starting place is Tufte's early works [@tufte:1990; @tufte:1997; @tufte:2001].
It's also important to note what the grammar doesn't do:
- It doesn't suggest which graphics to use.
While this book endeavours to promote a sensible process for producing plots, the focus is on how to produce the plots you want, not on which plot to produce.
For more advice on choosing or creating plots to answer the question you're interested in, you may want to consult @robbins:2004, @cleveland:1993, @chambers:1983, and @tukey:1977.
- It doesn't describe interactive graphics, only static ones.
There is essentially no difference between displaying ggplot2 graphs on a computer screen and printing them on a piece of paper.
## How does ggplot2 fit in with other R graphics?
There are a number of other graphics systems available in R: base graphics, grid graphics and trellis/lattice graphics.
How does ggplot2 differ from them?
- Base graphics were written by Ross Ihaka based on experience implementing the S graphics driver and partly looking at @chambers:1983.
Base graphics has a pen on paper model: you can only draw on top of the plot, you cannot modify or delete existing content.
There is no (user accessible) representation of the graphics, apart from their appearance on the screen.
Base graphics includes both tools for drawing primitives and entire plots.
Base graphics functions are generally fast, but have limited scope.
If you've created a single scatterplot, or histogram, or a set of boxplots in the past, you've probably used base graphics.
\index{Base graphics}
- The development of "grid" graphics, a much richer system of graphical primitives, started in 2000.
Grid is developed by Paul Murrell, growing out of his PhD work [@murrell:1998].
Grid grobs (graphical objects) can be represented independently of the plot and modified later.
A system of viewports (each containing its own coordinate system) makes it easier to lay out complex graphics.
Grid provides drawing primitives, but no tools for producing statistical graphics.
\index{grid}
- The lattice package, developed by Deepayan Sarkar, uses grid graphics to implement the trellis graphics system of @cleveland:1993 and is a considerable improvement over base graphics.
You can easily produce conditioned plots and some plotting details (e.g., legends) are taken care of automatically.
However, lattice graphics lacks a formal model, which can make it hard to extend.
Lattice graphics are explained in depth in @sarkar:2008.
\index{lattice}
- ggplot2, started in 2005, is an attempt to take the good things about base and lattice graphics and improve on them with a strong underlying model which supports the production of any kind of statistical graphic, based on the principles outlined above.
The solid underlying model of ggplot2 makes it easy to describe a wide range of graphics with a compact syntax, and independent components make extension easy.
Like lattice, ggplot2 uses grid to draw the graphics, which means you can exercise much low-level control over the appearance of the plot.
- htmlwidgets, <http://www.htmlwidgets.org>, provides a common framework for accessing web visualisation tools from R.
Packages built on top of htmlwidgets include leaflet (<https://rstudio.github.io/leaflet/>, maps), dygraph (<http://rstudio.github.io/dygraphs/>, time series) and networkD3 (<http://christophergandrud.github.io/networkD3/>, networks).
\index{htmlwidgets}
- plotly, <https://plotly-r.com>, is a popular javascript visualisation toolkit with an R interface.
It's a great tool if you want to make interactive graphics for HTML documents, and even comes with a `ggplotly()` function that can convert many ggplot2 graphics into their interactive equivalents.
Many other R packages, such as vcd [@meyer:2006], plotrix [@plotrix] and gplots [@gplots], implement specialist graphics, but no others provide a framework for producing statistical graphics.
A comprehensive list of all graphical tools available in other packages can be found in the graphics task view at <http://cran.r-project.org/web/views/Graphics.html>.
## About this book
The first chapter, @sec-getting-started, describes how to quickly get started using ggplot2 to make useful graphics.
This chapter introduces several important ggplot2 concepts: geoms, aesthetic mappings and facetting.
@sec-individual-geoms to @sec-arranging-plots explore how to use the basic toolbox to solve a wide range of visualisation problems that you're likely to encounter in practice.
Then @sec-scale-position to @sec-scale-other show you how to control the most important scales, allowing you to tweak the details of axes and legends.
In "The Grammar" we describe the layered grammar of graphics which underlies ggplot2.
The theory is illustrated in @sec-layers which demonstrates how to add additional layers to your plot, exercising full control over the geoms and stats used within them.
Understanding how scales work is crucial for fine-tuning the perceptual properties of your plot.
Customising scales gives fine control over the exact appearance of the plot and helps to support the story that you are telling.
@sec-scale-position, @sec-scale-colour and @sec-scale-other will show you what scales are available, how to adjust their parameters, and how to control the appearance of axes and legends.
Coordinate systems and facetting control the position of elements of the plot.
These are described in @sec-position.
Faceting is a very powerful graphical tool as it allows you to rapidly compare different subsets of your data.
Different coordinate systems are less commonly needed, but are very important for certain types of data.
To polish your plots for publication, you will need to learn about the tools described in @sec-polishing.
There you will learn about how to control the theming system of ggplot2 and how to save plots to disk.
## Prerequisites {#sec-prerequisites}
\index{Installation}
Before we continue, make sure you have all the software you need for this book:
- **R**: If you don't have R installed already, you may be reading the wrong book; we assume a basic familiarity with R throughout this book.
If you'd like to learn how to use R, we'd recommend [*R for Data Science*](https://r4ds.had.co.nz/) which is designed to get you up and running with R with a minimum of fuss.
- **RStudio**: RStudio is a free and open source integrated development environment (IDE) for R.
While you can write and use R code with any R environment (including R GUI and [ESS](http://ess.r-project.org)), RStudio has some nice features specifically for authoring and debugging your code.
We recommend giving it a try, but it's not required to be successful with ggplot2 or with this book.
You can download RStudio Desktop from <https://posit.co/download/rstudio-desktop>
- **R packages**: This book uses a bunch of R packages.
You can install them all at once by running:
```{r}
#| echo: false
#| cache: false
deps <- desc::desc_get_deps()
pkgs <- sort(deps$package[deps$type == "Imports"])
pkgs2 <- strwrap(paste(encodeString(pkgs, quote = '"'), collapse = ", "), exdent = 2)
install <- paste0(
"install.packages(c(\n ",
paste(pkgs2, "\n", collapse = ""),
"))"
)
```
```{r}
#| code: !expr install
#| eval: false
```
## Other resources {#sec-other-resources}
This book teaches you the elements of ggplot2's grammar and how they fit together, but it does not document every function in complete detail.
You will need additional documentation as your use of ggplot2 becomes more complex and varied.
The best resource for specific details of ggplot2 functions and their arguments will always be the built-in documentation.
This is accessible online, <https://ggplot2.tidyverse.org/reference/index.html>, and from within R using the usual help syntax.
The advantage of the online documentation is that you can see all the example plots and navigate between topics more easily.
If you use ggplot2 regularly, it's a good idea to sign up for the ggplot2 mailing list, <http://groups.google.com/group/ggplot2>.
The list has relatively low traffic and is very friendly to new users.
Another useful resource is stackoverflow, <https://stackoverflow.com>.
There is an active ggplot2 community on stackoverflow, and many common questions have already been asked and answered.
In either place, you're much more likely to get help if you create a minimal reproducible example.
The [reprex](https://github.com/jennybc/reprex) package by Jenny Bryan provides a convenient way to do this, and also include advice on creating a good example.
The more information you provide, the easier it is for the community to help you.
The number of functions in ggplot2 can be overwhelming, but RStudio provides some great cheatsheets to jog your memory at <https://posit.co/resources/cheatsheets/>.
Finally, the complete source code for the book is available online at <https://github.com/hadley/ggplot2-book>.
This contains the complete text for the book, as well as all the code and data needed to recreate all the plots.
## Colophon
This book was written in [RStudio](https://posit.co/products/open-source/rstudio/) using [bookdown](http://bookdown.org/).
The [website](http://ggplot2-book.org/) is hosted with [netlify](http://netlify.com/), and automatically updated after every commit by [Github Actions](https://github.com/features/actions).
The complete source is available from [GitHub](https://github.com/hadley/ggplot2-book).
This version of the book was built with `r R.version.string` and the following packages:
```{r}
#| echo: false
#| results: asis
pkgs <- sessioninfo::package_info(pkgs, dependencies = FALSE)
df <- tibble(
package = pkgs$package,
version = pkgs$ondiskversion,
source = gsub("@", "\\\\@", pkgs$source)
)
knitr::kable(df, format = "markdown")
```