forked from hadley/ggplot2-book
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathprogramming.rmd
413 lines (312 loc) · 17.2 KB
/
programming.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
---
bibliography: references.bib
---
```{r include = FALSE}
chapter <- "programming"
source("common.R")
```
# Programming with ggplot2 {#cha:programming}
## Introduction
A major requirement of a good data analysis is flexibility. If your data changes, or you discover something that makes you rethink your basic assumptions, you need to be able to easily change many plots at once. The main inhibitor of flexibility is code duplication. If you have the same plotting statement repeated over and over again, you'll have to make the same change in many different places. Often just the thought of making all those changes is exhausting! This chapter will help you overcome that problem by showing you how to program with ggplot2. \index{Programming}
To make your code more flexible, you need to reduce duplicated code by writing functions. When you notice you're doing the same thing over and over again, think about how you might generalise it and turn it into a function. If you're not that familiar with how functions work in R, you might want to brush up your knowledge at <http://adv-r.had.co.nz/Functions.html>.
In this chapter I'll show how to write functions that create:
* A single ggplot2 component.
* Multiple ggplot2 components.
* A complete plot.
And then I'll finish off with a brief illustration of how you can apply functional programming techniques to ggplot2 objects.
You might also find the [cowplot](https://github.com/wilkelab/cowplot) and [ggthemes](https://github.com/jrnold/ggthemes) packages helpful. As well as providing reusable components that help you directly, you can also read the source code of the packages to figure out how they work.
## Single components
Each component of a ggplot plot is an object. Most of the time you create the component and immediately add it to a plot, but you don't have to. Instead, you can save any component to a variable (giving it a name), and then add it to multiple plots:
`r columns(2, 1, 0.75)`
```{r layer9}
bestfit <- geom_smooth(
method = "lm",
se = FALSE,
colour = alpha("steelblue", 0.5),
size = 2
)
ggplot(mpg, aes(cty, hwy)) +
geom_point() +
bestfit
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
bestfit
```
That's a great way to reduce simple types of duplication (it's much better than copying-and-pasting!), but requires that the component be exactly the same each time. If you need more flexibility, you can wrap these reusable snippets in a function. For example, we could extend our `bestfit` object to a more general function for adding lines of best fit to a plot. The following code creates a `geom_lm()` with three parameters: the model `formula`, the line `colour` and the line `size`:
```{r geom-lm}
geom_lm <- function(formula = y ~ x, colour = alpha("steelblue", 0.5),
size = 2, ...) {
geom_smooth(formula = formula, se = FALSE, method = "lm", colour = colour,
size = size, ...)
}
ggplot(mpg, aes(displ, 1 / hwy)) +
geom_point() +
geom_lm()
ggplot(mpg, aes(displ, 1 / hwy)) +
geom_point() +
geom_lm(y ~ poly(x, 2), size = 1, colour = "red")
```
Pay close attention to the use of "`...`". When included in the function definition "`...`" allows a function to accept arbitrary additional arguments. Inside the function, you can then use "`...`" to pass those arguments on to another function. Here we pass "`...`" onto `geom_smooth()` so the user can still modify all the other arguments we haven't explicitly overridden. When you write your own component functions, it's a good idea to always use "`...`" in this way. \indexc{...}
Finally, note that you can only _add_ components to a plot; you can't modify or remove existing objects.
### Exercises
1. Create an object that represents a pink histogram with 100 bins.
1. Create an object that represents a fill scale with the Blues
ColorBrewer palette.
1. Read the source code for `theme_grey()`. What are its arguments?
How does it work?
1. Create `scale_colour_wesanderson()`. It should have a parameter to pick
the palette from the wesanderson package, and create either a continuous
or discrete scale.
## Multiple components
It's not always possible to achieve your goals with a single component. Fortunately, ggplot2 has a convenient way of adding multiple components to a plot in one step with a list. The following function adds two layers: one to show the mean, and one to show its standard error:
`r columns(2, 2/3, 0.75)`
```{r geom-mean-1, warning = FALSE}
geom_mean <- function() {
list(
stat_summary(fun.y = "mean", geom = "bar", fill = "grey70"),
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = 0.4)
)
}
ggplot(mpg, aes(class, cty)) + geom_mean()
ggplot(mpg, aes(drv, cty)) + geom_mean()
```
If the list contains any `NULL` elements, they're ignored. This makes it easy to conditionally add components:
```{r geom-mean-2, warning = FALSE}
geom_mean <- function(se = TRUE) {
list(
stat_summary(fun.y = "mean", geom = "bar", fill = "grey70"),
if (se)
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = 0.4)
)
}
ggplot(mpg, aes(drv, cty)) + geom_mean()
ggplot(mpg, aes(drv, cty)) + geom_mean(se = FALSE)
```
### Plot components
You're not just limited to adding layers in this way. You can also include any of the following object types in the list:
* A data.frame, which will override the default dataset associated with the
plot. (If you add a data frame by itself, you'll need to use `%+%`,
but this is not necessary if the data frame is in a list.)
* An `aes()` object, which will be combined with the existing default
aesthetic mapping.
* Scales, which override existing scales, with a warning if they've already
been set by the user.
* Coordinate systems and facetting specification, which override the existing
settings.
* Theme components, which override the specified components.
### Annotation
It's often useful to add standard annotations to a plot. In this case, your function will also set the data in the layer function, rather than inheriting it from the plot. There are two other options that you should set when you do this. These ensure that the layer is self-contained: \index{Annotation!functions}
* `inherit.aes = FALSE` prevents the layer from inheriting aesthetics from the
parent plot. This ensures your annotation works regardless of what else is on
the plot. \indexc{inherit.aes}
* `show.legend = FALSE` ensures that your annotation won't appear in the
legend. \indexc{show.legend}
One example of this technique is the `borders()` function built into ggplot2. It's designed to add map borders from one of the datasets in the maps package: \indexf{borders}
```{r}
borders <- function(database = "world", regions = ".", fill = NA,
colour = "grey50", ...) {
df <- map_data(database, regions)
geom_polygon(
aes_(~lat, ~long, group = ~group),
data = df, fill = fill, colour = colour, ...,
inherit.aes = FALSE, show.legend = FALSE
)
}
```
### Additional arguments
If you want to pass additional arguments to the components in your function, `...` is no good: there's no way to direct different arguments to different components. Instead, you'll need to think about how you want your function to work, balancing the benefits of having one function that does it all vs. the cost of having a complex function that's harder to understand. \indexc{...}
To get you started, here's one approach using `modifyList()` and `do.call()`: \indexf{modifyList} \indexf{do.call}
```{r}
geom_mean <- function(..., bar.params = list(), errorbar.params = list()) {
params <- list(...)
bar.params <- modifyList(params, bar.params)
errorbar.params <- modifyList(params, errorbar.params)
bar <- do.call("stat_summary", modifyList(
list(fun.y = "mean", geom = "bar", fill = "grey70"),
bar.params)
)
errorbar <- do.call("stat_summary", modifyList(
list(fun.data = "mean_cl_normal", geom = "errorbar", width = 0.4),
errorbar.params)
)
list(bar, errorbar)
}
ggplot(mpg, aes(class, cty)) +
geom_mean(
colour = "steelblue",
errorbar.params = list(width = 0.5, size = 1)
)
ggplot(mpg, aes(class, cty)) +
geom_mean(
bar.params = list(fill = "steelblue"),
errorbar.params = list(colour = "blue")
)
```
If you need more complex behaviour, it might be easier to create a custom geom or stat. You can learn about that in the extending ggplot2 vignette included with the package. Read it by running `vignette("extending-ggplot2")`.
### Exercises
1. To make the best use of space, many examples in this book hide the axes
labels and legend. I've just copied-and-pasted the same code into multiple
places, but it would make more sense to create a reusable function.
What would that function look like?
1. Extend the `borders()` function to also add `coord_quickmap()` to the
plot.
1. Look through your own code. What combinations of geoms or scales do you
use all the time? How could you extract the pattern into a reusable
function?
## Plot functions {#sec:functions}
Creating small reusable components is most in line with the ggplot2 spirit: you can recombine them flexibly to create whatever plot you want. But sometimes you're creating the same plot over and over again, and you don't need that flexibility. Instead of creating components, you might want to write a function that takes data and parameters and returns a complete plot. \index{Plot functions}
For example, you could wrap up the complete code needed to make a piechart:
`r columns(1, 2 / 3, 0.5)`
```{r}
piechart <- function(data, mapping) {
ggplot(data, mapping) +
geom_bar(width = 1) +
coord_polar(theta = "y") +
xlab(NULL) +
ylab(NULL)
}
piechart(mpg, aes(factor(1), fill = class))
```
This is much less flexible than the component based approach, but equally, it's much more concise. Note that I was careful to return the plot object, rather than printing it. That makes it possible add on other ggplot2 components.
You can take a similar approach to drawing parallel coordinates plots (PCPs). PCPs require a transformation of the data, so I recommend writing two functions: one that does the transformation and one that generates the plot. Keeping these two pieces separate makes life much easier if you later want to reuse the same transformation for a different visualisation. \index{Parallel coordinate plots}
`r columns(2, 2/3)`
```{r pcp_data}
pcp_data <- function(df) {
is_numeric <- vapply(df, is.numeric, logical(1))
# Rescale numeric columns
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
df[is_numeric] <- lapply(df[is_numeric], rescale01)
# Add row identifier
df$.row <- rownames(df)
# Treat numerics as value (aka measure) variables
# gather_ is the standard-evaluation version of gather, and
# is usually easier to program with.
tidyr::gather_(df, "variable", "value", names(df)[is_numeric])
}
pcp <- function(df, ...) {
df <- pcp_data(df)
ggplot(df, aes(variable, value, group = .row)) + geom_line(...)
}
pcp(mpg)
pcp(mpg, aes(colour = drv))
```
A complete exploration of this idea is `qplot()`, which provides a fairly deep wrapper around the most common `ggplot()` options. I recommend studying the source code if you want to see how far these basic techniques can take you. \indexf{qplot}
### Indirectly referring to variables
The `piechart()` function above is a little unappealing because it requires the user to know the exact `aes()` specification that generates a pie chart. It would be more convenient if the user could simply specify the name of the variable to plot. To do that you'll need to learn a bit more about how `aes()` works.
`aes()` uses non-standard evaluation: rather than looking at the values of its arguments, it looks at their expressions. This makes it difficult to work with programmatically as there's no way to store the name of a variable in an object and then refer to it later:
```{r}
x_var <- "displ"
aes(x_var)
```
Instead we need to use `aes_()`, which uses regular evaluation. There are two basic ways to create a mapping with `aes_()`: \indexf{aes\_}
* Using a _quoted call_, created by `quote()`, `substitute()`, `as.name()`,
or `parse()`. \indexf{quote} \indexf{substitute} \indexf{parse} \indexf{as.name}
```{r}
aes_(quote(displ))
aes_(as.name(x_var))
aes_(parse(text = x_var)[[1]])
f <- function(x_var) {
aes_(substitute(x_var))
}
f(displ)
```
The difference between `as.name()` and `parse()` is subtle. If `x_var`
is "a + b", `as.name()` will turn it into a variable called `` `a + b` ``,
`parse()` will turn it into the function call `a + b`. (If this is
confusing, <http://adv-r.had.co.nz/Expressions.html> might help).
* Using a formula, created with `~`. \indexc{\textasciitilde}
```{r}
aes_(~displ)
```
`aes_()` gives us three options for how a user can supply variables: as a string, as a formula, or as a bare expression. These three options are illustrated below
`r columns(3, 1)`
```{r}
piechart1 <- function(data, var, ...) {
piechart(data, aes_(~factor(1), fill = as.name(var)))
}
piechart1(mpg, "class") + theme(legend.position = "none")
piechart2 <- function(data, var, ...) {
piechart(data, aes_(~factor(1), fill = var))
}
piechart2(mpg, ~class) + theme(legend.position = "none")
piechart3 <- function(data, var, ...) {
piechart(data, aes_(~factor(1), fill = substitute(var)))
}
piechart3(mpg, class) + theme(legend.position = "none")
```
There's another advantage to `aes_()` over `aes()` if you're writing ggplot2 plots inside a package: using `aes_(~x, ~y)` instead of `aes(x, y)` avoids the global variables NOTE in `R CMD check`. \index{Global variables}
### The plot environment
As you create more sophisticated plotting functions, you'll need to understand a bit more about ggplot2's scoping rules. ggplot2 was written well before I understood the full intricacies of non-standard evaluation, so it has a rather simple scoping system. If a variable is not found in the `data`, it is looked for in _the_ plot environment. There is only one environment for a plot (not one for each layer), and it is the environment in which `ggplot()` is called from (i.e. the `parent.frame()`). \index{Environments} \indexf{parent.frame}
This means that the following function won't work because `n` is not stored in an environment accessible when the expressions in `aes()` are evaluated.
```{r, error = TRUE, fig.show = "hide"}
f <- function() {
n <- 10
geom_line(aes(x / n))
}
df <- data.frame(x = 1:3, y = 1:3)
ggplot(df, aes(x, y)) + f()
```
Note that this is only a problem with the `mapping` argument. All other arguments are evaluated immediately so their values (not a reference to a name) are stored in the plot object. This means the following function will work:
```{r, fig.show = "hide"}
f <- function() {
colour <- "blue"
geom_line(colour = colour)
}
ggplot(df, aes(x, y)) + f()
```
If you need to use a different environment for the plot, you can specify it with the `environment` argument to `ggplot()`. You'll need to do this if you're creating a plot function that takes user provided data. See `qplot()` for an example.
### Exercises
1. Create a `distribution()` function specially designed for visualising
continuous distributions. Allow the user to supply a dataset and the
name of a variable to visualise. Let them choose between histograms,
frequency polygons, and density plots. What other arguments might you
want to include?
1. What additional arguments should `pcp()` take? What are the downsides
of how `...` is used in the current code?
1. Advanced: why doesn't this code work? How can you fix it?
```{r, error = TRUE, fig.show = "hide"}
f <- function() {
levs <- c("2seater", "compact", "midsize", "minivan", "pickup",
"subcompact", "suv")
piechart3(mpg, factor(class, levels = levs))
}
f()
```
## Functional programming
Since ggplot2 objects are just regular R objects, you can put them in a list. This means you can apply all of R's great functional programming tools. For example, if you wanted to add different geoms to the same base plot, you could put them in a list and use `lapply()`. \index{Functional programming} \indexf{lapply}
`r columns(3, 1, 1)`
```{r}
geoms <- list(
geom_point(),
geom_boxplot(aes(group = cut_width(displ, 1))),
list(geom_point(), geom_smooth())
)
p <- ggplot(mpg, aes(displ, hwy))
lapply(geoms, function(g) p + g)
```
If you're not familiar with functional programming, read through <http://adv-r.had.co.nz/Functional-programming.html> and think about how you might apply the techniques to your duplicated plotting code.
### Exercises
1. How could you add a `geom_point()` layer to each element of the following
list?
```{r, eval = FALSE}
plots <- list(
ggplot(mpg, aes(displ, hwy)),
ggplot(diamonds, aes(carat, price)),
ggplot(faithfuld, aes(waiting, eruptions, size = density))
)
```
1. What does the following function do? What's a better name for it?
```{r, eval = FALSE}
mystery <- function(...) {
Reduce(`+`, list(...), accumulate = TRUE)
}
mystery(
ggplot(mpg, aes(displ, hwy)) + geom_point(),
geom_smooth(),
xlab(NULL),
ylab(NULL)
)
```