-
Notifications
You must be signed in to change notification settings - Fork 682
/
collective-geoms.qmd
257 lines (206 loc) · 10.8 KB
/
collective-geoms.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
# Collective geoms {#sec-collective-geoms}
```{r}
#| echo: false
#| message: false
#| results: asis
source("common.R")
status("drafting")
```
Geoms can be roughly divided into individual and collective geoms.
An **individual** geom draws a distinct graphical object for each observation (row).
For example, the point geom draws one point per row.
A **collective** geom displays multiple observations with one geometric object.
This may be a result of a statistical summary, like a boxplot, or may be fundamental to the display of the geom, like a polygon.
Lines and paths fall somewhere in between: each line is composed of a set of straight segments, but each segment represents two points.
How do we control the assignment of observations to graphical elements?
This is the job of the `group` aesthetic.
\index{Grouping} \indexc{group} \index{Geoms!collective}
By default, the `group` aesthetic is mapped to the interaction of all discrete variables in the plot.
This often partitions the data correctly, but when it does not, or when no discrete variable is used in a plot, you'll need to explicitly define the grouping structure by mapping group to a variable that has a different value for each group.
There are three common cases where the default is not enough, and we will consider each one below.
In the following examples, we will use a simple longitudinal dataset, `Oxboys`, from the nlme package.
It records the heights (`height`) and centered ages (`age`) of 26 boys (`Subject`), measured on nine occasions (`Occasion`).
`Subject` and `Occasion` are stored as ordered factors.
\index{nlme} \index{Data!Oxboys@\texttt{Oxboys}}
```{r}
#| label: oxboys
data(Oxboys, package = "nlme")
head(Oxboys)
```
## Multiple groups, one aesthetic
In many situations, you want to separate your data into groups, but render them in the same way.
In other words, you want to be able to distinguish individual subjects, but not identify them.
This is common in longitudinal studies with many subjects, where the plots are often descriptively called spaghetti plots.
For example, the following plot shows the growth trajectory for each boy (each `Subject`): \index{Data!longitudinal} \indexf{geom\_line}
```{r}
#| label: oxboys-line
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_point() +
geom_line()
```
If you incorrectly specify the grouping variable, you'll get a characteristic sawtooth appearance:
```{r}
#| label: oxboys-line-bad
ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line()
```
If a group isn't defined by a single variable, but instead by a combination of multiple variables, use `interaction()` to combine them, e.g. `aes(group = interaction(school_id, student_id))`.
\indexf{interaction}
## Different groups on different layers
Sometimes we want to plot summaries that use different levels of aggregation: one layer might display individuals, while another displays an overall summary.
Building on the previous example, suppose we want to add a single smooth line, showing the overall trend for *all* boys.
If we use the same grouping in both layers, we get one smooth per boy: \indexf{geom\_smooth}
```{r}
#| label: layer18
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE)
```
This is not what we wanted; we have inadvertently added a smoothed line for each boy.
Grouping controls both the display of the geoms, and the operation of the stats: one statistical transformation is run for each group.
Instead of setting the grouping aesthetic in `ggplot()`, where it will apply to all layers, we set it in `geom_line()` so it applies only to the lines.
There are no discrete variables in the plot so the default grouping variable will be a constant and we get one smooth:
```{r}
#| label: layer19
ggplot(Oxboys, aes(age, height)) +
geom_line(aes(group = Subject)) +
geom_smooth(method = "lm", linewidth = 2, se = FALSE)
```
## Overriding the default grouping
Some plots have a discrete x scale, but you still want to draw lines connecting *across* groups.
This is the strategy used in interaction plots, profile plots, and parallel coordinate plots, among others.
For example, imagine we've drawn boxplots of height at each measurement occasion: \indexf{geom\_boxplot}
```{r}
#| label: oxbox
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot()
```
There is one discrete variable in this plot, `Occasion`, so we get one boxplot for each unique x value.
Now we want to overlay lines that connect each individual boy.
Simply adding `geom_line()` does not work: the lines are drawn within each occasion, not across each subject:
```{r}
#| label: oxbox-line-bad
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(colour = "#3366FF", alpha = 0.5)
```
To get the plot we want, we need to override the grouping to say we want one line per boy:
```{r}
#| label: oxbox-line
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(aes(group = Subject), colour = "#3366FF", alpha = 0.5)
```
## Matching aesthetics to graphic objects {#sec-matching}
A final important issue with collective geoms is how the aesthetics of the individual observations are mapped to the aesthetics of the complete entity.
What happens when different aesthetics are mapped to a single geometric element?
\index{Aesthetics!matching to geoms}
In ggplot2, this is handled differently for different collective geoms.
Lines and paths operate on a "first value" principle: each segment is defined by two observations, and ggplot2 applies the aesthetic value (e.g., colour) associated with the *first* observation when drawing the segment.
That is, the aesthetic for the first observation is used when drawing the first segment, the second observation is used when drawing the second segment and so on.
The aesthetic value for the last observation is not used:
```{r}
#| layout-ncol: 2
#| fig-width: 4
df <- data.frame(x = 1:3, y = 1:3, colour = c(1, 3, 5))
ggplot(df, aes(x, y, colour = factor(colour))) +
geom_line(aes(group = 1), linewidth = 2) +
geom_point(size = 5)
ggplot(df, aes(x, y, colour = colour)) +
geom_line(aes(group = 1), linewidth = 2) +
geom_point(size = 5)
```
On the left --- where colour is discrete --- the first point and first line segment are red, the second point and second line segment are green, and the final point (with no corresponding segment) is blue.
On the right --- where colour is continuous --- the same principle is applied to the three different shades of blue.
Notice that even though the colour variable is continuous, ggplot2 does not smoothly blend from one aesthetic value to another.
If this is the behaviour you want, you can perform the linear interpolation yourself:
```{r}
#| label: matching-lines2
xgrid <- with(df, seq(min(x), max(x), length = 50))
interp <- data.frame(
x = xgrid,
y = approx(df$x, df$y, xout = xgrid)$y,
colour = approx(df$x, df$colour, xout = xgrid)$y
)
ggplot(interp, aes(x, y, colour = colour)) +
geom_line(linewidth = 2) +
geom_point(data = df, size = 5)
```
An additional limitation for paths and lines is worth noting: the line type must be constant over each individual line.
In R there is no way to draw a line which has varying line type.
\indexf{geom\_line} \indexf{geom\_path}
What about other collective geoms, such as polygons?
Most collective geoms are more complicated than lines and path, and a single geometric object can map onto many observations.
In such cases it is not obvious how the aesthetics of individual observations should be combined.
For instance, how would you colour a polygon that had a different fill colour for each point on its border?
Due to this ambiguity ggplot2 adopts a simple rule: the aesthetics from the individual components are used only if they are all the same.
If the aesthetics differ for each component, ggplot2 uses a default value instead.
\indexf{geom\_polygon}
These issues are most relevant when mapping aesthetics to continuous variables.
For discrete variables, the default behaviour of ggplot2 is to treat the variable as part of the group aesthetic, as described above.
This has the effect of splitting the collective geom into smaller pieces.
This works particularly well for bar and area plots, because stacking the individual pieces produces the same shape as the original ungrouped data:
```{r}
#| label: bar-split-disc
#| layout-ncol: 2
#| fig-width: 4
ggplot(mpg, aes(class)) +
geom_bar()
ggplot(mpg, aes(class, fill = drv)) +
geom_bar()
```
If you try to map the fill aesthetic to a continuous variable (e.g., `hwy`) in the same way, it doesn't work.
The default grouping will only be based on `class`, so each bar is now associated with multiple colours (depending on the value of `hwy` for the observations in each class).
Because a bar can only display one colour, ggplot2 reverts to the default grey in this case.
To show multiple colours, we need multiple bars for each `class`, which we can get by overriding the grouping:
```{r}
#| label: bar-split-cont
#| layout-ncol: 2
#| fig-width: 4
ggplot(mpg, aes(class, fill = hwy)) +
geom_bar()
ggplot(mpg, aes(class, fill = hwy, group = hwy)) +
geom_bar()
```
In the plot on the right, the "shaded bars" for each `class` have been constructed by stacking many distinct bars on top of each other, each filled with a different shade based on the value of `hwy`.
Note that when you do this, the bars are stacked in the order defined by the grouping variable (in this example `hwy`).
If you need fine control over this behaviour, you'll need to create a factor with levels ordered as needed.
## Exercises
1. Draw a boxplot of `hwy` for each value of `cyl`, without turning `cyl` into a factor.
What extra aesthetic do you need to set?
2. Modify the following plot so that you get one boxplot per integer value of `displ`.
```{r}
#| eval: false
ggplot(mpg, aes(displ, cty)) +
geom_boxplot()
```
3. When illustrating the difference between mapping continuous and discrete colours to a line, the discrete example needed `aes(group = 1)`.
Why?
What happens if that is omitted?
What's the difference between `aes(group = 1)` and `aes(group = 2)`?
Why?
4. How many bars are in each of the following plots?
```{r}
#| eval: false
ggplot(mpg, aes(drv)) +
geom_bar()
ggplot(mpg, aes(drv, fill = hwy, group = hwy)) +
geom_bar()
library(dplyr)
mpg2 <- mpg %>% arrange(hwy) %>% mutate(id = seq_along(hwy))
ggplot(mpg2, aes(drv, fill = hwy, group = id)) +
geom_bar()
```
(Hint: try adding an outline around each bar with `colour = "white"`)
5. Install the babynames package.
It contains data about the popularity of baby names in the US.
Run the following code and fix the resulting graph.
Why does this graph make us unhappy?
```{r}
#| eval: false
library(babynames)
hadley <- dplyr::filter(babynames, name == "Hadley")
ggplot(hadley, aes(year, n)) +
geom_line()
```