forked from hadley/ggplot2-book
-
Notifications
You must be signed in to change notification settings - Fork 1
/
data-manip.rmd
497 lines (357 loc) · 23.9 KB
/
data-manip.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
---
title: Data transformation
output: oldbookdown::html_chapter
bibliography: references.bib
---
```{r data, include = FALSE}
chapter <- "data-manip"
source("common.R")
```
# Data transformation {#cha:dplyr}
## Introduction
Tidy data is important, but it's not the end of the road. Often you won't have quite the right variables, or your data might need a little aggregation before you visualise it. This chapter will show you how to solve these problems (and more!) with the __dplyr__ package. \index{Data!manipulating} \index{dplyr} \index{Grammar!of data manipulation}
The goal of dplyr is to provide verbs (functions) that help you solve the most common 95% of data manipulation problems. dplyr is similar to ggplot2, but instead of providing a grammar of graphics, it provides a grammar of data manipulation. Like ggplot2, dplyr helps you not just by giving you functions, but it also helps you think about data manipulation. In particular, dplyr helps by constraining you: instead of struggling to think about which of the thousands of functions that might help, you can just pick from a handful that are design to be very likely to be helpful. In this chapter you'll learn four of the most important dplyr verbs:
* `filter()`
* `mutate()`
* `group_by()` & `summarise()`
These verbs are easy to learn because they all work the same way: they take a data frame as the first argument, and return a modified data frame. The other arguments control the details of the transformation, and are always interpreted in the context of the data frame so you can refer to variables directly. I'll also explain each in the same way: I'll show you a motivating example using the `diamonds` data, give you more details about how the function works, and finish up with some exercises for you to practice your skills with.
You'll also learn how to create data transformation pipelines using `%>%`. `%>%` plays a similar role to `+` in ggplot2: it allows you to solve complex problems by combining small pieces that are easily understood in isolation.
This chapter only scratches the surface of dplyr's capabilities but it should be enough to help you with visualisation problems. You can learn more by using the resources discussed at the end of the chapter.
## Filter observations
It's common to only want to explore one part of a dataset. A great data analysis strategy is to start with just one observation unit (one person, one city, etc), and understand how it works before attempting to generalise the conclusion to others. This is a great technique if you ever feel overwhelmed by an analysis: zoom down to a small subset, master it, and then zoom back out, to apply your conclusions to the full dataset. \indexf{filter}
Filtering is also useful for extracting outliers. Generally, you don't want to just throw outliers away, as they're often highly revealing, but it's useful to think about partitioning the data into the common and the unusual. You summarise the common to look at the broad trends; you examine the outliers individually to see if you can figure out what's going on.
For example, look at this plot that shows how the x and y dimensions of the diamonds are related:
`r columns(1, 2/3)`
```{r diamonds-x-y}
ggplot(diamonds, aes(x, y)) +
geom_bin2d()
```
There are around 50,000 points in this dataset: most of them lie along the diagonal, but there are a handful of outliers. One clear set of incorrect values are those diamonds with zero dimensions. We can use `filter()` to pull them out:
```{r diamonds-0-0}
filter(diamonds, x == 0 | y == 0)
```
This is equivalent to the base R code `diamonds[diamonds$x == 0 | diamonds$y == 0, ]`, but is more concise because `filter()` knows to look for the bare `x` in the data frame.
(If you've used `subset()` before, you'll notice that it has very similar behaviour. The biggest difference is that `subset()` can select both observations and variables, where in dplyr, `filter()` works exclusively with observations and `select()` with variables. There are some other subtle differences, but the main advantage to using `filter()` is that it behaves identically to the other dplyr verbs and it tends to be a bit faster than `subset()`.)
In a real analysis, you'd look at the outliers in more detail to see if you can find the root cause of the data quality problem. In this case, we're just going to throw them out and focus on what remains. To save some typing, we may provide multiple arguments to `filter()` which combines them.
```{r diamonds-ok}
diamonds_ok <- filter(diamonds, x > 0, y > 0, y < 20)
ggplot(diamonds_ok, aes(x, y)) +
geom_bin2d() +
geom_abline(slope = 1, colour = "white", size = 1, alpha = 0.5)
```
This plot is now more informative - we can see a very strong relationship between `x` and `y`. I've added the reference line to make it clear that for most diamonds, `x` and `y` are very similar. However, this plot still has problems:
* The plot is mostly empty, because most of the data lies along the diagonal.
* There are some clear bivariate outliers, but it's hard to select them with
a simple filter.
We'll solve both of these problem in the next section by adding a new variable that's a transformation of x and y. But before we continue on to that, let's talk more about the details of `filter()`.
### Useful tools
The first argument to `filter()` is a data frame. The second and subsequent arguments must be logical vectors: `filter()` selects every row where all the logical expressions are `TRUE`. The logical vectors must always be the same length as the data frame: if not, you'll get an error. Typically you create the logical vector with the comparison operators:
* `x == y`: x and y are equal.
* `x != y`: x and y are not equal.
* `x %in% c("a", "b", "c")`: `x` is one of the values in the right hand side.
* `x > y`, `x >= y`, `x < y`, `x <= y`: greater than, greater than or equal to,
less than, less than or equal to.
And combine them with logical operators:
* `!x` (pronounced "not x"), flips `TRUE` and `FALSE` so it keeps all the values
where `x` is `FALSE`.
* `x & y`: `TRUE` if both `x` and `y` are `TRUE`.
* `x | y`: `TRUE` if either `x` or `y` (or both) are `TRUE`.
* `xor(x, y)`: `TRUE` if either `x` or `y` are `TRUE`, but not both
(exclusive or).
Most real queries involve some combination of both:
* Price less than $500: `price < 500`
* Size between 1 and 2 carats: `carat >= 1 & carat < 2`
* Cut is ideal or premium: `cut == "Premium" | cut == "Ideal"`, or
`cut %in% c("Premium", "Ideal")` (note that R is case sensitive)
* Worst colour, cut and clarity: `cut == "Fair" & color == "J" & clarity == "SI2"`
You can also use functions in the filtering expression:
* Size is between 1 and 2 carats: `floor(carat) == 1`
* An average dimension greater than 3: `(x + y + z) / 3 > 3`
This is useful for simple expressions, but as things get more complicated it's better to create a new variable first so you can check that you've done the computation correctly before doing the subsetting. You'll learn how to do that in the next section.
The rules for `NA` are a bit trickier, so I'll explain them next.
### Missing values
`NA`, R's missing value indicator, can be frustrating to work with. R's underlying philosophy is to force you to recognise that you have missing values, and make a deliberate choice to deal with them: missing values never silently go missing. This is a pain because you almost always want to just get rid of them, but it's a good principle to force you to think about the correct option. \indexc{NA} \index{Missing values}
The most important thing to understand about missing values is that they are infectious: with few exceptions, the result of any operation that includes a missing value will be a missing value. This happens because `NA` represents an unknown value, and there are few operations that turn an unknown value into a known value.
```{r na}
x <- c(1, NA, 2)
x == 1
x > 2
x + 10
```
When you first learn R, you might be tempted to find missing values using `==`:
```{r na-comparison}
x == NA
x != NA
```
But that doesn't work! A little thought reveals why: there's no reason why two unknown values should be the same. Instead, use `is.na(X)` to determine if a value is missing: \indexf{is.na}
```{r is-na}
is.na(x)
```
`filter()` only includes observations where all arguments are `TRUE`, so `NA` values are automatically dropped. If you want to include missing values, be explicit: `x > 10 | is.na(x)`. In other parts of R, you'll sometimes need to convert missing values into `FALSE`. You can do that with `x > 10 & !is.na(x)`
### Exercises
1. Practice your filtering skills by:
* Finding all the diamonds with equal x and y dimensions.
* A depth between 55 and 70.
* A carat smaller than the median carat.
* Cost more than $10,000 per carat
* Are of good or better quality
1. Fill in the question marks in this table:
Expression | `TRUE` | `FALSE` | `NA`
-----------------|----------|---------|--------
`x` | x | |
? | | x |
`is.na(x)` | | | x
`!is.na(x)` | ? | ? | ?
? | x | | x
? | | x | x
1. Repeat the analysis of outlying values with `z`. Compared to `x` and `y`,
how would you characterise the relationship of `x` and `z`, or `y` and
`z`?
1. Install the __ggplot2movies__ package and look at the movies that have a missing budget.
How are they different from the movies with a budget? (Hint: try a frequency polygon plus `colour = is.na(budget)`.)
1. What is `NA & FALSE` and `NA | TRUE`? Why? Why doesn't `NA * 0` equal zero?
What number times zero does not equal 0? What do you expect `NA ^ 0` to
equal? Why?
## Create new variables {#mutate}
To better explore the relationship between `x` and `y`, it's useful to "rotate" the plot so that the data is flat, not diagonal. We can do that by creating two new variables: one that represents the difference between `x` and `y` (which in this context represents the symmetry of the diamond) and one that represents its size (the length of the diagonal). \indexf{mutate} \index{Data!creating new variables}
To create new variables use `mutate()`. Like `filter()` it takes a data frame as its first argument and returns a data frame. Its second and subsequent arguments are named expressions that generate new variables. Like `filter()` you can refer to variables just by their name, you don't need to also include the name of the dataset.
```{r mutate1}
diamonds_ok2 <- mutate(diamonds_ok,
sym = x - y,
size = sqrt(x ^ 2 + y ^ 2)
)
diamonds_ok2
ggplot(diamonds_ok2, aes(size, sym)) +
stat_bin2d()
```
This plot has two advantages: we can more easily see the pattern followed by most diamonds, and we can easily select outliers. Here, it doesn't seem important whether the outliers are positive (i.e. `x` is bigger than `y`) or negative (i.e. `y` is bigger `x`). So we can use the absolute value of the symmetry variable to pull out the outliers. Based on the plot, and a little experimentation, I came up with a threshold of 0.20. We'll check out the results with a histogram.
`r columns(2, 1)`
```{r sym-hist}
ggplot(diamonds_ok2, aes(abs(sym))) +
geom_histogram(binwidth = 0.10)
diamonds_ok3 <- filter(diamonds_ok2, abs(sym) < 0.20)
ggplot(diamonds_ok3, aes(abs(sym))) +
geom_histogram(binwidth = 0.01)
```
That's an interesting histogram! While most diamonds are close to being symmetric there are very few that are perfectly symmetric (i.e. `x == y`).
### Useful tools
Typically, transformations will be suggested by your domain knowledge. However, there are a few transformations that are useful in a surprisingly wide range of circumstances.
* Log-transformations are often useful. They turn multiplicative relationships
into additive relationships; they compress data that varies over orders of
magnitude; they convert power relationships to linear relationship.
See examples at <http://stats.stackexchange.com/questions/27951>
* Relative difference: If you're interested in the relative difference between
two variables, use `log(x / y)`. It's better than `x / y` because it's
symmetric: if x < y, `x / y` takes values [0, 1), but if x > y, `x / y` takes
values (1, Inf). See @tornqvist:1985 for more details. \indexf{log}
* Sometimes integrating or differentiating might make the data more
interpretable: if you have distance and time, would speed or acceleration be
more useful? (or vice versa). (Note that integration makes data more
smooth; differentiation makes it less smooth.)
* Partition a number into magnitude and direction with `abs(x)` and `sign(x)`.
There are also a few useful ways to transform pairs of variables:
* Partitioning into overall size and difference is often useful, as seen
above.
* If you see a strong trend, use a model to partition it into pattern and
residuals is often useful. You'll learn more about that in the next chapter.
* Sometimes it's useful to change positions to polar coordinates (or vice
versa): distance (`sqrt(x^2 + y^2)`) and angle (`atan2(y, x)`).
### Exercises
1. Practice your variable creation skills by creating the following new
variables:
* The approximate volume of the diamond (using x, y, and z).
* The approximate density of the diamond.
* The price per carat.
* Log transformation of carat and price.
1. How can you improve the data density of
`ggplot(diamonds, aes(x, z)) + stat_bin2d()`.
What transformation makes it easier to extract outliers?
1. The depth variable is just the width of the diamond (average of `x` and `y`)
divided by its height (`z`) multiplied by 100 and round to the nearest
integer. Compute the depth yourself and compare it to the existing depth
variable. Summarise your findings with a plot.
1. Compare the distribution of symmetry for diamonds with $x > y$ vs. $x < y$.
## Group-wise summaries {#sec:summarise}
Many insightful visualisations require that you reduce the full dataset down to a meaningful summary. ggplot2 provides a number of geoms that will do summaries for you. But it's often useful to do summaries by hand: that gives you more flexibility and you can use the summaries for other purposes. \indexf{group\_by} \indexf{summarise}
dplyr does summaries in two steps:
1. Define the grouping variables with `group_by()`.
2. Describe how to summarise each group with a single row with `summarise()`
For example, to look at the average price per clarity, we first group by clarity, then summarise:
`r columns(1, 1 / 2)`
```{r price-by-clarity}
by_clarity <- group_by(diamonds, clarity)
sum_clarity <- summarise(by_clarity, price = mean(price))
sum_clarity
ggplot(sum_clarity, aes(clarity, price)) +
geom_line(aes(group = 1), colour = "grey80") +
geom_point(size = 2)
```
You might be surprised by this pattern: why do diamonds with better clarity have lower prices? We'll see why this is the case and what to do about it in [removing trend](#sub:trend).
Supply additional variables to `group_by()` to create groups based on more than one variable. The next example shows how we can compute (by hand) a frequency polygon that shows how cut and depth interact. The special summary function `n()` counts the number of observations in each group.
```{r freqpoly-by-hand}
cut_depth <- summarise(group_by(diamonds, cut, depth), n = n())
cut_depth <- filter(cut_depth, depth > 55, depth < 70)
cut_depth
ggplot(cut_depth, aes(depth, n, colour = cut)) +
geom_line()
```
We can use a grouped `mutate()` to convert counts to proportions, so it's easier to compare across the cuts. `summarise()` strips one level of grouping off, so `cut_depth` will be grouped by cut.
```{r freqpoly-scaled}
cut_depth <- mutate(cut_depth, prop = n / sum(n))
ggplot(cut_depth, aes(depth, prop, colour = cut)) +
geom_line()
```
### Useful tools
`summarise()` needs to be used with functions that take a vector of $n$ values and always return a single value. Those functions include:
* Counts: `n()`, `n_distinct(x)`.
* Middle: `mean(x)`, `median(x)`.
* Spread: `sd(x)`, `mad(x)`, `IQR(x)`.
* Extremes: `quartile(x)`, `min(x)`, `max(x)`.
* Positions: `first(x)`, `last(x)`, `nth(x, 2)`.
Another extremely useful technique is to use `sum()` or `mean()` with a logical vector. When a logical vector is treated as numeric, `TRUE` becomes 1 and `FALSE` becomes 0. This means that `sum()` tells you the number of `TRUE`s, and `mean()` tells you the proportion of `TRUE`s. For example, the following code counts the number of diamonds with carat greater than or equal to 4, and the proportion of diamonds that cost less than $1000.
```{r}
summarise(diamonds,
n_big = sum(carat >= 4),
prop_cheap = mean(price < 1000)
)
```
Most summary functions have a `na.rm` argument: `na.rm = TRUE` tells the summary function to remove any missing values prior to summiarisation. This is a convenient shortcut: rather than removing the missing values then summarising, you can do it in one step.
### Statistical considerations
When summarising with the mean or median, it's always a good idea to include a count and a measure of spread. This helps you calibrate your assessments - if you don't include them you're likely to think that the data is less variable than it really is, and potentially draw unwarranted conclusions.
The following example extends our previous summary of the average price by clarity to also include the number of observations in each group, and the upper and lower quartiles. It suggests the mean might be a bad summary for this data - the distributions of price are so highly skewed that the mean is higher than the upper quartile for some of the groups!
`r columns(1, 2/3)`
```{r }
by_clarity <- diamonds %>%
group_by(clarity) %>%
summarise(
n = n(),
mean = mean(price),
lq = quantile(price, 0.25),
uq = quantile(price, 0.75)
)
by_clarity
ggplot(by_clarity, aes(clarity, mean)) +
geom_linerange(aes(ymin = lq, ymax = uq)) +
geom_line(aes(group = 1), colour = "grey50") +
geom_point(aes(size = n))
```
Another example of this comes from baseball. Let's take the MLB batting data from the Lahman package and calculate the batting average: the number of hits divided by the number of at bats. Who's the best batter according to this metric?
`r columns(1, 2/3)`
```{r}
data(Batting, package = "Lahman")
batters <- filter(Batting, AB > 0)
per_player <- group_by(batters, playerID)
ba <- summarise(per_player,
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE)
)
ggplot(ba, aes(ba)) +
geom_histogram(binwidth = 0.01)
```
Wow, there are a lot of players who can hit the ball every single time! Would you want them on your fantasy baseball team? Let's double check they're really that good by calibrating also showing the total number of at bats:
```{r}
ba <- summarise(per_player,
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE)
)
ggplot(ba, aes(ab, ba)) +
geom_bin2d(bins = 100) +
geom_smooth()
```
The highest batting averages occur for the players with the smallest number of at bats - it's not hard to hit the ball every time if you've only had two pitches. We can make the pattern a little more clear by getting rid of the players with less than 10 at bats.
```{r}
ggplot(filter(ba, ab >= 10), aes(ab, ba)) +
geom_bin2d() +
geom_smooth()
```
You'll often see a similar pattern whenever you plot number of observations vs. an average. Be aware!
### Exercises
1. For each year in the `ggplot2movies::movies` data determine the percent
of movies with missing budgets. Visualise the result.
1. How does the average length of a movie change over time? Display your
answer with a plot, including some display of uncertainty.
1. For each combination of diamond quality (e.g. cut, colour and clarity),
count the number of diamonds, the average price and the average size.
Visualise the results.
1. Compute a histogram of carat by "hand" using a binwidth of 0.1.
Display the results with `geom_bar(stat = "identity")`.
(Hint: you might need to create a new variable first).
1. In the baseball example, the batting average seems to increase as the
number of at bats increases. Why?
## Transformation pipelines
Most real analyses require you to string together multiple `mutate()`s, `filter()`s, `group_by()`s , and `summarise()`s. For example, above, we created a frequency polygon by hand with a combination of all four verbs: \indexc{\%>\%}
```{r pipe1, results = "hide"}
# By using intermediate values
cut_depth <- group_by(diamonds, cut, depth)
cut_depth <- summarise(cut_depth, n = n())
cut_depth <- filter(cut_depth, depth > 55, depth < 70)
cut_depth <- mutate(cut_depth, prop = n / sum(n))
```
This sequence of operations is a bit painful because we repeated the name of the data frame many times. An alternative is just to do it with one sequence of function calls:
```{r pipe2, results = "hide"}
# By "composing" functions
mutate(
filter(
summarise(
group_by(
diamonds,
cut,
depth
),
n = n()
),
depth > 55,
depth < 70
),
prop = n / sum(n)
)
```
But this is also hard to read because the sequence of operations is inside out, and the arguments to each function can be quite far apart. dplyr provides an alternative approach with the __pipe__, `%>%`. With the pipe, we can write the above sequence of operations as:
```{r pipe3, results = "none"}
cut_depth <- diamonds %>%
group_by(cut, depth) %>%
summarise(n = n()) %>%
filter(depth > 55, depth < 70) %>%
mutate(prop = n / sum(n))
```
This makes it easier to understand what's going on as we can read it almost like an English sentence: first group, then summarise, then filter, then mutate. In fact, the best way to pronounce `%>%` when reading a sequence of code is as "then". `%>%` comes from the magrittr package, by Stefan Milton Bache. It provides a number of other tools that dplyr doesn't expose by default, so I highly recommend that you check out the [magrittr website](https://github.com/smbache/magrittr). \index{magrittr}
`%>%` works by taking the thing on the left hand side (LHS) and supplying it as the first argument to the function on the right hand side (RHS). Each of these pairs of calls is equivalent:
```{r pipe-ex, eval = FALSE}
f(x, y)
# is the same as
x %>% f(y)
g(f(x, y), z)
# is the same as
x %>% f(y) %>% g(z)
```
### Exercises
1. Translate each of the examples in this chapter to use the pipe.
1. What does the following pipe do?
```{r, eval = FALSE}
library(magrittr)
x <- runif(100)
x %>%
subtract(mean(.)) %>%
raise_to_power(2) %>%
mean() %>%
sqrt()
```
1. Which player in the `Batting` dataset has had the most consistently
good performance over the course of their career?
## Learning more
dplyr provides a number of other verbs that are less useful for visualisation, but important to know about in general:
* `arrange()` orders observations according to variable(s). This is most useful
when you're looking at the data from the console. It can also be useful
for visualisations if you want to control which points are plotted on top.
* `select()` picks variables based on their names. Useful when you have
many variables and want to focus on just a few for analysis.
* `rename()` allows you to change the name of variables.
* Grouped mutates and filters are also useful, but more advanced. See
`vignette("window-functions", package = "dplyr")` for more details.
* There are a number of verbs designed to work with two tables of data
at a time. These include SQL joins (like the base `merge()` function)
and set operations. Learn more about them in `vignette("two-table", package = "dplyr")`.
* dplyr can work directly with data stored in a database - you use the same
R code as you do for local data and dplyr generates SQL to send to the
database. See `vignette("databases", package = "dplyr")` for the details.
Finally, RStudio provides a handy dplyr cheatsheet that will help jog your memory when you're wondering which function to use. Get it from <http://rstudio.com/cheatsheets>.
## References