-
Notifications
You must be signed in to change notification settings - Fork 0
/
Chapters_6_to_8.Rmd
473 lines (303 loc) · 15 KB
/
Chapters_6_to_8.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
---
title: "Chapters 6 to 8"
author: "Laura"
date: "11/12/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
library(tidyverse); library(skimr); library(nycflights13); library(GGally); library(ggstance); library(lvplot); library(hexbin); library(modelr)
```
## Notes for Chapter 6: Workflow: scripts
### 6.1 Running code
Helpful keyboard shortcuts are:
* `Cmd/Ctrl + Enter` will run the complete command where your cursor is standing and take the cursor to the next command
* `Cmd/Ctrl + Shift + S` will execute the whole script
## Notes for Chapter 7: Exploratory Data Analysis
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data.
To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
### 7.2 Questions
> “There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox
> “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
1. What type of variation occurs within my variables?
2. What type of covariation occurs between my variables?
Tabular data is _tidy_ if each _value_ is placed in its own “cell”, each _variable_ in its own column, and each _observation_ in its own row.
Things to do with a brand new dataset:
1. Visualize variable distributions
2. Examine typical values
3. Examine unusual values
4. Missing values
5. Covariation
6. Patterns and models
#### 7.3.1 Visualising distributions
```{r ch731}
diamonds %>%
ggplot() +
geom_bar(aes(x = cut))
diamonds %>%
count(cut)
diamonds %>%
ggplot() +
geom_histogram(aes(carat), binwidth = 0.5)
diamonds %>%
count(cut_width(carat, 0.5))
# if you want to go more tidyversy it can be a nightmare
diamonds$carat %>%
cut_width(0.5) %>%
as_tibble() %>%
count(value)
# without the warning it should be
diamonds$carat %>%
cut_width(0.5) %>%
enframe() %>%
count(value)
smaller <- diamonds %>%
filter(carat < 3)
smaller %>%
ggplot(aes(carat)) +
geom_histogram() +
geom_freqpoly(aes(color = reorder(cut, carat, FUN = median )))
```
#### 7.3.3 Unusual values
```{r ch733}
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
```
To make it easy to see the unusual values, we need to zoom to small values of the y-axis with `coord_cartesian()`. `coord_cartesian()` also has an `xlim()` argument for when you need to zoom into the x-axis. ggplot2 also has `xlim()` and `ylim()` functions that work slightly differently: *they throw away the data outside the limits*.
### 7.3.4 Exercises
Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
```{r ex7341}
diamonds %>% skim()
diamonds %>%
ggplot() +
geom_histogram(aes(x), binwidth = 0.01)
diamonds %>%
ggplot() +
geom_histogram(aes(x), binwidth = 0.1) +
coord_cartesian(ylim = c(0, 50))
diamonds %>%
filter(x < 3 | x > 9.5) %>%
select(price, x, y, z) %>%
arrange(x)
```
Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
```{r ex7342}
diamonds %>%
ggplot() +
geom_histogram(aes(price), binwidth = 10) +
coord_cartesian(ylim = c(0,10))
```
How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
```{r ex7343}
(carat_099 <- diamonds %>%
filter(carat == 0.99) %>%
count())
(carat_1 <- diamonds %>%
filter(carat == 1) %>%
count())
```
Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
```{r ex7344}
diamonds %>%
ggplot() +
geom_histogram(aes(x), binwidth = 0.10) +
coord_cartesian(ylim = c(0, 250))
diamonds %>%
ggplot() +
geom_histogram(aes(x)) +
coord_cartesian(ylim = c(0, 250))
diamonds %>%
ggplot() +
geom_histogram(aes(x), binwidth = 0.10) +
ylim(c(0,250))
diamonds %>%
ggplot() +
geom_histogram(aes(x))
diamonds %>%
ggplot() +
geom_histogram(aes(x)) +
ylim(c(0,250))
```
* `coord_cartesian` zooms in the histogram _after_ the histogram was calculated. Instead, `ylim()` or `xlim()` select the data on which the histogram will be calculated first and then the histogram is put together. That is why histograms with the `binwidth` option behave so differently when any of the `lim` functions are used.
## 7.4 Missing values
```{r ch74}
diamonds2 <- diamonds %>%
filter(between(y, 3, 20))
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
# an option to ifelse is case_when() - example from ?case_when
starwars %>%
select(name:mass, gender, species) %>%
mutate(
type = case_when(
height > 200 | mass > 200 ~ "large",
species == "Droid" ~ "robot",
TRUE ~ "other"
)
)
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)
```
#### 7.4.1 Exercises
What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
```{r ex7411}
glimpse(diamonds2)
skim(diamonds2)
diamonds3 <- diamonds2 %>%
mutate(
color = if_else(color == "E" & price == 326, NA_character_ , as.character(color)),
price = if_else(color == "E" & price == 326, NA_integer_ , price))
skim(diamonds3)
diamonds3 %>% ggplot() +
geom_bar(aes(color))
ggplot(diamonds3) +
geom_histogram(aes(price))
```
What does na.rm = TRUE do in mean() and sum()?
```{r ex7412}
sum(c(1, 2, 3, NA))
sum(c(1, 2, 3, NA), na.rm = TRUE)
mean(c(1, 2, 3, NA))
mean(c(1, 2, 3, NA), na.rm = TRUE)
```
## 7.5 Covariation
```{r ch75}
ggplot(diamonds, aes(price, ..density..)) +
geom_histogram()
ggplot(diamonds, aes(cut, price)) +
geom_boxplot() +
geom_jitter(alpha = 0.01)
ggplot(diamonds, aes(cut, price)) +
geom_violin() +
geom_jitter(alpha = 0.01)
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
bpflip <- ggplot(mpg) +
geom_boxplot(aes(reorder(class, hwy, FUN = median), hwy)) +
coord_flip()
bpflip + xlab("Car class")
```
#### 7.5.1.1 Exercises
* Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.
```{r ex75111}
flights <- flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time_h = sched_hour + sched_min / 60
)
ggplot(flights) +
geom_boxplot(aes(cancelled, sched_dep_time_h))
ggplot(flights) +
geom_jitter(aes(cancelled, sched_dep_time_h))
ggplot(flights) +
geom_violin(aes(cancelled, sched_dep_time_h))
ggplot(flights) +
geom_freqpoly(aes(sched_dep_time_h, y = ..density.., color = cancelled))
```
* What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
```{r ex75112}
# GGally::ggpairs allows to inspect the bivariate relationship of several variables at a time
diamonds %>% ggpairs(columns = c(1, 5, 7:10))
ggplot(diamonds) +
geom_boxplot(aes(cut, carat))
ggplot(diamonds) +
geom_violin(aes(cut, carat))
ggplot(diamonds) +
geom_histogram(aes(carat, y = ..density..)) +
facet_wrap(~ cut, ncol = 1)
```
Carat seems the most important variable for predicting the price of a diamond. We have a lot more ideal and premium cut diamonds of low carat than very good or lower cuts diamonds. The higher the carat, the higher the price. That explains that diamonds with a better cut appear to be cheaper, they are of lower carat on average than worse cut diamonds.
* Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?
```{r ex75113}
ggplot(diamonds) +
geom_boxplot(aes(cut, carat)) +
coord_flip()
# note that for geom_boxploth to run well you have to invert the order of the aes arguments
ggplot(diamonds) +
geom_boxploth(aes(carat, cut))
```
I don't see any difference between the outputs of using `coord_flip` to `geom_boxploth`. But I had to invert the order of the arguments in `geom_boxploth` in order for it to run correctly.
* One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
* Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
* If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.
### 7.5.2 Two categorical variables
```{r ch7521}
ggplot(diamonds) +
geom_count(aes(cut, color))
diamonds %>%
count(color, cut)
diamonds %>%
count(color, cut) %>%
ggplot(aes(cut, color, fill = n)) +
geom_tile()
```
If the categorical variables are unordered, you might want to use the `seriation` package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the `d3heatmap` or `heatmaply` packages, which create interactive plots.
#### 7.5.2.1 Exercises
* How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
* Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
* Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?
### 7.5.3 Two continuous variables
But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used `geom_histogram()` and `geom_freqpoly()` to bin in one dimension. Now you’ll learn how to use `geom_bin2d()` and `geom_hex()` to bin in two dimensions.
```{r ch7531}
ggplot(smaller) +
geom_bin2d(aes(carat, price))
ggplot(diamonds) +
geom_bin2d(aes(carat, price))
# install.packages("hexbin")
ggplot(smaller) +
geom_hex(aes(carat, price))
```
Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a categorical and a continuous variable that you learned about. For example, you could bin carat and then for each group, display a boxplot:
```{r ch7532}
ggplot(smaller, aes(carat, price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1)))
# the option varwidth let us see that each boxplot is representing a different number of obs
ggplot(smaller, aes(carat, price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1)), varwidth = TRUE)
# another option is to fix the number of points per bin and let the width of the box plot to reflect how much of the x axis is representing. I think the previous plot is more intuitive to understand
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
```
#### 7.5.3.1 Exercises
* Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?
* Visualise the distribution of carat, partitioned by price.
* How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?
* Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.
* Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.
## 7.6 Patterns and models
If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts price from carat and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed.
```{r ch761}
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(diamonds2) +
geom_point(aes(x = carat, y = resid))
ggplot(diamonds2) +
geom_boxplot(aes(cut, resid))
```
## 7.8 Learning more
Another useful resource is the (R Graphics Cookbook by Winston Chang[http://www.cookbook-r.com/Graphs/].
## Notes for Chapter 8:
There is a great pair of keyboard shortcuts that will work together to make sure you’ve captured the important parts of your code in the editor:
1. Press Cmd/Ctrl + Shift + F10 to restart RStudio.
2. Press Cmd/Ctrl + Shift + S to rerun the current script.
## Wishlist:
Using purrr for a histogram with varying bandwidths.
Using purrr to load several packages at one time?