-
Notifications
You must be signed in to change notification settings - Fork 0
/
chap3-data-visualization.Rmd
383 lines (300 loc) · 14.7 KB
/
chap3-data-visualization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
---
title: "Chapter 3 - Data visualisation"
author: "Salman Mohammed"
date: "February 7, 2017"
output: html_document
---
#### Importing the necessary libraries
```{r results='hide', message=FALSE, warning=FALSE}
library(tidyverse)
```
#### First look at the dataset
```{r}
mpg
```
#### A car's fuel efficiency vs. engine size
We want to see the relationship between a car's engine size (displ) and it's fuel efficiency
on the highway (hwy). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
```
In this plot, we can see a downward trend that explains a negative relationship between the two variables, so our hypothesis is that cars with bigger engines are less fuel efficient.
#### 3.2.4 - Exercises
1. Run ggplot(data = mpg) what do you see?
```{r}
ggplot(data = mpg)
```
2. How many rows are in mtcars? How many columns?
```{r}
nrow(mtcars)
ncol(mtcars)
```
3. What does the drv variable describe? Read the help for ?mpg to find out.
- drv describes the transmission system of the vehicle. The options are f = front-wheel drive, r = rear wheel drive, 4 = 4wd
4. Make a scatterplot of hwy vs cyl.
- hwy is the highway miles per gallon
- cyl is the number of cylinders
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = hwy, y =cyl))
```
5. What happens if you make a scatterplot of class vs drv. - class is the 'type' of car
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = class, y =drv))
```
6. Why is the plot not useful?
- The variables are both categorical, so the points on the plot overlap with one another.
#### Incorporating another variable in the plot with color/size/alpha/shape
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = class))
## alpha controls the transparency of the points
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
#### 3.3.1 - Exercises
1. What’s gone wrong with this code? Why are the points not blue?
- The color needs to be defined outside of aes()
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
```
2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
- Categorical - manufacturer, model, trans, drv, fl, class
- Continuous - displ, cyl, cty, hwy
- Categorical variables are type chr, whereas continuous variables are type dbl or int
```{r}
mpg
```
3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cyl))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cyl))
## shape cannot be applied to continuos variable
```
4. What happens if you map the same variable to multiple aesthetics?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cyl, color = cyl))
```
5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
- Stroke adjusts the thickness of the border for shapes that can take on different colors both inside and outside. It only works for shapes 21-24.
```{r}
# For shapes that have a border (like 21), you can colour the inside and
# outside separately. Use the stroke aesthetic to modify the width of the
# border
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), shape = 21, stroke = 3)
```
6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?
- R executes the code and creates a temporary variable containing the results of the operation. Here, the new variable takes on a value of TRUE if the engine displacement is less than 5 or FALSE if the engine displacement is more than or equal to 5.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))
```
#### Facets
Facets are particularly useful for categorical variables. Split your plot into facets, subplots that each display one subset of the data.
To facet your plot by a single variable, use facet_wrap().
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
```
To facet your plot on the combination of two variables, add facet_grid() to your plot call.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ class)
```
#### 3.5.1 Exercises
1. What happens if you facet on a continuous variable?
- Your graph will not make much sense. R will try to draw a separate facet for each unique value of the continuous variable. If you have too many unique values, you may crash R.
2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
- The empty plots mean that there are no points in our dataset that have that combination of drv and cyl values. For example, there are no 4-wheel drive cars in our dataset with 5 cylinders.
3. What plots does the following code make? What does . do?
- The . acts a placeholder for no variable. In facet_grid(), this results in a plot faceted on a single dimension (1 by NN or NN by 1) rather than an NN by NN grid.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
```
4. What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
```
Faceting splits the data into separate grids and better visualizes trends within each individual facet. The disadvantage is that by doing so, it is harder to visualize the overall relationship across facets. The color aesthetic is fine when your dataset is small, but with larger datasets points may begin to overlap with one another. In this situation with a colored plot, jittering may not be sufficient because of the additional color aesthetic.
5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol variables?
- nrow sets how many rows the faceted plot will have.
- ncol sets how many columns the faceted plot will have.
- as.table determines the starting facet to begin filling the plot, and dir determines the starting direction for filling in the plot (horizontal or vertical).
- facet_grid forms a matrix of panels defined by row and column facetting variables.
- facet_grid does not have nrow and ncol because those values are obtained automatically from the levels of the variables.
6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?
- This will put more columns (extend it vertically) and make the plot wider and this makes more sense with widescreen monitors - more viewing space. If you extend it horizontally, the plot will be compressed and harder to view.
#### 3.6 Geometric objects
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. As we see above, you can use different geoms to plot the same data.
```{r}
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
```
To display multiple geoms in the same plot, add multiple geom functions to ggplot():
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
```
This is a cleaner way of doing the above with less duplication:
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
```
#### 3.6.1 Exercises
1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
- Line chart - geom_line()
- Boxplot - geom_boxplot()
- Histogram - geom_histogram()
- Area chart - geom_area()
```{r}
## line chart - two variables: continuous x, continuous x
ggplot(data = mpg) + geom_line(mapping = aes(x = displ, y = hwy))
## box plot - two variables: discrete x, continuous x
ggplot(data = mpg) + geom_boxplot(mapping = aes(x = class, y = hwy))
## histogram - one variable: continuous
ggplot(data = mpg) + geom_histogram(mapping = aes(x = hwy), bins = 20)
## area chart - one variable: continuous
ggplot(data = mpg) + geom_area(stat = "bin", mapping = aes(x = hwy), bins = 20)
```
2. Run this code:
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE, show.legend = FALSE)
```
3. What does show.legend = FALSE do?
It removes the legend. The aesthetics are still mapped and plotted, but the key is removed from the graph.
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE, show.legend = FALSE)
```
4. What does the se argument to geom_smooth() do?
- When se is set to TRUE, the smoothed out line shows a shaded region for the confidence interval.
5. Will these two graphs look different? Why/why not?
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
```
They look the same but the first plot contains less duplicate code.
6. Recreate the R code necessary to generate the following graphs.
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(aes(group = drv), se = FALSE) +
geom_point()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
geom_smooth(se = FALSE)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
geom_smooth(aes(linetype = drv), se = FALSE)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(size = 4, colour = "white") +
geom_point(aes(colour = drv))
```
#### 3.7 Statistical transformations
The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
```{r}
# default stat = "count"
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
There are three reasons you might need to use a stat explicitly:
1. You might want to override the default stat.
```{r}
demo <- tribble(
~a, ~b,
"bar_1", 20,
"bar_2", 30,
"bar_3", 40
)
# using stat = "identity"
ggplot(data = demo) +
geom_bar(mapping = aes(x = a, y = b), stat = "identity")
```
2. You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count. To find the variables computed by the stat, look for the help section titled “computed variables”.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
```
3. You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarises the y values for each unique x value, to draw attention to the summary that you’re computing:
```{r}
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
```
#### 3.7.1 Exercises
1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
- Default geom is "pointrange"
```{r}
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = min,
fun.ymax = max,
fun.y = median)
```
2. What does geom_col() do? How is it different to geom_bar()?
- geom_bar() uses the stat_count() statistical transformation to draw the bar graph. geom_col() assumes the values have already been transformed to the appropriate values. geom_bar(stat = "identity") and geom_col() are equivalent.
3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
4. What variables does stat_smooth() compute? What parameters control its behaviour?
stat_smooth() calculates four variables:
- y - predicted value
- ymin - lower pointwise confidence interval around the mean
- ymax - upper pointwise confidence interval around the mean
- se - standard error
See ?stat_smooth for more details on the specific parameters. Most importantly, method controls the smoothing method to be employed, se determines whether confidence interval should be plotted, and level determines the level of confidence interval to use.
5. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
```
If we fail to set group = 1, the proportions for each cut are calculated using the complete dataset, rather than each subset of cut. Instead, we want the graphs to look like this:
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop.., group = 1))
```
#### 3.8 Position adjustments