-
Notifications
You must be signed in to change notification settings - Fork 176
/
explore-applications.qmd
535 lines (450 loc) · 20.8 KB
/
explore-applications.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
---
output: html_document
editor_options:
chunk_output_type: console
---
# Applications: Explore {#sec-explore-applications}
```{r}
#| include: false
source("_common.R")
```
\vspace{10mm}
## Case study: Effective communication of exploratory results {#sec-case-study-effective-comms}
Graphs can powerfully communicate ideas directly and quickly.
We all know, after all, that "a picture is worth 1000 words." Unfortunately, however, there are times when an image conveys a message which is inaccurate or misleading.
This chapter focuses on how graphs can best be utilized to present data accurately and effectively.
Along with data modeling, creative visualization is somewhat of an art.
However, even with an art, there are recommended guiding principles.
We provide a few best practices for creating data visualizations.
### Keep it simple
When creating a graphic, keep in mind what it is that you'd like your reader to see.
Colors should be used to group items or differentiate levels in meaningful ways.
Colors can be distracting when they are only used to brighten up the plot.
Consider a manufacturing company that has summarized its costs into five different categories.
In the two graphics provided in @fig-pie-to-bar, notice that the magnitudes in the pie chart in @fig-pie-to-bar-1 are difficult for the eye to compare.
That is, can your eye tell how different "Buildings and administration" is from "Workplace materials" when looking at the slices of pie?
Additionally, the colors in the pie chart do not mean anything and are therefore distracting.
Lastly, the three-dimensional aspect of the image does not improve the reader's ability to understand the data presented.
As an alternative, a bar plot is been provided in @fig-pie-to-bar-2.
Notice how much easier it is to identify the magnitude of the differences across categories while not being distracted by other aspects of the image.
Typically, a bar plot will be easier for the reader to digest than a pie chart, especially if the categorical data being plotted has more than just a few levels.
\clearpage
```{r}
expenses <- tribble(
~category, ~value,
"Cutting tools" , 0.03,
"Buildings and Administration" , 0.22,
"Labor" , 0.31,
"Machinery" , 0.27,
"Workplace materials" , 0.17
) |>
mutate(value = value * 100) |>
uncount(weights = value)
```
```{r}
#| label: fig-pie-to-bar
#| fig-cap: Same information displayed with two very different visualizations.
#| fig-subcap:
#| - A three-dimensional pie chart.
#| - A bar plot.
#| fig-alt: |
#| A three dimensional pie chart and a bar plot. Both plots show that the
#| biggest sources of cost are labor, machinery, and buildings & administration,
#| in that order. Very little of the costs are due to cutting tools.
#| fig-width: 3.5
#| layout: [[40,60]]
knitr::include_graphics("images/pie-3d.jpg")
ggplot(expenses, aes(x = fct_infreq(category))) +
geom_bar() +
theme_minimal() +
coord_flip() +
labs(x = NULL, y = NULL)
```
### Use color to draw attention
There are many reasons why you might choose to add **color** to your plots.
An important principle to keep in mind is to use color to draw attention.
Of course, you should still think about how visually pleasing your visualization is, and if you're adding color for making it visually pleasing without drawing attention to a particular feature, that might be fine.
However, you should be critical of default coloring and explicitly decide whether to include color and how.
Notice that in @fig-red-bar-2 the coloring is done in such a way to draw the reader's attention to one particular piece of information.
The default coloring in @fig-red-bar-1 can be distracting and makes the reader question, for example, is there something similar about the red and purple bars?
Also note that not everyone sees color the same way, often it's useful to add color and one more feature (e.g., pattern) so that you can refer to the features you're drawing attention to in multiple ways, as shown in @fig-red-bar-3.
\vspace{-2mm}
```{r}
#| label: fig-red-bar
#| fig-cap: |
#| Three bar charts visualizing the same information with different coloring
#| to highlight different aspects.
#| fig-subcap:
#| - Default coloring does nothing for the understanding of the data.
#| - Color draws attention directly to the bar on Buildings and Administration.
#| - Color and linetype draw attention directly to the bar on Buildings and Administration.
#| fig-alt: |
#| Three bar charts visualizing the same information with different coloring
#| to highlight different aspects. First plot colors each bar (Cutting tools,
#| Workspace materials, Buildings and Administration, Machinery, and Labor)
#| differently, while in the second and third plots Buildings and
#| Administration is highlighted in red and the rest of the bars are grey.
#| fig-width: 4.0
#| fig-asp: 0.57
#| out-width: 100%
#| layout: [[48,-4,48],[48,-4,48]]
ggplot(expenses, aes(y = fct_infreq(category), fill = category)) +
geom_bar(show.legend = FALSE) +
scale_fill_openintro("five") +
labs(x = NULL, y = NULL) +
scale_y_discrete(labels = label_wrap(14))
expenses |>
mutate(highlight = if_else(category == "Buildings and Administration", "yes", "no")) |>
ggplot(aes(y = fct_infreq(category), fill = highlight)) +
geom_bar(show.legend = FALSE) +
scale_fill_manual(values = c(IMSCOL["lgray", "full"], IMSCOL["red", "full"])) +
labs(x = NULL, y = NULL) +
scale_y_discrete(labels = label_wrap(14))
expenses |>
mutate(highlight = if_else(category == "Buildings and Administration", "yes", "no")) |>
ggplot(aes(y = fct_infreq(category), fill = highlight, color = highlight, linetype = highlight, size = highlight)) +
geom_bar(show.legend = FALSE) +
scale_fill_manual(values = c(IMSCOL["lgray", "full"], IMSCOL["red", "full"])) +
labs(x = NULL, y = NULL) +
scale_color_manual(values = c(IMSCOL["lgray", "full"], "white")) +
scale_size_manual(values = c(0, 0.8)) +
scale_y_discrete(labels = label_wrap(14))
```
\clearpage
### Tell a story
For many graphs, an important aspect is the inclusion of information which is not provided in the dataset that is being plotted.
The external information serves to contextualize the data and helps communicate the narrative of the research.
In @fig-duke-hires, the graph on the right is **annotated** with information about the start of the university's fiscal year which contextualizes the information provided by the data.
Sometimes the additional information may be a diagonal line given by $y = x$, points above the line quickly show the reader which values have a $y$ coordinate larger than the $x$ coordinate; points below the line show the opposite.
```{r}
#| label: fig-duke-hires
#| fig-cap: |
#| Time series plot showing monthly Duke University hiring trends
#| over five calendar years.
#| fig-alp: |
#| Time series plot showing monthly Duke University hiring trends
#| over five calendar years. Hiring spikes in July.
#| fig-subcap:
#| - Colored by year
#| - Same color for all years, annotation summarizing trend
#| out-width: 100%
#| fig-asp: 0.3
#| fig-width: 10
duke_hires_raw <- read_csv("data/duke-hires.csv")
set.seed(1234)
duke_hires <- duke_hires_raw |>
mutate(
`2011` = `2010` + round(runif(12, min = -30, max = 15)),
`2011` = `2010` + round(runif(12, min = -12, max = 10) * 1.02),
`2012` = `2010` + round(runif(12, min = -11, max = 31) * 0.95),
`2013` = `2010` + round(runif(12, min = -37, max = 20) * 1.1),
`2014` = `2010` + round(runif(12, min = -10, max = 42) * 0.8),
`2015` = `2010` + round(runif(12, min = -23, max = 18) * 1.5),
`2012` = if_else(month == 6, 600, `2012`),
`2013` = if_else(month == 6, 480, `2013`),
`2014` = if_else(month == 6, 550, `2014`),
`2015` = if_else(month == 6, 430, `2015`),
`2012` = if_else(month == 8, 600 + 10, `2012`),
`2013` = if_else(month == 8, 480 + 80, `2013`),
`2014` = if_else(month == 8, 550 + 90, `2014`),
`2015` = if_else(month == 8, 430 + 140, `2015`),
`2015` = if_else(month == 7, 800, `2015`)
) |>
pivot_longer(
cols = !month,
names_to = "year",
values_to = "n"
)
ggplot(duke_hires, aes(x = month, y = n, color = year)) +
geom_line(linewidth = 1) +
scale_color_openintro("five") +
scale_x_continuous(breaks = 1:12, labels = month.abb, minor_breaks = NULL) +
scale_y_continuous(breaks = seq(0, 800, 100), minor_breaks = NULL) +
labs(
title = "Duke hires by month",
y = NULL, x = NULL, color = NULL
) +
theme(
legend.position = c(0.12, 0.65),
legend.background = element_rect(color = "gray")
)
ggplot(duke_hires, aes(x = month, y = n, group = year)) +
geom_segment(x = 7, xend = 7, y = 0, yend = 800) +
geom_line(linewidth = 0.8, color = IMSCOL["blue", "f2"]) +
scale_x_continuous(breaks = 1:12, labels = month.abb, minor_breaks = NULL) +
scale_y_continuous(breaks = seq(0, 800, 200), minor_breaks = NULL) +
labs(
title = "Duke hires by month",
y = NULL, x = NULL
) +
annotate(
"label",
x = 7, y = 250,
label = "Hires always peak in\nJuly, which is the start\nof Duke's fiscal year.",
hjust = -0.1
)
```
### Order matters
Most software programs have built in methods for some of the plot details -- some order levels alphabetically, some provide functionality for arranging them in a custom order.
As seen in @fig-brexit-bars-1, the alphabetical ordering isn't particularly meaningful for describing the data.
Sometimes it makes sense to **order** the bars from tallest to shortest (or vice versa), as shown in @fig-brexit-bars-2.
But in this case, the best ordering is probably the one in which the questions were asked, as shown in @fig-brexit-bars-3.
An ordering which does not make sense in the context of the problem (e.g., alphabetically here), can mislead the reader who might take a quick glance at the axes and not read the bar labels carefully.
In September 2019, YouGov survey asked 1,639 Great Britain adults the following question[^06-explore-applications-1]:
[^06-explore-applications-1]: Source: [YouGov Survey Results](https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/x0msmggx08/YouGov%20-%20Brexit%20and%202019%20election.pdf), retrieved Oct 7, 2019.
\vspace{3mm}
> How well or badly do you think the government are doing at handling Britain's exit from the European Union?
>
> - Very well
> - Fairly well
> - Fairly badly
> - Very badly
> - Don't know
\clearpage
```{r}
brexit <- tibble(
opinion = c(
rep("Very well", 123), rep("Fairly well", 219), rep("Fairly badly", 332),
rep("Very badly", 818), rep("Don't know", 151)
),
region = c(
rep("london", 10), rep("rest_of_south", 49), rep("midlands_wales", 32), rep("north", 24), rep("scot", 8),
rep("london", 18), rep("rest_of_south", 93), rep("midlands_wales", 50), rep("north", 48), rep("scot", 10),
rep("london", 33), rep("rest_of_south", 126), rep("midlands_wales", 74), rep("north", 76), rep("scot", 23),
rep("london", 118), rep("rest_of_south", 246), rep("midlands_wales", 152), rep("north", 212), rep("scot", 90),
rep("london", 16), rep("rest_of_south", 38), rep("midlands_wales", 46), rep("north", 40), rep("scot", 11)
)
)
```
```{r}
#| label: fig-brexit-bars
#| fig-cap: |
#| Three bar charts visualizing the same information with arrangement of levels.
#| fig-subcap:
#| - Alphabetic order
#| - Frequecy order
#| - Same order as presented in the survey question
#| fig-alt: |
#| Three bar plots with the bars (Very well, Fairly well, Fairly badly,
#| Very badly, Don't know) arranged differently in each plot. In the first
#| plot they're in alphabetical order, in the second in frequency order
#| (highest Very badly to lowest Very well), and in the third plot in the
#| same order as presented in the survey question.
#| fig-width: 4.0
#| out-width: 100%
#| layout-ncol: 2
#| fig-asp: 0.5
ggplot(brexit, aes(y = fct_rev(opinion))) +
geom_bar() +
labs(y = "Opinion", x = "Count")
ggplot(brexit, aes(y = fct_rev(fct_infreq(opinion)))) +
geom_bar() +
labs(y = "Opinion", x = "Count")
brexit |>
mutate(opinion = fct_relevel(opinion, "Very well", "Fairly well", "Fairly badly", "Very badly", "Don't know")) |>
ggplot(aes(y = opinion)) +
geom_bar() +
labs(y = "Opinion", x = "Count")
```
### Make the labels as easy to read as possible
The Brexit survey results were additionally broken down by region in Great Britain.
The stacked bar plot allows for comparison of Brexit opinion across the five regions.
In @fig-brexit-region-1 the bars are vertical and in @fig-brexit-region-2 they are horizontal.
While the quantitative information in the two graphics is identical, flipping the graph and creating horizontal bars provides more space for the **axis labels**.
The easier the categories are to read, the more the reader will learn from the visualization.
Remember, the goal is to convey as much information as possible in a succinct and clear manner.
```{r}
brexit <- brexit |>
mutate(
region = fct_relevel(
region,
"london", "rest_of_south", "midlands_wales", "north", "scot"
),
region = fct_recode(
region,
London = "london",
`Rest of South` = "rest_of_south",
`Midlands / Wales` = "midlands_wales",
North = "north",
Scotland = "scot"
)
) |>
mutate(Opinion = fct_relevel(opinion, c("Very well", "Fairly well", "Fairly badly", "Very badly", "Don't know")))
```
::: {#fig-brexit-region layout="[[100], [100]]"}
```{r}
#| label: fig-brexit-region-1
#| fig-cap: Vertical bars across the x-axis.
#| fig-alt: |
#| Stacked bar plot of region and opinion, where vertical bars are on
#| the x-axis.
#| fig-asp: 0.25
#| out-width: 100%
ggplot(brexit, aes(x = region, fill = Opinion)) +
geom_bar(show.legend = FALSE) +
scale_fill_openintro("five") +
labs(x = "Region", y = "Count")
```
```{r}
#| label: fig-brexit-region-2
#| fig-cap: Horizontal bars across the y-axis.
#| fig-alt: |
#| Stacked bar plot of region and opinion, where horizontal bars are on
#| the y-axis.
#| fig-asp: 0.3
#| out-width: 100%
ggplot(brexit, aes(y = region, fill = Opinion)) +
geom_bar() +
labs(x = "Count", y = NULL) +
scale_fill_openintro("five") +
theme(legend.position = "bottom")
```
Stacked bar plots. Horizontal orientation makes the region labels easier to read.
:::
\clearpage
### Pick a purpose
Every graphical decision should be made with a **purpose**.
As previously mentioned, sticking with default options is not always best for conveying the narrative of your data story.
Stacked bar plots tell one part of a story.
Depending on your research question, they may not tell the part of the story most important to the research.
@fig-seg-three-ways provides three different ways of representing the same information.
If the most important comparison across regions is proportion, you might prefer @fig-seg-three-ways-1. If the most important comparison across regions also considers the total number of individuals in the region, you might prefer @fig-seg-three-ways-2. If a separate bar plot for each region makes the point you'd like, use @fig-seg-three-ways-3, which has been **faceted** by region.
@fig-seg-three-ways-3 also provides full titles and a succinct URL with the data source.
Other deliberate decisions to consider include using informative labels and avoiding redundancy.
::: {#fig-seg-three-ways layout="[[100], [100], [100]]"}
```{r}
#| label: fig-seg-three-ways-1
#| fig-cap: Stacked bar plot of region and opinion, showing percentages.
#| fig-alt: Stacked bar plot of region and opinion, showing percentages.
#| fig-asp: 0.3
#| out-width: 100%
ggplot(brexit, aes(y = region, fill = Opinion)) +
geom_bar(position = "fill") +
labs(x = "Percent", y = NULL) +
scale_fill_openintro("five") +
scale_x_continuous(labels = label_percent()) +
theme(legend.position = "top")
```
```{r}
#| label: fig-seg-three-ways-2
#| fig-cap: Stacked bar plot of region and opinion, showing counts.
#| fig-alt: Stacked bar plot of region and opinion, showing counts.
#| fig-asp: 0.3
#| out-width: 100%
ggplot(brexit, aes(y = region, fill = Opinion)) +
geom_bar() +
labs(x = "Count", y = NULL) +
scale_fill_openintro("five") +
theme(legend.position = "top")
```
```{r}
#| label: fig-seg-three-ways-3
#| fig-cap: Dodged bar plot of region and opinion, showing counts.
#| fig-alt: Dodged bar plot of region and opinion, showing counts.
#| fig-asp: 0.4
#| out-width: 100%
#| fig-width: 10
ggplot(brexit, aes(y = Opinion)) +
geom_bar() +
facet_grid(. ~ region, labeller = label_wrap_gen(width = 12)) +
scale_x_continuous(breaks = c(0, 100, 200)) +
labs(
title = "How well or badly do you think the government are doing at handling Britain's exit\nfrom the European Union?",
subtitle = "YouGov Survey Results, 2-3 September 2019",
caption = "Source: bit.ly/2lCJZVg",
x = NULL, y = NULL
) +
theme(plot.title.position = "plot")
```
Three different representations of two variables from the survey, region and opinion.
:::
\clearpage
### Select meaningful colors
One last consideration for building graphs is to consider color choices.
Default or rainbow colors are not always the choice which will best distinguish the level of your variables.
Much research has been done to find color combinations which are distinct and which are clear for differently sighted individuals.
The cividis scale works well with ordinal data.
[@Nunez:2018] @fig-brexit-viridis shows the same plot with two different color themes.
\vspace{5mm}
```{r}
#| label: fig-brexit-viridis
#| fig-cap: Identical bar plots with two different coloring options.
#| fig-subcap:
#| - Default color scale
#| - Cividis scale
#| fig-alt: Identical bar plots with two different coloring options.
#| fig-asp: 0.4
#| out-width: 100%
p <- ggplot(brexit, aes(y = region, fill = Opinion)) +
geom_bar(position = "fill") +
labs(
title = "How well or badly do you think the government are doing at handling Britain's exit\nfrom the European Union?",
subtitle = "YouGov Survey Results, 2-3 September 2019",
caption = "Source: bit.ly/2lCJZVg",
x = NULL, y = NULL, fill = NULL
) +
theme(plot.title.position = "plot")
p +
scale_fill_openintro("five")
p +
scale_fill_viridis_d(option = "E")
```
\vspace{5mm}
In this chapter different representations are contrasted to demonstrate best practices in creating graphs.
The fundamental principle is that your graph should provide maximal information succinctly and clearly.
Labels should be clear and oriented horizontally for the reader.
Don't forget titles and, if possible, include the source of the data.
\clearpage
## Interactive R tutorials {#sec-explore-tutorials}
Navigate the concepts you've learned in this part in R using the following self-paced tutorials.
All you need is your browser to get started!
::: {.alltutorials data-latex=""}
[Tutorial 2: Exploratory data analysis](https://openintrostat.github.io/ims-tutorials/02-explore/)
::: {.content-hidden unless-format="pdf"}
<https://openintrostat.github.io/ims-tutorials/02-explore>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 2 - Lesson 1: Visualizing categorical data](https://openintro.shinyapps.io/ims-02-explore-01/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-02-explore-01>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 2 - Lesson 2: Visualizing numerical data](https://openintro.shinyapps.io/ims-02-explore-02/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-02-explore-02>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 2 - Lesson 3: Summarizing with statistics](https://openintro.shinyapps.io/ims-02-explore-03/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-02-explore-03>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 2 - Lesson 4: Case study](https://openintro.shinyapps.io/ims-02-explore-04/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-02-explore-04>
:::
:::
::: {.content-hidden unless-format="pdf"}
You can also access the full list of tutorials supporting this book at <https://openintrostat.github.io/ims-tutorials>.
:::
::: {.content-visible when-format="html"}
You can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).
:::
## R labs {#sec-explore-labs}
Further apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.
::: {.singlelab data-latex=""}
[Intro to data - Flight delays](https://www.openintro.org/go?id=ims-r-lab-intro-to-data)\
::: {.content-hidden unless-format="pdf"}
<https://www.openintro.org/go?id=ims-r-lab-intro-to-data>
:::
:::
::: {.content-hidden unless-format="pdf"}
You can also access the full list of labs supporting this book at <https://www.openintro.org/go?id=ims-r-labs>.
:::
::: {.content-visible when-format="html"}
You can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).
:::