-
Notifications
You must be signed in to change notification settings - Fork 75
/
Copy path05-graphs_tables_maps.qmd
2061 lines (1654 loc) · 86.6 KB
/
05-graphs_tables_maps.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
engine: knitr
---
# Graphs, tables, and maps {#sec-static-communication}
::: {.callout-note}
Chapman and Hall/CRC published this book in July 2023. You can purchase that [here](https://www.routledge.com/Telling-Stories-with-Data-With-Applications-in-R/Alexander/p/book/9781032134772). This online version has some updates to what was printed.
:::
**Prerequisites**
- Read *R for Data Science*, [@r4ds]
- Focus on Chapter 1 "Data visualization", which provides an overview of `ggplot2`.
- Read *Data Visualization: A Practical Introduction*, [@healyviz]
- Focus on Chapter 3 "Make a plot", which provides an overview of `ggplot2` with different emphasis.
- Watch *The Glamour of Graphics*, [@chase2020]
- This video details ideas for how to improve a plot made with `ggplot2`.
- Read *Testing Statistical Charts: What Makes a Good Graph?*, [@vanderplas2020testing]
- This article details best practice for making graphs.
- Read *Data Feminism*, [@datafeminism2020]
- Focus on Chapter 3 "On Rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints", which provides examples of why data needs to be considered within context.
- Read *Historical development of the graphical representation of statistical data*, [@funkhouser1937historical]
- Focus on Chapter 2 "The Origin of the Graphic Method", which discusses how various graphs developed.
- Read *Remove the legend to become one*, [@removethelegend]
- Goes through the process of gradually improving a graph. It is all interesting, but the graphs aspect begins with "What does this have to do with line graphs?".
- Read *Geocomputation with R*, Chapter 2 "Geographic data in R", [@lovelace2019geocomputation]
- This chapter provides an overview of mapping in `R`.
- Read *Mastering Shiny*, Chapter 1 "Your first Shiny app", [@wickham2021mastering]
- This chapter provides a self-contained example of a Shiny app.
**Key concepts and skills**
- Visualization is one way to get a sense of our data and to communicate this to the reader. Plotting the observations in a dataset is important.
- We need to be comfortable with a variety of graph types, including: bar charts, scatterplots, line plots, and histograms. We can even consider a map to be a type of graph, especially after geocoding our data.
- We should also summarize data using tables. Typical use cases for this include showing part of a dataset, summary statistics, and regression results.
**Software and packages**
- `babynames` [@citebabynames]
- Base R [@citeR]
- `carData` [@carData]
- `datasauRus` [@citedatasauRus]
- `ggmap` [@KahleWickham2013]
- `janitor` [@janitor]
- `knitr` [@citeknitr]
- `leaflet` [@ChengKarambelkarXie2017]
- `mapdeck` [@citemapdeck]
- `maps` [@citemaps]
- `mapproj` [@mapproj]
- `modelsummary` [@citemodelsummary]
- `opendatatoronto` [@citeSharla]
- `patchwork` [@citepatchwork]
- `shiny` [@citeshiny]
- `tidygeocoder` [@tidygeocoder]
- `tidyverse` [@tidyverse]
- `tinytable` [@tinytable]
- `troopdata` [@troopdata]
- `usethis` [@usethis]
- `WDI` [@WDI]
```{r}
#| message: false
#| warning: false
library(babynames)
library(carData)
library(datasauRus)
library(ggmap)
library(janitor)
library(knitr)
library(leaflet)
library(mapdeck)
library(maps)
library(mapproj)
library(modelsummary)
library(opendatatoronto)
library(patchwork)
library(tidygeocoder)
library(tidyverse)
library(tinytable)
library(troopdata)
library(shiny)
library(usethis)
library(WDI)
```
## Introduction
When telling stories with data, we would like the data to do much of the work of convincing our reader. The paper is the medium, and the data are the message. To that end, we want to show our reader the data that allowed us to come to our understanding of the story. We use graphs, tables, and maps to help achieve this.
Try to show the observations that underpin our analysis. For instance, if your dataset consists of 2,500 responses to a survey, then at some point in the paper you should have a plot/s that contains each of the 2,500 observations, for every variable of interest. To do this we build graphs using `ggplot2` which is part of the core `tidyverse` and so does not have to be installed or loaded separately. In this chapter we go through a variety of different options including bar charts, scatterplots, line plots, and histograms.
In contrast to the role of graphs, which is to show each observation, the role of tables is typically to show an extract of the dataset or to convey various summary statistics, or regression results. We will build tables primarily using `knitr`. Later we will use `modelsummary` to build tables related to regression output.
Finally, we cover maps as a variant of graphs that are used to show a particular type of data. We will build static maps using `ggmap` after having obtained geocoded data using `tidygeocoder`.
## Graphs
> A world turning to a saner and richer civilization will be a world turning to charts.
>
> @karsetn [p. 684]
Graphs\index{graphs} are a critical aspect of compelling data stories. They allow us to see both broad patterns and details [@elementsofgraphingdata, p. 5]. Graphs enable a familiarity with our data that is hard to get from any other method. Every variable of interest should be graphed.
The most important objective of a graph is to convey as much of the actual data, and its context, as possible. In a way, graphing is an information encoding process where we construct a deliberate representation to convey information to our audience. The audience must decode that representation. The success of our graph depends on how much information is lost in this process so the decoding is a critical aspect [@elementsofgraphingdata, p. 221]. This means that we must focus on creating effective graphs that are suitable for our specific audience.
To see why graphing the actual data is important\index{graphs!importance of}, after installing and loading `datasauRus` consider the `datasaurus_dozen` dataset.
```{r}
datasaurus_dozen
```
The dataset consists of values for "x" and "y", which should be plotted on the x-axis and y-axis, respectively. There are 13 different values in the variable "dataset" including: "dino", "star", "away", and "bullseye". We focus on those four and generate summary statistics for each (@tbl-datasaurussummarystats).
```{r}
#| label: tbl-datasaurussummarystats
#| tbl-cap: "Mean and standard deviation for four datasauRus datasets"
# Based on: https://juliasilge.com/blog/datasaurus-multiclass/
datasaurus_dozen |>
filter(dataset %in% c("dino", "star", "away", "bullseye")) |>
summarise(across(c(x, y), list(mean = mean, sd = sd)),
.by = dataset) |>
tt() |>
style_tt(j = 2:5, align = "r") |>
format_tt(digits = 1, num_fmt = "decimal") |>
setNames(c("Dataset", "x mean", "x sd", "y mean", "y sd"))
```
Notice that the summary statistics are similar (@tbl-datasaurussummarystats). Despite this it turns out that the different datasets are actually very different beasts. This becomes clear when we plot the data (@fig-datasaurusgraph).
```{r}
#| eval: true
#| fig-cap: "Graph of four datasauRus datasets"
#| label: fig-datasaurusgraph
#| warning: false
#| echo: true
datasaurus_dozen |>
filter(dataset %in% c("dino", "star", "away", "bullseye")) |>
ggplot(aes(x = x, y = y, colour = dataset)) +
geom_point() +
theme_minimal() +
facet_wrap(vars(dataset), nrow = 2, ncol = 2) +
labs(color = "Dataset")
```
We get a similar lesson---always plot your data---from "Anscombe's Quartet", created by the twentieth century statistician Frank Anscombe. The key takeaway is that it is important to plot the actual data and not rely solely on summary statistics.\index{graphs!not relying on summary statistics}
```{r}
head(anscombe)
```
::: {.content-visible when-format="pdf"}
Anscombe's Quartet consists of eleven observations for four different datasets, with x and y values for each observation. We need to manipulate this dataset with `pivot_longer()` to get it into the "tidy" format discussed in the ["R Essentials" Online Appendix](https://tellingstorieswithdata.com/20-r_essentials.html).
:::
::: {.content-visible unless-format="pdf"}
Anscombe's Quartet consists of eleven observations for four different datasets, with x and y values for each observation. We need to manipulate this dataset with `pivot_longer()` to get it into the "tidy" format discussed in [Online Appendix -@sec-r-essentials].
:::
```{r}
# From: https://www.njtierney.com/post/2020/06/01/tidy-anscombe/
# And the pivot_longer() vignette.
tidy_anscombe <-
anscombe |>
pivot_longer(
everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
)
```
We can first create summary statistics (@tbl-anscombesummarystats) and then plot the data (@fig-anscombegraph). This again illustrates the importance of graphing the actual data, rather than relying on summary statistics.
```{r}
#| label: tbl-anscombesummarystats
#| message: false
#| tbl-cap: "Mean and standard deviation for Anscombe's quartet"
tidy_anscombe |>
summarise(
across(c(x, y), list(mean = mean, sd = sd)),
.by = set
) |>
tt() |>
style_tt(j = 2:5, align = "r") |>
format_tt(digits = 1, num_fmt = "decimal") |>
setNames(c("Dataset", "x mean", "x sd", "y mean", "y sd"))
```
```{r}
#| eval: true
#| fig-cap: "Recreation of Anscombe's Quartet"
#| label: fig-anscombegraph
#| warning: false
#| echo: true
tidy_anscombe |>
ggplot(aes(x = x, y = y, colour = set)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme_minimal() +
facet_wrap(vars(set), nrow = 2, ncol = 2) +
labs(colour = "Dataset") +
theme(legend.position = "bottom")
```
### Bar charts
We typically use a bar chart\index{graphs!bar chart} when we have a categorical variable that we want to focus on. We saw an example of this in @sec-fire-hose when we constructed a graph of the number of occupied beds. The geometric object---a "geom"---that we primarily use is `geom_bar()`, but there are many variants to cater for specific situations. To illustrate the use of bar charts, we use a dataset from the 1997-2001 British Election Panel Study that was put together by @fox2006effect and made available with `BEPS`, after installing and loading `carData`.\index{gender!British Election Panel Study}\index{British Election Panel Study}
```{r}
beps <-
BEPS |>
as_tibble() |>
clean_names() |>
select(age, vote, gender, political_knowledge)
```
The dataset consists of which party the respondent supports, along with various demographic, economic, and political variables. In particular, we have the age of the respondent. We begin by creating age-groups from the ages, and making a bar chart showing the frequency of each age-group using `geom_bar()` (@fig-bepfitst-1).
```{r}
beps <-
beps |>
mutate(
age_group =
case_when(
age < 35 ~ "<35",
age < 50 ~ "35-49",
age < 65 ~ "50-64",
age < 80 ~ "65-79",
age < 100 ~ "80-99"
),
age_group =
factor(age_group, levels = c("<35", "35-49", "50-64", "65-79", "80-99"))
)
```
```{r}
#| label: fig-bepfitst
#| eval: true
#| fig-cap: "Distribution of age-groups in the 1997-2001 British Election Panel Study"
#| echo: true
#| fig-subcap: ["Using `geom_bar()`", "Using `count()` and `geom_col()`"]
#| layout-ncol: 2
beps |>
ggplot(mapping = aes(x = age_group)) +
geom_bar() +
theme_minimal() +
labs(x = "Age group", y = "Number of observations")
beps |>
count(age_group) |>
ggplot(mapping = aes(x = age_group, y = n)) +
geom_col() +
theme_minimal() +
labs(x = "Age group", y = "Number of observations")
```
The default axis label used by `ggplot2` is the name of the relevant variable, so it is often useful to add more detail. We do this using `labs()` by specifying a variable and a name. In the case of @fig-bepfitst-1 we have specified labels for the x-axis and y-axis.
By default, `geom_bar()` creates a count of the number of times each age-group appears in the dataset. It does this because the default statistical transformation---a "stat"---for `geom_bar()` is "count", which saves us from having to create that statistic ourselves. But if we had already constructed a count (for instance, with `beps |> count(age_group)`), then we could specify a variable for the y-axis and then use `geom_col()` (@fig-bepfitst-2).
We may also like to consider various groupings of the data to get a different insight. For instance, we can use color to look at which party the respondent supports, by age-group (@fig-bepsecond-1).
```{r}
#| echo: true
#| eval: true
#| fig-cap: "Distribution of age-group, and vote preference, in the 1997-2001 British Election Panel Study"
#| label: fig-bepsecond
#| fig-subcap: ["Using `geom_bar()`", "Using `geom_bar()` with dodge2"]
#| layout-ncol: 2
beps |>
ggplot(mapping = aes(x = age_group, fill = vote)) +
geom_bar() +
labs(x = "Age group", y = "Number of observations", fill = "Vote") +
theme(legend.position = "bottom")
beps |>
ggplot(mapping = aes(x = age_group, fill = vote)) +
geom_bar(position = "dodge2") +
labs(x = "Age group", y = "Number of observations", fill = "Vote") +
theme(legend.position = "bottom")
```
By default, these different groups are stacked, but they can be placed side by side with `position = "dodge2"` (@fig-bepsecond-2). (Using "dodge2" rather than "dodge" adds a little space between the bars.)
#### Themes
At this point, we may like to address the general look of the graph. There are various themes that are built into `ggplot2`. These include: `theme_bw()`, `theme_classic()`, `theme_dark()`, and `theme_minimal()`. A full list is available in the `ggplot2` [cheat sheet](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf). We can use these themes by adding them as a layer (@fig-bepthemes). We could also install more themes from other packages, including `ggthemes` [@ggthemes], and `hrbrthemes` [@hrbrthemes]. We could even build our own!
```{r}
#| echo: true
#| eval: true
#| fig-cap: "Distribution of age-groups, and vote preference, in the 1997-2001 British Election Panel Study, illustrating different themes and the use of `patchwork`"
#| label: fig-bepthemes
#| warning: false
theme_bw <-
beps |>
ggplot(mapping = aes(x = age_group)) +
geom_bar(position = "dodge") +
theme_bw()
theme_classic <-
beps |>
ggplot(mapping = aes(x = age_group)) +
geom_bar(position = "dodge") +
theme_classic()
theme_dark <-
beps |>
ggplot(mapping = aes(x = age_group)) +
geom_bar(position = "dodge") +
theme_dark()
theme_minimal <-
beps |>
ggplot(mapping = aes(x = age_group)) +
geom_bar(position = "dodge") +
theme_minimal()
(theme_bw + theme_classic) / (theme_dark + theme_minimal)
```
In @fig-bepthemes we use `patchwork` to bring together multiple graphs. To do this, after installing and loading the package, we assign the graph to a variable. We then use "+" to signal which should be next to each other, "/" to signal which should be on top, and use brackets to indicate precedence
#### Facets
We use facets\index{graphs!facets} to show variation, based on one or more variables [@grammarofgraphics, p. 219]. Facets are especially useful when we have already used color to highlight variation in some other variable. For instance, we may be interested to explain vote, by age and gender (@fig-facets). We rotate the x-axis with `guides(x = guide_axis(angle = 90))` to avoid overlapping. We also change the position of the legend with `theme(legend.position = "bottom")`.
```{r}
#| echo: true
#| eval: true
#| fig-cap: "Distribution of age-group by gender, and vote preference, in the 1997-2001 British Election Panel Study"
#| label: fig-facets
#| warning: false
beps |>
ggplot(mapping = aes(x = age_group, fill = gender)) +
geom_bar() +
theme_minimal() +
labs(
x = "Age-group of respondent",
y = "Number of respondents",
fill = "Gender"
) +
facet_wrap(vars(vote)) +
guides(x = guide_axis(angle = 90)) +
theme(legend.position = "bottom")
```
We could change `facet_wrap()` to wrap vertically instead of horizontally with `dir = "v"`. Alternatively, we could specify a few rows, say `nrow = 2`, or a number of columns, say `ncol = 2`.
By default, both facets will have the same x-axis and y-axis. We could enable both facets to have different scales with `scales = "free"`, or just the x-axis with `scales = "free_x"`, or just the y-axis with `scales = "free_y"` (@fig-facetsfancy).
```{r}
#| echo: true
#| eval: true
#| fig-cap: "Distribution of age-group by gender, and vote preference, in the 1997-2001 British Election Panel Study"
#| label: fig-facetsfancy
#| warning: false
beps |>
ggplot(mapping = aes(x = age_group, fill = gender)) +
geom_bar() +
theme_minimal() +
labs(
x = "Age-group of respondent",
y = "Number of respondents",
fill = "Gender"
) +
facet_wrap(vars(vote), scales = "free") +
guides(x = guide_axis(angle = 90)) +
theme(legend.position = "bottom")
```
Finally, we can change the labels of the facets using `labeller()` (@fig-facetsfancylabels).
```{r}
#| echo: true
#| eval: true
#| fig-cap: "Distribution of age-group by political knowledge, and vote preference, in the 1997-2001 British Election Panel Study"
#| label: fig-facetsfancylabels
#| warning: false
new_labels <-
c("0" = "No knowledge", "1" = "Low knowledge",
"2" = "Moderate knowledge", "3" = "High knowledge")
beps |>
ggplot(mapping = aes(x = age_group, fill = vote)) +
geom_bar() +
theme_minimal() +
labs(
x = "Age-group of respondent",
y = "Number of respondents",
fill = "Voted for"
) +
facet_wrap(
vars(political_knowledge),
scales = "free",
labeller = labeller(political_knowledge = new_labels)
) +
guides(x = guide_axis(angle = 90)) +
theme(legend.position = "bottom")
```
We now have three ways to combine multiple graphs: sub-figures, facets, and `patchwork`. They are useful in different circumstances:
- sub-figures---which we covered in @sec-reproducible-workflows---for when we are considering different variables;
- facets for when we are considering a categorical variable; and
- `patchwork` for when we are interested in bringing together entirely different graphs.
#### Colors
We now turn to the colors\index{graphs!color} used in the graph. There are a variety of different ways to change the colors. The many palettes available from `RColorBrewer` [@RColorBrewer] can be specified using `scale_fill_brewer()`. In the case of `viridis` [@viridis] we can specify the palettes using `scale_fill_viridis_d()`. Additionally, `viridis` is particularly focused on color-blind palettes (@fig-usecolor). Neither `RColorBrewer` nor `viridis` need to be explicitly installed or loaded because `ggplot2`, which is part of the `tidyverse`, takes care of that for us.
::: callout-note
## Shoulders of giants
The name of the "brewer" palette refers to Cindy Brewer\index{Brewer, Cindy} [@brewerisarealperson]. After earning a PhD in Geography from Michigan State University in 1991, she joined San Diego State University as an assistant professor, moving to Pennsylvania State University in 1994, where she was promoted to full professor in 2007. One of her best-known books is *Designing Better Maps: A Guide for GIS Users* [@brewerbook]. In 2019 she became only the ninth person to have been awarded the O. M. Miller Cartographic Medal since it was established in 1968.\index{O. M. Miller Cartographic Medal}
:::
```{r}
#| echo: true
#| eval: true
#| message: false
#| warning: false
#| fig-cap: "Distribution of age-group and vote preference, in the 1997-2001 British Election Panel Study, illustrating different colors"
#| label: fig-usecolor
#| fig-subcap: ["Brewer palette 'Blues'", "Brewer palette 'Set1'", "Viridis palette default", "Viridis palette 'magma'"]
#| layout-ncol: 2
# Panel (a)
beps |>
ggplot(mapping = aes(x = age_group, fill = vote)) +
geom_bar() +
theme_minimal() +
labs(x = "Age-group", y = "Number", fill = "Voted for") +
theme(legend.position = "bottom") +
scale_fill_brewer(palette = "Blues")
# Panel (b)
beps |>
ggplot(mapping = aes(x = age_group, fill = vote)) +
geom_bar() +
theme_minimal() +
labs(x = "Age-group", y = "Number", fill = "Voted for") +
theme(legend.position = "bottom") +
scale_fill_brewer(palette = "Set1")
# Panel (c)
beps |>
ggplot(mapping = aes(x = age_group, fill = vote)) +
geom_bar() +
theme_minimal() +
labs(x = "Age-group", y = "Number", fill = "Voted for") +
theme(legend.position = "bottom") +
scale_fill_viridis_d()
# Panel (d)
beps |>
ggplot(mapping = aes(x = age_group, fill = vote)) +
geom_bar() +
theme_minimal() +
labs(x = "Age-group", y = "Number", fill = "Voted for") +
theme(legend.position = "bottom") +
scale_fill_viridis_d(option = "magma")
```
In addition to using pre-built palettes, we could build our own palette. That said, color is something to be considered with care. It should be used to increase the amount of information that is communicated [@elementsofgraphingdata]. Color should not be added to graphs unnecessarily---that is to say, it should play some role. Typically, that role is to distinguish different groups, which implies making the colors dissimilar. Color may also be appropriate if there is some relationship between the color and the variable.
For instance, if making a graph of the price of mangoes and raspberries, then it could help the reader decode the information if the colors were yellow and red, respectively [@franconeri2021science, p. 121].
### Scatterplots
We are often interested in the relationship between two numeric or continuous variables. We can use scatterplots\index{graphs!scatterplot} to show this. A scatterplot may not always be the best choice, but it is rarely a bad one [@weissgerber2015beyond]. Some consider it the most versatile and useful graph option [@historyofdataviz, p. 121]. To illustrate scatterplots, we install and load `WDI` and then use that to download some economic indicators from the World Bank\index{World Bank!economic data}. In particular, we use `WDIsearch()` to find the unique key that we need to pass to `WDI()` to facilitate the download.
:::{.callout-note}
## Oh, you think we have good data on that!
From @EssentialMacroAggregates [p. 15] Gross Domestic Product (GDP) "combines in a single figure, and with no double counting, all the output (or production) carried out by all the firms, non-profit institutions, government bodies and households in a given country during a given period, regardless of the type of goods and services produced, provided that the production takes place within the country's economic territory." \index{Gross Domestic Product (GDP)} The modern concept was developed by the twentieth century economist Simon Kuznets and is widely used and reported. There is a certain comfort in having a definitive and concrete single number to describe something as complicated as the economic activity of a country. It is useful and informative that we have such summary statistics. But as with any summary statistic, its strength is also its weakness. A single number necessarily loses information about constituent components, and disaggregated differences can be important [@Moyer2020Measuring]. It highlights short term economic progress over longer term improvements. And "the quantitative definiteness of the estimates makes it easy to forget their dependence upon imperfect data and the consequently wide margins of possible error to which both totals and components are liable" [@NationalIncomeAndItsComposition, p. xxvi]. Summary measures of economic performance shows only one side of a country's economy. While there are many strengths there are also well-known areas where GDP is weak.
:::
```{r}
#| echo: true
#| eval: false
WDIsearch("gdp growth")
WDIsearch("inflation")
WDIsearch("population, total")
WDIsearch("Unemployment, total")
```
```{r}
#| echo: true
#| eval: false
world_bank_data <-
WDI(
indicator =
c("FP.CPI.TOTL.ZG", "NY.GDP.MKTP.KD.ZG", "SP.POP.TOTL","SL.UEM.TOTL.NE.ZS"),
country = c("AU", "ET", "IN", "US")
)
```
```{r}
#| echo: false
#| eval: false
# INTERNAL
write_csv(world_bank_data, "inputs/data/world_bank_data.csv")
```
```{r}
#| eval: true
#| warning: false
#| echo: false
# INTERNAL
world_bank_data <-
read_csv(
"inputs/data/world_bank_data.csv",
show_col_types = FALSE
)
```
We may like to change the variable names to be more meaningful, and only keep those that we need.
```{r}
#| echo: true
#| eval: true
world_bank_data <-
world_bank_data |>
rename(
inflation = FP.CPI.TOTL.ZG,
gdp_growth = NY.GDP.MKTP.KD.ZG,
population = SP.POP.TOTL,
unem_rate = SL.UEM.TOTL.NE.ZS
) |>
select(country, year, inflation, gdp_growth, population, unem_rate)
head(world_bank_data)
```
To get started we can use `geom_point()` to make a scatterplot showing GDP growth and inflation, by country (@fig-scattorplot-1).
```{r}
#| warning: false
#| label: fig-scattorplot
#| fig-cap: "Relationship between inflation and GDP growth for Australia, Ethiopia, India, and the United States"
#| fig-subcap: ["Default settings", "With the addition of a theme and labels", "Including standard errors"]
#| layout-ncol: 2
# Panel (a)
world_bank_data |>
ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
geom_point()
# Panel (b)
world_bank_data |>
ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
geom_point() +
theme_minimal() +
labs(x = "GDP growth", y = "Inflation", color = "Country")
```
As with bar charts, we can change the theme, and update the labels (@fig-scattorplot-2).
For scatterplots we use "color" instead of "fill", as we did for bar charts, because they use dots rather than bars. This also then slightly affects how we change the palette (@fig-scatterplotnicercolor). That said, with particular types of dots, for instance `shape = 21`, it is possible to have both `fill` and `color` aesthetics.
```{r}
#| echo: true
#| eval: true
#| message: false
#| warning: false
#| label: fig-scatterplotnicercolor
#| fig-cap: "Relationship between inflation and GDP growth for Australia, Ethiopia, India, and the United States"
#| fig-subcap: ["Brewer palette 'Blues'", "Brewer palette 'Set1'", "Viridis palette default", "Viridis palette 'magma'"]
#| layout-ncol: 2
# Panel (a)
world_bank_data |>
ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
geom_point() +
theme_minimal() +
labs(x = "GDP growth", y = "Inflation", color = "Country") +
theme(legend.position = "bottom") +
scale_color_brewer(palette = "Blues")
# Panel (b)
world_bank_data |>
ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
geom_point() +
theme_minimal() +
labs(x = "GDP growth", y = "Inflation", color = "Country") +
theme(legend.position = "bottom") +
scale_color_brewer(palette = "Set1")
# Panel (c)
world_bank_data |>
ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
geom_point() +
theme_minimal() +
labs(x = "GDP growth", y = "Inflation", color = "Country") +
theme(legend.position = "bottom") +
scale_colour_viridis_d()
# Panel (d)
world_bank_data |>
ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
geom_point() +
theme_minimal() +
labs(x = "GDP growth", y = "Inflation", color = "Country") +
theme(legend.position = "bottom") +
scale_colour_viridis_d(option = "magma")
```
The points of a scatterplot sometimes overlap. We can address this situation in a variety of ways (@fig-alphajitter):
1) Adding a degree of transparency\index{graphs!transparency} to our dots with "alpha" (@fig-alphajitter-1).\index{graphs!alpha} The value for "alpha" can vary between 0, which is fully transparent, and 1, which is completely opaque.
2) Adding a small amount of noise, which slightly moves the points, using `geom_jitter()` (@fig-alphajitter-2).\index{graphs!jitter} By default, the movement is uniform in both directions, but we can specify which direction movement occurs with "width" or "height". The decision between these two options turns on the degree to which accuracy matters, and the number of points: it is often useful to use `geom_jitter()` when you want to highlight the relative density of points and not necessarily the exact value of individual points. When using `geom_jitter()` it is a good idea to set a seed, as introduced in @sec-fire-hose, for reproducibility.
```{r}
#| fig-cap: "Relationship between inflation and GDP growth for Australia, Ethiopia, India, and the United States"
#| label: fig-alphajitter
#| warning: false
#| fig-subcap: ["Changing the alpha setting", "Using jitter"]
#| layout-ncol: 2
set.seed(853)
# Panel (a)
world_bank_data |>
ggplot(aes(x = gdp_growth, y = inflation, color = country )) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(x = "GDP growth", y = "Inflation", color = "Country")
# Panel (b)
world_bank_data |>
ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
geom_jitter(width = 1, height = 1) +
theme_minimal() +
labs(x = "GDP growth", y = "Inflation", color = "Country")
```
We often use scatterplots to illustrate a relationship between two continuous variables.\index{graphs!continuous variables} It can be useful to add a "summary" line using `geom_smooth()` (@fig-scattorplottwo).\index{graphs!best fit} We can specify the relationship using "method", change the color with "color", and add or remove standard errors with "se".
A commonly used "method" is `lm`, which computes and plots a simple linear regression line similar to using the `lm()` function. Using `geom_smooth()` adds a layer to the graph, and so it inherits aesthetics from `ggplot()`. For instance, that is why we have one line for each country in @fig-scattorplottwo-1 and @fig-scattorplottwo-2. We could overwrite that by specifying a particular color (@fig-scattorplottwo-3). There are situation where other types of fitted lines such as splines might be preferred.
```{r}
#| message: false
#| warning: false
#| fig-cap: "Relationship between inflation and GDP growth for Australia, Ethiopia, India, and the United States"
#| label: fig-scattorplottwo
#| fig-subcap: ["Default line of best fit", "Specifying a linear relationship", "Specifying only one color"]
#| layout-ncol: 2
# Panel (a)
world_bank_data |>
ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
geom_jitter() +
geom_smooth() +
theme_minimal() +
labs(x = "GDP growth", y = "Inflation", color = "Country")
# Panel (b)
world_bank_data |>
ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
geom_jitter() +
geom_smooth(method = lm, se = FALSE) +
theme_minimal() +
labs(x = "GDP growth", y = "Inflation", color = "Country")
# Panel (c)
world_bank_data |>
ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
geom_jitter() +
geom_smooth(method = lm, color = "black", se = FALSE) +
theme_minimal() +
labs(x = "GDP growth", y = "Inflation", color = "Country")
```
### Line plots
We can use a line plot\index{graphs!line plot} when we have variables that should be joined together, for instance, an economic time series. We will continue with the dataset from the World Bank\index{World Bank} and focus on GDP\index{Gross Domestic Product!United States}\index{United States!Gross Domestic Product} growth in the United States using `geom_line()` (@fig-lineplot-1). The source of the data can be added to the graph using "caption" within `labs()`.
```{r}
#| fig-cap: "United States GDP growth (1961-2020)"
#| label: fig-lineplot
#| warning: false
#| layout-ncol: 2
#| fig-subcap: ["Using a line plot", "Using a stairstep line plot"]
# Panel (a)
world_bank_data |>
filter(country == "United States") |>
ggplot(mapping = aes(x = year, y = gdp_growth)) +
geom_line() +
theme_minimal() +
labs(x = "Year", y = "GDP growth", caption = "Data source: World Bank.")
# Panel (b)
world_bank_data |>
filter(country == "United States") |>
ggplot(mapping = aes(x = year, y = gdp_growth)) +
geom_step() +
theme_minimal() +
labs(x = "Year",y = "GDP growth", caption = "Data source: World Bank.")
```
We can use `geom_step()`, a slight variant of `geom_line()`, to focus attention on the change from year to year (@fig-lineplot-2).
The Phillips curve\index{Phillips curve} is the name given to plot of the relationship between unemployment and inflation over time. An inverse relationship is sometimes found in the data, for instance in the United Kingdom between 1861 and 1957 [@phillips1958relation]. We have a variety of ways to investigate this relationship in our data, including:
::: {.content-visible when-format="pdf"}
1) Adding a second line to our graph. For instance, we could add inflation (@fig-notphillips-1). This requires us to use `pivot_longer()`, which is discussed in the ["R Essentials" Online Appendix](https://tellingstorieswithdata.com/20-r_essentials.html), to ensure that the data are in a tidy format.
2) Using `geom_path()` to link values in the order they appear in the dataset. In @fig-notphillips-2 we show a Phillips curve for the United States between 1960 and 2020. @fig-notphillips-2 does not appear to show any clear relationship between unemployment and inflation.
:::
::: {.content-visible unless-format="pdf"}
1) Adding a second line to our graph. For instance, we could add inflation (@fig-notphillips-1). This requires us to use `pivot_longer()`, which is discussed in [Online Appendix -@sec-r-essentials], to ensure that the data are in a tidy format.
2) Using `geom_path()` to link values in the order they appear in the dataset. In @fig-notphillips-2 we show a Phillips curve for the United States between 1960 and 2020. @fig-notphillips-2 does not appear to show any clear relationship between unemployment and inflation.
:::
```{r}
#| fig-cap: "Unemployment and inflation for the United States (1960-2020)"
#| label: fig-notphillips
#| layout-ncol: 2
#| fig-subcap: ["Comparing the two time series over time", "Plotting the two time series against each other"]
#| warning: false
world_bank_data |>
filter(country == "United States") |>
select(-population, -gdp_growth) |>
pivot_longer(
cols = c("inflation", "unem_rate"),
names_to = "series",
values_to = "value"
) |>
ggplot(mapping = aes(x = year, y = value, color = series)) +
geom_line() +
theme_minimal() +
labs(
x = "Year", y = "Value", color = "Economic indicator",
caption = "Data source: World Bank."
) +
scale_color_brewer(palette = "Set1", labels = c("Inflation", "Unemployment")) +
theme(legend.position = "bottom")
world_bank_data |>
filter(country == "United States") |>
ggplot(mapping = aes(x = unem_rate, y = inflation)) +
geom_path() +
theme_minimal() +
labs(
x = "Unemployment rate", y = "Inflation",
caption = "Data source: World Bank."
)
```
### Histograms
A histogram\index{graphs!histogram} is useful to show the shape of the distribution of a continuous variable. The full range of the data values is split into intervals called "bins" and the histogram counts how many observations fall into which bin. In @fig-hisogramone we examine the distribution of GDP in Ethiopia.
```{r}
#| fig-cap: "Distribution of GDP growth in Ethiopia (1960-2020)"
#| label: fig-hisogramone
#| message: false
#| warning: false
world_bank_data |>
filter(country == "Ethiopia") |>
ggplot(aes(x = gdp_growth)) +
geom_histogram() +
theme_minimal() +
labs(
x = "GDP growth",
y = "Number of occurrences",
caption = "Data source: World Bank."
)
```
The key component that determines the shape of a histogram is the number of bins. This can be specified in one of two ways (@fig-hisogrambins):
1) specifying the number of "bins" to include; or
2) specifying their "binwidth".
```{r}
#| message: false
#| warning: false
#| fig-cap: "Distribution of GDP growth in Ethiopia (1960-2020)"
#| label: fig-hisogrambins
#| fig-subcap: ["Five bins", "20 bins", "Binwidth of two", "Binwidth of five"]
#| layout-ncol: 2
# Panel (a)
world_bank_data |>
filter(country == "Ethiopia") |>
ggplot(aes(x = gdp_growth)) +
geom_histogram(bins = 5) +
theme_minimal() +
labs(
x = "GDP growth",
y = "Number of occurrences"
)
# Panel (b)
world_bank_data |>
filter(country == "Ethiopia") |>
ggplot(aes(x = gdp_growth)) +
geom_histogram(bins = 20) +
theme_minimal() +
labs(
x = "GDP growth",
y = "Number of occurrences"
)
# Panel (c)
world_bank_data |>
filter(country == "Ethiopia") |>
ggplot(aes(x = gdp_growth)) +
geom_histogram(binwidth = 2) +
theme_minimal() +
labs(
x = "GDP growth",
y = "Number of occurrences"
)
# Panel (d)
world_bank_data |>
filter(country == "Ethiopia") |>
ggplot(aes(x = gdp_growth)) +
geom_histogram(binwidth = 5) +
theme_minimal() +
labs(
x = "GDP growth",
y = "Number of occurrences"
)
```
Histograms\index{graphs!histograms} can be thought of as locally averaging data, and the number of bins affects how much of this occurs. When there are only two bins then there is considerable smoothing, but we lose a lot of accuracy. Too few bins results in more bias, while too many bins results in more variance [@wasserman, p. 303]. Our decision as to the number of bins, or their width, is concerned with trying to balance bias and variance. This will depend on a variety of concerns including the subject matter and the goal [@elementsofgraphingdata, p. 135]. This is one of the reasons that @Denby2009 consider histograms to be especially valuable as exploratory tools.
Finally, while we can use "fill" to distinguish between different types of observations, it can get quite messy. It is usually better to:
1. trace the outline of the distribution with `geom_freqpoly()` (@fig-different-obs-1)
2. build stack of dots with `geom_dotplot()` (@fig-different-obs-2); or
3. add transparency, especially if the differences are more stark (@fig-different-obs-3).
```{r}
#| fig-cap: "Distribution of GDP growth across various countries (1960-2020)"
#| label: fig-different-obs
#| message: false
#| warning: false
#| layout-ncol: 2
#| fig-subcap: ["Tracing the outline", "Using dots", "Adding transparency"]
# Panel (a)
world_bank_data |>
ggplot(aes(x = gdp_growth, color = country)) +
geom_freqpoly() +
theme_minimal() +
labs(
x = "GDP growth", y = "Number of occurrences",
color = "Country",
caption = "Data source: World Bank."
) +
scale_color_brewer(palette = "Set1")
# Panel (b)
world_bank_data |>
ggplot(aes(x = gdp_growth, group = country, fill = country)) +
geom_dotplot(method = "histodot") +
theme_minimal() +
labs(
x = "GDP growth", y = "Number of occurrences",
fill = "Country",
caption = "Data source: World Bank."
) +
scale_color_brewer(palette = "Set1")
# Panel (c)
world_bank_data |>
filter(country %in% c("India", "United States")) |>
ggplot(mapping = aes(x = gdp_growth, fill = country)) +
geom_histogram(alpha = 0.5, position = "identity") +
theme_minimal() +
labs(
x = "GDP growth", y = "Number of occurrences",
fill = "Country",
caption = "Data source: World Bank."
) +
scale_color_brewer(palette = "Set1")
```
An interesting alternative to a histogram is the empirical cumulative distribution function (ECDF).\index{graphs!ECDF} The choice between this and a histogram is tends to be audience-specific. It may not appropriate for less-sophisticated audiences, but if the audience is quantitatively comfortable, then it can be a great choice because it does less smoothing than a histogram. We can build an ECDF with `stat_ecdf()`. For instance, @fig-ecdfismyfavohidonthavefavs shows an ECDF equivalent to @fig-hisogramone.
```{r}
#| fig-cap: "Distribution of GDP growth in four countries (1960-2020)"
#| label: fig-ecdfismyfavohidonthavefavs
#| warning: false
world_bank_data |>
ggplot(mapping = aes(x = gdp_growth, color = country)) +
stat_ecdf(geom = "point") +
theme_minimal() +
labs(
x = "GDP growth", y = "Proportion", color = "Country",
caption = "Data source: World Bank."
) +
theme(legend.position = "bottom")
```
### Boxplots
A boxplot\index{graphs!boxplot} typically shows five aspects: 1) the median, 2) the 25th, and 3) 75th percentiles. The fourth and fifth elements differ depending on specifics. One option is the minimum and maximum values. Another option is to determine the difference between the 75th and 25th percentiles, which is the interquartile range (IQR). The fourth and fifth elements are then the extreme observations within $1.5\times\mbox{IQR}$ from the 25th and 75th percentiles. That latter approach is used, by default, in `geom_boxplot` from `ggplot2`. @chartingstatistics [p. 166] introduced the notion of a chart that focused on the range and various summary statistics including the median and the range, while @tukeyeda focused on which summary statistics and popularized it [@anotherhadleyreferencelol].
One reason for using graphs is that they help us understand and embrace how complex our data are, rather than trying to hide and smooth it away [@armstrongembracecomplexity]. One appropriate use case for boxplots is to compare the summary statistics of many variables at once, such as in @Bethlehem2022. But boxplots alone are rarely the best choice because they hide the distribution of data, rather than show it. The same boxplot can apply to very different distributions. To see this, consider some simulated data from the beta distribution of two types.\index{simulation!beta distribution} The first contains draws from two beta distributions:\index{distribution!beta} one that is right skewed and another that is left skewed. The second contains draws from a beta distribution with no skew, noting that $\mbox{Beta}(1, 1)$ is equivalent to $\mbox{Uniform}(0, 1)$.
```{r}
set.seed(853)
number_of_draws <- 10000
both_left_and_right_skew <-
c(
rbeta(number_of_draws / 2, 5, 2),
rbeta(number_of_draws / 2, 2, 5)
)
no_skew <-
rbeta(number_of_draws, 1, 1)
beta_distributions <-
tibble(
observation = c(both_left_and_right_skew, no_skew),
source = c(
rep("Left and right skew", number_of_draws),
rep("No skew", number_of_draws)
)
)
```
We can first compare the boxplots of the two series (@fig-boxplotfirst-1). But if we plot the actual data then we can see how different they are (@fig-boxplotfirst-2).
```{r}
#| label: fig-boxplotfirst
#| message: false
#| warning: false
#| layout-ncol: 2
#| fig-cap: "Data drawn from beta distributions with different parameters"
#| fig-subcap: ["Illustrated with a boxplot","Actual data"]
beta_distributions |>
ggplot(aes(x = source, y = observation)) +
geom_boxplot() +
theme_classic()
beta_distributions |>
ggplot(aes(x = observation, color = source)) +
geom_freqpoly(binwidth = 0.05) +
theme_classic() +
theme(legend.position = "bottom")
```
One way forward, if a boxplot is to be used, is to include the actual data as a layer on top of the boxplot.\index{graphs!boxplot} For instance, in @fig-bloxplotandoverlay we show the distribution of inflation across the four countries. The reason that this works well is that it shows the actual observations, as well as the summary statistics.
```{r}
#| fig-cap: "Distribution of inflation data for four countries (1960-2020)"
#| label: fig-bloxplotandoverlay
#| message: false
#| warning: false
world_bank_data |>
ggplot(mapping = aes(x = country, y = inflation)) +
geom_boxplot() +
geom_jitter(alpha = 0.3, width = 0.15, height = 0) +
theme_minimal() +
labs(
x = "Country",
y = "Inflation",
caption = "Data source: World Bank."
)
```
### Interactive graphs