-
Notifications
You must be signed in to change notification settings - Fork 0
/
blog_diversity.Rmd
238 lines (185 loc) · 8.74 KB
/
blog_diversity.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
title: "Playing with {ggplot2} extensions"
author: "Pierrette Lo"
date: "6/7/2020"
output:
html_document:
keep_md: true
---
Recently, a friend asked me to make a simple data visualization for her. The dataset was tiny and, to be honest, not super interesting (a very simple survey of [not very much] diversity among her department's leadership and overall membership). But the nice thing about a simple dataset is that for once I could spend less time on data cleaning and more time playing with aesthetics.
Here are the libraries I used:
```{r setup, message=FALSE, warning=FALSE}
library(tidyverse)
library(readxl)
library(ggalt)
library(patchwork)
library(ggtext)
```
### Data Wrangling
The data I was given was an Excel sheet that looked like this:
```{r, fig.retina=3, out.width="100%", echo=F}
knitr::include_graphics("data_screencap.PNG")
```
I started by doing a bit of data cleanup in Excel. If the dataset had been larger, I might have tried using {readxl} to clean it up in R, but in this case it took about 30 seconds to do it in Excel.
I separated each table onto a different tab, added a "personnel" header (for "Overall" vs "Leadership" categories), and corrected the typo in Race ("Indian~~a~~").
I next used the {readxl} package to import the Excel data...
```{r read_data, results='hide'}
# specify path of original data
path <- "blog_data.xlsx"
# read in all sheets as a named list
data <- path %>%
excel_sheets() %>%
set_names() %>%
map(read_xlsx, path = path)
# split list into separate dataframes
list2env(data, .GlobalEnv)
```
...And now I have three little dataframes.
```{r show_tables, echo=F, warning =F}
ethnicity %>%
knitr::kable(caption = "ethnicity") %>%
kableExtra::kable_styling(full_width = F, position = "left")
gender %>%
knitr::kable(caption = "gender") %>%
kableExtra::kable_styling(full_width = F, position = "left")
race %>%
knitr::kable(caption = "race") %>%
kableExtra::kable_styling(full_width = F, position = "left")
```
Next up: tidying each dataframe (yes, I copied and pasted more than twice and therefore should have written some functions, but again I was in a hurry to get to the fun part).
Here's what I did for the `ethnicity` dataframe, which I repeated for `gender` and `race`.
```{r tidy_eth, results='hide', warning=F}
ethnicity <- ethnicity %>%
# convert all columns except personnel to numeric
mutate_at(vars(-personnel), as.numeric) %>%
# make it tidy (i.e. long) format
pivot_longer(-personnel, names_to = "ethnicity", values_to = "percent") %>%
# convert decimals to percentages; convert `ethnicity` and `personnel` to factors
mutate(percent = percent * 100,
ethnicity = as.factor(str_replace(ethnicity, "Unsp", "Unspecified")),
personnel = as.factor(personnel)) %>%
# replace NAs with 0 (after confirming with my friend that this was the intent)
replace_na(list(percent = 0))
```
```{r, tidy_race_gen, results='hide', echo=F, warning=F}
race <- race %>%
mutate_at(vars(-personnel), as.numeric) %>%
pivot_longer(-personnel, names_to = "race", values_to = "percent") %>%
mutate(percent = percent * 100,
race = as.factor(str_replace(race, "Unsp", "Unspecified")),
personnel = as.factor(personnel)) %>%
replace_na(list(percent = 0))
gender <- gender %>%
mutate_at(vars(-personnel), as.numeric) %>%
pivot_longer(-personnel, names_to = "gender", values_to = "percent") %>%
mutate(percent = percent * 100,
gender = as.factor(str_replace(gender, "Unsp", "Unspecified")),
personnel = as.factor(personnel)) %>%
replace_na(list(percent = 0))
```
```{r show_tidied_eth, echo=F}
ethnicity %>%
knitr::kable(caption = "ethnicity") %>%
kableExtra::kable_styling(full_width = F)
```
### Data Visualization
Now for the fun stuff! There's a whole universe of {ggplot2} extensions, many (but not all) of which are listed [here](https://exts.ggplot2.tidyverse.org/).
I picked a few that I had been wanting to play with: [{bbplot}](https://bbc.github.io/rcookbook/) for theme, [{ggalt}](https://github.com/hrbrmstr/ggalt) for dumbbell plots, [{patchwork}](https://patchwork.data-imaginist.com/) to arrange plots, and [{ggtext}](https://wilkelab.org/ggtext/) for HTML text styling.
I started by setting up a custom theme for my plots -- largely borrowed from the BBC's [{bbplot}](https://bbc.github.io/rcookbook/#how_to_create_bbc_style_graphics) package.
The preset theme can be applied directly as a `ggplot` layer using `bbplot::bbc_style()`, but I made some tweaks and saved it as `my_theme`.
```{r}
my_colors <- c("#FAAB18", "#1380A1")
my_theme <- theme_light() +
theme(axis.ticks = element_blank(),
axis.line = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_line(color = "#cbcbcb"),
panel.grid.major.x = element_blank(),
panel.background = element_blank(),
panel.border = element_blank())
theme_set(my_theme)
```
Now I create each bar plot (for Gender, Ethnicity, and Race) separately.
* Reorder gender by percent
* Set y axis 0-100 so all plots have the same range
* Use custom colors (from `bbplot::bbc_style`)
* Use `color` in bars (in addition to `fill`) so 0 shows as a line
```{r}
p1 <- ethnicity %>%
mutate(ethnicity = fct_reorder(ethnicity, percent, na.rm = T)) %>%
ggplot(aes(x = ethnicity, y = percent, fill = personnel, color = personnel)) +
geom_col(position = "dodge") +
coord_flip(ylim = c(0, 100)) +
ggtitle("Ethnicity") +
xlab(NULL) +
ylab(NULL) +
scale_fill_manual(values = my_colors) +
scale_color_manual(values = my_colors)
p2 <- gender %>%
mutate(gender = fct_reorder(gender, percent)) %>%
ggplot(aes(x = gender, y = percent, fill = personnel, color = personnel)) +
geom_col(position = "dodge") +
coord_flip(ylim = c(0, 100)) +
ggtitle("Gender") +
xlab(NULL) +
ylab(NULL) +
scale_fill_manual(values = my_colors) +
scale_color_manual(values = my_colors)
p3 <- race %>%
mutate(race = fct_reorder(race, percent, na.rm = T)) %>%
ggplot(aes(x = race, y = percent, fill = personnel, color = personnel)) +
geom_col(position = "dodge") +
coord_flip(ylim = c(0, 100)) +
ggtitle("Race") +
xlab(NULL) +
ylab(NULL) +
scale_fill_manual(values = my_colors) +
scale_color_manual(values = my_colors)
```
Then I use {patchwork} to stitch them together, and {ggtext} to add color to the title in lieu of a legend.
* "Collect" guides so legends from each plot are treated the same (ie. deleted)
* Use {ggtext} `element_textbox_simple` or `element_markdown` to allow html in title
```{r}
p3 + (p1 / p2) +
plot_layout(guides = "collect") +
plot_annotation(title = "<span style='font-size:18pt'>Diversity in Department <b style='color:#FAAB18;'>Leadership</b> vs <b style='color:#1380A1;'>Overall</b></span>",
subtitle = "Percentages of personnel in each category are shown",
theme = theme(plot.title = element_markdown(lineheight = 1.1))) &
theme(legend.position = "none")
```
```{r, include=FALSE, eval=FALSE}
ggsave("blog_diversity_bars.png", dpi = 600, height = 7, width = 10)
```
I also repeated the above, but with Race shown in a dumbbell plot:
```{r}
p4 <- race %>%
mutate(race = fct_reorder(race, percent, na.rm = T)) %>%
pivot_wider(names_from = personnel, values_from = percent) %>%
ggplot() +
geom_dumbbell(aes(x = `Dept Overall`, xend = `Dept Leadership`, y = race),
size = 3,
colour = "#dddddd",
colour_x = "#1380A1",
colour_xend = "#FAAB18",
show.legend = F) +
coord_cartesian(xlim = c(0,100)) +
ggtitle("Race") +
xlab(NULL) +
ylab(NULL)
```
And here's the patchwork:
```{r}
p4 + (p1 / p2) +
plot_layout(guides = "collect") +
plot_annotation(title = "<span style='font-size:18pt'>Diversity in Dept <b style='color:#FAAB18;'>Leadership</b> vs <b style='color:#1380A1;'>Overall</b></span>",
subtitle = "Percentages of personnel in each category are shown",
theme = theme(plot.title = element_markdown(lineheight = 1.1))) &
theme(legend.position = "none")
```
```{r, include=FALSE, eval=FALSE}
ggsave("blog_diversity_dumbbell.png", dpi = 600, height = 10, width = 10)
```
### BONUS TIP!
Thanks to [this](https://jozef.io/r909-rmarkdown-tips/) helpful post, I discovered that you can use xaringan's Infinite Moon Reader to get live previews of RMarkdown documents (not just xaringan slides!).
After installing [{xaringan}](https://github.com/yihui/xaringan), you can either run `xaringan:::inf_mr()` or select "Infinite Moon Reader" from the RStudio Addins drop-down menu.
The preview will appear in the RStudio Viewer pane, and it will refresh every time you save changes to your Rmd. So much better than knitting every time you want to check your formatting!