-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathData_transformation.Rmd
641 lines (409 loc) · 21.7 KB
/
Data_transformation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
# Data transformation with `dplyr`
This section focusses on transforming rectangular datasets.
The `dplyr` verbs and concepts covered in this chapter are also covered in this video by Garrett Grolemund (a co-author of _[R for Data Science](https://r4ds.had.co.nz/)_ with Hadley Wickham).
```{r, echo = FALSE}
knitr::include_url("https://www.youtube.com/embed/y9KJmUGc8SE")
```
## Set up
Load your packages first. This chapter just uses the packages contained in the `tidyverse`:
```{r load_packages, message = FALSE}
library(tidyverse)
```
The `sa3_income` dataset will be used for all key examples in this chapter.^[This is a tidied version of the [ABS Employee income by occupation and sex, 2010-11 to 2016-16](https://www.abs.gov.au/AUSSTATS/[email protected]/DetailsPage/6524.0.55.0022011-2016?OpenDocument) dataset.] It is a long dataset from the ABS that contains the average income and number of workers by Statistical Area 3, occupation and sex between 2011 and 2016.
If you haven't already, download the `sa3_income.csv` file to your own `data` folder:
```{r download_data, message = FALSE}
download.file(url = "https://raw.githubusercontent.com/grattan/R_at_Grattan/master/data/sa3_income.csv",
destfile = "data/sa3_income.csv")
```
Then read it using the `read_csv` function:
```{r read_data}
sa3_income <- read_csv("data/sa3_income.csv")
```
```{r remove_na, include=FALSE}
sa3_income <- sa3_income %>%
filter(!is.na(average_income)) %>%
select(year,
sa3_name,
state,
gender,
income = average_income,
workers) %>%
group_by(year, sa3_name, state, gender) %>%
summarise(income = mean(income),
workers = sum(workers)) %>%
ungroup() %>%
arrange(gender, state, sa3_name, year)
```
```{r}
head(sa3_income)
```
## The pipe: `%>%`
You will almost always want to perform more than one of the operations described below on your dataset. One way to perform multiple operations, one after the other, is to 'nest' them inside. This nesting will be _painfully_ familiar to Excel users.
Consider an example of baking and eating a cake.^[XXX cannot remember the source for this example; probably Hadley? Jenny Bryan? Maybe somenone else?] You take the ingredients, combine them, then mix, then bake, and then eat them. In a nested formula, this process looks like:
```{r, eval = FALSE}
eat(bake(mix(combine(ingredients))))
```
In a nested formula, you need to start in the _middle_ and work your way out. This means anyone reading your code -- including you in the future! -- needs to start in the middle and work their way out. But because we're used to left-right reading, we're not particularly good at naturally interpreting nested functions like this one.
This is where the 'pipe' can help. The pipe operator `%>%` (keyboard shortcut: `cmd + shift + m`) takes an argument on the left and 'pipes' it into the function on the right. Each time you see `%>%`, you can read it as 'and then'.
So the you could express the baking example as:
```{r, eval = FALSE}
ingredients %>% # and then
combine() %>% # and then
mix() %>% # and then
bake() %>% # and then
eat() # yum!
```
Which reads as:
> take the `ingredients`, then `combine`, then `mix`, then `bake`, then `eat` them.
This does the same thing as `eat(bake(mix(combine(ingredients))))`. But it's much nicer and more natural to read, and to _write_.
Another example: the function `paste` takes arguments and combines them together into a single string. So you could use the pipe to:
```{r}
"hello" %>% paste("dear", "reader")
```
which is the same as
```{r}
paste("hello", "dear", "reader")
```
Or you could define a vector of numbers and pass^['pass' can also be used to mean 'pipe'.] them to the `sum()` function:
```{r}
my_numbers <- c(1, 2, 3, 5, 8, 13)
my_numbers %>% sum()
```
Or you could skip the intermediate step altogether:
```{r}
c(1, 2, 3, 5, 8, 13) %>%
sum()
```
This is the same as:
```{r}
sum(c(1, 2, 3, 5, 8, 13))
```
The benefits of piping become more clear when you want to perform a few sequential operations on a dataset. For example, you might want to `filter` the observations in the `sa3_income` data to only `NSW`, before you `group_by` `gender` and `summarise` the `income` of these grops (these functions are explained in detail below). All of these functions take 'data' as the first argument, and are designed to be used with pipes.
Like the income differential it shows, writing this process as a nested function is outrageous and hard to read:
```{r}
summarise((group_by(filter(sa3_income, state == "NSW"), gender)), av_mean_income = mean(income))
```
The original common way to avoid this unseemly nesting in `R` was to assign each 'step' its own object, which is definitely clearer:
```{r}
data1 <- filter(sa3_income, state == "NSW")
data2 <- group_by(data1, gender)
data3 <- summarise(data2, av_mean_income = mean(income))
data3
```
And using pipes make the steps clearer still:
1. take the `sa3_income` data, and then %>%
2. `filter` it to only NSW, and then %>%
3. `group` it by gender, and then %>%
4. `summarise` it
```{r}
sa3_income %>% # and then
filter(state == "NSW") %>% # and then
group_by(gender) %>% # and then
summarise(av_mean_income = mean(income))
```
## Select variables with `select()`
The `select` function takes a dataset and **keeps** or **drops** variables (columns) that are specified.
For example, look at the variables that are in the `sa3_income` dataset (using the `names()` function):
```{r}
names(sa3_income)
```
If you wanted to keep just the `state` and `income` variables, you could take the `sa3_income` dataset and select just those variables:
```{r}
sa3_income %>%
select(state, income)
```
Or you could use `-` (minus) to remove the `state` and `sa3_name` variables:^[This is the same as **keeping everything except** the `state` and `sa3_name` variables.]
```{r}
sa3_income %>%
select(-state, -sa3_name)
```
### Selecting groups of variables
Sometimes it can be useful to keep or drop variables with names that have a certain characteristic; they begin with some text string, or end with one, or contain one, or have some other pattern altogether.
You can use patterns and ['select helpers'](https://tidyselect.r-lib.org/reference/select_helpers.html)^[Explained in useful detail by the Tidyverse people at https://tidyselect.r-lib.org/reference/select_helpers.html]
from the Tidyverse to help deal with these sets of variables.
For example, if you want to keep just the SA3 and state variables -- ie the variables that start with `"s"` -- you could:
```{r}
sa3_income %>%
select(starts_with("s"))
```
Or, instead, if you wanted to keep just the variables that contain `"er"`, you could:
```{r}
sa3_income %>%
select(contains("er"))
```
And if you wanted to keep **both** the `"s"` variables and the `"er"` variables, you could:
```{r}
sa3_income %>%
select(starts_with("s"), contains("er"), )
```
The full list of these handy select functions are provided with the `?tidyselect::select_helpers` documentation, listed below:
- `starts_with()`: Starts with a prefix.
- `ends_with()`: Ends with a suffix.
- `contains()`: Contains a literal string.
- `matches()`: Matches a regular expression.
- `num_range()`: Matches a numerical range like x01, x02, x03.
- `one_of()`: Matches variable names in a character vector.
- `everything()`: Matches all variables.
- `last_col()`: Select last variable, possibly with an offset.
## Filter with `filter()`
The `filter` function takes a dataset and **keeps** observations (rows) that meet the **conditions**.
`filter` has one required first argument -- the data -- and then as many 'conditions' as you want to provide.
### Conditions; logical operations; `TRUE` or `FALSE`
The **conditions** are logical operations, meaning they are a statement that return either `TRUE` or `FALSE` in the computer's mind.^[Computers' mind, more likely.]
We know, for instance, that `12` is equal to `12` and that `1 + 2` does not equal `12`. Which means if we type `12 == 12` or `1 + 2 == 12` into the console it should give `FALSE`:
```{r}
12 == 12
1 + 2 == 12
```
Or, we can see if `1 + 2` is equal `5` or `9` or `3` by providing a vector of those numbers:
```{r}
1 + 2 == c(5, 9, 3)
```
This works for character strings, too:
```{r}
"apple" == c("orange", "apple", 7)
```
A lot of what we do in 'data science' is based on these `TRUE` and `FALSE` conditions.
### Filtering data with `filter`
Turning back to the `sa3_income` data, if you just wanted to see observations people in `NT`:
```{r}
sa3_income %>%
filter(state == "NT")
```
Or you might just want to look at high-income (`income > 70,000`) areas from Victoria in 2015:
```{r}
sa3_income %>%
filter(state == "Vic",
income > 70000,
year == 2015)
```
Each of the commas in the `filter` function represent an 'and' `&`. So you can read the steps above as:
> take the `sa3_income` data and filter to keep only the observations that are from Victoria`,` and that have a average income above 70k`,` and are from the year 2015.
Sometimes you might want to relax a little, keeping observations from one category **or** another. You can do this with the **or** symbol: `|`^[On the keyboard: `shift` + `backslash`]
```{r}
sa3_income %>%
filter(state == "Vic" | state == "Tas",
income > 100000,
year == 2015 | year == 2016)
```
Which reads:
> take the `sa3_income` data and filter to keep only the observations that are from Victoria OR NSW, and that have a average income above 100k, and are from the year 2015 OR 2016.
### Grouped filtering with `group_by()`
The `group_by` function groups a dataset by given variables. This effectively generates one dataset per group within your main dataset. Any function you then apply -- like `filter()` -- will be applied to _each_ of the grouped datasets.
For example, you could filter the `sa3_income` dataset to keep just the observation with the highest average income:
```{r}
sa3_income %>%
filter(income == max(income))
```
To keep the observations that have the highest average incomes _in each state_, you can `group_by` state, then `filter`:^[Wow they are all men!]
```{r}
sa3_income %>%
group_by(state) %>%
filter(income == max(income))
```
From the description of the tibble above, you can learn that your data has 8 unique groups of state:
`## # Groups: state [8]`
Or you could keep the observations with the _lowest_ average incomes in _each state and year_:^[Wow they are all women!]
```{r}
sa3_income %>%
group_by(state, year) %>%
filter(income == min(income))
```
The dataset remains grouped after your function(s). To explicitly 'ungroup' your data, add the `ungroup` function to your chain (the 'Groups' note has disappeared in the below):
```{r}
sa3_income %>%
group_by(state, year) %>%
filter(income == min(income)) %>%
ungroup()
```
## Edit and add new variables with `mutate()`
To add new variables to your dataset, use the `mutate` function. Like all `dplyr` verbs, the first argument to `mutate` is your data. Then define variables using a `new_variable_name = x` format, where `x` can be a single number or character string, or simple operation or function using current variables.
To add a new variable to the `sa3_income` dataset that shows the log the number of workers:
```{r}
sa3_income %>%
mutate(log_workers = log(workers))
```
To edit a variable, redefine it in `mutate`. For example, if you wanted to take the last two digits of year:
```{r}
sa3_income %>%
mutate(year = as.integer(year - 2000))
```
### Using `if_else()` or `case_when()`
Sometimes you want to create a new variable based on some sort of condition. Like, if the number of workers in an `sa3` is more than `2,000`, set the new `many_workers` variable to `TRUE`, and set it to `FALSE` otherwise.
This kind of operation can be thought of as `if_else`: `if` (some condition), do this, otherwise do that.
That's what the `if_else()` function does. It takes three arguments: a condition, a value if that condition is true, and a value if that condition is false.
You can use the `if_else()` function when you are creating new variables in a `mutate` command:
```{r}
sa3_income %>%
mutate(many_workers = if_else(workers > 2000, "Many workers", "Not many workers"))
```
Which reads:
> Take the `sa3_income` dataset, and then add a variable that says 'Many workers' if there are more than 2,000 workers, and 'Not many workers' if there are fewer-or-equal than 2,000 workers.
With the `if_else` function, you take one conditional statement and return something based on that. But **often** you don't want to be so binary; you want to do this if this is true, that if that is true, and the other if the other is true, etc.
This could be done by nesting `if_else` statements:
```{r}
sa3_income %>%
mutate(worker_group = if_else(workers > 2000, "More than 2000 workers",
if_else(workers > 1000, "1000-2000 workers",
if_else(workers > 500, "500-1000 workers",
"500 workers or less"))))
```
But that syntax can be a bit difficult to read. You can do this in a clearer way using `case_when`:
```{r}
sa3_income %>%
mutate(worker_group = case_when(
workers > 20000 ~ "More than 20,000 workers",
workers > 10000 ~ "More than 10,000 workers",
workers > 5000 ~ "More than 5,000 workers",
workers <= 5000 ~ "5,000 or fewer workers"
))
```
The `case_when` function takes the first condition (LHS) and applies some value (RHS) if it is true. It then moves to the next condition, and so on. Once an observation has been classified -- eg an observation has more than 20,000 workers -- it is ignored in proceeding conditions.
Ending a `case_when` statement with `TRUE ~ [some value]` is a catch all, and will apply the RHS `[some value]` to any observations that did not meet an explicit condition. For example, you could end the worker classification with:
```{r}
sa3_income %>%
mutate(worker_group = case_when(
workers > 20000 ~ "More than 20,000 workers",
workers > 10000 ~ "More than 10,000 workers",
workers > 5000 ~ "More than 5,000 workers",
TRUE ~ "5,000 or fewer workers"
))
```
Meaning, for any observation that did not have workers more than 20,000 or more than 10,000 or more than 5,000, assign the value `"5,000 or fewer workers"`.
Observations that do not meet a condition will be set to `NA`:
```{r}
sa3_income %>%
mutate(worker_group = case_when(
workers > 10e6 ~ "More than 10 million workers"
))
```
Like any `if` or `if_else`, you can provide more than one condition to your conditional statement:
```{r}
sa3_income %>%
mutate(women_group = case_when(
gender == "Women" & workers > 20000 ~ "More than 20,000 women",
gender == "Women" & workers > 10000 ~ "More than 10,000 women",
gender == "Women" & workers > 5000 ~ "More than 5,000 women",
gender == "Women" ~ "5,000 or fewer women",
TRUE ~ "Men"
))
```
### Grouped mutates with `group_by()`
Like filtering, you can add or edit variables on grouped data. For example, you could get the average number of workers in each SA3 over the 6 years:
```{r}
sa3_income %>%
group_by(sa3_name, gender) %>%
mutate(av_workers = mean(workers))
```
Above, the `mean()` function is applied separately to each unique group of `sa3_name` and `gender`, taking one average for women in Queanbeyan, one average for men in Queanbeyan, and so on.
Grouping a dataset does not prohibit operations that don't utilise the grouping. For example, you could get each year's workers relative to the SA3/gender average in the same call to `mutate`:
```{r}
sa3_income %>%
group_by(sa3_name, gender) %>%
mutate(av_workers = mean(workers),
worker_diff = workers / av_workers)
```
See that the data remains grouped after the `mutate`. You can explicitly `ungroup()` afterwards:
```{r}
sa3_income %>%
group_by(sa3_name, gender) %>%
mutate(av_workers = mean(workers),
worker_diff = workers / av_workers) %>%
ungroup()
```
## Summarise data with `summarise()`
Summarising is a useful way to assess and present data. The `summarise` function collapses your data down into a single row, performing the operation(s) you provide:
```{r}
sa3_income %>%
summarise(mean_income = mean(income),
total_workers = sum(workers)) # this is a silly statistic
```
Summarising is usually only useful when combined with `group_by`.
### Grouped summaries with `group_by()`
Grouped summaries can help change the _detail_ of your data. In the original `sa3_income` data, there is a unique `workers` observation for each year, SA3 and gender. If you wanted to aggregate that information up see the total number of workers for each year and SA3:
```{r}
sa3_income %>%
group_by(year, sa3_name) %>%
summarise(workers = sum(workers))
```
After the `summarise` function, the dataset grouping remains but is reduced by one -- so the right-hand-side grouping is lost. This enables a common combination to find a proportion of a group. For example, if you
**Common functions to use with summarise**
Grouped summaries generate summary statistics for grouped data. It uses the same `summarise` function, but is preceded with a `group_by`. For example, if you want to find the average income for women and men:
```{r}
sa3_income %>%
group_by(gender) %>%
summarise(mean_income = mean(income))
```
Or the total workers in each year and state by gender:
```{r}
sa3_income %>%
group_by(year, state, gender) %>%
summarise(workers = sum(workers))
```
## Arrange with `arrange()`
'doesn't add or subtract to your data'
Sorting data in one way or another can be useful. Use the `arrange` function to sort data by the provided variable(s). Like with `select`, you can use the minus sign `-` to reverse the order.
For example, to find the areas in 2016 with the **least** workers:
```{r}
sa3_income %>%
filter(year == 2016) %>%
arrange(workers)
```
You can provide more than one variable. To sort the data first by `state` and, within each state, by the most workers (ie sorting by negative workers):
```{r}
sa3_income %>%
filter(year == 2016) %>%
arrange(state, -workers)
```
### `lead` and `lag` functions with `arrange`
Having your data arranged in the way you want lets you use the `lead` (looking forward) and `lag` (looking backward) functions.
Both the `lead` and `lag` functions take a variable as their only requried argument. The default number of lags or leads is `1`, and this can be changed with the second argument. For example:
```{r}
sa3_income %>%
mutate(last_workers = lag(workers))
```
If you wanted to see the growth rate of income over time, you could `arrange` then `group_by` your data before creating an `income_growth` variable that is `income / lag(income)`.
```{r}
sa3_income %>%
arrange(sa3_name, gender, year) %>%
group_by(sa3_name, gender) %>%
mutate(income_growth = income / lag(income) - 1)
```
## Putting it all together
You will often use a combination of the above `dplyr` functions to get your data into shape.
For example, say you want to get the total workers and total income in each state and year by gender. You could start with the `sa3_income` dataset, and then filter to year 2016, then create a new variable equal to `workers * income`, then group by year, state and gender before you summarise to get the statistics you want. With pipes, it could look something like:
```{r}
sa3_income %>%
filter(year == 2016) %>%
mutate(total_income = workers * income) %>%
group_by(year, state, gender) %>%
summarise(total_workers = sum(workers),
mean_income = mean(income),
total_income = sum(total_income))
```
Or say you want to see the annual growth rate of female workers in the SA3 with the highest female income. You could filter to keep women, and then group by SA3, then get the highest income for each of SA3, then ungroup and filter to keep only the SA3 with the highest income, then arrange by year and get the annual worker growth:
```{r}
sa3_income %>%
filter(gender == "Women") %>%
group_by(sa3_name) %>%
mutate(highest_income = max(income)) %>%
ungroup() %>%
filter(highest_income == max(highest_income)) %>%
arrange(year) %>%
mutate(worker_growth = workers / lag(workers) - 1)
```
## Joining datasets with `left_join()`
Joining one dataset with another is incredibly useful and can be a difficult concept to grasp. The concept of joining one dataset to another is well introduced in [Chapter 13 of R for Data Science](https://r4ds.had.co.nz/relational-data.html):
> It's rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called **relational data** because it is the relations, not just the individual datasets, that are important.
The `dplyr` package ['Join two tbls together'](https://dplyr.tidyverse.org/reference/join.html) page provides a comprehensive summary of all join types. We will explore the key use of joins in our line of work -- `left_join` -- below.
A 'left' join takes your main dataset, and adds variables from a new dataset based on a matching condition **that's unhelpful, fix**. If an observation in the new dataset is not found in the main dataset, it is ignored.
It is probably easier to show this with an example. Say that we had the income percentiles of each SA3 in each year from a different data source:
```{r, include=FALSE}
sa3_percentiles <- read_csv("data/sa3_income.csv") %>%
group_by(sa3_name, year) %>%
summarise(sa3_income_percentile = mean(sa3_income_percentile))
write_csv(sa3_percentiles, "data/sa3_percentiles.csv")
```
```{r}
sa3_percentiles <- read_csv("data/sa3_percentiles.csv")
```