Data_transformation.Rmd

# Data transformation with `dplyr`

This section focusses on transforming rectangular datasets. 


The `dplyr` verbs and concepts covered in this chapter are also covered in this video by Garrett Grolemund (a co-author of _[R for Data Science](https://r4ds.had.co.nz/)_ with Hadley Wickham). 

```{r, echo = FALSE}
knitr::include_url("https://www.youtube.com/embed/y9KJmUGc8SE")
```


## Set up

Load your packages first. This chapter just uses the packages contained in the `tidyverse`:

```{r load_packages, message = FALSE}
library(tidyverse)
```


The `sa3_income` dataset will be used for all key examples in this chapter.^[This is a tidied version of the [ABS Employee income by occupation and sex, 2010-11 to 2016-16](https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/6524.0.55.0022011-2016?OpenDocument) dataset.] It is a long dataset from the ABS that contains the average income and number of workers by Statistical Area 3, occupation and sex between 2011 and 2016.

If you haven't already, download the `sa3_income.csv` file to your own `data` folder:

```{r download_data, message = FALSE}
download.file(url = "https://raw.githubusercontent.com/grattan/R_at_Grattan/master/data/sa3_income.csv",
              destfile = "data/sa3_income.csv")
```

Then read it using the `read_csv` function:

```{r read_data}
sa3_income <- read_csv("data/sa3_income.csv")
```

```{r remove_na, include=FALSE}
sa3_income <- sa3_income %>% 
  filter(!is.na(average_income)) %>% 
  select(year, 
         sa3_name, 
         state, 
         gender, 
         income = average_income, 
         workers) %>% 
  group_by(year, sa3_name, state, gender) %>% 
  summarise(income = mean(income),
            workers = sum(workers)) %>% 
  ungroup() %>% 
  arrange(gender, state, sa3_name, year)
```

```{r}
head(sa3_income)
```


## The pipe: `%>%`

You will almost always want to perform more than one of the operations described below on your dataset. One way to perform multiple operations, one after the other, is to 'nest' them inside. This nesting will be _painfully_ familiar to Excel users.

Consider an example of baking and eating a cake.^[XXX cannot remember the source for this example; probably Hadley? Jenny Bryan? Maybe somenone else?] You take the ingredients, combine them, then mix, then bake, and then eat them. In a nested formula, this process looks like:

```{r, eval = FALSE}
eat(bake(mix(combine(ingredients))))
```

In a nested formula, you need to start in the _middle_ and work your way out. This means anyone reading your code -- including you in the future! -- needs to start in the middle and work their way out. But because we're used to left-right reading, we're not particularly good at naturally interpreting nested functions like this one.

This is where the 'pipe' can help. The pipe operator `%>%` (keyboard shortcut: `cmd + shift + m`)  takes an argument on the left and 'pipes' it into the function on the right. Each time you see `%>%`, you can read it as 'and then'. 

So the you could express the baking example as:

```{r, eval = FALSE}
ingredients %>% # and then
  combine() %>% # and then
  mix() %>% # and then
  bake() %>% # and then
  eat() # yum!
```

Which reads as:

> take the `ingredients`, then `combine`, then `mix`, then `bake`, then `eat` them.

This does the same thing as `eat(bake(mix(combine(ingredients))))`. But it's much nicer and more natural to read, and to _write_.

Another example: the function `paste` takes arguments and combines them together into a single string. So you could use the pipe to:

```{r}
"hello" %>% paste("dear", "reader")
```

which is the same as

```{r}
paste("hello", "dear", "reader")
```


Or you could define a vector of numbers and pass^['pass' can also be used to mean 'pipe'.] them to the `sum()` function:

```{r}
my_numbers <- c(1, 2, 3, 5, 8, 13)

my_numbers %>% sum()
```

Or you could skip the intermediate step altogether:
```{r}
c(1, 2, 3, 5, 8, 13) %>% 
  sum()
```

This is the same as:

```{r}
sum(c(1, 2, 3, 5, 8, 13))
```


The benefits of piping become more clear when you want to perform a few sequential operations on a dataset. For example, you might want to `filter` the observations in the `sa3_income` data to only `NSW`, before you `group_by` `gender` and `summarise` the `income` of these grops (these functions are explained in detail below). All of these functions take 'data' as the first argument, and are designed to be used with pipes.

Like the income differential it shows, writing this process as a nested function is outrageous and hard to read:

```{r}
summarise((group_by(filter(sa3_income, state == "NSW"), gender)), av_mean_income = mean(income))
```

The original common way to avoid this unseemly nesting in `R` was to assign each 'step' its own object, which is definitely clearer:

```{r}
data1 <- filter(sa3_income, state == "NSW")
data2 <- group_by(data1, gender)
data3 <- summarise(data2, av_mean_income = mean(income))
data3
```

And using pipes make the steps clearer still: 

1. take the `sa3_income` data, and then %>% 
2. `filter` it to only NSW, and then %>% 
3. `group` it by gender, and then %>% 
4. `summarise` it

```{r}
sa3_income %>%  # and then
  filter(state == "NSW") %>% # and then 
  group_by(gender) %>% # and then
  summarise(av_mean_income = mean(income))
```

 
## Select variables with `select()`

The `select` function takes a dataset and **keeps** or **drops** variables (columns) that are specified.

For example, look at the variables that are in the `sa3_income` dataset (using the `names()` function):

```{r}
names(sa3_income)
```

If you wanted to keep just the `state` and `income` variables, you could take the `sa3_income` dataset and select just those variables:

```{r}
sa3_income %>% 
  select(state, income)
```

Or you could use `-` (minus) to remove the `state` and `sa3_name` variables:^[This is the same as **keeping everything except** the `state` and `sa3_name` variables.]

```{r}
sa3_income %>% 
  select(-state, -sa3_name)
```

### Selecting groups of variables

Sometimes it can be useful to keep or drop variables with names that have a certain characteristic; they begin with some text string, or end with one, or contain one, or have some other pattern altogether. 

You can use patterns and ['select helpers'](https://tidyselect.r-lib.org/reference/select_helpers.html)^[Explained in useful detail by the Tidyverse people at https://tidyselect.r-lib.org/reference/select_helpers.html] 
from the Tidyverse to help deal with these sets of variables.

For example, if you want to keep just the SA3 and state variables -- ie the variables that start with `"s"` -- you could: 

```{r}
sa3_income %>% 
  select(starts_with("s"))
```

Or, instead, if you wanted to keep just the variables that contain `"er"`, you could:

```{r}
sa3_income %>% 
  select(contains("er"))
```

And if you wanted to keep **both** the `"s"` variables and the `"er"` variables, you could:

```{r}
sa3_income %>% 
  select(starts_with("s"), contains("er"), )
```

The full list of these handy select functions are provided with the `?tidyselect::select_helpers` documentation, listed below:

- `starts_with()`: Starts with a prefix.
- `ends_with()`: Ends with a suffix.
- `contains()`: Contains a literal string.
- `matches()`: Matches a regular expression.
- `num_range()`: Matches a numerical range like x01, x02, x03.
- `one_of()`: Matches variable names in a character vector.
- `everything()`: Matches all variables.
- `last_col()`: Select last variable, possibly with an offset.


## Filter with `filter()`

The `filter` function takes a dataset and **keeps** observations (rows) that meet the **conditions**. 

`filter` has one required first argument -- the data -- and then as many 'conditions' as you want to provide. 


### Conditions; logical operations; `TRUE` or `FALSE`

The **conditions** are logical operations, meaning they are a statement that return either `TRUE` or `FALSE` in the computer's mind.^[Computers' mind, more likely.] 

We know, for instance, that `12` is equal to `12` and that `1 + 2` does not equal `12`. Which means if we type `12 == 12` or `1 + 2 == 12` into the console it should give `FALSE`:

```{r}
12 == 12
1 + 2 == 12
```

Or, we can see if `1 + 2` is equal `5` or `9` or `3` by providing a vector of those numbers:

```{r}
1 + 2 == c(5, 9, 3)
```

This works for character strings, too:

```{r}
"apple" == c("orange", "apple", 7)
```


A lot of what we do in 'data science' is based on these `TRUE` and `FALSE` conditions. 

### Filtering data with `filter`

Turning back to the `sa3_income` data, if you just wanted to see observations people in `NT`:

```{r}
sa3_income %>% 
  filter(state == "NT")
```

Or you might just want to look at high-income (`income > 70,000`) areas from Victoria in 2015:

```{r}
sa3_income %>% 
  filter(state == "Vic",
         income > 70000,
         year == 2015)
```


Each of the commas in the `filter` function represent an 'and' `&`. So you can read the steps above as: 

> take the `sa3_income` data and filter to keep only the observations that are from Victoria`,` and that have a average income above 70k`,` and are from the year 2015.


Sometimes you might want to relax a little, keeping observations from one category **or** another. You can do this with the **or** symbol: `|`^[On the keyboard: `shift` + `backslash`]

```{r}
sa3_income %>% 
  filter(state == "Vic" | state == "Tas",
         income > 100000,
         year == 2015 | year == 2016)

```

Which reads:

> take the `sa3_income` data and filter to keep only the observations that are from Victoria OR NSW, and that have a average income above 100k, and are from the year 2015 OR 2016.


### Grouped filtering with `group_by()` 

The `group_by` function groups a dataset by given variables. This effectively generates one dataset per group within your main dataset. Any function you then apply -- like `filter()` -- will be applied to _each_ of the grouped datasets. 

For example, you could filter the `sa3_income` dataset to keep just the observation with the highest average income:

```{r}
sa3_income %>% 
  filter(income == max(income))
```

To keep the observations that have the highest average incomes _in each state_, you can `group_by` state, then `filter`:^[Wow they are all men!]

```{r}
sa3_income %>% 
  group_by(state) %>% 
  filter(income == max(income))
```

From the description of the tibble above, you can learn that your data has 8 unique groups of state: 

`## # Groups:       state [8]`

Or you could keep the observations with the _lowest_ average incomes in _each state and year_:^[Wow they are all women!]

```{r}
sa3_income %>% 
  group_by(state, year) %>% 
  filter(income == min(income))
```

The dataset remains grouped after your function(s). To explicitly 'ungroup' your data, add the `ungroup` function to your chain (the 'Groups' note has disappeared in the below):

```{r}
sa3_income %>% 
  group_by(state, year) %>% 
  filter(income == min(income)) %>% 
  ungroup()
```


## Edit and add new variables with `mutate()`

To add new variables to your dataset, use the `mutate` function. Like all `dplyr` verbs, the first argument to `mutate` is your data. Then define variables using a `new_variable_name = x` format, where `x` can be a single number or character string, or simple operation or function using current variables. 


To add a new variable to the `sa3_income` dataset that shows the log the number of workers:

```{r}
sa3_income %>% 
  mutate(log_workers = log(workers))
```

To edit a variable, redefine it in `mutate`. For example, if you wanted to take the last two digits of year:

```{r}
sa3_income %>% 
  mutate(year = as.integer(year - 2000))
```


### Using `if_else()` or `case_when()` 


Sometimes you want to create a new variable based on some sort of condition. Like, if the number of workers in an `sa3` is more than `2,000`, set the new `many_workers` variable to `TRUE`, and set it to `FALSE` otherwise. 

This kind of operation can be thought of as `if_else`: `if` (some condition), do this, otherwise do that. 

That's what the `if_else()` function does. It takes three arguments: a condition, a value if that condition is true, and a value if that condition is false.

You can use the `if_else()` function when you are creating new variables in a `mutate` command:

```{r}
sa3_income %>% 
  mutate(many_workers = if_else(workers > 2000, "Many workers", "Not many workers"))
```

Which reads:

> Take the `sa3_income` dataset, and then add a variable that says 'Many workers' if there are more than 2,000 workers, and 'Not many workers' if there are fewer-or-equal than 2,000 workers.

With the `if_else` function, you take one conditional statement and return something based on that. But **often** you don't want to be so binary; you want to do this if this is true, that if that is true, and the other if the other is true, etc. 


This could be done by nesting `if_else` statements:

```{r}
sa3_income %>% 
  mutate(worker_group = if_else(workers > 2000, "More than 2000 workers", 
                                if_else(workers > 1000, "1000-2000 workers",
                                        if_else(workers > 500, "500-1000 workers",
                                                "500 workers or less"))))
```

But that syntax can be a bit difficult to read. You can do this in a clearer way using `case_when`:

```{r}
sa3_income %>% 
  mutate(worker_group = case_when(
    workers > 20000 ~ "More than 20,000 workers",
    workers > 10000 ~ "More than 10,000 workers",
    workers >  5000 ~ "More than 5,000 workers",
    workers <= 5000 ~ "5,000 or fewer workers"
  ))
  
```

The `case_when` function takes the first condition (LHS) and applies some value (RHS) if it is true. It then moves to the next condition, and so on. Once an observation has been classified -- eg an observation has more than 20,000 workers -- it is ignored in proceeding conditions. 

Ending a `case_when` statement with `TRUE ~ [some value]` is a catch all, and will apply the RHS `[some value]` to any observations that did not meet an explicit condition. For example, you could end the worker classification with:

```{r}
sa3_income %>% 
  mutate(worker_group = case_when(
    workers > 20000 ~ "More than 20,000 workers",
    workers > 10000 ~ "More than 10,000 workers",
    workers >  5000 ~ "More than 5,000 workers",
    TRUE ~ "5,000 or fewer workers"
  ))
```

Meaning, for any observation that did not have workers more than 20,000 or more than 10,000 or more than 5,000, assign the value `"5,000 or fewer workers"`.

Observations that do not meet a condition will be set to `NA`:

```{r}
sa3_income %>% 
  mutate(worker_group = case_when(
    workers > 10e6 ~ "More than 10 million workers"
  ))
```


Like any `if` or `if_else`, you can provide more than one condition to your conditional statement:

```{r}
sa3_income %>% 
  mutate(women_group = case_when(
    gender == "Women" & workers > 20000 ~ "More than 20,000 women",
    gender == "Women" & workers > 10000 ~ "More than 10,000 women",
    gender == "Women" & workers >  5000 ~ "More than 5,000 women",
    gender == "Women"                  ~ "5,000 or fewer women",
    TRUE ~ "Men"
  ))
```


### Grouped mutates with `group_by()` 

Like filtering, you can add or edit variables on grouped data. For example, you could get the average number of workers in each SA3 over the 6 years:

```{r}
sa3_income %>% 
  group_by(sa3_name, gender) %>% 
  mutate(av_workers = mean(workers))
```

Above, the `mean()` function is applied separately to each unique group of `sa3_name` and `gender`, taking one average for women in Queanbeyan, one average for men in Queanbeyan, and so on. 

Grouping a dataset does not prohibit operations that don't utilise the grouping. For example, you could get each year's workers relative to the SA3/gender average in the same call to `mutate`:

```{r}
sa3_income %>% 
  group_by(sa3_name, gender) %>% 
  mutate(av_workers = mean(workers),
         worker_diff = workers / av_workers)
```


See that the data remains grouped after the `mutate`. You can explicitly `ungroup()` afterwards:

```{r}
sa3_income %>% 
  group_by(sa3_name, gender) %>% 
  mutate(av_workers = mean(workers),
         worker_diff = workers / av_workers) %>% 
  ungroup()
```


## Summarise data with `summarise()`

Summarising is a useful way to assess and present data. The `summarise` function collapses your data down into a single row, performing the operation(s) you provide:


```{r}
sa3_income %>% 
  summarise(mean_income = mean(income),
            total_workers = sum(workers))  # this is a silly statistic
```


Summarising is usually only useful when combined with `group_by`.


### Grouped summaries with `group_by()` 

Grouped summaries can help change the _detail_ of your data. In the original `sa3_income` data, there is a unique `workers` observation for each year, SA3 and gender. If you wanted to aggregate that information up see the total number of workers for each year and SA3:

```{r}
sa3_income %>% 
  group_by(year, sa3_name) %>% 
  summarise(workers = sum(workers))
```

After the `summarise` function, the dataset grouping remains but is reduced by one -- so the right-hand-side grouping is lost. This enables a common combination to find a proportion of a group. For example, if you 

**Common functions to use with summarise**

Grouped summaries generate summary statistics for grouped data. It uses the same `summarise` function, but is preceded with a `group_by`. For example, if you want to find the average income for women and men:

```{r}
sa3_income %>% 
  group_by(gender) %>% 
  summarise(mean_income = mean(income))
```

Or the total workers in each year and state by gender:

```{r}
sa3_income %>% 
  group_by(year, state, gender) %>% 
  summarise(workers = sum(workers))
```


## Arrange with `arrange()`

'doesn't add or subtract to your data'

Sorting data in one way or another can be useful. Use the `arrange` function to sort data by the provided variable(s). Like with `select`, you can use the minus sign `-` to reverse the order. 

For example, to find the areas in 2016 with the **least** workers:

```{r}
sa3_income %>%
  filter(year == 2016) %>% 
  arrange(workers)
```

You can provide more than one variable. To sort the data first by `state` and, within each state, by the most workers (ie sorting by negative workers):

```{r}
sa3_income %>%
  filter(year == 2016) %>% 
  arrange(state, -workers)
```

### `lead` and `lag` functions with `arrange`

Having your data arranged in the way you want lets you use the `lead` (looking forward) and `lag` (looking backward) functions. 

Both the `lead` and `lag` functions take a variable as their only requried argument. The default number of lags or leads is `1`, and this can be changed with the second argument. For example:

```{r}
sa3_income %>%
  mutate(last_workers = lag(workers))
```

If you wanted to see the growth rate of income over time, you could `arrange` then `group_by` your data before creating an `income_growth` variable that is `income / lag(income)`.


```{r}
sa3_income %>%
  arrange(sa3_name, gender, year) %>% 
  group_by(sa3_name, gender) %>% 
  mutate(income_growth = income / lag(income) - 1)
```


## Putting it all together

You will often use a combination of the above `dplyr` functions to get your data into shape. 

For example, say you want to get the total workers and total income in each state and year by gender. You could start with the `sa3_income` dataset, and then filter to year 2016, then create a new variable equal to `workers * income`, then group by year, state and gender before you summarise to get the statistics you want. With pipes, it could look something like:

```{r}
sa3_income %>% 
  filter(year == 2016) %>% 
  mutate(total_income = workers * income) %>% 
  group_by(year, state, gender) %>% 
  summarise(total_workers = sum(workers),
            mean_income = mean(income),
            total_income = sum(total_income))
```

Or say you want to see the annual growth rate of female workers in the SA3 with the highest female income. You could filter to keep women, and then group by SA3, then get the highest income for each of SA3, then ungroup and filter to keep only the SA3 with the highest income, then arrange by year and get the annual worker growth:

```{r}
sa3_income %>% 
  filter(gender == "Women") %>% 
  group_by(sa3_name) %>% 
  mutate(highest_income = max(income)) %>% 
  ungroup() %>% 
  filter(highest_income == max(highest_income)) %>% 
  arrange(year) %>% 
  mutate(worker_growth = workers / lag(workers) - 1)
```


## Joining datasets with `left_join()` 

Joining one dataset with another is incredibly useful and can be a difficult concept to grasp. The concept of joining one dataset to another is well introduced in [Chapter 13 of R for Data Science](https://r4ds.had.co.nz/relational-data.html): 

> It's rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called **relational data** because it is the relations, not just the individual datasets, that are important.

The `dplyr` package ['Join two tbls together'](https://dplyr.tidyverse.org/reference/join.html) page provides a comprehensive summary of all join types. We will explore the key use of joins in our line of work -- `left_join` -- below. 

A 'left' join takes your main dataset, and adds variables from a new dataset based on a matching condition **that's unhelpful, fix**. If an observation in the new dataset is not found in the main dataset, it is ignored. 

It is probably easier to show this with an example. Say that we had the income percentiles of each SA3 in each year from a different data source:

```{r, include=FALSE}
sa3_percentiles <- read_csv("data/sa3_income.csv") %>% 
  group_by(sa3_name, year) %>% 
  summarise(sa3_income_percentile = mean(sa3_income_percentile))

write_csv(sa3_percentiles, "data/sa3_percentiles.csv")
```

```{r}
sa3_percentiles <- read_csv("data/sa3_percentiles.csv")
```