week05/data_wrangling_w_tidyr.Rmd

---
title: "Data wrangling with tidyr"
output: html_document
---

```{r setup, include=FALSE}
# load libraries
library(tidyverse)
library(conflicted)
library(viridis)

# configure knit settings
knitr::opts_chunk$set(echo = TRUE, fig.width = 6, fig.height = 4)

# resolve package conflicts
filter <- dplyr::filter
select <- dplyr::select
```

# Data Wrangling with tidyr

The tidyr package has some built in data-sets specifically for practicing tidying, so they'll be used for the demonstration today.

## Tidy Data

The first example data table in tidyr is the cleverly named "table1". This table is tidy! The other tableNs in this dataset are untidy versions of this one to practice on.

```{r}
table1
```

As you can see when you look at table1, each column is a variable and contains only values of the same type belonging to that variable. Going foward, we'll be tidying messed up versions of table1

## tidyr functions

### `spread()`

If you look at table 2 in the chunk below, it's not tidy because the column 'type' contains two variables, cases and population. Cases and population are separate measurements, so they should be their own columns. To make more columns from existing columns, use `spread()`. 

```{r}
table2

table2 %>% spread(type, count)
```

### `gather()`

Looking at table4a in the chunk below, it's not tidy because the columns 1999 and 2000 are not variable descriptions, they're data (the year the measurement was made in). We need the column names to be values in a column. To do that we'll use gather.

```{r}
table4a

# you can gather by specifying column names
table4a %>% gather(year, cases, `1999`:`2000`)

# also with column numbers
table4a %>% gather(year, cases, 2:3)
```

### `separate()`

In the chunk below, table3 is not tidy because it has multiple values separate by a backslash in the rate column. To split one column into two or more columns, use `separate()`

```{r}
table3

# separate example
table3 %>% separate(rate, into = c('cases', 'population'), sep = '/')
```

`separate()` also lets you specify an interger position to split at

```{r}
# just so you know, you can also separate by specifying a position to separate after
table3 %>% separate(rate, into = c('cases', 'population'), sep = 3)
```

### `unite()`

In the chunk below, table5 is not tidy because the date is split between two columns, century and year. To combine columns, use the opposite of `separate()`, `unite()`

```{r}
table5

table5 %>% unite(year, c('century', 'year'), sep = '')
```

You may have noticed that the rate column in table5 is the same as the rate column in table3 and contains two values. Chain together some of the tidyr functions above to completely tidy table5.

```{r}
table5 %>% 
  unite(year, c('century', 'year'), sep = '') %>% 
  separate(rate, into = c('cases', 'population'), sep = '/')
```

## Combining Tables

Binds and joins are in the dplyr package, but fit in better with data wrangling, so they were saved for this week.

### Binds

Attach tables together. In order to bind tables, the tables being bound together have to have the same number of rows OR the same number of columns (depending if you're binding by rows or columns).

#### `rbind()` / `bind_rows()`

Attach tables together by adding additional rows. When you bind by rows, the second table is added to the bottom of the first table. 

```{r}
table4a
table4b

# base R
rbind(table4a, table4b)

# tidyverse, dplyr
bind_rows(table4a, table4b)
```

The difference between `rbind()` and `bind_rows()` is that if you have a different number of columns, `rbind()` will refuse to work and return an error, while `bind_rows()` will keep all columns from all tables and add NAs where there are no values.

#### `cbind()` / `bind_cols()`

Attach tables by adding additional columns.

```{r}
table4a
table4b

# base R
cbind(table4a, table4b)

# tidyverse, dplyr
bind_cols(table4a, table4b)
```

### Joins

Joins combine two tables together by common variables. The difference between binds and joins is that binds are intended to combine tables together that have exactly the same information, like for different samples, while joins are intended to combine together tables with complimentary but different information by common variables, like samples with their metadata. The `*_join()` functions combine either by the columns you specify or by auto-detecting columns with the same name. Mutating joins add additional columns, while filtering joins filter a table by another table.

We'll use the band_* tables from the dplyr package to practice joins. As you can see when you run the tables in the chunk below, the tables have information in common, musician names, but complimentary information, what band they belong to in one table and what instruments they play in the others.

```{r}
band_members
band_instruments
band_instruments2
```

The descriptions below refer to x and y in the join. x is the first table listed in the join and y is the second. The order tables are given in joins is important!

#### Mutating Joins

Mutating joins add columns onto a table from another table.

##### `inner_join()`

Joins tables by returning all rows where x and y equal each other (in the join columns) and all columns from both x and y.

```{r}
band_members
band_instruments
inner_join(band_members, band_instruments, by = c('name'))
```

##### `full_join()`

The opposite of `inner_join()`, `full_join()` joins tables by returning all rows all columns from both x and y. If there is no matching value between x and y, an NA will be returned.

```{r}
band_members
band_instruments
full_join(band_members, band_instruments, by = c('name'))
```

##### `left_join()`

Joins tables by return all rows from x and all columns from x and y. If there is no matching value in y for x, an NA will be returned.

```{r}
band_members
band_instruments
left_join(band_members, band_instruments, by = c('name'))
```

##### `right_join()`

Joins tables by the opposite of `left_join()`, returning all rows from y and all columns from x and y. If there is no matching value in x for y, an NA will be returned.

```{r}
band_members
band_instruments
right_join(band_members, band_instruments, by = c('name'))
```

#### Filtering Joins

Filtering joins filter a table by another table.

#### `semi_join()`

Filters x by returning all rows from x that match values in y.

```{r}
band_members
band_instruments
semi_join(band_members, band_instruments, by = c('name'))
```

#### `anti_join()`

The opposite of `semi_join()`, `anti_join()` returns rows from x that DON'T match a value in y

```{r}
band_members
band_instruments
anti_join(band_members, band_instruments, by = c('name'))
```

#### What if my column names don't match?

Name and artist are the same variable, who the person is, in the tables below.

```{r}
band_members
band_instruments2
```

To specify the join by two differently named columns, specify that they equal each other. You MUST list the column names in the order the tables are given to the join. 'name' comes first below, because that's the column you want to join by in the first table.

```{r}
full_join(band_members, band_instruments2, by = c('name' = 'artist'))
```

### Dealing with NAs

If there's no value in a table, or R can't figure out what's supposed to be there, it will give you an `NA` as seen in the Star Wars table below

```{r}
starwars
```

NAs are bad not only because you're missing data, but also because many functions in R either refuse to run or return weird answers if there are NAs in the data. 

```{r}
mean(starwars$birth_year)
```

#### `na.omit()`

One strategy is just to drop all rows that contain NAs

```{r}
starwars
starwars %>% na.omit()
```

#### `replace_na()`

Sometimes you might want to replace NAs with a value though, which you can do using tidyr

```{r}
starwars
starwars %>% replace_na(list(hair_color = 'none', birth_year = 'unknown', gender = 'n/a'))
```

## Putting it together

```{r}
# what the tidy table should look like
table1

# the tables we need to tidy into one table
table4a
table4b
```

```{r}
# tidy individual tables
table4a %>% gather(year, cases, `1999`:`2000`) -> table4a_tidy
table4b %>% gather(year, population, `1999`:`2000`) -> table4b_tidy

# combine
left_join(table4a_tidy, table4b_tidy, by = c('country', 'year'))
```