12-joins.Rmd

---
layout: page
title: R for reproducible scientific analysis
subtitle: Multi-Table Joins
minutes: ?20
---

```{r, include=FALSE}
source("tools/chunk-options.R")
opts_chunk$set(warning = FALSE, message = FALSE)
# Silently load in the data so the rest of the lesson works
library(tidyverse)
gapminder <- read_csv("data/gapminder-FiveYearData.csv")
```

> ## Learning objectives {.objectives}
>
> - Focus on the third tidy data principle
>     - Each variable forms a column.
>     - Each observation forms a row.
>     - Each type of observational unit forms a table.
> - By able to use `dplyr`'s join functions to merge tables
>

### Joins

The third tidy data maxim states that each observation type gets its own table. The idea of multiple tables within a dataset will be familiar to anyone who has worked with a relational database but may seem foreign to those who have not. 

The idea is this: Suppose we conduct a behavioral experiment that puts individuals in groups, and we measure both individual- and group-level variables. We should have a table for the individual-level variables and a separate table for the group-level variables. Then, should we need to merge them, we can do so using the `join` functions of `dplyr`. 

The join functions are nicely illustrated in RStudio's [Data wrangling cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf). Each function takes two data.frames and, optionally, the name(s) of columns on which to match. If no column names are provided, the functions match on all shared column names.

The different join functions control what happens to rows that exist in one table but not the other. 

- `left_join` keeps all the entries that are present in the left (first) table and excludes any that are only in the right table. 
- `right_join` keeps all the entries that are present in the right table and excludes any that are only in the left table.
- `inner_join` keeps only the entries that are present in both tables. `inner_join` is the only function that guarantees you won't generate any missing entries.
- `full_join` keeps all of the entries in both tables, regardless of whether or not they appear in the other table.

![dplyr joins, via RStudio](fig/dplyr-joins.png)

We will practice on our continents data.frame from module 2 and the gapminder data.frame. Note how these are tidy data: We have observations at the level of continent and at the level of country, so they go in different tables. The continent column in the gapminder data.frame allows us to link them now. If continents data.frame isn't in your Environment, load it and recall what it consists of:

```{r}
load('data/continents.RDA')
continents
```

We can join the two data.frames using any of the `dplyr` functions. We will pass the results to `str` to avoid printing more than we can read, and to get more high-level information on the resulting data.frames.

```{r}
left_join(gapminder, continents) 

right_join(gapminder, continents)
```

These operations produce slightly different results, either 1704 or 1705 observations. Can you figure out why? Antarctica contains no countries so doesn't appear in the gapminder data.frame. When we use `left_join` it gets filtered from the results, but when we use `right_join` it appears, with missing values for all of the country-level variables:
  
```{r}
right_join(gapminder, continents) %>% 
  filter(continent == "Antarctica")
```

There's another problem in this data.frame -- it has two population measures, one by continent and one by country and it's not clear which is which! Let's rename a couple of columns.

```{r}
right_join(gapminder, continents) %>% 
  rename(country_pop = pop, continent_pop = population)
```


> #### Challenge -- Putting the pieces together {.challenge}
> 
> A colleague suggests that the more land area an individual has, the greater their gdp will be and that this relationship will be observable at any scale of observation. You chuckle and mutter "Not at the continental scale," but your colleague insists. Test your colleague's hypothesis by:
> 
> - Calculating the total GDP of each continent, 
>       - Hint: Use `dplyr`'s `group_by` and `summarize`
> - Joining the resulting data.frame to the `continents` data.frame, 
> - Calculating the per-capita GDP for each continent, and 
> - Plotting per-capita gdp versus population density. 
>


## Challenge solutions

> #### Solution to Challenge -- Putting the pieces together {.challenge}
>
>
> ```{r, Putting the pieces together - solution}
> library(ggplot2)
> 
> # Calculate country-level GDP
> mutate(gapminder, GDP = gdpPercap * pop) %>%  
> # Group by continent
>     group_by(continent) %>%  
> # Calculate continent-level GDP
>     summarize(cont_gdp = sum(GDP)) %>%  
> # Join the continent-GDP data.frame to the continents data.frame
>     left_join(continents) %>%  
> # Calculate continent-level per-capita GDP
>     mutate(per_cap = cont_gdp / population) %>%  
> # Plot gdp versus land area
>     ggplot(aes(x = area_km2, y = per_cap)) +  
> # Draw points
>     geom_point() +  
> # And label them
>     geom_text(aes(label = continent), nudge_y = 5e3)  
> ```
>