Chapters_6_to_8.Rmd

---
title: "Chapters 6 to 8"
author: "Laura"
date: "11/12/2019"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)

library(tidyverse); library(skimr); library(nycflights13); library(GGally); library(ggstance); library(lvplot); library(hexbin); library(modelr)

```

## Notes for Chapter 6: Workflow: scripts 

### 6.1 Running code

Helpful keyboard shortcuts are:

* `Cmd/Ctrl + Enter` will run the complete command where your cursor is standing and take the cursor to the next command

* `Cmd/Ctrl + Shift + S` will execute the whole script


## Notes for Chapter 7: Exploratory Data Analysis

EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data.

To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.

### 7.2 Questions

> “There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

> “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

1. What type of variation occurs within my variables?

2. What type of covariation occurs between my variables?

Tabular data is _tidy_ if each _value_ is placed in its own “cell”, each _variable_ in its own column, and each _observation_ in its own row.

Things to do with a brand new dataset:

1. Visualize variable distributions
2. Examine typical values
3. Examine unusual values
4. Missing values
5. Covariation
6. Patterns and models

#### 7.3.1 Visualising distributions

```{r ch731}
diamonds %>% 
  ggplot() +
    geom_bar(aes(x = cut))
    
diamonds %>% 
  count(cut)

diamonds %>% 
  ggplot() +
    geom_histogram(aes(carat), binwidth = 0.5)

diamonds %>% 
  count(cut_width(carat, 0.5))

# if you want to go more tidyversy it can be a nightmare

diamonds$carat %>% 
  cut_width(0.5) %>% 
  as_tibble() %>%  
  count(value)

# without the warning it should be

diamonds$carat %>% 
  cut_width(0.5) %>% 
  enframe() %>%  
  count(value)

smaller <- diamonds %>% 
  filter(carat < 3)

smaller %>% 
    ggplot(aes(carat)) +
    geom_histogram() +
    geom_freqpoly(aes(color = reorder(cut, carat, FUN = median )))


```

#### 7.3.3 Unusual values


```{r ch733}
ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

```

To make it easy to see the unusual values, we need to zoom to small values of the y-axis with `coord_cartesian()`. `coord_cartesian()` also has an `xlim()` argument for when you need to zoom into the x-axis. ggplot2 also has `xlim()` and `ylim()` functions that work slightly differently: *they throw away the data outside the limits*.

### 7.3.4 Exercises

Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

```{r ex7341}

diamonds %>% skim()

diamonds %>% 
  ggplot() +
    geom_histogram(aes(x), binwidth = 0.01) 

diamonds %>% 
  ggplot() +
    geom_histogram(aes(x), binwidth = 0.1) +
    coord_cartesian(ylim = c(0, 50))

diamonds %>% 
  filter(x < 3 | x > 9.5) %>% 
  select(price, x, y, z) %>% 
  arrange(x)


```


Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

```{r ex7342}

diamonds %>% 
  ggplot() +
    geom_histogram(aes(price), binwidth = 10) +
    coord_cartesian(ylim = c(0,10))

```


How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

```{r ex7343}

(carat_099 <- diamonds %>%
  filter(carat == 0.99) %>% 
  count())

(carat_1 <- diamonds %>%
  filter(carat == 1) %>% 
  count())


```


Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

```{r ex7344}

diamonds %>% 
  ggplot() +
    geom_histogram(aes(x), binwidth = 0.10) +
    coord_cartesian(ylim = c(0, 250))

diamonds %>% 
  ggplot() +
    geom_histogram(aes(x)) +
    coord_cartesian(ylim = c(0, 250))


diamonds %>% 
  ggplot() +
    geom_histogram(aes(x), binwidth = 0.10) +
    ylim(c(0,250))

diamonds %>% 
  ggplot() +
    geom_histogram(aes(x))

diamonds %>% 
  ggplot() +
    geom_histogram(aes(x)) +
    ylim(c(0,250))
```

* `coord_cartesian` zooms in the histogram _after_ the histogram was calculated. Instead, `ylim()` or `xlim()` select the data on which the histogram will be calculated first and then the histogram is put together. That is why histograms with the `binwidth` option behave so differently when any of the `lim` functions are used.

## 7.4 Missing values

```{r ch74}

diamonds2 <- diamonds %>% 
  filter(between(y, 3, 20))

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

# an option to ifelse is case_when() - example from ?case_when
starwars %>%
  select(name:mass, gender, species) %>%
  mutate(
    type = case_when(
      height > 200 | mass > 200 ~ "large",
      species == "Droid"        ~ "robot",
      TRUE                      ~ "other"
    )
  )

ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point()

nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(sched_dep_time)) + 
    geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)


ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point(na.rm = TRUE)
```

#### 7.4.1 Exercises

What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

```{r ex7411}

glimpse(diamonds2)
skim(diamonds2)

diamonds3 <- diamonds2 %>% 
  mutate(
    color = if_else(color == "E" & price == 326, NA_character_ , as.character(color)),
    price = if_else(color == "E" & price == 326, NA_integer_ , price))

skim(diamonds3)

diamonds3 %>% ggplot() +
  geom_bar(aes(color))

ggplot(diamonds3) +
  geom_histogram(aes(price))

```

What does na.rm = TRUE do in mean() and sum()?

```{r ex7412}

sum(c(1, 2, 3, NA))
sum(c(1, 2, 3, NA), na.rm = TRUE)

mean(c(1, 2, 3, NA))
mean(c(1, 2, 3, NA), na.rm = TRUE)
```

## 7.5 Covariation

```{r ch75}

ggplot(diamonds, aes(price, ..density..)) + 
  geom_histogram()

ggplot(diamonds, aes(cut, price)) + 
  geom_boxplot() +
  geom_jitter(alpha = 0.01)

ggplot(diamonds, aes(cut, price)) + 
  geom_violin() +
  geom_jitter(alpha = 0.01)

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()

bpflip <- ggplot(mpg) +
  geom_boxplot(aes(reorder(class, hwy, FUN = median), hwy)) +
  coord_flip()
  
bpflip +  xlab("Car class")

```

#### 7.5.1.1 Exercises

* Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.

```{r ex75111}
flights <- flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time_h = sched_hour + sched_min / 60
  )

ggplot(flights) +
  geom_boxplot(aes(cancelled, sched_dep_time_h))

ggplot(flights) +
  geom_jitter(aes(cancelled, sched_dep_time_h))

ggplot(flights) +
  geom_violin(aes(cancelled, sched_dep_time_h))

ggplot(flights) +
  geom_freqpoly(aes(sched_dep_time_h, y = ..density.., color = cancelled))

```

* What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

```{r ex75112}
# GGally::ggpairs allows to inspect the bivariate relationship of several variables at a time
diamonds %>% ggpairs(columns = c(1, 5, 7:10))

ggplot(diamonds) +
  geom_boxplot(aes(cut, carat))

ggplot(diamonds) +
  geom_violin(aes(cut, carat))

ggplot(diamonds) +
  geom_histogram(aes(carat, y = ..density..)) +
  facet_wrap(~ cut, ncol = 1)


```

Carat seems the most important variable for predicting the price of a diamond. We have a lot more ideal and premium  cut diamonds of low carat than very good or lower cuts diamonds. The higher the carat, the higher the price. That explains that diamonds with a better cut appear to be cheaper, they are of lower carat on average than worse cut diamonds. 

* Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?

```{r ex75113}
ggplot(diamonds) +
  geom_boxplot(aes(cut, carat)) +
  coord_flip()

# note that for geom_boxploth to run well you have to invert the order of the aes arguments
ggplot(diamonds) +
  geom_boxploth(aes(carat, cut))

```
I don't see any difference between the outputs of using `coord_flip` to `geom_boxploth`. But I had to invert the order of the arguments in `geom_boxploth` in order for it to run correctly.


* One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?

* Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?

* If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

### 7.5.2 Two categorical variables

```{r ch7521}

ggplot(diamonds) +
  geom_count(aes(cut, color))

diamonds %>% 
  count(color, cut)


diamonds %>% 
  count(color, cut) %>% 
  ggplot(aes(cut, color, fill = n)) +
    geom_tile()
```
If the categorical variables are unordered, you might want to use the `seriation` package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the `d3heatmap` or `heatmaply` packages, which create interactive plots.

#### 7.5.2.1 Exercises

* How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?

* Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

* Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

### 7.5.3 Two continuous variables

But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used `geom_histogram()` and `geom_freqpoly()` to bin in one dimension. Now you’ll learn how to use `geom_bin2d()` and `geom_hex()` to bin in two dimensions.

```{r ch7531}

ggplot(smaller) +
  geom_bin2d(aes(carat, price))

ggplot(diamonds) +
  geom_bin2d(aes(carat, price))
# install.packages("hexbin")

ggplot(smaller) +
  geom_hex(aes(carat, price))
```

Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a categorical and a continuous variable that you learned about. For example, you could bin carat and then for each group, display a boxplot:

```{r ch7532}

ggplot(smaller, aes(carat, price)) + 
  geom_boxplot(aes(group = cut_width(carat, 0.1)))

# the option varwidth let us see that each boxplot is representing a different number of obs
ggplot(smaller, aes(carat, price)) + 
  geom_boxplot(aes(group = cut_width(carat, 0.1)), varwidth = TRUE)

# another option is to fix the number of points per bin and let the width of the box plot to reflect how much of the x axis is representing. I think the previous plot is more intuitive to understand
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

```

#### 7.5.3.1 Exercises

* Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?

* Visualise the distribution of carat, partitioned by price.

* How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?

* Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.

* Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.

## 7.6 Patterns and models

If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.

Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts price from carat and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed.

```{r ch761}
mod <- lm(log(price) ~ log(carat), data = diamonds)

diamonds2 <- diamonds %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))

ggplot(diamonds2) + 
  geom_point(aes(x = carat, y = resid))

ggplot(diamonds2) + 
  geom_boxplot(aes(cut, resid))

```

## 7.8 Learning more

Another useful resource is the (R Graphics Cookbook by Winston Chang[http://www.cookbook-r.com/Graphs/]. 

## Notes for Chapter 8:

There is a great pair of keyboard shortcuts that will work together to make sure you’ve captured the important parts of your code in the editor:

1. Press Cmd/Ctrl + Shift + F10 to restart RStudio.
2. Press Cmd/Ctrl + Shift + S to rerun the current script.


## Wishlist:

Using purrr for a histogram with varying bandwidths.
Using purrr to load several packages at one time?