2_MLPS_R_visualization.Rmd

---
title: "2_MLPS_R_visualization"
author: "Zhe Zhang (TA - Heinz CMU PhD)"
date: "1/19/2017"
output: 
  html_document:
    css: '~/Dropbox/avenir-white.css'
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = F, error = F)
```

## Lecture 2: Business & Data Understanding

Key specific tasks we covered in this lecture are around exploratory data analysis.
* dataset attribute types
* working with categorical variables (`factor` variables in R)
* data matrix styles (long vs wide)
* checking for data quality 

* temporal line plot
* bar chart
* histogram and 1-dimensional density visualization (and bin sizes)
* 2-dimensional histograms
* kernel density estimation
* median, quantile, quartile, interquartile range
* box plots
* scatter plots (many plots, overplotting, correlation graphs)
* parallel coordinates
* radar plots

## Dataset Actions

```{r}
# use the flights dataset
require(tidyverse)

# use `glimpse()` or `str()` to see a dataset's attributes
glimpse(iris)
glimpse(mtcars)
```

When your data has categorical values, if dataset has already been cleaned, you may not need to do much with it except be aware of it. When there are issues or things you'd like to change about your categorical variables, use the `forcats` package. Also, it can help reduce errors to not use factors in the first place, the `tidyverse` packages should work just as well with text/character variables versus factors. To convert a column to a factor use `{r} df$factor_column <- as.factor(df$factor_column)`.

To convert a dataset between a "wide" format and a "long" format, use the `gather` and `spread` functions available in the tidyr package. See more here: <http://r4ds.had.co.nz/tidy-data.html#tidy-data-1>

```{r}
library(tidyverse)

table4a

# renaming to avoid the weird columns that start with numbers
renamed_table4a <- table4a %>% rename('yr_1999' = `1999`, 'yr_2000' = `2000`)

# gather the data to make it long and make each row an observation
renamed_table4a %>% 
  gather(yr_1999, yr_2000, key = "year", value = "cases")
```

See here <http://r4ds.had.co.nz/tidy-data.html#spreading> for more information on making data from long to wide.

To check for data quality and potential missing variables, two keys options are summarization and visualization (and both combined). We discuss visualization below. This will not catch all potential errors, but is a way to start. This is especially useful when you have subgroups or categorical variables from which to compare. This chapter covers a lot of the EDA that we talked about in class in R: <http://r4ds.had.co.nz/exploratory-data-analysis.html>

```{r}
library(dplyr)
library(nycflights13)

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")

# It looks like delays increase with distance up to ~750 miles 
# and then decrease. Maybe as flights get longer there's more 
# ability to make up delays in the air?
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)

# cite: R4DS
```


## Visualization Actions

The R4DS has more examples and a thorough discussion of both visualization and especially, important, how to use visualization for exploratory analysis, which is focus of the early part of this course. I highly recommend reading through these two sections and looking at the examples. If you're not already comfortable with the pieces, doing so will be quite helpful for all your work in this class and going forward with R and analysis. I will only briefly touch on the specific commands below.

* <http://r4ds.had.co.nz/data-visualisation.html>
* <http://r4ds.had.co.nz/exploratory-data-analysis.html>

We recommend the use of the `ggplot2` package in this class. It has a consistent and clear syntax that helps you think about what data you're interested in visualizing. The R4DS book has a deeper explanation.

Basic graphics in `ggplot2` are a combination of two pieces:

* `ggplot(data = your_data)`
* `geom_SOME_VIZ_TYPE(mapping = aes(x = x_var, y = y_var, color = color_var))`

The `ggplot()` function tells which dataset to use. Then, we can think about each of the following pieces as adding layers to an empty image. You can add a line plot layer, a point layer, a histogram layer, etc. Each layer type will have different arguments, of at least 1 data variable to plot (e.g. histogram). Sometimes, we can make our layers capture additional information by tweaking the layer's aesthetics (color, size, shape, linetype) by giving them additional variables.

If you want color, size, shape, or linetype to not **vary** based on a data variable, place that option outside of the `aes()` (aesthetics) function call.

Sometimes, you will see the `ggplot()` function contain the aesthetics (as shown in examples below). This means that these will be the default aesthetics that will be passed to all geom_layers. This is useful when we plot several layers together (e.g., a line plot and a point plot together).

```{r}
# More specific examples of commands incoming
#   mostly based around ggplot2 package
library(tidyverse)
library(ggplot2) # redundant, but here for emphasis

# we'll use the mpg dataset
head(mpg)

# line plot
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_line()

# sometimes we want to group on different categorical variables
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, group = cyl)) +
  geom_line()
# but we can't tell the lines apart without color
ggplot(data = mpg, 
       mapping = aes(x = displ, y = hwy,
                     group = cyl, color = cyl)) +
  geom_line()
# ggplot is treating CYL as a continuous variable;
#   we want to avoid this
# but we can't tell the lines apart without color
ggplot(data = mpg, 
       mapping = aes(x = displ, y = hwy,
                     group = cyl, color = as.character(cyl))) +
  geom_line()

# bar chart (only uses one variable because it counts automatically)
#   IF you want to supply your own count instead, use
#   geom_bar(stat = "identity")
ggplot(mpg, aes(x = displ)) +
  geom_bar()

# histogram and 1-dimensional density visualization (and bin sizes)
ggplot(mpg, aes(x = hwy)) +
  geom_histogram()

ggplot(mpg, aes(x = hwy)) +
  geom_histogram(binwidth = 2.5)

# 2-dimensional histograms/density
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_density2d() +
  geom_point()

# kernel density estimation
#   we use the built in smoother in ggplot for now
#   GEOM_SMOOTH()
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy),
              method = "loess", span = 0.5)

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy),
              method = "loess", span = 2)

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy),
              method = "lm")

# median, quantile, quartile, interquartile range
# use ecdf() function for these attributes
#   the empirical cumulative density function
ecdf_hwy_mpg <- ecdf(mpg$hwy)
# median
quantile(ecdf_hwy_mpg, 0.5)
quantile(ecdf_hwy_mpg, seq(0.25, 0.75, by = 0.25))

# box plots
ggplot(mpg, aes(x = class, y = hwy, color = class)) +
  geom_boxplot()

# scatter plots (many plots, overplotting, correlation graphs)
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + facet_wrap(~cyl)

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + facet_wrap(~class)

# parallel coordinates
#   for some
long_iris_df <- iris %>% mutate(obs_id = row_number()) %>%
  gather(key = measurement, value = value, -obs_id, -Species)
ggplot(long_iris_df, 
       aes(x = measurement, y = value, 
           group = obs_id, color = Species)) +
  geom_line()

# radar plots
#   if you're interested, check out this ggplot2 extension
# http://www.ggplot2-exts.org/ggradar.html
```

Notes from the above examples:

* in the parallel coordinates example, note that we could not have plotted this without making the iris data a long dataset, where each measurement of an observation gets one row. `ggplot` enforces this structure because it makes it clear what we're studying. In this case, we're interested in comparing the different measurements against each other, so measurement type should be a variable (hence, row), rather than a particular description of a measurement (i.e. column). This follows the tidy data philosophy (see R4DS for more).
* when doing line plots, it can be important to specify how we're connecting/grouping our data into a line. Use the `group` aesthetic to guide this and think through this.