chap3-data-visualization.Rmd

---
title: "Chapter 3 -  Data visualisation"
author: "Salman Mohammed"
date: "February 7, 2017"
output: html_document
---

#### Importing the necessary libraries
```{r results='hide', message=FALSE, warning=FALSE}
library(tidyverse)
```


#### First look at the dataset
```{r}
mpg
```


#### A car's fuel efficiency vs. engine size
We want to see the relationship between a car's engine size (displ) and it's fuel efficiency
on the highway (hwy). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
```

In this plot, we can see a downward trend that explains a negative relationship between the two variables, so our hypothesis is that cars with bigger engines are less fuel efficient.

#### 3.2.4 - Exercises

1. Run ggplot(data = mpg) what do you see?
```{r}
ggplot(data = mpg)
```

2. How many rows are in mtcars? How many columns?
```{r}
nrow(mtcars)
ncol(mtcars)
```

3. What does the drv variable describe? Read the help for ?mpg to find out.
- drv describes the transmission system of the vehicle. The options are f = front-wheel drive, r = rear wheel drive, 4 = 4wd

4. Make a scatterplot of hwy vs cyl.
- hwy is the highway miles per gallon
- cyl is the number of cylinders
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = hwy, y =cyl))
```

5. What happens if you make a scatterplot of class vs drv. - class is the 'type' of car
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = class, y =drv))
```

6. Why is the plot not useful?
- The variables are both categorical, so the points on the plot overlap with one another.

#### Incorporating another variable in the plot with color/size/alpha/shape

```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = class))

## alpha controls the transparency of the points
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```


#### 3.3.1 - Exercises

1. What’s gone wrong with this code? Why are the points not blue?
- The color needs to be defined outside of aes()
```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
```

2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
- Categorical - manufacturer, model, trans, drv, fl, class
- Continuous - displ, cyl, cty, hwy
- Categorical variables are type chr, whereas continuous variables are type dbl or int
```{r}
mpg
```

3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = cyl))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = cyl))

## shape cannot be applied to continuos variable
```
4. What happens if you map the same variable to multiple aesthetics?
```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = cyl, color = cyl))
```

5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
- Stroke adjusts the thickness of the border for shapes that can take on different colors both inside and outside. It only works for shapes 21-24.
```{r}
# For shapes that have a border (like 21), you can colour the inside and
# outside separately. Use the stroke aesthetic to modify the width of the
# border 
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), shape = 21, stroke = 3)
```

6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?
- R executes the code and creates a temporary variable containing the results of the operation. Here, the new variable takes on a value of TRUE if the engine displacement is less than 5 or FALSE if the engine displacement is more than or equal to 5.
```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))
```

#### Facets

Facets are particularly useful for categorical variables. Split your plot into facets, subplots that each display one subset of the data.
To facet your plot by a single variable, use facet_wrap().

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)
```

To facet your plot on the combination of two variables, add facet_grid() to your plot call.

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ class)
```

#### 3.5.1 Exercises

1. What happens if you facet on a continuous variable?
- Your graph will not make much sense. R will try to draw a separate facet for each unique value of the continuous variable. If you have too many unique values, you may crash R.

2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
- The empty plots mean that there are no points in our dataset that have that combination of drv and cyl values. For example, there are no 4-wheel drive cars in our dataset with 5 cylinders.

3. What plots does the following code make? What does . do?
- The . acts a placeholder for no variable. In facet_grid(), this results in a plot faceted on a single dimension (1 by NN or NN by 1) rather than an NN by NN grid.
```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)
```

4. What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))
```
Faceting splits the data into separate grids and better visualizes trends within each individual facet. The disadvantage is that by doing so, it is harder to visualize the overall relationship across facets. The color aesthetic is fine when your dataset is small, but with larger datasets points may begin to overlap with one another. In this situation with a colored plot, jittering may not be sufficient because of the additional color aesthetic.

5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol variables?
- nrow sets how many rows the faceted plot will have.
- ncol sets how many columns the faceted plot will have.
- as.table determines the starting facet to begin filling the plot, and dir determines the starting direction for filling in the plot (horizontal or vertical).
- facet_grid forms a matrix of panels defined by row and column facetting variables.
- facet_grid does not have nrow and ncol because those values are obtained automatically from the levels of the variables.

6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?
- This will put more columns (extend it vertically) and make the plot wider and this makes more sense with widescreen monitors - more viewing space. If you extend it horizontally, the plot will be compressed and harder to view.

#### 3.6 Geometric objects

A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. As we see above, you can use different geoms to plot the same data.

```{r}
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
```

To display multiple geoms in the same plot, add multiple geom functions to ggplot():
```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))
```

This is a cleaner way of doing the above with less duplication:
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()
```

#### 3.6.1 Exercises

1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
- Line chart - geom_line()
- Boxplot - geom_boxplot()
- Histogram - geom_histogram()
- Area chart - geom_area()
```{r}
## line chart - two variables: continuous x, continuous x
ggplot(data = mpg) + geom_line(mapping = aes(x = displ, y = hwy))

## box plot - two variables: discrete x, continuous x
ggplot(data = mpg) + geom_boxplot(mapping = aes(x = class, y = hwy))

  ## histogram - one variable: continuous
ggplot(data = mpg) + geom_histogram(mapping = aes(x = hwy), bins = 20)

## area chart - one variable: continuous
ggplot(data = mpg) + geom_area(stat = "bin", mapping = aes(x = hwy), bins = 20)
```

2. Run this code:
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point(show.legend = FALSE) + 
  geom_smooth(se = FALSE, show.legend = FALSE)
```

3. What does show.legend = FALSE do?
It removes the legend. The aesthetics are still mapped and plotted, but the key is removed from the graph.
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point(show.legend = FALSE) + 
  geom_smooth(se = FALSE, show.legend = FALSE)
```

4. What does the se argument to geom_smooth() do?
- When se is set to TRUE, the smoothed out line shows a shaded region for the confidence interval.

5. Will these two graphs look different? Why/why not?
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
```
They look the same but the first plot contains less duplicate code.

6. Recreate the R code necessary to generate the following graphs.
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(aes(group = drv), se = FALSE) +
  geom_point()

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(aes(color = drv)) + 
  geom_smooth(se = FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(aes(color = drv)) +
  geom_smooth(aes(linetype = drv), se = FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(size = 4, colour = "white") + 
  geom_point(aes(colour = drv))
```

#### 3.7 Statistical transformations

The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
```{r}
# default stat = "count"
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))
```

There are three reasons you might need to use a stat explicitly:
1. You might want to override the default stat.
```{r}
demo <- tribble(
  ~a,      ~b,
  "bar_1", 20,
  "bar_2", 30,
  "bar_3", 40
)

# using stat = "identity"
ggplot(data = demo) +
  geom_bar(mapping = aes(x = a, y = b), stat = "identity")
```

2. You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count. To find the variables computed by the stat, look for the help section titled “computed variables”.
```{r}
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
```

3. You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarises the y values for each unique x value, to draw attention to the summary that you’re computing:
```{r}
ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )
```

#### 3.7.1 Exercises

1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
- Default geom is "pointrange"
```{r}
ggplot(data = diamonds) +
  geom_pointrange(mapping = aes(x = cut, y = depth),
                  stat = "summary",
                  fun.ymin = min,
                  fun.ymax = max,
                  fun.y = median)
```


2. What does geom_col() do? How is it different to geom_bar()?
- geom_bar() uses the stat_count() statistical transformation to draw the bar graph. geom_col() assumes the values have already been transformed to the appropriate values.  geom_bar(stat = "identity") and geom_col() are equivalent.

3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

4. What variables does stat_smooth() compute? What parameters control its behaviour?
stat_smooth() calculates four variables:
- y - predicted value
- ymin - lower pointwise confidence interval around the mean
- ymax - upper pointwise confidence interval around the mean
- se - standard error
See ?stat_smooth for more details on the specific parameters. Most importantly, method controls the smoothing method to be employed, se determines whether confidence interval should be plotted, and  level determines the level of confidence interval to use.

5. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
```{r}
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
```
If we fail to set group = 1, the proportions for each cut are calculated using the complete dataset, rather than each subset of cut. Instead, we want the graphs to look like this:
```{r}
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop.., group = 1))
```

#### 3.8 Position adjustments