-
Notifications
You must be signed in to change notification settings - Fork 4
/
2_MLPS_R_visualization.Rmd
212 lines (162 loc) · 8.6 KB
/
2_MLPS_R_visualization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
title: "2_MLPS_R_visualization"
author: "Zhe Zhang (TA - Heinz CMU PhD)"
date: "1/19/2017"
output:
html_document:
css: '~/Dropbox/avenir-white.css'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = F, error = F)
```
## Lecture 2: Business & Data Understanding
Key specific tasks we covered in this lecture are around exploratory data analysis.
* dataset attribute types
* working with categorical variables (`factor` variables in R)
* data matrix styles (long vs wide)
* checking for data quality
* temporal line plot
* bar chart
* histogram and 1-dimensional density visualization (and bin sizes)
* 2-dimensional histograms
* kernel density estimation
* median, quantile, quartile, interquartile range
* box plots
* scatter plots (many plots, overplotting, correlation graphs)
* parallel coordinates
* radar plots
## Dataset Actions
```{r}
# use the flights dataset
require(tidyverse)
# use `glimpse()` or `str()` to see a dataset's attributes
glimpse(iris)
glimpse(mtcars)
```
When your data has categorical values, if dataset has already been cleaned, you may not need to do much with it except be aware of it. When there are issues or things you'd like to change about your categorical variables, use the `forcats` package. Also, it can help reduce errors to not use factors in the first place, the `tidyverse` packages should work just as well with text/character variables versus factors. To convert a column to a factor use `{r} df$factor_column <- as.factor(df$factor_column)`.
To convert a dataset between a "wide" format and a "long" format, use the `gather` and `spread` functions available in the tidyr package. See more here: <http://r4ds.had.co.nz/tidy-data.html#tidy-data-1>
```{r}
library(tidyverse)
table4a
# renaming to avoid the weird columns that start with numbers
renamed_table4a <- table4a %>% rename('yr_1999' = `1999`, 'yr_2000' = `2000`)
# gather the data to make it long and make each row an observation
renamed_table4a %>%
gather(yr_1999, yr_2000, key = "year", value = "cases")
```
See here <http://r4ds.had.co.nz/tidy-data.html#spreading> for more information on making data from long to wide.
To check for data quality and potential missing variables, two keys options are summarization and visualization (and both combined). We discuss visualization below. This will not catch all potential errors, but is a way to start. This is especially useful when you have subgroups or categorical variables from which to compare. This chapter covers a lot of the EDA that we talked about in class in R: <http://r4ds.had.co.nz/exploratory-data-analysis.html>
```{r}
library(dplyr)
library(nycflights13)
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")
# It looks like delays increase with distance up to ~750 miles
# and then decrease. Maybe as flights get longer there's more
# ability to make up delays in the air?
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
# cite: R4DS
```
## Visualization Actions
The R4DS has more examples and a thorough discussion of both visualization and especially, important, how to use visualization for exploratory analysis, which is focus of the early part of this course. I highly recommend reading through these two sections and looking at the examples. If you're not already comfortable with the pieces, doing so will be quite helpful for all your work in this class and going forward with R and analysis. I will only briefly touch on the specific commands below.
* <http://r4ds.had.co.nz/data-visualisation.html>
* <http://r4ds.had.co.nz/exploratory-data-analysis.html>
We recommend the use of the `ggplot2` package in this class. It has a consistent and clear syntax that helps you think about what data you're interested in visualizing. The R4DS book has a deeper explanation.
Basic graphics in `ggplot2` are a combination of two pieces:
* `ggplot(data = your_data)`
* `geom_SOME_VIZ_TYPE(mapping = aes(x = x_var, y = y_var, color = color_var))`
The `ggplot()` function tells which dataset to use. Then, we can think about each of the following pieces as adding layers to an empty image. You can add a line plot layer, a point layer, a histogram layer, etc. Each layer type will have different arguments, of at least 1 data variable to plot (e.g. histogram). Sometimes, we can make our layers capture additional information by tweaking the layer's aesthetics (color, size, shape, linetype) by giving them additional variables.
If you want color, size, shape, or linetype to not **vary** based on a data variable, place that option outside of the `aes()` (aesthetics) function call.
Sometimes, you will see the `ggplot()` function contain the aesthetics (as shown in examples below). This means that these will be the default aesthetics that will be passed to all geom_layers. This is useful when we plot several layers together (e.g., a line plot and a point plot together).
```{r}
# More specific examples of commands incoming
# mostly based around ggplot2 package
library(tidyverse)
library(ggplot2) # redundant, but here for emphasis
# we'll use the mpg dataset
head(mpg)
# line plot
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_line()
# sometimes we want to group on different categorical variables
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, group = cyl)) +
geom_line()
# but we can't tell the lines apart without color
ggplot(data = mpg,
mapping = aes(x = displ, y = hwy,
group = cyl, color = cyl)) +
geom_line()
# ggplot is treating CYL as a continuous variable;
# we want to avoid this
# but we can't tell the lines apart without color
ggplot(data = mpg,
mapping = aes(x = displ, y = hwy,
group = cyl, color = as.character(cyl))) +
geom_line()
# bar chart (only uses one variable because it counts automatically)
# IF you want to supply your own count instead, use
# geom_bar(stat = "identity")
ggplot(mpg, aes(x = displ)) +
geom_bar()
# histogram and 1-dimensional density visualization (and bin sizes)
ggplot(mpg, aes(x = hwy)) +
geom_histogram()
ggplot(mpg, aes(x = hwy)) +
geom_histogram(binwidth = 2.5)
# 2-dimensional histograms/density
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_density2d() +
geom_point()
# kernel density estimation
# we use the built in smoother in ggplot for now
# GEOM_SMOOTH()
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy),
method = "loess", span = 0.5)
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy),
method = "loess", span = 2)
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy),
method = "lm")
# median, quantile, quartile, interquartile range
# use ecdf() function for these attributes
# the empirical cumulative density function
ecdf_hwy_mpg <- ecdf(mpg$hwy)
# median
quantile(ecdf_hwy_mpg, 0.5)
quantile(ecdf_hwy_mpg, seq(0.25, 0.75, by = 0.25))
# box plots
ggplot(mpg, aes(x = class, y = hwy, color = class)) +
geom_boxplot()
# scatter plots (many plots, overplotting, correlation graphs)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() + facet_wrap(~cyl)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() + facet_wrap(~class)
# parallel coordinates
# for some
long_iris_df <- iris %>% mutate(obs_id = row_number()) %>%
gather(key = measurement, value = value, -obs_id, -Species)
ggplot(long_iris_df,
aes(x = measurement, y = value,
group = obs_id, color = Species)) +
geom_line()
# radar plots
# if you're interested, check out this ggplot2 extension
# http://www.ggplot2-exts.org/ggradar.html
```
Notes from the above examples:
* in the parallel coordinates example, note that we could not have plotted this without making the iris data a long dataset, where each measurement of an observation gets one row. `ggplot` enforces this structure because it makes it clear what we're studying. In this case, we're interested in comparing the different measurements against each other, so measurement type should be a variable (hence, row), rather than a particular description of a measurement (i.e. column). This follows the tidy data philosophy (see R4DS for more).
* when doing line plots, it can be important to specify how we're connecting/grouping our data into a line. Use the `group` aesthetic to guide this and think through this.