-
Notifications
You must be signed in to change notification settings - Fork 3
/
linear_modeling_and_broom.Rmd
280 lines (181 loc) · 6.95 KB
/
linear_modeling_and_broom.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
---
output: html_document
---
```{r setup, include=FALSE}
# load libraries
library(tidyverse)
library(conflicted)
library(viridis)
library(broom)
library(magrittr)
library(nycflights13)
# configure knit settings
knitr::opts_chunk$set(echo = TRUE, fig.width = 6, fig.height = 4)
# resolve package conflicts
filter <- dplyr::filter
select <- dplyr::select
```
# Linear modeling and `broom`
## Tests
### correlation
Correlation gives you the direction and strength of the linear association between two numeric variables. What can you use it to test?
- Is the expression of this gene associated with the expression of another gene?
- Is the expression of this gene associated with the dose of the drug?
- Is the level of this molecule associated with a disease state?
<br>
#### Plot first
Before deciding whether to test something, you should always plot first! Visualizing the variables helps you decided what test/action to take next.
```{r}
ggplot(iris, aes(x = Sepal.Length, y = Petal.Width)) +
geom_point() +
labs(x = 'Sepal Length (cm)', y = 'Petal Width (cm)') +
theme_classic() +
theme(axis.text = element_text(size = 14))
```
The variables look linearly related, so calculating the correlation between them might tell us something.
<br>
#### `cor()`
`cor()` gives you the correlation (the r value) between two numeric values.
```{r}
# correlated
cor(iris$Sepal.Length, iris$Petal.Width)
# not correlated
cor(iris$Sepal.Length, iris$Sepal.Width)
```
<br>
#### Side note: the `%$%` pipe
Remember how some tests we couldn't pipe into? The solution is the `%$%` pipe!!! Basically, behind the scenes it translates the column names to base R, so functions can take them. It's in the magrittr package (which is installed along with the tidyverse), so you have to load the library before using the `%$%` pipe. The library is in the setup chunk at the top of the document, so if you haven't already, make sure to run that chunk before continuing
```{r}
# tidy t test example
iris %$% t.test(Sepal.Length, Petal.Width) %>% tidy()
# and use it to pipe into cor()
iris %$% cor(Sepal.Length, Petal.Width)
```
<br>
#### `cor.test()`
Similar to a t test, there's a correlation test. In addition to giving you the correlation between two variables, it gives you a p-value, confidence interval, and a few other statistics.
```{r}
# correlated
cor.test(iris$Sepal.Length, iris$Petal.Width)
# not correlated
cor.test(iris$Sepal.Length, iris$Sepal.Width)
# tidy
iris %$% cor.test(Sepal.Length, Petal.Width) %>% tidy()
```
<br>
### linear model / linear regression
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data (by minimizing residuals). Then you use can use it for making further predictions.
<br>
#### Plot first
Before deciding whether to test something, you should always plot first! Visualizing the variables helps you decided what test/action to take next.
```{r}
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(size = 3) +
labs(x = 'Weight (1,000 lbs)', y = 'Miles / Gallon') +
theme_classic() +
theme(axis.text = element_text(size = 14))
```
Again, the variables appear linearly related, so a linear model might be interesting here.
<br>
#### model
`lm()` calculates the linear regression/linear model/line of best fit. It gives you the slope and y-intercept of the lines, the error, the test statistic, and the p-value
```{r}
lm(mpg ~ wt, data = mtcars) %>% tidy()
# can pipe into lm()
mtcars %>% lm(mpg ~ wt, data = .) %>% tidy()
```
<br>
#### plot again
Now, you can add the regression line to the scatterplot.
```{r}
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(size = 3) +
geom_abline(slope = -5.34, intercept = 37.3, linetype = 'dashed', size = 1, color = 'red') +
labs(x = 'Weight (1,000 lbs)', y = 'Miles / Gallon') +
theme_classic() +
theme(axis.text = element_text(size = 14))
```
<br>
#### predict
You can use the slope and intercept from the regression line to make predictions. For example, what mpg would a car weighing 2,500 lbs have?
```{r}
2.5 * -5.34 + 37.3
```
<br>
### ANOVA
#### Plot first
Before deciding whether to test something, you should always plot first! Visualizing the variables helps you decided what test/action to take next.
```{r}
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length)) +
geom_point(size = 3) +
labs(x = 'Sepal Length (cm)', y = 'Sepal Width (cm)') +
theme_classic() +
theme(axis.text = element_text(size = 14))
```
<br>
Sometimes it doesn't look interesting at first and you need to plot multiple times.
```{r}
# with color by species
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
geom_point(size = 3) +
labs(x = 'Sepal Length (cm)', y = 'Sepal Width (cm)') +
theme_classic() +
theme(axis.text = element_text(size = 14))
```
<br>
#### test
Analysis of variance (ANOVA) is (almost) the same as a pairwise t test; it compares means among three or more groups, but minimizes the error due to multiple testing. Use `aov()` for an ANOVA
```{r}
aov(Sepal.Length ~ Species, data = iris) %>% tidy()
```
<br>
#### `TukeyHSD()`
Just knowing that there is a difference between species isn't super interesting, so after ANOVA, you can use a Tukey test for pairwise comparisons of the groups using `TukeyHSD()`
```{r}
aov(Sepal.Length ~ Species, data = iris) %>% TukeyHSD() %>% tidy()
```
<br>
One thing you have to watch out for with `TukeyHSD()`, is that the variable you're doing comparisons over has to be a character or a factor. Notice the error we get when we try to do a Tukey test over the months in the `flights` dataset.
```{r}
flights
aov(dep_time ~ month, data = flights) %>% TukeyHSD()
```
<br>
Fix the problem by wrapping the numeric column that's not really numeric in either `as.factor()` or `as.character()`
```{r}
# as.factor()
aov(dep_time ~ as.factor(month), data = flights) %>% TukeyHSD() %>% tidy()
# as.character()
aov(dep_time ~ as.character(month), data = flights) %>% TukeyHSD() %>% tidy()
```
<br><br>
## `broom`
The `broom` package is for tidying up base R statistical tests and models. We've touched on the `tidy()` function from the package before, but it has two more useful functions.
<br>
### `glance()`
`glance()` gives you the model parameters. For the most part, you don't need to worry about these (but it's nice to known how to get them easily if you need them)
```{r}
aov(Sepal.Length ~ Species, data = iris) %>% glance()
```
<br>
Sometimes `tidy()` and `glance()` give the same output
```{r}
# t test
t.test(extra ~ group, data = sleep) %>% tidy()
t.test(extra ~ group, data = sleep) %>% glance()
```
<br>
### `augment()`
`augment()` adds information from your test/model back to a data table
```{r}
lm(Sepal.Width ~ Species, data = iris) %>% augment(iris) -> iris_lm
iris_lm
```
<br>
You can then do something with the original data and the model parameters together.
```{r}
ggplot(iris_lm, aes(x = Species, y = .resid, fill = Species)) +
geom_violin() +
theme_classic()
```
<br><br>