-
Notifications
You must be signed in to change notification settings - Fork 173
/
collinearity.Rmd
409 lines (297 loc) · 16.6 KB
/
collinearity.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
# Collinearity
```{r, include = FALSE}
knitr::opts_chunk$set(cache = TRUE, autodep = TRUE, fig.align = "center")
```
> "If I look confused it is because I am thinking."
>
> --- **Samuel Goldwyn**
After reading this chapter you will be able to:
- Identify collinearity in regression.
- Understand the effect of collinearity on regression models.
## Exact Collinearity
Let's create a dataset where one of the predictors, $x_3$, is a linear combination of the other predictors.
```{r}
gen_exact_collin_data = function(num_samples = 100) {
x1 = rnorm(n = num_samples, mean = 80, sd = 10)
x2 = rnorm(n = num_samples, mean = 70, sd = 5)
x3 = 2 * x1 + 4 * x2 + 3
y = 3 + x1 + x2 + rnorm(n = num_samples, mean = 0, sd = 1)
data.frame(y, x1, x2, x3)
}
```
Notice that the way we are generating this data, the response $y$ only really depends on $x_1$ and $x_2$.
```{r}
set.seed(42)
exact_collin_data = gen_exact_collin_data()
head(exact_collin_data)
```
What happens when we attempt to fit a regression model in `R` using all of the predictors?
```{r}
exact_collin_fit = lm(y ~ x1 + x2 + x3, data = exact_collin_data)
summary(exact_collin_fit)
```
We see that `R` simply decides to exclude a variable. Why is this happening?
```{r, eval = FALSE}
X = cbind(1, as.matrix(exact_collin_data[,-1]))
solve(t(X) %*% X)
```
If we attempt to find $\boldsymbol{\hat{\beta}}$ using $\left( \boldsymbol{X}^T \boldsymbol{X} \right)^{-1}$, we see that this is not possible, due to the fact that the columns of $\boldsymbol{X}$ are linearly dependent. The previous lines of code were not run, because they produce an error!
When this happens, we say there is **exact collinearity** in the dataset.
As a result of this issue, `R` essentially chose to fit the model `y ~ x1 + x2`. However notice that two other models would accomplish exactly the same fit.
```{r}
fit1 = lm(y ~ x1 + x2, data = exact_collin_data)
fit2 = lm(y ~ x1 + x3, data = exact_collin_data)
fit3 = lm(y ~ x2 + x3, data = exact_collin_data)
```
We see that the fitted values for each of the three models are exactly the same. This is a result of $x_3$ containing all of the information from $x_1$ and $x_2$. As long as one of $x_1$ or $x_2$ is included in the model, $x_3$ can be used to recover the information from the variable not included.
```{r}
all.equal(fitted(fit1), fitted(fit2))
all.equal(fitted(fit2), fitted(fit3))
```
While their fitted values are all the same, their estimated coefficients are wildly different. The sign of $x_2$ is switched in two of the models! So only `fit1` properly *explains* the relationship between the variables, `fit2` and `fit3` still *predict* as well as `fit1`, despite the coefficients having little to no meaning, a concept we will return to later.
```{r}
coef(fit1)
coef(fit2)
coef(fit3)
```
## Collinearity
Exact collinearity is an extreme example of **collinearity**, which occurs in multiple regression when predictor variables are highly correlated. Collinearity is often called *multicollinearity*, since it is a phenomenon that really only occurs during multiple regression.
Looking at the `seatpos` dataset from the `faraway` package, we will see an example of this concept. The predictors in this dataset are various attributes of car drivers, such as their height, weight and age. The response variable `hipcenter` measures the "horizontal distance of the midpoint of the hips from a fixed location in the car in mm." Essentially, it measures the position of the seat for a given driver. This is potentially useful information for car manufacturers considering comfort and safety when designing vehicles.
We will attempt to fit a model that predicts `hipcenter`. Two predictor variables are immediately interesting to us: `HtShoes` and `Ht`. We certainly expect a person's height to be highly correlated to their height when wearing shoes. We'll pay special attention to these two variables when fitting models.
```{r, fig.height=8, fig.width=8}
library(faraway)
pairs(seatpos, col = "dodgerblue")
round(cor(seatpos), 2)
```
After loading the `faraway` package, we do some quick checks of correlation between the predictors. Visually, we can do this with the `pairs()` function, which plots all possible scatterplots between pairs of variables in the dataset.
We can also do this numerically with the `cor()` function, which when applied to a dataset, returns all pairwise correlations. Notice this is a symmetric matrix. Recall that correlation measures strength and direction of the linear relationship between two variables. The correlation between `Ht` and `HtShoes` is extremely high. So high, that rounded to two decimal places, it appears to be 1!
Unlike exact collinearity, here we can still fit a model with all of the predictors, but what effect does this have?
```{r}
hip_model = lm(hipcenter ~ ., data = seatpos)
summary(hip_model)
```
One of the first things we should notice is that the $F$-test for the regression tells us that the regression is significant, however each individual predictor is not. Another interesting result is the opposite signs of the coefficients for `Ht` and `HtShoes`. This should seem rather counter-intuitive. Increasing `Ht` increases `hipcenter`, but increasing `HtShoes` decreases `hipcenter`?
This happens as a result of the predictors being highly correlated. For example, the `HtShoe` variable explains a large amount of the variation in `Ht`. When they are both in the model, their effects on the response are lessened individually, but together they still explain a large portion of the variation of `hipcenter`.
We define $R_j^2$ to be the proportion of observed variation in the $j$-th predictor explained by the other predictors. In other words $R_j^2$ is the multiple R-Squared for the regression of $x_j$ on each of the other predictors.
```{r}
ht_shoes_model = lm(HtShoes ~ . - hipcenter, data = seatpos)
summary(ht_shoes_model)$r.squared
```
Here we see that the other predictors explain $99.67\%$ of the variation in `HtShoe`. When fitting this model, we removed `hipcenter` since it is not a predictor.
### Variance Inflation Factor.
Now note that the variance of $\hat{\beta_j}$ can be written as
$$
\text{Var}(\hat{\beta_j}) = \sigma^2 C_{jj} = \sigma^2 \left( \frac{1}{1 - R_j^2} \right) \frac{1}{S_{x_j x_j}}
$$
where
$$
S_{x_j x_j} = \sum(x_{ij}-\bar{x}_j)^2.
$$
This gives us a way to understand how collinearity affects our regression estimates.
We will call,
$$
\frac{1}{1 - R_j^2}
$$
the **variance inflation factor.** The variance inflation factor quantifies the effect of collinearity on the variance of our regression estimates. When $R_j^2$ is large, that is close to 1, $x_j$ is well explained by the other predictors. With a large $R_j^2$ the variance inflation factor becomes large. This tells us that when $x_j$ is highly correlated with other predictors, our estimate of $\beta_j$ is highly variable.
The `vif` function from the `faraway` package calculates the VIFs for each of the predictors of a model.
```{r}
vif(hip_model)
```
In practice it is common to say that any VIF greater than $5$ is cause for concern. So in this example we see there is a huge multicollinearity issue as many of the predictors have a VIF greater than 5.
Let's further investigate how the presence of collinearity actually affects a model. If we add a moderate amount of noise to the data, we see that the estimates of the coefficients change drastically. This is a rather undesirable effect. Adding random noise should not affect the coefficients of a model.
```{r}
set.seed(1337)
noise = rnorm(n = nrow(seatpos), mean = 0, sd = 5)
hip_model_noise = lm(hipcenter + noise ~ ., data = seatpos)
```
Adding the noise had such a large effect, the sign of the coefficient for `Ht` has changed.
```{r}
coef(hip_model)
coef(hip_model_noise)
```
This tells us that a model with collinearity is bad at explaining the relationship between the response and the predictors. We cannot even be confident in the direction of the relationship. However, does collinearity affect prediction?
```{r}
plot(fitted(hip_model), fitted(hip_model_noise), col = "dodgerblue", pch = 20,
xlab = "Predicted, Without Noise", ylab = "Predicted, With Noise", cex = 1.5)
abline(a = 0, b = 1, col = "darkorange", lwd = 2)
```
We see that by plotting the predicted values using both models against each other, they are actually rather similar.
Let's now look at a smaller model,
```{r}
hip_model_small = lm(hipcenter ~ Age + Arm + Ht, data = seatpos)
summary(hip_model_small)
vif(hip_model_small)
```
Immediately we see that multicollinearity isn't an issue here.
```{r}
anova(hip_model_small, hip_model)
```
Also notice that using an $F$-test to compare the two models, we would prefer the smaller model.
We now investigate the effect of adding another variable to this smaller model. Specifically we want to look at adding the variable `HtShoes`. So now our possible predictors are `HtShoes`, `Age`, `Arm`, and `Ht`. Our response is still `hipcenter`.
To quantify this effect we will look at a **variable added plot** and a **partial correlation coefficient**. For both of these, we will look at the residuals of two models:
- Regressing the response (`hipcenter`) against all of the predictors except the predictor of interest (`HtShoes`).
- Regressing the predictor of interest (`HtShoes`) against the other predictors (`Age`, `Arm`, and `Ht`).
```{r}
ht_shoes_model_small = lm(HtShoes ~ Age + Arm + Ht, data = seatpos)
```
So now, the residuals of `hip_model_small` give us the variation of `hipcenter` that is *unexplained* by `Age`, `Arm`, and `Ht`. Similarly, the residuals of `ht_shoes_model_small` give us the variation of `HtShoes` unexplained by `Age`, `Arm`, and `Ht`.
The correlation of these two residuals gives us the **partial correlation coefficient** of `HtShoes` and `hipcenter` with the effects of `Age`, `Arm`, and `Ht` removed.
```{r}
cor(resid(ht_shoes_model_small), resid(hip_model_small))
```
Since this value is small, close to zero, it means that the variation of `hipcenter` that is unexplained by `Age`, `Arm`, and `Ht` shows very little correlation with the variation of `HtShoes` that is not explained by `Age`, `Arm`, and `Ht`. Thus adding `HtShoes` to the model would likely be of little benefit.
Similarly a **variable added plot** visualizes these residuals against each other. It is also helpful to regress the residuals of the response against the residuals of the predictor and add the regression line to the plot.
```{r}
plot(resid(hip_model_small) ~ resid(ht_shoes_model_small),
col = "dodgerblue", pch = 20,
xlab = "Residuals, Added Predictor",
ylab = "Residuals, Original Model",
main = "Variable Added Plot")
abline(h = 0, lty = 2)
abline(v = 0, lty = 2)
abline(lm(resid(hip_model_small) ~ resid(ht_shoes_model_small)),
col = "darkorange", lwd = 2)
```
Here the variable added plot shows almost no linear relationship. This tells us that adding `HtShoes` to the model would probably not be worthwhile. Since its variation is largely explained by the other predictors, adding it to the model will not do much to improve the model. However it will increase the variation of the estimates and make the model much harder to interpret.
Had there been a strong linear relationship here, thus a large partial correlation coefficient, it would likely have been useful to add the additional predictor to the model.
This trade-off is mostly true in general. As a model gets more predictors, errors will get smaller and its *prediction* will be better, but it will be harder to interpret. This is why, if we are interested in *explaining* the relationship between the predictors and the response, we often want a model that fits well, but with a small number of predictors with little correlation.
Next chapter we will learn about methods to find models that both fit well, but also have a small number of predictors. We will also discuss *overfitting*. Although, adding additional predictors will always make errors smaller, sometimes we will be "fitting the noise" and such a model will not generalize to additional observations well.
## Simulation
Here we simulate example data with and without collinearity. We will note the difference in the distribution of the estimates of the $\beta$ parameters, in particular their variance. However, we will also notice the similarity in their $MSE$.
We will use the model,
\[
Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon
\]
where $\epsilon \sim N(\mu = 0, \sigma^2 = 25)$ and the $\beta$ coefficients defined below.
```{r}
set.seed(42)
beta_0 = 7
beta_1 = 3
beta_2 = 4
sigma = 5
```
We will use a sample size of 10, and 2500 simulations for both situations.
```{r}
sample_size = 10
num_sim = 2500
```
We'll first consider the situation with a collinearity issue, so we manually create the two predictor variables.
```{r}
x1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
x2 = c(1, 2, 3, 4, 5, 7, 6, 10, 9, 8)
```
```{r}
c(sd(x1), sd(x2))
cor(x1, x2)
```
Notice that they have extremely high correlation.
```{r}
true_line_bad = beta_0 + beta_1 * x1 + beta_2 * x2
beta_hat_bad = matrix(0, num_sim, 2)
mse_bad = rep(0, num_sim)
```
We perform the simulation 2500 times, each time fitting a regression model, and storing the estimated coefficients and the MSE.
```{r}
for (s in 1:num_sim) {
y = true_line_bad + rnorm(n = sample_size, mean = 0, sd = sigma)
reg_out = lm(y ~ x1 + x2)
beta_hat_bad[s, ] = coef(reg_out)[-1]
mse_bad[s] = mean(resid(reg_out) ^ 2)
}
```
Now we move to the situation without a collinearity issue, so we again manually create the two predictor variables.
```{r}
z1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
z2 = c(9, 2, 7, 4, 5, 6, 3, 8, 1, 10)
```
Notice that the standard deviations of each are the same as before, however, now the correlation is extremely close to 0.
```{r}
c(sd(z1), sd(z2))
cor(z1, z2)
```
```{r}
true_line_good = beta_0 + beta_1 * z1 + beta_2 * z2
beta_hat_good = matrix(0, num_sim, 2)
mse_good = rep(0, num_sim)
```
We then perform simulations and store the same results.
```{r}
for (s in 1:num_sim) {
y = true_line_good + rnorm(n = sample_size, mean = 0, sd = sigma)
reg_out = lm(y ~ z1 + z2)
beta_hat_good[s, ] = coef(reg_out)[-1]
mse_good[s] = mean(resid(reg_out) ^ 2)
}
```
We'll now investigate the differences.
```{r, fig.width = 12, fig.height = 6}
par(mfrow = c(1, 2))
hist(beta_hat_bad[, 1],
col = "darkorange",
border = "dodgerblue",
main = expression("Histogram of " *hat(beta)[1]* " with Collinearity"),
xlab = expression(hat(beta)[1]),
breaks = 20)
hist(beta_hat_good[, 1],
col = "darkorange",
border = "dodgerblue",
main = expression("Histogram of " *hat(beta)[1]* " without Collinearity"),
xlab = expression(hat(beta)[1]),
breaks = 20)
```
First, for $\beta_1$, which has a true value of $3$, we see that both with and without collinearity, the simulated values are centered near $3$.
```{r}
mean(beta_hat_bad[, 1])
mean(beta_hat_good[, 1])
```
The way the predictors were created, the $S_{x_j x_j}$ portion of the variance is the same for the predictors in both cases, but the variance is still much larger in the simulations performed with collinearity. The variance is so large in the collinear case, that sometimes the estimated coefficient for $\beta_1$ is negative!
```{r}
sd(beta_hat_bad[, 1])
sd(beta_hat_good[, 1])
```
```{r, fig.width = 12, fig.height = 6}
par(mfrow = c(1, 2))
hist(beta_hat_bad[, 2],
col = "darkorange",
border = "dodgerblue",
main = expression("Histogram of " *hat(beta)[2]* " with Collinearity"),
xlab = expression(hat(beta)[2]),
breaks = 20)
hist(beta_hat_good[, 2],
col = "darkorange",
border = "dodgerblue",
main = expression("Histogram of " *hat(beta)[2]* " without Collinearity"),
xlab = expression(hat(beta)[2]),
breaks = 20)
```
We see the same issues with $\beta_2$. On average the estimates are correct, but the variance is again much larger with collinearity.
```{r}
mean(beta_hat_bad[, 2])
mean(beta_hat_good[, 2])
```
```{r}
sd(beta_hat_bad[, 2])
sd(beta_hat_good[, 2])
```
```{r, fig.width = 12, fig.height = 6}
par(mfrow = c(1, 2))
hist(mse_bad,
col = "darkorange",
border = "dodgerblue",
main = "MSE, with Collinearity",
xlab = "MSE")
hist(mse_good,
col = "darkorange",
border = "dodgerblue",
main = "MSE, without Collinearity",
xlab = "MSE")
```
Interestingly, in both cases, the MSE is roughly the same on average. Again, this is because collinearity affects a model's ability to *explain*, but not predict.
```{r}
mean(mse_bad)
mean(mse_good)
```
## `R` Markdown
The `R` Markdown file for this chapter can be found here:
- [`collinearity.Rmd`](collinearity.Rmd){target="_blank"}
The file was created using `R` version ``r paste0(version$major, "." ,version$minor)``.