-
Notifications
You must be signed in to change notification settings - Fork 173
/
mlr.Rmd
1085 lines (787 loc) · 40.1 KB
/
mlr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Multiple Linear Regression
```{r, include = FALSE}
knitr::opts_chunk$set(cache = TRUE, autodep = TRUE, fig.align = "center")
```
> "Life is really simple, but we insist on making it complicated."
>
> --- **Confucius**
After reading this chapter you will be able to:
- Construct and interpret linear regression models with more than one predictor.
- Understand how regression models are derived using matrices.
- Create interval estimates and perform hypothesis tests for multiple regression parameters.
- Formulate and interpret interval estimates for the mean response under various conditions.
- Compare nested models using an ANOVA F-Test.
The last two chapters we saw how to fit a model that assumed a linear relationship between a response variable and a single predictor variable. Specifically, we defined the simple linear regression model,
\[
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i
\]
where $\epsilon_i \sim N(0, \sigma^2)$.
However, it is rarely the case that a dataset will have a single predictor variable. It is also rarely the case that a response variable will only depend on a single variable. So in this chapter, we will extend our current linear model to allow a response to depend on *multiple* predictors.
```{r}
# read the data from the web
autompg = read.table(
"http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
quote = "\"",
comment.char = "",
stringsAsFactors = FALSE)
# give the dataframe headers
colnames(autompg) = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name")
# remove missing data, which is stored as "?"
autompg = subset(autompg, autompg$hp != "?")
# remove the plymouth reliant, as it causes some issues
autompg = subset(autompg, autompg$name != "plymouth reliant")
# give the dataset row names, based on the engine, year and name
rownames(autompg) = paste(autompg$cyl, "cylinder", autompg$year, autompg$name)
# remove the variable for name, as well as origin
autompg = subset(autompg, select = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year"))
# change horsepower from character to numeric
autompg$hp = as.numeric(autompg$hp)
# check final structure of data
str(autompg)
```
```{r, echo = FALSE}
# create local csv for backup
write.csv(autompg, "data/autompg.csv")
```
We will once again discuss a dataset with information about cars. [This dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data){target="_blank"}, which can be found at the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG){target="_blank"} contains a response variable `mpg` which stores the city fuel efficiency of cars, as well as several predictor variables for the attributes of the vehicles. We load the data, and perform some basic tidying before moving on to analysis.
For now we will focus on using two variables, `wt` and `year`, as predictor variables. That is, we would like to model the fuel efficiency (`mpg`) of a car as a function of its weight (`wt`) and model year (`year`). To do so, we will define the following linear model,
\[
Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i, \qquad i = 1, 2, \ldots, n
\]
where $\epsilon_i \sim N(0, \sigma^2)$. In this notation we will define:
- $x_{i1}$ as the weight (`wt`) of the $i$th car.
- $x_{i2}$ as the model year (`year`) of the $i$th car.
The picture below will visualize what we would like to accomplish. The data points $(x_{i1}, x_{i2}, y_i)$ now exist in 3-dimensional space, so instead of fitting a line to the data, we will fit a plane. (We'll soon move to higher dimensions, so this will be the last example that is easy to visualize and think about this way.)
```{r, echo=FALSE, fig.height=6, fig.width=6, message=FALSE, warning=FALSE}
library("plot3D")
x = autompg$wt
y = autompg$year
z = autompg$mpg
fit <- lm(z ~ x + y)
grid.lines = 25
x.pred = seq(min(x), max(x), length.out = grid.lines)
y.pred = seq(min(y), max(y), length.out = grid.lines)
xy = expand.grid(x = x.pred, y = y.pred)
z.pred = matrix(predict(fit, newdata = xy),
nrow = grid.lines, ncol = grid.lines)
fitpoints = predict(fit)
scatter3D(x, y, z, pch = 19, cex = 2, col = gg.col(1000), lighting = TRUE,
theta = 25, phi = 45, ticktype = "detailed",
xlab = "wt", ylab = "year", zlab = "mpg", zlim = c(0, 40), clim = c(0, 40),
surf = list(x = x.pred, y = y.pred, z = z.pred,
facets = NA, fit = fitpoints), main = "")
```
How do we find such a plane? Well, we would like a plane that is as close as possible to the data points. That is, we would like it to minimize the errors it is making. How will we define these errors? Squared distance of course! So, we would like to minimize
\[
f(\beta_0, \beta_1, \beta_2) = \sum_{i = 1}^{n}(y_i - (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}))^2
\]
with respect to $\beta_0$, $\beta_1$, and $\beta_2$. How do we do so? It is another straightforward multivariate calculus problem. All we have done is add an extra variable since we did this last time. So again, we take a derivative with respect to each of $\beta_0$, $\beta_1$, and $\beta_2$ and set them equal to zero, then solve the resulting system of equations. That is,
\[
\begin{aligned}
\frac{\partial f}{\partial \beta_0} &= 0 \\
\frac{\partial f}{\partial \beta_1} &= 0 \\
\frac{\partial f}{\partial \beta_2} &= 0
\end{aligned}
\]
After doing so, we will once again obtain the **normal equations.**
\[
\begin{aligned}
n \beta_0 + \beta_1 \sum_{i = 1}^{n} x_{i1} + \beta_2 \sum_{i = 1}^{n} x_{i2} &= \sum_{i = 1}^{n} y_i \\
\beta_0 \sum_{i = 1}^{n} x_{i1} + \beta_1 \sum_{i = 1}^{n} x_{i1}^2 + \beta_2 \sum_{i = 1}^{n} x_{i1}x_{i2} &= \sum_{i = 1}^{n} x_{i1}y_i \\
\beta_0 \sum_{i = 1}^{n} x_{i2} + \beta_1 \sum_{i = 1}^{n} x_{i1}x_{i2} + \beta_2 \sum_{i = 1}^{n} x_{i2}^2 &= \sum_{i = 1}^{n} x_{i2}y_i
\end{aligned}
\]
We now have three equations and three variables, which we could solve, or we could simply let `R` solve for us.
```{r}
mpg_model = lm(mpg ~ wt + year, data = autompg)
coef(mpg_model)
```
\[
\hat{y} = `r coef(mpg_model)[1]` + `r coef(mpg_model)[2]` x_1 + `r coef(mpg_model)[3]` x_2
\]
Here we have once again fit our model using `lm()`, however we have introduced a new syntactical element. The formula `mpg ~ wt + year` now reads: "model the response variable `mpg` as a linear function of `wt` and `year`." That is, it will estimate an intercept, as well as slope coefficients for `wt` and `year`. We then extract these as we have done before using `coef()`.
In the multiple linear regression setting, some of the interpretations of the coefficients change slightly.
Here, $\hat{\beta}_0 = `r coef(mpg_model)[1]`$ is our estimate for $\beta_0$, the mean miles per gallon for a car that weighs 0 pounds and was built in 1900. We see our estimate here is negative, which is a physical impossibility. However, this isn't unexpected, as we shouldn't expect our model to be accurate for cars from 1900 which weigh 0 pounds. (Because they never existed!) This isn't much of a change from SLR. That is, $\beta_0$ is still simply the mean when all of the predictors are 0.
The interpretation of the coefficients in front of our predictors is slightly different than before. For example $\hat{\beta}_1 = `r coef(mpg_model)[2]`$ is our estimate for $\beta_1$, the average change in miles per gallon for an increase in weight ($x_{1}$) of one-pound **for a car of a certain model year**, that is, for a fixed value of $x_{2}$. Note that this coefficient is actually the same for any given value of $x_{2}$. Later, we will look at models that allow for a different change in mean response for different values of $x_{2}$. Also note that this estimate is negative, which we would expect since, in general, fuel efficiency decreases for larger vehicles. Recall that in the multiple linear regression setting, this interpretation is dependent on a fixed value for $x_{2}$, that is, "for a car of a certain model year." It is possible that the indirect relationship between fuel efficiency and weight does not hold when an additional factor, say year, is included, and thus we could have the sign of our coefficient flipped.
Lastly, $\hat{\beta}_2 = `r coef(mpg_model)[3]`$ is our estimate for $\beta_2$, the average change in miles per gallon for a one-year increase in model year ($x_{2}$) for a car of a certain weight, that is, for a fixed value of $x_{1}$. It is not surprising that the estimate is positive. We expect that as time passes and the years march on, technology would improve so that a car of a specific weight would get better mileage now as compared to their predecessors. And yet, the coefficient could have been negative because we are also including weight as variable, and not strictly as a fixed value.
## Matrix Approach to Regression
In our above example we used two predictor variables, but it will only take a little more work to allow for an arbitrary number of predictor variables and derive their coefficient estimates. We can consider the model,
\[
Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{p-1} x_{i(p-1)} + \epsilon_i, \qquad i = 1, 2, \ldots, n
\]
where $\epsilon_i \sim N(0, \sigma^2)$. In this model, there are $p - 1$ predictor variables, $x_1, x_2, \cdots, x_{p-1}$. There are a total of $p$ $\beta$-parameters and a single parameter $\sigma^2$ for the variance of the errors. (It should be noted that almost as often, authors will use $p$ as the number of predictors, making the total number of $\beta$ parameters $p+1$. This is always something you should be aware of when reading about multiple regression. There is not a standard that is used most often.)
If we were to stack together the $n$ linear equations that represent each $Y_i$ into a column vector, we get the following.
\[
\begin{bmatrix}
Y_1 \\
Y_2 \\
\vdots\\
Y_n \\
\end{bmatrix}
=
\begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1(p-1)} \\
1 & x_{21} & x_{22} & \cdots & x_{2(p-1)} \\
\vdots & \vdots & \vdots & & \vdots \\
1 & x_{n1} & x_{n2} & \cdots & x_{n(p-1)} \\
\end{bmatrix}
\begin{bmatrix}
\beta_0 \\
\beta_1 \\
\beta_2 \\
\vdots \\
\beta_{p-1} \\
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_1 \\
\epsilon_2 \\
\vdots\\
\epsilon_n \\
\end{bmatrix}
\]
\[
Y = X \beta + \epsilon
\]
\[
Y = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots\\ Y_n \end{bmatrix}, \quad
X = \begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1(p-1)} \\
1 & x_{21} & x_{22} & \cdots & x_{2(p-1)} \\
\vdots & \vdots & \vdots & & \vdots \\
1 & x_{n1} & x_{n2} & \cdots & x_{n(p-1)} \\
\end{bmatrix}, \quad
\beta = \begin{bmatrix}
\beta_0 \\
\beta_1 \\
\beta_2 \\
\vdots \\
\beta_{p-1} \\
\end{bmatrix}, \quad
\epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots\\ \epsilon_n \end{bmatrix}
\]
So now with data,
\[
y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots\\ y_n \end{bmatrix}
\]
Just as before, we can estimate $\beta$ by minimizing,
\[
f(\beta_0, \beta_1, \beta_2, \cdots, \beta_{p-1}) = \sum_{i = 1}^{n}(y_i - (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{p-1} x_{i(p-1)}))^2,
\]
which would require taking $p$ derivatives, which result in the following **normal equations**.
\[
\begin{bmatrix}
n & \sum_{i = 1}^{n} x_{i1} & \sum_{i = 1}^{n} x_{i2} & \cdots & \sum_{i = 1}^{n} x_{i(p-1)} \\
\sum_{i = 1}^{n} x_{i1} & \sum_{i = 1}^{n} x_{i1}^2 & \sum_{i = 1}^{n} x_{i1}x_{i2} & \cdots & \sum_{i = 1}^{n} x_{i1}x_{i(p-1)} \\
\vdots & \vdots & \vdots & & \vdots \\
\sum_{i = 1}^{n} x_{i(p-1)} & \sum_{i = 1}^{n} x_{i(p-1)}x_{i1} & \sum_{i = 1}^{n} x_{i(p-1)}x_{i2} & \cdots & \sum_{i = 1}^{n} x_{i(p-1)}^2 \\
\end{bmatrix}
\begin{bmatrix}
\beta_0 \\
\beta_1 \\
\vdots \\
\beta_{p-1} \\
\end{bmatrix}
=
\begin{bmatrix}
\sum_{i = 1}^{n} y_i \\
\sum_{i = 1}^{n} x_{i1}y_i \\
\vdots \\
\sum_{i = 1}^{n} x_{i(p-1)}y_i \\
\end{bmatrix}
\]
The normal equations can be written much more succinctly in matrix notation,
\[
X^\top X \beta = X^\top y.
\]
We can then solve this expression by multiplying both sides by the inverse of $X^\top X$, which exists, provided the columns of $X$ are linearly independent. Then as always, we denote our solution with a hat.
\[
\hat{\beta} = \left( X^\top X \right)^{-1}X^\top y
\]
To verify that this is what `R` has done for us in the case of two predictors, we create an $X$ matrix. Note that the first column is all 1s, and the remaining columns contain the data.
```{r}
n = nrow(autompg)
p = length(coef(mpg_model))
X = cbind(rep(1, n), autompg$wt, autompg$year)
y = autompg$mpg
(beta_hat = solve(t(X) %*% X) %*% t(X) %*% y)
coef(mpg_model)
```
\[
\hat{\beta} = \begin{bmatrix}
`r beta_hat[1]` \\
`r beta_hat[2]` \\
`r beta_hat[3]` \\
\end{bmatrix}
\]
In our new notation, the fitted values can be written
\[
\hat{y} = X \hat{\beta}.
\]
\[
\hat{y} = \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2 \\ \vdots\\ \hat{y}_n \end{bmatrix}
\]
Then, we can create a vector for the residual values,
\[
e
= \begin{bmatrix} e_1 \\ e_2 \\ \vdots\\ e_n \end{bmatrix}
= \begin{bmatrix} y_1 \\ y_2 \\ \vdots\\ y_n \end{bmatrix} - \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2 \\ \vdots\\ \hat{y}_n \end{bmatrix}.
\]
And lastly, we can update our estimate for $\sigma^2$.
\[
s_e^2 = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n - p} = \frac{e^\top e}{n-p}
\]
Recall, we like this estimate because it is unbiased, that is,
\[
\text{E}[s_e^2] = \sigma^2
\]
Note that the change from the SLR estimate to now is in the denominator. Specifically we now divide by $n - p$ instead of $n - 2$. Or actually, we should note that in the case of SLR, there are two $\beta$ parameters and thus $p = 2$.
Also note that if we fit the model $Y_i = \beta + \epsilon_i$ that $\hat{y} = \bar{y}$ and $p = 1$ and $s_e^2$ would become
\[
s_e^2 = \frac{\sum_{i=1}^n (y_i - \bar{y})^2}{n - 1}
\]
which is likely the very first sample variance you saw in a mathematical statistics class. The same reason for $n - 1$ in this case, that we estimated one parameter, so we lose one degree of freedom. Now, in general, we are estimating $p$ parameters, the $\beta$ parameters, so we lose $p$ degrees of freedom.
Also, recall that most often we will be interested in $s_e$, the residual standard error as `R` calls it,
\[
s_e = \sqrt{\frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n - p}}.
\]
In `R`, we could directly access $s_e$ for a fitted model, as we have seen before.
```{r}
summary(mpg_model)$sigma
```
And we can now verify that our math above is indeed calculating the same quantities.
```{r}
y_hat = X %*% solve(t(X) %*% X) %*% t(X) %*% y
e = y - y_hat
sqrt(t(e) %*% e / (n - p))
sqrt(sum((y - y_hat) ^ 2) / (n - p))
```
## Sampling Distribution
As we can see in the output below, the results of calling `summary()` are similar to SLR, but there are some differences, most obviously a new row for the added predictor variable.
```{r}
summary(mpg_model)
```
To understand these differences in detail, we will need to first obtain the sampling distribution of $\hat{\beta}$.
The derivation of the sampling distribution of $\hat{\beta}$ involves the [multivariate normal distribution. These brief notes from semesters past](notes/mvn.pdf) give a basic overview. These are simply for your information, as we will not present the derivation in full here.
Our goal now is to obtain the distribution of the $\hat{\beta}$ vector,
\[
\hat{\beta} = \begin{bmatrix}
\hat{\beta}_0 \\
\hat{\beta}_1 \\
\hat{\beta}_2 \\
\vdots \\
\hat{\beta}_{p-1} \end{bmatrix}
\]
Recall from last time that when discussing sampling distributions, we now consider $\hat{\beta}$ to be a random vector, thus we use $Y$ instead of the data vector $y$.
\[
\hat{\beta} = \left( X^\top X \right)^{-1}X^\top Y
\]
Then it is a consequence of the multivariate normal distribution that,
\[
\hat{\beta} \sim N\left(\beta, \sigma^2 \left(X^\top X\right)^{-1} \right).
\]
We then have
\[
\text{E}[\hat{\beta}] = \beta
\]
and for any $\hat{\beta}_j$ we have
\[
\text{E}[\hat{\beta}_j] = \beta_j.
\]
We also have
\[
\text{Var}[\hat{\beta}] = \sigma^2 \left( X^\top X \right)^{-1}
\]
and for any $\hat{\beta}_j$ we have
\[
\text{Var}[\hat{\beta}_j] = \sigma^2 C_{jj}
\]
where
\[
C = \left(X^\top X\right)^{-1}
\]
and the elements of $C$ are denoted
\[
C = \begin{bmatrix}
C_{00} & C_{01} & C_{02} & \cdots & C_{0(p-1)} \\
C_{10} & C_{11} & C_{12} & \cdots & C_{1(p-1)} \\
C_{20} & C_{21} & C_{22} & \cdots & C_{2(p-1)} \\
\vdots & \vdots & \vdots & & \vdots \\
C_{(p-1)0} & C_{(p-1)1} & C_{(p-1)2} & \cdots & C_{(p-1)(p-1)} \\
\end{bmatrix}.
\]
Essentially, the diagonal elements correspond to the $\beta$ vector.
Then the standard error for the $\hat{\beta}$ vector is given by
\[
\text{SE}[\hat{\beta}] = s_e \sqrt{\left( X^\top X \right)^{-1}}
\]
and for a particular $\hat{\beta}_j$
\[
\text{SE}[\hat{\beta}_j] = s_e \sqrt{C_{jj}}.
\]
Lastly, each of the $\hat{\beta}_j$ follows a normal distribution,
\[
\hat{\beta}_j \sim N\left(\beta_j, \sigma^2 C_{jj} \right).
\]
thus
\[
\frac{\hat{\beta}_j - \beta_j}{s_e \sqrt{C_{jj}}} \sim t_{n-p}.
\]
Now that we have the necessary distributional results, we can move on to perform tests and make interval estimates.
### Single Parameter Tests
The first test we will see is a test for a single $\beta_j$.
\[
H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0
\]
Again, the test statistic takes the form
\[
\text{TS} = \frac{\text{EST} - \text{HYP}}{\text{SE}}.
\]
In particular,
\[
t = \frac{\hat{\beta}_j - \beta_j}{\text{SE}[\hat{\beta}_j]} = \frac{\hat{\beta}_j-0}{s_e\sqrt{C_{jj}}},
\]
which, under the null hypothesis, follows a $t$ distribution with $n - p$ degrees of freedom.
Recall our model for `mpg`,
\[
Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i, \qquad i = 1, 2, \ldots, n
\]
where $\epsilon_i \sim N(0, \sigma^2)$.
- $x_{i1}$ as the weight (`wt`) of the $i$th car.
- $x_{i2}$ as the model year (`year`) of the $i$th car.
Then the test
\[
H_0: \beta_1 = 0 \quad \text{vs} \quad H_1: \beta_1 \neq 0
\]
can be found in the `summary()` output, in particular:
```{r}
summary(mpg_model)$coef
```
The estimate (`Estimate`), standard error (`Std. Error`), test statistic (`t value`), and p-value (`Pr(>|t|)`) for this test are displayed in the second row, labeled `wt`. Remember that the p-value given here is specifically for a two-sided test, where the hypothesized value is 0.
Also note in this case, by hypothesizing that $\beta_1 = 0$ the null and alternative essentially specify two different models:
- $H_0$: $Y = \beta_0 + \beta_2 x_{2} + \epsilon$
- $H_1$: $Y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \epsilon$
This is important. We are not simply testing whether or not there is a relationship between weight and fuel efficiency. We are testing if there is a relationship between weight and fuel efficiency, given that a term for year is in the model. (Note, we dropped some indexing here, for readability.)
### Confidence Intervals
Since $\hat{\beta}_j$ is our estimate for $\beta_j$ and we have
\[
\text{E}[\hat{\beta}_j] = \beta_j
\]
as well as the standard error,
\[
\text{SE}[\hat{\beta}_j] = s_e\sqrt{C_{jj}}
\]
and the sampling distribution of $\hat{\beta}_j$ is Normal, then we can easily construct confidence intervals for each of the $\hat{\beta}_j$.
\[
\hat{\beta}_j \pm t_{\alpha/2, n - p} \cdot s_e\sqrt{C_{jj}}
\]
We can find these in `R` using the same method as before. Now there will simply be additional rows for the additional $\beta$.
```{r}
confint(mpg_model, level = 0.99)
```
### Confidence Intervals for Mean Response
As we saw in SLR, we can create confidence intervals for the mean response, that is, an interval estimate for $\text{E}[Y \mid X = x]$. In SLR, the mean of $Y$ was only dependent on a single value $x$. Now, in multiple regression, $\text{E}[Y \mid X = x]$ is dependent on the value of each of the predictors, so we define the vector $x_0$ to be,
\[
x_{0} = \begin{bmatrix}
1 \\
x_{01} \\
x_{02} \\
\vdots \\
x_{0(p-1)} \\
\end{bmatrix}.
\]
Then our estimate of $\text{E}[Y \mid X = x_0]$ for a set of values $x_0$ is given by
\[
\begin{aligned}
\hat{y}(x_0) &= x_{0}^\top\hat{\beta} \\
&= \hat{\beta}_0 + \hat{\beta}_1 x_{01} + \hat{\beta}_2 x_{02} + \cdots + \hat{\beta}_{p-1} x_{0(p-1)}.
\end{aligned}
\]
As with SLR, this is an unbiased estimate.
\[
\begin{aligned}
\text{E}[\hat{y}(x_0)] &= x_{0}^\top\beta \\
&= \beta_0 + \beta_1 x_{01} + \beta_2 x_{02} + \cdots + \beta_{p-1} x_{0(p-1)}
\end{aligned}
\]
To make an interval estimate, we will also need its standard error.
\[
\text{SE}[\hat{y}(x_0)] = s_e \sqrt{x_{0}^\top\left(X^\top X\right)^{-1}x_{0}}
\]
Putting it all together, we obtain a confidence interval for the mean response.
\[
\hat{y}(x_0) \pm t_{\alpha/2, n - p} \cdot s_e \sqrt{x_{0}^\top\left(X^\top X\right)^{-1}x_{0}}
\]
The math has changed a bit, but the process in `R` remains almost identical. Here, we create a data frame for two additional cars. One car that weighs 3500 pounds produced in 1976, as well as a second car that weighs 5000 pounds which was produced in 1981.
```{r}
new_cars = data.frame(wt = c(3500, 5000), year = c(76, 81))
new_cars
```
We can then use the `predict()` function with `interval = "confidence"` to obtain intervals for the mean fuel efficiency for both new cars. Again, it is important to make the data passed to `newdata` a data frame, so that `R` knows which values are for which variables.
```{r}
predict(mpg_model, newdata = new_cars, interval = "confidence", level = 0.99)
```
`R` then reports the estimate $\hat{y}(x_0)$ (`fit`) for each, as well as the lower (`lwr`) and upper (`upr`) bounds for the interval at a desired level (99%).
A word of caution here: one of these estimates is good while one is suspect.
```{r}
new_cars$wt
range(autompg$wt)
```
Note that both of the weights of the new cars are within the range of observed values.
```{r}
new_cars$year
range(autompg$year)
```
As are the years of each of the new cars.
```{r}
plot(year ~ wt, data = autompg, pch = 20, col = "dodgerblue", cex = 1.5)
points(new_cars, col = "darkorange", cex = 3, pch = "X")
```
However, we have to consider weight and year together now. And based on the above plot, one of the new cars is within the "blob" of observed values, while the other, the car from 1981 weighing 5000 pounds, is noticeably outside of the observed values. This is a hidden extrapolation which you should be aware of when using multiple regression.
Shifting gears back to the new data pair that can be reasonably estimated, we do a quick verification of some of the mathematics in `R`.
```{r}
x0 = c(1, 3500, 76)
x0 %*% beta_hat
```
\[
x_{0} = \begin{bmatrix}
1 \\
3500 \\
76 \\
\end{bmatrix}
\]
\[
\hat{\beta} = \begin{bmatrix}
`r beta_hat[1]` \\
`r beta_hat[2]` \\
`r beta_hat[3]` \\
\end{bmatrix}
\]
\[
\hat{y}(x_0) = x_{0}^\top\hat{\beta} =
\begin{bmatrix}
1 &
3500 &
76 \\
\end{bmatrix}
\begin{bmatrix}
`r beta_hat[1]` \\
`r beta_hat[2]` \\
`r beta_hat[3]` \\
\end{bmatrix}= `r x0 %*% beta_hat`
\]
Also note that, using a particular value for $x_0$, we can essentially extract certain $\hat{\beta}_j$ values.
```{r}
beta_hat
x0 = c(0, 0, 1)
x0 %*% beta_hat
```
With this in mind, confidence intervals for the individual $\hat{\beta}_j$ are actually a special case of a confidence interval for mean response.
### Prediction Intervals
As with SLR, creating prediction intervals involves one slight change to the standard error to account for the fact that we are now considering an observation, instead of a mean.
Here we use $\hat{y}(x_0)$ to estimate $Y_0$, a new observation of $Y$ at the predictor vector $x_0$.
\[
\begin{aligned}
\hat{y}(x_0) &= x_{0}^\top\hat{\beta} \\
&= \hat{\beta}_0 + \hat{\beta}_1 x_{01} + \hat{\beta}_2 x_{02} + \cdots + \hat{\beta}_{p-1} x_{0(p-1)}
\end{aligned}
\]
\[
\begin{aligned}
\text{E}[\hat{y}(x_0)] &= x_{0}^\top\beta \\
&= \beta_0 + \beta_1 x_{01} + \beta_2 x_{02} + \cdots + \beta_{p-1} x_{0(p-1)}
\end{aligned}
\]
As we did with SLR, we need to account for the additional variability of an observation about its mean.
\[
\text{SE}[\hat{y}(x_0) + \epsilon] = s_e \sqrt{1 + x_{0}^\top\left(X^\top X\right)^{-1}x_{0}}
\]
Then we arrive at our updated prediction interval for MLR.
\[
\hat{y}(x_0) \pm t_{\alpha/2, n - p} \cdot s_e \sqrt{1 + x_{0}^\top\left(X^\top X\right)^{-1}x_{0}}
\]
```{r}
new_cars
predict(mpg_model, newdata = new_cars, interval = "prediction", level = 0.99)
```
## Significance of Regression
The decomposition of variation that we had seen in SLR still holds for MLR.
\[
\sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2
\]
That is,
\[
\text{SST} = \text{SSE} + \text{SSReg}.
\]
This means that we can still calculate $R^2$ in the same manner as before, which `R` continues to do automatically.
```{r}
summary(mpg_model)$r.squared
```
The interpretation changes slightly as compared to SLR. In this MLR case, we say that $`r round(100*summary(mpg_model)$r.squared, 2)`\%$ of the observed variation in miles per gallon is explained by the linear relationship with the two predictor variables, weight and year.
In multiple regression, the significance of regression test is
\[
H_0: \beta_1 = \beta_2 = \cdots = \beta_{p - 1} = 0.
\]
Here, we see that the null hypothesis sets all of the $\beta_j$ equal to 0, *except* the intercept, $\beta_0$. We could then say that the null model, or "model under the null hypothesis" is
\[
Y_i = \beta_0 + \epsilon_i.
\]
This is a model where the *regression* is insignificant. **None** of the predictors have a significant linear relationship with the response. Notationally, we will denote the fitted values of this model as $\hat{y}_{0i}$, which in this case happens to be:
\[
\hat{y}_{0i} = \bar{y}.
\]
The alternative hypothesis here is that at least one of the $\beta_j$ from the null hypothesis is not 0.
\[
H_1: \text{At least one of } \beta_j \neq 0, j = 1, 2, \cdots, (p-1)
\]
We could then say that the full model, or "model under the alternative hypothesis" is
\[
Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(p-1)} x_{i(p-1)} + \epsilon_i
\]
This is a model where the regression is significant. **At least one** of the predictors has a significant linear relationship with the response. There is some linear relationship between $y$ and the predictors, $x_1, x_2, \ldots, x_{p - 1}$.
We will denote the fitted values of this model as $\hat{y}_{1i}$.
To develop the $F$ test for the significance of the regression, we will arrange the variance decomposition into an ANOVA table.
| Source | Sum of Squares | Degrees of Freedom | Mean Square | $F$ |
|------------|-----------------|------------|-----------|-----------|
| Regression | $\sum_{i=1}^{n}(\hat{y}_{1i} - \bar{y})^2$ | $p - 1$ | $\text{SSReg} / (p - 1)$ | $\text{MSReg} / \text{MSE}$ |
| Error | $\sum_{i=1}^{n}(y_i - \hat{y}_{1i})^2$ | $n - p$ | $\text{SSE} / (n - p)$ | |
| Total | $\sum_{i=1}^{n}(y_i - \bar{y})^2$ | $n - 1$ | | |
In summary, the $F$ statistic is
\[
F = \frac{\sum_{i=1}^{n}(\hat{y}_{1i} - \bar{y})^2 / (p - 1)}{\sum_{i=1}^{n}(y_i - \hat{y}_{1i})^2 / (n - p)},
\]
and the p-value is calculated as
\[
P(F_{p-1, n-p} > F)
\]
since we reject for large values of $F$. A large value of the statistic corresponds to a large portion of the variance being explained by the regression. Here $F_{p-1, n-p}$ represents a random variable which follows an $F$ distribution with $p - 1$ and $n - p$ degrees of freedom.
To perform this test in `R`, we first explicitly specify the two models in `R` and save the results in different variables. We then use `anova()` to compare the two models, giving `anova()` the null model first and the alternative (full) model second. (Specifying the full model first will result in the same p-value, but some nonsensical intermediate values.)
In this case,
- $H_0$: $Y_i = \beta_0 + \epsilon_i$
- $H_1$: $Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i$
That is, in the null model, we use neither of the predictors, whereas in the full (alternative) model, at least one of the predictors is useful.
```{r}
null_mpg_model = lm(mpg ~ 1, data = autompg)
full_mpg_model = lm(mpg ~ wt + year, data = autompg)
anova(null_mpg_model, full_mpg_model)
```
First, notice that `R` does not display the results in the same manner as the table above. More important than the layout of the table are its contents. We see that the value of the $F$ statistic is `r round(anova(null_mpg_model, full_mpg_model)$F[2], 3)`, and the p-value is extremely low, so we reject the null hypothesis at any reasonable $\alpha$ and say that the regression is significant. At least one of `wt` or `year` has a useful linear relationship with `mpg`.
```{r}
summary(mpg_model)
```
Notice that the value reported in the row for `F-statistic` is indeed the $F$ test statistic for the significance of regression test, and additionally it reports the two relevant degrees of freedom.
Also, note that none of the individual $t$-tests are equivalent to the $F$-test as they were in SLR. This equivalence only holds for SLR because the individual test for $\beta_1$ is the same as testing for all non-intercept parameters, since there is only one.
We can also verify the sums of squares and degrees of freedom directly in `R`. You should match these to the table from `R` and use this to match `R`'s output to the written table above.
```{r}
# SSReg
sum((fitted(full_mpg_model) - fitted(null_mpg_model)) ^ 2)
# SSE
sum(resid(full_mpg_model) ^ 2)
# SST
sum(resid(null_mpg_model) ^ 2)
# Degrees of Freedom: Regression
length(coef(full_mpg_model)) - length(coef(null_mpg_model))
# Degrees of Freedom: Error
length(resid(full_mpg_model)) - length(coef(full_mpg_model))
# Degrees of Freedom: Total
length(resid(null_mpg_model)) - length(coef(null_mpg_model))
```
## Nested Models
The significance of regression test is actually a special case of testing what we will call **nested models**. More generally we can compare two models, where one model is "nested" inside the other, meaning one model contains a subset of the predictors from only the larger model.
Consider the following full model,
\[
Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(p-1)} x_{i(p-1)} + \epsilon_i
\]
This model has $p - 1$ predictors, for a total of $p$ $\beta$-parameters. We will denote the fitted values of this model as $\hat{y}_{1i}$.
Let the null model be
\[
Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(q-1)} x_{i(q-1)} + \epsilon_i
\]
where $q < p$. This model has $q - 1$ predictors, for a total of $q$ $\beta$-parameters. We will denote the fitted values of this model as $\hat{y}_{0i}$.
The difference between these two models can be codified by the null hypothesis of a test.
\[
H_0: \beta_q = \beta_{q+1} = \cdots = \beta_{p - 1} = 0.
\]
Specifically, the $\beta$-parameters from the full model that are not in the null model are zero. The resulting model, which is nested, is the null model.
We can then perform this test using an $F$-test, which is the result of the following ANOVA table.
| Source | Sum of Squares | Degrees of Freedom | Mean Square | $F$ |
|--------|-----------------|------------|-----------|-----------|
| Diff | $\sum_{i=1}^{n}(\hat{y}_{1i} - \hat{y}_{0i})^2$ | $p - q$ | $\text{SSD} / (p - q)$ | $\text{MSD} / \text{MSE}$ |
| Full | $\sum_{i=1}^{n}(y_i - \hat{y}_{1i})^2$ | $n - p$ | $\text{SSE} / (n - p)$ | |
| Null | $\sum_{i=1}^{n}(y_i - \hat{y}_{0i})^2$ | $n - q$ | | |
\[
F = \frac{\sum_{i=1}^{n}(\hat{y}_{1i} - \hat{y}_{0i})^2 / (p - q)}{\sum_{i=1}^{n}(y_i - \hat{y}_{1i})^2 / (n - p)}.
\]
Notice that the row for "Diff" compares the sum of the squared differences of the fitted values. The degrees of freedom is then the difference of the number of $\beta$-parameters estimated between the two models.
For example, the `autompg` dataset has a number of additional variables that we have yet to use.
```{r}
names(autompg)
```
We'll continue to use `mpg` as the response, but now we will consider two different models.
- Full: `mpg ~ wt + year + cyl + disp + hp + acc`
- Null: `mpg ~ wt + year`
Note that these are nested models, as the null model contains a subset of the predictors from the full model, and no additional predictors. Both models have an intercept $\beta_0$ as well as a coefficient in front of each of the predictors. We could then write the null hypothesis for comparing these two models as,
\[
H_0: \beta_{\texttt{cyl}} = \beta_{\texttt{disp}} = \beta_{\texttt{hp}} = \beta_{\texttt{acc}} = 0
\]
The alternative is simply that at least one of the $\beta_{j}$ from the null is not 0.
To perform this test in `R` we first define both models, then give them to the `anova()` commands.
```{r}
null_mpg_model = lm(mpg ~ wt + year, data = autompg)
#full_mpg_model = lm(mpg ~ wt + year + cyl + disp + hp + acc, data = autompg)
full_mpg_model = lm(mpg ~ ., data = autompg)
anova(null_mpg_model, full_mpg_model)
```
Here we have used the formula `mpg ~ .` to define to full model. This is the same as the commented out line. Specifically, this is a common shortcut in `R` which reads, "model `mpg` as the response with each of the remaining variables in the data frame as predictors."
Here we see that the value of the $F$ statistic is `r round(anova(null_mpg_model, full_mpg_model)$F[2],3)`, and the p-value is very large, so we fail to reject the null hypothesis at any reasonable $\alpha$ and say that none of `cyl`, `disp`, `hp`, and `acc` are significant with `wt` and `year` already in the model.
Again, we verify the sums of squares and degrees of freedom directly in `R`. You should match these to the table from `R`, and use this to match `R`'s output to the written table above.
```{r}
# SSDiff
sum((fitted(full_mpg_model) - fitted(null_mpg_model)) ^ 2)
# SSE (For Full)
sum(resid(full_mpg_model) ^ 2)
# SST (For Null)
sum(resid(null_mpg_model) ^ 2)
# Degrees of Freedom: Diff
length(coef(full_mpg_model)) - length(coef(null_mpg_model))
# Degrees of Freedom: Full
length(resid(full_mpg_model)) - length(coef(full_mpg_model))
# Degrees of Freedom: Null
length(resid(null_mpg_model)) - length(coef(null_mpg_model))
```
## Simulation
Since we ignored the derivation of certain results, we will again use simulation to convince ourselves of some of the above results. In particular, we will simulate samples of size `n = 100` from the model
\[
Y_i = 5 + -2 x_{i1} + 6 x_{i2} + \epsilon_i, \qquad i = 1, 2, \ldots, n
\]
where $\epsilon_i \sim N(0, \sigma^2 = 16)$. Here we have two predictors, so $p = 3$.
```{r}
set.seed(1337)
n = 100 # sample size
p = 3
beta_0 = 5
beta_1 = -2
beta_2 = 6
sigma = 4
```
As is the norm with regression, the $x$ values are considered fixed and known quantities, so we will simulate those first, and they remain the same for the rest of the simulation study. Also note we create an `x0` which is all `1`, which we need to create our `X` matrix. If you look at the matrix formulation of regression, this unit vector of all `1`s is a "predictor" that puts the intercept into the model. We also calculate the `C` matrix for later use.
```{r}
x0 = rep(1, n)
x1 = sample(seq(1, 10, length = n))
x2 = sample(seq(1, 10, length = n))
X = cbind(x0, x1, x2)
C = solve(t(X) %*% X)
```
We then simulate the response according the model above. Lastly, we place the two predictors and response into a data frame. Note that we do **not** place `x0` in the data frame. This is a result of `R` adding an intercept by default.
```{r}
eps = rnorm(n, mean = 0, sd = sigma)
y = beta_0 + beta_1 * x1 + beta_2 * x2 + eps
sim_data = data.frame(x1, x2, y)
```
Plotting this data and fitting the regression produces the following plot.
```{r, echo=FALSE, fig.height=6, fig.width=6, message=FALSE, warning=FALSE}
# make this use data.frame? or, simply hide this?
fit = lm(y ~ x1 + x2)
grid.lines = 25
x1.pred = seq(min(x1), max(x1), length.out = grid.lines)
x2.pred = seq(min(x2), max(x2), length.out = grid.lines)
x1x2 = expand.grid(x1 = x1.pred, x2 = x2.pred)
y.pred = matrix(predict(fit, newdata = x1x2),
nrow = grid.lines, ncol = grid.lines)
# fitted points for droplines to surface
fitpoints = predict(fit)
# scatter plot with regression plane
scatter3D(x1, x2, y, pch = 20, cex = 2, col = gg.col(1000), lighting = TRUE,
theta = 45, phi = 15, ticktype = "detailed", zlim = c(min(y.pred), max(y.pred)), clim = c(min(y.pred), max(y.pred)),
xlab = "x1", ylab = "x2", zlab = "y",
surf = list(x = x1.pred, y = x2.pred, z = y.pred,
facets = NA, fit = fitpoints), main = "")
```
We then calculate
\[
\hat{\beta} = \left( X^\top X \right)^{-1}X^\top y.
\]
```{r}
(beta_hat = C %*% t(X) %*% y)
```
Notice that these values are the same as the coefficients found using `lm()` in `R`.
```{r}
coef(lm(y ~ x1 + x2, data = sim_data))
```
Also, these values are close to what we would expect.
```{r}
c(beta_0, beta_1, beta_2)
```
We then calculated the fitted values in order to calculate $s_e$, which we see is the same as the `sigma` which is returned by `summary()`.
```{r}
y_hat = X %*% beta_hat
(s_e = sqrt(sum((y - y_hat) ^ 2) / (n - p)))
summary(lm(y ~ x1 + x2, data = sim_data))$sigma
```
So far so good. Everything checks out. Now we will finally simulate from this model repeatedly in order to obtain an empirical distribution of $\hat{\beta}_2$.
We expect $\hat{\beta}_2$ to follow a normal distribution,
\[
\hat{\beta}_2 \sim N\left(\beta_2, \sigma^2 C_{22} \right).
\]