-
Notifications
You must be signed in to change notification settings - Fork 176
/
inf-model-slr.qmd
1350 lines (1130 loc) · 64.3 KB
/
inf-model-slr.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Inference for linear regression with a single predictor {#sec-inf-model-slr}
```{r}
#| include: false
source("_common.R")
```
\chaptermark{Inference for regression with a single predictor}
\vspace{-3mm}
::: {.chapterintro data-latex=""}
We now bring together ideas of inferential analyses with the descriptive models seen in [Chapter -@sec-model-slr].
In particular, we will use the least squares regression line to test whether there is a relationship between two continuous variables.
Additionally, we will build confidence intervals which quantify the slope of the linear regression line.
The setting is now focused on predicting a numeric response variable (for linear models) or a binary response variable (for logistic models), we continue to ask questions about the variability of the model from sample to sample.
The sampling variability will inform the conclusions about the population that can be drawn.
Many of the inferential ideas are remarkably similar to those covered in previous chapters.
The technical conditions for linear models are typically assessed graphically, although independence of observations continues to be of utmost importance.
We encourage the reader to think broadly about the models at hand without putting too much dependence on the exact p-values that are reported from the statistical software.
Inference on models with multiple explanatory variables can suffer from data snooping which result in false positive claims.
We provide some guidance and hope the reader will further their statistical learning after working through the material in this text.
:::
```{r}
#| include: false
terms_chp_24 <- c("inference with single precictor regression")
```
\vspace{-7mm}
## Case study: Sandwich store
### Observed data
We start the chapter with a hypothetical example describing the linear relationship between dollars spent advertising for a chain sandwich restaurant and monthly revenue.
The hypothetical example serves the purpose of illustrating how a linear model varies from sample to sample.
Because we have made up the example and the data (and the entire population), we can take many many samples from the population to visualize the variability.
Note that in real life, we always have exactly one sample (that is, one dataset), and through the inference process, we imagine what might have happened had we taken a different sample.
The change from sample to sample leads to an understanding of how the single observed dataset is different from the population of values, which is typically the fundamental goal of inference.
\clearpage
Consider the following hypothetical population of all of the sandwich stores of a particular chain seen in @fig-sandpop.
In this made-up world, the CEO actually has all the relevant data, which is why they can plot it here.
The CEO is omniscient and can write down the population model which describes the true population relationship between the advertising dollars and revenue.
There appears to be a linear relationship between advertising dollars and revenue (both in \$1,000).
```{r}
#| label: fig-sandpop
#| fig-cap: |
#| Revenue as a linear model of advertising dollars for a population of sandwich
#| stores, in thousands of dollars.
#| fig-alt: |
#| Scatterplot with advertising amount on the x-axis and revenue on the y-axis.
#| A linear model is superimposed. The points show a reasonably strong and
#| positive linear trend.
set.seed(4747)
popsize <- 1000
ad <- rnorm(popsize, 4, 1)
rev <- 12 + 4.7 * ad + rnorm(popsize, 0, 8)
sandwich <- tibble(ad, rev)
ggplot(sandwich, aes(x = ad, y = rev)) +
geom_point(alpha = 0.5, size = 2) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
scale_x_continuous(labels = label_dollar(suffix = "K")) +
scale_y_continuous(labels = label_dollar(suffix = "K")) +
labs(
x = "Amount spent on advertising",
y = "Revenue",
title = "Chain sandwich store",
subtitle = "Hypothetical population"
) +
coord_cartesian(ylim = c(0, 65))
```
You may remember from @sec-model-slr that the population model is: $$y = \beta_0 + \beta_1 x + \varepsilon.$$
Again, the omniscient CEO (with the full population information) can write down the true population model as: $$\texttt{expected revenue} = 11.23 + 4.8 \times \texttt{advertising}.$$
### Variability of the statistic
Unfortunately, in our scenario, the CEO is not willing to part with the full set of data, but they will allow potential franchise buyers to see a small sample of the data in order to help the potential buyer decide whether set up a new franchise.
The CEO is willing to give each potential franchise buyer a random sample of data from 20 stores.
As with any numerical characteristic which describes a subset of the population, the estimated slope of a sample will vary from sample to sample.
Consider the linear model which describes revenue (in \$1,000) based on advertising dollars (in \$1,000).
The least squares regression model uses the data to find a sample linear fit: $$\hat{y} = b_0 + b_1 x.$$
Two random samples of 20 stores shows different least squares regression lines in @fig-sand-samp-1 and @fig-sand-samp-2, depending on which observations are selected.
Both trends are similar to those seen in @fig-sandpop, which describes the population.
```{r}
set.seed(470)
sandwich2 <- sandwich |>
sample_n(size = 20)
sandwich3 <- sandwich |>
sample_n(size = 20)
sandwich_many <- sandwich |>
rep_sample_n(size = 20, replace = FALSE, reps = 50)
```
```{r}
#| label: fig-sand-samp
#| fig-cap: |
#| Two random samples of 20 stores from the entire population. A linear
#| trend between advertising and revenue is observed in both.
#| fig-subcap:
#| - First sample.
#| - Second sample.
#| fig-alt: |
#| For two random samples of 20 stores, scatterplots with advertising amount on
#| the x-axis and revenue on the y-axis. Linear models are superimposed.
#| The points in each plot show reasonably strong and positive linear trends.
#| layout-ncol: 2
#| out-width: 100%
#| fig-width: 5
ggplot(sandwich2, aes(x = ad, y = rev)) +
geom_point(size = 3, fill = IMSCOL["green", "full"], color = "#FFFFFF", shape = 22) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, color = IMSCOL["green", "full"]) +
scale_x_continuous(labels = label_dollar(suffix = "K")) +
scale_y_continuous(labels = label_dollar(suffix = "K")) +
labs(
x = "Amount spent on advertising",
y = "Revenue",
title = "Chain sandwich store",
subtitle = "Random sample of 20 stores"
) +
coord_cartesian(ylim = c(0, 65))
ggplot(sandwich3, aes(x = ad, y = rev)) +
geom_point(size = 3, fill = IMSCOL["gray", "full"], color = "#FFFFFF", shape = 23) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, color = IMSCOL["gray", "full"]) +
scale_x_continuous(labels = label_dollar(suffix = "K")) +
scale_y_continuous(labels = label_dollar(suffix = "K")) +
labs(
x = "Amount spent on advertising",
y = "Revenue",
title = "Chain sandwich store",
subtitle = "Another random sample of 20 stores"
) +
coord_cartesian(ylim = c(0, 65))
```
\vspace{-5mm}
@fig-sand-samp12 shows the two samples and the least squares regressions from @fig-sand-samp on the same plot.
We can see that the two lines are different.
That is, there is **variability** in the regression line from sample to sample.
The concept of the sampling variability is something you've seen before, but in this lesson, you will focus on the variability of the line often measured through the variability of a single statistic: **the slope of the line**.
```{r}
#| include: false
terms_chp_24 <- c(terms_chp_24, "variability of the slope")
```
```{r}
#| label: fig-sand-samp12
#| fig-cap: |
#| The linear models from the two random samples are quite similar,
#| but not exactly the same.
#| fig-alt: |
#| For two different random samples, superimposed onto the same plot,
#| scatterplot with advertising amount on the x-axis and revenue on the y-axis.
#| Two linear models are plotted to demonstrate that the lines are very similar,
#| yet they are not the same.
#| fig-asp: 0.5
ggplot() +
geom_point(data = sandwich2, aes(x = ad, y = rev),
size = 3, , shape = 22,
fill = IMSCOL["green", "full"], color = "#FFFFFF") +
geom_smooth(data = sandwich2, aes(x = ad, y = rev),
method = "lm", se = FALSE, fullrange = TRUE,
color = IMSCOL["green", "full"]) +
geom_point(data = sandwich3, aes(x = ad, y = rev),
size = 3, , shape = 23,
fill = IMSCOL["gray", "full"], color = "#FFFFFF") +
geom_smooth(data = sandwich3, aes(x = ad, y = rev),
method = "lm", se = FALSE, fullrange = TRUE,
color = IMSCOL["gray", "full"]) +
scale_x_continuous(labels = label_dollar(suffix = "K")) +
scale_y_continuous(labels = label_dollar(suffix = "K")) +
labs(
x = "Amount spent on advertising",
y = "Revenue",
title = "Chain sandwich store",
subtitle = "Two random samples of 20 stores"
) +
coord_cartesian(ylim = c(0, 65))
```
@fig-slopes shows least squares lines fit to many more random samples of 20 from the population.
\vspace{-5mm}
```{r}
#| label: fig-slopes
#| fig-cap: |
#| If repeated samples of size 20 are taken from the entire population, each
#| linear model will be slightly different. The red line provides the linear fit to
#| the entire population.
#| fig-alt: |
#| An x-y coordinate system with least squares regression lines from many
#| random samples of size 20 (no points are plotted). The lines vary around the
#| true population line. On the x-axis is advertising amount; on the y-axis
#| is revenue.
#| fig-asp: 0.5
ggplot() +
geom_smooth(
data = sandwich_many, aes(x = ad, y = rev, group = replicate),
method = "lm", se = FALSE, fullrange = TRUE,
color = IMSCOL["gray", "f2"], size = 0.5
) +
geom_smooth(
data = sandwich, aes(x = ad, y = rev), method = "lm", se = FALSE,
fullrange = TRUE, color = IMSCOL["red", "full"]
) +
scale_x_continuous(labels = label_dollar(suffix = "K")) +
scale_y_continuous(labels = label_dollar(suffix = "K")) +
labs(
x = "Amount spent on advertising",
y = "Revenue",
title = "Chain sandwich store",
subtitle = "Many random samples of 20 stores"
) +
coord_cartesian(ylim = c(0, 65))
```
\clearpage
You might notice in @fig-slopes that the $\hat{y}$ values given by the lines are much more consistent in the middle of the dataset than at the ends.
The reason is that the data itself anchors the lines in such a way that the line must pass through the center of the data cloud.
The effect of the fan-shaped lines is that predicted revenue for advertising close to \$4,000 will be much more precise than the revenue predictions made for \$1,000 or \$7,000 of advertising.
The distribution of slopes (for samples of size $n=20$) can be seen in @fig-sand20lm.
```{r}
#| label: fig-sand20lm
#| fig-cap: |
#| Variability of slope estimates from many different samples of stores,
#| each of size 20.
#| fig-alt: |
#| Histogram of the slope values from many random samples of size 20.
#| The slope estimates vary from about 2.1 to 8. The histogram is reasonably
#| bell-shaped.
sandwich_many_lm <- sandwich_many |>
group_by(replicate) |>
do(tidy(lm(rev ~ ad, data = .))) |>
filter(term == "ad")
ggplot(sandwich_many_lm, aes(x = estimate)) +
geom_histogram(binwidth = 0.5) +
labs(
x = "Slope estimate",
y = "Count",
title = "Chain sandwich store",
subtitle = "Many random samples of 20 stores"
)
```
Recall, the example described in this introduction is hypothetical.
That is, we created an entire population in order demonstrate how the slope of a line would vary from sample to sample.
The tools in this textbook are designed to evaluate only one single sample of data.
With actual studies, we do not have repeated samples, so we are not able to use repeated samples to visualize the variability in slopes.
We have seen variability in samples throughout this text, so it should not come as a surprise that different samples will produce different linear models.
However, it is nice to visually consider the linear models produced by different slopes.
Additionally, as with measuring the variability of previous statistics (e.g., $\overline{X}_1 - \overline{X}_2$ or $\hat{p}_1 - \hat{p}_2$), the histogram of the sample statistics can provide information related to inferential considerations.
In the following sections, the distribution (i.e., histogram) of $b_1$ (the estimated slope coefficient) will be constructed in the same three ways that, by now, may be familiar to you.
First (in @sec-randslope), the distribution of $b_1$ when $\beta_1 = 0$ is constructed by randomizing (permuting) the response variable.
Next (in @sec-bootbeta1), we can bootstrap the data by taking random samples of size $n$ from the original dataset.
And last (in @sec-mathslope), we use mathematical tools to describe the variability using the $t$-distribution that was first encountered in @sec-one-mean-math.
\clearpage
## Randomization test for the slope {#sec-randslope}
\index{randomization test for the slope}
Consider data on 100 randomly selected births gathered originally from the US Department of Health and Human Services.
Some of the variables are plotted in @fig-babyweight.
The scientific research interest at hand will be in determining the linear relationship between weight of baby at birth (in lbs) and number of weeks of gestation.
The dataset is quite rich and deserves exploring, but for this example, we will focus only on the weight of the baby.
::: {.data data-latex=""}
The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
We will work with a random sample of 100 observations from these data.
:::
```{r}
#| include: false
terms_chp_24 <- c(terms_chp_24, "randomization test for the slope")
```
```{r}
#| label: fig-babyweight
#| fig-cap: |
#| Weight of baby at birth (in lbs) as plotted by four other birth variables
#| (mother's weight gain, mother's age, number of hospital visits, and weeks
#| gestation).
#| fig-alt: |
#| Four different scatterplots, all with weight of baby on the y-axis.
#| On the x-axis are weight gained by mother, mother's age, number of hospital
#| visits, and weeks gestation. Weeks gestation and weight of baby show the
#| strongest linear relationship (which is positive).
set.seed(47)
births14_100 <- births14 |>
drop_na() |>
sample_n(100)
p_gained <- ggplot(births14_100) +
geom_jitter(aes(y = weight, x = gained)) +
xlab("Weight gained by mother") +
ylab("Weight of baby")
p_mage <- ggplot(births14_100) +
geom_jitter(aes(y = weight, x = mage)) +
xlab("Mother's age") +
ylab("Weight of baby")
p_visits <- ggplot(births14_100) +
geom_jitter(aes(y = weight, x = visits)) +
xlab("Number of hospital visits during pregnancy") +
ylab("Weight of baby")
p_weeks <- ggplot(births14_100) +
geom_jitter(aes(x = weeks, y = weight)) +
xlab("Weeks of gestation") +
ylab("Weight of baby")
p_gained + p_mage + p_visits + p_weeks +
plot_layout(ncol = 2)
```
As you have seen previously, statistical inference typically relies on setting a null hypothesis which is hoped to be subsequently rejected.
In the linear model setting, we might hope to have a linear relationship between `weeks` and `weight` in settings where `weeks` gestation is known and `weight` of baby needs to be predicted.
The relevant hypotheses for the linear model setting can be written in terms of the population slope parameter.
Here the population refers to a larger population of births in the US.
- $H_0: \beta_1= 0$, there is no linear relationship between `weight` and `weeks`.
- $H_A: \beta_1 \ne 0$, there is some linear relationship between `weight` and `weeks`.
Recall that for the randomization test, we permute one variable to eliminate any existing relationship between the variables.
That is, we set the null hypothesis to be true, and we measure the natural variability in the data due to sampling but **not** due to variables being correlated.
@fig-permweightScatter-1 shows the observed data and @fig-permweightScatter-2 shows one permutation of the `weight` variable.
The careful observer can see that each of the observed values for `weight` (and for `weeks`) exist in both the original data plot as well as the permuted `weight` plot, but the `weight` and `weeks` gestation are no longer matched for a given birth.
That is, each `weight` value is randomly assigned to a new `weeks` gestation.
By repeatedly permuting the response variable, any pattern in the linear model that is observed is due only to random chance (and not an underlying relationship).
The randomization test compares the slopes calculated from the permuted response variable with the observed slope.
If the observed slope is inconsistent with the slopes from permuting, we can conclude that there is some underlying relationship (and that the slope is not merely due to random chance).
```{r}
#| label: fig-permweightScatter
#| fig-cap: |
#| Permutation removes the linear relationship between `weight` and `weeks`.
#| Repeated permutations allow for quantifying the variability in the slope
#| under the condition that there is no linear relationship (i.e., that the
#| null hypothesis is true).
#| fig-subcap:
#| - Original data.
#| - Permuted data.
#| fig-alt: |
#| Two scatterplots, both with length of gestation on the x-axis and weight
#| of baby on the y-axis. The left panel is the original data. The right panel
#| is data where the weight of the baby has been permuted across the
#| observations.
#| out-width: 100%
#| layout-ncol: 2
#| fig-width: 5
set.seed(4747)
ggplot(births14_100) +
geom_point(aes(x = weeks, y = weight)) +
labs(
x = "Length of gestation (weeks)",
y = "Weight of baby (lbs)",
title = "Original data"
)
ggplot(births14_100) +
geom_point(aes(x = weeks, y = sample(weight))) +
labs(
x = "Length of gestation (weeks)",
y = "Permuted weight of baby (lbs)",
title = "Permuted data"
)
```
### Observed data
We will continue to use the births data to investigate the linear relationship between `weight` and `weeks` gestation.
Note that the least squares model (see @sec-model-slr) describing the relationship is given in @tbl-ls-births.
The columns in @tbl-ls-births are further described in @sec-mathslope.
```{r}
#| label: tbl-ls-births
#| tbl-cap: |
#| The least squares estimates of the intercept and slope are given in
#| the estimate column. The observed slope is 0.335.
#| tbl-pos: H
lm(weight ~ weeks, data = births14_100) |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
kbl(linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr") |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
### Variability of the statistic
After permuting the data, the least squares estimate of the line can be computed.
Repeated permutations and slope calculations describe the variability in the line (i.e., in the slope) due only to the natural variability and not due to a relationship between `weight` and `weeks` gestation.
@fig-permweekslm shows two different permutations of `weight` and the resulting linear models.
```{r}
#| label: fig-permweekslm
#| fig-cap: |
#| Two permutations of `weight` with slightly different least
#| squares regression lines.
#| fig-subcap:
#| - First permutation.
#| - Second permutation.
#| fig-alt: |
#| Two scatterplots, both with length of gestation on the x-axis and
#| weight of baby on the y-axis. Each plot includes data where the weight of
#| the baby has been permuted across the observations. The two different
#| permutations produce slightly different least squares regression lines.
#| out-width: 100%
#| layout-ncol: 2
#| fig-width: 5
set.seed(470)
ggplot(births14_100, aes(x = weeks, y = sample(weight))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
labs(
y = "Permuted weight of baby (lbs)",
x = "Length of gestation (weeks)",
title = "First permutation of weight"
)
ggplot(births14_100, aes(x = weeks, y = sample(weight))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
labs(
y = "Permuted weight of baby (lbs)",
x = "Length of gestation (weeks)",
title = "Second permutation of weight"
)
```
As you can see, sometimes the slope of the permuted data is positive, sometimes it is negative.
Because the randomization happens under the condition of no underlying relationship (because the response variable is completely mixed with the explanatory variable), we expect to see the center of the randomized slope distribution to be zero.
### Observed statistic vs. null statistics
```{r}
#| label: fig-nulldistBirths
#| fig-cap: |
#| Histogram of slopes given different permutations of the `weight` variable.
#| The vertical red line is at the observed value of the slope, 0.335.
#| fig-alt: |
#| Histogram of slopes describing the linear model from permuted weight
#| regressed on weeks gestation. The permuted slopes range from -0.15 to +0.15
#| and are nowhere near the observed slope value of 0.335.
#| fig-asp: 0.48
perm_slope <- births14_100 |>
specify(weight ~ weeks) |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute") |>
calculate(stat = "slope")
obs_slope <- births14_100 |>
specify(weight ~ weeks) |>
calculate(stat = "slope") |>
pull()
ggplot(data = perm_slope, aes(x = stat)) +
geom_histogram() +
geom_vline(xintercept = obs_slope, color = IMSCOL["red", "full"]) +
labs(x = "Randomly generated slopes", y = "Count")
```
\vspace{-2mm}
As we can see from @fig-nulldistBirths, a slope estimate as extreme as the observed slope estimate (the red line) never happened in many repeated permutations of the `weight` variable.
That is, if indeed there were no linear relationship between `weight` and `weeks`, the natural variability of the slopes would produce estimates between approximately -0.15 and +0.15.
We reject the null hypothesis.
Therefore, we believe that the slope observed on the original data is not just due to natural variability and indeed, there is a linear relationship between `weight` of baby and `weeks` gestation for births in the US.
\vspace{-2mm}
## Bootstrap confidence interval for the slope {#sec-bootbeta1}
\index{bootstrap CI for the slope}
As we have seen in previous chapters, we can use bootstrapping to estimate the sampling distribution of the statistic of interest (here, the slope) without the null assumption of no relationship (which was the condition in the randomization test).
Because interest is now in creating a CI, there is no null hypothesis, so there won't be any reason to permute either of the variables.
```{r}
#| include: false
terms_chp_24 <- c(terms_chp_24, "bootstrap CI for the slope")
```
### Observed data
Returning to the births data, we may want to consider the relationship between `mage` (mother's age) and `weight`.
Is `mage` a good predictor of `weight`?
And if so, what is the relationship?
That is, what is the slope that models average `weight` of baby as a function of `mage` (mother's age)?
The linear model regressing `weight` on `mage` is provided in @tbl-ls-births-mage.
```{r}
#| echo: false
set.seed(4747)
births4 <- births14_100 |>
sample_n(size = 100, replace = TRUE)
births5 <- births14_100 |>
sample_n(size = 100, replace = TRUE)
births_many_BS <- births14_100 |>
rep_sample_n(size = 100, replace = TRUE, reps = 50)
```
\vspace{-3mm}
```{r}
#| label: fig-magePlot
#| fig-cap: |
#| Using the original data, the weight of baby as a linear model of
#| mother's age. Notice that the relationship between mother's age and weight
#| of baby is not as strong as the relationship we saw previously between
#| weeks gestation and weight of baby.
#| fig-alt: |
#| Scatterplot with mother's age on the x-axis and baby's weight on the
#| y-axis. A linear model is superimposed. The points show a weak positive
#| linear trend.
#| fig-asp: 0.5
ggplot(births14_100) +
geom_point(aes(x = mage, y = weight)) +
geom_smooth(aes(x = mage, y = weight),
method = "lm", se = FALSE,
fullrange = TRUE,
) +
labs(x = "Mother's age", y = "Weight of baby")
```
```{r}
#| label: tbl-ls-births-mage
#| tbl-cap: |
#| The least squares estimates of the intercept and slope are given in
#| the estimate column. The observed slope is 0.036.
#| tbl-pos: H
lm(weight ~ mage, data = births14_100) |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
kbl(linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr") |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
### Variability of the statistic
Because the focus here is *not* on a null distribution, we sample with replacement $n = 100$ observations from the original dataset.
Recall that with bootstrapping the resample always has the same number of observations as the original dataset in order to mimic the process of taking a sample from the population.
When sampling in the linear model case, consider each observation to be a single dot.
If the dot is resampled, both the `weight` and the `mage` measurement are observed.
The measurements are linked to the dot (i.e., to the birth in the sample).
```{r}
#| label: fig-birth2BS
#| fig-cap: |
#| Original and one bootstrap sample of the births data. It is difficult to
#| differentiate between the two plots, as (within a single bootstrap sample)
#| the observations which have been resampled twice are plotted as points on
#| top of one another. The red circles represent points in the original data
#| which were not included in the bootstrap sample. The blue circles represent
#| a data point that was repeatedly resampled (and is therefore darker) in the
#| bootstrap sample. The green circles represent a particular structure to the
#| data which is observed in both the original and bootstrap samples.
#| fig-subcap:
#| - Original data.
#| - Bootstrapped data.
#| fig-alt: |
#| Two scatterplots, both with mother's age on the x-axis and baby's weight
#| on the y-axis. The left plot is the original data. The right plot is the
#| bootstrapped data. Comparing the bootstrapped points to the original points,
#| we can see that some observations were sampled more than once, and some
#| observations were not selected for the bootstrap sample at all.
#| out-width: 100%
#| layout-ncol: 2
#| fig-width: 5
ggplot(births14_100) +
geom_point(aes(x = mage, y = weight), alpha = 0.4) +
geom_smooth(aes(x = mage, y = weight),
method = "lm", se = FALSE,
fullrange = TRUE
) +
xlab("mother's age") +
ylab("weight of baby") +
ggtitle("Original data") +
geom_point(x = 29, y = 9, color = IMSCOL["green", "full"], pch = 1, size = 8) +
geom_point(x = 20, y = 6, color = IMSCOL["red", "full"], pch = 1, size = 8) +
geom_point(x = 39, y = 9, pch = 1, size = 8)
ggplot(births5) +
geom_point(aes(x = mage, y = weight), alpha = 0.4) +
geom_smooth(aes(x = mage, y = weight),
method = "lm", se = FALSE,
fullrange = TRUE
) +
xlab("mother's age") +
ylab("weight of baby") +
ggtitle("Bootstrap sample") +
geom_point(x = 29, y = 9, color = IMSCOL["green", "full"], pch = 1, size = 8) +
geom_point(x = 20, y = 6, color = IMSCOL["red", "full"], pch = 1, size = 8) +
geom_point(x = 39, y = 9, pch = 1, size = 8)
```
@fig-birth2BS-1 shows the original data as compared with a single bootstrap sample in @fig-birth2BS-2, resulting in (slightly) different linear models.
The red circles represent points in the original data which were not included in the bootstrap sample.
The blue circles represent a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample.
The green circles represent a particular structure to the data which is observed in both the original and bootstrap samples.
By repeatedly resampling, we can see dozens of bootstrapped slopes on the same plot in @fig-birthBS.
```{r}
#| label: fig-birthBS
#| fig-cap: |
#| Repeated bootstrap resamples of size 100 are taken from the original data. Each
#| of the bootstrapped linear models is slightly different.
#| fig-alt: |
#| An x-y coordinate system with least squares regression lines from many
#| bootstrap samples (no points are plotted). The lines vary around the observed
#| population line. On the x-axis is mother's age; on the y-axis is baby's weight.
ggplot(births_many_BS, aes(x = mage, y = weight, group = replicate)) +
geom_smooth(method = "lm", se = FALSE, color = IMSCOL["blue", "f2"], fullrange = TRUE) +
labs(
x = "Mother's age",
y = "Weight of baby"
)
```
Recall that in order to create a confidence interval for the slope, we need to find the range of values that the statistic (here the slope) takes on from different bootstrap samples.
@fig-mageBSslopes is a histogram of the relevant bootstrapped slopes.
We can see that a 95% bootstrap percentile interval for the true population slope is given by (-0.01, 0.081).
We are 95% confident that for the model describing the population of births, predicting weight of baby from mother's age, a one unit increase in `mage` (in years) is associated with an increase in predicted average baby `weight` of between -0.01 and 0.081 pounds. Notice that the CI contains zero, so the true relationship *might* be null!
```{r}
#| label: fig-mageBSslopes
#| fig-cap: |
#| The original births data on baby's weight and mother's age is
#| bootstrapped 1,000 times. The histogram provides a sense for the
#| variability of the slope of the linear model from sample to sample.
#| fig-alt: |
#| Histogram of the slopes computed from many bootstrapped samples.
#| The bootstrap samples range from -0.05 (with the 2.5 percentile at -0.01) to
#| +0.1 (with the 97.5 percentile at 0.081). The bootstrapped slopes form a
#| histogram that is reasonably symmetric and bell-shaped.
set.seed(47)
births_many_BS_1000 <- births14_100 |>
rep_sample_n(size = 100, replace = TRUE, reps = 1000)
births_many_lm_BS <- births_many_BS_1000 |>
group_by(replicate) |>
do(tidy(lm(weight ~ mage, data = .))) |>
filter(term == "mage")
lower <- round(quantile(births_many_lm_BS$estimate, 0.025), 3)
upper <- round(quantile(births_many_lm_BS$estimate, 0.975), 3)
ggplot(births_many_lm_BS, aes(x = estimate)) +
geom_histogram() +
annotate("segment", x = lower, xend = lower, y = 0, yend = 25, linetype = "dashed") +
annotate("segment", x = upper, xend = upper, y = 0, yend = 25, linetype = "dashed") +
annotate("text", x = lower, y = 35, label = "2.5 percentile\n-0.01", size = 5) +
annotate("text", x = upper, y = 35, label = "97.5 percentile\n0.081", size = 5) +
labs(
x = "Bootstrapped values of the slope for predicting weight of baby from mother's age",
y = NULL
) +
theme(axis.text.y = element_blank())
```
::: {.workedexample data-latex=""}
Using @fig-mageBSslopes, calculate the bootstrap estimate for the standard error of the slope.
Using the bootstrap standard error, find a 95% bootstrap SE confidence interval for the true population slope, and interpret the interval in context.
------------------------------------------------------------------------
Notice that most of the bootstrapped slopes fall between -0.01 and +0.08 (a range of 0.09).
Using the empirical rule (that with bell-shaped distributions, most observations are within two standard errors of the center), the standard error of the slopes is approximately 0.0225.
The critical value for a 95% confidence interval is $z^\star = 1.96$ which leads to a confidence interval of $b_1 \pm 1.96 \cdot SE \rightarrow 0.036 \pm 1.96 \cdot 0.0225 \rightarrow (-0.0081, 0.0801).$ The bootstrap SE confidence interval is almost identical to the bootstrap percentile interval.
In context, we are 95% confident that for the model describing the population of births, predicting weight of baby from mother's age, a one unit increase in `mage` (in years) is associated with an increase in predicted average baby `weight` of between -0.0081 and 0.0801 pounds.
:::
## Mathematical model for testing the slope {#sec-mathslope}
When certain technical conditions apply, it is convenient to use mathematical approximations to test and estimate the slope parameter.
The approximations will build on the t-distribution which was described in @sec-inference-one-mean.
The mathematical model is often correct and is usually easy to implement computationally.
The validity of the technical conditions will be considered in detail in @sec-tech-cond-linmod.
In this section, we discuss uncertainty in the estimates of the slope and y-intercept for a regression line.
Just as we identified standard errors for point estimates in previous chapters, we start by discussing standard errors for the slope and y-intercept estimates.
### Observed data
**Midterm elections and unemployment**
Elections for members of the United States House of Representatives occur every two years, coinciding every four years with the U.S.
Presidential election.
The set of House elections occurring during the middle of a Presidential term are called midterm elections.
In America's two-party system (the vast majority of House members through history have been either Republicans or Democrats), one political theory suggests the higher the unemployment rate, the worse the President's party will do in the midterm elections.
In 2020 there were 232 Democrats, 198 Republicans, and 1 Libertarian in the House.
To assess the validity of the claim related to unemployment and voting patterns, we can compile historical data and look for a connection.
We consider every midterm election from 1898 to 2018, with the exception of the elections during the Great Depression.
The House of Representatives is made up of 435 voting members.
::: {.data data-latex=""}
The [`midterms_house`](http://openintrostat.github.io/openintro/reference/midterms_house.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::
@fig-unemploymentAndChangeInHouse shows these data and the least-squares regression line:
$$
\begin{aligned}
&\texttt{percent change in House seats for President's party} \\
&\qquad\qquad= -7.36 - 0.89 \times \texttt{(unemployment rate)}
\end{aligned}
$$
We consider the percent change in the number of seats of the President's party (e.g., percent change in the number of seats for Republicans in 2018) against the unemployment rate.
Examining the data, there are no clear deviations from linearity or substantial outliers (see @sec-resids for a discussion on using residuals to visualize how well a linear model fits the data).
While the data are collected sequentially, a separate analysis was used to check for any apparent correlation between successive observations; no such correlation was found.
```{r}
#| label: fig-unemploymentAndChangeInHouse
#| fig-cap: |
#| The percent change in House seats for the President's party in each election
#| from 1898 to 2010 plotted against the unemployment rate. The two points for the
#| Great Depression have been removed, and a least squares regression line has been
#| fit to the data.
#| fig-alt: |
#| Scatterplot with percent unemployed on the x-axis and percent change in
#| House seats for the President's party on the y-axis. Each point represents a
#| different President's midterm and is colored according to their political party
#| (Democrat or Republican). The relationship is moderate and negative.
#| out-width: 100%
#| fig-asp: 0.5
years_to_label <- c(2019, 2003, 1995, 1983, 2011, 1899)
midterms_house_with_labels <- midterms_house |>
mutate(
label_top = word(potus, -1),
label_bottom = year - 1
)
midterms_house_with_labels |>
filter(!(year %in% c(1935, 1939))) |>
ggplot(aes(x = unemp, y = house_change)) +
geom_point(aes(color = party, shape = party), size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
scale_color_openintro() +
scale_shape_manual(values = c(16, 17)) +
scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1), limits = c(3.5, 12.1), breaks = c(4, 8, 12)) +
scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1), limits = c(-31, 11), breaks = c(-30, -20, -10, 0, 10)) +
labs(
x = "Percent unemployed",
y = "Percent change in House seats\nfor the President's party",
color = "Party", shape = "Party"
) +
geom_text(data = midterms_house_with_labels |> filter(year %in% years_to_label),
aes(x = unemp, y = house_change + 4, label = paste0(label_top, "\n", label_bottom)), color = IMSCOL["black", "full"]
) +
theme(
legend.position = c(0.8, 0.8),
legend.background = element_rect(color = "white")
)
```
::: {.guidedpractice data-latex=""}
The data for the Great Depression (1934 and 1938) were removed because the unemployment rate was 21% and 18%, respectively.
Do you agree that they should be removed for this investigation?
Why or why not?[^24-inf-model-slr-1]
:::
[^24-inf-model-slr-1]: The answer to this question relies on the idea that statistical data analysis is somewhat of an art.
That is, in many situations, there is no "right" answer.
As you do more and more analyses on your own, you will come to recognize the nuanced understanding which is needed for a particular dataset.
In terms of the Great Depression, we will provide two contrasting considerations.
Each of these points would have very high leverage on any least-squares regression line, and years with such high unemployment may not help us understand what would happen in other years where the unemployment is only modestly high.
On the other hand, the Depression years are exceptional cases, and we would be discarding important information if we exclude them from a final analysis.
There is a negative slope in the line shown in @fig-unemploymentAndChangeInHouse.
However, this slope (and the y-intercept) are only estimates of the parameter values.
We might wonder, is this convincing evidence that the "true" linear model has a negative slope?
That is, do the data provide strong evidence that the political theory is accurate, where the unemployment rate is a useful predictor of the midterm election?
We can frame this investigation into a statistical hypothesis test:
- $H_0$: $\beta_1 = 0$. The true linear model has slope zero.
- $H_A$: $\beta_1 \neq 0$. The true linear model has a slope different than zero. The unemployment is predictive of whether the President's party wins or loses seats in the House of Representatives.
We would reject $H_0$ in favor of $H_A$ if the data provide strong evidence that the true slope parameter is different than zero.
To assess the hypotheses, we identify a standard error for the estimate, compute an appropriate test statistic, and identify the p-value.
\clearpage
### Variability of the statistic
Just like other point estimates we have seen before, we can compute a standard error and test statistic for $b_1$.
We will generally label the test statistic using a $T$, since it follows the $t$-distribution.
\index{T score!slope}
We will rely on statistical software to compute the standard error and leave the explanation of how this standard error is determined to a second or third statistics course.
@tbl-midtermUnempRegTable shows software output for the least squares regression line in @fig-unemploymentAndChangeInHouse.
The row labeled `unemp` includes all relevant information about the slope estimate (i.e., the coefficient of the unemployment variable, the related SE, the T statistic, and the corresponding p-value).
```{r}
#| include: false
terms_chp_24 <- c(terms_chp_24, "t-distribution for slope")
```
```{r}
#| label: tbl-midtermUnempRegTable
#| tbl-cap: |
#| Output from statistical software for the regression line modeling the
#| midterm election losses for the President's party as a response to
#| unemployment.
#| tbl-pos: H
d <- midterms_house
th <- !d$year %in% c(1935, 1939)
lm(house_change ~ unemp, d[th, ]) |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
kbl(linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr") |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
::: {.workedexample data-latex=""}
What do the first and second columns of @tbl-midtermUnempRegTable represent?
------------------------------------------------------------------------
The entries in the first column represent the least squares estimates, $b_0$ and $b_1$, and the values in the second column correspond to the standard errors of each estimate.
Using the estimates, we could write the equation for the least square regression line as
$$ \hat{y} = -7.36 - 0.89 x $$
where $\hat{y}$ in this case represents the predicted change in the number of seats for the president's party, and $x$ represents the unemployment rate.
:::
We previously used a $t$-test statistic for hypothesis testing in the context of numerical data.
Regression is very similar.
In the hypotheses we consider, the null value for the slope is 0, so we can compute the test statistic using the T score formula:
$$
T \ = \ \frac{\text{estimate} - \text{null value}}{\text{SE}} = \ \frac{-0.89 - 0}{0.835} = \ -1.07
$$
The T score we calculated corresponds to the third column of @tbl-midtermUnempRegTable.
::: {.workedexample data-latex=""}
Use @tbl-midtermUnempRegTable to determine the p-value for the hypothesis test.
------------------------------------------------------------------------
The last column of the table gives the p-value for the two-sided hypothesis test for the coefficient of the unemployment rate 0.2961 That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President's party in the House of Representatives in midterm elections. If there was no linear relationship between the two variables (i.e., if $\beta_1 = 0)$, then we would expect to see linear models as or more extreme that the observed model roughly 30% of the time.
:::
\clearpage
### Observed statistic vs. null statistics
As the final step in a mathematical hypothesis test for the slope, we use the information provided to make a conclusion about whether the data could have come from a population where the true slope was zero (i.e., $\beta_1 = 0$).
Before evaluating the formal hypothesis claim, sometimes it is important to check your intuition.
Based on everything we have seen in the examples above describing the variability of a line from sample to sample, ask yourself if the linear relationship given by the data could have come from a population in which the slope was truly zero.
::: {.workedexample data-latex=""}
Examine @fig-elmhurstScatterWLine, which relates the Elmhurst College aid and student family income.
Are you convinced that the slope is discernibly different from zero?
That is, do you think a formal hypothesis test would reject the claim that the true slope of the line should be zero?
------------------------------------------------------------------------
While the relationship between the variables is not perfect, there is an evident decreasing trend in the data.
Such a distinct trend suggests that the hypothesis test will reject the null claim that the slope is zero.
:::
::: {.data data-latex=""}
The [`elmhurst`](http://openintrostat.github.io/openintro/reference/elmhurst.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::
The tools in this section help you go beyond a visual interpretation of the linear relationship toward a formal mathematical claim about whether the slope estimate is meaningfully different from 0 to suggest that the true population slope is different from 0.
```{r}
#| label: tbl-rOutputForIncomeAidLSRLineInInferenceSection
#| tbl-cap: |
#| Summary of least squares fit for the Elmhurst College data, where we
#| are predicting the gift aid by the university based on the family
#| income of students.
#| tbl-pos: H
elmhurst |>
mutate(
gift_aid = gift_aid * 1000,
family_income = family_income * 1000
) |>
lm(gift_aid ~ family_income, data = _) |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
kbl(linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr") |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
\vspace{-5mm}
::: {.guidedpractice data-latex=""}
@tbl-rOutputForIncomeAidLSRLineInInferenceSection shows statistical software output from fitting the least squares regression line shown in @fig-elmhurstScatterWLine.
Use the output to formally evaluate the following hypotheses.[^24-inf-model-slr-2]
- $H_0$: The true coefficient for family income is zero.
- $H_A$: The true coefficient for family income is not zero.
:::
[^24-inf-model-slr-2]: We look in the second row corresponding to the family income variable.
We see the point estimate of the slope of the line is -0.0431, the standard error of this estimate is 0.0108, and the $t$-test statistic is $T = -3.98$.
The p-value corresponds exactly to the two-sided test we are interested in: 0.0002.
The p-value is so small that we reject the null hypothesis and conclude that family income and financial aid at Elmhurst College for freshman entering in the year 2011 are negatively correlated and the true slope parameter is indeed less than 0, just as we believed in our analysis of @fig-elmhurstScatterWLine.
\vspace{-5mm}
::: {.important data-latex=""}
**Inference for regression.**
We usually rely on statistical software to identify point estimates, standard errors, test statistics, and p-values in practice.
However, be aware that software will not generally check whether the method is appropriate, meaning we must still verify conditions are met.
See @sec-tech-cond-linmod.
:::
\clearpage
## Mathematical model, interval for the slope
Similar to how we can conduct a hypothesis test for a model coefficient using regression output, we can also construct confidence intervals for the slope and intercept coefficients.
::: {.important data-latex=""}
**Confidence intervals for coefficients.**
Confidence intervals for model coefficients (e.g., the intercept or the slope) can be computed using the $t$-distribution:
$$ b_i \ \pm\ t_{df}^{\star} \times SE_{b_{i}} $$
where $t_{df}^{\star}$ is the appropriate $t^{\star}$ cutoff corresponding to the confidence level with the model's degrees of freedom, $df = n - 2$.
:::
::: {.workedexample data-latex=""}
Compute the 95% confidence interval for the coefficient using the regression output from @tbl-rOutputForIncomeAidLSRLineInInferenceSection.
------------------------------------------------------------------------
The point estimate is -0.0431 and the standard error is $SE = 0.0108$.
When constructing a confidence interval for a model coefficient, we generally use a $t$-distribution.
The degrees of freedom for the distribution are noted in the regression output, $df = 48$, allowing us to identify $t_{48}^{\star} = 2.01$ for use in the confidence interval.
We can now construct the confidence interval in the usual way:
$$
\begin{aligned}
\text{point estimate} &\pm t_{48}^{\star} \times SE \\
-0.0431 &\pm 2.01 \times 0.0108 \\
(-0.0648 &, -0.0214)
\end{aligned}
$$
We are 95% confident that for an additional one unit (i.e., $1000 increase) in family income, the university's gift aid is predicted to decrease on average by \$21.40 to \$64.80.
:::
On the topic of intervals in this book, we have focused exclusively on confidence intervals for model parameters.
However, there are other types of intervals that may be of interest (and are outside the scope of this book), including prediction intervals for a response value and confidence intervals for a mean response value in the context of regression.
\clearpage