-
Notifications
You must be signed in to change notification settings - Fork 77
/
13-ijaglm.qmd
1853 lines (1467 loc) · 68.9 KB
/
13-ijaglm.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
engine: knitr
---
# Generalized linear models {#sec-its-just-a-generalized-linear-model}
::: {.callout-note}
Chapman and Hall/CRC published this book in July 2023. You can purchase that [here](https://www.routledge.com/Telling-Stories-with-Data-With-Applications-in-R/Alexander/p/book/9781032134772).
This online version has some updates to what was printed. An online version that matches the print version is available [here](https://rohanalexander.github.io/telling_stories-published/).
:::
**Prerequisites**
- Read *Regression and Other Stories*, [@gelmanhillvehtari2020]
- Focus on Chapters 13 "Logistic regression" and 15 "Other generalized linear models", which provide a detailed guide to generalized linear models.
- Read *An Introduction to Statistical Learning with Applications in R*, [@islr]
- Focus on Chapter 4 "Classification", which is a complementary treatment of generalized linear models from a different perspective.
- Read *We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results*, [@cohn2016]
- Details a situation in which different modeling choices, given the same dataset, result in different forecasts.
**Key concepts and skills**
- Linear regression can be generalized for alternative types of outcome variables.
- Logistic regression can be used when we have a binary outcome variable.
- Poisson regression can be used when we have an integer count outcome variable. A variant---negative binomial regression---is often also considered because the assumptions are less onerous.
- Multilevel modeling is an approach that can allow us to make better use of our data.
**Software and packages**
- Base R [@citeR]
- `boot` [@boot; @bootii]
- `broom.mixed` [@mixedbroom]
- `collapse` [@collapse]
- `dataverse` [@dataverse]
- `gutenbergr` [@gutenbergr]
- `janitor` [@janitor]
- `marginaleffects` [@marginaleffects]
- `modelsummary` [@citemodelsummary]
- `rstanarm` [@citerstanarm]
- `tidybayes` [@citetidybayes]
- `tidyverse` [@tidyverse]
- `tinytable` [@tinytable]
```{r}
#| message: false
#| warning: false
library(boot)
library(broom.mixed)
library(collapse)
library(dataverse)
library(gutenbergr)
library(janitor)
library(marginaleffects)
library(modelsummary)
library(rstanarm)
library(tidybayes)
library(tidyverse)
library(tinytable)
```
## Introduction
Linear models, covered in @sec-its-just-a-linear-model, have evolved substantially over the past century.\index{statistics!history of} Francis Galton,\index{Galton, Francis} mentioned in @sec-hunt-data, and others of his generation used linear regression in earnest in the late 1800s and early 1900s. Binary outcomes quickly became of interest and needed special treatment, leading to the development and wide adaption of logistic regression and similar methods in the mid-1900s [@cramer2002origins]. The generalized linear model framework came into being, in a formal sense, in the 1970s with @nelder1972generalized. Generalized linear models (GLMs)\index{linear models!generalized} broaden the types of outcomes that are allowed. We still model outcomes as a linear function, but we are less constrained. The outcome can be anything in the exponential family, and popular choices include the logistic distribution and the Poisson distribution. For the sake of a completed story but turning to approaches that are beyond the scope of this book, a further generalization of GLMs is generalized additive models (GAMs) where we broaden the structure of the explanatory side. We still explain the outcome variable as an additive function of various bits and pieces, but those bits and pieces can be functions. This framework was proposed in the 1990s by @hastie1990generalized.
In terms of generalized linear models, in this chapter we consider logistic, Poisson, and negative binomial regression. But we also explore a variant that is relevant to both linear models and generalized linear models: multilevel modeling. This is when we take advantage of some type of grouping that exists within our dataset.
## Logistic regression
Linear regression\index{regression!logistic}\index{logistic regression} is a useful way to better understand our data. But it assumes a continuous outcome variable that can take any number on the real line. We would like some way to use this same machinery when we cannot satisfy this condition. We turn to logistic and Poisson regression for binary and count outcome variables, respectively. They are still linear models, because the predictor variables enter in a linear fashion.
Logistic regression\index{logistic regression}, and its close variants, are useful in a variety of settings, from elections [@wang2015forecasting] through to horse racing [@chellel2018gambler; @boltonruth]. We use logistic regression when the outcome variable is a binary outcome, such as 0 or 1, or "yes" or "no". Although the presence of a binary outcome variable\index{binary outcome} may sound limiting, there are a lot of circumstances in which the outcome either naturally falls into this situation or can be adjusted into it. For instance, win or lose, available or not available, support or not.
The foundation of this is the Bernoulli distribution\index{distribution!Bernoulli}. There is a certain probability, $p$, of outcome "1" and the remainder, $1-p$, for outcome "0". We can use `rbinom()` with one trial ("size = 1") to simulate data from the Bernoulli distribution.\index{simulation}
```{r}
#| message: false
#| warning: false
set.seed(853)
bernoulli_example <-
tibble(draws = rbinom(n = 20, size = 1, prob = 0.1))
bernoulli_example |> pull(draws)
```
One reason to use logistic regression\index{logistic regression} is that we will be modeling a probability, hence it will be bounded between 0 and 1. With linear regression we may end up with values outside this. The foundation of logistic regression is the logit function\index{logit function}:
$$
\mbox{logit}(x) = \log\left(\frac{x}{1-x}\right).
$$
This will transpose values between 0 and 1 onto the real line. For instance, `logit(0.1) = -2.2`, `logit(0.5) = 0`, and `logit(0.9) = 2.2` (@fig-heyitslogit). We call this the "link function". It relates the distribution of interest in a generalized linear model to the machinery we use in linear models.
```{r}
#| eval: true
#| include: true
#| echo: false
#| fig-cap: "Example of the logit function for values between 0 and 1"
#| label: fig-heyitslogit
#| message: false
#| warning: false
tibble(values = seq(from = 0, to = 1, by = 0.001),
logit = logit(values)) |>
ggplot(aes(x = values, y = logit)) +
geom_line() +
theme_classic() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Values of x",
y = "logit(x)")
```
### Simulated example: day or night
To illustrate logistic regression\index{logistic regression}, we will simulate data on whether it is a weekday or weekend, based on the number of cars on the road.\index{simulation} We will assume that on weekdays the road is busier.\index{distribution!Normal}
```{r}
#| message: false
#| warning: false
set.seed(853)
week_or_weekday <-
tibble(
num_cars = sample.int(n = 100, size = 1000, replace = TRUE),
noise = rnorm(n = 1000, mean = 0, sd = 10),
is_weekday = if_else(num_cars + noise > 50, 1, 0)
) |>
select(-noise)
week_or_weekday
```
```{r}
#| eval: false
#| include: false
arrow::write_parquet(x = week_or_weekday,
sink = "outputs/data/week_or_weekday.parquet")
```
We can use `glm()` from base R to do a quick estimation.\index{logistic regression!Base R} In this case we will try to work out whether it is a weekday or weekend, based on the number of cars we can see. We are interested in estimating @eq-logisticexample:
$$
\mbox{Pr}(y_i=1) = \mbox{logit}^{-1}\left(\beta_0+\beta_1 x_i\right)
$$ {#eq-logisticexample}
where $y_i$ is whether it is a weekday and $x_i$ is the number of cars on the road.
```{r}
week_or_weekday_model <-
glm(
is_weekday ~ num_cars,
data = week_or_weekday,
family = "binomial"
)
summary(week_or_weekday_model)
```
The estimated coefficient on the number of cars is 0.19. The interpretation of coefficients in logistic regression\index{logistic regression!interpretation} is more complicated than linear regression as they relate to changes in the log-odds of the binary outcome. For instance, the estimate of 0.19 is the average change in the log-odds of it being a weekday with observing one extra car on the road. The coefficient is positive which means an increase. As it is non-linear, if we want to specify a particular change, then this will be different for different baseline levels of the observation. That is, an increase of 0.19 log-odds has a larger impact when the baseline log-odds are 0, compared to 2.
We can translate our estimate into the probability of it being a weekday, for a given number of cars. We can add the implied probability that it is a weekday for each observation using `predictions()` from `marginaleffects`.
```{r}
week_or_weekday_predictions <-
predictions(week_or_weekday_model) |>
as_tibble()
week_or_weekday_predictions
```
And we can then graph the probability that our model implies, for each observation, of it being a weekday (@fig-dayornightprobs). This is a nice opportunity to consider a few different ways of illustrating the fit. While it is common to use a scatterplot (@fig-dayornightprobs-1), this is also an opportunity to use an ECDF (@fig-dayornightprobs-2).
```{r}
#| eval: true
#| fig-cap: "Logistic regression probability results with simulated data of whether it is a weekday or weekend based on the number of cars that are around"
#| include: true
#| label: fig-dayornightprobs
#| message: false
#| warning: false
#| fig-subcap: ["Illustrating the fit with a scatterplot", "Illustrating the fit with an ECDF"]
#| layout-ncol: 2
# Panel (a)
week_or_weekday_predictions |>
mutate(is_weekday = factor(is_weekday)) |>
ggplot(aes(x = num_cars, y = estimate, color = is_weekday)) +
geom_jitter(width = 0.01, height = 0.01, alpha = 0.3) +
labs(
x = "Number of cars that were seen",
y = "Estimated probability it is a weekday",
color = "Was actually weekday"
) +
theme_classic() +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
# Panel (b)
week_or_weekday_predictions |>
mutate(is_weekday = factor(is_weekday)) |>
ggplot(aes(x = num_cars, y = estimate, color = is_weekday)) +
stat_ecdf(geom = "point", alpha = 0.75) +
labs(
x = "Number of cars that were seen",
y = "Estimated probability it is a weekday",
color = "Actually weekday"
) +
theme_classic() +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
```
The marginal effect\index{marginal effects} at each observation is of interest because it provides a sense of how this probability is changing. It enables us to say that at the median (which in this case is if we were to see 50 cars) the probability of it being a weekday increases by almost five per cent if we were to see another car (@tbl-marginaleffectcar).
```{r}
#| label: tbl-marginaleffectcar
#| tbl-cap: "Marginal effect of another car on the probability that it is a weekday, at the median"
slopes(week_or_weekday_model, newdata = "median") |>
select(term, estimate, std.error) |>
tt() |>
style_tt(j = 1:3, align = "lrr") |>
format_tt(digits = 3, num_mark_big = ",", num_fmt = "decimal") |>
setNames(c("Term", "Estimate", "Standard error"))
```
To more thoroughly examine the situation we might want to build a Bayesian model using `rstanarm`.\index{Bayesian!logistic regression}\index{logistic regression!Bayesian} As in @sec-its-just-a-linear-model we will specify priors for our model, but these will just be the default priors that `rstanarm` uses:
$$
\begin{aligned}
y_i|\pi_i & \sim \mbox{Bern}(\pi_i) \\
\mbox{logit}(\pi_i) & = \beta_0+\beta_1 x_i \\
\beta_0 & \sim \mbox{Normal}(0, 2.5)\\
\beta_1 & \sim \mbox{Normal}(0, 2.5)
\end{aligned}
$$
where $y_i$ is whether it is a weekday (actually 0 or 1), $x_i$ is the number of cars on the road, and $\pi_i$ is the probability that observation $i$ is a weekday.
```{r}
#| eval: false
#| echo: true
#| message: false
#| warning: false
week_or_weekday_rstanarm <-
stan_glm(
is_weekday ~ num_cars,
data = week_or_weekday,
family = binomial(link = "logit"),
prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
prior_intercept = normal(location = 0, scale = 2.5, autoscale = TRUE),
seed = 853
)
saveRDS(
week_or_weekday_rstanarm,
file = "week_or_weekday_rstanarm.rds"
)
```
```{r}
#| eval: false
#| include: false
#| message: false
#| warning: false
# INTERNAL
saveRDS(
week_or_weekday_rstanarm,
file = "outputs/model/week_or_weekday_rstanarm.rds"
)
```
```{r}
#| eval: true
#| include: false
#| message: false
#| warning: false
week_or_weekday_rstanarm <-
readRDS(file = "outputs/model/week_or_weekday_rstanarm.rds")
```
The results of our Bayesian model are similar to the quick model we built using base (@tbl-modelsummarylogistic).
```{r}
#| label: tbl-modelsummarylogistic
#| tbl-cap: "Explaining whether it is day or night, based on the number of cars on the road"
#| message: false
#| warning: false
modelsummary(
list(
"Day or night" = week_or_weekday_rstanarm
)
)
```
@tbl-modelsummarylogistic makes it clear that each of the approaches is similar in this case. They agree on the direction of the effect of seeing an extra car on the probability of it being a weekday. Even the magnitude of the effect is estimated to be similar.
### Political support in the United States
One area where logistic regression is often used is political polling\index{United States!political polling}. In many cases voting implies the need for one preference ranking, and so issues are reduced, whether appropriately or not, to "support" or "not support".\index{elections!US 2020 Presidential Election}
As a reminder, the workflow we advocate in this book is:\index{workflow}
$$\mbox{Plan} \rightarrow \mbox{Simulate} \rightarrow \mbox{Acquire} \rightarrow \mbox{Explore} \rightarrow \mbox{Share}$$
While the focus here is the exploration of data using models, we still need to do the other aspects. We begin by planning. In this case, we are interested in US political support. In particular we are interested in whether we can forecast who a respondent is likely to vote for, based only on knowing their highest level of education and gender. That means we are interested in a dataset with variables for who an individual voted for, and some of their characteristics, such as gender and education. A quick sketch of such a dataset is @fig-uspoliticalsupportsketch. We would like our model to average over these points. A quick sketch is @fig-uspoliticalsupportmodel.
::: {#fig-uspoliticalsuppor layout-ncol=2 layout-valign="bottom"}
![Quick sketch of a dataset that could be used to examine US political support](figures/IMG_2054.png){#fig-uspoliticalsupportsketch}
![Quick sketch of what we expect from the analysis before finalizing either the data or the analysis](figures/IMG_2055.png){#fig-uspoliticalsupportmodel}
Sketches of the expected dataset and analysis focus and clarify our thinking even if they will be updated later
:::
We will simulate a dataset where the chance that a person supports Biden depends on their gender and education.\index{simulation}
```{r}
set.seed(853)
num_obs <- 1000
us_political_preferences <- tibble(
education = sample(0:4, size = num_obs, replace = TRUE),
gender = sample(0:1, size = num_obs, replace = TRUE),
support_prob = ((education + gender) / 5),
) |>
mutate(
supports_biden = if_else(runif(n = num_obs) < support_prob, "yes", "no"),
education = case_when(
education == 0 ~ "< High school",
education == 1 ~ "High school",
education == 2 ~ "Some college",
education == 3 ~ "College",
education == 4 ~ "Post-grad"
),
gender = if_else(gender == 0, "Male", "Female")
) |>
select(-support_prob, supports_biden, gender, education)
```
For the actual data we can use the 2020 Cooperative Election Study\index{Cooperative Election Study} (CES) [@cooperativeelectionstudyus]. This is a long-standing annual survey of US political opinion. In 2020, there were 61,000 respondents who completed the post-election survey. The sampling methodology, detailed in @guidetothe2020ces [p. 13], relies on matching and is an accepted approach that balances sampling concerns and cost.
We can access the CES using `get_dataframe_by_name()` after installing and loading `dataverse`. This approach was introduced in @sec-gather-data and @sec-store-and-share. We save the data that are of interest to us, and then refer to that saved dataset.
```{r}
#| echo: true
#| eval: false
ces2020 <-
get_dataframe_by_name(
filename = "CES20_Common_OUTPUT_vv.csv",
dataset = "10.7910/DVN/E9N6PH",
server = "dataverse.harvard.edu",
.f = read_csv
) |>
select(votereg, CC20_410, gender, educ)
write_csv(ces2020, "ces2020.csv")
```
```{r}
#| echo: false
#| eval: false
# INTERNAL
write_csv(ces2020, "inputs/data/ces2020.csv")
```
```{r}
#| echo: true
#| eval: false
ces2020 <-
read_csv(
"ces2020.csv",
col_types =
cols(
"votereg" = col_integer(),
"CC20_410" = col_integer(),
"gender" = col_integer(),
"educ" = col_integer()
)
)
ces2020
```
```{r}
#| echo: false
#| eval: true
# INTERNAL
ces2020 <-
read_csv(
"inputs/data/ces2020.csv",
col_types =
cols(
"votereg" = col_integer(),
"CC20_410" = col_integer(),
"gender" = col_integer(),
"educ" = col_integer()
)
)
ces2020
```
When we look at the actual data, there are concerns that we did not anticipate in our sketches. We use the codebook to investigate this more thoroughly. We only want respondents who are registered to vote, and we are only interested in those that voted for either Biden or Trump. We see that when the variable "CC20_410" is 1, then this means the respondent supported Biden, and when it is 2 that means Trump. We can filter to only those respondents and then add more informative labels. Genders of "female" and "male" is what is available from the CES, and when the variable "gender" is 1, then this means "male", and when it is 2 this means "females". Finally, the codebook tells us that "educ" is a variable from 1 to 6, in increasing levels of education.
```{r}
ces2020 <-
ces2020 |>
filter(votereg == 1,
CC20_410 %in% c(1, 2)) |>
mutate(
voted_for = if_else(CC20_410 == 1, "Biden", "Trump"),
voted_for = as_factor(voted_for),
gender = if_else(gender == 1, "Male", "Female"),
education = case_when(
educ == 1 ~ "No HS",
educ == 2 ~ "High school graduate",
educ == 3 ~ "Some college",
educ == 4 ~ "2-year",
educ == 5 ~ "4-year",
educ == 6 ~ "Post-grad"
),
education = factor(
education,
levels = c(
"No HS",
"High school graduate",
"Some college",
"2-year",
"4-year",
"Post-grad"
)
)
) |>
select(voted_for, gender, education)
```
```{r}
#| eval: false
#| include: false
arrow::write_parquet(x = ces2020,
sink = "outputs/data/ces2020.parquet")
```
In the end we are left with 43,554 respondents (@fig-cesissogooditslikecheating).
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
#| fig-cap: "The distribution of presidential preferences, by gender, and highest education"
#| label: fig-cesissogooditslikecheating
ces2020 |>
ggplot(aes(x = education, fill = voted_for)) +
stat_count(position = "dodge") +
facet_wrap(facets = vars(gender)) +
theme_minimal() +
labs(
x = "Highest education",
y = "Number of respondents",
fill = "Voted for"
) +
coord_flip() +
scale_fill_brewer(palette = "Set1") +
theme(legend.position = "bottom")
```
The model that we are interested in is:
$$
\begin{aligned}
y_i|\pi_i & \sim \mbox{Bern}(\pi_i) \\
\mbox{logit}(\pi_i) & = \beta_0+\beta_1 \times \mbox{gender}_i + \beta_2 \times \mbox{education}_i \\
\beta_0 & \sim \mbox{Normal}(0, 2.5)\\
\beta_1 & \sim \mbox{Normal}(0, 2.5)\\
\beta_2 & \sim \mbox{Normal}(0, 2.5)
\end{aligned}
$$
where $y_i$ is the political preference of the respondent and equal to 1 if Biden and 0 if Trump, $\mbox{gender}_i$ is the gender of the respondent, and $\mbox{education}_i$ is the education of the respondent. We could estimate the parameters using `stan_glm()`. Note that the model is a generally accepted short-hand. In practice `rstanarm` converts categorical variables into a series of indicator variables and there are multiple coefficients estimated. In the interest of run-time we will randomly sample 1,000 observations and fit the model on that, rather than the full dataset.
```{r}
#| eval: false
#| echo: true
#| message: false
#| warning: false
set.seed(853)
ces2020_reduced <-
ces2020 |>
slice_sample(n = 1000)
political_preferences <-
stan_glm(
voted_for ~ gender + education,
data = ces2020_reduced,
family = binomial(link = "logit"),
prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
prior_intercept =
normal(location = 0, scale = 2.5, autoscale = TRUE),
seed = 853
)
saveRDS(
political_preferences,
file = "political_preferences.rds"
)
```
```{r}
#| eval: false
#| echo: false
# INTERNAL
saveRDS(
political_preferences,
file = "outputs/model/political_preferences.rds"
)
```
```{r}
#| echo: true
#| eval: false
#| message: false
#| warning: false
political_preferences <-
readRDS(file = "political_preferences.rds")
```
```{r}
#| eval: true
#| include: false
#| message: false
#| warning: false
political_preferences <-
readRDS(file = "outputs/model/political_preferences.rds")
```
The results of our model are interesting. They suggest males were less likely to vote for Biden, and that there is a considerable effect of education (@tbl-modelsummarylogisticpolitical).
```{r}
#| label: tbl-modelsummarylogisticpolitical
#| tbl-cap: "Whether a respondent is likely to vote for Biden based on their gender and education"
#| message: false
#| warning: false
modelsummary(
list(
"Support Biden" = political_preferences
),
statistic = "mad"
)
```
It can be useful to plot the credibility intervals of these predictors (@fig-modelplotlogisticpolitical). In particular this might be something that is especially useful in an appendix.
```{r}
#| label: fig-modelplotlogisticpolitical
#| fig-cap: "Credible intervals for predictors of support for Biden"
modelplot(political_preferences, conf_level = 0.9) +
labs(x = "90 per cent credibility interval")
```
## Poisson regression
When we have count data we should initially think to take advantage of the Poisson distribution.\index{distribution!Poisson}\index{regression!Poisson} One application of Poisson regression is modeling the outcomes of sports. For instance @Burch2023 builds a Poisson model of hockey outcomes, following @Baio2010 who build a Poisson model of football outcomes.
The Poisson distribution is governed by one parameter, $\lambda$. This distributes probabilities over non-negative integers and hence governs the shape of the distribution. As such, the Poisson distribution has the interesting feature that the mean is also the variance. As the mean increases, so does the variance. The Poisson probability mass function is [@pitman, p. 121]:
$$P_{\lambda}(k) = e^{-\lambda}\lambda^k/k!\mbox{, for }k=0,1,2,\dots$$
We can simulate $n=20$ draws from the Poisson distribution\index{distribution!Poisson} with `rpois()`, where $\lambda$ is equal to three.\index{simulation}
```{r}
rpois(n = 20, lambda = 3)
```
We can also look at what happens to the distribution as we change the value of $\lambda$ (@fig-poissondistributiontakingshape).\index{distribution!Poisson}
```{r}
#| eval: true
#| include: true
#| echo: false
#| message: false
#| warning: false
#| fig-cap: "The Poisson distribution is governed by the value of the mean, which is the same as its variance"
#| label: fig-poissondistributiontakingshape
set.seed(853)
number_of_each <- 100
lambdas <- c(0, 1, 2, 4, 7, 10, 15, 25, 50)
poisson_takes_shape <-
map(lambdas, ~ tibble(lambda = rep(.x, number_of_each),
draw = rpois(n = number_of_each, lambda = .x))) |>
list_rbind()
poisson_takes_shape <- poisson_takes_shape |>
mutate(lambda = paste("lambda =", lambda),
lambda = factor(lambda, levels = paste("lambda =", lambdas)))
ggplot(poisson_takes_shape, aes(x = draw)) +
geom_density() +
facet_wrap(vars(lambda), scales = "free_y") +
theme_minimal() +
labs(x = "Integer", y = "Density")
```
### Simulated example: number of As by department
To illustrate the situation, we could simulate data about the number of As that are awarded in each university course.\index{simulation} In this simulated example, we consider three departments, each of which has many courses. Each course will award a different number of As.\index{distribution!Poisson}
```{r}
set.seed(853)
class_size <- 26
count_of_A <-
tibble(
# From Chris DuBois: https://stackoverflow.com/a/1439843
department =
c(rep.int("1", 26), rep.int("2", 26), rep.int("3", 26)),
course = c(
paste0("DEP_1_", letters),
paste0("DEP_2_", letters),
paste0("DEP_3_", letters)
),
number_of_As = c(
rpois(n = class_size, lambda = 5),
rpois(n = class_size, lambda = 10),
rpois(n = class_size, lambda = 20)
)
)
```
```{r}
#| eval: false
#| include: false
arrow::write_parquet(x = count_of_A,
sink = "outputs/data/count_of_A.parquet")
```
```{r}
#| echo: true
#| eval: true
#| message: false
#| warning: false
#| fig-cap: "Simulated number of As in various classes across three departments"
#| label: fig-simgradesdepartments
count_of_A |>
ggplot(aes(x = number_of_As)) +
geom_histogram(aes(fill = department), position = "dodge") +
labs(
x = "Number of As awarded",
y = "Number of classes",
fill = "Department"
) +
theme_classic() +
scale_fill_brewer(palette = "Set1") +
theme(legend.position = "bottom")
```
Our simulated dataset has the number of As awarded by courses, which are structured within departments (@fig-simgradesdepartments). In @sec-multilevel-regression-with-post-stratification, we will take advantage of this departmental structure, but for now we just ignore it and focus on differences between departments.
The model that we are interested in estimating is:
$$
\begin{aligned}
y_i|\lambda_i &\sim \mbox{Poisson}(\lambda_i)\\
\log(\lambda_i) & = \beta_0 + \beta_1 \times \mbox{department}_i
\end{aligned}
$$
where $y_i$ is the number of A grades awarded, and we are interested in how this differs by department.
We can use `glm()` from base R to get a quick sense of the data. This function is quite general, and we specify Poisson regression by setting the "family" parameter. The estimates are contained in the first column of @tbl-modelsummarypoisson.
```{r}
grades_base <-
glm(
number_of_As ~ department,
data = count_of_A,
family = "poisson"
)
summary(grades_base)
```
As with logistic regression, the interpretation of the coefficients from Poisson regression\index{regression!Poisson} can be difficult. The interpretation of the coefficient on "department2" is that it is the log of the expected difference between departments. We expect $e^{0.883} \approx 2.4$ and $e^{1.703} \approx 5.5$ as many A grades in departments 2 and 3, respectively, compared with department 1 (@tbl-modelsummarypoisson).
We could build a Bayesian model and estimate it with `rstanarm` (@tbl-modelsummarypoisson).
$$
\begin{aligned}
y_i|\lambda_i &\sim \mbox{Poisson}(\lambda_i)\\
\log(\lambda_i) & = \beta_0 + \beta_1 \times\mbox{department}_i\\
\beta_0 & \sim \mbox{Normal}(0, 2.5)\\
\beta_1 & \sim \mbox{Normal}(0, 2.5)
\end{aligned}
$$
where $y_i$ is the number of As awarded.
```{r}
#| include: true
#| message: false
#| warning: false
#| eval: false
grades_rstanarm <-
stan_glm(
number_of_As ~ department,
data = count_of_A,
family = poisson(link = "log"),
prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
prior_intercept = normal(location = 0, scale = 2.5, autoscale = TRUE),
seed = 853
)
saveRDS(
grades_rstanarm,
file = "grades_rstanarm.rds"
)
```
```{r}
#| eval: false
#| include: false
#| message: false
#| warning: false
# INTERNAL
saveRDS(
grades_rstanarm,
file = "outputs/model/grades_rstanarm.rds"
)
```
```{r}
#| eval: true
#| include: false
#| message: false
#| warning: false
grades_rstanarm <-
readRDS(file = "outputs/model/grades_rstanarm.rds")
```
The results are in @tbl-modelsummarypoisson.
```{r}
#| label: tbl-modelsummarypoisson
#| tbl-cap: "Examining the number of A grades given in different departments"
modelsummary(
list(
"Number of As" = grades_rstanarm
)
)
```
As with logistic regression, we can use `slopes()` from `marginaleffects` to help with interpreting these results.\index{marginal effects} It may be useful to consider how we expect the number of A grades to change as we go from one department to another. @tbl-marginaleffectspoisson suggests that in our dataset, classes in Department 2 tend to have around five additional A grades, compared with Department 1, and that classes in Department 3 tend to have around 17 more A grades, compared with Department 1.
```{r}
#| label: tbl-marginaleffectspoisson
#| tbl-cap: "The estimated difference in the number of A grades awarded at each department"
slopes(grades_rstanarm) |>
select(contrast, estimate, conf.low, conf.high) |>
unique() |>
tt() |>
style_tt(j = 1:4, align = "lrrr") |>
format_tt(digits = 2, num_mark_big = ",", num_fmt = "decimal") |>
setNames(c("Compare department", "Estimate", "2.5%", "97.5%"))
```
### Letters used in *Jane Eyre*
In an earlier age, @edgeworth1885methods made counts of the dactyls in Virgil's *Aeneid* (@Stigler1978 [p. 301] provides helpful background and the dataset is available using `Dactyl` from `HistData` [@HistData]). Inspired by this we could use `gutenbergr` to get the text of *Jane Eyre* by Charlotte Brontë\index{Brontë, Charlotte!Jane Eyre}.\index{text!analysis} (Recall that in @sec-gather-data we converted PDFs of *Jane Eyre* into a dataset.) We could then consider the first ten lines of each chapter, count the number of words, and count the number of times either "E" or "e" appears. We are interested to see whether the number of e/Es increases as more words are used. If not, it could suggest that the distribution of e/Es is not consistent, which could be of interest to linguists.\index{regression!Poisson}
Following the workflow advocated in this book, we first sketch our dataset and model.\index{workflow} A quick sketch of what the dataset could look like is @fig-letterssketch, and a quick sketch of our model is @fig-lettersmodel.
::: {#fig-letterss layout-ncol=2 layout-valign="bottom" layout="[[50,10,50]]"}
![Planned counts, by line and chapter, in *Jane Eyre*](figures/IMG_2056.png){#fig-letterssketch}
![Expected relationship between count of e/Es and number of words in the line](figures/IMG_2075.png){#fig-lettersmodel}
Sketches of the expected dataset and analysis force us to consider what we are interested in
:::
We simulate a dataset of how the number of e/Es could be distributed following the Poisson distribution (@fig-simenum).\index{simulation}\index{distribution!Poisson}\index{distribution!uniform}
```{r}
#| echo: true
#| eval: true
#| message: false
#| warning: false
#| fig-cap: "Simulated counts of e/Es"
#| label: fig-simenum
count_of_e_simulation <-
tibble(
chapter = c(rep(1, 10), rep(2, 10), rep(3, 10)),
line = rep(1:10, 3),
number_words_in_line = runif(min = 0, max = 15, n = 30) |> round(0),
number_e = rpois(n = 30, lambda = 10)
)
count_of_e_simulation |>
ggplot(aes(y = number_e, x = number_words_in_line)) +
geom_point() +
labs(
x = "Number of words in line",
y = "Number of e/Es in the first ten lines"
) +
theme_classic() +
scale_fill_brewer(palette = "Set1")
```
We can now gather and prepare our data. We download the text of the book from Project Gutenberg using `gutenberg_download()` from `gutenbergr`.\index{text!gathering}\index{Project Gutenberg}
```{r}
#| eval: false
#| echo: true
gutenberg_id_of_janeeyre <- 1260
jane_eyre <-
gutenberg_download(
gutenberg_id = gutenberg_id_of_janeeyre,
mirror = "https://gutenberg.pglaf.org/"
)
jane_eyre
write_csv(jane_eyre, "jane_eyre.csv")
```
We will download it and then use our local copy to avoid overly imposing on the Project Gutenberg\index{Project Gutenberg} servers.
```{r}
#| eval: false
#| echo: false
# INTERNAL
write_csv(jane_eyre, "inputs/jane_eyre.csv")
```
```{r}
#| eval: false
#| echo: true
jane_eyre <- read_csv(
"jane_eyre.csv",
col_types = cols(
gutenberg_id = col_integer(),
text = col_character()
)
)
jane_eyre
```
```{r}
#| eval: true
#| echo: false
# INTERNAL
jane_eyre <- read_csv(
"inputs/jane_eyre.csv",
col_types = cols(
gutenberg_id = col_integer(),
text = col_character()
)
)
jane_eyre
```
We are interested in only those lines that have content, so we remove those empty lines that are just there for spacing.\index{text!cleaning} Then we can create counts of the number of e/Es in that line, for the first ten lines of each chapter. For instance, we can look at the first few lines and see that there are five e/Es in the first line and eight in the second.
```{r}
jane_eyre_reduced <-
jane_eyre |>
filter(!is.na(text)) |> # Remove empty lines
mutate(chapter = if_else(str_detect(text, "CHAPTER") == TRUE,
text,
NA_character_)) |> # Find start of chapter
fill(chapter, .direction = "down") |>
mutate(chapter_line = row_number(),
.by = chapter) |> # Add line number to each chapter
filter(!is.na(chapter),
chapter_line %in% c(2:11)) |> # Remove "CHAPTER I" etc
select(text, chapter) |>
mutate(
chapter = str_remove(chapter, "CHAPTER "),
chapter = str_remove(chapter, "—CONCLUSION"),
chapter = as.integer(as.roman(chapter))
) |> # Change chapters to integers
mutate(count_e = str_count(text, "e|E"),
word_count = str_count(text, "\\w+")
# From: https://stackoverflow.com/a/38058033
)
```
```{r}
jane_eyre_reduced |>
select(chapter, word_count, count_e, text) |>
head()
```
We can verify that the mean and variance of the number of e/Es is roughly similar by plotting all of the data (@fig-janeecounts). The mean, in pink, is 6.7, and the variance, in blue, is 6.2. While they are not entirely the same, they are similar. We include the diagonal in @fig-janeecounts-2 to help with thinking about the data. If the data were on the $y=x$ line, then on average there would be one e/E per word. Given the mass of points below that line expect that on average there is less than one per word.
```{r}
#| echo: true
#| eval: true
#| fig-cap: "Number of e/Es letters in the first ten lines of each chapter in Jane Eyre"
#| label: fig-janeecounts
#| message: false
#| warning: false
#| layout-ncol: 2
#| fig-subcap: ["Distribution of the number of e/Es", "Comparison of the number of e/Es in the line and the number of words in the line"]
mean_e <- mean(jane_eyre_reduced$count_e)
variance_e <- var(jane_eyre_reduced$count_e)
jane_eyre_reduced |>
ggplot(aes(x = count_e)) +
geom_histogram() +
geom_vline(xintercept = mean_e,
linetype = "dashed",
color = "#C64191") +
geom_vline(xintercept = variance_e,
linetype = "dashed",
color = "#0ABAB5") +
theme_minimal() +
labs(
y = "Count",
x = "Number of e's per line for first ten lines"
)
jane_eyre_reduced |>
ggplot(aes(x = word_count, y = count_e)) +
geom_jitter(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
theme_minimal() +
labs(
x = "Number of words in the line",
y = "Number of e/Es in the line"
)
```
We could consider the following model:
$$
\begin{aligned}
y_i|\lambda_i &\sim \mbox{Poisson}(\lambda_i)\\
\log(\lambda_i) & = \beta_0 + \beta_1 \times \mbox{Number of words}_i\\
\beta_0 & \sim \mbox{Normal}(0, 2.5)\\
\beta_1 & \sim \mbox{Normal}(0, 2.5)
\end{aligned}
$$
where $y_i$ is the number of e/Es in the line and the explanatory variable is the number of words in the line. We could estimate the model using `stan_glm()`.
```{r}
#| eval: false
#| echo: true
#| message: false
#| warning: false