-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07.Rmd
1879 lines (1424 loc) · 95.1 KB
/
07.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```
# Ulysses' Compass
In this chapter we contend with two contrasting kinds of statistical error:
* overfitting, "which leads to poor prediction by learning too *much* from the data"
* underfitting, "which leads to poor prediction by learning too *little* from the data" [@mcelreathStatisticalRethinkingBayesian2020, p. 192, *emphasis* added]
> Our job is to carefully navigate among these monsters. There are two common families of approaches. The first approach is to use a **regularizing prior** to tell the model not to get too excited by the data. This is the same device that non-Bayesian methods refer to as "penalized likelihood." The second approach is to use some scoring device, like **information criteria** or **cross-validation**, to model the prediction task and estimate predictive accuracy. Both families of approaches are routinely used in the natural and social sciences. Furthermore, they can be--maybe should be--used in combination. So it's worth understanding both, as you're going to need both at some point. (p. 192, **emphasis** in the original)
There's a lot going on in this chapter. More more practice with these ideas, check out @yarkoniChoosingPredictionExplanation2017, [*Choosing prediction over explanation in psychology: Lessons from machine learning*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6603289/).
#### Rethinking stargazing.
> The most common form of model selection among practicing scientists is to search for a model in which every coefficient is statistically significant. Statisticians sometimes call this **stargazing**, as it is embodied by scanning for asterisks ($^{\star \star}$) trailing after estimates....
>
> Whatever you think about null hypothesis significance testing in general, using it to select among structurally different models is a mistake--$p$-values are not designed to help you navigate between underfitting and overfitting. (p. 193, **emphasis** in the original).
McElreath spent little time discussing $p$-values and null hypothesis testing in the text. If you'd like to learn more from a Bayesian perspective, you might check out the first several chapters (particularly 10--13) in Kruschke's [-@kruschkeDoingBayesianData2015] [text](https://sites.google.com/site/doingbayesiandataanalysis/) and my [-@kurzDoingBayesianData2020] [ebook translating it to **brms** and the tidyverse](https://bookdown.org/content/3686/). The great [Frank Harrell](https://twitter.com/f2harrell) has complied [*A Litany of Problems With p-values*](https://www.fharrell.com/post/pval-litany/).
Also, don't miss the statement on $p$-values released by the [American Statistical Association](https://www.amstat.org/) [@wassersteinASAStatementPvalues2016].
## The problem with parameters
The $R^2$ is a popular way to measure how well you can retrodict the data. It traditionally follows the form
$$R^2 = \frac{\text{var(outcome)} - \text{var(residuals)}}{\text{var(outcome)}} = 1 - \frac{\text{var(residuals)}}{\text{var(outcome)}}.$$
By $\operatorname{var}()$, of course, we meant variance (i.e., what you get from the `var()` function in **R**). McElreath is not a fan of the $R^2$. But it's important in my field, so instead of a summary at the end of the chapter, we will cover the Bayesian version of $R^2$ and how to use it in **brms**.
### More parameters (almost) always improve fit.
We'll start off by making the data with brain size and body size for seven `species`.
```{r, warning = F, message = F}
library(tidyverse)
(
d <-
tibble(species = c("afarensis", "africanus", "habilis", "boisei", "rudolfensis", "ergaster", "sapiens"),
brain = c(438, 452, 612, 521, 752, 871, 1350),
mass = c(37.0, 35.5, 34.5, 41.5, 55.5, 61.0, 53.5))
)
```
Let's get ready for Figure 7.2. The plots in this chapter will be characterized by `theme_classic() + theme(text = element_text(family = "Courier"))`. Our color palette will come from the [**rcartocolor** package](https://CRAN.R-project.org/package=rcartocolor) [@R-rcartocolor], which provides color schemes [designed by 'CARTO'](https://carto.com/carto-colors/).
```{r}
library(rcartocolor)
```
The specific palette we'll be using is "BurgYl." In addition to palettes, the **rcartocolor** package offers a few convenience functions which make it easier to use their palettes. The `carto_pal()` function will return the HEX numbers associated with a given palette's colors and the `display_carto_pal()` function will display the actual colors.
```{r, fig.width = 6, fig.height = 2.25}
carto_pal(7, "BurgYl")
display_carto_pal(7, "BurgYl")
```
We'll be using a diluted version of the third color for the panel background (i.e., `theme(panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)))`) and the darker purples for other plot elements. Here's the plot.
```{r, fig.width = 3.5, fig.height = 3, message = F, warning = F}
library(ggrepel)
theme_set(
theme_classic() +
theme(text = element_text(family = "Courier"),
panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)))
)
d %>%
ggplot(aes(x = mass, y = brain, label = species)) +
geom_point(color = carto_pal(7, "BurgYl")[5]) +
geom_text_repel(size = 3, color = carto_pal(7, "BurgYl")[7], family = "Courier", seed = 438) +
labs(subtitle = "Average brain volume by body\nmass for six hominin species",
x = "body mass (kg)",
y = "brain volume (cc)") +
xlim(30, 65)
```
Before fitting our models,
> we want to standardize body mass--give it mean zero and standard deviation one--and rescale the outcome, brain volume, so that the largest observed value is 1. Why not standardize brain volume as well? Because we want to preserve zero as a reference point: No brain at all. You can't have negative brain. I don't think. (p. 195)
```{r}
d <-
d %>%
mutate(mass_std = (mass - mean(mass)) / sd(mass),
brain_std = brain / max(brain))
```
Our first statistical model will follow the form
\begin{align*}
\text{brain_std}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\
\mu_i & = \alpha + \beta \text{mass_std}_i \\
\alpha & \sim \operatorname{Normal}(0.5, 1) \\
\beta & \sim \operatorname{Normal}(0, 10) \\
\sigma & \sim \operatorname{Log-Normal}(0, 1).
\end{align*}
> This simply says that the average brain volume $b_i$ of species $i$ is a linear function of its body mass $m_i$. Now consider what the priors imply. The prior for $\alpha$ is just centered on the mean brain volume (rescaled) in the data. So it says that the average species with an average body mass has a brain volume with an 89% credible interval from about −1 to 2. That is ridiculously wide and includes impossible (negative) values. The prior for $\beta$ is very flat and centered on zero. It allows for absurdly large positive and negative relationships. These priors allow for absurd inferences, especially as the model gets more complex. And that's part of the lesson. (p. 196)
Fire up **brms**.
```{r, warning = F, message = F}
library(brms)
```
A careful study of McElreath's **R** code 7.3 will show he is modeling `log_sigma`, rather than $\sigma$. There are ways to do this with **brms** [see @Bürkner2021Distributional], but I'm going to keep things simple, here. Our approach will be to follow the above equation more literally and just slap the $\operatorname{Log-Normal}(0, 1)$ prior directly onto $\sigma$.
```{r b7.1}
b7.1 <-
brm(data = d,
family = gaussian,
brain_std ~ 1 + mass_std,
prior = c(prior(normal(0.5, 1), class = Intercept),
prior(normal(0, 10), class = b),
prior(lognormal(0, 1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 7,
file = "fits/b07.01")
```
Check the model summary.
```{r}
print(b7.1)
```
As we'll learn later on, **brms** already has a convenience function for computing the Bayesian $R^2$. At this point in the chapter, we'll follow along and make a **brms**-centric version of McElreath's `R2_is_bad()`. But because part of our version of `R2_is_bad()` will contain the `brms::predict()` function, I'm going to add a `seed` argument to make the results more reproducible.
```{r}
R2_is_bad <- function(brm_fit, seed = 7, ...) {
set.seed(seed)
p <- predict(brm_fit, summary = F, ...)
# in my experience, it's more typical to define residuals as the criterion minus the predictions
r <- d$brain_std - apply(p, 2, mean)
1 - rethinking::var2(r) / rethinking::var2(d$brain_std)
}
```
Here's the estimate for our $R^2$.
```{r}
R2_is_bad(b7.1)
```
Do note that,
> in principle, the Bayesian approach mandates that we do this for each sample from the posterior. But $R^2$ is traditionally computed only at the mean prediction. So we'll do that as well here. Later in the chapter you'll learn a properly Bayesian score that uses the entire posterior distribution. (p. 197)
Now fit the quadratic through the fifth-order polynomial models using `update()`.
```{r b7.2, message = F}
# quadratic
b7.2 <-
update(b7.1,
newdata = d,
formula = brain_std ~ 1 + mass_std + I(mass_std^2),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 7,
file = "fits/b07.02")
# cubic
b7.3 <-
update(b7.1,
newdata = d,
formula = brain_std ~ 1 + mass_std + I(mass_std^2) + I(mass_std^3),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 7,
control = list(adapt_delta = .9),
file = "fits/b07.03")
# fourth-order
b7.4 <-
update(b7.1,
newdata = d,
formula = brain_std ~ 1 + mass_std + I(mass_std^2) + I(mass_std^3) + I(mass_std^4),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 7,
control = list(adapt_delta = .995),
file = "fits/b07.04")
# fifth-order
b7.5 <-
update(b7.1,
newdata = d,
formula = brain_std ~ 1 + mass_std + I(mass_std^2) + I(mass_std^3) + I(mass_std^4) + I(mass_std^5),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 7,
control = list(adapt_delta = .99999),
file = "fits/b07.05")
```
You may have noticed we fiddled with the `adapt_delta` setting for some of the models. When you try to fit complex models with few data points and wide priors, that can cause difficulties for Stan. I'm not going to get into the details on what `adapt_delta` does, right now. But it'll make appearances in later chapters and we'll more formally introduce it in [Section 13.4.2][Non-centered chimpanzees.].
Now returning to the text,
> That last model, `m7.6`, has one trick in it. The standard deviation is replaced with a constant value 0.001. The model will not work otherwise, for a very important reason that will become clear as we plot these monsters. (p. 198)
By "last model, `m7.6`," McElreath was referring to the sixth-order polynomial, fit on page 199. McElreath's **rethinking** package is set up so the syntax is simple to replacing $\sigma$ with a constant value. We can do this with **brms**, too, but it'll take more effort. If we want to fix $\sigma$ to a constant, we'll need to define a custom likelihood. Bürkner explained how to do so in his [-@Bürkner2021Define] vignette, [*Define custom response distributions with brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_customfamilies.html). I'm not going to explain this in great detail, here. In brief, first we use the `custom_family()` function to define the name and parameters of a `custom_normal()` likelihood that will set $\sigma$ to a constant value, 0.001. Second, we'll define some functions for Stan which are not defined in Stan itself and save them as `stan_funs`. Third, we make a `stanvar()` statement which will allow us to pass our `stan_funs` to `brm()`.
```{r, warning = F, message = F}
custom_normal <- custom_family(
"custom_normal", dpars = "mu",
links = "identity",
type = "real"
)
stan_funs <- "real custom_normal_lpdf(real y, real mu) {
return normal_lpdf(y | mu, 0.001);
}
real custom_normal_rng(real mu) {
return normal_rng(mu, 0.001);
}
"
stanvars <- stanvar(scode = stan_funs, block = "functions")
```
Now we can fit the model by setting `family = custom_normal`. Note that since we've set $\sigma = 0.001$, we don't need to include a prior for $\sigma$. Also notice our `stanvars = stanvars` line.
```{r b7.6}
b7.6 <-
brm(data = d,
family = custom_normal,
brain_std ~ 1 + mass_std + I(mass_std^2) + I(mass_std^3) + I(mass_std^4) + I(mass_std^5) + I(mass_std^6),
prior = c(prior(normal(0.5, 1), class = Intercept),
prior(normal(0, 10), class = b)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 7,
stanvars = stanvars,
file = "fits/b07.06")
```
Before we can do our variant of Figure 7.3, we'll need to define a few more custom functions to work with `b7.6`. The `posterior_epred_custom_normal()` function is required for `brms::fitted()` to work with our `family = custom_normal` `brmsfit` object. Same thing for `posterior_predict_custom_normal()` and `brms::predict()`. Though we won't need it until [Section 7.2.5][Scoring the right data.], we'll also define `log_lik_custom_normal` so we can use the `log_lik()` function for any models fit with `family = custom_normal`. Before all that, we need to throw in a line with the `expose_functions()` function. If you want to understand why, read up in Bürkner's [-@Bürkner2021Define] vignette. For now, just go with it.
```{r, message = F, warning = F, results = "hide"}
expose_functions(b7.6, vectorize = TRUE)
posterior_epred_custom_normal <- function(prep) {
mu <- prep$dpars$mu
mu
}
posterior_predict_custom_normal <- function(i, prep, ...) {
mu <- prep$dpars$mu
mu
custom_normal_rng(mu)
}
log_lik_custom_normal <- function(i, prep) {
mu <- prep$dpars$mu
y <- prep$data$Y[i]
custom_normal_lpdf(y, mu)
}
```
Okay, here's how we might plot the result for the first model, `b7.1`.
```{r, fig.width = 3.25, fig.height = 3, message = F, warning = F}
library(tidybayes)
nd <- tibble(mass_std = seq(from = -2, to = 2, length.out = 100))
fitted(b7.1,
newdata = nd,
probs = c(.055, .945)) %>%
data.frame() %>%
bind_cols(nd) %>%
ggplot(aes(x = mass_std, y = Estimate)) +
geom_lineribbon(aes(ymin = Q5.5, ymax = Q94.5),
color = carto_pal(7, "BurgYl")[7], size = 1/2,
fill = alpha(carto_pal(7, "BurgYl")[6], 1/3)) +
geom_point(data = d,
aes(y = brain_std),
color = carto_pal(7, "BurgYl")[7]) +
labs(subtitle = bquote(italic(R)^2==.(round(R2_is_bad(b7.1), digits = 2))),
x = "body mass (standardized)",
y = "brain volume (standardized)") +
coord_cartesian(xlim = range(d$mass_std))
```
To slightly repurpose a quote from McElreath:
> We'll want to do this for the next several models, so let's write a function to make it repeatable. If you find yourself writing code more than once, it is usually saner to write a function and call the function more than once instead. (p. 197)
Our `make_figure7.3()` function will wrap the simulation, data wrangling, and plotting code all in one. It takes two arguments, the first of which defines which fit object we'd like to plot. If you look closely at Figure 7.3 in the text, you'll notice that the range of the $y$-axis changes in the last three plots. Our second argument, `ylim`, will allow us to vary those parameters across subplots.
```{r}
make_figure7.3 <- function(brms_fit, ylim = range(d$brain_std)) {
# compute the R2
r2 <- R2_is_bad(brms_fit)
# define the new data
nd <- tibble(mass_std = seq(from = -2, to = 2, length.out = 200))
# simulate and wrangle
fitted(brms_fit, newdata = nd, probs = c(.055, .945)) %>%
data.frame() %>%
bind_cols(nd) %>%
# plot!
ggplot(aes(x = mass_std)) +
geom_lineribbon(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5),
color = carto_pal(7, "BurgYl")[7], size = 1/2,
fill = alpha(carto_pal(7, "BurgYl")[6], 1/3)) +
geom_point(data = d,
aes(y = brain_std),
color = carto_pal(7, "BurgYl")[7]) +
labs(subtitle = bquote(italic(R)^2==.(round(r2, digits = 2))),
x = "body mass (std)",
y = "brain volume (std)") +
coord_cartesian(xlim = c(-1.2, 1.5),
ylim = ylim)
}
```
Here we make and save the six subplots in bulk.
```{r}
p1 <- make_figure7.3(b7.1)
p2 <- make_figure7.3(b7.2)
p3 <- make_figure7.3(b7.3)
p4 <- make_figure7.3(b7.4, ylim = c(.25, 1.1))
p5 <- make_figure7.3(b7.5, ylim = c(.1, 1.4))
p6 <- make_figure7.3(b7.6, ylim = c(-0.25, 1.5)) +
geom_hline(yintercept = 0, color = carto_pal(7, "BurgYl")[2], linetype = 2)
```
Now use **patchwork** syntax to bundle them all together.
```{r, fig.width = 6, fig.height = 8.5, warning = F, message = F}
library(patchwork)
((p1 | p2) / (p3 | p4) / (p5 | p6)) +
plot_annotation(title = "Figure7.3. Polynomial linear models of increasing\ndegree for the hominin data.")
```
It's not clear, to me, why our **brms**-based 89% intervals are so much broader than those in the text. But if you do fit the size models with `rethinking::quap()` and compare the summaries, you'll see our **brms**-based parameters are systemically less certain than those fit with `quap()`. In you have an answer or perhaps even an alternative workflow to solve the issue, [share on GitHub](https://github.com/ASKurz/Statistical_Rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed/issues).
If you really what the axes scaled in the original metrics of the variables rather than their standardized form, you can use the re-scaling techniques from back in [Section 4.5.1.0.1][Overthinking: Converting back to natural scale.].
"This is a general phenomenon: If you adopt a model family with enough parameters, you can fit the data exactly. But such a model will make rather absurd predictions for yet-to-be-observed cases" (pp. 199--201).
#### Rethinking: Model fitting as compression.
> Another perspective on the absurd model just above is to consider that model fitting can be considered a form of **data compression**. Parameters summarize relationships among the data. These summaries compress the data into a simpler form, although with loss of information ("lossy" compression) about the sample. The parameters can then be used to generate new data, effectively decompressing the data. (p. 201, **emphasis** in the original)
### Too few parameters hurts, too.
> The overfit polynomial models fit the data extremely well, but they suffer for this within-sample accuracy by making nonsensical out-of-sample predictions. In contrast, **underfitting** produces models that are inaccurate both within and out of sample. They learn too little, failing to recover regular features of the sample. (p. 201, **emphasis** in the original)
To explore the distinctions between overfitting and underfitting, we'll need to refit `b7.1` and `b7.4` several times after serially dropping one of the rows in the data. You can `filter()` by `row_number()` to drop rows in a [**tidyverse** kind of way](https://dplyr.tidyverse.org/reference/slice.html). For example, we can drop the second row of `d` like this.
```{r}
d %>%
mutate(row = 1:n()) %>%
filter(row_number() != 2)
```
In his **Overthinking: Dropping rows** box (p. 202), McElreath encouraged us to take a look at the `brain_loo_plot()` function to get a sense of how he made his Figure 7.4. Here it is.
```{r, warning = F, message = F}
library(rethinking)
brain_loo_plot
```
Though we'll be taking a slightly different route than the one outlined in McElreath's `brain_loo_plot()` function, we can glean some great insights. For example, we'll be refitting our **brms** models with `update()`.
```{r b7.1.1}
b7.1.1 <-
update(b7.1,
newdata = filter(d, row_number() != 1),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 7,
file = "fits/b07.01.1")
print(b7.1.1)
```
You can see by the `newdata` statement that `b7.1.1` is fit on the `d` data after dropping the first row. Here's how we might amend out plotting strategy from before to visualize the posterior mean for the model-implied trajectory.
```{r, fig.width = 3.25, fig.height = 3}
fitted(b7.1.1,
newdata = nd) %>%
data.frame() %>%
bind_cols(nd) %>%
ggplot(aes(x = mass_std)) +
geom_line(aes(y = Estimate),
color = carto_pal(7, "BurgYl")[7], size = 1/2, alpha = 1/2) +
geom_point(data = d,
aes(y = brain_std),
color = carto_pal(7, "BurgYl")[7]) +
labs(subtitle = "b7.1.1",
x = "body mass (std)",
y = "brain volume (std)") +
coord_cartesian(xlim = range(d$mass_std),
ylim = range(d$brain_std))
```
Now we'll make a **brms**-oriented version of McElreath's `brain_loo_plot()` function. Our version `brain_loo_lines()` will refit the model and extract the lines information in one step. We'll leave the plotting for a second step.
```{r}
brain_loo_lines <- function(brms_fit, row, ...) {
# refit the model
new_fit <-
update(brms_fit,
newdata = filter(d, row_number() != row),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 7,
refresh = 0,
...)
# pull the lines values
fitted(new_fit,
newdata = nd) %>%
data.frame() %>%
select(Estimate) %>%
bind_cols(nd)
}
```
Here's how `brain_loo_lines()` works.
```{r, message = F}
brain_loo_lines(b7.1, row = 1) %>%
glimpse()
```
Working within the **tidyverse** paradigm, we'll make a tibble with the predefined `row` values. We will then use `purrr::map()` to plug those `row` values into `brain_loo_lines()`, which will return the desired posterior `mean` values for each corresponding value of `mass_std`. Here we do that for both `b7.1` and `b7.4`.
```{r, echo = F}
# save(b7.1_fits, file = "/Users/solomonkurz/Dropbox/Recoding Statistical Rethinking 2nd ed/fits/b07.01_fits.rda")
# save(b7.4_fits, file = "/Users/solomonkurz/Dropbox/Recoding Statistical Rethinking 2nd ed/fits/b07.04_fits.rda")
load("/Users/solomonkurz/Dropbox/Recoding Statistical Rethinking 2nd ed/fits/b07.01_fits.rda")
load("/Users/solomonkurz/Dropbox/Recoding Statistical Rethinking 2nd ed/fits/b07.04_fits.rda")
```
```{r, eval = F}
b7.1_fits <-
tibble(row = 1:7) %>%
mutate(post = purrr::map(row, ~brain_loo_lines(brms_fit = b7.1, row = .))) %>%
unnest(post)
b7.4_fits <-
tibble(row = 1:7) %>%
mutate(post = purrr::map(row, ~brain_loo_lines(brms_fit = b7.4,
row = .,
control = list(adapt_delta = .999)))) %>%
unnest(post)
```
Now for each, pump those values into `ggplot()`, customize the settings, and combine the two ggplots to make the full Figure 7.4.
```{r, fig.width = 6, fig.height = 3, warning = F, message = F}
# left
p1 <-
b7.1_fits %>%
ggplot(aes(x = mass_std)) +
geom_line(aes(y = Estimate, group = row),
color = carto_pal(7, "BurgYl")[7], size = 1/2, alpha = 1/2) +
geom_point(data = d,
aes(y = brain_std),
color = carto_pal(7, "BurgYl")[7]) +
labs(subtitle = "b7.1",
x = "body mass (std)",
y = "brain volume (std)") +
coord_cartesian(xlim = range(d$mass_std),
ylim = range(d$brain_std))
# right
p2 <-
b7.4_fits %>%
ggplot(aes(x = mass_std, y = Estimate)) +
geom_line(aes(group = row),
color = carto_pal(7, "BurgYl")[7], size = 1/2, alpha = 1/2) +
geom_point(data = d,
aes(y = brain_std),
color = carto_pal(7, "BurgYl")[7]) +
labs(subtitle = "b7.4",
x = "body mass (std)",
y = "brain volume (std)") +
coord_cartesian(xlim = range(d$mass_std),
ylim = c(-0.1, 1.4))
# combine
p1 + p2
```
"Notice that the straight lines hardly vary, while the curves fly about wildly. This is a general contrast between underfit and overfit models: sensitivity to the exact composition of the sample used to fit the model" (p. 201).
#### Rethinking: Bias and variance.
> The underfitting/overfitting dichotomy is often described as the **bias-variance trade-off**. While not exactly the same distinction, the bias-variance trade-off addresses the same problem. "Bias" is related to underfitting, while "variance" is related to overfitting. These terms are confusing, because they are used in many different ways in different contexts, even within statistics. The term "bias" also sounds like a bad thing, even though increasing bias often leads
to better predictions. (p. 201, **emphasis** in the original)
Take a look at @yarkoniChoosingPredictionExplanation2017 for more on the bias-variance trade-off. As McElreath indicated in his endnote #104 (p. 563), Hastie, Tibshirani and Friedman [-@hastie2009elements] broadly cover these ideas in their freely-downloadable text, [*The elements of statistical learning*](https://link.springer.com/book/10.1007%2F978-0-387-84858-7).
## Entropy and accuracy
> So how do we navigate between the hydra of overfitting and the vortex of underfitting? Whether you end up using regularization or information criteria or both, the first thing you must do is pick a criterion of model performance. What do you want the model to do well at? We'll call this criterion the *target*, and in this section you'll see how information theory provides a common and useful target. (p. 202, *emphasis* in the original)
### Firing the weatherperson.
If you let rain = 1 and sun = 0, here's a way to make a plot of the first table of page 203, the weatherperson's predictions.
```{r, fig.width = 6, fig.height = 1.5}
weatherperson <-
tibble(day = 1:10,
prediction = rep(c(1, 0.6), times = c(3, 7)),
observed = rep(1:0, times = c(3, 7)))
weatherperson %>%
pivot_longer(-day) %>%
ggplot(aes(x = day, y = name, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = value, color = value == 0)) +
scale_x_continuous(breaks = 1:10, expand = c(0, 0)) +
scale_y_discrete(NULL, expand = c(0, 0)) +
scale_fill_viridis_c(direction = -1) +
scale_color_manual(values = c("white", "black")) +
theme(axis.ticks.y = element_blank(),
legend.position = "none")
```
Here's how the newcomer fared:
```{r, fig.width = 6, fig.height = 1.5}
newcomer <-
tibble(day = 1:10,
prediction = 0,
observed = rep(1:0, times = c(3, 7)))
newcomer %>%
pivot_longer(-day) %>%
ggplot(aes(x = day, y = name, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = value, color = value == 0)) +
scale_x_continuous(breaks = 1:10, expand = c(0, 0)) +
scale_y_discrete(NULL, expand = c(0, 0)) +
scale_fill_viridis_c(direction = -1) +
scale_color_manual(values = c("white", "black")) +
theme(axis.ticks.y = element_blank(),
legend.position = "none")
```
If we do the math entailed in the tibbles, we'll see why the newcomer could boast "I'm the best person for the job" (p. 203).
```{r, message = F}
weatherperson %>%
bind_rows(newcomer) %>%
mutate(person = rep(c("weatherperson", "newcomer"), each = n()/2),
hit = ifelse(prediction == observed, 1, 1 - prediction - observed)) %>%
group_by(person) %>%
summarise(hit_rate = mean(hit))
```
#### Costs and benefits.
Our new `points` variable doesn't fit into the nice color-based `geom_tile()` plots from above, but we can still do the math.
```{r, message = F}
bind_rows(weatherperson,
newcomer) %>%
mutate(person = rep(c("weatherperson", "newcomer"), each = n()/2),
points = ifelse(observed == 1 & prediction != 1, -5,
ifelse(observed == 1 & prediction == 1, -1,
-1 * prediction))) %>%
group_by(person) %>%
summarise(happiness = sum(points))
```
#### Measuring accuracy.
> Consider computing the probability of predicting the exact sequence of days. This means computing the probability of a correct prediction for each day. Then multiply all of these probabilities together to get the joint probability of correctly predicting the observed sequence. This is the same thing as the joint likelihood, which you've been using up to this point to fit models with Bayes' theorem. This is the definition of accuracy that is maximized by the correct model.
>
> In this light, the newcomer looks even worse. (p. 204)
```{r}
bind_rows(weatherperson,
newcomer) %>%
mutate(person = rep(c("weatherperson", "newcomer"), each = n() / 2),
hit = ifelse(prediction == observed, 1, 1 - prediction - observed)) %>%
count(person, hit) %>%
mutate(power = hit ^ n,
term = rep(letters[1:2], times = 2)) %>%
select(person, term, power) %>%
pivot_wider(names_from = term,
values_from = power) %>%
mutate(probability_correct_sequence = a * b)
```
### Information and uncertainty.
Within the context of information theory [@shannonMathematicalTheoryCommunication1948; also @cover2006elements], information is "the reduction in uncertainty when we learn an outcome" (p. 205). **Information entropy** is a way of measuring that uncertainty in a way that is (a) continuous, (b) increases as the number of possible events increases, and (c) is additive. The formula for information entropy is:
$$H(p) = - \text E \log (p_i) = - \sum_{i = 1}^n p_i \log (p_i).$$
McElreath put it in words as: "The uncertainty contained in a probability distribution is the average log-probability of an event." (p. 206). We'll compute the information entropy for weather at the first unnamed location, which we'll call `McElreath's house`, and `Abu Dhabi` at once.
```{r}
tibble(place = c("McElreath's house", "Abu Dhabi"),
p_rain = c(.3, .01)) %>%
mutate(p_shine = 1 - p_rain) %>%
group_by(place) %>%
mutate(h_p = (p_rain * log(p_rain) + p_shine * log(p_shine)) %>% mean() * -1)
```
Did you catch how we used the equation $H(p) = - \sum_{i = 1}^n p_i \log (p_i)$ in our `mutate()` code, there? Our computation indicated the uncertainty is less in Abu Dhabi because it rarely rains, there. If you have sun, rain and snow, the entropy for weather is:
```{r}
p <- c(.7, .15, .15)
-sum(p * log(p))
```
"These entropy values by themselves don't mean much to us, though. Instead we can use them to build a measure of accuracy. That comes next" (p. 206).
### From entropy to accuracy.
> How can we use information entropy to say how far a model is from the target? The key lies in **divergence**:
>
>> **Divergence**: The additional uncertainty induced by using probabilities from one distribution to describe another distribution.
>
> This is often known as *Kullback-Leibler divergence* or simply KL divergence. [p. 207, **emphasis** in the original, see @kullbackInformationSufficiency1951]
The formula for the KL divergence is
$$D_\text{KL} (p, q) = \sum_i p_i \big ( \log (p_i) - \log (q_i) \big ) = \sum_i p_i \log \left ( \frac{p_i}{q_i} \right ),$$
which is what McElreath described in plainer language as "the average difference in log probability between the target ($p$) and model ($q$)" (p. 207).
In McElreath's initial example
* $p_1 = .3$,
* $p_2 = .7$,
* $q_1 = .25$, and
* $q_2 = .75$.
With those values, we can compute $D_\text{KL} (p, q)$ within a tibble like so:
```{r}
tibble(p_1 = .3,
p_2 = .7,
q_1 = .25,
q_2 = .75) %>%
mutate(d_kl = (p_1 * log(p_1 / q_1)) + (p_2 * log(p_2 / q_2)))
```
Our systems in this section are binary (e.g., $q = \{ q_i, q_2 \}$). Thus if you know $q_1 = .3$ you know of a necessity $q_2 = 1 - q_1$. Therefore we can code the tibble for the next example, for when $p = q$, like this.
```{r}
tibble(p_1 = .3) %>%
mutate(p_2 = 1 - p_1,
q_1 = p_1) %>%
mutate(q_2 = 1 - q_1) %>%
mutate(d_kl = (p_1 * log(p_1 / q_1)) + (p_2 * log(p_2 / q_2)))
```
Building off of that, you can make the data required for Figure 7.6 like this.
```{r}
t <-
tibble(p_1 = .3,
p_2 = .7,
q_1 = seq(from = .01, to = .99, by = .01)) %>%
mutate(q_2 = 1 - q_1) %>%
mutate(d_kl = (p_1 * log(p_1 / q_1)) + (p_2 * log(p_2 / q_2)))
head(t)
```
Now we have the data, plotting Figure 7.6 is a just `geom_line()` with stylistic flourishes.
```{r, fig.width = 3, fig.height = 2.75}
t %>%
ggplot(aes(x = q_1, y = d_kl)) +
geom_vline(xintercept = .3, color = carto_pal(7, "BurgYl")[5], linetype = 2) +
geom_line(color = carto_pal(7, "BurgYl")[7], size = 1.5) +
annotate(geom = "text", x = .4, y = 1.5, label = "q = p",
color = carto_pal(7, "BurgYl")[5], family = "Courier", size = 3.5) +
labs(x = "q[1]",
y = "Divergence of q from p")
```
> What divergence can do for us now is help us contrast different approximations to $p$. As an approximating function $q$ becomes more accurate, $D_\text{KL} (p, q)$ will shrink. So if we have a pair of candidate distributions, then the candidate that minimizes the divergence will be closest to the target. Since predictive models specify probabilities of events (observations), we can use divergence to compare the accuracy of models. (p. 208)
#### Rethinking: Divergence depends upon direction.
Here we see $H(p, q) \neq H(q, p)$. That is, direction matters.
```{r}
tibble(direction = c("Earth to Mars", "Mars to Earth"),
p_1 = c(.01, .7),
q_1 = c(.7, .01)) %>%
mutate(p_2 = 1 - p_1,
q_2 = 1 - q_1) %>%
mutate(d_kl = (p_1 * log(p_1 / q_1)) + (p_2 * log(p_2 / q_2)))
```
The $D_\text{KL}$ was double when applying Martian estimates to Terran estimates.
> An important practical consequence of this asymmetry, in a model fitting context, is that if we use a distribution with high entropy to approximate an unknown true distribution of events, we will reduce the distance to the truth and therefore the error. This fact will help us build generalized linear models, later on in [Chapter 10][Big Entropy and the Generalized Linear Model]. (p. 209)
### Estimating divergence.
> The point of all the preceding material about information theory and divergence is to establish both:
>
> 1. How to measure the distance of a model from our target. Information theory gives us the distance measure we need, the KL divergence.
>
> 2. How to estimate the divergence. Having identified the right measure of distance, we now need a way to estimate it in real statistical modeling tasks. (p. 209)
Now we'll start working on item #2.
Within the context of science, say we've labeled the true model for our topic of interest as $p$. We don't actually know what $p$ is--we wouldn't need the scientific method if we did. But say what we do have are two candidate models $q$ and $r$. We would at least like to know which is closer to $p$. It turns out we don't even need to know the absolute value of $p$ to achieve this. Just the relative values of $q$ and $r$ will suffice. We express model $q$'s average log-probability as $\text E \log (q_i)$. Extrapolating, the difference $\text E \log (q_i) - \text E \log (r_i)$ gives us a sense about the divergence of both $q$ and $r$ from the target $p$. That is, "we can compare the average log-probability from each model to get an estimate of the relative distance of each model from the target" (p. 210). **Deviance** and related statistics can help us towards this end. We define deviance as
$$D(q) = -2 \sum_i \log (q_i),$$
where $i$ indexes each case and $q_i$ is the likelihood for each case. Here's the deviance from the OLS version of model `m7.1`.
```{r}
lm(data = d, brain_std ~ mass_std) %>%
logLik() * -2
```
In our $D(q)$ formula, did you notice how we ended up multiplying $\sum_i \log (p_i)$ by $-2$? Frequentists and Bayesians alike make use of information theory, KL divergence, and deviance. It turns out that the differences between two $D(q)$ values follows a $\chi^2$ distribution [@wilksLargesampleDistributionLikelihood1938], which frequentists like to reference for the purpose of null-hypothesis significance testing. Many Bayesians, however, are not into all that significance-testing stuff and they aren't as inclined to multiply $\sum_i \log (p_i)$ by $-2$ for the simple purpose of scaling the associated difference distribution to follow the $\chi^2$. If we leave that part out of the equation, we end up with
$$S(q) = \sum_i \log (q_i),$$
which we can think of as a log-probability score which is "the gold standard way to compare the predictive accuracy of different models. It is an estimate of $\text E \log (q_i)$, just without the final step of dividing by the number of observations" (p. 210). When Bayesians compute $S(q)$, they do so over the entire posterior distribution. "Doing this calculation correctly requires a little subtlety. The **rethinking** package has a function called `lppd`--**log-pointwise-predictive-density**--to do this calculation for quap models" (p. 210, **emphasis** in the original for the second time, but not the first). However, I'm now aware of a similar function within **brms**. If you're willing to roll up your sleeves, a little, you can do it by hand. Here's an example with `b7.1`.
```{r, message = F}
log_lik(b7.1) %>%
data.frame() %>%
set_names(pull(d, species)) %>%
pivot_longer(everything(),
names_to = "species",
values_to = "logprob") %>%
mutate(prob = exp(logprob)) %>%
group_by(species) %>%
summarise(log_probability_score = mean(prob) %>% log())
```
"If you sum these values, you'll have the total log-probability score for the model and data" (p. 210). Here we sum those $\log (q_i)$ values up to compute $S(q)$.
```{r, message = F}
log_lik(b7.1) %>%
data.frame() %>%
set_names(pull(d, species)) %>%
pivot_longer(everything(),
names_to = "species",
values_to = "logprob") %>%
mutate(prob = exp(logprob)) %>%
group_by(species) %>%
summarise(log_probability_score = mean(prob) %>% log()) %>%
summarise(total_log_probability_score = sum(log_probability_score))
```
#### Overthinking: Computing the lppd.
The Bayesian version of the log-probability score, what we've been calling the lppd, has to account for the data and the posterior distribution. It follows the form
$$\text{lppd}(y, \Theta) = \sum_i \log \frac{1}{S} \sum_s p (y_i | \Theta_s),$$
> where $S$ is the number of samples and $\Theta_s$ is the $s$-th set of sampled parameter values in the posterior distribution. While in principle this is easy--you just need to compute the probability (density) of each observation $i$ for each sample $s$, take the average, and then the logarithm--in practice it is not so easy. The reason is that doing arithmetic in a computer often requires some tricks to retain precision. (p. 210)
Our approach to McElreath's **R** code 7.14 code will look very different. Going step by step, first we use the `brms::log_lik()` function.
```{r}
log_prob <- log_lik(b7.1)
log_prob %>%
glimpse()
```
The `log_lik()` function returned a matrix. Each occasion in the original data, $y_i$, got a column and each HMC chain iteration gets a row. Given we used the **brms** default of 4,000 HMC iterations, which corresponds to $S = 4{,}000$ in the formula. Note the each of these $7 \times 4{,}000$ values is a log-probability, not a probability itself. Thus, if we want to start summing these $s$ iterations within cases, we'll need to exponentiate them into probabilities.
```{r}
prob <-
log_prob %>%
# make it a data frame
data.frame() %>%
# add case names, for convenience
set_names(pull(d, species)) %>%
# add an s iteration index, for convenience
mutate(s = 1:n()) %>%
# make it long
pivot_longer(-s,
names_to = "species",
values_to = "logprob") %>%
# compute the probability scores
mutate(prob = exp(logprob))
prob
```
Now for each case, we take the average of each of the probability sores, and then take the log of that.
```{r, message = F}
prob <-
prob %>%
group_by(species) %>%
summarise(log_probability_score = mean(prob) %>% log())
prob
```
For our last step, we sum those values up.
```{r}
prob %>%
summarise(total_log_probability_score = sum(log_probability_score))
```
That, my friends, is the log-pointwise-predictive-density, $\text{lppd}(y, \Theta)$.
### Scoring the right data.
Since we don't have a `lppd()` function for **brms**, we'll have to turn our workflow from the last two sections into a custom function. We'll call it `my_lppd()`.
```{r, message = F}
my_lppd <- function(brms_fit) {
log_lik(brms_fit) %>%
data.frame() %>%
pivot_longer(everything(),
values_to = "logprob") %>%
mutate(prob = exp(logprob)) %>%
group_by(name) %>%
summarise(log_probability_score = mean(prob) %>% log()) %>%
summarise(total_log_probability_score = sum(log_probability_score))
}
```
Here's a **tidyverse**-style approach for computing the lppd for each of our six **brms** models.
```{r, message = F}
tibble(name = str_c("b7.", 1:6)) %>%
mutate(brms_fit = purrr::map(name, get)) %>%
mutate(lppd = purrr::map(brms_fit, ~ my_lppd(.))) %>%
unnest(lppd)
```
> When we usually have data and use it to fit a statistical model, the data comprise a **training sample**. Parameters are estimated from it, and then we can imagine using those estimates to predict outcomes in a new sample, called the **test sample**. R is going to do all of this for you. But here's the full procedure, in outline:
>
> 1. Suppose there's a training sample of size $N$.
> 2. Compute the posterior distribution of a model for the training sample, and compute the score on the training sample. Call this score $D_\text{train}$.
> 3. Suppose another sample of size $N$ from the same process. This is the test sample.
> 4. Compute the score on the test sample, using the posterior trained on the training
sample. Call this new score $D_\text{test}$.
>
> The above is a thought experiment. It allows us to explore the distinction between accuracy measured in and out of sample, using a simple prediction scenario. (p. 211, **emphasis** in the original)
We'll see how to carry out such a thought experiment in the next section.
#### Overthinking: Simulated training and testing.
McElreath plotted the results of such a thought experiment in implemented in **R** with the aid of his `sim_train_test()` function. If you're interested in how the function pulls this off, execute the code below.
```{r, eval = F}
sim_train_test
```
For the sake of brevity, I am going to show the results of a simulation based on 1,000 simulations rather than McElreath's 10,000.
```{r, eval = F}
# I've reduced this number by one order of magnitude to reduce computation time
n_sim <- 1e3
n_cores <- 8
kseq <- 1:5
# define the simulation function
my_sim <- function(k) {
print(k);
r <- mcreplicate(n_sim, sim_train_test(N = n, k = k), mc.cores = n_cores);
c(mean(r[1, ]), mean(r[2, ]), sd(r[1, ]), sd(r[2, ]))
}
# here's our dev object based on `N <- 20`
n <- 20
dev_20 <-
sapply(kseq, my_sim)
# here's our dev object based on N <- 100
n <- 100
dev_100 <-
sapply(kseq, my_sim)
```
```{r, echo = F}
# Even after reducing `n_sim` by an order of magnitude, that simulation takes 20.32234 mins to
# complete. So instead of completing it on the fly for each new version of this project, I've
# saved the results in an external file.
# save(list = c("dev_100", "dev_20", "kseq", "n_sim", "n_cores"), file = "/Users/solomonkurz/Dropbox/Recoding Statistical Rethinking/2nd ed/sims/06.sim_1.rda")
load("sims/06.sim_1.rda")
```
If you didn't quite catch it, the simulation yields `dev_20` and `dev_100`. We'll want to convert them to tibbles, bind them together, and wrangle extensively before we're ready to plot.
```{r}
dev_tibble <-
rbind(dev_20, dev_100) %>%
data.frame() %>%
mutate(statistic = rep(c("mean", "sd"), each = 2) %>% rep(., times = 2),
sample = rep(c("in", "out"), times = 2) %>% rep(., times = 2),
n = rep(c("n = 20", "n = 100"), each = 4)) %>%
pivot_longer(-(statistic:n)) %>%
pivot_wider(names_from = statistic, values_from = value) %>%
mutate(n = factor(n, levels = c("n = 20", "n = 100")),
n_par = str_extract(name, "\\d+") %>% as.double()) %>%
mutate(n_par = ifelse(sample == "in", n_par - .075, n_par + .075))
head(dev_tibble)
```
Now we're ready to make Figure 7.6.
```{r, fig.width = 6, fig.height = 3.25}
# for the annotation
text <-
dev_tibble %>%
filter(n_par > 1.5,
n_par < 2.5) %>%
mutate(n_par = ifelse(sample == "in", n_par - 0.2, n_par + 0.29))
# plot!
dev_tibble %>%
ggplot(aes(x = n_par, y = mean,
ymin = mean - sd, ymax = mean + sd,
group = sample, color = sample, fill = sample)) +
geom_pointrange(shape = 21) +
geom_text(data = text,
aes(label = sample)) +
scale_fill_manual(values = carto_pal(7, "BurgYl")[c(5, 7)]) +
scale_color_manual(values = carto_pal(7, "BurgYl")[c(7, 5)]) +
labs(title = "Figure 7.6. Deviance in and out of sample.",
x = "number of parameters",
y = "deviance") +
theme(legend.position = "none",
strip.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[1], 1/4), color = "transparent")) +
facet_wrap(~ n, scale = "free_y")
```
Even with a substantially smaller $N$, our simulation results matched up well with those in the text.
> Deviance is an assessment of predictive accuracy, not of truth. The true model, in terms of which predictors are included, is not guaranteed to produce the best predictions. Likewise a false model, in terms of which predictors are included, is not guaranteed to produce poor predictions.
>
> The point of this thought experiment is to demonstrate how deviance behaves, in theory. While deviance on training data always improves with additional predictor variables, deviance on future data may or may not, depending upon both the true data-generating process and how much data is available to precisely estimate the parameters. These facts form the basis for understanding both regularizing priors and information criteria. (p. 213)
## Golem taming: regularization
> The root of overfitting is a model's tendency to get overexcited by the training sample. When the priors are flat or nearly flat, the machine interprets this to mean that every parameter value is equally plausible. As a result, the model returns a posterior that encodes as much of the training sample--as represented by the likelihood function--as possible.
>
> One way to prevent a model from getting too excited by the training sample is to use a skeptical prior. By "skeptical," I mean a prior that slows the rate of learning from the sample. The most common skeptical prior is a **regularizing prior**. Such a prior, when tuned properly, reduces overfitting while still allowing the model to learn the regular features of a sample. (p. 214, **emphasis** in the original)
In case you were curious, here's how you might make a version Figure 7.7 with **ggplot2**.
```{r, fig.width = 3, fig.height = 2.75}
tibble(x = seq(from = - 3.5, to = 3.5, by = 0.01)) %>%
mutate(a = dnorm(x, mean = 0, sd = 0.2),
b = dnorm(x, mean = 0, sd = 0.5),
c = dnorm(x, mean = 0, sd = 1.0)) %>%
pivot_longer(-x) %>%
ggplot(aes(x = x, y = value,
fill = name, color = name, linetype = name)) +
geom_area(alpha = 1/2, size = 1/2, position = "identity") +
scale_fill_manual(values = carto_pal(7, "BurgYl")[7:5]) +
scale_color_manual(values = carto_pal(7, "BurgYl")[7:5]) +
scale_linetype_manual(values = 1:3) +
scale_x_continuous("parameter value", breaks = -3:3) +
scale_y_continuous(NULL, breaks = NULL) +
theme(legend.position = "none")
```
In our version of the plot, darker purple = more regularizing.
To prepare for Figure 7.8, we need to simulate. This time we'll wrap the basic simulation code we used before into a function we'll call `make_sim()`. Our `make_sim()` function has two parameters, `n` and `b_sigma`, both of which come from McElreath's simulation code. So you'll note that instead of hard coding the values for `n` and `b_sigma` within the simulation, we're leaving them adjustable (i.e., `sim_train_test(N = n, k = k, b_sigma = b_sigma)`). Also notice that instead of saving the simulation results as objects, like before, we're just converting them to a data frame with the `data.frame()` function at the bottom. Our goal is to use `make_sim()` within a `purrr::map2()` statement. The result will be a nested data frame into which we've saved the results of 6 simulations based off of two sample sizes (i.e., `n = c(20, 100)`) and three values of $\sigma$ for our Gaussian $\beta$ prior (i.e., `b_sigma = c(1, .5, .2)`).