-
Notifications
You must be signed in to change notification settings - Fork 8
/
notes-07_hierarchical-models_bda3-5.Rmd
210 lines (158 loc) · 9.91 KB
/
notes-07_hierarchical-models_bda3-5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
# Section 7. Hierarchical models and exchangeability
2021-10-12
```{r setup, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
library(rstan)
library(tidybayes)
library(tidyverse)
options(mc.cores = 2)
rstan_options(auto_write = TRUE)
set.seed(659)
```
## Resources
- BDA3 chapter 5 and [reading instructions](`r paste0(CM_URL, "BDA3_ch05_reading-instructions.pdf")`)
- lectures:
- ['7.1. Hierarchical models'](https://aalto.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=79dee6de-afa9-446f-b533-aaf400cabf2b)
- ['7.2. Exchangeability'](https://aalto.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=c822561c-f95d-44fc-a1d0-aaf400d9fae3)
- [slides](`r paste0(CM_URL, "slides_ch5.pdf")`)
- [Assignment 7](`r paste0(CM_URL, "assignment-07.pdf")`)
## Notes
### Reading instructions
- "The hierarchical models in the chapter are simple to keep computation simple. More advanced computational tools are presented in Chapters 10-12 (part of the course) and 13 (not part of the course)."
#### Exchangeability vs. independence {-}
- exchangeability and independence are two separate concepts; neither necessarily implies the other
- independent identically distributed variables/parameters are exchangeable
- exchangeability is less strict condition than independence
#### Weakly informative priors for hierarchical variance parameters {-}
- suggestions have changed since writing section 5.7
- section 5.7 recommends use of half-Cauchy as weakly informative prior for hierarchical variance parameters
- now recommend a "half-normal if you have substantial information on the high end values, or or half-$t_4$ if you there might be possibility of surprise"
- "half-normal produces usually more sensible prior predictive distributions and is thus better justified"
- "half-normal leads also usually to easier inference"
### Chapter 5. Hierarchical models
- individual parameters for groups can be modeled as coming from a *population distribution*
- model these relationships hierarchically
- hierarchical models can often have more parameters than data but avoid overfitting of traditional linear models
- sections
- 5.2: how to construct a hierarchical prior distribution in the context of a fully Bayesian analysis
- 5.7: weakly informative priors
#### 1. Constructing a parameterized prior distribution {-}
- have historical data to inform our model
- can use it to construct a prior for our new data or use it as data to inform the posterior
- probably should not use it for both though, thus favor using it directly in the model alongside our new data
- for each experiment $j$, with data $y_j$, estimate the parameter $\theta_j$
- the parameters $\theta_j$ can come from a *population distribution* parameterized by $\text{Beta}(\alpha, \beta)$
![hierarchical structure graph](notes-assets/07_hierarchical-models_bda3-05/bda3_ch5_fig5-1.png)
#### 5.2 Exchangeability and hierarchical models {-}
- if no information (other than $y$) is available to distinguish any of the $\theta_j$'s, and no ordering (time) or grouping can be made, then we must assume symmetry among the parameters in the prior distribution
- *this symmetry is represented probabilistically by exchangeability*: the parameters $(\theta_1, \dots, \theta_J)$ are *exchangeable* in the join distribution if $p(\theta_1, \dots, \theta_J)$ is invariant to permutations of the indices $(1, \dots, J)$
- "in practice, ignorance implies exchangeability" (pg. 104)
- the less we know about a problem, the more confident we can claim exchangeability (i.e. we don't know any better)
- exchangeability is not the same as i.i.d:
- probability of a die landing on each face: parameters $(\theta_1, \dots, \theta_6)$ are exchangeable because we think the faces are all the same, but they are not independent because the total must sum to 1
- exchangeability when additional information is available on the groups
- if the observations can be grouped in their own submodels, but the group properties are unknown, can make a common prior distribution for the group properties
- if $y_i$ has additional information $x_i$ so that $y_i$ are not exchangeable, but $(y_i, x_i)$ are exchangeable, we can make a join model for $(y_i, x_i)$ or a conditional model $y_i | x_i$
- the usual way to model exchangeability with covariates is through conditional independence
$$
p(\theta_1, \dots, \theta_J | x_1, \dots, x_J) = \int [\prod_{j=1}^J p(\theta_j | \phi, x_j)] p(\phi|x) d \phi
$$
- $phi$ is unknown, so it gets a prior distribution, too $p(\phi)$
- the posterior distribution is of the vector $(\phi, \theta)$
- thus the joint prior is: $p(\phi, \theta) = p(\phi) p(\theta|\phi)$
- joint posterior:
$$
\begin{aligned}
p(\phi, \theta | y) &\propto p(\phi, \theta) p(y|\phi, \theta) \\
&= p(\phi, \theta) p(y|\theta)
\end{aligned}
$$
- posterior predictive distributions
- get predictions on both levels
#### 5.3 Bayesian analysis of conjugate hierarchical models {-}
- analysis derivation of conditional and marginal distributions
1. write the (unnormalized) joint posterior density $p(\theta, \phi | y)$ as a product of the hyperprior $p(\phi)$, the population distribution $p(\theta|\phi)$ and the likelihood $p(y|\theta)$
2. determine the conditional posterior density of $\theta$ given $\phi$: $p(\theta | \phi, y)$
3. estimate $\phi$ from its marginal posterior distribution $p(\phi|y)$
- drawing simulations from the posterior distribution
1. draw $\phi$ from $p(\phi|y)$
2. draw $\theta$ from its conditional posterior distribution $p(\theta | \phi, y)$ using the values of $\phi$ from the previous step
3. draw predictive values $\tilde{y}$ given the values of $\theta$
- example on rat tumors is a good demonstration of the shrinkage of parameter estimates in a hierarchical model
- the 1:1 line represents where the posteriors would be for a non-pooling model
- can see the shrinkage of the extreme values towards the average
- the effect is stronger for experiments with fewer rats
![rat tumor model posterior](notes-assets/07_hierarchical-models_bda3-05/bda3_ch5_fig5-4.png)
#### 5.4 Normal model with exchangeable parameters {-}
- drawing posterior predictive samples:
1. for new data $\tilde{y}$ of existing groups $J$, use the posterior distributions for $(\theta_1, \dots, \theta_j)$
2. for new data $\tilde{y}$ for $\tilde{J}$ new groups:
1. draw values for the hyperparameters $\mu, \tau$ (in this case)
2. draw $\tilde{J}$ new parameters $(\tilde{\theta}_1, \dots, \tilde{\theta}_\tilde{J})$ from $p(\tilde{\theta} | \mu, \tau)$
3. draw $\tilde{y}$ given $\tilde{\theta}$
#### 5.5 Example: parallel experiments in eight schools {-}
- a full example of a Bayesian hierarchical modeling process and analysis
#### 5.6 Hierarchical modeling applied to a meta-analysis {-}
- another good example of analyzing the results of a Bayesian hierarchical model
- notes the importance of interpreting both the mean and standard error hyperparameters to interpret the distribution of possible parameter values
#### 5.7 Weakly informative priors for variance parameters {-}
- "It turns out the the choice of 'noninformative' prior distribution can have a big effect on inferences," (pg. 128)
- uniform prior distributions will often lead to overly-dispersed posteriors
- this is especially true for the standard error term of a hyperparameter when there are few groups
- half-Cauchy is often a good balance between restricting the prior to reasonable value but with fat enough tails to allow for uncertainty
- particularly good for variance parameters of hierarchical models with few groups
### Lecture notes
#### ['7.1. Hierarchical models'](https://aalto.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=79dee6de-afa9-446f-b533-aaf400cabf2b) {-}
- description of the diagram of a hierarchical model
- box: known value
- double box: indicates a fixed experimental value (e.g. decided how many rats to include in an experiment)
- circle: unknown parameter
- box with the subscript $j$: indicates that everything within the box is repeated $J$ times
![hierarchical box diagram](notes-assets/07_hierarchical-models_bda3-05/slides_ch5_s7-crop.jpg)
- hierarchical model shrinkage effects with sample size
- below example shows how shrinkage is stronger for data points with smaller sample size (county radon example)
![chrinkage with sample size](notes-assets/07_hierarchical-models_bda3-05/slides_ch5_s17-crop.jpg)
- below, I replicate the 8-schools model in Stan and try to make the plot of $\theta$ over $\tau$
```{r}
schools_data <- list(
J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18)
)
eight_schools <- stan(
file = here::here("models", "8-schools.stan"),
data = schools_data,
chains = 4,
warmup = 1000,
iter = 2000,
cores = 2,
refresh = 0
)
eight_schools
```
```{r}
theta_tau_post <- eight_schools %>%
spread_draws(theta[school], tau) %>%
mutate(school = purrr::map_chr(school, ~ LETTERS[[.x]]))
theta_tau_post %>%
filter(tau < 20) %>%
arrange(tau) %>%
ggplot(aes(x = tau, y = theta, color = school)) +
geom_smooth(
method = "loess", formula = "y ~ x", n = 250, size = 0.8, alpha = 0.1
) +
scale_x_continuous(expand = expansion(c(0, 0))) +
scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
scale_color_brewer(type = "qual", palette = "Set1") +
theme_classic()
```
#### ['7.2. Exchangeability'](https://aalto.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=c822561c-f95d-44fc-a1d0-aaf400d9fae3) {-}
- **exchangeability**: parameters $\theta_1, \dots ,\theta_J$ (or observations $y_1, \dots , y_J$) are exchangeable if the joint distribution $p$ is invariant to the permutation of indices $(1, \dots, J)$
- e.g.: $p(\theta_1, \theta_2, \theta_3) = p(\theta_2, \theta_3, \theta_1)$
- exchangeability implies symmetry
- if there is no information which can be used a priori to separate $\theta_j$ form each other, we can assume exchangeability
- ”Ignorance implies exchangeability”
---
```{r}
sessionInfo()
```