-
Notifications
You must be signed in to change notification settings - Fork 1
/
gmm_stan_tutorial.Rmd
342 lines (277 loc) · 12.5 KB
/
gmm_stan_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
title: "GMM Estimation Using Stan - Tutorial"
author: "Nicolas Kuehn and Peter Stafford"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
keep_md: true
toc: true
toc_depth: 2
number_sections: true
highlight: tango
pdf_document: default
bibliography: REF/references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width=12,fig.height = 8, fig.path = 'pictures/',
root.dir = '/Users/nico/GROUNDMOTION/PROJECTS/STAN_Tutorial/Git/StanGMMTutorial')
```
# Introduction
This is a tutorial on how to use the program Stan (<https://mc-stan.org/>) to estimate the parameters of a ground-motion model (GMM).
Stan is a program that uses Bayesian inference to estimate the parameters of a model via Markov Chain Monte Carlo (MCMC) sampling.
In this tutorial, we estimate parameters of a GMM used as an example in Kuehn and Stafford.
A list of Stan programs covering a wide variety of GMMs is available in the other markdown file.
We will use he package `cmdstanR` to estimate the model parameters, sine it is a lightweight package that allows to use the lates Stan version.
We cover the basics of running the model and extracting parameters/posterior distributions, and looking at graphical summaries.
For more info and installation instructions, see <https://mc-stan.org/cmdstanr/> and the vignette <https://mc-stan.org/cmdstanr/articles/cmdstanr.html>.
For the use of the package Rstan, see <https://mc-stan.org/rstan/>.
# Getting Started
This tutorial uses **Stan** version 2.27.0 and requires the following **R** packages.
```{r load_libraries, warning=FALSE, message=FALSE}
# load required packages
library(lme4)
library(cmdstanr)
library(posterior)
library(bayesplot)
library(ggplot2)
options(mc.cores = parallel::detectCores())
```
``` {r package_options, echo=FALSE}
register_knitr_engine(override = TRUE)
```
First, we define the color scheme for the bayesplot package, and tell the cmdStanR package where to find cmdStan.
```{r set_options, results = FALSE, message=FALSE}
color_scheme_set("brightblue")
set_cmdstan_path('/Users/nico/GROUNDMOTION/SOFTWARE/cmdstan-2.27.0') # replace with Path
cmdstan_path()
cmdstan_version()
```
Next, we simulate some data.
Here, we have 50 events and 20 stations.
Each event is recorded at all 20 stations.
Events are randomly assigned a magnitude between 4 and 8, and an event term, sampled from a normal distribution with mea zero and standard deviation $\tau$.
Similarly, the $V_{S30}$ and station terms are randomly sampled for each station.
Then, for each event/station par, a distance is sampled, and the median PSA is calculated according to
$$y = c_1 + c_2 M + c_3 (8 - M)^2 + (c_4 + c_5 M) \ln \left[ R_{RUP} + h \right] + c_6 R_{RUP} + c_7 \ln \frac{V_{S30}}{400} + \delta B + \delta S + \delta WS$$
```{r simulate_data}
## simulate data
tau <- 0.5;
phiSS <- 0.5;
phiS2S <- 0.4;
neq <- 50;
nstat <- 20;
### determine M, event term and observed magnitude
dataM <- matrix(nrow = neq,ncol=2);
set.seed(5618);
for(i in 1:neq) {
mag <- round(runif(1,4,8),2);
eta <- rnorm(1,0,tau);
dataM[i,] <- c(mag,eta);
}
### determine VS, station term and observed VS
dataV <- matrix(nrow = nstat,ncol=2);
set.seed(8472);
for(i in 1:nstat) {
vs <- round(runif(1,log(300),log(1000)),2);
lambda <- rnorm(1,0,phiS2S);
dataV[i,] <- c(vs - log(400),lambda);
}
nrec <- nstat;
data <- matrix(nrow=neq * nrec,ncol = 7);
data_x <- matrix(nrow = neq * nrec, ncol = 7)
data_y <- vector(length = neq * nrec)
h <- 6
coeffs <- c(10.925, -0.985, -0.245, -3.245, 0.32, -0.008, -0.5)
set.seed(98765);
k <- 1
for(i in 1:neq) {
idx <- 1:nstat;
mag <- dataM[i,1];
eqt <- dataM[i,2];
for(j in 1:nrec) {
dist <- round(runif(1,1,200),2);
epsilon <- rnorm(1,0,phiSS);
vs <- dataV[idx[j],1];
disteff <- dist + h;
data_x[k,] <- c(1, mag, (8 - mag)^2, log(disteff), mag * log(disteff), dist, vs);
pga <- coeffs %*% data_x[k,]
pga2 <- pga + epsilon + eqt + dataV[idx[j],2];
data_y[k] <- pga2;
data[k,] <- c(mag,dist,vs,pga,pga2,i,idx[j]);
k <- k+1;
}
}
```
First, we fit a linear model using `lmer`, from the **R** package **lme4** [@Bates2015], to the data.
The package **lme4** is the successor to the package **nlme** and together these packages have been used quite extensively for the purposes of calibrating ground-motion models in the past.
These packages use more traditional maximum-likelihood based techniques with efficient numerical strategies to fit models.
They are computationally very efficient, but also have limitations with regard to what type of models can be fit.
To make our GMM linear, we have to fix the parameter `h` (often referred to as pseudo-depth of near-fault-saturation-term).
In this example, we fix it to `h = 6`, which is the value used to generate the data.
``` {r lmer}
eqid = data[,6]
statid = data[,7]
M <- data_x[,2]
M2 <- data_x[,3]
lnR <- data_x[,4]
MlnR <- data_x[,5]
R <- data_x[,6]
VS <- data_x[,7]
Y <- data_y
fit_lmer <- lmer(Y ~ 1 + M + M2 + lnR + MlnR + R + VS + (1 | eqid) + (1 | statid))
coeffs_lmer <- fixef(fit_lmer)
summary(fit_lmer)
```
# Stan
Now we describe how to fit the same model using Stan.
A Stan program is made up of blocks, like a `data {}`, `parameters {}` and a `model {}` block.
These are used to declare the data, the parameters to be estimated, and a generative model for the data.
A declaration of a variable will look like `real a;` to declare a variable `a` that is a real, or `vector[N] Y;` to declare a vector of length `N`.
Stan is typed, so there is a difference between a declaration `real a;` or `int a;`.
Constraints can be declared as `real<lower=L,upper=U> a;`, which means that `a` can take only values `L <= a <= U`.
Each line in a stan program has to end in `;`.
Below, we load a Stan model from file `STAN/gmm.stan`.
It is compiled with `mod <- cmdstan_model(file)`, which returns a `CmdStanModel` object, which can be used to access information about the model, and provides methods to for fitting the model.
``` {r stan_model}
file <- file.path('STAN', 'gmm.stan')
mod <- cmdstan_model(file)
mod$print()
```
## cmdstanR
Next, we declare the data for the Stan program, and run the sampler.
All parameters declared in the `data {}` block are defined as a named list in **R**.
Then, the `$sample()` method is used to call the MCMC sampler.
There are options that allow one to control the sampling process, such as setting the number of chains to run, the number of warm-up samples and post-warmup samples, and others.
For a full list of options, see <https://mc-stan.org/cmdstanr/reference/model-method-sample.html>.
``` {r run_stan}
data_list <- list(N = length(data_y),
N_eq = neq,
N_stat = nstat,
K = 7,
X = data_x,
Y = data_y,
idx_eq = eqid,
idx_stat = statid
)
fit <- mod$sample(
data = data_list,
seed = 123,
chains = 4,
iter_sampling = 1000,
iter_warmup = 1000,
refresh = 500,
max_treedepth = 12
)
```
### Assessing the Model Fit
Next, we do some checks to see whether there were some problems with the fit.
To understand wome of these diagnostices, see @Vehtari2020 (for Rhat) and @Betancourt2016 (for EBFMI (Energy Bayesian fraction of missing information)).
```{r diagnose}
fit$cmdstan_diagnose()
```
```{r summary}
fit$cmdstan_summary()
```
The output of running `mod$sample()` is a `CmdStanMCMC` object.
We can use the associated `print` method to get a summary of the fit.
``` {r print_stan_fit}
fit$print(variables = c('c','phi_SS', 'tau', 'phi_S2S'))
```
As we can see, the Rhat values are all close to one, indicating good convergence of the chains.
Below, we plot trace plots of the chains for the standard deviation parameters.
Trace plots are a quick qualitative view to check whether the different chains have converged.
We convert the posterior samples, which are stored in the `CmdStanMCMC` object `fit` into a `draws` format of the **posterior** package [@Buerkner2021] <https://mc-stan.org/posterior/reference/posterior-package.html>.
We then plot the traces of the standard deviations using `mcmc_trace' from the **bayesplot** package [@Gabry2019] <https://mc-stan.org/bayesplot/>.
``` {r stan_fit_trace}
posterior <- fit$draws()
mcmc_trace(posterior, regex_pars = c("phi_SS", "tau", "phi_S2S")) +
xaxis_title(size = 30, family = "sans") +
yaxis_title(size = 30, family = "sans")
```
### Assessing Results
Now, we look at some of the results.
The outcome of the fit are samples from the posterior distribution.
Together, these samples span the range of possible outcomes of the parameters, and are thus an assessment of the epistemic uncertainty associated with the model.
Histograms of the posterior samples of the parameters show the uncertainty associated with each parameter.
```{r posterior_hist}
mcmc_hist(posterior, regex_pars = c("c")) +
xaxis_title(size = 30, family = "sans") +
yaxis_title(size = 30, family = "sans")
mcmc_hist(posterior, regex_pars = c("phi_SS", "tau", "phi_S2S")) +
xaxis_title(size = 30, family = "sans") +
yaxis_title(size = 30, family = "sans")
```
Below, we plot the a density estimate of $c_1$, together with the true value (solid line) and he value estimated using `lmer` (dashed line),
```{r posterior hist2}
size_text <- 20
size_title <- 30
i <- 1 # plot first coefficient
mcmc_dens(posterior, pars = sprintf("c[%d]",i)) +
vline_at(c(coeffs[i], coeffs_lmer[i]), size = c(1,0.75), linetype = c(1,2)) +
xaxis_text(size = size_text, family = "sans") +
yaxis_text(size = size_text, family = "sans") +
xaxis_title(size = size_title, family = "sans") +
yaxis_title(size = size_title, family = "sans") +
grid_lines(color = "gray60")
```
A different way to visualize the posterior for each parameter is to plot intervals, which can be done with `mcmc_intervals`.
`mcmc_areas` plots density estimates of the posterior distribution of the parameters.
```{r posterior_intervals}
mcmc_intervals(posterior, regex_pars = c("phi_SS", "tau", "phi_S2S"),
prob = 0.5,
prob_outer = 0.9) +
xaxis_text(size = 30, family = "sans") +
yaxis_text(size = 30, family = "sans") +
grid_lines(color = "gray60")
mcmc_areas(posterior,
regex_pars = c("phi_SS", "tau", "phi_S2S"),
prob = 0.8,
prob_outer = 0.99) +
xaxis_text(size = 30, family = "sans") +
yaxis_text(size = 30, family = "sans") +
grid_lines(color = "gray60")
```
```{r posterior_intervals_deltaS}
mcmc_intervals(posterior, regex_pars = c("deltaS"),
prob = 0.84, prob_outer = 0.95) +
xaxis_text(size = 30, family = "sans") +
yaxis_text(size = 30, family = "sans") +
grid_lines(color = "gray60")
```
Pairs plots show the correlation between different parameters, and can be done via `mcmc_scatter`, `mcmc_pairs` or `mcmc_hex`.
Parameters to be plotted are selected via the `pars` or `regex_pars` argument, with the latter selecting parameters based on regular expressions.
It is also possible to transform the variables, either indvidually, or all of them.
```{r pairs plots}
mcmc_pairs(posterior, pars = "tau", regex_pars = "c\\[[1,4]\\]",
off_diag_args = list(size = 1, alpha = 0.5))
mcmc_hex(posterior, pars = c("tau", "phi_S2S"), transform = list(tau = "log"))
mcmc_scatter(posterior, pars = c("phi_SS", "phi_S2S"), transform = "log")
```
One can also extract variables from the posterior distribution (in draws format), and get summaries.
For example below we extract summaries of the event terms, and plot the event terms (and their uncertainty) against magnitude.
We also plot them agaianst the true (simulated) values, to see that they are well estimated.
```{r deltaB}
deltaB <- as.data.frame(summarise_draws(subset(posterior, variable = "^deltaB\\[[0-9]+\\]", regex = TRUE)))
df_plot <- data.frame(M = dataM[,1], deltaB = deltaB$mean, q05 = deltaB$q5, q95 = deltaB$q95, eta = dataM[,2])
ggplot(df_plot, aes(x = M, y = deltaB)) +
geom_point() +
ylim(-1.2,1.2) +
geom_pointrange(aes(ymin = q05, ymax = q95)) +
labs(title = "Event Terms") +
theme(
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
plot.title = element_text(size = 20)
)
ggplot(df_plot, aes(x = eta, y = deltaB)) +
geom_point() + geom_abline(intercept = 0, slope = 1, colour = "red") +
ylim(-1.2,1.2) +
geom_pointrange(aes(ymin = q05, ymax = q95)) +
labs(title = "Event Terms",
x = "deltaB_true", y = "detlaB_est") +
theme(
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
plot.title = element_text(size = 20)
)
```