assignment-08.Rmd

# Assignment 8

2021-11-12

**[Assignment 8](`r paste0(CM_URL, "assignment-08.pdf")`)**

## Setup

```{r setup, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 300)

for (f in list.files(here::here("src"), pattern = "R$", full.names = TRUE)) {
  source(f)
}

library(rstan)
library(bayestestR)
library(loo)
library(tidybayes)
library(tidyverse)

rstan_options(auto_write = TRUE)
options(mc.cores = 2)

theme_set(theme_classic() + theme(strip.background = element_blank()))

factory <- aaltobda::factory
set.seed(678)
```

## Exercise 1. Model assessment: LOO-CV for factory data with Stan

**Use leave-one-out cross-validation (LOO-CV) to assess the predictive performance of the pooled, separate and hierarchical Gaussian models for the factory dataset (see the second exercise in Assignment 7).**

**a) Fit the models with Stan as instructed in Assignment 7.**
**To use the `loo` or `psisloo` functions, you need to compute the log-likelihood values of each observation for every posterior draw (i.e. an $S$-by-$N$ matrix, where $S$ is the number of posterior draws and $N = 30$ is the total number of observations).**
**This can be done in the `generated quantities` block in the Stan code; for a demonstration, see the Gaussian linear model `lin.stan` in the R Stan examples that can be found here.**


*Separate model*

```{r}
separate_model_code <- here::here(
  "models", "assignment07_factories_separate.stan"
)

separate_model_data <- list(
  y = factory,
  N = nrow(factory),
  J = ncol(factory)
)

separate_model <- rstan::stan(
  separate_model_code,
  data = separate_model_data,
  verbose = FALSE,
  refresh = 0
)

print(separate_model, pars = c("mu", "sigma"))
```

*Pooled model*

```{r}
pooled_model_code <- here::here("models", "assignment07_factories_pooled.stan")

pooled_model_data <- list(
  y = unname(unlist(factory)),
  N = length(unlist(factory))
)

pooled_model <- rstan::stan(
  pooled_model_code,
  data = pooled_model_data,
  verbose = FALSE,
  refresh = 0
)

print(pooled_model, pars = c("mu", "sigma"))
```

*Hierarchical model*

```{r}
hierarchical_model_code <- here::here(
  "models", "assignment07_factories_hierarchical.stan"
)

hierarchical_model_data <- list(
  y = factory,
  N = nrow(factory),
  J = ncol(factory)
)

hierarchical_model <- rstan::stan(
  hierarchical_model_code,
  data = hierarchical_model_data,
  verbose = FALSE,
  refresh = 0
)

print(hierarchical_model, pars = c("alpha", "tau", "mu", "sigma"))
```

**b) Compute the PSIS-LOO elpd values and the $\hat{k}$-values for each of the three models.**
**Hint! It will be convenient to visualize the $\hat{k}$-values for each model so that you can easily see how many of these values fall in the range $\hat{k} > 0.7$ to assess the reliability of the PSIS-LOO estimate for each model.**
**You can read more about the theoretical guarantees for the accuracy of the estimate depending on $\hat{k}$ from the original article (see here or here), but regarding this assignment, it suffices to understand that if all the $\hat{k}$-values are $\hat{k} \le 0.7$, the PSIS-LOO estimate can be considered to be reliable, otherwise there is a concern that it may be biased (too optimistic, overestimating the predictive accuracy of the model).**

```{r}
calc_loo <- function(mdl) {
  log_lik <- loo::extract_log_lik(mdl, merge_chains = FALSE)
  r_eff <- loo::relative_eff(exp(log_lik))
  loo_res <- loo::loo(log_lik, r_eff = r_eff)
  return(loo_res)
}
```

*Separate model*

```{r}
separate_loo <- calc_loo(separate_model)
print(separate_loo)
```

*Pooled model*

```{r}
pooled_loo <- calc_loo(pooled_model)
print(pooled_loo)
```

*Hierarchical model*

```{r}
hierarchical_loo <- calc_loo(hierarchical_model)
print(hierarchical_loo)
```

**c) Compute the effective number of parameters $p_\text{eff} for each of the three models.**
**Hint! The estimated effective number of parameters in the model can be computed from equation (7.15) in the book, where elpdloo-cv is the PSIS-LOO value (sum of the LOO log densities) and lpd is given by equation (7.5) in the book.**

```{r}
extract_p_eff <- function(loo_res) {
  loo_res$estimates[2, ]
}

bind_rows(
  extract_p_eff(separate_loo),
  extract_p_eff(pooled_loo),
  extract_p_eff(hierarchical_loo)
) %>%
  add_column(model = c("separate", "pooled", "hierarchical"))
```

**d) Assess how reliable the PSIS-LOO estimates are for the three models based on the $\hat{k}$-values.**

```{r}
plot_khat <- function(loo_res, factory_data) {
  khat <- as.data.frame(loo_res$pointwise)$influence_pareto_k
  factory_data %>%
    mutate(idx = row_number()) %>%
    pivot_longer(-idx, names_to = "factory", values_to = "measure") %>%
    arrange(factory) %>%
    mutate(
      khat = !!khat,
      factory = str_replace(factory, "V", "factory ")
    ) %>%
    ggplot(aes(x = factor(idx), y = khat)) +
    facet_wrap(vars(factory), nrow = 1, scales = "free_x") +
    geom_hline(yintercept = c(0, 0.5, 0.7, 1.0), linetype = 2, color = "grey50") +
    geom_point(shape = 3, color = "#6497B1") +
    theme(axis.ticks = element_blank(), panel.grid.major.y = element_line()) +
    labs(x = "measurement", y = "Pareto shape k")
}
```

```{r}
plot_khat(separate_loo, factory) + labs(title = "Separate model")
```

```{r}
plot_khat(pooled_loo, factory) + labs(title = "Pooled model")
```

```{r}
plot_khat(hierarchical_loo, factory) + labs(title = "Hierarchical model")
```

**e) An assessment of whether there are differences between the models with regard to the elpdloo-cv, and if so, which model should be selected according to PSIS-LOO.**

```{r}
extract_elpd <- function(loo_res) {
  loo_res$estimates[1, ]
}
```

```{r}
bind_rows(
  extract_elpd(separate_loo),
  extract_elpd(pooled_loo),
  extract_elpd(hierarchical_loo)
) %>%
  add_column(model = c("separate", "pooled", "hierarchical"))
```

From the ELPD and $\hat{k}$ values, the hierarchical model is superior to the seaprate and pooled models.

**f) Both the Stan and R code should be included in your report.**

All of the R code is included in this file.
All of the models are described in [Assignment 7](#assignment-7).
Below is a list of the Stan code for all of the models (available in the [`models/`](`r MODELS_URL`) directory):

1. [separate models](`r paste0(MODELS_URL, "assignment07_factories_separate.stan")`)
2. [pooled model](`r paste0(MODELS_URL, "assignment07_factories_pooled.stan")`)
3. [hierarchical model](`r paste0(MODELS_URL, "assignment07_factories_hierarchical.stan")`)

---

```{r}
sessionInfo()
```