Prediction versus causal modeling post #3

tgerke · 2023-09-23T01:33:32Z

No description provided.

malcolmbarrett · 2023-09-23T17:10:12Z

I have some other comments and questions but taking a break!

malcolmbarrett · 2023-09-24T15:13:33Z

I think we could spruce up the plots a little without too much distraction. Here's something I tried but just a suggestion:

# at the top somewhere
theme_set(theme_minimal(14))

# ...
# plot 1
df_sim |> 
  ggplot() + 
  geom_point(aes(x = x1, y = x2), alpha = .7, color = "steelblue", size = 3)

#...
# plot 2 
df_sim |> 
  ggplot() + 
  geom_histogram(aes(x = y), fill = "steelblue", bins = 35, alpha = .8, color = "steelblue") + 
  theme(panel.grid.minor = element_blank(), panel.grid.major.x = element_blank())

malcolmbarrett · 2023-09-24T15:13:52Z

In general it feels you could add a little more explanation, particularly to the conclusion

malcolmbarrett · 2023-09-24T15:21:10Z

cc @LucyMcGowan because this comes back to my original attempt at a good simulation for this. I think a simplified example for the post is perfect but I want to understand this better and have a more elaborate sim to show in the book.

Ok, now my BIG questions. I tried to run this in a simulation, and it seems on average this simulation is not that illustrative of this situation. That's part one. Is there a better simulated data set?

Part 2: I tried to compare this to the expanded combo of calculating the model and RMSE by hand as you do here and using lm() and yardstick::rmse() (the vector version in this case). I can't make heads or tails of the results, but maybe I did something wrong. The lm() model with by-hand RMSE results don't make much sense to me (the RMSE is basically 0 for both models??), and the by-hand RMSE and yardstick RMSE seem to be on different scales. Is there something wrong with the by-hand calc, or is yardstick just doing it differently? In general, this version of the simulation also finds something similar: in the extreme, this isn't a good example of this situation. The difference is basically 0 for all but the lm-yardstick combo, in which the causal model is actually better. But that could be due to overfitting, which is why my first attempt included a cross-validated assessment.

library(tidyverse)

simple_sim <- function(..., n = 100) {
  x1 <- 100*rnorm(n)
  x2 <- x1/100 + rnorm(n, sd = .1)
  
  # simulate the outcome
  y <- 10*x1 + x2 + rnorm(n, sd = 100)
  
  df_sim <- tibble(y = y, x1 = x1, x2 = x2)
  
  preds_causal <- 10*df_sim$x1 + df_sim$x2
  rmse_causal <- sqrt(mean(df_sim$y - preds_causal)^2)
  
  preds_biased <- 10*df_sim$x1
  rmse_biased <- sqrt(mean(df_sim$y - preds_biased)^2)
  
  tibble(causal = rmse_causal, biased = rmse_biased, diff = causal - biased)
}

rmses_simple <- map_dfr(1:10000, simple_sim)

rmses_simple$diff |> mean()
#> [1] 0.0006647142

rmses_simple |>
  pivot_longer(c(causal, biased), names_to = "model", values_to = "rmse") |> 
  ggplot(aes(rmse, fill = model)) + 
  geom_histogram(alpha = .5, color = NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(tidyverse)

simulate_rmse <- function(id, n = 100) {
  # simulate the exposure variables
  x1 <- 100*rnorm(n)
  x2 <- x1/100 + rnorm(n, sd = .1)
  
  # simulate the outcome
  y <- 10*x1 + x2 + rnorm(n, sd = 100)
  
  df_sim <- tibble(y = y, x1 = x1, x2 = x2)
  preds_causal_model <- lm(y ~ x1 + x2, data = df_sim)$fitted.values
  preds_biased_model <- lm(y ~ x1, data = df_sim)$fitted.values
  
  preds_causal_by_hand <- 10*df_sim$x1 + df_sim$x2
  preds_biased_by_hand <- 10*df_sim$x1 
  
  rmse_causal_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_model)
  rmse_biased_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_model)
  
  rmse_causal_model_by_hand <- sqrt(mean(df_sim$y - preds_causal_model)^2)
  rmse_biased_model_by_hand <- sqrt(mean(df_sim$y - preds_biased_model)^2)
  
  
  rmse_causal_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_by_hand)
  rmse_biased_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_by_hand)
  
  rmse_causal_by_hand_by_hand <- sqrt(mean(df_sim$y - preds_causal_by_hand)^2)
  rmse_biased_by_hand_by_hand <- sqrt(mean(df_sim$y - preds_biased_by_hand)^2)
  
  tribble(
    ~id, ~ rmse, ~ model, ~ calculation, ~ metric,
    id, rmse_causal_model_yardstick, "causal", "lm", "yardstick",
    id, rmse_biased_model_yardstick, "biased", "lm", "yardstick",
    id, rmse_causal_model_by_hand, "causal", "lm", "rmse by hand",
    id, rmse_biased_model_by_hand, "biased", "lm", "rmse by hand",
    id, rmse_causal_by_hand_yardstick, "causal", "model by hand", "yardstick",
    id, rmse_biased_by_hand_yardstick, "biased", "model by hand", "yardstick",
    id, rmse_causal_by_hand_by_hand, "causal", "model by hand", "rmse by hand",
    id, rmse_biased_by_hand_by_hand, "biased", "model by hand", "rmse by hand",
  )
}

rmses <- map_dfr(1:10000, simulate_rmse)
rmses |> 
  pivot_wider(names_from = model, values_from = rmse) |> 
  group_by(calculation, metric) |> 
  summarise(
    mean_causal = mean(causal),
    mean_biased = mean(biased),
    diff = mean_causal - mean_biased,
    pct_dff = ((mean_causal - mean_biased)/mean_causal) * 100, 
    .groups = "drop"
  ) |> 
  mutate(across(where(is.numeric), ~ scales::number(.x, 1.0001)))
#> # A tibble: 4 × 6
#>   calculation   metric       mean_causal mean_biased diff    pct_dff
#>   <chr>         <chr>        <chr>       <chr>       <chr>   <chr>  
#> 1 lm            rmse by hand 0.0000      0.0000      0.0000  0.0000 
#> 2 lm            yardstick    98.0098     99.0099     -1.0001 -1.0001
#> 3 model by hand rmse by hand 8.0008      8.0008      0.0000  0.0000 
#> 4 model by hand yardstick    100.0100    100.0100    0.0000  0.0000

rmses |>
  ggplot(aes(rmse, fill = model)) + 
  geom_histogram(alpha = .5, color = NA) + 
  facet_wrap(calculation ~ metric, scales = "free")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

^{Created on 2023-09-24 with reprex v2.0.2}

malcolmbarrett · 2023-09-24T18:31:32Z

Revelation on the scale part, because the yardstick version seems to be on the right scale. I think you have the square in the wrong part, e.g. it should square then mean sqrt(mean((actual - predicted)^2))

Updated sim (follow-up edit: changed the plot so it was on the same axes since the scale is the same now):

library(tidyverse)

simple_sim <- function(..., n = 100) {
  x1 <- 100*rnorm(n)
  x2 <- x1/100 + rnorm(n, sd = .1)
  
  # simulate the outcome
  y <- 10*x1 + x2 + rnorm(n, sd = 100)
  
  df_sim <- tibble(y = y, x1 = x1, x2 = x2)
  
  preds_causal <- 10*df_sim$x1 + df_sim$x2
  rmse_causal <- sqrt(mean((df_sim$y - preds_causal)^2))
  
  preds_biased <- 10*df_sim$x1
  rmse_biased <- sqrt(mean((df_sim$y - preds_biased)^2))
  
  tibble(causal = rmse_causal, biased = rmse_biased, diff = causal - biased)
}

rmses_simple <- map_dfr(1:10000, simple_sim)

rmses_simple$diff |> mean()
#> [1] -0.004241039

rmses_simple |>
  pivot_longer(c(causal, biased), names_to = "model", values_to = "rmse") |> 
  ggplot(aes(rmse, fill = model)) + 
  geom_histogram(alpha = .5, color = NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(tidyverse)

simulate_rmse <- function(id, n = 100) {
  # simulate the exposure variables
  x1 <- 100*rnorm(n)
  x2 <- x1/100 + rnorm(n, sd = .1)
  
  # simulate the outcome
  y <- 10*x1 + x2 + rnorm(n, sd = 100)
  
  df_sim <- tibble(y = y, x1 = x1, x2 = x2)
  preds_causal_model <- lm(y ~ x1 + x2, data = df_sim)$fitted.values
  preds_biased_model <- lm(y ~ x1, data = df_sim)$fitted.values
  
  preds_causal_by_hand <- 10*df_sim$x1 + df_sim$x2
  preds_biased_by_hand <- 10*df_sim$x1 
  
  rmse_causal_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_model)
  rmse_biased_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_model)
  
  rmse_causal_model_by_hand <- sqrt(mean((df_sim$y - preds_causal_model)^2))
  rmse_biased_model_by_hand <- sqrt(mean((df_sim$y - preds_biased_model)^2))
  
  
  rmse_causal_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_by_hand)
  rmse_biased_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_by_hand)
  
  rmse_causal_by_hand_by_hand <- sqrt(mean((df_sim$y - preds_causal_by_hand)^2))
  rmse_biased_by_hand_by_hand <- sqrt(mean((df_sim$y - preds_biased_by_hand)^2))
  
  tribble(
    ~id, ~ rmse, ~ model, ~ calculation, ~ metric,
    id, rmse_causal_model_yardstick, "causal", "lm", "yardstick",
    id, rmse_biased_model_yardstick, "biased", "lm", "yardstick",
    id, rmse_causal_model_by_hand, "causal", "lm", "rmse by hand",
    id, rmse_biased_model_by_hand, "biased", "lm", "rmse by hand",
    id, rmse_causal_by_hand_yardstick, "causal", "model by hand", "yardstick",
    id, rmse_biased_by_hand_yardstick, "biased", "model by hand", "yardstick",
    id, rmse_causal_by_hand_by_hand, "causal", "model by hand", "rmse by hand",
    id, rmse_biased_by_hand_by_hand, "biased", "model by hand", "rmse by hand",
  )
}

rmses <- map_dfr(1:10000, simulate_rmse)
rmses |> 
  pivot_wider(names_from = model, values_from = rmse) |> 
  group_by(calculation, metric) |> 
  summarise(
    mean_causal = mean(causal),
    mean_biased = mean(biased),
    diff = mean_causal - mean_biased,
    pct_dff = ((mean_causal - mean_biased)/mean_causal) * 100, 
    .groups = "drop"
  ) |> 
  mutate(across(where(is.numeric), ~ scales::number(.x, 1.0001)))
#> # A tibble: 4 × 6
#>   calculation   metric       mean_causal mean_biased diff    pct_dff
#>   <chr>         <chr>        <chr>       <chr>       <chr>   <chr>  
#> 1 lm            rmse by hand 98.0098     99.0099     -1.0001 -1.0001
#> 2 lm            yardstick    98.0098     99.0099     -1.0001 -1.0001
#> 3 model by hand rmse by hand 100.0100    100.0100    0.0000  0.0000 
#> 4 model by hand yardstick    100.0100    100.0100    0.0000  0.0000

rmses |>
  ggplot(aes(rmse, fill = model)) + 
  geom_histogram(alpha = .5, color = NA) + 
  facet_grid(calculation ~ metric)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

^{Created on 2023-09-24 with reprex v2.0.2}

tgerke · 2023-09-25T18:20:12Z

I think we could spruce up the plots a little without too much distraction. Here's something I tried but just a suggestion:

# at the top somewhere
theme_set(theme_minimal(14))

# ...
# plot 1
df_sim |> 
  ggplot() + 
  geom_point(aes(x = x1, y = x2), alpha = .7, color = "steelblue", size = 3)

#...
# plot 2 
df_sim |> 
  ggplot() + 
  geom_histogram(aes(x = y), fill = "steelblue", bins = 35, alpha = .8, color = "steelblue") + 
  theme(panel.grid.minor = element_blank(), panel.grid.major.x = element_blank())

Totally agree, didn't know how much theming we wanted to do without distracting. hrbrthemes can provide a one-liner to get us plots like what you suggest (I've updated code as such).

tgerke · 2023-09-25T18:44:39Z

Part 2: I tried to compare this to the expanded combo of calculating the model and RMSE by hand as you do here and using lm() and yardstick::rmse() (the vector version in this case).

So I just recoded by hand from scratch, and caught the fact that my previous version was doing the square of the mean rather than the mean of the squares (R order of operations argh!). yardstick is correct and matches the corrected thing I just coded up. Will go with yardstick for simplicity.

malcolmbarrett · 2023-09-25T18:45:57Z

re: plots go with whatever you like and we'll use it as a starting point to discuss #4 more broadly

tgerke · 2023-09-25T20:40:42Z

Revelation on the scale part, because the yardstick version seems to be on the right scale. I think you have the square in the wrong part, e.g. it should square then mean sqrt(mean((actual - predicted)^2))

lololol would have helped if I read your comment from yesterday before I recoded the whole thing only to have the same revelation! 🫠

tgerke · 2023-09-25T20:44:41Z

I agree with the need to get some intuition from @LucyMcGowan on which situations make the difference in RMSE most extreme. I've recoded the qmd to simplify things a bit (i.e. so that we can toy with the parameters), and it's often the case that the biased model does better but the difference is so small a natural reaction would be "so what?" Equation 6 in the appendix here holds the key, and I'm beginning to think staring at that for a few minutes would give us a better roadmap than the 4 bullet points which follow it.

malcolmbarrett · 2023-09-26T22:32:16Z

Did you two make any progress here?

tgerke · 2023-09-27T18:57:11Z

Did you two make any progress here?

Yes and no. We worked through it for another hour or so, and began to believe the whole theory is broken. Intuition (and proof) shows that mean squared error decreases as you add more predictors (hence the dangers of overfitting) so why would the more parsimonious model predict better? It doesn't make sense. We're a little stuck, letting it simmer for a few.

tgerke added 2 commits September 22, 2023 16:42

ignore .DS_Store

35daa8c

prediction versus causal modeling draft

6930bd9

malcolmbarrett mentioned this pull request Sep 24, 2023

Should we have a consistent plot theme? #4

Open

theme plots

a9228c5

tgerke added 3 commits September 25, 2023 11:52

swap to yardstick

9a47bae

parameterize betas

3512962

simplify

e51717e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prediction versus causal modeling post #3

Prediction versus causal modeling post #3

tgerke commented Sep 23, 2023

malcolmbarrett commented Sep 23, 2023

malcolmbarrett commented Sep 24, 2023 •

edited

Loading

malcolmbarrett commented Sep 24, 2023

malcolmbarrett commented Sep 24, 2023 •

edited

Loading

malcolmbarrett commented Sep 24, 2023 •

edited

Loading

tgerke commented Sep 25, 2023

tgerke commented Sep 25, 2023

malcolmbarrett commented Sep 25, 2023

tgerke commented Sep 25, 2023

tgerke commented Sep 25, 2023

malcolmbarrett commented Sep 26, 2023

tgerke commented Sep 27, 2023

Prediction versus causal modeling post #3

Are you sure you want to change the base?

Prediction versus causal modeling post #3

Conversation

tgerke commented Sep 23, 2023

malcolmbarrett commented Sep 23, 2023

malcolmbarrett commented Sep 24, 2023 • edited Loading

malcolmbarrett commented Sep 24, 2023

malcolmbarrett commented Sep 24, 2023 • edited Loading

malcolmbarrett commented Sep 24, 2023 • edited Loading

tgerke commented Sep 25, 2023

tgerke commented Sep 25, 2023

malcolmbarrett commented Sep 25, 2023

tgerke commented Sep 25, 2023

tgerke commented Sep 25, 2023

malcolmbarrett commented Sep 26, 2023

tgerke commented Sep 27, 2023

malcolmbarrett commented Sep 24, 2023 •

edited

Loading

malcolmbarrett commented Sep 24, 2023 •

edited

Loading

malcolmbarrett commented Sep 24, 2023 •

edited

Loading