Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prediction versus causal modeling post #3

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

tgerke
Copy link

@tgerke tgerke commented Sep 23, 2023

No description provided.

@malcolmbarrett
Copy link
Contributor

I have some other comments and questions but taking a break!

@malcolmbarrett
Copy link
Contributor

malcolmbarrett commented Sep 24, 2023

I think we could spruce up the plots a little without too much distraction. Here's something I tried but just a suggestion:

# at the top somewhere
theme_set(theme_minimal(14))

# ...
# plot 1
df_sim |> 
  ggplot() + 
  geom_point(aes(x = x1, y = x2), alpha = .7, color = "steelblue", size = 3)

#...
# plot 2 
df_sim |> 
  ggplot() + 
  geom_histogram(aes(x = y), fill = "steelblue", bins = 35, alpha = .8, color = "steelblue") + 
  theme(panel.grid.minor = element_blank(), panel.grid.major.x = element_blank())

@malcolmbarrett
Copy link
Contributor

In general it feels you could add a little more explanation, particularly to the conclusion

@malcolmbarrett
Copy link
Contributor

malcolmbarrett commented Sep 24, 2023

cc @LucyMcGowan because this comes back to my original attempt at a good simulation for this. I think a simplified example for the post is perfect but I want to understand this better and have a more elaborate sim to show in the book.

Ok, now my BIG questions. I tried to run this in a simulation, and it seems on average this simulation is not that illustrative of this situation. That's part one. Is there a better simulated data set?

Part 2: I tried to compare this to the expanded combo of calculating the model and RMSE by hand as you do here and using lm() and yardstick::rmse() (the vector version in this case). I can't make heads or tails of the results, but maybe I did something wrong. The lm() model with by-hand RMSE results don't make much sense to me (the RMSE is basically 0 for both models??), and the by-hand RMSE and yardstick RMSE seem to be on different scales. Is there something wrong with the by-hand calc, or is yardstick just doing it differently? In general, this version of the simulation also finds something similar: in the extreme, this isn't a good example of this situation. The difference is basically 0 for all but the lm-yardstick combo, in which the causal model is actually better. But that could be due to overfitting, which is why my first attempt included a cross-validated assessment.

library(tidyverse)

simple_sim <- function(..., n = 100) {
  x1 <- 100*rnorm(n)
  x2 <- x1/100 + rnorm(n, sd = .1)
  
  # simulate the outcome
  y <- 10*x1 + x2 + rnorm(n, sd = 100)
  
  df_sim <- tibble(y = y, x1 = x1, x2 = x2)
  
  preds_causal <- 10*df_sim$x1 + df_sim$x2
  rmse_causal <- sqrt(mean(df_sim$y - preds_causal)^2)
  
  preds_biased <- 10*df_sim$x1
  rmse_biased <- sqrt(mean(df_sim$y - preds_biased)^2)
  
  tibble(causal = rmse_causal, biased = rmse_biased, diff = causal - biased)
}

rmses_simple <- map_dfr(1:10000, simple_sim)

rmses_simple$diff |> mean()
#> [1] 0.0006647142

rmses_simple |>
  pivot_longer(c(causal, biased), names_to = "model", values_to = "rmse") |> 
  ggplot(aes(rmse, fill = model)) + 
  geom_histogram(alpha = .5, color = NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(tidyverse)

simulate_rmse <- function(id, n = 100) {
  # simulate the exposure variables
  x1 <- 100*rnorm(n)
  x2 <- x1/100 + rnorm(n, sd = .1)
  
  # simulate the outcome
  y <- 10*x1 + x2 + rnorm(n, sd = 100)
  
  df_sim <- tibble(y = y, x1 = x1, x2 = x2)
  preds_causal_model <- lm(y ~ x1 + x2, data = df_sim)$fitted.values
  preds_biased_model <- lm(y ~ x1, data = df_sim)$fitted.values
  
  preds_causal_by_hand <- 10*df_sim$x1 + df_sim$x2
  preds_biased_by_hand <- 10*df_sim$x1 
  
  rmse_causal_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_model)
  rmse_biased_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_model)
  
  rmse_causal_model_by_hand <- sqrt(mean(df_sim$y - preds_causal_model)^2)
  rmse_biased_model_by_hand <- sqrt(mean(df_sim$y - preds_biased_model)^2)
  
  
  rmse_causal_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_by_hand)
  rmse_biased_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_by_hand)
  
  rmse_causal_by_hand_by_hand <- sqrt(mean(df_sim$y - preds_causal_by_hand)^2)
  rmse_biased_by_hand_by_hand <- sqrt(mean(df_sim$y - preds_biased_by_hand)^2)
  
  tribble(
    ~id, ~ rmse, ~ model, ~ calculation, ~ metric,
    id, rmse_causal_model_yardstick, "causal", "lm", "yardstick",
    id, rmse_biased_model_yardstick, "biased", "lm", "yardstick",
    id, rmse_causal_model_by_hand, "causal", "lm", "rmse by hand",
    id, rmse_biased_model_by_hand, "biased", "lm", "rmse by hand",
    id, rmse_causal_by_hand_yardstick, "causal", "model by hand", "yardstick",
    id, rmse_biased_by_hand_yardstick, "biased", "model by hand", "yardstick",
    id, rmse_causal_by_hand_by_hand, "causal", "model by hand", "rmse by hand",
    id, rmse_biased_by_hand_by_hand, "biased", "model by hand", "rmse by hand",
  )
}

rmses <- map_dfr(1:10000, simulate_rmse)
rmses |> 
  pivot_wider(names_from = model, values_from = rmse) |> 
  group_by(calculation, metric) |> 
  summarise(
    mean_causal = mean(causal),
    mean_biased = mean(biased),
    diff = mean_causal - mean_biased,
    pct_dff = ((mean_causal - mean_biased)/mean_causal) * 100, 
    .groups = "drop"
  ) |> 
  mutate(across(where(is.numeric), ~ scales::number(.x, 1.0001)))
#> # A tibble: 4 × 6
#>   calculation   metric       mean_causal mean_biased diff    pct_dff
#>   <chr>         <chr>        <chr>       <chr>       <chr>   <chr>  
#> 1 lm            rmse by hand 0.0000      0.0000      0.0000  0.0000 
#> 2 lm            yardstick    98.0098     99.0099     -1.0001 -1.0001
#> 3 model by hand rmse by hand 8.0008      8.0008      0.0000  0.0000 
#> 4 model by hand yardstick    100.0100    100.0100    0.0000  0.0000

rmses |>
  ggplot(aes(rmse, fill = model)) + 
  geom_histogram(alpha = .5, color = NA) + 
  facet_wrap(calculation ~ metric, scales = "free")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2023-09-24 with reprex v2.0.2

@malcolmbarrett
Copy link
Contributor

malcolmbarrett commented Sep 24, 2023

Revelation on the scale part, because the yardstick version seems to be on the right scale. I think you have the square in the wrong part, e.g. it should square then mean sqrt(mean((actual - predicted)^2))

Updated sim (follow-up edit: changed the plot so it was on the same axes since the scale is the same now):

library(tidyverse)

simple_sim <- function(..., n = 100) {
  x1 <- 100*rnorm(n)
  x2 <- x1/100 + rnorm(n, sd = .1)
  
  # simulate the outcome
  y <- 10*x1 + x2 + rnorm(n, sd = 100)
  
  df_sim <- tibble(y = y, x1 = x1, x2 = x2)
  
  preds_causal <- 10*df_sim$x1 + df_sim$x2
  rmse_causal <- sqrt(mean((df_sim$y - preds_causal)^2))
  
  preds_biased <- 10*df_sim$x1
  rmse_biased <- sqrt(mean((df_sim$y - preds_biased)^2))
  
  tibble(causal = rmse_causal, biased = rmse_biased, diff = causal - biased)
}

rmses_simple <- map_dfr(1:10000, simple_sim)

rmses_simple$diff |> mean()
#> [1] -0.004241039

rmses_simple |>
  pivot_longer(c(causal, biased), names_to = "model", values_to = "rmse") |> 
  ggplot(aes(rmse, fill = model)) + 
  geom_histogram(alpha = .5, color = NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(tidyverse)

simulate_rmse <- function(id, n = 100) {
  # simulate the exposure variables
  x1 <- 100*rnorm(n)
  x2 <- x1/100 + rnorm(n, sd = .1)
  
  # simulate the outcome
  y <- 10*x1 + x2 + rnorm(n, sd = 100)
  
  df_sim <- tibble(y = y, x1 = x1, x2 = x2)
  preds_causal_model <- lm(y ~ x1 + x2, data = df_sim)$fitted.values
  preds_biased_model <- lm(y ~ x1, data = df_sim)$fitted.values
  
  preds_causal_by_hand <- 10*df_sim$x1 + df_sim$x2
  preds_biased_by_hand <- 10*df_sim$x1 
  
  rmse_causal_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_model)
  rmse_biased_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_model)
  
  rmse_causal_model_by_hand <- sqrt(mean((df_sim$y - preds_causal_model)^2))
  rmse_biased_model_by_hand <- sqrt(mean((df_sim$y - preds_biased_model)^2))
  
  
  rmse_causal_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_by_hand)
  rmse_biased_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_by_hand)
  
  rmse_causal_by_hand_by_hand <- sqrt(mean((df_sim$y - preds_causal_by_hand)^2))
  rmse_biased_by_hand_by_hand <- sqrt(mean((df_sim$y - preds_biased_by_hand)^2))
  
  tribble(
    ~id, ~ rmse, ~ model, ~ calculation, ~ metric,
    id, rmse_causal_model_yardstick, "causal", "lm", "yardstick",
    id, rmse_biased_model_yardstick, "biased", "lm", "yardstick",
    id, rmse_causal_model_by_hand, "causal", "lm", "rmse by hand",
    id, rmse_biased_model_by_hand, "biased", "lm", "rmse by hand",
    id, rmse_causal_by_hand_yardstick, "causal", "model by hand", "yardstick",
    id, rmse_biased_by_hand_yardstick, "biased", "model by hand", "yardstick",
    id, rmse_causal_by_hand_by_hand, "causal", "model by hand", "rmse by hand",
    id, rmse_biased_by_hand_by_hand, "biased", "model by hand", "rmse by hand",
  )
}

rmses <- map_dfr(1:10000, simulate_rmse)
rmses |> 
  pivot_wider(names_from = model, values_from = rmse) |> 
  group_by(calculation, metric) |> 
  summarise(
    mean_causal = mean(causal),
    mean_biased = mean(biased),
    diff = mean_causal - mean_biased,
    pct_dff = ((mean_causal - mean_biased)/mean_causal) * 100, 
    .groups = "drop"
  ) |> 
  mutate(across(where(is.numeric), ~ scales::number(.x, 1.0001)))
#> # A tibble: 4 × 6
#>   calculation   metric       mean_causal mean_biased diff    pct_dff
#>   <chr>         <chr>        <chr>       <chr>       <chr>   <chr>  
#> 1 lm            rmse by hand 98.0098     99.0099     -1.0001 -1.0001
#> 2 lm            yardstick    98.0098     99.0099     -1.0001 -1.0001
#> 3 model by hand rmse by hand 100.0100    100.0100    0.0000  0.0000 
#> 4 model by hand yardstick    100.0100    100.0100    0.0000  0.0000

rmses |>
  ggplot(aes(rmse, fill = model)) + 
  geom_histogram(alpha = .5, color = NA) + 
  facet_grid(calculation ~ metric)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2023-09-24 with reprex v2.0.2

@tgerke
Copy link
Author

tgerke commented Sep 25, 2023

I think we could spruce up the plots a little without too much distraction. Here's something I tried but just a suggestion:

# at the top somewhere
theme_set(theme_minimal(14))

# ...
# plot 1
df_sim |> 
  ggplot() + 
  geom_point(aes(x = x1, y = x2), alpha = .7, color = "steelblue", size = 3)

#...
# plot 2 
df_sim |> 
  ggplot() + 
  geom_histogram(aes(x = y), fill = "steelblue", bins = 35, alpha = .8, color = "steelblue") + 
  theme(panel.grid.minor = element_blank(), panel.grid.major.x = element_blank())

Totally agree, didn't know how much theming we wanted to do without distracting. hrbrthemes can provide a one-liner to get us plots like what you suggest (I've updated code as such).

@tgerke
Copy link
Author

tgerke commented Sep 25, 2023

Part 2: I tried to compare this to the expanded combo of calculating the model and RMSE by hand as you do here and using lm() and yardstick::rmse() (the vector version in this case).

So I just recoded by hand from scratch, and caught the fact that my previous version was doing the square of the mean rather than the mean of the squares (R order of operations argh!). yardstick is correct and matches the corrected thing I just coded up. Will go with yardstick for simplicity.

@malcolmbarrett
Copy link
Contributor

re: plots go with whatever you like and we'll use it as a starting point to discuss #4 more broadly

@tgerke
Copy link
Author

tgerke commented Sep 25, 2023

Revelation on the scale part, because the yardstick version seems to be on the right scale. I think you have the square in the wrong part, e.g. it should square then mean sqrt(mean((actual - predicted)^2))

lololol would have helped if I read your comment from yesterday before I recoded the whole thing only to have the same revelation! 🫠

@tgerke
Copy link
Author

tgerke commented Sep 25, 2023

I agree with the need to get some intuition from @LucyMcGowan on which situations make the difference in RMSE most extreme. I've recoded the qmd to simplify things a bit (i.e. so that we can toy with the parameters), and it's often the case that the biased model does better but the difference is so small a natural reaction would be "so what?" Equation 6 in the appendix here holds the key, and I'm beginning to think staring at that for a few minutes would give us a better roadmap than the 4 bullet points which follow it.

@malcolmbarrett
Copy link
Contributor

Did you two make any progress here?

@tgerke
Copy link
Author

tgerke commented Sep 27, 2023

Did you two make any progress here?

Yes and no. We worked through it for another hour or so, and began to believe the whole theory is broken. Intuition (and proof) shows that mean squared error decreases as you add more predictors (hence the dangers of overfitting) so why would the more parsimonious model predict better? It doesn't make sense. We're a little stuck, letting it simmer for a few.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants