-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prediction versus causal modeling post #3
base: main
Are you sure you want to change the base?
Conversation
I have some other comments and questions but taking a break! |
I think we could spruce up the plots a little without too much distraction. Here's something I tried but just a suggestion: # at the top somewhere
theme_set(theme_minimal(14))
# ...
# plot 1
df_sim |>
ggplot() +
geom_point(aes(x = x1, y = x2), alpha = .7, color = "steelblue", size = 3)
#...
# plot 2
df_sim |>
ggplot() +
geom_histogram(aes(x = y), fill = "steelblue", bins = 35, alpha = .8, color = "steelblue") +
theme(panel.grid.minor = element_blank(), panel.grid.major.x = element_blank()) |
In general it feels you could add a little more explanation, particularly to the conclusion |
cc @LucyMcGowan because this comes back to my original attempt at a good simulation for this. I think a simplified example for the post is perfect but I want to understand this better and have a more elaborate sim to show in the book. Ok, now my BIG questions. I tried to run this in a simulation, and it seems on average this simulation is not that illustrative of this situation. That's part one. Is there a better simulated data set? Part 2: I tried to compare this to the expanded combo of calculating the model and RMSE by hand as you do here and using library(tidyverse)
simple_sim <- function(..., n = 100) {
x1 <- 100*rnorm(n)
x2 <- x1/100 + rnorm(n, sd = .1)
# simulate the outcome
y <- 10*x1 + x2 + rnorm(n, sd = 100)
df_sim <- tibble(y = y, x1 = x1, x2 = x2)
preds_causal <- 10*df_sim$x1 + df_sim$x2
rmse_causal <- sqrt(mean(df_sim$y - preds_causal)^2)
preds_biased <- 10*df_sim$x1
rmse_biased <- sqrt(mean(df_sim$y - preds_biased)^2)
tibble(causal = rmse_causal, biased = rmse_biased, diff = causal - biased)
}
rmses_simple <- map_dfr(1:10000, simple_sim)
rmses_simple$diff |> mean()
#> [1] 0.0006647142
rmses_simple |>
pivot_longer(c(causal, biased), names_to = "model", values_to = "rmse") |>
ggplot(aes(rmse, fill = model)) +
geom_histogram(alpha = .5, color = NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. library(tidyverse)
simulate_rmse <- function(id, n = 100) {
# simulate the exposure variables
x1 <- 100*rnorm(n)
x2 <- x1/100 + rnorm(n, sd = .1)
# simulate the outcome
y <- 10*x1 + x2 + rnorm(n, sd = 100)
df_sim <- tibble(y = y, x1 = x1, x2 = x2)
preds_causal_model <- lm(y ~ x1 + x2, data = df_sim)$fitted.values
preds_biased_model <- lm(y ~ x1, data = df_sim)$fitted.values
preds_causal_by_hand <- 10*df_sim$x1 + df_sim$x2
preds_biased_by_hand <- 10*df_sim$x1
rmse_causal_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_model)
rmse_biased_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_model)
rmse_causal_model_by_hand <- sqrt(mean(df_sim$y - preds_causal_model)^2)
rmse_biased_model_by_hand <- sqrt(mean(df_sim$y - preds_biased_model)^2)
rmse_causal_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_by_hand)
rmse_biased_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_by_hand)
rmse_causal_by_hand_by_hand <- sqrt(mean(df_sim$y - preds_causal_by_hand)^2)
rmse_biased_by_hand_by_hand <- sqrt(mean(df_sim$y - preds_biased_by_hand)^2)
tribble(
~id, ~ rmse, ~ model, ~ calculation, ~ metric,
id, rmse_causal_model_yardstick, "causal", "lm", "yardstick",
id, rmse_biased_model_yardstick, "biased", "lm", "yardstick",
id, rmse_causal_model_by_hand, "causal", "lm", "rmse by hand",
id, rmse_biased_model_by_hand, "biased", "lm", "rmse by hand",
id, rmse_causal_by_hand_yardstick, "causal", "model by hand", "yardstick",
id, rmse_biased_by_hand_yardstick, "biased", "model by hand", "yardstick",
id, rmse_causal_by_hand_by_hand, "causal", "model by hand", "rmse by hand",
id, rmse_biased_by_hand_by_hand, "biased", "model by hand", "rmse by hand",
)
}
rmses <- map_dfr(1:10000, simulate_rmse)
rmses |>
pivot_wider(names_from = model, values_from = rmse) |>
group_by(calculation, metric) |>
summarise(
mean_causal = mean(causal),
mean_biased = mean(biased),
diff = mean_causal - mean_biased,
pct_dff = ((mean_causal - mean_biased)/mean_causal) * 100,
.groups = "drop"
) |>
mutate(across(where(is.numeric), ~ scales::number(.x, 1.0001)))
#> # A tibble: 4 × 6
#> calculation metric mean_causal mean_biased diff pct_dff
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 lm rmse by hand 0.0000 0.0000 0.0000 0.0000
#> 2 lm yardstick 98.0098 99.0099 -1.0001 -1.0001
#> 3 model by hand rmse by hand 8.0008 8.0008 0.0000 0.0000
#> 4 model by hand yardstick 100.0100 100.0100 0.0000 0.0000
rmses |>
ggplot(aes(rmse, fill = model)) +
geom_histogram(alpha = .5, color = NA) +
facet_wrap(calculation ~ metric, scales = "free")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Created on 2023-09-24 with reprex v2.0.2 |
Revelation on the scale part, because the yardstick version seems to be on the right scale. I think you have the square in the wrong part, e.g. it should square then mean Updated sim (follow-up edit: changed the plot so it was on the same axes since the scale is the same now): library(tidyverse)
simple_sim <- function(..., n = 100) {
x1 <- 100*rnorm(n)
x2 <- x1/100 + rnorm(n, sd = .1)
# simulate the outcome
y <- 10*x1 + x2 + rnorm(n, sd = 100)
df_sim <- tibble(y = y, x1 = x1, x2 = x2)
preds_causal <- 10*df_sim$x1 + df_sim$x2
rmse_causal <- sqrt(mean((df_sim$y - preds_causal)^2))
preds_biased <- 10*df_sim$x1
rmse_biased <- sqrt(mean((df_sim$y - preds_biased)^2))
tibble(causal = rmse_causal, biased = rmse_biased, diff = causal - biased)
}
rmses_simple <- map_dfr(1:10000, simple_sim)
rmses_simple$diff |> mean()
#> [1] -0.004241039
rmses_simple |>
pivot_longer(c(causal, biased), names_to = "model", values_to = "rmse") |>
ggplot(aes(rmse, fill = model)) +
geom_histogram(alpha = .5, color = NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. library(tidyverse)
simulate_rmse <- function(id, n = 100) {
# simulate the exposure variables
x1 <- 100*rnorm(n)
x2 <- x1/100 + rnorm(n, sd = .1)
# simulate the outcome
y <- 10*x1 + x2 + rnorm(n, sd = 100)
df_sim <- tibble(y = y, x1 = x1, x2 = x2)
preds_causal_model <- lm(y ~ x1 + x2, data = df_sim)$fitted.values
preds_biased_model <- lm(y ~ x1, data = df_sim)$fitted.values
preds_causal_by_hand <- 10*df_sim$x1 + df_sim$x2
preds_biased_by_hand <- 10*df_sim$x1
rmse_causal_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_model)
rmse_biased_model_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_model)
rmse_causal_model_by_hand <- sqrt(mean((df_sim$y - preds_causal_model)^2))
rmse_biased_model_by_hand <- sqrt(mean((df_sim$y - preds_biased_model)^2))
rmse_causal_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_causal_by_hand)
rmse_biased_by_hand_yardstick <- yardstick::rmse_vec(df_sim$y, preds_biased_by_hand)
rmse_causal_by_hand_by_hand <- sqrt(mean((df_sim$y - preds_causal_by_hand)^2))
rmse_biased_by_hand_by_hand <- sqrt(mean((df_sim$y - preds_biased_by_hand)^2))
tribble(
~id, ~ rmse, ~ model, ~ calculation, ~ metric,
id, rmse_causal_model_yardstick, "causal", "lm", "yardstick",
id, rmse_biased_model_yardstick, "biased", "lm", "yardstick",
id, rmse_causal_model_by_hand, "causal", "lm", "rmse by hand",
id, rmse_biased_model_by_hand, "biased", "lm", "rmse by hand",
id, rmse_causal_by_hand_yardstick, "causal", "model by hand", "yardstick",
id, rmse_biased_by_hand_yardstick, "biased", "model by hand", "yardstick",
id, rmse_causal_by_hand_by_hand, "causal", "model by hand", "rmse by hand",
id, rmse_biased_by_hand_by_hand, "biased", "model by hand", "rmse by hand",
)
}
rmses <- map_dfr(1:10000, simulate_rmse)
rmses |>
pivot_wider(names_from = model, values_from = rmse) |>
group_by(calculation, metric) |>
summarise(
mean_causal = mean(causal),
mean_biased = mean(biased),
diff = mean_causal - mean_biased,
pct_dff = ((mean_causal - mean_biased)/mean_causal) * 100,
.groups = "drop"
) |>
mutate(across(where(is.numeric), ~ scales::number(.x, 1.0001)))
#> # A tibble: 4 × 6
#> calculation metric mean_causal mean_biased diff pct_dff
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 lm rmse by hand 98.0098 99.0099 -1.0001 -1.0001
#> 2 lm yardstick 98.0098 99.0099 -1.0001 -1.0001
#> 3 model by hand rmse by hand 100.0100 100.0100 0.0000 0.0000
#> 4 model by hand yardstick 100.0100 100.0100 0.0000 0.0000
rmses |>
ggplot(aes(rmse, fill = model)) +
geom_histogram(alpha = .5, color = NA) +
facet_grid(calculation ~ metric)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Created on 2023-09-24 with reprex v2.0.2 |
Totally agree, didn't know how much theming we wanted to do without distracting. |
So I just recoded by hand from scratch, and caught the fact that my previous version was doing the square of the mean rather than the mean of the squares (R order of operations argh!). yardstick is correct and matches the corrected thing I just coded up. Will go with yardstick for simplicity. |
re: plots go with whatever you like and we'll use it as a starting point to discuss #4 more broadly |
lololol would have helped if I read your comment from yesterday before I recoded the whole thing only to have the same revelation! 🫠 |
I agree with the need to get some intuition from @LucyMcGowan on which situations make the difference in RMSE most extreme. I've recoded the qmd to simplify things a bit (i.e. so that we can toy with the parameters), and it's often the case that the biased model does better but the difference is so small a natural reaction would be "so what?" Equation 6 in the appendix here holds the key, and I'm beginning to think staring at that for a few minutes would give us a better roadmap than the 4 bullet points which follow it. |
Did you two make any progress here? |
Yes and no. We worked through it for another hour or so, and began to believe the whole theory is broken. Intuition (and proof) shows that mean squared error decreases as you add more predictors (hence the dangers of overfitting) so why would the more parsimonious model predict better? It doesn't make sense. We're a little stuck, letting it simmer for a few. |
No description provided.