Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on chapter 13 #296

Open
7 tasks
rpruim opened this issue Nov 30, 2024 · 3 comments
Open
7 tasks

Feedback on chapter 13 #296

rpruim opened this issue Nov 30, 2024 · 3 comments

Comments

@rpruim
Copy link

rpruim commented Nov 30, 2024

  • park_extra_magic_morning = c(rep(1, 5000), rep(0, 5000)) -> rep(1:0, each = 5000) is shorter and clearer. rep(0:1, length.out = 10000 is perhaps safer (but the order of the 0s and 1s will be different.
  • I would re-order the workflow in 13.3. 1) Refine the question; 2) Wrangle the data (since it comes first in the workflow, and so you can use the wrangled data to test subsequent steps along the way, not just at the very end); 3) simulate population for left-most variables; 4) simulate process (perhaps with a better/less vague name); 5) compute stats.
  • Should pivot_longer( names_to = "term", values_to = "estimate", cols = everything() go inside compute_stats()?
  • In fit_models() from 13.3, fit_wait_minutes_posted is never needed or used since we set the values of that variable based on our contrast.
  • It would perhaps be good to add a bootstrap confidence interval at the end of 13.3.
  • The modular functions that return lists of models; data + contrast; etc. seem a little heavy and unnatural. Seems like it might be nicer if the functions returned more natural things (like a simple data frame). sim_population() could produce a data frame from simulation parameters, and calculate_stats() could produce a tidy model data frame from a data set. In any case, I would remove the pluck() from compute_stats().
  • in compute_stats(), exposure_val and control_val are never used and values of 30 and 60 are hard coded into the names of the returned object. I'd suggest something like the code below. (Alternatively, one could use lm() |> tidy().)
# sim_obj is a list created by our simulate_process() function
compute_stats <- function(sim_obj) {
 
 sim_obj |> pluck("df_outcome") |>   # pluck() can be avoided if the input is a data frame
   group_by(wait_minutes_posted_avg) |>
   summarize(avg_wait_actual = mean(wait_minutes_actual_avg)) |>
   pivot_wider(
     names_from = wait_minutes_posted_avg,
     values_from = avg_wait_actual,
     names_prefix = "X_"
   ) |>
   mutate(effect = diff(c_across(1:2)))
}
@rpruim
Copy link
Author

rpruim commented Nov 30, 2024

Here is an alternative way to do the g-computation in 13.3:

simulate_population2 <- function(orig_data, contrast = c(30,60), size = 10000) { 
  orig_data |>
    select(park_ticket_season, park_close, park_temperature_high) |>
    slice_sample(n = 10000, replace = TRUE) |>
    mutate(wait_minutes_posted_avg = rep(contrast, length.out = size)) |>
    augment( 
      newdata = _,
      glm( park_extra_magic_morning ~ 
             park_ticket_season + park_close + park_temperature_high,
           data = orig_data, family = "binomial"),
      type.predict = "response") |>
    mutate(park_extra_magic_morning = rbinom(size, 1, .fitted)) |>
    augment(
      newdata = _,
      lm(wait_minutes_actual_avg ~
           splines::ns(wait_minutes_posted_avg, df = 3) + park_extra_magic_morning + 
           park_ticket_season + park_close + park_temperature_high,
         data = orig_data)) |>
    rename(wait_minutes_actual_avg = .fitted)
}    

compute_stats2 <- function(population) {
  population  |> 
    lm(wait_minutes_actual_avg ~ factor(wait_minutes_posted_avg), data = _) |>
    tidy()
}

set.seed(8675309)
wait_times |>
  simulate_population2() |>
  compute_stats2() 

@rpruim
Copy link
Author

rpruim commented Dec 7, 2024

wrong milestone?

@malcolmbarrett
Copy link
Collaborator

No, I'm going to rework this chapter to show a simpler approach you can use when it's a pre-post analysis (cloning and standardizing). We're going to use the currently described approach in the section on time-varying exposures, when you need this way of simulating the data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants