Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use of data argument in broom::augment() is unnecessary and potentially misleading #292

Open
rpruim opened this issue Nov 9, 2024 · 4 comments

Comments

@rpruim
Copy link

rpruim commented Nov 9, 2024

Since glm models store the data used to fit them, the use of the data argument to augment() is not needed when computing propensity scores. The more interesting argument to broom::augment() is newdata, which allows you to compute propensity scores to a different data set from the one used to fit the model (for example to the matched pairs after matching, or to any other data set you like).

From the help for augment():

data
A base::data.frame or tibble::tibble() containing the original data that was used to produce the object x. Defaults to stats::model.frame(x) so that augment(my_fit) returns the augmented original data. Do not pass new data to the data argument. Augment will report information such as influence and cooks distance for data passed to the data argument. These measures are only defined for the original training data.

newdata
A base::data.frame() or tibble::tibble() containing all the original predictors used to create x. Defaults to NULL, indicating that nothing has been passed to newdata. If newdata is specified, the data argument will be ignored.

@malcolmbarrett
Copy link
Collaborator

Where did you find examples of data?

@rpruim
Copy link
Author

rpruim commented Nov 11, 2024

Every use of augment() in chapter 8, for example. This includes the template for adding propensity scores to data:

glm(
  exposure ~ confounder_1 + confounder_2,
  data = df,
  family = [binomial](https://rdrr.io/r/stats/family.html)()
) |>
  augment(type.predict = "response", data = df)

@rpruim
Copy link
Author

rpruim commented Nov 11, 2024

Also here in chapter 2:

library(rsample)

fit_ipw <- function(.split, ...) {
  # get bootstrapped data frame
  .df <- as.data.frame(.split)

  # fit propensity score model
  propensity_model <- glm(
    net ~ income + health + temperature,
    data = .df,
    family = binomial()
  )

  # calculate inverse probability weights
  .df <- propensity_model |>
    augment(type.predict = "response", data = .df) |>
    mutate(wts = wt_ate(.fitted, net))

  # fit correctly bootstrapped ipw model
  lm(malaria_risk ~ net, data = .df, weights = wts) |>
    tidy()
}

@rpruim
Copy link
Author

rpruim commented Nov 11, 2024

Chapter 9 mostly uses newdata, but there is one example using data:

library(broom)
library(touringplans)

seven_dwarfs <- seven_dwarfs_train_2018 |>
  filter(wait_hour == 9) |>
  mutate(park_extra_magic_morning = factor(
    park_extra_magic_morning,
    labels = c("No Magic Hours", "Extra Magic Hours")
  ))

seven_dwarfs_with_ps <- glm(
  park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
  data = seven_dwarfs,
  family = binomial()
) |>
  augment(type.predict = "response", data = seven_dwarfs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants