use of data argument in broom::augment() is unnecessary and potentially misleading #292

rpruim · 2024-11-09T19:19:11Z

Since glm models store the data used to fit them, the use of the data argument to augment() is not needed when computing propensity scores. The more interesting argument to broom::augment() is newdata, which allows you to compute propensity scores to a different data set from the one used to fit the model (for example to the matched pairs after matching, or to any other data set you like).

From the help for augment():

data
A base::data.frame or tibble::tibble() containing the original data that was used to produce the object x. Defaults to stats::model.frame(x) so that augment(my_fit) returns the augmented original data. Do not pass new data to the data argument. Augment will report information such as influence and cooks distance for data passed to the data argument. These measures are only defined for the original training data.

newdata
A base::data.frame() or tibble::tibble() containing all the original predictors used to create x. Defaults to NULL, indicating that nothing has been passed to newdata. If newdata is specified, the data argument will be ignored.

The text was updated successfully, but these errors were encountered:

malcolmbarrett · 2024-11-10T19:26:20Z

Where did you find examples of data?

rpruim · 2024-11-11T02:07:06Z

Every use of augment() in chapter 8, for example. This includes the template for adding propensity scores to data:

glm(
  exposure ~ confounder_1 + confounder_2,
  data = df,
  family = [binomial](https://rdrr.io/r/stats/family.html)()
) |>
  augment(type.predict = "response", data = df)

rpruim · 2024-11-11T02:11:25Z

Also here in chapter 2:

library(rsample)

fit_ipw <- function(.split, ...) {
  # get bootstrapped data frame
  .df <- as.data.frame(.split)

  # fit propensity score model
  propensity_model <- glm(
    net ~ income + health + temperature,
    data = .df,
    family = binomial()
  )

  # calculate inverse probability weights
  .df <- propensity_model |>
    augment(type.predict = "response", data = .df) |>
    mutate(wts = wt_ate(.fitted, net))

  # fit correctly bootstrapped ipw model
  lm(malaria_risk ~ net, data = .df, weights = wts) |>
    tidy()
}

rpruim · 2024-11-11T02:13:38Z

Chapter 9 mostly uses newdata, but there is one example using data:

library(broom)
library(touringplans)

seven_dwarfs <- seven_dwarfs_train_2018 |>
  filter(wait_hour == 9) |>
  mutate(park_extra_magic_morning = factor(
    park_extra_magic_morning,
    labels = c("No Magic Hours", "Extra Magic Hours")
  ))

seven_dwarfs_with_ps <- glm(
  park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
  data = seven_dwarfs,
  family = binomial()
) |>
  augment(type.predict = "response", data = seven_dwarfs)

malcolmbarrett added this to the Chapter 08: Propensity scores milestone Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use of data argument in broom::augment() is unnecessary and potentially misleading #292

use of data argument in broom::augment() is unnecessary and potentially misleading #292

rpruim commented Nov 9, 2024 •

edited

Loading

malcolmbarrett commented Nov 10, 2024

rpruim commented Nov 11, 2024

rpruim commented Nov 11, 2024

rpruim commented Nov 11, 2024

use of data argument in broom::augment() is unnecessary and potentially misleading #292

use of data argument in broom::augment() is unnecessary and potentially misleading #292

Comments

rpruim commented Nov 9, 2024 • edited Loading

malcolmbarrett commented Nov 10, 2024

rpruim commented Nov 11, 2024

rpruim commented Nov 11, 2024

rpruim commented Nov 11, 2024

rpruim commented Nov 9, 2024 •

edited

Loading