undocumented char -> factor conversion in recipe creates non-commutative condition #1377

beansrowning opened this issue Sep 27, 2024 · 1 comment


The problem

Perhaps I've missed some documentation, but I seem to have identified an issue where {recipes} converts character features to factor invisibly to the user, and this in turn creates a condition where all_string() and all_string_predictors() operate differently depending on where in the recipe they're used.

In my example, I have columns of several different types. I have a few models that will use character features, and some which won't. In this case, I want to remove those features and only keep factors, integers, doubles, etc.

I'd assume that step_rm(all_string_predictors()) would sort this out quickly, but this actually results in unpredictable behavior depending on where in the recipe chain you place it.

I could pre-remove these features beforehand, or pre-compute their values and remove them by name, but this seemed somewhat antithetical to the entire tidymodels approach.

Is this expected behavior, and if so, what is the "preferred" solution to handling it?

Reproducible example

df <- structure(list(
  NEK = c(
    221119035L, 221213318L, 211030043L, 220842741L,
    220161193L, 221215066L
  ), DateChanged = structure(c(
    1670284800, 1634169600, 1660867200, 1643587200, 1670371200
  ), class = c(
  ), tzone = ""), DateCollected = structure(c(
    1670198400, 1634083200, 1660780800, 1643500800, 1670371200
  ), class = c(
  ), tzone = ""), DateofTreatment = structure(c(
    1669334400, 1632614400, 1660262400, 1642636800, 1665964800
  ), class = c(
  ), tzone = ""), AgeYrs = c(555, 555, 555, 555, 555, 555), DrugUse = structure(c(1L, 3L, 1L, 1L, 1L, 1L), levels = c(
  ), class = "factor"), Comments = c(
    "lorem ipsum", "lorem ipsum",
    "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ipsum"
  DiagOther = c(
    "lorem ipsum", "lorem ipsum", "lorem ipsum",
    "lorem ipsum", "lorem ipsum", "lorem ipsum"
  ), DiagOther2 = c(
    "lorem ipsum",
    "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ipsum",
    "lorem ipsum"
  ), Drug1 = c(
    "lorem ipsum", "lorem ipsum", "lorem ipsum",
    "lorem ipsum", "lorem ipsum", "lorem ipsum"
  CaseStatus = rbinom(6, 1, 0.5)
), row.names = c(
), class = c("tbl_df", "tbl", "data.frame"))

#> Rows: 6
#> Columns: 11
#> $ NEK             <int> 221119035, 221213318, 211030043, 220842741, 220161193,<85>
#> $ DateChanged     <dttm> 2022-11-07 19:00:00, 2022-12-05 19:00:00, 2021-10-13 <85>
#> $ DateCollected   <dttm> 2022-11-06 19:00:00, 2022-12-04 19:00:00, 2021-10-12 <85>
#> $ DateofTreatment <dttm> 2022-05-21 20:00:00, 2022-11-24 19:00:00, 2021-09-25 <85>
#> $ AgeYrs          <dbl> 555, 555, 555, 555, 555, 555
#> $ Comments        <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ DiagOther       <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ DiagOther2      <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ Drug1           <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ CaseStatus      <int> 1, 0, 1, 0, 0, 0

# --- With two steps, this seems to work as expected -------------------------------------
lr_recipe <- recipe(CaseStatus ~ ., data = df) |>
  step_rm(NEK, all_string_predictors()) |>
  step_date(DateofTreatment) |>
  step_time(DateofTreatment) |>
  step_holiday(DateofTreatment) |>
  step_rm(all_datetime_predictors()) # remove unique ID, any string predictors, dates

#> -- Recipe ----------------------------------------------------------------------
#> -- Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 10
#> -- Training information
#> Training data contained 6 data points and no incomplete rows.
#> -- Operations
#> <95> Variables removed: NEK, Comments, DiagOther, DiagOther2, Drug1 | Trained
#> <95> Date features from: DateofTreatment | Trained
#> <95> Time features from: DateofTreatment | Trained
#> <95> Holiday features from: DateofTreatment | Trained
#> <95> Variables removed: DateChanged, DateCollected, DateofTreatment | Trained
#> # A tibble: 6 × 12
#>   AgeYrs DrugUse     CaseStatus DateofTreatment_dow DateofTreatment_month
#>    <dbl> <fct>            <int> <fct>               <fct>                
#> 1    555 THERAPEUTIC          1 Sat                 May                  
#> 2    555 SELF-HARM            0 Thu                 Nov                  
#> 3    555 THERAPEUTIC          1 Sat                 Sep                  
#> 4    555 THERAPEUTIC          0 Thu                 Aug                  
#> 5    555 THERAPEUTIC          0 Wed                 Jan                  
#> 6    555 THERAPEUTIC          0 Sun                 Oct                  
#> # ℹ 7 more variables: DateofTreatment_year <int>, DateofTreatment_hour <int>,
#> #   DateofTreatment_minute <int>, DateofTreatment_second <dbl>,
#> #   DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,
#> #   DateofTreatment_ChristmasDay <int>

# --- If all_string_predictors() is not in the first step, fails -----------------------------------
lr_recipe <- recipe(CaseStatus ~ ., data = df) |>
  step_rm(NEK) |>
  step_rm(all_string_predictors()) |>
  step_date(DateofTreatment) |>
  step_time(DateofTreatment) |>
  step_holiday(DateofTreatment) |>
  step_rm(all_datetime_predictors()) # remove unique ID, any string predictors, dates

#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 10
#> ── Training information
#> Training data contained 6 data points and no incomplete rows.
#> ── Operations
#> • Variables removed: NEK | Trained
#> • Variables removed: <none> | Trained
#> • Date features from: DateofTreatment | Trained
#> • Time features from: DateofTreatment | Trained
#> • Holiday features from: DateofTreatment | Trained
#> • Variables removed: DateChanged, DateCollected, DateofTreatment | Trained
#> # A tibble: 6 × 16
#>   AgeYrs DrugUse     Comments    DiagOther   DiagOther2  Drug1       CaseStatus
#>    <dbl> <fct>       <fct>       <fct>       <fct>       <fct>            <int>
#> 1    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          1
#> 2    555 SELF-HARM   lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 3    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          1
#> 4    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 5    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 6    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> # ℹ 9 more variables: DateofTreatment_dow <fct>, DateofTreatment_month <fct>,
#> #   DateofTreatment_year <int>, DateofTreatment_hour <int>,
#> #   DateofTreatment_minute <int>, DateofTreatment_second <dbl>,
#> #   DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,
#> #   DateofTreatment_ChristmasDay <int>

# --- Calling step_rm at the end of the chain fails, recipes only finds NEK and datetime cols because strings have already been converted to factor ----
lr_recipe <- recipe(CaseStatus ~ ., data = df) |>
  step_date(DateofTreatment) |>
  step_time(DateofTreatment) |>
  step_holiday(DateofTreatment) |>
  step_rm(NEK, all_string_predictors(), all_datetime_predictors()) # remove unique ID, any string predictors, dates

#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 10
#> ── Training information
#> Training data contained 6 data points and no incomplete rows.
#> ── Operations
#> • Date features from: DateofTreatment | Trained
#> • Time features from: DateofTreatment | Trained
#> • Holiday features from: DateofTreatment | Trained
#> • Variables removed: NEK, DateChanged, DateCollected, ... | Trained
#> # A tibble: 6 × 16
#>   AgeYrs DrugUse     Comments    DiagOther   DiagOther2  Drug1       CaseStatus
#>    <dbl> <fct>       <fct>       <fct>       <fct>       <fct>            <int>
#> 1    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          1
#> 2    555 SELF-HARM   lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 3    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          1
#> 4    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 5    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 6    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> # ℹ 9 more variables: DateofTreatment_dow <fct>, DateofTreatment_month <fct>,
#> #   DateofTreatment_year <int>, DateofTreatment_hour <int>,
#> #   DateofTreatment_minute <int>, DateofTreatment_second <dbl>,
#> #   DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,
#> #   DateofTreatment_ChristmasDay <int>

Session info

This is happening because of the default value of strings_as_factors argument to prep(). Setting it to FALSE will likely deal with your issue.

This is a known issue, and we are planning to move the argument to recipe() have it be more central to the recipe object. #331

