You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Perhaps I've missed some documentation, but I seem to have identified an issue where {recipes} converts character features to factor invisibly to the user, and this in turn creates a condition where all_string() and all_string_predictors() operate differently depending on where in the recipe they're used.
In my example, I have columns of several different types. I have a few models that will use character features, and some which won't. In this case, I want to remove those features and only keep factors, integers, doubles, etc.
I'd assume that step_rm(all_string_predictors()) would sort this out quickly, but this actually results in unpredictable behavior depending on where in the recipe chain you place it.
I could pre-remove these features beforehand, or pre-compute their values and remove them by name, but this seemed somewhat antithetical to the entire tidymodels approach.
Is this expected behavior, and if so, what is the "preferred" solution to handling it?
Reproducible example
library(dplyr)
#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #> filter, lag#> The following objects are masked from 'package:base':#> #> intersect, setdiff, setequal, union
library(recipes)
#> Warning: package 'recipes' was built under R version 4.4.1#> #> Attaching package: 'recipes'#> The following object is masked from 'package:stats':#> #> stepdf<- structure(list(
NEK= c(
221119035L, 221213318L, 211030043L, 220842741L,
220161193L, 221215066L
), DateChanged= structure(c(
1667865600,
1670284800, 1634169600, 1660867200, 1643587200, 1670371200
), class= c(
"POSIXct",
"POSIXt"
), tzone=""), DateCollected= structure(c(
1667779200,
1670198400, 1634083200, 1660780800, 1643500800, 1670371200
), class= c(
"POSIXct",
"POSIXt"
), tzone=""), DateofTreatment= structure(c(
1653177600,
1669334400, 1632614400, 1660262400, 1642636800, 1665964800
), class= c(
"POSIXct",
"POSIXt"
), tzone=""), AgeYrs= c(555, 555, 555, 555, 555, 555), DrugUse= structure(c(1L, 3L, 1L, 1L, 1L, 1L), levels= c(
"THERAPEUTIC",
"ABUSE", "SELF-HARM", "ASSAULT", "UNKNOWN INTENT", "NOT AN ADE"
), class="factor"), Comments= c(
"lorem ipsum", "lorem ipsum",
"lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ipsum"
),
DiagOther= c(
"lorem ipsum", "lorem ipsum", "lorem ipsum",
"lorem ipsum", "lorem ipsum", "lorem ipsum"
), DiagOther2= c(
"lorem ipsum",
"lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ipsum",
"lorem ipsum"
), Drug1= c(
"lorem ipsum", "lorem ipsum", "lorem ipsum",
"lorem ipsum", "lorem ipsum", "lorem ipsum"
),
CaseStatus= rbinom(6, 1, 0.5)
), row.names= c(
NA,
-6L
), class= c("tbl_df", "tbl", "data.frame"))
glimpse(df)
#> Rows: 6#> Columns: 11#> $ NEK <int> 221119035, 221213318, 211030043, 220842741, 220161193,<85>#> $ DateChanged <dttm> 2022-11-07 19:00:00, 2022-12-05 19:00:00, 2021-10-13 <85>#> $ DateCollected <dttm> 2022-11-06 19:00:00, 2022-12-04 19:00:00, 2021-10-12 <85>#> $ DateofTreatment <dttm> 2022-05-21 20:00:00, 2022-11-24 19:00:00, 2021-09-25 <85>#> $ AgeYrs <dbl> 555, 555, 555, 555, 555, 555#> $ DrugUse <fct> THERAPEUTIC, SELF-HARM, THERAPEUTIC, THERAPEUTIC, THER<85>#> $ Comments <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>#> $ DiagOther <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>#> $ DiagOther2 <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>#> $ Drug1 <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>#> $ CaseStatus <int> 1, 0, 1, 0, 0, 0# --- With two steps, this seems to work as expected -------------------------------------lr_recipe<- recipe(CaseStatus~., data=df) |>
step_rm(NEK, all_string_predictors()) |>
step_date(DateofTreatment) |>
step_time(DateofTreatment) |>
step_holiday(DateofTreatment) |>
step_rm(all_datetime_predictors()) # remove unique ID, any string predictors, dates
prep(lr_recipe)
#> #> -- Recipe ----------------------------------------------------------------------#> #> -- Inputs#> Number of variables by role#> outcome: 1#> predictor: 10#> #> -- Training information#> Training data contained 6 data points and no incomplete rows.#> #> -- Operations#> <95> Variables removed: NEK, Comments, DiagOther, DiagOther2, Drug1 | Trained#> <95> Date features from: DateofTreatment | Trained#> <95> Time features from: DateofTreatment | Trained#> <95> Holiday features from: DateofTreatment | Trained#> <95> Variables removed: DateChanged, DateCollected, DateofTreatment | Trained
juice(prep(lr_recipe))
#> # A tibble: 6 × 12#> AgeYrs DrugUse CaseStatus DateofTreatment_dow DateofTreatment_month#> <dbl> <fct> <int> <fct> <fct> #> 1 555 THERAPEUTIC 1 Sat May #> 2 555 SELF-HARM 0 Thu Nov #> 3 555 THERAPEUTIC 1 Sat Sep #> 4 555 THERAPEUTIC 0 Thu Aug #> 5 555 THERAPEUTIC 0 Wed Jan #> 6 555 THERAPEUTIC 0 Sun Oct #> # ℹ 7 more variables: DateofTreatment_year <int>, DateofTreatment_hour <int>,#> # DateofTreatment_minute <int>, DateofTreatment_second <dbl>,#> # DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,#> # DateofTreatment_ChristmasDay <int># --- If all_string_predictors() is not in the first step, fails -----------------------------------lr_recipe<- recipe(CaseStatus~., data=df) |>
step_rm(NEK) |>
step_rm(all_string_predictors()) |>
step_date(DateofTreatment) |>
step_time(DateofTreatment) |>
step_holiday(DateofTreatment) |>
step_rm(all_datetime_predictors()) # remove unique ID, any string predictors, dates
prep(lr_recipe)
#> #> ── Recipe ──────────────────────────────────────────────────────────────────────#> #> ── Inputs#> Number of variables by role#> outcome: 1#> predictor: 10#> #> ── Training information#> Training data contained 6 data points and no incomplete rows.#> #> ── Operations#> • Variables removed: NEK | Trained#> • Variables removed: <none> | Trained#> • Date features from: DateofTreatment | Trained#> • Time features from: DateofTreatment | Trained#> • Holiday features from: DateofTreatment | Trained#> • Variables removed: DateChanged, DateCollected, DateofTreatment | Trained
juice(prep(lr_recipe))
#> # A tibble: 6 × 16#> AgeYrs DrugUse Comments DiagOther DiagOther2 Drug1 CaseStatus#> <dbl> <fct> <fct> <fct> <fct> <fct> <int>#> 1 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 1#> 2 555 SELF-HARM lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0#> 3 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 1#> 4 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0#> 5 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0#> 6 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0#> # ℹ 9 more variables: DateofTreatment_dow <fct>, DateofTreatment_month <fct>,#> # DateofTreatment_year <int>, DateofTreatment_hour <int>,#> # DateofTreatment_minute <int>, DateofTreatment_second <dbl>,#> # DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,#> # DateofTreatment_ChristmasDay <int># --- Calling step_rm at the end of the chain fails, recipes only finds NEK and datetime cols because strings have already been converted to factor ----lr_recipe<- recipe(CaseStatus~., data=df) |>
step_date(DateofTreatment) |>
step_time(DateofTreatment) |>
step_holiday(DateofTreatment) |>
step_rm(NEK, all_string_predictors(), all_datetime_predictors()) # remove unique ID, any string predictors, dates
prep(lr_recipe)
#> #> ── Recipe ──────────────────────────────────────────────────────────────────────#> #> ── Inputs#> Number of variables by role#> outcome: 1#> predictor: 10#> #> ── Training information#> Training data contained 6 data points and no incomplete rows.#> #> ── Operations#> • Date features from: DateofTreatment | Trained#> • Time features from: DateofTreatment | Trained#> • Holiday features from: DateofTreatment | Trained#> • Variables removed: NEK, DateChanged, DateCollected, ... | Trained
juice(prep(lr_recipe))
#> # A tibble: 6 × 16#> AgeYrs DrugUse Comments DiagOther DiagOther2 Drug1 CaseStatus#> <dbl> <fct> <fct> <fct> <fct> <fct> <int>#> 1 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 1#> 2 555 SELF-HARM lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0#> 3 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 1#> 4 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0#> 5 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0#> 6 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0#> # ℹ 9 more variables: DateofTreatment_dow <fct>, DateofTreatment_month <fct>,#> # DateofTreatment_year <int>, DateofTreatment_hour <int>,#> # DateofTreatment_minute <int>, DateofTreatment_second <dbl>,#> # DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,#> # DateofTreatment_ChristmasDay <int>
The problem
Perhaps I've missed some documentation, but I seem to have identified an issue where {recipes} converts character features to factor invisibly to the user, and this in turn creates a condition where
all_string()
andall_string_predictors()
operate differently depending on where in the recipe they're used.In my example, I have columns of several different types. I have a few models that will use character features, and some which won't. In this case, I want to remove those features and only keep factors, integers, doubles, etc.
I'd assume that
step_rm(all_string_predictors())
would sort this out quickly, but this actually results in unpredictable behavior depending on where in the recipe chain you place it.I could pre-remove these features beforehand, or pre-compute their values and remove them by name, but this seemed somewhat antithetical to the entire tidymodels approach.
Is this expected behavior, and if so, what is the "preferred" solution to handling it?
Reproducible example
Session info
The text was updated successfully, but these errors were encountered: