[FEATURE] Ability to stratify with cols that contain some Nans values, this way people can hyperparameter tune best imputation methods #681

dec1costello · 2024-06-26T01:50:06Z

Hello!

I have a training pipeline that hyperparameter tunes the best imputation method
My pipeline fails when sklearn's train_test_split(stratify=stratify_data) is insufficient with cols containing Nan values
Curious if this seems like a scikit-lego feature people would want

Here's my attempt to stratify cols with some Nans for more context, I am a beginner so open to better ideas or comments if this feature request is out of scope. Thanks in advance!! Appreciate everyone's contributions to this package!

Strat attempt:

X = result_df[feature_cols]
y = result_df['strokes_to_hole_out']

#Extract the columns for stratification
stratify_cols = ['from_location_scorer','from_location_laser']
stratify_data = result_df[stratify_cols]

#Split the data, using 'stratify_data' for stratification
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=stratify_data)

error I receive come training: Trial failed with exception: Found unknown categories ['blue'] in column 9 during transform

FBruzzesi · 2024-06-26T07:07:34Z

Hey @dec1costello , thank for the feature request. I have a few questions:

Could you provide some minimal input data?
Could you provide some minimal expected output data?
The error seems to be related to a transformer failing in the .transform(X_valid) step. How would the proposal fix that?

dec1costello added the enhancement New feature or request label Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Ability to stratify with cols that contain some Nans values, this way people can hyperparameter tune best imputation methods #681

[FEATURE] Ability to stratify with cols that contain some Nans values, this way people can hyperparameter tune best imputation methods #681

dec1costello commented Jun 26, 2024 •

edited by FBruzzesi

Loading

FBruzzesi commented Jun 26, 2024

[FEATURE] Ability to stratify with cols that contain some Nans values, this way people can hyperparameter tune best imputation methods #681

[FEATURE] Ability to stratify with cols that contain some Nans values, this way people can hyperparameter tune best imputation methods #681

Comments

dec1costello commented Jun 26, 2024 • edited by FBruzzesi Loading

FBruzzesi commented Jun 26, 2024

dec1costello commented Jun 26, 2024 •

edited by FBruzzesi

Loading