-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] get_feature_names_out
for sklego.preprocessing
transformers.
#543
Comments
I'll take the thumbs-up as a go! 😁 After some digging I discovered that |
As discussed in #544 we are at a crossroads for this feature. Several options are valid.
|
I would prefer option 1.
There is no rush. And if we can save ourselves work, that's always a good thing for a project we're doing in our precious spare time. |
Hi, Some estimators will require to be updated like ZeroInflatedRegressor(). Example: from sklearn import set_config
set_config(transform_output="pandas")
param_grid_clf_normal = {
'tree_method': 'hist', # cpu: 'hist' | gpu: 'gpu_hist'
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'monotone_constraints': {"numeric_transf__tgt_period":1}
#non-indicated features will be considered as 0.
# For clf, I expect to apply this to the predict_proba function
}
numeric_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="constant",
fill_value=-1, add_indicator=True)),
# originally: ("imputer", SimpleImputer(strategy="mean")),
("scaler", StandardScaler()),
("variance_selector", VarianceThreshold(threshold=0.03))
]
)
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))]
)
preprocessor = ColumnTransformer(
transformers=[
("numeric_transf", numeric_transformer, numerical_features),
("categorical_transf", categorical_transformer, one_hot_features)],
remainder='drop'
)
zir = ZeroInflatedRegressor(
classifier=xgb.XGBClassifier(**param_grid_clf_normal),
regressor=xgb.XGBRegressor(**param_grid_reg)
)
model_zir_xgb = Pipeline(
[
# now preprocessor is an independent pipeline
('preprocessor', preprocessor),
('estimator', zir)
]
)
model_xgb = Pipeline(
[
# now preprocessor is an independent pipeline
('preprocessor', preprocessor),
('estimator', xgb.XGBRegressor(**param_grid_reg))
]
) When fitting the normal XGBoost: When trying to fit the ZIR: NotFittedError Traceback (most recent call last)
File /anaconda/envs/tweedie_experiments/lib/python3.10/site-packages/sklego/meta/zero_inflated_regressor.py:90, in ZeroInflatedRegressor.fit(self, X, y, sample_weight)
89 try:
---> 90 check_is_fitted(self.classifier)
91 self.classifier_ = self.classifier
File /anaconda/envs/tweedie_experiments/lib/python3.10/site-packages/sklearn/utils/validation.py:1390, in check_is_fitted(estimator, attributes, msg, all_or_any)
1389 if not fitted:
-> 1390 raise NotFittedError(msg % {"name": type(estimator).__name__})
NotFittedError: This XGBClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[55], line 1
----> 1 model_zir_xgb["estimator"].fit(train_data, target_train)
File /anaconda/envs/tweedie_experiments/lib/python3.10/site-packages/sklego/meta/zero_inflated_regressor.py:96, in ZeroInflatedRegressor.fit(self, X, y, sample_weight)
93 self.classifier_ = clone(self.classifier)
95 if "sample_weight" in signature(self.classifier_.fit).parameters:
---> 96 self.classifier_.fit(X, y != 0, sample_weight=sample_weight)
97 else:
98 logging.warning("Classifier ignores sample_weight.")
...
1599 "Constrained features are not a subset of training data feature names"
1600 )
1602 return tuple(value.get(name, 0) for name in feature_names)
ValueError: Constrained features are not a subset of training data feature names
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?b6367863-6fbc-40aa-899a-418c3977df8f) or open in a [text editor](command:workbench.action.openLargeOutput?b6367863-6fbc-40aa-899a-418c3977df8f). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)... |
get_feature_names_out
is an important component for interpreting scikit-learnPipeline
objects. Aget_feature_names_out
call on aPipeline
only works if it is implemented for all components in the pipeline, except the last step (i.e. the Model).Scikit-learn recently implemented
get_feature_names_out
for all Transformers in their 1.1 release (Source).I think it makes sense to also implement
get_feature_names_out
for allscikit-lego
Transformers that are not models and are notTrainOnly
. This leaves most objects insklego.preprocessing
.sklego.preprocessing.ColumnCapper
sklego.preprocessing.DictMapper
sklego.preprocessing.IdentityTransformer
sklego.preprocessing.IntervalEncoder
sklego.preprocessing.OutlierRemover
(TrainOnly)sklego.preprocessing.PandasTypeSelector
sklego.preprocessing.ColumnSelector
sklego.preprocessing.ColumnDropper
sklego.preprocessing.PatsyTransformer
sklego.preprocessing.OrthogonalTransformer
sklego.preprocessing.InformationFilter
sklego.preprocessing.RandomAdder
(TrainOnly)sklego.preprocessing.RepeatingBasisFunction
Additionally, it should be tested if
get_feature_names_out
works correctly with aPipeline
that contains transformers inheriting fromTrainOnlyTransformerMixin
, likeRandomAdder
.@koaning and I recently discussed implementing
get_feature_names_out
forsklego.meta
and ended up implementing this method forEstimatorTransformer
(PR #539). It does not look like objects insklego.decomposition
andsklego.mixture
require an implementation ofget_feature_names_out
, because it seems they are mostly used as the last step in a pipeline or wrapped in anEstimatorTransformer
.Since this is such a systematic issue, we can consider adding some additional requirements for people contributing to
sklego.preprocessing
. That is, make sure to implementget_feature_names_out
for any new preprocessor that is not a train-time only Transformer.The text was updated successfully, but these errors were encountered: