[FEATURE] `get_feature_names_out` for `sklego.preprocessing` transformers. #543

CarloLepelaars · 2022-10-07T12:35:04Z

CarloLepelaars · 2022-10-10T14:31:43Z

I'll take the thumbs-up as a go! 😁

After some digging I discovered that scikit-learn uses a _OneToOneFeatureMixin for adding get_feature_names_out to these preprocessing transformers where the shape does not change. The plan is to use this simple Mixin for transformers where the shape stays the same and try to find a different clean solution for the preprocessors where the shape does change (like ColumnSelector and ColumnDropper).

CarloLepelaars · 2022-11-04T11:52:48Z

As discussed in #544 we are at a crossroads for this feature. Several options are valid.

Wait until there have been multiple sklearn releases where _ClassNamePrefixFeaturesOutMixin is part of the public sklearn API. Most robust solution. Requires dropping of support for Python 3.7.
Implement get_feature_names_out manually for all classes in sklego.preprocessing. More brittle, but this we can keep Python 3.7. support a can push this functionality pretty quickly. Tests for all transformers in sklego.preprocessing are already implemented.

koaning · 2022-11-04T12:28:11Z

I would prefer option 1.

Per the zen of this library:

Some problems cannot be solved in a single day,
but if you ignore them, they sometimes go away.

There is no rush. And if we can save ourselves work, that's always a good thing for a project we're doing in our precious spare time.

edgBR · 2023-05-05T17:05:53Z

Hi,

Some estimators will require to be updated like ZeroInflatedRegressor().

Example:

from sklearn import set_config
set_config(transform_output="pandas")

param_grid_clf_normal = {
                'tree_method': 'hist',  # cpu: 'hist' | gpu: 'gpu_hist'
                'objective': 'binary:logistic',
                'eval_metric': 'logloss',
                'monotone_constraints': {"numeric_transf__tgt_period":1}
                #non-indicated features will be considered as 0.
                # For clf, I expect to apply this to the predict_proba function
            }

numeric_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="constant",
             fill_value=-1, add_indicator=True)),
            # originally: ("imputer", SimpleImputer(strategy="mean")),
            ("scaler", StandardScaler()),
            ("variance_selector", VarianceThreshold(threshold=0.03))
        ]
    )


categorical_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
            ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))]
    )

preprocessor = ColumnTransformer(
            transformers=[
                ("numeric_transf", numeric_transformer, numerical_features),
                ("categorical_transf", categorical_transformer, one_hot_features)],
            remainder='drop'
        )



zir = ZeroInflatedRegressor(
    classifier=xgb.XGBClassifier(**param_grid_clf_normal),
    regressor=xgb.XGBRegressor(**param_grid_reg)
)

model_zir_xgb = Pipeline(
    [
        # now preprocessor is an independent pipeline
        ('preprocessor', preprocessor),
        ('estimator', zir)
    ]
)
model_xgb = Pipeline(
    [
        # now preprocessor is an independent pipeline
        ('preprocessor', preprocessor),
        ('estimator', xgb.XGBRegressor(**param_grid_reg))
    ]
)

When fitting the normal XGBoost:

When trying to fit the ZIR:

NotFittedError                            Traceback (most recent call last)
File /anaconda/envs/tweedie_experiments/lib/python3.10/site-packages/sklego/meta/zero_inflated_regressor.py:90, in ZeroInflatedRegressor.fit(self, X, y, sample_weight)
     89 try:
---> 90     check_is_fitted(self.classifier)
     91     self.classifier_ = self.classifier

File /anaconda/envs/tweedie_experiments/lib/python3.10/site-packages/sklearn/utils/validation.py:1390, in check_is_fitted(estimator, attributes, msg, all_or_any)
   1389 if not fitted:
-> 1390     raise NotFittedError(msg % {"name": type(estimator).__name__})

NotFittedError: This XGBClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[55], line 1
----> 1 model_zir_xgb["estimator"].fit(train_data, target_train)

File /anaconda/envs/tweedie_experiments/lib/python3.10/site-packages/sklego/meta/zero_inflated_regressor.py:96, in ZeroInflatedRegressor.fit(self, X, y, sample_weight)
     93 self.classifier_ = clone(self.classifier)
     95 if "sample_weight" in signature(self.classifier_.fit).parameters:
---> 96     self.classifier_.fit(X, y != 0, sample_weight=sample_weight)
     97 else:
     98     logging.warning("Classifier ignores sample_weight.")
...
   1599         "Constrained features are not a subset of training data feature names"
   1600     )
   1602 return tuple(value.get(name, 0) for name in feature_names)

ValueError: Constrained features are not a subset of training data feature names
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?b6367863-6fbc-40aa-899a-418c3977df8f) or open in a [text editor](command:workbench.action.openLargeOutput?b6367863-6fbc-40aa-899a-418c3977df8f). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

CarloLepelaars added the enhancement New feature or request label Oct 7, 2022

CarloLepelaars mentioned this issue Oct 11, 2022

[WIP] get_feature_names_out for sklego.preprocessing. #544

Closed

4 tasks

Alex-Cremers mentioned this issue Jul 11, 2024

feat: RepeatingBasisFunction.inverse_transform #687

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] `get_feature_names_out` for `sklego.preprocessing` transformers. #543

[FEATURE] `get_feature_names_out` for `sklego.preprocessing` transformers. #543

CarloLepelaars commented Oct 7, 2022 •

edited

Loading

CarloLepelaars commented Oct 10, 2022 •

edited

Loading

CarloLepelaars commented Nov 4, 2022 •

edited

Loading

koaning commented Nov 4, 2022

edgBR commented May 5, 2023 •

edited

Loading

[FEATURE] get_feature_names_out for sklego.preprocessing transformers. #543

[FEATURE] get_feature_names_out for sklego.preprocessing transformers. #543

Comments

CarloLepelaars commented Oct 7, 2022 • edited Loading

CarloLepelaars commented Oct 10, 2022 • edited Loading

CarloLepelaars commented Nov 4, 2022 • edited Loading

koaning commented Nov 4, 2022

edgBR commented May 5, 2023 • edited Loading

[FEATURE] `get_feature_names_out` for `sklego.preprocessing` transformers. #543

[FEATURE] `get_feature_names_out` for `sklego.preprocessing` transformers. #543

CarloLepelaars commented Oct 7, 2022 •

edited

Loading

CarloLepelaars commented Oct 10, 2022 •

edited

Loading

CarloLepelaars commented Nov 4, 2022 •

edited

Loading

edgBR commented May 5, 2023 •

edited

Loading