Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] get_feature_names_out for sklego.preprocessing transformers. #543

Open
13 tasks done
CarloLepelaars opened this issue Oct 7, 2022 · 4 comments
Open
13 tasks done
Labels
enhancement New feature or request

Comments

@CarloLepelaars
Copy link
Contributor

CarloLepelaars commented Oct 7, 2022

get_feature_names_out is an important component for interpreting scikit-learn Pipeline objects. A get_feature_names_out call on a Pipeline only works if it is implemented for all components in the pipeline, except the last step (i.e. the Model).

Scikit-learn recently implemented get_feature_names_out for all Transformers in their 1.1 release (Source).

I think it makes sense to also implement get_feature_names_out for all scikit-lego Transformers that are not models and are not TrainOnly. This leaves most objects in sklego.preprocessing.

  • sklego.preprocessing.ColumnCapper
  • sklego.preprocessing.DictMapper
  • sklego.preprocessing.IdentityTransformer
  • sklego.preprocessing.IntervalEncoder
  • sklego.preprocessing.OutlierRemover (TrainOnly)
  • sklego.preprocessing.PandasTypeSelector
  • sklego.preprocessing.ColumnSelector
  • sklego.preprocessing.ColumnDropper
  • sklego.preprocessing.PatsyTransformer
  • sklego.preprocessing.OrthogonalTransformer
  • sklego.preprocessing.InformationFilter
  • sklego.preprocessing.RandomAdder (TrainOnly)
  • sklego.preprocessing.RepeatingBasisFunction

Additionally, it should be tested if get_feature_names_out works correctly with a Pipeline that contains transformers inheriting from TrainOnlyTransformerMixin, like RandomAdder.

@koaning and I recently discussed implementing get_feature_names_out for sklego.meta and ended up implementing this method for EstimatorTransformer (PR #539). It does not look like objects in sklego.decomposition and sklego.mixture require an implementation of get_feature_names_out, because it seems they are mostly used as the last step in a pipeline or wrapped in an EstimatorTransformer.

Since this is such a systematic issue, we can consider adding some additional requirements for people contributing to sklego.preprocessing. That is, make sure to implement get_feature_names_out for any new preprocessor that is not a train-time only Transformer.

@CarloLepelaars CarloLepelaars added the enhancement New feature or request label Oct 7, 2022
@CarloLepelaars
Copy link
Contributor Author

CarloLepelaars commented Oct 10, 2022

I'll take the thumbs-up as a go! 😁

After some digging I discovered that scikit-learn uses a _OneToOneFeatureMixin for adding get_feature_names_out to these preprocessing transformers where the shape does not change. The plan is to use this simple Mixin for transformers where the shape stays the same and try to find a different clean solution for the preprocessors where the shape does change (like ColumnSelector and ColumnDropper).

@CarloLepelaars
Copy link
Contributor Author

CarloLepelaars commented Nov 4, 2022

As discussed in #544 we are at a crossroads for this feature. Several options are valid.

  1. Wait until there have been multiple sklearn releases where _ClassNamePrefixFeaturesOutMixin is part of the public sklearn API. Most robust solution. Requires dropping of support for Python 3.7.
  2. Implement get_feature_names_out manually for all classes in sklego.preprocessing. More brittle, but this we can keep Python 3.7. support a can push this functionality pretty quickly. Tests for all transformers in sklego.preprocessing are already implemented.

@koaning
Copy link
Owner

koaning commented Nov 4, 2022

I would prefer option 1.

Per the zen of this library:

Some problems cannot be solved in a single day,
but if you ignore them, they sometimes go away.

There is no rush. And if we can save ourselves work, that's always a good thing for a project we're doing in our precious spare time.

@edgBR
Copy link

edgBR commented May 5, 2023

Hi,

Some estimators will require to be updated like ZeroInflatedRegressor().

Example:

from sklearn import set_config
set_config(transform_output="pandas")

param_grid_clf_normal = {
                'tree_method': 'hist',  # cpu: 'hist' | gpu: 'gpu_hist'
                'objective': 'binary:logistic',
                'eval_metric': 'logloss',
                'monotone_constraints': {"numeric_transf__tgt_period":1}
                #non-indicated features will be considered as 0.
                # For clf, I expect to apply this to the predict_proba function
            }

numeric_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="constant",
             fill_value=-1, add_indicator=True)),
            # originally: ("imputer", SimpleImputer(strategy="mean")),
            ("scaler", StandardScaler()),
            ("variance_selector", VarianceThreshold(threshold=0.03))
        ]
    )


categorical_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
            ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))]
    )

preprocessor = ColumnTransformer(
            transformers=[
                ("numeric_transf", numeric_transformer, numerical_features),
                ("categorical_transf", categorical_transformer, one_hot_features)],
            remainder='drop'
        )



zir = ZeroInflatedRegressor(
    classifier=xgb.XGBClassifier(**param_grid_clf_normal),
    regressor=xgb.XGBRegressor(**param_grid_reg)
)

model_zir_xgb = Pipeline(
    [
        # now preprocessor is an independent pipeline
        ('preprocessor', preprocessor),
        ('estimator', zir)
    ]
)
model_xgb = Pipeline(
    [
        # now preprocessor is an independent pipeline
        ('preprocessor', preprocessor),
        ('estimator', xgb.XGBRegressor(**param_grid_reg))
    ]
)

When fitting the normal XGBoost:

image

When trying to fit the ZIR:

NotFittedError                            Traceback (most recent call last)
File /anaconda/envs/tweedie_experiments/lib/python3.10/site-packages/sklego/meta/zero_inflated_regressor.py:90, in ZeroInflatedRegressor.fit(self, X, y, sample_weight)
     89 try:
---> 90     check_is_fitted(self.classifier)
     91     self.classifier_ = self.classifier

File /anaconda/envs/tweedie_experiments/lib/python3.10/site-packages/sklearn/utils/validation.py:1390, in check_is_fitted(estimator, attributes, msg, all_or_any)
   1389 if not fitted:
-> 1390     raise NotFittedError(msg % {"name": type(estimator).__name__})

NotFittedError: This XGBClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[55], line 1
----> 1 model_zir_xgb["estimator"].fit(train_data, target_train)

File /anaconda/envs/tweedie_experiments/lib/python3.10/site-packages/sklego/meta/zero_inflated_regressor.py:96, in ZeroInflatedRegressor.fit(self, X, y, sample_weight)
     93 self.classifier_ = clone(self.classifier)
     95 if "sample_weight" in signature(self.classifier_.fit).parameters:
---> 96     self.classifier_.fit(X, y != 0, sample_weight=sample_weight)
     97 else:
     98     logging.warning("Classifier ignores sample_weight.")
...
   1599         "Constrained features are not a subset of training data feature names"
   1600     )
   1602 return tuple(value.get(name, 0) for name in feature_names)

ValueError: Constrained features are not a subset of training data feature names
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?b6367863-6fbc-40aa-899a-418c3977df8f) or open in a [text editor](command:workbench.action.openLargeOutput?b6367863-6fbc-40aa-899a-418c3977df8f). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants