-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] PolynomialColumnTransformer with maximum number of columns, n_dimensions
#715
Comments
Hey @CamiloMartinezM thanks for the very detailed explanation. To me the proposed solution seems something which is already possible via import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer
X = pd.DataFrame(np.random.randn(100, 3), columns=list("abc"))
cols = ["a", "b"] # or list(range(n_dimensions))
n_dim_polynomial = ColumnTransformer(
transformers = [
("poly", PolynomialFeatures(degree=2), cols)
],
remainder="passthrough" # no needs to do union with the other columns, they are just passed through as they are
).set_output(transform="pandas")
n_dim_polynomial.fit_transform(X).head(2) Resulting in:
|
Another alternative is to use the pipe = make_pipeline(
ColumnSelector(['a', 'b']),
PolynomialFeatures()
) |
Interesting, I hadn't considered the
|
Description
Add the ability to specify the maximum number of dimensions to include in the polynomial features combinations generated by
sklearn.preprocessing.PolynomialFeatures
, allowing users to apply polynomial transformations to only the firstN
columns while keeping remaining columns unchanged.Use Case
p_value
, for instance, after a statistical test. And often only the first few features would need polynomial expansion while later features should remain untouched. Or simply to reduce computational overhead.[0, 1]
values, doing polynomial expansion on these doesn't make sense and increases the number of columns of our dataset unnecessarily.I found myself coding this
Transformer
myself to be able to use it in ascikit-learn
pipeline that would preprocess ~100 features in a Healthcare dataset, which quickly blew up in terms of the number of output columns, when applyingPolynomialFeatures
withdegree <= 3
.Current Behavior
Currently,
PolynomialFeatures
transforms all input columns. We can specify adegree = (min_degree, max_degree)
and ainteraction_only=True or False
, to limit the combinations. For example,Someone already pitched a similar idea to scikit-learn here, which ended up in this PR. This change allowed specifying a
tuple
like this:degree = (min_degree, max_degree)
, whereas previously one could only specify adegree=int
.Hacky Solution?
I haven't tried this myself, but I guess you could potentially use
sklego.preprocessing.ColumnSelector
, which selects columns based on column name (I'm taking this out of theREADME
file), applyPolynomialFeatures
and then concatenate them with the remaining columns. But I find this unnecessarily complex, having to use three transformers: a column selector, polynomial features, and then a union to concatenate the features back. Also, it assumes the user knows the names of the columns at any point in the pipeline, which doesn't always work, since some scikit-learn transformers, includingPolynomialFeatures
perform changes to the names of the columns.According to my humble knowledge, there is no other way to achieve this in an intuitive way with a single transformer.
Proposed Solution
Add
n_dimensions
parameter to control how many columns from the input should undergo polynomial transformation:Even though you don't specify which columns to use, but the first
n_dimensions
columns, you could sort the columns yourself first based on the feature importances or domain knowledge, so only the ones you want are used for polynomial expansion. Or one could use this transformer alongside other potential ones that allow you to move columns to the beginning or even sort them based on some feature engineering algorithm. Nevertheless, I'm open to suggestions to make this even more general. I already have the code for this one implemented, as well as very thorough tests to make sure nothing breaks in-between, so I could take this as a first issue, if you find it useful.The text was updated successfully, but these errors were encountered: