-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Enable Grid-Search for TableVectorizer
#814
[ENH] Enable Grid-Search for TableVectorizer
#814
Conversation
TableVectorizer
I need to address a small docstring error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
from sklearn.utils.validation import check_is_fitted | ||
|
||
from skrub import DatetimeEncoder, GapEncoder | ||
from skrub._utils import parse_astype_error_message | ||
|
||
HIGH_CARDINALITY_TRANSFORMER = GapEncoder(n_components=30) | ||
LOW_CARDINALITY_TRANSFORMER = OneHotEncoder( | ||
sparse_output=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we expose the ColumnTransformer
's sparse_threshold
parameter but with our default transformers the default will always be dense (even if toe onehot encoder yields many zeros)
we could consider
- pointing out in the doc that users need to change the transformers if they want sparse output
- not exposing the
sparse_threshold
and always returning dense data - making the onehot encoder sparse by default
(not in this PR)
("numeric", self.numerical_transformer_, numeric_columns), | ||
("datetime", self.datetime_transformer_, datetime_columns), | ||
("low_card_cat", self.low_card_cat_transformer_, low_card_cat_columns), | ||
("high_card_cat", self.high_card_cat_transformer_, high_card_cat_columns), | ||
("low_card_cat", self.low_cardinality_transformer_, low_card_cat_columns), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not super important but you could propagate the change 'card_cat' -> 'cardinality' to local variables
along that line it would be nice if we picked one of "numeric" or "numerical" and used it all the time :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numeric, since it's shorter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! it's also the choice made by polars.selectors.numeric
. pandas select_dtypes
uses "number", "category"
I guess this one is ready to merge? |
I think so! |
Very nice. Congratulations! |
What does this PR fix/address?
Apply Gaël's suggestions and the outputs of discussion #796 to make grid-search possible.
What does it change?
None
, they are turned into "passthrough" during fit (e.g.high_cardinality_encoder = None
will result inhigh_cardinality_encoder_ = passthrough