Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit explanation on category encoding strategies #792

Open
fritshermans opened this issue Dec 23, 2024 · 0 comments
Open

Revisit explanation on category encoding strategies #792

fritshermans opened this issue Dec 23, 2024 · 0 comments

Comments

@fritshermans
Copy link

fritshermans commented Dec 23, 2024

I think that the statement on which category encoding strategy to choose could use some revision:

"In general OneHotEncoder is the encoding strategy used when the downstream models are linear models while OrdinalEncoder is often a good strategy with tree-based models."

There is no problem in using the OneHotEncoder with a tree based model as long as there are not too many unique categorical values. Moreover, I don't think it's good to teach that tree based models may be presented with ordinal encoded categorical values where the order doesn't make sense because the tree will be able to sort things out (pun intended) when they are deep enough. Imagine a modelling exercise where a country's climate is an important predictor and the country name is a categorical feature; 'Australia' (warm), 'Austria' (cold), 'Barbados' (warm), 'Belgium' (cold) etc. The tree needs to be very deep only because of poor categorical value encoding.

I propose to add the TargetEncoder with some explanation. This is a better suited encoder when there are many unique categories and there is no logical order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant