Revisit explanation on category encoding strategies #792

fritshermans · 2024-12-23T15:26:19Z

I think that the statement on which category encoding strategy to choose could use some revision:

"In general OneHotEncoder is the encoding strategy used when the downstream models are linear models while OrdinalEncoder is often a good strategy with tree-based models."

There is no problem in using the OneHotEncoder with a tree based model as long as there are not too many unique categorical values. Moreover, I don't think it's good to teach that tree based models may be presented with ordinal encoded categorical values where the order doesn't make sense because the tree will be able to sort things out (pun intended) when they are deep enough. Imagine a modelling exercise where a country's climate is an important predictor and the country name is a categorical feature; 'Australia' (warm), 'Austria' (cold), 'Barbados' (warm), 'Belgium' (cold) etc. The tree needs to be very deep only because of poor categorical value encoding.

I propose to add the TargetEncoder with some explanation. This is a better suited encoder when there are many unique categories and there is no logical order.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit explanation on category encoding strategies #792

Revisit explanation on category encoding strategies #792

fritshermans commented Dec 23, 2024 •

edited

Loading

Revisit explanation on category encoding strategies #792

Revisit explanation on category encoding strategies #792

Comments

fritshermans commented Dec 23, 2024 • edited Loading

fritshermans commented Dec 23, 2024 •

edited

Loading