You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think that the statement on which category encoding strategy to choose could use some revision:
"In general OneHotEncoder is the encoding strategy used when the downstream models are linear models while OrdinalEncoder is often a good strategy with tree-based models."
There is no problem in using the OneHotEncoder with a tree based model as long as there are not too many unique categorical values. Moreover, I don't think it's good to teach that tree based models may be presented with ordinal encoded categorical values where the order doesn't make sense because the tree will be able to sort things out (pun intended) when they are deep enough. Imagine a modelling exercise where a country's climate is an important predictor and the country name is a categorical feature; 'Australia' (warm), 'Austria' (cold), 'Barbados' (warm), 'Belgium' (cold) etc. The tree needs to be very deep only because of poor categorical value encoding.
I propose to add the TargetEncoder with some explanation. This is a better suited encoder when there are many unique categories and there is no logical order.
The text was updated successfully, but these errors were encountered:
I think that the statement on which category encoding strategy to choose could use some revision:
"In general OneHotEncoder is the encoding strategy used when the downstream models are linear models while OrdinalEncoder is often a good strategy with tree-based models."
There is no problem in using the OneHotEncoder with a tree based model as long as there are not too many unique categorical values. Moreover, I don't think it's good to teach that tree based models may be presented with ordinal encoded categorical values where the order doesn't make sense because the tree will be able to sort things out (pun intended) when they are deep enough. Imagine a modelling exercise where a country's climate is an important predictor and the country name is a categorical feature; 'Australia' (warm), 'Austria' (cold), 'Barbados' (warm), 'Belgium' (cold) etc. The tree needs to be very deep only because of poor categorical value encoding.
I propose to add the TargetEncoder with some explanation. This is a better suited encoder when there are many unique categories and there is no logical order.
The text was updated successfully, but these errors were encountered: