-
Notifications
You must be signed in to change notification settings - Fork 0
Don't Use One Hot Encoding
Did you ever train a model and it required numeric features, but you couldn't use it because your data contained categorical features?
Did you then remember what you got told in school? That you should create dummy variables to one-hot encode these features?
TL;DR: Don't do it. Encode them numerically.
Let me elaborate.
I recently used hdbscan()
from dbscan package in R to cluster a dataset. The hdbscan()
function requires numerical data, so I had to take care of my categorical features.
At first, I precomputed a distance matrix using the daisy()
function and the Gower metric. They can deal with categorical features, and hdbscan()
can also cluster on distance matrices. I was happy with the clustering results.
Then, for reasons too elaborate to explain here, I found the distance matrix precomputation inconvenient. I tried what I was told in school: to one-hot encode my categorical variables instead and directly feed the data into hdbscan()
. But, to my surprise, the results got dramatically worse!
Good point, but no. I tried centering the data, I tried using the Gower metric on the one-hot encoded data, and more. Nothing helped - until I stumbled upon a few articles I found while browsing the web in rage.
First of all I discovered this article. They compare the training of a decision tree with continuous variables to the training with one-hot encoding. Surprisingly, they also find that the results suffered when using one-hot encoding.
Encouraged by these findings, I continued searching and found this very extensive comparison of encoding schemes. One-hot encoding performs worst throughout their analysis. To quote them:
There seems to be no reason to use One-Hot Encoding over Numeric Encoding.
So I tried exactly that - I switched to numeric encoding and I was back at the model performance I observed with daisy()
and Gower.