Don't Use One Hot Encoding

Did you ever train a model and it required numeric features, but you couldn't use it because your data contained categorical features?

Did you then remember what you got told in school? That you should create dummy variables to one-hot encode these features?

TL;DR: Don't do it. Encode them numerically.

But why not? Aren't professors always right?

Let me elaborate.

I recently used hdbscan() from dbscan package in R to cluster a dataset. The hdbscan() function requires numerical data, so I had to take care of my categorical features.

At first, I precomputed a distance matrix using the daisy() function and the Gower metric. They can deal with categorical features, and hdbscan() can also cluster on distance matrices. I was happy with the clustering results.

Then, for reasons too elaborate to explain here, I found the distance matrix precomputation inconvenient. I tried what I was told in school: to one-hot encode my categorical variables instead and directly feed the data into hdbscan(). But, to my surprise, the results got dramatically worse!

Well... maybe you didn't center your data?

Good point, but no. I tried centering the data, I tried using the Gower metric on the one-hot encoded data, and more. Nothing helped - until I stumbled upon a few articles I found while browsing the web in rage.

First of all I discovered this article. They compare the training of a decision tree with continuous variables to the training with one-hot encoding. Surprisingly, they also find that the results suffered when using one-hot encoding.

Encouraged by these findings, I continued searching and found this very extensive comparison of encoding schemes. One-hot encoding performs worst throughout their analysis. To quote them:

There seems to be no reason to use One-Hot Encoding over Numeric Encoding.

So I tried exactly that - I switched to numeric encoding and I was back at the model performance I observed with daisy() and Gower.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't Use One Hot Encoding

But why not? Aren't professors always right?

Well... maybe you didn't center your data?

Clone this wiki locally