Feature Request: Count-Based Target Encoder (Dracula)? #420

bking124 · 2023-09-11T13:53:36Z

I recently stumbled upon a categorical encoding idea dubbed "Distributed Robust Algorithm for Count-based Learning" (aka Dracula) described in this Microsoft blog as well as this talk. It seems like it mixes ideas of CountEncoder and TargetEncoder. Has anybody heard of this approach before and has there been thought of introducing such an encoder into the package? I'm interested to compare this approach with the typical TargetEncoder.

Thanks for the wonderful package!

PaulWestenthanner · 2023-09-18T17:05:51Z

Hi @bking124

I haven't heard of the approach before. Searching "Dracula Encoder" or "CTR encoder" (as mentioned in the talk) also doesn't yield much. Since the talk and blog post are already 8 years old and it didn't get much traction since I'd be surprised if yields great results.
On the other hand we could include it into the package. I think it should be rather straight forward to implement.
From what I understood the encoded value is calculated as:

calculate the counts for each label df.groupBy(col, label).count(). This can be only done for the top N and the rest will go to a rest category
use as encoded value for a label x: counts[x, target=0], counts[x, target=1], ..., log-odds, flag_is_rest

I'm not quite sure how to handle the regression case. Probably we'd need some binning of the target variable there?
Also small categories might result in overfitting if the classifier basically ignores the counts and just uses the log odds (which it will). This might be a potential issue (just like in target encoding with too little regularization).
In fact this is pretty much what you'd get when you encode a variable with both count encoder and target encoder (with no regularisation).

PaulWestenthanner added the enhancement label Sep 18, 2023

EmilHvitfeldt mentioned this issue Nov 28, 2023

step_dracula() EmilHvitfeldt/extrasteps#68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Count-Based Target Encoder (Dracula)? #420

Feature Request: Count-Based Target Encoder (Dracula)? #420

bking124 commented Sep 11, 2023 •

edited

Loading

PaulWestenthanner commented Sep 18, 2023

Feature Request: Count-Based Target Encoder (Dracula)? #420

Feature Request: Count-Based Target Encoder (Dracula)? #420

Comments

bking124 commented Sep 11, 2023 • edited Loading

PaulWestenthanner commented Sep 18, 2023

bking124 commented Sep 11, 2023 •

edited

Loading