-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash Encoding Outputting High Collision Output #402
Comments
Hi Eddie, thanks for reporting this issue. from category_encoders import HashingEncoder
import pandas as pd
import random
categories = [str(x) for x in range(100)]
data = [random.choice(categories) for _ in range(1000)]
df = pd.DataFrame({"col_a": data})
my_enc = HashingEncoder()
df_out = my_enc.fit_transform(df)
df_out["dummy"] = 1
print(df_out.groupby([f"col_{i}" for i in range(8)]).count()) I'm really surprised that this seems to have gone unnoticed for 7 years.
Also note a similar question and discussion in #350 |
This is expected behavior for a hashing encoder, I think. See wikipedia, and compare with sklearn's implementation. A binary encoding based on a hash function could be interesting as a new encoder though. It could hash to integers and then apply binary encoding on those. Hashing to "ordinal" encoding first could be factored out, and the current transformer would be that followed by one-hot? |
Thanks for pointing this out Ben!
8 bits to me sounds like 2^8 hash values. Also the default in sklearn is 2^20, whereas ours is 8. |
I've removed the bug label since this is actually the correct behavior. Keeping the issue to discuss whether to fix the documentation or just drop the encoder all-together (in a backwards compatible way, pointing to sklearn instead) |
Expected Behavior
If I set n_components to 8, for example, I should be able to encode columns with up to 2^8 values.
Actual Behavior
Whatever I set n_components to is the number of unique records of the encoded columns I get. For example, if I set n_components to 32, there only ends up being 32 unique records. This obviously will not do as an encoder as there are countless collisions.
This issue does not come up with binary coding for example. I successfully can encode my high cardinality categorical column with it.
Steps to Reproduce the Problem
Specifications
The text was updated successfully, but these errors were encountered: