Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash Encoding Outputting High Collision Output #402

Open
eddietaylor opened this issue Mar 20, 2023 · 4 comments
Open

Hash Encoding Outputting High Collision Output #402

eddietaylor opened this issue Mar 20, 2023 · 4 comments

Comments

@eddietaylor
Copy link

Expected Behavior

If I set n_components to 8, for example, I should be able to encode columns with up to 2^8 values.

Actual Behavior

Whatever I set n_components to is the number of unique records of the encoded columns I get. For example, if I set n_components to 32, there only ends up being 32 unique records. This obviously will not do as an encoder as there are countless collisions.

This issue does not come up with binary coding for example. I successfully can encode my high cardinality categorical column with it.

Steps to Reproduce the Problem

  1. fit the transform to some high cardinality categorical column
  2. look at the number of unique records for the col_0,..., col_7 combinations

Specifications

  • Version: category_encoders 2.6.0
  • Platform:
  • Subsystem:
@PaulWestenthanner
Copy link
Collaborator

Hi Eddie,

thanks for reporting this issue.
I've created this example to reproduce the problem

from category_encoders import HashingEncoder
import pandas as pd
import random

categories = [str(x) for x in range(100)]
data = [random.choice(categories) for _ in range(1000)]
df = pd.DataFrame({"col_a": data})

my_enc = HashingEncoder()
df_out = my_enc.fit_transform(df)
df_out["dummy"] = 1
print(df_out.groupby([f"col_{i}" for i in range(8)]).count())

I'm really surprised that this seems to have gone unnoticed for 7 years.
It seems like we need to modify this line to basically to do some logarithmic action and we need to join the strings in the whole row before hashing.

tmp[int(hasher.hexdigest(), 16) % N] += 1

Also note a similar question and discussion in #350

@bmreiniger
Copy link
Contributor

This is expected behavior for a hashing encoder, I think. See wikipedia, and compare with sklearn's implementation.

A binary encoding based on a hash function could be interesting as a new encoder though. It could hash to integers and then apply binary encoding on those. Hashing to "ordinal" encoding first could be factored out, and the current transformer would be that followed by one-hot?

@PaulWestenthanner
Copy link
Collaborator

Thanks for pointing this out Ben!
But then our documentation and default values are wrong.

n_components: int
  how many bits to use to represent the feature. By default, we use 8 bits.
  For high-cardinality features, consider using up-to 32 bits.

8 bits to me sounds like 2^8 hash values. Also the default in sklearn is 2^20, whereas ours is 8.
Given that the HashingEncoder does not implement handle na / handle missing and does not keep the mapping, I'm starting to wonder where the benefit against sklearn is. Sure we take dataframes as input and have some multiprocessing features but we'd need to test if our implementation is faster at all. I'd suspect that sklearn uses better optimised libraries, sparse output etc.

@PaulWestenthanner
Copy link
Collaborator

I've removed the bug label since this is actually the correct behavior. Keeping the issue to discuss whether to fix the documentation or just drop the encoder all-together (in a backwards compatible way, pointing to sklearn instead)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants