Skip to content

Hash Encoding Outputting High Collision Output #402

Open
@eddietaylor

Description

@eddietaylor

Expected Behavior

If I set n_components to 8, for example, I should be able to encode columns with up to 2^8 values.

Actual Behavior

Whatever I set n_components to is the number of unique records of the encoded columns I get. For example, if I set n_components to 32, there only ends up being 32 unique records. This obviously will not do as an encoder as there are countless collisions.

This issue does not come up with binary coding for example. I successfully can encode my high cardinality categorical column with it.

Steps to Reproduce the Problem

  1. fit the transform to some high cardinality categorical column
  2. look at the number of unique records for the col_0,..., col_7 combinations

Specifications

  • Version: category_encoders 2.6.0
  • Platform:
  • Subsystem:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions