Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto encoding for categorical data during inference. #11088

Open
2 of 6 tasks
trivialfis opened this issue Dec 11, 2024 · 3 comments
Open
2 of 6 tasks

Auto encoding for categorical data during inference. #11088

trivialfis opened this issue Dec 11, 2024 · 3 comments

Comments

@trivialfis
Copy link
Member

trivialfis commented Dec 11, 2024

We are working on automatic re-encoding for categorical features during inference. This teaches the booster to handle data encoded differently than the training dataset and eliminates the need for a scikit-learn pipeline for data encoding when using DataFrame inputs.

  • Python native
  • Dask
  • R
  • PySpark
  • Scala/Spark
  • Thread safety

Removed the spark variants, its dataframe doesn't have encoding. Use the StringIndexer instead.

Related:

Notes:
Looking into the Arrow CPU implementation, its compute module dispatches based on whether a null mask is present. If true, it tries to find consecutive valid values (called a run) and then iterates on this run. This way, it avoids having a predicate for every element for the validity check. The consecutive valid values are found using compiler builtins with leading nnz counting.

Tracking PRs:

@trivialfis
Copy link
Member Author

trivialfis commented Dec 11, 2024

@david-cortes May I ask how the tmp is kept alive during the construction of QDM?

The setData from the proxy DMatrix only holds a reference (a C pointer) to the input data, if the data is released early then it's a use-after-free error. But testing the qdm.from_iterator seems fine.

@david-cortes
Copy link
Contributor

@trivialfis In that function, the DMatrix is set in the line right before tmp gets deleted. Or are you saying that the data still needs to be kept after the DMatrix has been set? If so, at which point is it safe to deallocate the data?

Regarding the feature: since the idea is to have this feature in different interfaces, how would it work behind the scenes?

Would be ideal if the categorical encodings could get saved in the booster and be used in plots/trees-to-tables/jsons/etc. (#9927). Better yet if it's a standardized C-level attribute so that the encodings could survive transfers from one interface to another.

I see some potential difficulties though:

  • Pandas allows arbitrary python object for categorical encodings (not json-friendly).
  • Different data formats (pandas, arrow, polars, etc.) might have different limitations for what kind of values they allow as categories (e.g. strings-only vs. also integers).

@trivialfis
Copy link
Member Author

trivialfis commented Dec 11, 2024

Or are you saying that the data still needs to be kept after the DMatrix has been set

It needs to be kept until the next next call or reset call of the iterator. I got a bit confused since the data is deleted right after the setData call, but no access error. So, in theory, it should be kept after set, and until the next iteration.

Regarding the feature: since the idea is to have this feature in different interfaces, how would it work behind the scenes?

We will store the levels in the booster as you suggested. Things will be handled in C++, we might allow users to optionally disable the encoder for performance reasons (searching through levels is not cheap in the context of inference, especially with strings).

Better yet if it's a standardized C-level attribute so that the encodings could survive transfers from one interface to another.

Currently, I'm returning the categories in the arrow columnar format with the help of pyarrow. Haven't decided the exact return type yet, in my experimental code, it's just a Python map (dictionary) from feature names to arrow arrays.

Pandas allows arbitrary python object for categorical encodings

We accept only strings and some other primitive types like integers. Still working on the typing part. The pandas.Index (used for representing categories) should have the same type of feature column, so it can't be arbitrary and has to be something XGBoost can understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants