Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour with Xgboost when dealing with categorical data type, Possibly a bug #9676

Closed
abhishek0093 opened this issue Oct 14, 2023 · 29 comments · Fixed by #9678
Closed
Labels

Comments

@abhishek0093
Copy link

abhishek0093 commented Oct 14, 2023

Hi Community !
I'm experiencing a strange behavior with Xgboost module , when I'm using it with categorical data. I have attached a sample file here with comments and results to reproduce the issue (link).

The issue I'm facing is that during training I turned on the categorical feature and followed the guidelines of documentation. I encoded the categories to lie in range [0, num_categories). The training phase follows the expected transformation of data in dmatrix .
But during test phase, when I'm doing the same transformations and enabling categorical, the transformation in the dmatrix for big(>2) dataframe doesn't follow the same encoding as happened during the training phase. Also, if I make predictions row-by-row each of the categorical_feature, each of them is getting encoded as 0.
Strangely, if I treat those categorical feature as of type integer, I'm getting the correct expected transformation for dmatrix. But, its result is different from treating the features as categorical and I'm not sure which one (or any) is correct result.

I think there is some problem with how categories are being transformed during the testing phase on a completely new dataset. Xgboost tries to encode each of the provided category to lie in the range of [0, num_categories), regardless of how they were treated during the training. For example if during training phase I had 100 unique categories each within the range [0, 100). Now if during test phase I provide same categorical column having original cat_id of 89 and 98, xgboost transform it to cat_id 0, and 1 which I think shouldn't be.

I would like to hear from the community that If there is something that I'm missing or this is some unexpected behavior.

Edit (14Oct, 2023) : Also not sure, if there should be a parameter to validate that the data type of dmatrix passed to model for prediction has same data_type of features as used during training. I'm trying to say something similar "validate_features" of predict method , but to validate type of data supplied.

@trivialfis
Copy link
Member

Let me look into it.

@trivialfis
Copy link
Member

trivialfis commented Oct 16, 2023

Hi, in the notebook, the test data was created from fresh and doesn't maintain the category mapping from the training dataset. XGBoost by itself doesn't save the information on how data is mapped (which is part of the user's data engineering pipeline).

You can optionally use the OrdinalEncoder from sklearn (or other libraries providing similar features), or use your own mapping function with pandas (for instance, pd.Categorical.from_codes).

@trivialfis
Copy link
Member

Added a brief document for this #9678 .

@abhishek0093
Copy link
Author

abhishek0093 commented Oct 17, 2023

Hi @trivialfis ,
Thankyou for responding to the issue.
However, I'm not sure if I understood your comment fully because I'm still facing the same issue even after using the same OrdinalEncoder() object both at training and testing time (link).
Could you please help with a small example to state your point as it seems I'm missing something here. I have a dataset (similar to in attached notebook) and I want to train on that dataset containing categorical data(present as integer, not necessarily continuous). But later I want to deploy the built model into production and for that I want to ensure the dmatrix constructed during test phase on an entirely new dataset follows the same transformation rules as during the training phase which currently doesn't seems to happen.

Edit 17Oct 2023 : I had thought to put every step inside a pipeline and use .fit_transform() during training and later during test time to use .transfom() to remember all transforms. But I'm not sure how to put dmatrix creation inside pipeline(I can do it for sklearn wrapper , but not sure of the native xgboost) or it is possible.

@trivialfis
Copy link
Member

@abhishek0093 I think one doesn't need to put DMatrix construction into a pipeline, if you can get sklearn interface working then you can get DMatrix working. Looking into the fit method, you would find that the first thing XGBoost does is create a DMatrix.

@trivialfis
Copy link
Member

I can try to write an example for using ord encoder tomorrow if you haven't got it right yet.

@abhishek0093 abhishek0093 reopened this Oct 18, 2023
@abhishek0093
Copy link
Author

@trivialfis I have already built some of the code to use with native xgboost, so it would be advantageous to use native module. Also I think sklearn wrapper currently doesn't support some methods when dealing with categorical data which was the reason for me to switch to native module.

I have already tried using various options but none seems to work. Currently the problem is that Dmatrix created during testing phase doesn’t know that data is already encoded and that it just need to parse it as it is without doing internal processing to make it to lie in range [0,n) as is required during training. One way was to put it inside a pipeline and only use transform() during testing phase, but not sure how we could do this. I think there is something which I’m missing or unexpected here.

Can you please give a working example for the issue to make sure we get correct dmatrix encoding during test phase on a new dataset while dealing with categorical data.

@trivialfis
Copy link
Member

@abhishek0093 I put together a quick script based on your notebook, see if it addresses the issue:

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder

np.random.seed(3)


def make_df():
    """Generate data for demo/testing."""
    n_samples = 512
    df = pd.DataFrame(
        np.random.randint(32, 96, size=(n_samples, 3)),
        columns=["brand_id", "retailer_id", "category_id"],
    )

    df["price"] = np.random.randint(100, 200, size=(n_samples,))
    df["stock_status"] = np.random.choice([True, False], n_samples)
    df["on_sale"] = np.random.choice([True, False], n_samples)
    df["label"] = np.random.normal(loc=0.0, scale=1.0, size=n_samples)

    features_list = [
        "brand_id",
        "retailer_id",
        "category_id",
        "stock_status",
        "on_sale",
        "price",
    ]
    categorical_features = ["brand_id", "retailer_id", "category_id"]

    X = df[features_list]
    y = df["label"]

    return X, y, categorical_features


X, y, cat_feats = make_df()


def native():
    """Using the native XGBoost interface."""
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, random_state=3, test_size=0.2
    )

    enc = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan)
    enc.set_output(transform="pandas")

    enc = enc.fit(X_train[cat_feats])

    def enc_transform(X: pd.DataFrame) -> pd.DataFrame:
        # don't make change inplace so that we can have demonstrations for encoding
        X = X.copy()
        cat_cols = enc.transform(X[cat_feats])
        for i, name in enumerate(cat_feats):
            cat_cols[name] = pd.Categorical.from_codes(
                codes=cat_cols[name].astype(np.int32), categories=enc.categories_[i]
            )
        X[cat_feats] = cat_cols
        return X

    X_train_enc = enc_transform(X_train)
    X_test_enc = enc_transform(X_test)

    Xy_train = xgb.QuantileDMatrix(X_train_enc, y_train, enable_categorical=True)
    Xy_test = xgb.QuantileDMatrix(
        X_test_enc, y_test, enable_categorical=True, ref=Xy_train
    )
    booster = xgb.train({}, Xy_train)
    booster.predict(Xy_test)

    # This addresses the question in the issue, to show that the encoding is done
    # consistently.
    # We first obtain result from newly encoded data
    predt0 = booster.inplace_predict(enc_transform(X_train.head(16)))
    # then we obtain result from already encoded data from training.
    predt1 = booster.inplace_predict(X_train_enc.head(16))
    np.testing.assert_allclose(predt0, predt1)


def pipeline():
    """Using sklearn pipeline."""
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, random_state=3, test_size=0.2
    )

    enc = make_column_transformer(
        (
            OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan),
            make_column_selector(pattern=".*_id"),
        ),
        remainder="passthrough",
        verbose_feature_names_out=False,
    )
    # No need to set pandas output, we use `feature_types` to indicate the type of
    # features.

    # enc.set_output(transform="pandas")

    feature_types = ["c" if fn in cat_feats else "q" for fn in X_train.columns]
    reg = xgb.XGBRegressor(
        feature_types=feature_types, enable_categorical=True, n_estimators=10
    )
    p = make_pipeline(enc, reg)
    p.fit(X_train, y_train)
    for a, b in zip(reg.get_booster().feature_types, feature_types):
        assert a == b

    predt0 = p.predict(X_train.iloc[:16, :])
    predt1 = p.predict(X_train)[:16]

    # This addresses the question in the issue, to show that the encoding is done
    # consistently.
    np.testing.assert_allclose(predt0, predt1)


if __name__ == "__main__":
    pipeline()
    native()

@abhishek0093
Copy link
Author

abhishek0093 commented Oct 19, 2023

Hi @trivialfis ,
Thanks a lot for the help, the issue is resolved now. I appreciate you for taking out time and resolving the issue.

I think the mistake done by me might be common and the above mentioned way doesn't seems to be present on the documentation page for the categorical data. Can you please put this example on documentation page or link this issue there as it might be helpful for others facing the same issue. If you need my help in this,I would be very happy to help.

@trivialfis
Copy link
Member

@abhishek0093 I pushed the example into #9678 with small cleanups, feel free take a look when you get a chance.

@sergiotj
Copy link

Is this issue closed? I am wondering if many people affected by this will notice that the documentation has changed and that they need to add ordinal encoding for the categorical feature to work properly.

@mustaq95
Copy link

@trivialfis @abhishek0093 Can you please look into my issue once. Its very similar to the trailing issue. I have added the link to reproduce the issue.
The issue is that i'm observing different quantile predictions values when model is trained with different column orders (sorting alphabetically) QuantileDMatrix for a given alpha list.
This is not ensuring model consistency even when trained on same dataset. Please help me in fixing this.

@abhishek0093
Copy link
Author

Hi @mustaq95 ,
I will look into this.

@abhishek0093
Copy link
Author

abhishek0093 commented Nov 16, 2023

Hi @mustaq95 ,
I looked into the code and it also surprises me as only the ordering has been changed and it shouldn’t have affected the results in general. The results are matching as expected if we run with default parameters. As I’m not very much familiar with the quantilerror and its implementation. So, @trivialfis would be better person to look to this.

Edit : For others looking for solution to above issue. It has been discussed here : #9788

@jhaneyrf
Copy link

jhaneyrf commented Apr 1, 2024

I've spent 4 hours figuring out how to get this to work for me, so I thought I'd share some of my findings to make things easier for others in the future. I think the code for the pipeline() function that @trivialfis provides above could benefit from a little tweak. The column transformer that gets created by make_column_transformer() reorders the columns of the input dataframe, and the xgboost model object needs feature_types to be specified in this new order or else it produces an error message. (At least this is true for sklearn 1.2.2)

The only reason this doesn't cause a problem in the sample code is that the transformed columns all start out at the beginning of the input dataframe. A version that I believe would work with the code example in the documentation is:

...
# After columns are transformed, transformed columns appear at the beginning of the list of columns and must be passed to
# XGBoost accordingly
cat_types = ["c" for col in X_train.columns if col.endswith("_id")]
feature_types = cat_types + ["q"] * (len(X_train.columns) - len(cat_types))
...

Please let me know if you would prefer that I submit this as a new issue.

I also find that I can't get the pipeline object to work correctly if I want to supply the eval_set parameter, but that seems like it would require a big enough change to justify its own issue.

@trivialfis
Copy link
Member

Thank you for sharing @jhaneyrf , would be great if you can help enhance the existing document with a PR. :-)

@jhaneyrf
Copy link

jhaneyrf commented Apr 4, 2024

I'll put some time on this over the weekend.

@BBITWestin
Copy link

I am using the xgboost native interface for encoding categorical columns during training. Everything works great and I'm able to pass np.testing.assert_allclose(predt0, predt1). But once I save the model to a json file and load it back in to be used for inference with inplace_predict passing in different categories (in the same exact format as our training data) does not have any impact on the output as if all inputs are being encoded to -1 per enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1). Can we see an example of how to save and load a model that has been trained with cat features?

@trivialfis
Copy link
Member

Could you please provide a reproducible example?

@BBITWestin
Copy link

XGBoost Categorical Data Handling Issue

What I'm trying to demonstrate:

  1. I'm training an XGBoost model with a mix of integer and string categorical features, along with numerical features.
  2. I'm saving this model and then loading it for inference.
  3. I'm attempting to show that the model handles categorical data consistently, whether it's predicting on bulk data or individual rows.

My current approach:

  1. I create a bulk dataset for prediction, with categorical values within the range of my training data.
  2. I make predictions on this bulk dataset.
  3. I then create individual DataFrames for each row and make predictions on these single-row DataFrames.
  4. I compare the predictions from the bulk operation with the individual row predictions.

What I'm observing:

The predictions for individual rows are not matching the predictions from the bulk dataset. Specifically:

  • The first row's prediction matches between bulk and individual prediction.
  • Subsequent rows predictions do not match, with the individual row predictions seeming to repeat the first row's prediction.

What I'm unsure about:

  1. Do I need to convert categorical columns to the 'category' dtype for inference? I'm currently doing this to avoid a ValueError, but is this the correct approach?

  2. Should I be using model.predict() with a DMatrix, or model.inplace_predict() with a pandas DataFrame? I've tried both approaches but am unsure which is correct for categorical data.

  3. Am I missing a step in preparing the data for inference that's causing the categorical encodings to be inconsistent between bulk and individual predictions?

  4. Is there a difference in how XGBoost handles categorical data during training versus inference that I'm not accounting for?

  5. Could the issue be related to the order of categories or how XGBoost is interpreting the categorical data in single-row predictions versus bulk predictions?

Code snippet demonstrating the issue:

import os
import numpy as np
import pandas as pd
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder

import xgboost as xgb


def make_example_data():
    n_samples = 2048
    rng = np.random.default_rng(1994)

    # Mix of integer and string categorical features
    brand_ids = rng.integers(1, 21, size=n_samples)  # 20 unique integer brands
    retailer_ids = [f"R{i}" for i in rng.integers(1, 11, size=n_samples)]  # 10 unique string retailers
    category_ids = rng.choice(['A', 'B', 'C', 'D', 'E'], size=n_samples)  # 5 unique string categories

    df = pd.DataFrame({
        "brand_id": brand_ids,
        "retailer_id": retailer_ids,
        "category_id": category_ids,
        "price": rng.integers(100, 200, size=n_samples),
        "stock_status": rng.choice([True, False], n_samples),
        "on_sale": rng.choice([True, False], n_samples)
    })

    df["label"] = rng.normal(loc=0.0, scale=1.0, size=n_samples)

    X = df.drop(["label"], axis=1)
    y = df["label"]

    categorical_features = ["brand_id", "retailer_id", "category_id"]
    for col in categorical_features:
        X[col] = X[col].astype('category')

    return X, y, categorical_features


def train_and_save_model() -> None:
    """Using the native XGBoost interface."""
    X, y, cat_feats = make_example_data()

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, random_state=1994, test_size=0.2
    )

    # Create an encoder based on training data.
    enc = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan)
    enc.set_output(transform="pandas")
    enc = enc.fit(X_train[cat_feats])

    def enc_transform(X: pd.DataFrame) -> pd.DataFrame:
        # don't make change inplace so that we can have demonstrations for encoding
        X = X.copy()
        cat_cols = enc.transform(X[cat_feats])
        for i, name in enumerate(cat_feats):
            # create pd.Series based on the encoder
            cat_cols[name] = pd.Categorical.from_codes(
                codes=cat_cols[name].astype(np.int32), categories=enc.categories_[i]
            )
        X[cat_feats] = cat_cols
        return X

    # Encode the data based on fitted encoder.
    X_train_enc = enc_transform(X_train)
    X_test_enc = enc_transform(X_test)
    # Train XGBoost model using the native interface.
    Xy_train = xgb.QuantileDMatrix(X_train_enc, y_train, enable_categorical=True)
    Xy_test = xgb.QuantileDMatrix(
        X_test_enc, y_test, enable_categorical=True, ref=Xy_train
    )
    booster = xgb.train({}, Xy_train)
    booster.predict(Xy_test)

    # Following shows that data are encoded consistently.

    # We first obtain result from newly encoded data
    predt0 = booster.inplace_predict(enc_transform(X_train.head(16)))
    # then we obtain result from already encoded data from training.
    predt1 = booster.inplace_predict(X_train_enc.head(16))

    np.testing.assert_allclose(predt0, predt1)

    # Save the model
    booster.save_model('booster_with_encoding.json')


def load_model() -> xgb.Booster:
    loaded_model = xgb.Booster()
    loaded_model.load_model('booster_with_encoding.json')
    return loaded_model

def test_model():
    model = load_model()

    # Create a small dataset with known categorical values for bulk prediction
    test_data_bulk = pd.DataFrame({
        'brand_id': [1, 5, 10, 15, 20],  # Within range of 1-20
        'retailer_id': ['R1', 'R5', 'R7', 'R9', 'R10'],  # Within range of R1-R10
        'category_id': ['A', 'B', 'C', 'D', 'E'],  # All possible categories
        'price': [150, 160, 170, 180, 190],
        'stock_status': [True, False, True, False, True],
        'on_sale': [False, True, False, True, False]
    })

    # Should I be converting inferences to type category? If I don't I get this error:
    # ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter`enable_categorical` must be set to `True`.  Invalid columns:retailer_id: object, category_id: object
    # Convert categorical columns to 'category' dtype
    cat_feats = ['brand_id', 'retailer_id', 'category_id']
    for col in cat_feats:
        test_data_bulk[col] = test_data_bulk[col].astype('category')

    # Not sure which of these I should use. predict vs inplace_predict with dMatirx / pandas df
    # Make bulk predictions
    # dbulk = xgb.DMatrix(test_data_bulk, enable_categorical=True)
    predictions_bulk = model.inplace_predict(test_data_bulk)

    print("Bulk Predictions:")
    print(predictions_bulk)

    # Now, let's create and predict each row independently
    for i in range(len(test_data_bulk)):
        single_row = pd.DataFrame({col: [test_data_bulk.iloc[i][col]] for col in test_data_bulk.columns})
        # Should I be converting inferences to type category? If I don't I get this error:
        # ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter`enable_categorical` must be set to `True`.  Invalid columns:retailer_id: object, category_id: object
        for col in cat_feats:
            single_row[col] = single_row[col].astype('category')
        
        # dsingle = xgb.DMatrix(single_row, enable_categorical=True)
        # prediction_single = model.predict(dsingle)
        prediction_single = model.inplace_predict(single_row)

        print(f"\nRow {i}:")
        print(single_row)
        print(f"Bulk prediction: {predictions_bulk[i]}")
        print(f"Single row prediction: {prediction_single[0]}")
        
        np.testing.assert_allclose(predictions_bulk[i], prediction_single[0], rtol=1e-5, atol=1e-8)
        print("Predictions match!")

if __name__ == "__main__":
    if not os.path.exists("booster_with_encoding.json"):
        print("Training and saving model...")
        train_and_save_model()
    else:
        print("Model already exists. Skipping training.")
    
    print("Testing model...")
    test_model()

@trivialfis trivialfis reopened this Aug 16, 2024
@BBITWestin
Copy link

BBITWestin commented Aug 16, 2024

I am currently under the impression that if we use the native XGBoost enable_categorical=True after encoding our training data with the OrdinalEncoder that we do not need to encode anything when it comes time to use the model for inference. Meaning we should not need to save the encoder as it is baked into the model we saved.

Docs:

Samples:

@trivialfis
Copy link
Member

Hi, I haven't looked into your example yet, really exhausted today.

I am currently under the impression that if we use the native XGBoost enable_categorical=True after encoding our training data with the OrdinalEncoder that we do not need to encode anything when it comes time to use the model for inference. Meaning we should not need to save the encoder as it is baked into the model we saved.

This is not true. XGBoost doesn't encode any data; it simply trains a model on the inputs. That's its scope, train a boosting model and nothing more. Any data pre-processing step, including encoding, is left to the users. As a result, one should use the pipeline consistently.

@BBITWestin
Copy link

Understood. Does anyone know where I can find an example where we encode some categorial features, train a model, save the model AND save the encoder... then in a separate script load the model (and the encoder?) and prove that the loaded model/pipeline predicts using the correct category encodings?

I will try and make my own example this weekend and post it here. Thanks for getting back to me so soon!

@trivialfis
Copy link
Member

trivialfis commented Aug 17, 2024

The sklearn pipeline is just an example to show what a general pipeline looks like. Understandably, one can find it confusing. Sklearn doesn't provide any mean for stable serialization, picking is the only option for sklearn estimators and transformers.

Since you are already using the native interface, I'm sure you can find a way to preserve the sklearn encoder's states, or roll your own encoder if necessary.

@trivialfis
Copy link
Member

On the other hand, I should update all examples and tutorials to reflect the need for consistent encoding.

@trivialfis
Copy link
Member

We plan to include some form of encoder into XGBoost to eliminate the footgun. Hopefully, it doesn't bring too much overhead.

@abhishek0093
Copy link
Author

Hi,
Sorry missed a few messages above. @BBITWestin I’m not sure if your issue is already resolved, but would like to share my approach of saving the model and loading it in a separate file using native xgboost module that I wrote then using shared example approach by @trivialfis . This had preserved the encoding during load time for me.

from sklearn.preprocessing import OrdinalEncoder
import numpy as np
import pandas as pd
import xgboost as xgb
import pickle
def enc_transform(X: pd.DataFrame, ordinal_encoder, categorical_features) -> pd.DataFrame:
        """
        Transforms list represented by categorical_features to categorical_codes
        """
        X = X.copy()
        cat_cols = ordinal_encoder.transform(X[categorical_features])

        for i, name in enumerate(categorical_features):
            cat_cols[name] = pd.Categorical.from_codes(codes=cat_cols[name].astype(np.int32), categories=ordinal_encoder.categories_[i])
        X[categorical_features] = cat_cols
        return X


categorical_features = ["abc", "xyz"] # Your list containing categorical features. 
feature_types = ["c" if str(c) == "category" else ("i" if str(c) == "bool" else "q") for c in X_train.dtypes]
# Set Encoder 
ord_enc = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1, encoded_missing_value = -1, min_frequency=500).set_output(transform="pandas").fit(X_train[categorical_features])


#Transform Data and form Dmatrix
X_train = enc_transform(X_train, ord_enc, categorical_features)
dtrain = xgb.DMatrix(X_train, Y_train, enable_categorical=True, missing=-1, weight=train_sample_weights, feature_types=feature_types)

#Save Encoder 
encoder_location = "/location/encoder.pkl"
with open(encoder_location, "wb") as filepath:
    pickle.dump(ord_enc, filepath)



# Train and Save the model
model_location = "/location/model.json"
bst = xgb.train(bst_param, dtrain, num_boost_round=num_trees, verbose_eval=0)
bst.save_model(model_location)
# Load model
loaded_model = xgb.Booster() 
loaded_model.load_model(model_location) 


# Load encoder 
with open(encoder_location, 'rb') as filepath:
    loaded_encoder = pickle.load(filepath)

# Make predictions
df_test = # dataframe containing all the input feature columns
model_prediction = loaded_model.inplace_predict(enc_transform(df_test, loaded_encoder, categorical_features))

@trivialfis
Copy link
Member

Thank you for sharing, going through this discussion, we are considering adding the encoder into XGBoost booster.

@trivialfis
Copy link
Member

Closing in favor of #11088 for a roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants