-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange behaviour with Xgboost when dealing with categorical data type, Possibly a bug #9676
Comments
Let me look into it. |
Hi, in the notebook, the test data was created from fresh and doesn't maintain the category mapping from the training dataset. XGBoost by itself doesn't save the information on how data is mapped (which is part of the user's data engineering pipeline). You can optionally use the |
Added a brief document for this #9678 . |
Hi @trivialfis , Edit 17Oct 2023 : I had thought to put every step inside a pipeline and use .fit_transform() during training and later during test time to use .transfom() to remember all transforms. But I'm not sure how to put dmatrix creation inside pipeline(I can do it for sklearn wrapper , but not sure of the native xgboost) or it is possible. |
@abhishek0093 I think one doesn't need to put DMatrix construction into a pipeline, if you can get sklearn interface working then you can get DMatrix working. Looking into the |
I can try to write an example for using ord encoder tomorrow if you haven't got it right yet. |
@trivialfis I have already built some of the code to use with native xgboost, so it would be advantageous to use native module. Also I think sklearn wrapper currently doesn't support some methods when dealing with categorical data which was the reason for me to switch to native module. I have already tried using various options but none seems to work. Currently the problem is that Dmatrix created during testing phase doesn’t know that data is already encoded and that it just need to parse it as it is without doing internal processing to make it to lie in range [0,n) as is required during training. One way was to put it inside a pipeline and only use transform() during testing phase, but not sure how we could do this. I think there is something which I’m missing or unexpected here. Can you please give a working example for the issue to make sure we get correct dmatrix encoding during test phase on a new dataset while dealing with categorical data. |
@abhishek0093 I put together a quick script based on your notebook, see if it addresses the issue: import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
np.random.seed(3)
def make_df():
"""Generate data for demo/testing."""
n_samples = 512
df = pd.DataFrame(
np.random.randint(32, 96, size=(n_samples, 3)),
columns=["brand_id", "retailer_id", "category_id"],
)
df["price"] = np.random.randint(100, 200, size=(n_samples,))
df["stock_status"] = np.random.choice([True, False], n_samples)
df["on_sale"] = np.random.choice([True, False], n_samples)
df["label"] = np.random.normal(loc=0.0, scale=1.0, size=n_samples)
features_list = [
"brand_id",
"retailer_id",
"category_id",
"stock_status",
"on_sale",
"price",
]
categorical_features = ["brand_id", "retailer_id", "category_id"]
X = df[features_list]
y = df["label"]
return X, y, categorical_features
X, y, cat_feats = make_df()
def native():
"""Using the native XGBoost interface."""
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=3, test_size=0.2
)
enc = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan)
enc.set_output(transform="pandas")
enc = enc.fit(X_train[cat_feats])
def enc_transform(X: pd.DataFrame) -> pd.DataFrame:
# don't make change inplace so that we can have demonstrations for encoding
X = X.copy()
cat_cols = enc.transform(X[cat_feats])
for i, name in enumerate(cat_feats):
cat_cols[name] = pd.Categorical.from_codes(
codes=cat_cols[name].astype(np.int32), categories=enc.categories_[i]
)
X[cat_feats] = cat_cols
return X
X_train_enc = enc_transform(X_train)
X_test_enc = enc_transform(X_test)
Xy_train = xgb.QuantileDMatrix(X_train_enc, y_train, enable_categorical=True)
Xy_test = xgb.QuantileDMatrix(
X_test_enc, y_test, enable_categorical=True, ref=Xy_train
)
booster = xgb.train({}, Xy_train)
booster.predict(Xy_test)
# This addresses the question in the issue, to show that the encoding is done
# consistently.
# We first obtain result from newly encoded data
predt0 = booster.inplace_predict(enc_transform(X_train.head(16)))
# then we obtain result from already encoded data from training.
predt1 = booster.inplace_predict(X_train_enc.head(16))
np.testing.assert_allclose(predt0, predt1)
def pipeline():
"""Using sklearn pipeline."""
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=3, test_size=0.2
)
enc = make_column_transformer(
(
OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan),
make_column_selector(pattern=".*_id"),
),
remainder="passthrough",
verbose_feature_names_out=False,
)
# No need to set pandas output, we use `feature_types` to indicate the type of
# features.
# enc.set_output(transform="pandas")
feature_types = ["c" if fn in cat_feats else "q" for fn in X_train.columns]
reg = xgb.XGBRegressor(
feature_types=feature_types, enable_categorical=True, n_estimators=10
)
p = make_pipeline(enc, reg)
p.fit(X_train, y_train)
for a, b in zip(reg.get_booster().feature_types, feature_types):
assert a == b
predt0 = p.predict(X_train.iloc[:16, :])
predt1 = p.predict(X_train)[:16]
# This addresses the question in the issue, to show that the encoding is done
# consistently.
np.testing.assert_allclose(predt0, predt1)
if __name__ == "__main__":
pipeline()
native() |
Hi @trivialfis , I think the mistake done by me might be common and the above mentioned way doesn't seems to be present on the documentation page for the categorical data. Can you please put this example on documentation page or link this issue there as it might be helpful for others facing the same issue. If you need my help in this,I would be very happy to help. |
@abhishek0093 I pushed the example into #9678 with small cleanups, feel free take a look when you get a chance. |
Is this issue closed? I am wondering if many people affected by this will notice that the documentation has changed and that they need to add ordinal encoding for the categorical feature to work properly. |
@trivialfis @abhishek0093 Can you please look into my issue once. Its very similar to the trailing issue. I have added the link to reproduce the issue. |
Hi @mustaq95 , |
Hi @mustaq95 , Edit : For others looking for solution to above issue. It has been discussed here : #9788 |
I've spent 4 hours figuring out how to get this to work for me, so I thought I'd share some of my findings to make things easier for others in the future. I think the code for the pipeline() function that @trivialfis provides above could benefit from a little tweak. The column transformer that gets created by make_column_transformer() reorders the columns of the input dataframe, and the xgboost model object needs feature_types to be specified in this new order or else it produces an error message. (At least this is true for sklearn 1.2.2) The only reason this doesn't cause a problem in the sample code is that the transformed columns all start out at the beginning of the input dataframe. A version that I believe would work with the code example in the documentation is:
Please let me know if you would prefer that I submit this as a new issue. I also find that I can't get the pipeline object to work correctly if I want to supply the eval_set parameter, but that seems like it would require a big enough change to justify its own issue. |
Thank you for sharing @jhaneyrf , would be great if you can help enhance the existing document with a PR. :-) |
I'll put some time on this over the weekend. |
I am using the xgboost native interface for encoding categorical columns during training. Everything works great and I'm able to pass |
Could you please provide a reproducible example? |
XGBoost Categorical Data Handling IssueWhat I'm trying to demonstrate:
My current approach:
What I'm observing:The predictions for individual rows are not matching the predictions from the bulk dataset. Specifically:
What I'm unsure about:
Code snippet demonstrating the issue:import os
import numpy as np
import pandas as pd
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
import xgboost as xgb
def make_example_data():
n_samples = 2048
rng = np.random.default_rng(1994)
# Mix of integer and string categorical features
brand_ids = rng.integers(1, 21, size=n_samples) # 20 unique integer brands
retailer_ids = [f"R{i}" for i in rng.integers(1, 11, size=n_samples)] # 10 unique string retailers
category_ids = rng.choice(['A', 'B', 'C', 'D', 'E'], size=n_samples) # 5 unique string categories
df = pd.DataFrame({
"brand_id": brand_ids,
"retailer_id": retailer_ids,
"category_id": category_ids,
"price": rng.integers(100, 200, size=n_samples),
"stock_status": rng.choice([True, False], n_samples),
"on_sale": rng.choice([True, False], n_samples)
})
df["label"] = rng.normal(loc=0.0, scale=1.0, size=n_samples)
X = df.drop(["label"], axis=1)
y = df["label"]
categorical_features = ["brand_id", "retailer_id", "category_id"]
for col in categorical_features:
X[col] = X[col].astype('category')
return X, y, categorical_features
def train_and_save_model() -> None:
"""Using the native XGBoost interface."""
X, y, cat_feats = make_example_data()
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=1994, test_size=0.2
)
# Create an encoder based on training data.
enc = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan)
enc.set_output(transform="pandas")
enc = enc.fit(X_train[cat_feats])
def enc_transform(X: pd.DataFrame) -> pd.DataFrame:
# don't make change inplace so that we can have demonstrations for encoding
X = X.copy()
cat_cols = enc.transform(X[cat_feats])
for i, name in enumerate(cat_feats):
# create pd.Series based on the encoder
cat_cols[name] = pd.Categorical.from_codes(
codes=cat_cols[name].astype(np.int32), categories=enc.categories_[i]
)
X[cat_feats] = cat_cols
return X
# Encode the data based on fitted encoder.
X_train_enc = enc_transform(X_train)
X_test_enc = enc_transform(X_test)
# Train XGBoost model using the native interface.
Xy_train = xgb.QuantileDMatrix(X_train_enc, y_train, enable_categorical=True)
Xy_test = xgb.QuantileDMatrix(
X_test_enc, y_test, enable_categorical=True, ref=Xy_train
)
booster = xgb.train({}, Xy_train)
booster.predict(Xy_test)
# Following shows that data are encoded consistently.
# We first obtain result from newly encoded data
predt0 = booster.inplace_predict(enc_transform(X_train.head(16)))
# then we obtain result from already encoded data from training.
predt1 = booster.inplace_predict(X_train_enc.head(16))
np.testing.assert_allclose(predt0, predt1)
# Save the model
booster.save_model('booster_with_encoding.json')
def load_model() -> xgb.Booster:
loaded_model = xgb.Booster()
loaded_model.load_model('booster_with_encoding.json')
return loaded_model
def test_model():
model = load_model()
# Create a small dataset with known categorical values for bulk prediction
test_data_bulk = pd.DataFrame({
'brand_id': [1, 5, 10, 15, 20], # Within range of 1-20
'retailer_id': ['R1', 'R5', 'R7', 'R9', 'R10'], # Within range of R1-R10
'category_id': ['A', 'B', 'C', 'D', 'E'], # All possible categories
'price': [150, 160, 170, 180, 190],
'stock_status': [True, False, True, False, True],
'on_sale': [False, True, False, True, False]
})
# Should I be converting inferences to type category? If I don't I get this error:
# ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter`enable_categorical` must be set to `True`. Invalid columns:retailer_id: object, category_id: object
# Convert categorical columns to 'category' dtype
cat_feats = ['brand_id', 'retailer_id', 'category_id']
for col in cat_feats:
test_data_bulk[col] = test_data_bulk[col].astype('category')
# Not sure which of these I should use. predict vs inplace_predict with dMatirx / pandas df
# Make bulk predictions
# dbulk = xgb.DMatrix(test_data_bulk, enable_categorical=True)
predictions_bulk = model.inplace_predict(test_data_bulk)
print("Bulk Predictions:")
print(predictions_bulk)
# Now, let's create and predict each row independently
for i in range(len(test_data_bulk)):
single_row = pd.DataFrame({col: [test_data_bulk.iloc[i][col]] for col in test_data_bulk.columns})
# Should I be converting inferences to type category? If I don't I get this error:
# ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter`enable_categorical` must be set to `True`. Invalid columns:retailer_id: object, category_id: object
for col in cat_feats:
single_row[col] = single_row[col].astype('category')
# dsingle = xgb.DMatrix(single_row, enable_categorical=True)
# prediction_single = model.predict(dsingle)
prediction_single = model.inplace_predict(single_row)
print(f"\nRow {i}:")
print(single_row)
print(f"Bulk prediction: {predictions_bulk[i]}")
print(f"Single row prediction: {prediction_single[0]}")
np.testing.assert_allclose(predictions_bulk[i], prediction_single[0], rtol=1e-5, atol=1e-8)
print("Predictions match!")
if __name__ == "__main__":
if not os.path.exists("booster_with_encoding.json"):
print("Training and saving model...")
train_and_save_model()
else:
print("Model already exists. Skipping training.")
print("Testing model...")
test_model() |
I am currently under the impression that if we use the native XGBoost enable_categorical=True after encoding our training data with the OrdinalEncoder that we do not need to encode anything when it comes time to use the model for inference. Meaning we should not need to save the encoder as it is baked into the model we saved. Docs:
Samples: |
Hi, I haven't looked into your example yet, really exhausted today.
This is not true. XGBoost doesn't encode any data; it simply trains a model on the inputs. That's its scope, train a boosting model and nothing more. Any data pre-processing step, including encoding, is left to the users. As a result, one should use the pipeline consistently. |
Understood. Does anyone know where I can find an example where we encode some categorial features, train a model, save the model AND save the encoder... then in a separate script load the model (and the encoder?) and prove that the loaded model/pipeline predicts using the correct category encodings? I will try and make my own example this weekend and post it here. Thanks for getting back to me so soon! |
The sklearn pipeline is just an example to show what a general pipeline looks like. Understandably, one can find it confusing. Sklearn doesn't provide any mean for stable serialization, picking is the only option for sklearn estimators and transformers. Since you are already using the native interface, I'm sure you can find a way to preserve the sklearn encoder's states, or roll your own encoder if necessary. |
On the other hand, I should update all examples and tutorials to reflect the need for consistent encoding. |
We plan to include some form of encoder into XGBoost to eliminate the footgun. Hopefully, it doesn't bring too much overhead. |
Hi, from sklearn.preprocessing import OrdinalEncoder
import numpy as np
import pandas as pd
import xgboost as xgb
import pickle def enc_transform(X: pd.DataFrame, ordinal_encoder, categorical_features) -> pd.DataFrame:
"""
Transforms list represented by categorical_features to categorical_codes
"""
X = X.copy()
cat_cols = ordinal_encoder.transform(X[categorical_features])
for i, name in enumerate(categorical_features):
cat_cols[name] = pd.Categorical.from_codes(codes=cat_cols[name].astype(np.int32), categories=ordinal_encoder.categories_[i])
X[categorical_features] = cat_cols
return X
categorical_features = ["abc", "xyz"] # Your list containing categorical features.
feature_types = ["c" if str(c) == "category" else ("i" if str(c) == "bool" else "q") for c in X_train.dtypes] # Set Encoder
ord_enc = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1, encoded_missing_value = -1, min_frequency=500).set_output(transform="pandas").fit(X_train[categorical_features])
#Transform Data and form Dmatrix
X_train = enc_transform(X_train, ord_enc, categorical_features)
dtrain = xgb.DMatrix(X_train, Y_train, enable_categorical=True, missing=-1, weight=train_sample_weights, feature_types=feature_types)
#Save Encoder
encoder_location = "/location/encoder.pkl"
with open(encoder_location, "wb") as filepath:
pickle.dump(ord_enc, filepath)
# Train and Save the model
model_location = "/location/model.json"
bst = xgb.train(bst_param, dtrain, num_boost_round=num_trees, verbose_eval=0)
bst.save_model(model_location) # Load model
loaded_model = xgb.Booster()
loaded_model.load_model(model_location)
# Load encoder
with open(encoder_location, 'rb') as filepath:
loaded_encoder = pickle.load(filepath)
# Make predictions
df_test = # dataframe containing all the input feature columns
model_prediction = loaded_model.inplace_predict(enc_transform(df_test, loaded_encoder, categorical_features)) |
Thank you for sharing, going through this discussion, we are considering adding the encoder into XGBoost booster. |
Closing in favor of #11088 for a roadmap. |
Hi Community !
I'm experiencing a strange behavior with Xgboost module , when I'm using it with categorical data. I have attached a sample file here with comments and results to reproduce the issue (link).
The issue I'm facing is that during training I turned on the categorical feature and followed the guidelines of documentation. I encoded the categories to lie in range [0, num_categories). The training phase follows the expected transformation of data in dmatrix .
But during test phase, when I'm doing the same transformations and enabling categorical, the transformation in the dmatrix for big(>2) dataframe doesn't follow the same encoding as happened during the training phase. Also, if I make predictions row-by-row each of the categorical_feature, each of them is getting encoded as 0.
Strangely, if I treat those categorical feature as of type integer, I'm getting the correct expected transformation for dmatrix. But, its result is different from treating the features as categorical and I'm not sure which one (or any) is correct result.
I think there is some problem with how categories are being transformed during the testing phase on a completely new dataset. Xgboost tries to encode each of the provided category to lie in the range of [0, num_categories), regardless of how they were treated during the training. For example if during training phase I had 100 unique categories each within the range [0, 100). Now if during test phase I provide same categorical column having original cat_id of 89 and 98, xgboost transform it to cat_id 0, and 1 which I think shouldn't be.
I would like to hear from the community that If there is something that I'm missing or this is some unexpected behavior.
Edit (14Oct, 2023) : Also not sure, if there should be a parameter to validate that the data type of dmatrix passed to model for prediction has same data_type of features as used during training. I'm trying to say something similar "validate_features" of predict method , but to validate type of data supplied.
The text was updated successfully, but these errors were encountered: