Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The MiningField@invalidValueTreatment attribute gets silently overriden by OneHotEncoder transformer #428

Open
kausmees opened this issue Aug 7, 2024 · 7 comments

Comments

@kausmees
Copy link

kausmees commented Aug 7, 2024

Hello

I'm trying to create a PMML file that includes a specification for how to handle invalid values for both numerical and categorical features. I'm getting the PMML output I expect for the numerical features, but the categorical features don't have any tags mentioning invalidValueTreatment in the schema.

I've tried two things: handling invalid values as missing and handling invalid values by assigning them a specific value (both in code below), but am getting the same result.

Reproducible example:

import pandas as pd
import numpy as np 

from sklearn_pandas import DataFrameMapper
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn.preprocessing import OneHotEncoder
from jpmml_evaluator import make_evaluator
import xgboost as xgb

df = pd.DataFrame([(1, 1, 'a'), (0, 2,'a'), (1, 3, 'b'), (1, 4, 'b'), (0, 5, 'c'), (1, 6, 'c'), (0, 7, 'c')], columns = ['Y', 'X1', 'X2'])

print(df)

numerical_cols = ['X1']
categorical_cols = ['X2']

### attempt 1: handle invalid values as missing
mapper = DataFrameMapper([
  (categorical_cols, [CategoricalDomain(invalid_value_treatment = "as_missing",
                                        missing_value_replacement = "OTHER"), OneHotEncoder()]),    
  (numerical_cols, [ContinuousDomain(invalid_value_treatment = "as_missing", 
                                     missing_value_replacement = -1)]),
])

### attempt 2: assign invalid values a new value
# mapper = DataFrameMapper([
#   (categorical_cols, [CategoricalDomain(invalid_value_treatment = "as_value",
#                                         invalid_value_replacement = "OTHER",
#                                         missing_value_replacement = "OTHER"), OneHotEncoder()]),    
#   (numerical_cols, [ContinuousDomain(invalid_value_treatment = "as_missing", 
#                                      missing_value_replacement = -1)]),
# ])



classifier = xgb.XGBClassifier(n_estimators=500,
                               objective="binary:logistic",
                               random_state=0,
                               learning_rate=0.1,
                               missing=np.nan,
                               scale_pos_weight=3000,
                               max_depth=8
                              )

pipeline = PMMLPipeline([
  ("mapper", mapper),
  ("classifier", classifier)
])

pipeline.fit(df[df.columns.difference(["Y"])], df["Y"])

sklearn2pmml(pipeline, "pipeline_test.pmml")


## evaluate

evaluator = make_evaluator("pipeline_test.pmml").verify()

TEST_INPUT_1 = {
  "X1": 1,
  "X2": "a"
}

TEST_INPUT_2 = {
  "X1": 1,
  "X2": "d"
}

print(evaluator.evaluate(TEST_INPUT_1))
print(evaluator.evaluate(TEST_INPUT_2))

Evaluation of TEST_INPUT_2 using jpmml_evaluator results in an error

jpmml_evaluator.JavaError: org.jpmml.evaluator.ValueCheckException: Field "X2" cannot accept invalid value "d"

without the invalidValueTreatment tags, but if I manually add invalidValueTreatment="asMissing" for X2 in the PMML file, then evaluation works as expected.

Is there any way to get the PMML output to contain info on how to handle invalid values, or have I missed something about the intended behavior here? Thanks

@vruusmann
Copy link
Member

First of all, if you want to train an XGBoost model using categorical features, then you shouldn't be using an explicit one-hot encoding transformation (such as sklearn.preprocessing.OneHotEncoder). The XGBoost library is able to ingest categorical features as-is; simply set enable_categorical = True.

I'm getting the PMML output I expect for the numerical features, but the categorical features don't have any tags mentioning invalidValueTreatment in the schema.

Can confirm this observation. When running your example code (thanks for taking the time to provide it, including the imports section), then I get the following PMML markup:

<MiningSchema>
	<MiningField name="Y" usageType="target"/>
	<MiningField name="X2" missingValueReplacement="OTHER" missingValueTreatment="asIs"/>
	<MiningField name="X1" missingValueReplacement="-1" missingValueTreatment="asIs" invalidValueTreatment="asMissing"/>
</MiningSchema>

Indeed, the MiningField@invalidValueTreatment attribute is not there for the "X2" feature.

This attribute is set two times in the pipeline. First, the CategoricalDomain step sets it to MiningField@invalidValueTreatment="asMissing" as expected. Then, the OneHotEncoder re-sets it to MiningField@invalidValueTreatment="returnInvalid" (see https://github.com/jpmml/jpmml-sklearn/blob/1.8.4/pmml-sklearn/src/main/java/sklearn/preprocessing/MultiOneHotEncoder.java#L85-L89). Since this is the default/implied value for this attribute, it is not shown in the PMML document.

TLDR: It's a bug - there are two instances of InvalidValueDecorator being applied. It wouldn't be a problem if they were functionally equivalent, but in the current case they're not. The converter should raise an exception about it.

The MiningField@invalidValueTreatment attribute sticks correctly for continuous features, because there is only one instance of InvalidValueDecorator being applied.

@vruusmann vruusmann changed the title Invalid value treatment attributes don't get written to PMML file for categorical feature Invalid value treatment attributes gets silently overriden for categorical features Aug 7, 2024
@vruusmann vruusmann changed the title Invalid value treatment attributes gets silently overriden for categorical features The MiningField@invalidValueTreatment attribute gets silently overriden by OneHotEncoder transformer Aug 7, 2024
@vruusmann
Copy link
Member

I've had this "new decorator overriding the old decorator" issue in my private TODO list for a long time. Perhaps it's time to work on it, now that someone is demonstrably suffering from it.

I think the solution would be to assume that an explicit decoration (here: CategoricalDomain(invalid_value_treatment = "as_missing")) should take priority over any implicit decorations (here: OneHotEncoder(handle_unknown = "error")).

However, the converter should log a warning message every time when "ignoring" some decoration.

@vruusmann
Copy link
Member

Also, a generic note about PMML input value preparation algorithm.

The input value can belong to one of three value spaces: valid, invalid (ie. non-missing, but not valid) or missing. The invalid value treatment is applied first. The missing value treatment is applied after that.

If the model supports missing values, then at the end of input value preparation there should be only valid or missing values present. If the model does not support missing values, only valid values should be present.

I can see the following logical error in your Python pipeline that you're replacing missing values with invalid values.

For example, the domain of the "X1" field is [1, 7], but the missing value replacement value is -1. Or, the domain of the "X2" field is {a, b, c}, but the missing value replacement value is OTHER.

It means that you're intentionally inputting an invalid value (ie. a value that was not present in the training dataset) to your model. What good can it be/do?

Granted, decision tree models (such as XGBoost) are quite lenient towards invalid values. The evaluation path simply follows the "default way" (instead of erroring out).

This makes me think that perhaps SkLearn2PMML decorator classes should also check that the provided invalid and missing value replacement values are actually valid values.

@kausmees
Copy link
Author

kausmees commented Aug 8, 2024

@vruusmann
Thanks for the fast and very thorough reply.

Would appreciate a fix, it will save us from having an extra step in the pipeline of inserting the tags into the PMML file.

I can see the following logical error in your Python pipeline that you're replacing missing values with invalid values.

You're right about the example, this is a very simplified version of the data for the sake of example, but I should perhaps have taken the time to make it a bit more realistic or complete.

This makes me think that perhaps SkLearn2PMML decorator classes should also check that the provided invalid and missing value replacement values are actually valid values.

This could be good. I ran into a java error (below) when calling sklearn2pmml() with a pipeline that assigned a value to missing categorical data that was not present in the original data set (this also happens when changing an X2 value to missing in the example I gave). I know this doesn't make sense to do, but it happened when testing the pipeline with a subset of data during development, which by chance didn't contain any instances of the category that missing values were assigned . Having a check and giving an error or warning on the python side would probably be more clear and easier to troubleshoot.

Exception in thread "main" java.lang.IllegalArgumentException: Expected the same number of elements, got a different numbers of elements
        at org.jpmml.python.ClassDictUtil.checkSize(ClassDictUtil.java:47)
        at sklearn.preprocessing.MultiOneHotEncoder.encodeFeatures(MultiOneHotEncoder.java:153)
        at sklearn.Transformer.encode(Transformer.java:77)
        at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:67)
        at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:48)
        at sklearn.Initializer.encode(Initializer.java:59)
        at sklearn.Composite.encodeFeatures(Composite.java:112)
        at sklearn.Composite.initFeatures(Composite.java:255)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:113)
        at com.sklearn2pmml.Main.run(Main.java:99)
        at com.sklearn2pmml.Main.main(Main.java:84)

@vruusmann
Copy link
Member

vruusmann commented Aug 8, 2024

Would appreciate a fix, it will save us from having an extra step in the pipeline of inserting the tags into the PMML file.

The fix about "decorator overrides" would go into the base JPMML-Converter library. It'll take some time to propagate it up to the JPMML-SkLearn library level.

Is the use case about OneHotEncoder plus XGBClassifier pipeline realistic or not? If its is, then you can fix everything (while significantly improving computational and statistical performance) by simply leaving out the OneHotEncoder step (in favour of XGBClassifier(enable_categorical = True)).

I ran into a java error (below) when calling sklearn2pmml() with a pipeline that assigned a value to missing categorical data that was not present in the original data set

I also have something similar noted in my private TODO list.

The situation can likely fixed by tweaking OneHotEncoder configuration some more - you'd need to make it aware of the possibility of invalid feature values coming in during OneHotEncoder.transform(X).

Please note that Scikit-Learn calls invalid values as "unknown values".

Alternatively, you may try replacing sklearn.preprocessing.OneHotEncoder with category_encoders.OneHotEncoder. The converter might have be more up-to-date in parsing its invalid and missing value handling logic.

But it would be even better if you got rid of one-hot encoding on your XGBoost pipelines in the first place.

@vruusmann
Copy link
Member

Is the use case about OneHotEncoder plus XGBClassifier pipeline realistic or not?

Another idea: the JPMML-SkLearn library should raise an error when it encounters a OneHotEncoding transformer step together with some XGBoost estimator step. Perhaps, the converter should offer an option to suppress this error by paying some kind of fee :-)

People keep following 5+ year old tutorials, completely missing out the new and correct way of doing things.

@kausmees
Copy link
Author

kausmees commented Aug 8, 2024

Is the use case about OneHotEncoder plus XGBClassifier pipeline realistic or not?

The use of XGBClassifier now is mainly to have a placeholder while developing a pipeline, and perhaps as a baseline for model performance later. But performing one-hot-encoding at some point down the line definitely seems feasible, when assessing different models.

But thanks for pointing out that skipping OneHotEncoder would solve the problem in this particular case, and yes probably would also be better for performance. I will also look into using other libraries such as category_encoders, knowing what's going on with the decorators being overridden is useful for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants