BinaryEncoder doesn't work together with cross_val_predict from sklearn #232

grialx · 2020-01-30T14:42:41Z

Versions
sklearn: '0.22.1'
category_encoders: 2.1.0

Issue - if I use a fitted BinaryEncoder instance in a custom classifier, there is a ValueError
"ValueError: Must train encoder before it can be used to transform data."
Important is, that I do not want to fit the encoder again and again. It should be fitted once at the beginning on the whole dataset.

Minimal example:

import pandas as pd
from sklearn.model_selection import cross_val_predict
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.svm import SVC
from category_encoders.binary import BinaryEncoder
encoder = BinaryEncoder()
col_1 = ["Hello", "World!", "ML", "is", "interesting"]*10
col_2 = [1, 0, 0, 1, 1]*10
data = pd.DataFrame({"A": col_1, "B": col_2})
encoder.fit(data.loc[:, ["A"]])


class ToyClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, encoder=None):
        self.encoder = encoder
        self.reg = SVC(C=1000)

    def fit(self, X, y):
        X = self.encoder.transform(X)
        self.reg.fit(X, y)
        return self

    def predict(self, X):
        X = self.encoder.transform(X)
        return self.reg.predict(X)

    def __setstate__(self, state):
        self.encoder = state["encoder"]
        self.reg = state["reg"]

    def __getstate__(self):
        return {
            "encoder": self.encoder,
            "reg": self.reg
        }


cls = ToyClassifier(encoder=encoder)
res = cross_val_predict(cls, data.loc[:, ["A"]], data.loc[:, "B"], cv=2)

The whole error message is:

Traceback (most recent call last):
File "/opt/miniconda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3319, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 41, in
res = cross_val_predict(cls, data.loc[:, ["A"]], data.loc[:, "B"], cv=2)
File "/opt/miniconda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 755, in cross_val_predict
for train, test in cv.split(X, y, groups))
File "/opt/miniconda/lib/python3.7/site-packages/joblib/parallel.py", line 921, in call
if self.dispatch_one_batch(iterator):
File "/opt/miniconda/lib/python3.7/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "/opt/miniconda/lib/python3.7/site-packages/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/opt/miniconda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "/opt/miniconda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 549, in init
self.results = batch()
File "/opt/miniconda/lib/python3.7/site-packages/joblib/parallel.py", line 225, in call
for func, args, kwargs in self.items]
File "/opt/miniconda/lib/python3.7/site-packages/joblib/parallel.py", line 225, in
for func, args, kwargs in self.items]
File "/opt/miniconda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 841, in _fit_and_predict
estimator.fit(X_train, y_train, **fit_params)
File "", line 20, in fit
X = self.encoder.transform(X)
File "/opt/miniconda/lib/python3.7/site-packages/category_encoders/binary.py", line 125, in transform
return self.base_n_encoder.transform(X)
File "/opt/miniconda/lib/python3.7/site-packages/category_encoders/basen.py", line 214, in transform
raise ValueError('Must train encoder before it can be used to transform data.')
ValueError: Must train encoder before it can be used to transform data.

The text was updated successfully, but these errors were encountered:

janmotl · 2020-01-30T16:20:09Z

Thank you for the report with the reproducible example.

The issue is related to the cloning in scikit: variables with an underscore at the beginning like _dim are not cloned. Unfortunately, it is not sufficient to just move away from the prefix to the suffix notation (e.g.: _dim -> dim_)... It requires further investigation.

janmotl added the bug label Jan 30, 2020

janmotl referenced this issue Mar 2, 2020

Test a fix for #229

0cb5a32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BinaryEncoder doesn't work together with cross_val_predict from sklearn #232

BinaryEncoder doesn't work together with cross_val_predict from sklearn #232

grialx commented Jan 30, 2020 •

edited

Loading

janmotl commented Jan 30, 2020

BinaryEncoder doesn't work together with cross_val_predict from sklearn #232

BinaryEncoder doesn't work together with cross_val_predict from sklearn #232

Comments

grialx commented Jan 30, 2020 • edited Loading

janmotl commented Jan 30, 2020

grialx commented Jan 30, 2020 •

edited

Loading