Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BinaryEncoder doesn't work together with cross_val_predict from sklearn #232

Open
grialx opened this issue Jan 30, 2020 · 1 comment
Open
Labels

Comments

@grialx
Copy link

grialx commented Jan 30, 2020

Versions
sklearn: '0.22.1'
category_encoders: 2.1.0

Issue - if I use a fitted BinaryEncoder instance in a custom classifier, there is a ValueError
"ValueError: Must train encoder before it can be used to transform data."
Important is, that I do not want to fit the encoder again and again. It should be fitted once at the beginning on the whole dataset.

Minimal example:

import pandas as pd
from sklearn.model_selection import cross_val_predict
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.svm import SVC
from category_encoders.binary import BinaryEncoder
encoder = BinaryEncoder()
col_1 = ["Hello", "World!", "ML", "is", "interesting"]*10
col_2 = [1, 0, 0, 1, 1]*10
data = pd.DataFrame({"A": col_1, "B": col_2})
encoder.fit(data.loc[:, ["A"]])


class ToyClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, encoder=None):
        self.encoder = encoder
        self.reg = SVC(C=1000)

    def fit(self, X, y):
        X = self.encoder.transform(X)
        self.reg.fit(X, y)
        return self

    def predict(self, X):
        X = self.encoder.transform(X)
        return self.reg.predict(X)

    def __setstate__(self, state):
        self.encoder = state["encoder"]
        self.reg = state["reg"]

    def __getstate__(self):
        return {
            "encoder": self.encoder,
            "reg": self.reg
        }


cls = ToyClassifier(encoder=encoder)
res = cross_val_predict(cls, data.loc[:, ["A"]], data.loc[:, "B"], cv=2)

The whole error message is:

Traceback (most recent call last):
File "/opt/miniconda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3319, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 41, in
res = cross_val_predict(cls, data.loc[:, ["A"]], data.loc[:, "B"], cv=2)
File "/opt/miniconda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 755, in cross_val_predict
for train, test in cv.split(X, y, groups))
File "/opt/miniconda/lib/python3.7/site-packages/joblib/parallel.py", line 921, in call
if self.dispatch_one_batch(iterator):
File "/opt/miniconda/lib/python3.7/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "/opt/miniconda/lib/python3.7/site-packages/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/opt/miniconda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "/opt/miniconda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 549, in init
self.results = batch()
File "/opt/miniconda/lib/python3.7/site-packages/joblib/parallel.py", line 225, in call
for func, args, kwargs in self.items]
File "/opt/miniconda/lib/python3.7/site-packages/joblib/parallel.py", line 225, in
for func, args, kwargs in self.items]
File "/opt/miniconda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 841, in _fit_and_predict
estimator.fit(X_train, y_train, **fit_params)
File "", line 20, in fit
X = self.encoder.transform(X)
File "/opt/miniconda/lib/python3.7/site-packages/category_encoders/binary.py", line 125, in transform
return self.base_n_encoder.transform(X)
File "/opt/miniconda/lib/python3.7/site-packages/category_encoders/basen.py", line 214, in transform
raise ValueError('Must train encoder before it can be used to transform data.')
ValueError: Must train encoder before it can be used to transform data.

@janmotl
Copy link
Collaborator

janmotl commented Jan 30, 2020

Thank you for the report with the reproducible example.

The issue is related to the cloning in scikit: variables with an underscore at the beginning like _dim are not cloned. Unfortunately, it is not sufficient to just move away from the prefix to the suffix notation (e.g.: _dim -> dim_)... It requires further investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants