cross_validate don't work with LightGBM v4.0.0 #112

yuta100101 · 2023-07-16T05:38:11Z

Thanks for publishing such a useful tool!

A few days ago, LightGBM's new version 4.0.0 has been released.
In this release, early_stopping_rounds argument in fit() was removed.

So, functions that use cross_validate() such as run_experiment don't work.
(There may be other functions that don't work, I haven't investigated yet.)

Of cource, there is no probrem with versions before 3.3.5.

pytest log

(nyaggle) yuta100101:~/nyaggle(master =)$ pytest tests/validation/test_cross_validate.py::test_cv_lgbm
========================================================================================== test session starts ===========================================================================================
platform linux -- Python 3.9.17, pytest-7.4.0, pluggy-1.2.0
rootdir: /home/yuta100101/practice/nyaggle
collected 1 item                                                                                                                                                                                         

tests/validation/test_cross_validate.py F                                                                                                                                                          [100%]

================================================================================================ FAILURES ================================================================================================
______________________________________________________________________________________________ test_cv_lgbm ______________________________________________________________________________________________

    def test_cv_lgbm():
        X, y = make_classification(n_samples=1024, n_features=20, class_sep=0.98, random_state=0)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
    
        models = [LGBMClassifier(n_estimators=300) for _ in range(5)]
    
>       pred_oof, pred_test, scores, importance = cross_validate(models, X_train, y_train, X_test, cv=5,
                                                                 eval_func=roc_auc_score,
                                                                 fit_params={'early_stopping_rounds': 200})

tests/validation/test_cross_validate.py:52: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

estimator = [LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300)]
X_train =            0         1         2         3         4         5         6         7         8   ...        11        12... ... -0.109782 -0.412230  1.707714 -0.240937 -0.276747  0.481276 -0.278111  1.304773 -0.139538

[512 rows x 20 columns]
y = 0      0
1      0
2      0
3      1
4      0
      ..
507    0
508    1
509    0
510    1
511    0
Name: target, Length: 512, dtype: int64
X_test =            0         1         2         3         4         5         6         7         8   ...        11        12... ... -2.598922 -0.351561  0.233836 -1.873634 -1.089221  0.373956 -0.520939 -0.489945  2.452996

[512 rows x 20 columns]
cv = KFold(n_splits=5, random_state=0, shuffle=True), groups = None, eval_func = <function roc_auc_score at 0x7fe910196ee0>, logger = <Logger nyaggle.validation.cross_validate (WARNING)>
on_each_fold = None, fit_params = {'early_stopping_rounds': 200}, importance_type = 'gain', early_stopping = True, type_of_target = 'binary'

    def cross_validate(estimator: Union[BaseEstimator, List[BaseEstimator]],
                       X_train: Union[pd.DataFrame, np.ndarray], y: Union[pd.Series, np.ndarray],
                       X_test: Union[pd.DataFrame, np.ndarray] = None,
                       cv: Optional[Union[int, Iterable, BaseCrossValidator]] = None,
                       groups: Optional[pd.Series] = None,
                       eval_func: Optional[Callable] = None, logger: Optional[Logger] = None,
                       on_each_fold: Optional[Callable[[int, BaseEstimator, pd.DataFrame, pd.Series], None]] = None,
                       fit_params: Optional[Union[Dict[str, Any], Callable]] = None,
                       importance_type: str = 'gain',
                       early_stopping: bool = True,
                       type_of_target: str = 'auto') -> CVResult:
        """
        Evaluate metrics by cross-validation. It also records out-of-fold prediction and test prediction.
    
        Args:
            estimator:
                The object to be used in cross-validation. For list inputs, ``estimator[i]`` is trained on i-th fold.
            X_train:
                Training data
            y:
                Target
            X_test:
                Test data (Optional). If specified, prediction on the test data is performed using ensemble of models.
            cv:
                int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
    
                - None, to use the default ``KFold(5, random_state=0, shuffle=True)``,
                - integer, to specify the number of folds in a ``(Stratified)KFold``,
                - CV splitter (the instance of ``BaseCrossValidator``),
                - An iterable yielding (train, test) splits as arrays of indices.
            groups:
                Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., ``GroupKFold``).
            eval_func:
                Function used for logging and returning scores
            logger:
                logger
            on_each_fold:
                called for each fold with (idx_fold, model, X_fold, y_fold)
            fit_params:
                Parameters passed to the fit method of the estimator
            importance_type:
                The type of feature importance to be used to calculate result.
                Used only in ``LGBMClassifier`` and ``LGBMRegressor``.
            early_stopping:
                If ``True``, ``eval_set`` will be added to ``fit_params`` for each fold.
                ``early_stopping_rounds = 100`` will also be appended to fit_params if it does not already have one.
            type_of_target:
                The type of target variable. If ``auto``, type is inferred by ``sklearn.utils.multiclass.type_of_target``.
                Otherwise, ``binary``, ``continuous``, or ``multiclass`` are supported.
        Returns:
            Namedtuple with following members
    
            * oof_prediction (numpy array, shape (len(X_train),)):
                The predicted value on put-of-Fold validation data.
            * test_prediction (numpy array, hape (len(X_test),)):
                The predicted value on test data. ``None`` if X_test is ``None``.
            * scores (list of float, shape (nfolds+1,)):
                ``scores[i]`` denotes validation score in i-th fold.
                ``scores[-1]`` is the overall score. `None` if eval is not specified.
            * importance (list of pandas DataFrame, shape (nfolds,)):
                ``importance[i]`` denotes feature importance in i-th fold model.
                If the estimator is not GBDT, empty array is returned.
    
        Example:
            >>> from sklearn.datasets import make_regression
            >>> from sklearn.linear_model import Ridge
            >>> from sklearn.metrics import mean_squared_error
            >>> from nyaggle.validation import cross_validate
    
            >>> X, y = make_regression(n_samples=8)
            >>> model = Ridge(alpha=1.0)
            >>> pred_oof, pred_test, scores, _ = \
            >>>     cross_validate(model,
            >>>                    X_train=X[:3, :],
            >>>                    y=y[:3],
            >>>                    X_test=X[3:, :],
            >>>                    cv=3,
            >>>                    eval_func=mean_squared_error)
            >>> print(pred_oof)
            [-101.1123267 ,   26.79300693,   17.72635528]
            >>> print(pred_test)
            [-10.65095894 -12.18909059 -23.09906427 -17.68360714 -20.08218267]
            >>> print(scores)
            [71912.80290003832, 15236.680239881942, 15472.822033121925, 34207.43505768073]
        """
        cv = check_cv(cv, y)
        n_output_cols = 1
        if type_of_target == 'auto':
            type_of_target = multiclass.type_of_target(y)
        if type_of_target == 'multiclass':
            n_output_cols = y.nunique(dropna=True)
    
        if isinstance(estimator, list):
            assert len(estimator) == cv.get_n_splits(), "Number of estimators should be same to nfolds."
    
        X_train = convert_input(X_train)
        y = convert_input_vector(y, X_train.index)
        if X_test is not None:
            X_test = convert_input(X_test)
    
        if not isinstance(estimator, list):
            estimator = [estimator] * cv.get_n_splits()
    
        assert len(estimator) == cv.get_n_splits()
    
        if logger is None:
            logger = getLogger(__name__)
    
        def _predict(model: BaseEstimator, x: pd.DataFrame, _type_of_target: str):
            if _type_of_target in ('binary', 'multiclass'):
                if hasattr(model, "predict_proba"):
                    proba = model.predict_proba(x)
                elif hasattr(model, "decision_function"):
                    warnings.warn('Since {} does not have predict_proba method, '
                                  'decision_function is used for the prediction instead.'.format(type(model)))
                    proba = model.decision_function(x)
                else:
                    raise RuntimeError('Estimator in classification problem should have '
                                       'either predict_proba or decision_function')
                if proba.ndim == 1:
                    return proba
                else:
                    return proba[:, 1] if proba.shape[1] == 2 else proba
            else:
                return model.predict(x)
    
        oof = np.zeros((len(X_train), n_output_cols)) if n_output_cols > 1 else np.zeros(len(X_train))
        evaluated = np.full(len(X_train), False)
        test = None
        if X_test is not None:
            test = np.zeros((len(X_test), n_output_cols)) if n_output_cols > 1 else np.zeros(len(X_test))
    
        scores = []
        eta_all = []
        importance = []
    
        for n, (train_idx, valid_idx) in enumerate(cv.split(X_train, y, groups)):
            start_time = time.time()
    
            train_x, train_y = X_train.iloc[train_idx], y.iloc[train_idx]
            valid_x, valid_y = X_train.iloc[valid_idx], y.iloc[valid_idx]
    
            if fit_params is None:
                fit_params_fold = {}
            elif callable(fit_params):
                fit_params_fold = fit_params(n, train_idx, valid_idx)
            else:
                fit_params_fold = copy.copy(fit_params)
    
            if is_gbdt_instance(estimator[n], ('lgbm', 'cat', 'xgb')):
                if early_stopping:
                    if 'eval_set' not in fit_params_fold:
                        fit_params_fold['eval_set'] = [(valid_x, valid_y)]
                    if 'early_stopping_rounds' not in fit_params_fold:
                        fit_params_fold['early_stopping_rounds'] = 100
    
>               estimator[n].fit(train_x, train_y, **fit_params_fold)
E               TypeError: fit() got an unexpected keyword argument 'early_stopping_rounds'

nyaggle/validation/cross_validate.py:177: TypeError
======================================================================================== short test summary info =========================================================================================
FAILED tests/validation/test_cross_validate.py::test_cv_lgbm - TypeError: fit() got an unexpected keyword argument 'early_stopping_rounds'
=========================================================================================== 1 failed in 1.90s ============================================================================================

<\details>

The text was updated successfully, but these errors were encountered:

nyanp · 2023-07-17T10:52:29Z

@yuta100101 Thank you for reporting! It should be replaced with callback API.

wakame1367 · 2023-07-19T03:58:51Z

As a temporary measure, I have set a version constraint on the installation of LightGBM. The version has been limited to LightGBM<4.0.0.
I plan to address the main fix for this bug in a separate pull request.

wakame1367 · 2023-07-20T06:38:39Z

Here is an article that may be helpful in resolving this issue.
Qiita - LightGBMのearly_stoppingの仕様が変わったので、使用法を調べてみた

Temporary Fix for Issue #112

yuta100101 · 2023-08-06T11:36:08Z

Not only cross_validate() but also find_best_lgbm_parameter() is affected, so it might be better to modify this library after Optuna's support for LightGBM 4.0.0 (Probably the PRs shown below) has been released.

yuta100101 · 2023-08-07T10:49:35Z

Sorry for the lack of words, find_best_lgbm_parameter() is affected by removing fobj argument of train().

nyanp added bug Something isn't working good first issue Good for newcomers contributions welcome labels Jul 17, 2023

wakame1367 mentioned this issue Jul 19, 2023

Temporary Fix for Issue #112 #113

Merged

nyanp added a commit that referenced this issue Jul 22, 2023

Merge pull request #113 from wakame1367/bugfix/lightgbm_v4

86a9db4

Temporary Fix for Issue #112

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cross_validate don't work with LightGBM v4.0.0 #112

cross_validate don't work with LightGBM v4.0.0 #112

yuta100101 commented Jul 16, 2023 •

edited

Loading

nyanp commented Jul 17, 2023

wakame1367 commented Jul 19, 2023 •

edited

Loading

wakame1367 commented Jul 20, 2023

yuta100101 commented Aug 6, 2023

yuta100101 commented Aug 7, 2023

cross_validate don't work with LightGBM v4.0.0 #112

cross_validate don't work with LightGBM v4.0.0 #112

Comments

yuta100101 commented Jul 16, 2023 • edited Loading

nyanp commented Jul 17, 2023

wakame1367 commented Jul 19, 2023 • edited Loading

wakame1367 commented Jul 20, 2023

yuta100101 commented Aug 6, 2023

yuta100101 commented Aug 7, 2023

yuta100101 commented Jul 16, 2023 •

edited

Loading

wakame1367 commented Jul 19, 2023 •

edited

Loading