`training` eval_set does not default to "training" in Dask #4394

ffineis · 2021-06-20T04:10:43Z

Description

When using eval_set in LightGBM, if a component (or components) of eval_set is just the training data (e.g. eval_set[0][0] is X and eval_set[0][1] is y), then that component's default eval_name is just "training" in the corresponding eval_set data artifacts like best_score_, evals_result_, etc. This happens by default when eval_names is None.

In the implementation of eval_set for DaskLGBMModels involves asking the (X, y) pair in each individual eval_set within eval_set "hey, are you just the training X and y?". If they are, then Dask LightGBM will not copy the training set parts so that we skip having to .compute() them multiple times on the dask cluster.

But training a DaskLGBM estimator with an eval_set that contains the training (X, y), although it is detected as such in _train_part, LightGBM does not automatically name this "training" in the default eval_names. Instead, LightGBM just names the validation set valid_<index> just like other non-training validation dataset components.

Note: as of #4101

Reproducible example

import dask.dataframe as dd
from dask.distributed import Client, LocalCluster
import lightgbm as lgb
import numpy as np
import pandas as pd

X = pd.DataFrame(np.random.normal(size = (100, 5)))
y = pd.Series(np.random.choice([0, 1], size = 100))
eval_set = [(X.sample(30), y.sample(30)), (X, y)]  # -- sampled and training set as eval_set

clf = lgb.LGBMClassifier()
clf.fit(X, y, eval_set = eval_set)

# -- "training" is in the default eval_names
eval_names = sorted(clf.evals_result_.keys())
assert eval_names == ['training', 'valid_0']

client = Client(LocalCluster())

dX = dd.from_pandas(X, chunksize=10)
dy = dd.from_pandas(y, chunksize=10)
eval_set = [(dX, dy), (dX.partitions[3:5], dy.partitions[3:5])]

dclf = lgb.DaskLGBMClassifier()
dclf.fit(dX, dy, eval_set = eval_set)

# -- "training" is not in default eval_names
assert sorted(dclf.evals_result_.keys()) == ['valid_0', 'valid_1']

Environment info

LightGBM version or commit hash:

3.2.1.99

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

This may be an issue with distributed LightGBM training, not specifically the DaskLightGBM codebase in dask.py.

The text was updated successfully, but these errors were encountered:

ffineis mentioned this issue Jun 20, 2021

[dask] add support for eval sets and custom eval functions #4101

Merged

jameslamb added the dask label Jun 22, 2021

jameslamb changed the title ~~training eval_set is does not default to "training" in Dask~~ training eval_set does not default to "training" in Dask Jun 22, 2021

StrikerRUS added the bug label Jul 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`training` eval_set does not default to "training" in Dask #4394

`training` eval_set does not default to "training" in Dask #4394

ffineis commented Jun 20, 2021 •

edited by jameslamb

Loading

training eval_set does not default to "training" in Dask #4394

training eval_set does not default to "training" in Dask #4394

Comments

ffineis commented Jun 20, 2021 • edited by jameslamb Loading

Description

Reproducible example

Environment info

Additional Comments

`training` eval_set does not default to "training" in Dask #4394

`training` eval_set does not default to "training" in Dask #4394

ffineis commented Jun 20, 2021 •

edited by jameslamb

Loading