Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training eval_set does not default to "training" in Dask #4394

Open
ffineis opened this issue Jun 20, 2021 · 0 comments
Open

training eval_set does not default to "training" in Dask #4394

ffineis opened this issue Jun 20, 2021 · 0 comments

Comments

@ffineis
Copy link
Contributor

ffineis commented Jun 20, 2021

Description

When using eval_set in LightGBM, if a component (or components) of eval_set is just the training data (e.g. eval_set[0][0] is X and eval_set[0][1] is y), then that component's default eval_name is just "training" in the corresponding eval_set data artifacts like best_score_, evals_result_, etc. This happens by default when eval_names is None.

In the implementation of eval_set for DaskLGBMModels involves asking the (X, y) pair in each individual eval_set within eval_set "hey, are you just the training X and y?". If they are, then Dask LightGBM will not copy the training set parts so that we skip having to .compute() them multiple times on the dask cluster.

But training a DaskLGBM estimator with an eval_set that contains the training (X, y), although it is detected as such in _train_part, LightGBM does not automatically name this "training" in the default eval_names. Instead, LightGBM just names the validation set valid_<index> just like other non-training validation dataset components.

Note: as of #4101

Reproducible example

import dask.dataframe as dd
from dask.distributed import Client, LocalCluster
import lightgbm as lgb
import numpy as np
import pandas as pd

X = pd.DataFrame(np.random.normal(size = (100, 5)))
y = pd.Series(np.random.choice([0, 1], size = 100))
eval_set = [(X.sample(30), y.sample(30)), (X, y)]  # -- sampled and training set as eval_set

clf = lgb.LGBMClassifier()
clf.fit(X, y, eval_set = eval_set)

# -- "training" is in the default eval_names
eval_names = sorted(clf.evals_result_.keys())
assert eval_names == ['training', 'valid_0']

client = Client(LocalCluster())

dX = dd.from_pandas(X, chunksize=10)
dy = dd.from_pandas(y, chunksize=10)
eval_set = [(dX, dy), (dX.partitions[3:5], dy.partitions[3:5])]

dclf = lgb.DaskLGBMClassifier()
dclf.fit(dX, dy, eval_set = eval_set)

# -- "training" is not in default eval_names
assert sorted(dclf.evals_result_.keys()) == ['valid_0', 'valid_1']

Environment info

LightGBM version or commit hash:

3.2.1.99

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

This may be an issue with distributed LightGBM training, not specifically the DaskLightGBM codebase in dask.py.

@jameslamb jameslamb added the dask label Jun 22, 2021
@jameslamb jameslamb changed the title training eval_set is does not default to "training" in Dask training eval_set does not default to "training" in Dask Jun 22, 2021
@StrikerRUS StrikerRUS added the bug label Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants