Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Result shape from DaskLGBMClassifier.predict(pred_contrib=True) for CSC matrices is inconsistent with LGBMClassifier #3881

Closed
jameslamb opened this issue Jan 29, 2021 · 3 comments · Fixed by #4378

Comments

@jameslamb
Copy link
Collaborator

See the discussion in #3866 (comment) for full details.

lightgbm.dask.DaskLGBMClassifier tries to stay as close as possible to the API of lightgbm.sklearn.LGBMClassifier. This feature describes one known inconsistency.

In lightgbm.sklearn.LGBMClassifier, for multiclass classification tasks, if you call .predict(X, pred_contrib=True) and X is a scipy.sparse.csc_matrix, the result will be a list of CSC matrices, 1 per class.

In lightgbm.dask.DaskLGBMClassifier, for multiclass classification taks, if you call .predict(X, pred_contrib=True) and X is a Dask Array whose partitions are each a scipy.sparse.csc_matrix, the result will be a Dask Array that, once .compute()'d, returns a scipy.sparse.coo_matrix.

To complete this feature, try to make Dask's behavior match the behavior from lightgbm.sklearn.LGBMClassifier, or document why that can't / shouldn't be done.

@jameslamb
Copy link
Collaborator Author

Added this to #2302, where we store feature requests for this project. Anyone is welcome to contribute this feature. Leave a comment below if you'd like to pick it up and the issue can be re-opened.

@StrikerRUS
Copy link
Collaborator

One addition:

In lightgbm.sklearn.LGBMClassifier, for multiclass classification tasks, if you call .predict(X, pred_contrib=True) and X is a scipy.sparse.csc_matrix, the result will be a list of CSC matrices, 1 per class.

The same is true for CSR matrix as well.

See the following core Python API test to better understand what is expected:

def test_contribs_sparse_multiclass():
n_features = 20
n_samples = 100
n_labels = 4
# generate CSR sparse dataset
X, y = make_multilabel_classification(n_samples=n_samples,
sparse=True,
n_features=n_features,
n_classes=1,
n_labels=n_labels)
y = y.flatten()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
params = {
'objective': 'multiclass',
'num_class': n_labels,
'verbose': -1,
}
lgb_train = lgb.Dataset(X_train, y_train)
gbm = lgb.train(params, lgb_train, num_boost_round=20)
contribs_csr = gbm.predict(X_test, pred_contrib=True)
assert isinstance(contribs_csr, list)
for perclass_contribs_csr in contribs_csr:
assert isspmatrix_csr(perclass_contribs_csr)
# convert data to dense and get back same contribs
contribs_dense = gbm.predict(X_test.toarray(), pred_contrib=True)
# validate the values are the same
contribs_csr_array = np.swapaxes(np.array([sparse_array.todense() for sparse_array in contribs_csr]), 0, 1)
contribs_csr_arr_re = contribs_csr_array.reshape((contribs_csr_array.shape[0],
contribs_csr_array.shape[1] * contribs_csr_array.shape[2]))
np.testing.assert_allclose(contribs_csr_arr_re, contribs_dense)
contribs_dense_re = contribs_dense.reshape(contribs_csr_array.shape)
assert np.linalg.norm(gbm.predict(X_test, raw_score=True) - np.sum(contribs_dense_re, axis=2)) < 1e-4
# validate using CSC matrix
X_test_csc = X_test.tocsc()
contribs_csc = gbm.predict(X_test_csc, pred_contrib=True)
assert isinstance(contribs_csc, list)
for perclass_contribs_csc in contribs_csc:
assert isspmatrix_csc(perclass_contribs_csc)
# validate the values are the same
contribs_csc_array = np.swapaxes(np.array([sparse_array.todense() for sparse_array in contribs_csc]), 0, 1)
contribs_csc_array = contribs_csc_array.reshape((contribs_csc_array.shape[0],
contribs_csc_array.shape[1] * contribs_csc_array.shape[2]))
np.testing.assert_allclose(contribs_csc_array, contribs_dense)

@jameslamb
Copy link
Collaborator Author

re-opening this to note that I am currently working on a fix for this, to try to unblock #4351

@jameslamb jameslamb reopened this Jun 13, 2021
StrikerRUS added a commit that referenced this issue Jul 7, 2021
…rices match those from sklearn estimators (fixes #3881) (#4378)

* test_classifier working

* adding tests

* docs

* tests

* revert unnecessary changes in tests

* test output type

* linting

* linting

* use from_delayed() instead

* docstring pycodestyle is happy with

* isort

* put pytest skips back

* respect sparse return type

* fix doc

* remove unnecessary dask_array_concatenate()

* Apply suggestions from code review

Co-authored-by: Nikita Titov <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nikita Titov <[email protected]>

* update predict_proba() docstring

* remove unnecessary np.array()

* Update python-package/lightgbm/dask.py

Co-authored-by: Nikita Titov <[email protected]>

* fix assertion

* fix test use of len()

* restore np.array() in tests

* use np.asarray() instead

* use toarray()

* remove empty functions in compat

Co-authored-by: Nikita Titov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants