[dask] Result shape from DaskLGBMClassifier.predict(pred_contrib=True) for CSC matrices is inconsistent with LGBMClassifier #3881

jameslamb · 2021-01-29T15:46:56Z

See the discussion in #3866 (comment) for full details.

lightgbm.dask.DaskLGBMClassifier tries to stay as close as possible to the API of lightgbm.sklearn.LGBMClassifier. This feature describes one known inconsistency.

In lightgbm.sklearn.LGBMClassifier, for multiclass classification tasks, if you call .predict(X, pred_contrib=True) and X is a scipy.sparse.csc_matrix, the result will be a list of CSC matrices, 1 per class.

In lightgbm.dask.DaskLGBMClassifier, for multiclass classification taks, if you call .predict(X, pred_contrib=True) and X is a Dask Array whose partitions are each a scipy.sparse.csc_matrix, the result will be a Dask Array that, once .compute()'d, returns a scipy.sparse.coo_matrix.

To complete this feature, try to make Dask's behavior match the behavior from lightgbm.sklearn.LGBMClassifier, or document why that can't / shouldn't be done.

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-01-29T16:04:47Z

Added this to #2302, where we store feature requests for this project. Anyone is welcome to contribute this feature. Leave a comment below if you'd like to pick it up and the issue can be re-opened.

StrikerRUS · 2021-01-29T16:24:22Z

One addition:

In lightgbm.sklearn.LGBMClassifier, for multiclass classification tasks, if you call .predict(X, pred_contrib=True) and X is a scipy.sparse.csc_matrix, the result will be a list of CSC matrices, 1 per class.

The same is true for CSR matrix as well.

See the following core Python API test to better understand what is expected:

LightGBM/tests/python_package_test/test_engine.py

Lines 1058 to 1100 in 217642c

    
           def test_contribs_sparse_multiclass(): 
        
               n_features = 20 
        
               n_samples = 100 
        
               n_labels = 4 
        
               # generate CSR sparse dataset 
        
               X, y = make_multilabel_classification(n_samples=n_samples, 
        
                                                     sparse=True, 
        
                                                     n_features=n_features, 
        
                                                     n_classes=1, 
        
                                                     n_labels=n_labels) 
        
               y = y.flatten() 
        
               X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42) 
        
               params = { 
        
                   'objective': 'multiclass', 
        
                   'num_class': n_labels, 
        
                   'verbose': -1, 
        
               } 
        
               lgb_train = lgb.Dataset(X_train, y_train) 
        
               gbm = lgb.train(params, lgb_train, num_boost_round=20) 
        
               contribs_csr = gbm.predict(X_test, pred_contrib=True) 
        
               assert isinstance(contribs_csr, list) 
        
               for perclass_contribs_csr in contribs_csr: 
        
                   assert isspmatrix_csr(perclass_contribs_csr) 
        
               # convert data to dense and get back same contribs 
        
               contribs_dense = gbm.predict(X_test.toarray(), pred_contrib=True) 
        
               # validate the values are the same 
        
               contribs_csr_array = np.swapaxes(np.array([sparse_array.todense() for sparse_array in contribs_csr]), 0, 1) 
        
               contribs_csr_arr_re = contribs_csr_array.reshape((contribs_csr_array.shape[0], 
        
                                                                 contribs_csr_array.shape[1] * contribs_csr_array.shape[2])) 
        
               np.testing.assert_allclose(contribs_csr_arr_re, contribs_dense) 
        
               contribs_dense_re = contribs_dense.reshape(contribs_csr_array.shape) 
        
               assert np.linalg.norm(gbm.predict(X_test, raw_score=True) - np.sum(contribs_dense_re, axis=2)) < 1e-4 
        
               # validate using CSC matrix 
        
               X_test_csc = X_test.tocsc() 
        
               contribs_csc = gbm.predict(X_test_csc, pred_contrib=True) 
        
               assert isinstance(contribs_csc, list) 
        
               for perclass_contribs_csc in contribs_csc: 
        
                   assert isspmatrix_csc(perclass_contribs_csc) 
        
               # validate the values are the same 
        
               contribs_csc_array = np.swapaxes(np.array([sparse_array.todense() for sparse_array in contribs_csc]), 0, 1) 
        
               contribs_csc_array = contribs_csc_array.reshape((contribs_csc_array.shape[0], 
        
                                                                contribs_csc_array.shape[1] * contribs_csc_array.shape[2])) 
        
               np.testing.assert_allclose(contribs_csc_array, contribs_dense)

jameslamb · 2021-06-13T05:03:33Z

re-opening this to note that I am currently working on a fix for this, to try to unblock #4351

…rices match those from sklearn estimators (fixes #3881) (#4378) * test_classifier working * adding tests * docs * tests * revert unnecessary changes in tests * test output type * linting * linting * use from_delayed() instead * docstring pycodestyle is happy with * isort * put pytest skips back * respect sparse return type * fix doc * remove unnecessary dask_array_concatenate() * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * update predict_proba() docstring * remove unnecessary np.array() * Update python-package/lightgbm/dask.py Co-authored-by: Nikita Titov <[email protected]> * fix assertion * fix test use of len() * restore np.array() in tests * use np.asarray() instead * use toarray() * remove empty functions in compat Co-authored-by: Nikita Titov <[email protected]>

jameslamb added feature request dask labels Jan 29, 2021

This was referenced Jan 29, 2021

[dask] Add type hints in Dask package #3866

Merged

Feature Requests & Voting Hub #2302

Open

jameslamb closed this as completed Jan 29, 2021

jameslamb mentioned this issue Jun 5, 2021

WIP: [ci] remove pin on dask and distributed in CI (fixes #4285) #4307

Closed

jameslamb self-assigned this Jun 13, 2021

jameslamb reopened this Jun 13, 2021

jameslamb mentioned this issue Jun 14, 2021

[dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators (fixes #3881) #4378

Merged

jameslamb mentioned this issue Jul 4, 2021

[dask] preserve chunks in results of multi-class pred_contrib predictions on sparse matrices #4438

Closed

StrikerRUS closed this as completed in #4378 Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] Result shape from DaskLGBMClassifier.predict(pred_contrib=True) for CSC matrices is inconsistent with LGBMClassifier #3881

[dask] Result shape from DaskLGBMClassifier.predict(pred_contrib=True) for CSC matrices is inconsistent with LGBMClassifier #3881

jameslamb commented Jan 29, 2021

jameslamb commented Jan 29, 2021

StrikerRUS commented Jan 29, 2021

jameslamb commented Jun 13, 2021

[dask] Result shape from DaskLGBMClassifier.predict(pred_contrib=True) for CSC matrices is inconsistent with LGBMClassifier #3881

[dask] Result shape from DaskLGBMClassifier.predict(pred_contrib=True) for CSC matrices is inconsistent with LGBMClassifier #3881

Comments

jameslamb commented Jan 29, 2021

jameslamb commented Jan 29, 2021

StrikerRUS commented Jan 29, 2021

jameslamb commented Jun 13, 2021