xgboost does not train from existing model in distributed environment #70

trieuat · 2020-03-15T02:36:53Z

When continuing training xgboost from an existing model in distributed environment with more than 3 workers, xgboost does not train: nothing happens in workers and it never finishes. But in local cluster or distributed cluster with less than 3 workers, the training happens and finishes.

dxgb.train(client, params, X_train, y_train,
xgb_model=existing_model,...)

TomAugspurger · 2020-03-16T15:41:28Z

Do you know why that might be?

Do things work if you use the native dask integration? https://xgboost.readthedocs.io/en/latest/tutorials/dask.html

trieuat · 2020-03-16T23:45:12Z

I don't know why. The native dask integration in the link can train from existing model.

However, I have a different problem with it. Its performance is just like random in distributed environment vs. good performance from Dask xgboost with the same parameters and data.

TomAugspurger · 2020-03-19T13:57:09Z

I'm not sure why that would be, but the usual recommendation is to create a performance report: https://distributed.dask.org/en/latest/diagnosing-performance.html#performance-reports

…

On Mon, Mar 16, 2020 at 6:45 PM tuanatrieu ***@***.***> wrote: I don't know why. The native dask integration in the link can train from existing model. However, I have a different problem with it. Its performance is just like random in distributed environment vs. good performance from Dask xgboost with the same parameters and data. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIVHPPOY5YKVF6KTKFLRH22ZJANCNFSM4LKICDDQ> .

trieuat · 2020-03-26T22:02:15Z

I can create a performance report, but the thing is that the training does not seem to happen and never finish even if I build just one tree. CPU usage in all workers are %2-6% (vs. ~ +100% CPU usage if I remove the parameter xgb_model). If you have any suggestion for how to debug it, please let me know.

jakirkham · 2020-03-26T23:32:57Z

Just a blind guess, have you tried deleting dask-worker-space and storage directories that Dask creates?

They will be wherever temporary-directory is set to. This most likely would be set in ~/.config/dask/dask.yaml, but could be configured in other places depending on what your code might be doing. If unspecified, these will be in the same directory you ran the script or notebook from.

trieuat · 2020-03-31T15:14:57Z

Thank for the suggestion. I had dask-worker-space in my folder, after removing it, I can train with several workers on a small dataset. So, I moved to submit the job with skein instead of using my edgnode as client. But, when I increase the dataset size (< 1Gb), it does not train again. Looking at the log, I can see only one worker started the hist algorithm but did not progress to build any tree. And nothing happened at other workers.

trieuat changed the title ~~xgboost does not train from existing model in distributed client~~ xgboost does not train from existing model in distributed environment Mar 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xgboost does not train from existing model in distributed environment #70

xgboost does not train from existing model in distributed environment #70

trieuat commented Mar 15, 2020

TomAugspurger commented Mar 16, 2020

trieuat commented Mar 16, 2020

TomAugspurger commented Mar 19, 2020 via email

trieuat commented Mar 26, 2020

jakirkham commented Mar 26, 2020 •

edited

Loading

trieuat commented Mar 31, 2020

xgboost does not train from existing model in distributed environment #70

xgboost does not train from existing model in distributed environment #70

Comments

trieuat commented Mar 15, 2020

TomAugspurger commented Mar 16, 2020

trieuat commented Mar 16, 2020

TomAugspurger commented Mar 19, 2020 via email

trieuat commented Mar 26, 2020

jakirkham commented Mar 26, 2020 • edited Loading

trieuat commented Mar 31, 2020

jakirkham commented Mar 26, 2020 •

edited

Loading