Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

xgboost does not train from existing model in distributed environment #70

Open
trieuat opened this issue Mar 15, 2020 · 6 comments
Open

Comments

@trieuat
Copy link

trieuat commented Mar 15, 2020

When continuing training xgboost from an existing model in distributed environment with more than 3 workers, xgboost does not train: nothing happens in workers and it never finishes. But in local cluster or distributed cluster with less than 3 workers, the training happens and finishes.

dxgb.train(client, params, X_train, y_train,
xgb_model=existing_model,...)

@trieuat trieuat changed the title xgboost does not train from existing model in distributed client xgboost does not train from existing model in distributed environment Mar 15, 2020
@TomAugspurger
Copy link
Member

Do you know why that might be?

Do things work if you use the native dask integration? https://xgboost.readthedocs.io/en/latest/tutorials/dask.html

@trieuat
Copy link
Author

trieuat commented Mar 16, 2020

I don't know why. The native dask integration in the link can train from existing model.

However, I have a different problem with it. Its performance is just like random in distributed environment vs. good performance from Dask xgboost with the same parameters and data.

@TomAugspurger
Copy link
Member

TomAugspurger commented Mar 19, 2020 via email

@trieuat
Copy link
Author

trieuat commented Mar 26, 2020

I can create a performance report, but the thing is that the training does not seem to happen and never finish even if I build just one tree. CPU usage in all workers are %2-6% (vs. ~ +100% CPU usage if I remove the parameter xgb_model). If you have any suggestion for how to debug it, please let me know.

@jakirkham
Copy link
Member

jakirkham commented Mar 26, 2020

Just a blind guess, have you tried deleting dask-worker-space and storage directories that Dask creates?

They will be wherever temporary-directory is set to. This most likely would be set in ~/.config/dask/dask.yaml, but could be configured in other places depending on what your code might be doing. If unspecified, these will be in the same directory you ran the script or notebook from.

@trieuat
Copy link
Author

trieuat commented Mar 31, 2020

Thank for the suggestion. I had dask-worker-space in my folder, after removing it, I can train with several workers on a small dataset. So, I moved to submit the job with skein instead of using my edgnode as client. But, when I increase the dataset size (< 1Gb), it does not train again. Looking at the log, I can see only one worker started the hist algorithm but did not progress to build any tree. And nothing happened at other workers.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants