-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] hold ports until training #5890
Conversation
Seems like the actor approach won't work here due to dask/distributed#7842. Feel free to try the alternative here @jameslamb. EDIT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks so much for working on this! This is a much better version of what I was kind of envisioning in #5865 (comment).
I'm unsure about exactly how it works, but mainly because of my own ignorance of Dask and TCP. If you tell me this is working for you on both LocalCluster
and a multi-machine distributed Dask cluster, then I think it's worth the time for you to add the remaining type hints, tests, docs, etc. and I'll commit to testing it myself on both LocalCluster
and some different configurations of multi-machine clusters from Coiled or with something like dask-cloudprovider
.
Based on #5890 (comment), I'm feeling pretty optimistic about this approach! Awesome work. Some time this week, I'll try to test this on a multi-machine cluster using Coiled or |
Seems like coiled doesn't provide free credits to use their infra anymore, so I haven't been able to test this on a real cluster. I'll try to find an alternative because I don't have an AWS account. |
I can test on my AWS account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this tonight on AWS Elastic Container Service (ECS) cluster, using dask-cloudprovider
: https://github.com/jameslamb/lightgbm-dask-testing/blob/main/notebooks/demo-aws.ipynb.
Happy to say it worked well!
I left one small comment, do what you want with it.
Great work, @jmoralez !!!
Looks great to me! You should be the one to click the button, @jmoralez 😁 |
Hey only saw this now, but thanks for looking into it ! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Contributes to #5865 by holding on to the ports that LightGBM will use during training for as long as possible on the Python side, thus decreasing the chance of a race condition (but not eliminating it).