-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][dask] fix workers without data test (fixes #5537) #5544
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Before I review, could you please explain, either in the description or in a comment here, what the root cause of the timeouts observed in #5537 were and how this fixes it?
The diff has a lot more changes in it than I'd expected from the description "have three workers and data in only two of them". For example, it's not obvious to me how that type of change would require bringing in new imports from sklearn.metrics
.
I'm especially interested in the change where we're now passing an explicit list of workers=
to client.persist()
. Was the root problem just that "collection_to_single_partition()
was resulting in, for example, dX
being all on the first worker and dy
being all on the second worker"? If so, couldn't the same effect be accomplished with the existing 2-worker cluster?
I couldn't reproduce the timeouts locally.
The purpose of the test was to check that in a distributed training setting, having a worker without data wouldn't cause a problem, that's why I changed it to having three workers and only two with data, so distributed training is performed (because of the two workers having data), whereas previously it was probably falling back to just local training, and the definition of "succeeds" is that we achieve a good metric value (hence the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously it was probably falling back to just local training
OHHHHHHHHHHHH. Thank you so much for that explanation. Makes sense to me!
Let's merge this then 😁
Will merge this whenever we get CI working again 😭 Hopefully if/when #5514 is merged. |
thanks so much for fixing this @jmoralez ! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Changes
tests/python_package_test/test_dask.py::test_training_succeeds_even_if_some_workers_do_not_have_any_data
to have three workers and data in only two of them.Fixes #5537.