-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] prefer CPython in Windows test environment and use safer approach for cleaning up network (fixes #5509) #5510
Conversation
This worked! (https://ci.appveyor.com/project/guolinke/lightgbm/builds/44896165/job/xvf4hw3hv45lchyf) Pushing another commit re-enabling all the other CI that I'd skipped just to save time, but then I think we should adopt this fix. @shiyu1994 @StrikerRUS |
This reverts commit 704aea4.
It seems the linux jobs are getting stuck as well in this PR. I investigated a bit while working on #5505 but the environments seemed the same, it's strange that it started to happen now. Also they sometimes randomly pass. |
😭 this is getting so complicated It looks like the two Azure DevOps Linux jobs that are getting stuck (have been running for more than 50 mins) are both it's the next thing on my list to try to upgrade that image, so that we can hopefully remove the pin on I'm going to just try manually rebuilding the timed-out Linux jobs here when they fail, to at least make some forward progress. "We sometimes have to manually re-run builds" is a bad state to be in, but not as bad as "Appveyor is failing on every commit and nothing can be merged to |
I've tried rebuilding the timed-out CUDA and Azure jobs a few times today, hoping to get lucky and have the builds not hit the Dask timeouts...so far, I haven't been successful. I'll be traveling for the next few days, and I'm not sure how much I'll be able to work on LightGBM during that time. @shiyu1994 @jmoralez @StrikerRUS if you're able to find a workaround for the Dask issues, it's ok with me if you want to push such fixes directly to this branch. So we can have one PR that resolves the CI issues. |
Thanks. Good to see that the windows environment issue is resolved. Let's see whether we can fix the Dask issue together in this PR. |
I can try to manually debug the Dask tests on our self-hosted CUDA CI agent. Hopefully we can find a solution soon. |
We could also try pinning all |
I think I've found the root of the timeout of Dask tests. This is because However, because in our Then comes the issue. When a booster A is created in test case 1, it still exists in the memory space of the process. Then when we move to test case 2, a new booster B will be created. However, at this moment, somehow the garbage collection is triggered, and booster A is recycled. And the LightGBM/python-package/lightgbm/basic.py Lines 2826 to 2836 in dc4794b
which will deallocate all the network connections by calling free_network !LightGBM/python-package/lightgbm/basic.py Lines 2919 to 2929 in dc4794b
Lines 2511 to 2515 in dc4794b
Then booster A will train as a single process program since the num_machines_ is set to 1 .LightGBM/src/network/network.cpp Lines 60 to 66 in dc4794b
However, since we are training in a distributed way, there's another process waiting for the response of the process running booster A. And that process gets stuck forever since there's no response from booster A's process. I've pushed a quick workaround to this branch, which enforces using a new Could anybody familiar with Dask provide a better solution? Shortly speaking, we want to use new processes for distributed training in each test case. |
Wow, great investigation!!! I wonder if this is the cause of reports like #4771 or #4942. I think creating a new Dask cluster every time is not a good solution...LightGBM's users wouldn't be happy with having to do that, since it would mean that, for example, the workflow of "initialize a Dask DataFrame, keep it in distributed memory with Instead, could we have each distributed training process allocate its own |
I agree with @jameslamb on this, users may want to run consecutive trainings in the same process when doing things like hyperparameter tuning. Also this has been the way we run the dask tests for more than a year (changed in #4159), @shiyu1994 can you think of a recent change that would cause this to start failing now? |
I tested on mac and in one of the jobs there are many errors that we've previously seen for mac: Edit: So I believe you're definitely right @shiyu1994. |
I'm going to try implementing a fix here where, at the end of training, the Python package explicitly calls I still think the ideal solution is the one described in #5510 (comment), because the Dask-only fix I'm proposing has the following drawbacks:
If my proposed fix seems to work, I think we should adopt it to unblock CI and then I'll document this issue in a separate issue that could be worked on. |
This reverts commit 7ff172d.
update: that So it seems like the only problematic test might be Going to continue investigating that next. |
I know it is difficult to follow all the debugging comments and commits here, but CI is passing and this PR is ready for review 🎉 After switching from I think we should
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job! And thanks @shiyu1994 for the insights.
Thanks to both of you for your help! This was a really difficult one. It would be interesting in the future to work on changing the strategy for how I'm going to merge this and start updating / merging some of the other approved PRs, starting with #5506. |
Sorry for the late response. Just return from our 1-week national holiday. |
no problem, welcome back! Please look at my comment in #5502 (comment) as soon as possible and respond there...I'm nervous that the R package might be in danger of archiving on CRAN. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Fixes #5507.
conda
sometimes "downgrades" Python from a CPython build to a PyPy build in our Windows CI jobs. This has historically been because of dependencies introduced by either thepython-graphviz
ormatplotlib
conda packages.This PR proposes trying to prevent that situation by explicitly passing flag--no-update-deps
.From theconda
docs (link)This PR proposes explicitly installing
python={version}[build=*cpython]
to preventconda
from environment solves that switch to pypy-based builds of Python.I think this should be possible, based on https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/pkg-specs.html#package-match-specifications.