-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux CI jobs hang forever after completing all Python tests successfully #4948
Comments
For macOS, we have the following workaround to avoid conflicts between multiple instances of libomp library: Lines 120 to 123 in 4aaeb22
|
Given that I remember only Azure Pipelines Lines 76 to 79 in 4aaeb22
LightGBM/.github/workflows/cuda.yml Lines 30 to 33 in 4aaeb22
|
Official suggested workarounds:
|
I guess we are facing a conflict between default Ubuntu system-wide Just for example, latest Lines 52 to 56 in 4aaeb22
According to the
|
If no one of maintainers knows better workaround than documented in
This CI problem is quite annoying because it makes us re-run CI jobs multiple times after they were timed out and slows down the whole development process in the repository. @guolinke, @chivee, @shiyu1994, @tongwu-msft, @hzy46, @Laurae2, @jameslamb, @jmoralez |
Excellent investigation, thank you! I vote for option 1, setting I'd also like to ask.... @xhochy, if you have time could you advise us on this issue? I'm wondering if you've experienced a similar issuue with the |
We have seen these issues a long time ago in other packages but they shouldn't be occurring these days in a pure conda-forge setting as we have safeguards in place that only a single OpenMP implementation is installed at a time. This is especially important for As this seems to directly about failing CI in LightGBM, is there any usage of |
Thanks very much @xhochy !
Yes. For more context, we do not use Line 117 in 4aaeb22
And for jobs on Linux which use |
I wonder, maybe it's a good time to migrate from default conda channel to conda-forge one? Besides this particular issue with different libomp implementations, default conda channel is extremely slow in terms of updates and lacks some required packages for our CI. Just some small examples.
https://anaconda.org/conda-forge/dask-core LightGBM version at default conda channel is Requests for adding new and upgrading existing [R packages] tend to be ignored: ContinuumIO/anaconda-issues#11604, ContinuumIO/anaconda-issues#11571. Due to this reason, we have already migrated to conda-forge for building our docks: #4767. In addition, conda-forge channel often supports more architectures (Miniforge): #4843 (comment). Download stats for LightGBM (especially for the recent versions) show that users already prefer Just a reminder: it's better not to mix different channels in one environment not only due to possible package conflicts, but also due to long time and high memory consumption for resolving environment specification during installation phase (matters for CI): #4054 (review), ContinuumIO/anaconda-issues#11604 (comment). |
@StrikerRUS thanks for all the research! I strongly support moving LightGBM's CI to using only |
@jameslamb Thank you very much! Before we start, let's wait for some other opinions... @guolinke @shiyu1994 @tongwu-msft @hzy46 @Laurae2 @jmoralez |
I also support using conda-forge and maybe we could consider using mamba in CI. The time to solve the environments and install packages reduces significantly. |
@jmoralez |
@StrikerRUS @guolinke could I have "Write" access on https://github.com/guolinke/lightgbm-ci-docker? I realized today that to make this change to Otherwise, I'll have to make a PR from my fork of |
@jameslamb Good idea but unfortunately I have no rights to grant you "write" access. |
This issue has been automatically locked since there has not been any recent activity since it was closed. |
This problem has started to timeout our CI jobs about 5 days ago. The most frequent CI jobs that run out of allowed 60min limit are
Linux_latest regular
andLinux_latest sdist
at Azure Pipelines. Also, I just sawCUDA Version / cuda 10.0 pip (linux, clang, Python 3.8)
encountered the same problem.From test logs I guess that the root cause is connected to the following warning message from the
joblib/threadpoolctl
package:The text was updated successfully, but these errors were encountered: