Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] CUDA CI jobs are broken: "driver/library version mismatch" #5546

Closed
jameslamb opened this issue Oct 18, 2022 · 7 comments
Closed

[ci] CUDA CI jobs are broken: "driver/library version mismatch" #5546

jameslamb opened this issue Oct 18, 2022 · 7 comments

Comments

@jameslamb
Copy link
Collaborator

Description

The CUDA CI jobs for this project are all failing, with the following error.

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.
Error: Process completed with exit code 125.

Reproducible example

References

Here is the line where these jobs are failing.

docker run --env-file docker.env -v "$GITHUB_WORKSPACE":"$ROOT_DOCKER_FOLDER" --rm --gpus all "$docker_img" /bin/bash $ROOT_DOCKER_FOLDER/docker-script.sh

The references at https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch suggest that this issue could be resolved by rebooting.

@jameslamb
Copy link
Collaborator Author

@shiyu1994 since you are the only person with administrative access to the machine the CUDA jobs runs on, can you please try rebooting that machine and investigating other fixes for this?

I'm happy to help do other research however I can, but you are the only person who can reboot the machine.

@jameslamb jameslamb changed the title [ci] CUDA CI jobs are broken [ci] CUDA CI jobs are broken: "driver/library version mismatch" Oct 18, 2022
@jameslamb
Copy link
Collaborator Author

I just re-triggered a CUDA job...this is still broken.

https://github.com/microsoft/LightGBM/actions/runs/3281397887/jobs/5452120895

@shiyu1994 is there any way I can help you resolve this?

@jameslamb
Copy link
Collaborator Author

I just triggered another run and this is still happening.

https://github.com/microsoft/LightGBM/actions/runs/3350110051/jobs/5563225006

@shiyu1994 I really hope you're able to get to this soon. @ me any time if there's any way I can help.

@shiyu1994
Copy link
Collaborator

@jameslamb Sorry for the long delay.

I've fixed the virtual machine. And now the CI tests should be able to run.

@jameslamb
Copy link
Collaborator Author

Excellent, thanks @shiyu1994 ! I'll try re-running the checks from #5545 right now. If they work, I'll work on merging some of the approved PRs today.

@jameslamb
Copy link
Collaborator Author

🎉 🎉 🎉

that worked!

https://github.com/microsoft/LightGBM/actions/runs/3281397887/jobs/5609753729

image

Thank you so much for the help @shiyu1994 !

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot removed the blocking label Aug 19, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants