Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Linux H100] containers with images considered "legacy" failed to do docker run #6238

Open
yangw-dev opened this issue Jan 30, 2025 · 3 comments

Comments

@yangw-dev
Copy link
Contributor

yangw-dev commented Jan 30, 2025

description

when AO test runs with h100 it's not consistent during the linux test job, when the image is 'legacy', it causes problem

Error Peak

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: error parsing IMEX info: unsupported IMEX channel value: all: unknown.

example

AO test
failure example: https://github.com/pytorch/ao/actions/runs/13042725250/job/36387841392
success exmaple: https://github.com/pytorch/ao/actions/runs/12999348107/job/36254475921

@yangw-dev
Copy link
Contributor Author

The bug might related to some pet instances have legacy images, and cause docker run error due to NVIDIA/nvidia-container-toolkit#797

@huydhn
Copy link
Contributor

huydhn commented Feb 7, 2025

I think we mitigated this a while ago by pinning nvidia-container-toolkit #5852, maybe we need to do the same here

cc @jeanschmidt @ZainRizvi

@huydhn
Copy link
Contributor

huydhn commented Feb 7, 2025

I manually install the older version of nvidia-container-toolkit on the broken runner i-0d3ed1ff3ccbeec77 with apt-get install nvidia-container-toolkit=1.16.2-1 nvidia-container-toolkit-base=1.16.2-1 to get the runner up now

https://github.com/pytorch/ao/actions/runs/13190616812/job/36824369092

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Cold Storage
Development

No branches or pull requests

2 participants