-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] gpu labeler job incorrectly detects A100 as A10 gpu #2959
Comments
I seem to be experiencing this as well, but it seems like I'm not affected when I train with a single node cluster. Is there anything else that could be affected by such an error? |
I don't think so. This only affects the UX it seems like. When i launch |
Thanks for the report @asaiacai - I think I know the issue. The GPU labeller script is lacking a break statement after L103: skypilot/sky/utils/kubernetes/k8s_gpu_labeler_setup.yaml Lines 98 to 105 in 1e53317
which causes To fix, can you add a break statement after L103 in
Unfortunately I haven't been able to secure an A100 GPU on a k8s cluster to test this fix, but if you can test and submit a PR that'll be great. Thanks! This will also be resolved by #2890, since that PR also adds a break. |
I tried what you said and now run |
This issue seems to be resolved by #2890. Please feel free to reopen the issue if the problem still exists. : ) |
I'm testing a single node k8s cluster on an A100:2 and after running the
sky
gpu labeler my available gpus are A10sThe text was updated successfully, but these errors were encountered: