Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] gpu labeler job incorrectly detects A100 as A10 gpu #2959

Closed
asaiacai opened this issue Jan 9, 2024 · 5 comments
Closed

[k8s] gpu labeler job incorrectly detects A100 as A10 gpu #2959

asaiacai opened this issue Jan 9, 2024 · 5 comments
Assignees
Labels
k8s Kubernetes related items

Comments

@asaiacai
Copy link
Contributor

asaiacai commented Jan 9, 2024

I'm testing a single node k8s cluster on an A100:2 and after running the sky gpu labeler my available gpus are A10s

$ nvidia-smi
Tue Jan  9 00:34:54 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   27C    P0    41W / 400W |    5MiB / 40960MiB   |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   25C    P0    49W / 400W |      5MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
$ python -m sky.utils.kubernetes.gpu_labeler

$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES  
A10         1, 2
@landscapepainter landscapepainter self-assigned this Jan 9, 2024
@fourfireM
Copy link

I seem to be experiencing this as well, but it seems like I'm not affected when I train with a single node cluster. Is there anything else that could be affected by such an error?

@asaiacai
Copy link
Contributor Author

asaiacai commented Jan 9, 2024

I don't think so. This only affects the UX it seems like. When i launch nvidia-smi from within the pod, it shows up as an A100

@romilbhardwaj
Copy link
Collaborator

Thanks for the report @asaiacai - I think I know the issue. The GPU labeller script is lacking a break statement after L103:

def main():
gpu_name = get_gpu_name()
if gpu_name:
for allowed_name in allowed_gpu_names:
if allowed_name.lower() in gpu_name.lower():
label_node(allowed_name.lower())
else:
print('No supported GPU detected.')

which causes A10 to also match the A100 string, and overrides the label. This will be fixed

To fix, can you add a break statement after L103 in sky/utils/kubernetes/k8s_gpu_labeler_setup.yaml like so:

    def main():
        gpu_name = get_gpu_name()
        if gpu_name:
            for allowed_name in allowed_gpu_names:
                if allowed_name.lower() in gpu_name.lower():
                    label_node(allowed_name.lower())
                    break
        else:
            print('No supported GPU detected.')

Unfortunately I haven't been able to secure an A100 GPU on a k8s cluster to test this fix, but if you can test and submit a PR that'll be great. Thanks!

This will also be resolved by #2890, since that PR also adds a break.

@fourfireM
Copy link

I tried what you said and now run python -m sky.utils.kubernetes.gpu_labeler, now my k8s shows is as follows: skypilot.co/accelerator=a100 and it is running normally!

@romilbhardwaj romilbhardwaj added the k8s Kubernetes related items label Jan 10, 2024
@Michaelvll
Copy link
Collaborator

This issue seems to be resolved by #2890. Please feel free to reopen the issue if the problem still exists. : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k8s Kubernetes related items
Projects
None yet
Development

No branches or pull requests

5 participants