-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] GPU support for sky local up
#2890
Conversation
Nice @romilbhardwaj! Tried it on a sky-launched T4 machine. Some UX.
I then ran
without luck. |
Changing from draft -> ready for review. Fixed some bugs/UX and tested on GCP and AWS GPU VMs.
That's a good point. I wanted to do that too, but realized that 1) the steps involve modifying user's global settings, which will affect their environment and 2) Requires sudo permissions, which may require launching |
Thanks @romilbhardwaj. I tried out on
and couldn't seem to get it to work. User journey
It'd be great to print "To see detailed log: tail -f ..." just like
Making it auto label would be very convenient. Recall user feedback to minimize number of steps.
The job appears finished?
but it's not labeled
|
Thanks for the UX feedback - good points. Will fix. For autolabelling - yes, Investigating if we can do away with the need to maintain the list of allowed_gpus and label with whatever gpu name nvidia-smi reports. |
AFAIK, this method doesn't allow being selective about which and how many GPUs are visible and for local machines, the possible configurations are practically endless (I have 2x A5000 and 1x 2080Ti on my desktop and a 1x 4090 on my laptop). Maybe for local kubernetes, we could just create a new |
I tried this out, thank you for doing this work. Let's make sure that the --gpus feature is set as default once this lands, because it's exceedingly nonsensical to have no GPU access when using sky where this is the entire point 😆. I encountered a few issues:
|
Hey @tobi - thanks for the super useful feedback! I've made a bunch of UX and functionality fixes - hopefully this should fix your issues. Please give it a go now :)
Some screenshots:
lmk if you run into any issues! Tested:
|
…o localup_gpus # Conflicts: # sky/utils/kubernetes/create_cluster.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the support for GPU with sky local up
@romilbhardwaj! I just tried it on a new fluidstack VM with 4 QUADRO-RTX-5000
GPUs, and it works well. Left several comments mainly related to UX.
for gpu, _ in sorted(result.items()): | ||
gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
for gpu, _ in sorted(result.items()): | |
gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))]) | |
for gpu, info_list in sorted(result.items()): | |
gpu_table.add_row([gpu, _list_to_str(info_list)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to "pop" from the list to avoid having the GPU show up again in the other gpus table at L3502
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see. Missed that part! Sounds good to me!
Thanks @Michaelvll! Addressed your comments and added a UX fix to show all dependency errors in one go: |
really nice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for missing this PR and thank you for the update @romilbhardwaj! It looks great to me.
for gpu, _ in sorted(result.items()): | ||
gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see. Missed that part! Sounds good to me!
if (cloud_obj is not None and | ||
cloud_obj.is_same_cloud(clouds.Kubernetes())): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if (cloud_obj is not None and | |
cloud_obj.is_same_cloud(clouds.Kubernetes())): | |
if isinstance(cloud_obj, clouds.Kubernetes): |
'A100-80GB', 'A100', 'A10G', 'K80', 'M60', 'T4g', 'T4', 'V100', 'A10', | ||
'P100', 'P40', 'P4', 'L4' | ||
'A100-80GB', 'A100', 'A10G', 'H100', 'K80', 'M60', 'T4g', 'T4', 'V100', | ||
'A10', 'P100', 'P40', 'P4', 'L4', 'A6000' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this A6000 the same as RTXA6000
in other clouds? If the labelling is A6000
by default in k8s, we should probably rename it for the other clouds as well in the future. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! nvidia-smi
lists this as Nvidia RTX 6000
- perhaps we can converge onto nvidia-smi's naming convention in the future?
Thanks for the review @Michaelvll! Tested again on a V100 - merging now. |
Adds support for GPUs in kind clusters created by
sky local up
. This is useful for users who may have single machines with GPUs that they want to run SkyPilot tasks on.Uses instructions from this guide: https://gist.github.com/romilbhardwaj/acde8657e319ecdc6ae9e50646acca33
Note that kind does not support GPUs natively - this is a hack which injects the GPU device through volume mounts feature in nvidia container runtime - https://docs.google.com/document/d/1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw/edit#
Does not auto-install or auto-configure
nvidia-container-runtime
and other dependencies because it will be highly dependent on user's env + requires sudo. Instead we print out instructions on how to do so.Due to the brittle and non-native nature of this support, the
--gpus
flag will be hidden from users until we decide it is stable enough.Tested: