Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] GPU support for sky local up #2890

Merged
merged 12 commits into from
Jan 22, 2024
Merged

[k8s] GPU support for sky local up #2890

merged 12 commits into from
Jan 22, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Dec 21, 2023

Adds support for GPUs in kind clusters created by sky local up. This is useful for users who may have single machines with GPUs that they want to run SkyPilot tasks on.

Uses instructions from this guide: https://gist.github.com/romilbhardwaj/acde8657e319ecdc6ae9e50646acca33

Note that kind does not support GPUs natively - this is a hack which injects the GPU device through volume mounts feature in nvidia container runtime - https://docs.google.com/document/d/1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw/edit#

Does not auto-install or auto-configurenvidia-container-runtime and other dependencies because it will be highly dependent on user's env + requires sudo. Instead we print out instructions on how to do so.

Due to the brittle and non-native nature of this support, the --gpus flag will be hidden from users until we decide it is stable enough.

Tested:

  • AWS T4 instance
  • AWS V100 instance
  • GCP T4 instance

@concretevitamin
Copy link
Member

Nice @romilbhardwaj!

Tried it on a sky-launched T4 machine. Some UX.

(base) gcpuser@ray-sky-2d4a-zongheng-8a39-head-d50432f5-compute:~/skypilot$ sky local up --gpus
⠋ Creating local cluster...
Failed to create local cluster. No kind clusters found.
NVIDIA is not set as the default runtime for Docker. To fix, run:
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker

(base) gcpuser@ray-sky-2d4a-zongheng-8a39-head-d50432f5-compute:~/skypilot$ sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker
INFO[0000] Loading docker config from /etc/docker/daemon.json
INFO[0000] Successfully loaded config
INFO[0000] Wrote updated config to /etc/docker/daemon.json
INFO[0000] It is recommended that the docker daemon be restarted.
(base) gcpuser@ray-sky-2d4a-zongheng-8a39-head-d50432f5-compute:~/skypilot$ sky local up --gpus
⠋ Creating local cluster...
Failed to create local cluster. No kind clusters found.
NVIDIA visible devices are not set as volume mounts in container runtime. To fix, run:
sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' /etc/nvidia-container-runtime/config.toml

(base) gcpuser@ray-sky-2d4a-zongheng-8a39-head-d50432f5-compute:~/skypilot$ sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' /etc/nvidia-container-runtime/config.toml
(base) gcpuser@ray-sky-2d4a-zongheng-8a39-head-d50432f5-compute:~/skypilot$ sky local up --gpus
⠙ Creating local cluster...
Failed to create local cluster. No kind clusters found.
Creating cluster "skypilot" ...
 • Ensuring node image (kindest/node:v1.27.3) 🖼  ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 • Preparing nodes 📦   ...
 ✓ Preparing nodes 📦
 • Writing configuration 📜  ...
 ✓ Writing configuration 📜
 • Starting control-plane 🕹️  ...
 ✓ Starting control-plane 🕹️
 • Installing CNI 🔌  ...
 ✓ Installing CNI 🔌
 • Installing StorageClass 💾  ...
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-skypilot"
You can now use your cluster with:

kubectl cluster-info --context kind-skypilot

Not sure what to do next? 😅  Check out https://kind.sigs.k8s.io/docs/user/quick-start/
helm is not installed. Please install helm and try again. Installation instructions: https://helm.sh/docs/intro/install/
  1. There are two sets of hinted commands (NVIDIA is not set as the default runtime for Docker -> NVIDIA visible devices are not set) that the user had to manually copy paste and run. Is it possible to make sky local up run them automatically?

  2. The last hint seems to be new and comes from kind? It's not clear if a SkyPilot user should run the hinted kubectl cluster-info --context kind-skypilot command.

I then ran

$ sky show-gpus --cloud kubernetes
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerators) are setup correctly. To further debug, run: sky check.

$ python -m sky.utils.kubernetes.gpu_labeler
Found 0 GPU nodes in the cluster
No GPU nodes found in the cluster. If you have GPU nodes, please ensure that they have the label `nvidia.com/gpu: <number of GPUs>`

without luck.

@romilbhardwaj
Copy link
Collaborator Author

Changing from draft -> ready for review.

Fixed some bugs/UX and tested on GCP and AWS GPU VMs.

There are two sets of hinted commands (NVIDIA is not set as the default runtime for Docker -> NVIDIA visible devices are not set) that the user had to manually copy paste and run. Is it possible to make sky local up run them automatically?

That's a good point. I wanted to do that too, but realized that 1) the steps involve modifying user's global settings, which will affect their environment and 2) Requires sudo permissions, which may require launching sky local up as a sudo user for users who don't have passwordless sudo set up. Figured it's best to force users to be intentional about this and configure their environment themselves.

@concretevitamin
Copy link
Member

concretevitamin commented Dec 25, 2023

Thanks @romilbhardwaj. I tried out on

sky launch --cloud gcp --gpus L4:8 --use-spot --cpus 2+ -y -c dbg-docker

and couldn't seem to get it to work.

User journey
1.

$ sky local up --gpus
⠦ Creating local cluster with GPU support (this may take up to 15 minutes)...

It'd be great to print "To see detailed log: tail -f ..." just like sky launch. Currently it does take a while and makes me wonder "what it's doing and why so long".

$ sky local up --gpus
Local Kubernetes cluster created successfully with 96 CPUs. `sky launch` can now run tasks locally.
Hint: To change the number of CPUs, change your docker runtime settings. See https://kind.sigs.k8s.io/docs/user/quick-start/#settings-for-docker-desktop for more info.
  • with 96 CPUs -> with 96 CPUs and XX GPUs?
  • can we green the first line?
  1. It is not clear if the local up command labels the GPUs for the user, turns out not?
$ sky show-gpus --cloud kubernetes
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerators) are setup correctly. To further debug, run: sky check.

Making it auto label would be very convenient. Recall user feedback to minimize number of steps.

  1. I manually triggered labeling
$ python -m sky.utils.kubernetes.gpu_labeler
Found 1 GPU nodes in the cluster
Created GPU labeler job for node skypilot-control-plane
GPU labeling started - this may take a few minutes to complete.
To check the status of GPU labeling jobs, run `kubectl get jobs -n kube-system -l job=sky-gpu-labeler`
You can check if nodes have been labeled by running `kubectl describe nodes` and looking for labels of the format `skypilot.co/accelerators: <gpu_name>`.

The job appears finished?

$ kubectl get jobs -n kube-system -l job=sky-gpu-labeler
NAME                                               COMPLETIONS   DURATION   AGE
sky-gpu-labeler-d71e5a46f04de559ca097dd3c16cd8c5   1/1           6s         2m27s

but it's not labeled

kubectl describe nodes | grep accelerators  # returns nothing

@romilbhardwaj
Copy link
Collaborator Author

Thanks for the UX feedback - good points. Will fix.

For autolabelling - yes, sky local up --gpus will indeed autolabel, there should be no need to run the labelling separately. It works on T4/other gpus. Looks like the labelling script was broken for L4 GPUs - I've pushed a fix to add L4 and tested it on L4, seems to be working now.

Investigating if we can do away with the need to maintain the list of allowed_gpus and label with whatever gpu name nvidia-smi reports.

@creatorrr
Copy link

AFAIK, this method doesn't allow being selective about which and how many GPUs are visible and for local machines, the possible configurations are practically endless (I have 2x A5000 and 1x 2080Ti on my desktop and a 1x 4090 on my laptop). Maybe for local kubernetes, we could just create a new LOCAL gpu type?

@tobi
Copy link
Contributor

tobi commented Jan 4, 2024

I tried this out, thank you for doing this work. Let's make sure that the --gpus feature is set as default once this lands, because it's exceedingly nonsensical to have no GPU access when using sky where this is the entire point 😆.

I encountered a few issues:

  • invoking the shell script that you created from sky local up --gpus eats all of its output but the output is really necissary for debugging purposes, so we should maybe just shell out to it?
  • my cards are A6000 which were missing from the discovery process. This happened before at some point, I feel like sky should just really expose all nvidia cards and not rely on multiple implementations of allow-lists.
  • I got stuck on an endless loop of Waiting for GPU operator installation... after running the script directly. the gpu-operator never finds my A6000s. This is likely some latent configuration problem on my machine that i need to debug. Oddly running nvidia-smi through docker works fine, so i'm confused. But yes, maybe handle this case in such a way that after 10 retries it errors out.

@romilbhardwaj
Copy link
Collaborator Author

Hey @tobi - thanks for the super useful feedback! I've made a bunch of UX and functionality fixes - hopefully this should fix your issues. Please give it a go now :)

  • Logging: The detailed logs are now written to disk and the path to tail them is shown at the start. If any error happens, the stderr + the path to full logs is shown again. Hope this helps with debugging.
  • Fine-grained spinner: The progress spinner is now shows exactly the step that's being executed (e.g., "Installing NVIDIA GPU Operator", "Waiting for nodes to be labelled", etc.)
  • GPU discovery fixes: The GPU discovery has been reworked - your A6000 machines should work now. The allow list has been changed to simply be a canonical name "suggestion" list. If a GPU name fits any of the names in the list, it's considered a match and canonicalized so that the SkyPilot optimizer can handle failover to cloud, if necessary. If the GPU name does not have a match in the list, we directly use the name provided by nvidia-smi (useful for GPUs unknown to SkyPilot).
  • Timeouts and hints for debugging: Sorry to hear you got stuck on Waiting for GPU operator installation. This does seem like a configuration issue. Can you try running kubectl get pods -n gpu-operator and kubectl describe pod <stuck_pod_name> -n gpu-operator to help debug? (I've also added a timeout to this step so that it doesn't get stuck forever, and also added hints for how to debug if it does get stuck.)
  • GPU support is default now: GPU support for sky local up is now automatically enabled on supported hosts (can override with --no-gpus)

Some screenshots:

  • Spinner showing intermediate steps image
  • sky local up on a 4xA6000 machine image
  • Canonicalization in action on 4x V100-SXM on fluidstack - the optimizer shows both local V100s and AWS V100s as targets for provisioning (only AWS was enabled for this screenshot) image

lmk if you run into any issues!

Tested:

  • GCP 1x L4 machine
  • GCP 1x T4 machine
  • 4x A6000 machine (on fluidstack) - gets listed as a new GPU nvidia-rtx-a6000.
  • 4x V100-SXM2 machine (on fluidstack) - gets canonicalized to v100

…o localup_gpus

# Conflicts:
#	sky/utils/kubernetes/create_cluster.sh
@romilbhardwaj romilbhardwaj marked this pull request as ready for review January 5, 2024 17:14
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support for GPU with sky local up @romilbhardwaj! I just tried it on a new fluidstack VM with 4 QUADRO-RTX-5000 GPUs, and it works well. Left several comments mainly related to UX.

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
Comment on lines +3482 to 3483
for gpu, _ in sorted(result.items()):
gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
for gpu, _ in sorted(result.items()):
gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))])
for gpu, info_list in sorted(result.items()):
gpu_table.add_row([gpu, _list_to_str(info_list)])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to "pop" from the list to avoid having the GPU show up again in the other gpus table at L3502

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see. Missed that part! Sounds good to me!

sky/cli.py Outdated Show resolved Hide resolved
sky/utils/kubernetes/k8s_gpu_labeler_setup.yaml Outdated Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator Author

Thanks @Michaelvll! Addressed your comments and added a UX fix to show all dependency errors in one go:

image

@tobi
Copy link
Contributor

tobi commented Jan 8, 2024

really nice!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing this PR and thank you for the update @romilbhardwaj! It looks great to me.

sky/cli.py Show resolved Hide resolved
Comment on lines +3482 to 3483
for gpu, _ in sorted(result.items()):
gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see. Missed that part! Sounds good to me!

Comment on lines +3482 to +3483
if (cloud_obj is not None and
cloud_obj.is_same_cloud(clouds.Kubernetes())):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if (cloud_obj is not None and
cloud_obj.is_same_cloud(clouds.Kubernetes())):
if isinstance(cloud_obj, clouds.Kubernetes):

'A100-80GB', 'A100', 'A10G', 'K80', 'M60', 'T4g', 'T4', 'V100', 'A10',
'P100', 'P40', 'P4', 'L4'
'A100-80GB', 'A100', 'A10G', 'H100', 'K80', 'M60', 'T4g', 'T4', 'V100',
'A10', 'P100', 'P40', 'P4', 'L4', 'A6000'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this A6000 the same as RTXA6000 in other clouds? If the labelling is A6000 by default in k8s, we should probably rename it for the other clouds as well in the future. : )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! nvidia-smi lists this as Nvidia RTX 6000 - perhaps we can converge onto nvidia-smi's naming convention in the future?

@romilbhardwaj
Copy link
Collaborator Author

Thanks for the review @Michaelvll! Tested again on a V100 - merging now.

@romilbhardwaj romilbhardwaj merged commit c05a0ea into master Jan 22, 2024
19 checks passed
@romilbhardwaj romilbhardwaj deleted the localup_gpus branch January 22, 2024 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants