Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Run GPU labelling job only on nodes with gpus #2550

Merged
merged 2 commits into from
Sep 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 20 additions & 7 deletions sky/utils/kubernetes/gpu_labeler.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,16 @@ def label():

# Iterate over nodes
nodes = v1.list_node().items
# TODO(romilb): Run this only on nodes with GPUs.

# Get the list of nodes with GPUs
gpu_nodes = []
for node in nodes:
if 'nvidia.com/gpu' in node.status.capacity:
gpu_nodes.append(node)

print(f'Found {len(gpu_nodes)} GPU nodes in the cluster')

for node in gpu_nodes:
node_name = node.metadata.name

# Modify the job manifest for the current node
Expand All @@ -103,12 +111,17 @@ def label():
# Create the job for this node`
batch_v1.create_namespaced_job(namespace, job_manifest)
print(f'Created GPU labeler job for node {node_name}')
print('GPU labeling started - this may take a few minutes to complete.'
'\nTo check the status of GPU labeling jobs, run '
'`kubectl get jobs --namespace=kube-system -l job=sky-gpu-labeler`'
'\nYou can check if nodes have been labeled by running '
'`kubectl describe nodes` and looking for labels of the format '
'`skypilot.co/accelerators: <gpu_name>`. ')
if len(gpu_nodes) == 0:
print('No GPU nodes found in the cluster. If you have GPU nodes, '
'please ensure that they have the label '
'`nvidia.com/gpu: <number of GPUs>`')
else:
print('GPU labeling started - this may take a few minutes to complete.'
'\nTo check the status of GPU labeling jobs, run '
'`kubectl get jobs -n kube-system -l job=sky-gpu-labeler`'
'\nYou can check if nodes have been labeled by running '
'`kubectl describe nodes` and looking for labels of the format '
'`skypilot.co/accelerators: <gpu_name>`. ')


def main():
Expand Down
8 changes: 4 additions & 4 deletions tests/kubernetes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,14 +82,14 @@ sky local up
```bash
kubectl get jobs -n kube-system
```
Note that some jobs may be in pending state if your cluster contains CPU nodes. To clean up these jobs after you're done, run:
```bash
python -m sky.utils.kubernetes.gpu_labeler --cleanup
```
After the jobs are done, you can verify the GPU labels are setup correctly by looking for `skypilot.co/accelerator` label in the output of:
```bash
kubectl describe nodes
```
In case something goes wrong, you can clean up these jobs by running:
```bash
python -m sky.utils.kubernetes.gpu_labeler --cleanup
```
5. Run `sky check`.
```bash
sky check
Expand Down