[k8s] Show per node status in sky show-gpus #3816

romilbhardwaj · 2024-08-08T05:57:18Z

Updates sky show-gpus --cloud kubernetes to also show per node GPU availability information. Helps users debug out of resource errors and get better visibility into their cluster.

Example:

$ sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
T4    1             4           3
V100  1, 2          4           1

Kubernetes per node GPU availability
NODE_NAME                               GPU_NAME  TOTAL_GPUS  FREE_GPUS
gke-gkeusc5-default-pool-93c7cbf8-6ll6  T4        1           1
gke-gkeusc5-default-pool-93c7cbf8-czx8  T4        1           1
gke-gkeusc5-default-pool-93c7cbf8-fdhf  T4        1           0
gke-gkeusc5-default-pool-93c7cbf8-mfj8  T4        1           1
gke-gkeusc5-v100-570bede1-kwsf          V100      2           1
gke-gkeusc5-v100-570bede1-s23r          V100      2           0
gke-gkeusc5-largecpu-4fa50bdc-g7hr      None      0           0
gke-gkeusc5-largecpu-4fa50bdc-qqxq      None      0           0

Tested (run the relevant ones):

Code formatting: bash format.sh

Manual tests:

GKE cluster with different GPU types
GKE cluster with different GPU types and all GPUs utilized
Kind cluster without any GPUs

Michaelvll

Thanks for adding this @romilbhardwaj! LGTM.

Michaelvll · 2024-08-08T17:40:47Z

sky/provision/kubernetes/utils.py

+        for pod in pods:
+            # Get all the pods running on the node
+            if (pod.spec.node_name == node.metadata.name and
+                    pod.status.phase in ['Running', 'Pending']):
+                # Iterate over all the containers in the pod and sum the
+                # GPU requests
+                for container in pod.spec.containers:
+                    if container.resources.requests:
+                        allocated_qty += int(
+                            container.resources.requests.get(
+                                'nvidia.com/gpu', 0))


nit: it might be interesting to show the number of running pods on each nodes as well, but this is minor if a user does not request that. : )

Good point - though one challenge is a node may be running many pods across different namespaces.. perhaps we can show only the pods in the user's configured namespace. We can add this if users ask :)

romilbhardwaj added 2 commits August 7, 2024 18:29

per node gpu availability

67cb9b4

per node gpu availability

7ee556c

Michaelvll approved these changes Aug 8, 2024

View reviewed changes

romilbhardwaj added this pull request to the merge queue Aug 8, 2024

Merged via the queue into master with commit 7f64d60 Aug 8, 2024
20 checks passed

romilbhardwaj deleted the k8s_show_gpus_nodes branch August 8, 2024 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Show per node status in sky show-gpus #3816

[k8s] Show per node status in sky show-gpus #3816

romilbhardwaj commented Aug 8, 2024 •

edited

Loading

Michaelvll left a comment

Michaelvll Aug 8, 2024

romilbhardwaj Aug 8, 2024

[k8s] Show per node status in sky show-gpus #3816

[k8s] Show per node status in sky show-gpus #3816

Conversation

romilbhardwaj commented Aug 8, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Aug 8, 2024

Choose a reason for hiding this comment

romilbhardwaj Aug 8, 2024

Choose a reason for hiding this comment

romilbhardwaj commented Aug 8, 2024 •

edited

Loading