DRA driver fails with P4 #222

johnbelamaric · 2024-12-20T18:14:59Z

I am working on a demo of using DRA to have a deployment with mixed GPU models. Trying it against a GKE node with P4 GPUs results in a failure in NodePrepareResources.

Pods:

[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$ k get po
NAME                           READY   STATUS              RESTARTS   AGE
ccc-gpu-5969dcb484-c9gkd       1/1     Running             0          21m
ccc-gpu-67c77c9bdf-hlgwt       0/1     ContainerCreating   0          10m
ccc-gpu-deb-5f54d99d8f-zbb5p   1/1     Running             0          12m
[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$ k describe po ccc-gpu-67c77c9bdf-hlgwt

IPs:              <none>
Controlled By:    ReplicaSet/ccc-gpu-67c77c9bdf
Containers:
  ctr:
    Container ID:
    Image:         ubuntu:22.04
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
    Args:
      while [ 1 ]; do date; echo $(nvidia-smi -L || echo Waiting...); sleep 60; done
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2vpsr (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-2vpsr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              cloud.google.com/compute-class=inference-1x8x24
Tolerations:                 cloud.google.com/compute-class=inference-1x8x24:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                         Age               From               Message
  ----     ------                         ----              ----               -------
  Normal   Scheduled                      10m               default-scheduler  Successfully assigned default/ccc-gpu-67c77c9bdf-hlgwt to gke-drabeta-n1-standard-4-4xp4-f7feecbe-4h4q
  Warning  FailedPrepareDynamicResources  1s (x8 over 10m)  kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim default/ccc-gpu-67c77c9bdf-hlgwt-gpu-428dx: error preparing devices for claim e9504d19-2894-4331-aa0b-2c4536de9322: prepare devices failed: error applying GPU config: error setting timeslice config for requests '[gpu gpu gpu gpu]' in claim 'e9504d19-2894-4331-aa0b-2c4536de9322': error setting time slice: error running nvidia-smi: exit status 3
[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$

Looking at the driver log:

I1220 17:52:58.757018       1 driver.go:97] NodePrepareResource is called: number of claims: 1
E1220 17:53:10.813853       1 nvlib.go:534]
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported

johnbelamaric · 2024-12-20T18:15:40Z

Reference: GoogleCloudPlatform/kubernetes-engine-samples#1569

klueska · 2024-12-20T18:35:24Z

This is a know issue with a pending PR to fix it:

When the node GPU does not support setting timeslice, the plugin will crash directly. #81
DRA does not support Tesla P4 model GPUs because it does not support setting time slices by nvidia-smi #41
support skip old architecture version GPU settings time slice #58

klueska · 2024-12-20T18:38:48Z

For my previous demo on GKE, I had used v100 and t4 GPUs:
https://github.com/NVIDIA/k8s-dra-driver/blob/main/demo/clusters/gke/create-cluster.sh#L64-L114

johnbelamaric · 2024-12-20T21:09:57Z

Thanks, I can look into switching to v100s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRA driver fails with P4 #222

DRA driver fails with P4 #222

johnbelamaric commented Dec 20, 2024

johnbelamaric commented Dec 20, 2024

klueska commented Dec 20, 2024 •

edited

Loading

klueska commented Dec 20, 2024

johnbelamaric commented Dec 20, 2024

DRA driver fails with P4 #222

DRA driver fails with P4 #222

Comments

johnbelamaric commented Dec 20, 2024

johnbelamaric commented Dec 20, 2024

klueska commented Dec 20, 2024 • edited Loading

klueska commented Dec 20, 2024

johnbelamaric commented Dec 20, 2024

klueska commented Dec 20, 2024 •

edited

Loading