Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA driver fails with P4 #222

Open
johnbelamaric opened this issue Dec 20, 2024 · 4 comments
Open

DRA driver fails with P4 #222

johnbelamaric opened this issue Dec 20, 2024 · 4 comments

Comments

@johnbelamaric
Copy link

I am working on a demo of using DRA to have a deployment with mixed GPU models. Trying it against a GKE node with P4 GPUs results in a failure in NodePrepareResources.

Pods:

[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$ k get po
NAME                           READY   STATUS              RESTARTS   AGE
ccc-gpu-5969dcb484-c9gkd       1/1     Running             0          21m
ccc-gpu-67c77c9bdf-hlgwt       0/1     ContainerCreating   0          10m
ccc-gpu-deb-5f54d99d8f-zbb5p   1/1     Running             0          12m
[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$ k describe po ccc-gpu-67c77c9bdf-hlgwt

IPs:              <none>
Controlled By:    ReplicaSet/ccc-gpu-67c77c9bdf
Containers:
  ctr:
    Container ID:
    Image:         ubuntu:22.04
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
    Args:
      while [ 1 ]; do date; echo $(nvidia-smi -L || echo Waiting...); sleep 60; done
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2vpsr (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-2vpsr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              cloud.google.com/compute-class=inference-1x8x24
Tolerations:                 cloud.google.com/compute-class=inference-1x8x24:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                         Age               From               Message
  ----     ------                         ----              ----               -------
  Normal   Scheduled                      10m               default-scheduler  Successfully assigned default/ccc-gpu-67c77c9bdf-hlgwt to gke-drabeta-n1-standard-4-4xp4-f7feecbe-4h4q
  Warning  FailedPrepareDynamicResources  1s (x8 over 10m)  kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim default/ccc-gpu-67c77c9bdf-hlgwt-gpu-428dx: error preparing devices for claim e9504d19-2894-4331-aa0b-2c4536de9322: prepare devices failed: error applying GPU config: error setting timeslice config for requests '[gpu gpu gpu gpu]' in claim 'e9504d19-2894-4331-aa0b-2c4536de9322': error setting time slice: error running nvidia-smi: exit status 3
[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$

Looking at the driver log:

I1220 17:52:58.757018       1 driver.go:97] NodePrepareResource is called: number of claims: 1
E1220 17:53:10.813853       1 nvlib.go:534]
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported
@johnbelamaric
Copy link
Author

@klueska
Copy link
Collaborator

klueska commented Dec 20, 2024

For my previous demo on GKE, I had used v100 and t4 GPUs:
https://github.com/NVIDIA/k8s-dra-driver/blob/main/demo/clusters/gke/create-cluster.sh#L64-L114

@johnbelamaric
Copy link
Author

Thanks, I can look into switching to v100s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants