GPU sharing on cuda compute capability >=7.5 #231

guptaNswati · 2025-01-24T00:12:18Z

This is to add a check on allowing GPU sharing only when its a CUDA compute capability of 7.5 and higher. It skips both timeslicing and MPS. Referencing these 2 issues and related MR

#41
https://github.com/NVIDIA/cloud-native-team/issues/97
https://github.com/NVIDIA/cloud-native-team/issues/96

Tested on Geforce 980 and Titan

$ 
logs when called on incompatible GPUs
$ kubectl logs nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-xbnr2 -n nvidia

I0130 23:08:07.073619       1 driver.go:108] NodeUnprepareResource is called: number of claims: 1
E0130 23:08:07.123606       1 nvlib.go:534] 
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported

no MPS server running 
$ kubectl apply -f demo/specs/quickstart/gpu-test-mps.yaml

$ kubectl get pods -A
NAMESPACE            NAME                                                           READY   STATUS              RESTARTS   AGE
gpu-test-mps         test-pod                                                       0/2     ContainerCreating   0          31m
kube-system          coredns-668d6bf9bc-hwhxl                                       1/1     Running             0          34m
kube-system          coredns-668d6bf9bc-rb964                                       1/1     Running             0          34m
kube-system          etcd-k8s-dra-driver-cluster-control-plane                      1/1     Running             0          34m
kube-system          kindnet-gxfdc                                                  1/1     Running             0          34m
kube-system          kindnet-r88xt                                                  1/1     Running             0          34m
kube-system          kube-apiserver-k8s-dra-driver-cluster-control-plane            1/1     Running             0          34m
kube-system          kube-controller-manager-k8s-dra-driver-cluster-control-plane   1/1     Running             0          34m
kube-system          kube-proxy-m7m4t                                               1/1     Running             0          34m
kube-system          kube-proxy-tx7bp                                               1/1     Running             0          34m
kube-system          kube-scheduler-k8s-dra-driver-cluster-control-plane            1/1     Running             0          34m
local-path-storage   local-path-provisioner-58cc7856b6-x77dz                        1/1     Running             0          34m
nvidia               nvidia-dra-driver-k8s-dra-driver-controller-844fcb94b-66wkq    1/1     Running             0          32m
nvidia               nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-vqhfg          1/1     Running             0          32m

$ kubectl logs nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-vqhfg  -n nvidia
I0131 00:51:41.457384       1 device_state.go:73] using devRoot=/driver-root
I0131 00:52:26.105473       1 driver.go:97] NodePrepareResource is called: number of claims: 1
I0131 00:53:34.078698       1 driver.go:97] NodePrepareResource is called: number of claims: 1

$ kubectl get pods -n gpu-test-mps
NAME       READY   STATUS              RESTARTS   AGE
test-pod   0/2     ContainerCreating   0          29m

$ kubectl describe pod test-pod -n gpu-test-mps
Warning  FailedPrepareDynamicResources  31s (x25 over 30m)  kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim gpu-test-mps/test-pod-shared-gpu-wfk6r: error preparing devices for claim 84b5789b-1f09-4d93-a3d3-a9fb61542cf9: prepare devices failed: error applying GPU config: GPU sharing is not available on this device UUID=GPU-34e8d7ba-0e4d-ac00-6852-695d5d404f51

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati · 2025-01-31T01:30:36Z

cc @elezar PTAL as you also reviewed #58

elezar · 2025-02-03T10:53:48Z

Thanks @guptaNswati. I will need to check how this differs from #58?

elezar · 2025-02-03T14:16:20Z

cmd/nvidia-dra-plugin/device_state.go

+		if deviceType.Gpu != nil {
+			cudaCCv := "v" + strings.TrimPrefix(deviceType.Gpu.cudaComputeCapability, "v")
+			gpuUUID := deviceType.Gpu.UUID
+			if semver.Compare(semver.Canonical(cudaCCv), semver.Canonical("v7.5")) >= 0 {


@guptaNswati where does the v7.5 threshold come from? In #58 we check for >= v7.0 and for MPS specifically, v3.5 is mentioned.

I picked it from our device-plugin code checking if its Volta https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/device.go#L51

elezar · 2025-02-03T14:20:51Z

cmd/nvidia-dra-plugin/device_state.go

+	// allow devices only with cuda compute compatility >= 7.5 as time slicing and MPS does not work with old arch
+	shareableAllocatableDevices := make(AllocatableDevices)
+	for device, deviceType := range allocatableDevices {
+		if deviceType.Gpu != nil {


Does this mean that we don't timeslice MIG devices?

In general, does it make sense to factor these checks into a function where we can better test the various combinations of options?

elezar · 2025-02-03T14:22:14Z

cmd/nvidia-dra-plugin/device_state.go

@@ -413,7 +431,8 @@ func (s *DeviceState) applySharingConfig(ctx context.Context, config configapi.S
 		if err != nil {
 			return nil, fmt.Errorf("error getting MPS configuration: %w", err)
 		}
-		mpsControlDaemon := s.mpsManager.NewMpsControlDaemon(string(claim.UID), allocatableDevices)
+
+		mpsControlDaemon := s.mpsManager.NewMpsControlDaemon(string(claim.UID), shareableAllocatableDevices)


Should we distinguish between timeslicing-sharable and MPS-sharable devices?

guptaNswati changed the title ~~Draft:MPS on cuda compute capability >3.5~~ Draft: MPS on cuda compute capability >3.5 Jan 24, 2025

GPU sharing on cuda compute capability >=7.5

86de1cb

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati force-pushed the when-to-startMPS branch from 58f6bfa to 86de1cb Compare January 31, 2025 01:14

guptaNswati requested a review from klueska January 31, 2025 01:28

guptaNswati changed the title ~~Draft: MPS on cuda compute capability >3.5~~ GPU sharing on cuda compute capability >=7.5 Jan 31, 2025

guptaNswati requested a review from elezar January 31, 2025 01:29

elezar reviewed Feb 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU sharing on cuda compute capability >=7.5 #231

GPU sharing on cuda compute capability >=7.5 #231

guptaNswati commented Jan 24, 2025 •

edited

Loading

guptaNswati commented Jan 31, 2025

elezar commented Feb 3, 2025

elezar Feb 3, 2025

guptaNswati Feb 3, 2025

elezar Feb 3, 2025

elezar Feb 3, 2025

elezar Feb 3, 2025

GPU sharing on cuda compute capability >=7.5 #231

Are you sure you want to change the base?

GPU sharing on cuda compute capability >=7.5 #231

Conversation

guptaNswati commented Jan 24, 2025 • edited Loading

guptaNswati commented Jan 31, 2025

elezar commented Feb 3, 2025

elezar Feb 3, 2025

Choose a reason for hiding this comment

guptaNswati Feb 3, 2025

Choose a reason for hiding this comment

elezar Feb 3, 2025

Choose a reason for hiding this comment

elezar Feb 3, 2025

Choose a reason for hiding this comment

elezar Feb 3, 2025

Choose a reason for hiding this comment

guptaNswati commented Jan 24, 2025 •

edited

Loading