Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU sharing on cuda compute capability >=7.5 #231

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

guptaNswati
Copy link
Contributor

@guptaNswati guptaNswati commented Jan 24, 2025

This is to add a check on allowing GPU sharing only when its a CUDA compute capability of 7.5 and higher. It skips both timeslicing and MPS. Referencing these 2 issues and related MR

#41
https://github.com/NVIDIA/cloud-native-team/issues/97
https://github.com/NVIDIA/cloud-native-team/issues/96

Tested on Geforce 980 and Titan

$ 
logs when called on incompatible GPUs
$ kubectl logs nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-xbnr2 -n nvidia

I0130 23:08:07.073619       1 driver.go:108] NodeUnprepareResource is called: number of claims: 1
E0130 23:08:07.123606       1 nvlib.go:534] 
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported

no MPS server running 
$ kubectl apply -f demo/specs/quickstart/gpu-test-mps.yaml

$ kubectl get pods -A
NAMESPACE            NAME                                                           READY   STATUS              RESTARTS   AGE
gpu-test-mps         test-pod                                                       0/2     ContainerCreating   0          31m
kube-system          coredns-668d6bf9bc-hwhxl                                       1/1     Running             0          34m
kube-system          coredns-668d6bf9bc-rb964                                       1/1     Running             0          34m
kube-system          etcd-k8s-dra-driver-cluster-control-plane                      1/1     Running             0          34m
kube-system          kindnet-gxfdc                                                  1/1     Running             0          34m
kube-system          kindnet-r88xt                                                  1/1     Running             0          34m
kube-system          kube-apiserver-k8s-dra-driver-cluster-control-plane            1/1     Running             0          34m
kube-system          kube-controller-manager-k8s-dra-driver-cluster-control-plane   1/1     Running             0          34m
kube-system          kube-proxy-m7m4t                                               1/1     Running             0          34m
kube-system          kube-proxy-tx7bp                                               1/1     Running             0          34m
kube-system          kube-scheduler-k8s-dra-driver-cluster-control-plane            1/1     Running             0          34m
local-path-storage   local-path-provisioner-58cc7856b6-x77dz                        1/1     Running             0          34m
nvidia               nvidia-dra-driver-k8s-dra-driver-controller-844fcb94b-66wkq    1/1     Running             0          32m
nvidia               nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-vqhfg          1/1     Running             0          32m

$ kubectl logs nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-vqhfg  -n nvidia
I0131 00:51:41.457384       1 device_state.go:73] using devRoot=/driver-root
I0131 00:52:26.105473       1 driver.go:97] NodePrepareResource is called: number of claims: 1
I0131 00:53:34.078698       1 driver.go:97] NodePrepareResource is called: number of claims: 1

$ kubectl get pods -n gpu-test-mps
NAME       READY   STATUS              RESTARTS   AGE
test-pod   0/2     ContainerCreating   0          29m

$ kubectl describe pod test-pod -n gpu-test-mps
Warning  FailedPrepareDynamicResources  31s (x25 over 30m)  kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim gpu-test-mps/test-pod-shared-gpu-wfk6r: error preparing devices for claim 84b5789b-1f09-4d93-a3d3-a9fb61542cf9: prepare devices failed: error applying GPU config: GPU sharing is not available on this device UUID=GPU-34e8d7ba-0e4d-ac00-6852-695d5d404f51

@guptaNswati guptaNswati changed the title Draft:MPS on cuda compute capability >3.5 Draft: MPS on cuda compute capability >3.5 Jan 24, 2025
@guptaNswati guptaNswati requested a review from klueska January 31, 2025 01:28
@guptaNswati guptaNswati changed the title Draft: MPS on cuda compute capability >3.5 GPU sharing on cuda compute capability >=7.5 Jan 31, 2025
@guptaNswati guptaNswati requested a review from elezar January 31, 2025 01:29
@guptaNswati
Copy link
Contributor Author

cc @elezar PTAL as you also reviewed #58

@elezar
Copy link
Member

elezar commented Feb 3, 2025

Thanks @guptaNswati. I will need to check how this differs from #58?

if deviceType.Gpu != nil {
cudaCCv := "v" + strings.TrimPrefix(deviceType.Gpu.cudaComputeCapability, "v")
gpuUUID := deviceType.Gpu.UUID
if semver.Compare(semver.Canonical(cudaCCv), semver.Canonical("v7.5")) >= 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guptaNswati where does the v7.5 threshold come from? In #58 we check for >= v7.0 and for MPS specifically, v3.5 is mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked it from our device-plugin code checking if its Volta https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/device.go#L51

// allow devices only with cuda compute compatility >= 7.5 as time slicing and MPS does not work with old arch
shareableAllocatableDevices := make(AllocatableDevices)
for device, deviceType := range allocatableDevices {
if deviceType.Gpu != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we don't timeslice MIG devices?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, does it make sense to factor these checks into a function where we can better test the various combinations of options?

@@ -413,7 +431,8 @@ func (s *DeviceState) applySharingConfig(ctx context.Context, config configapi.S
if err != nil {
return nil, fmt.Errorf("error getting MPS configuration: %w", err)
}
mpsControlDaemon := s.mpsManager.NewMpsControlDaemon(string(claim.UID), allocatableDevices)

mpsControlDaemon := s.mpsManager.NewMpsControlDaemon(string(claim.UID), shareableAllocatableDevices)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we distinguish between timeslicing-sharable and MPS-sharable devices?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants