Support for allocating GPUs in Passthrough-Mode #183

varunrsekar · 2024-10-17T09:24:50Z

This PR introduces a new DeviceClass vfiopci.nvidia.com that will allocate a full GPU in PassThrough-mode (PT) by binding the GPU to vfio-pci driver.

The primary usecase for this new DeviceClass are Kata containers and KubeVirt VMs that require the gpu to be in PT-mode and made available to a pod which then would spin up a guest with the gpu.

Note: Regular pod workloads will not benefit from this DeviceClass and shouldn't try to use this.

As part of this change, I've introduced some (but not all) modifications to the kind cluster config that are needed for this DeviceClass to work. Host-level modifications needed:

# Example on Ubuntu:

# Enable IOMMU on the host kernel
if ! grep -q "GRUB_CMDLINE_LINUX_DEFAULT=.*intel_iommu=on" /etc/default/grub; then
   sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on /g' /etc/default/grub
fi
sudo update-grub

# Disable GDM
sudo systemctl stop gdm && systemctl disable gdm
# Unload nvidia-drm
sudo modprobe -r nvidia-drm
# Reboot the node
sudo reboot

Validated on a kind cluster with a Quattro P2000 GPU:

$ nvidia-smi -L
GPU 0: Quadro P2000 (UUID: GPU-7bea1569-778c-fb4d-7801-df6b6b85ceac)

$ k get resourceclaim -n gpu-test-vfiopci
NAME             STATE                AGE
pod1-gpu-k9w6g   allocated,reserved   21s

$ k get pod -n gpu-test-vfiopci
NAME   READY   STATUS    RESTARTS   AGE
pod1   1/1     Running   0          2m20s

Open items:

how to make sysfs on the kind cluster node be read-write mountable?
Handling kubelet plugin restarts when the GPU is bound to vfio-pci driver as there is no device discovery possible at that time.

varunrsekar · 2024-10-17T09:25:42Z

/cc @klueska

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

elezar · 2024-12-03T08:58:46Z

api/nvidia.com/resource/gpu/v1alpha1/driverconfig.go

+type GpuDriver string
+
+const (
+	NvidiaDriver  GpuDriver = "nvidia"


Is there another name for this driver that isn't nvidia?

Nope. The driver is simply called nvidia

elezar · 2024-12-03T09:15:58Z

api/nvidia.com/resource/gpu/v1alpha1/driverconfig.go

+}
+
+// Validate ensures that GpuDriverConfig has a valid set of values.
+func (c *GpuDriverConfig) Validate() error {


Question: Is c == nil valid?

Nope. gpuConfig.Normalize() would always ensure GpuDriverConfig is set

My question should probably have been: "Should we add an explict nil check and retunr an error if this is the case?"

api/nvidia.com/resource/gpu/v1alpha1/gpuconfig.go

elezar · 2024-12-03T09:19:18Z

api/nvidia.com/resource/gpu/v1alpha1/driverconfig.go

+	case VfioPciDriver:
+		break
+	default:
+		return fmt.Errorf("invalid driver specified in gpu driver configuration")


Should we include the invalid value here?

api/nvidia.com/resource/gpu/v1alpha1/gpuconfig.go

elezar · 2024-12-03T09:29:52Z

api/nvidia.com/resource/gpu/v1alpha1/gpuconfig.go

+	if c.DriverConfig.Driver != NvidiaDriver {
+		return nil
+	}


Does adding a SupportsSharing() function to the GpuDriver type make this clearer?

Also, is the expectation that we always call Validate() after Normalize() so that we check whether there is no sharing actually configured? Does it make sense to remove this check here entirely so that we only need to update logic in one place with regards to the driver type and sharing interaction?

Does it make sense to remove this check here entirely so that we only need to update logic in one place with regards to the driver type and sharing interaction?

Removing the check here won't work. There's 2 possibilities here:

On NVIDIA driver, normalize sharing by initializing it if its nil, and then set config based on strategy.

On VFIO-PCI driver, don't initialize it.

If we allow normalization of c.Sharing in case (2) by relaxing the check, then we're going to fail subsequent validation.

cmd/nvidia-dra-plugin/allocatable.go

cmd/nvidia-dra-plugin/device_state.go

elezar · 2024-12-03T09:33:36Z

cmd/nvidia-dra-plugin/device_state.go

 		allocatable:       allocatable,
 		config:            config,
 		nvdevlib:          nvdevlib,
 		checkpointManager: checkpointManager,
 	}

+	// Initialize the vfio-pci driver manager.
+	vfioPciManager.Init()


Why do we init this here? Where are other managers initialised?

tsManager/mpsManager don't have any "initializations". But checkpointManager is initialized underneath: L117-L131.

What's the concern here?

cmd/nvidia-dra-plugin/device_state.go

elezar · 2024-12-03T09:40:38Z

cmd/nvidia-dra-plugin/device_state.go

 	switch castConfig := config.(type) {
 	case *configapi.GpuConfig:
-		return s.applySharingConfig(ctx, castConfig.Sharing, claim, results)
+		configState.GpuConfig = castConfig
+		err = s.applyGpuConfig(ctx, castConfig, claim, results, &configState)


Why not still return if err != nil in each of the branches? alternatively, does it make sense to have s.applyGpuConfig return (*DeviceConfigState, error) like the other apply functions and keep the other function unchanged?

elezar · 2024-12-03T09:43:20Z

cmd/nvidia-dra-plugin/device_state.go

+	if config.Sharing != nil {
+		err := s.applySharingConfig(ctx, config.Sharing, claim, results, configState)
+		if err != nil {
+			return err
+		}
+	}


Question: We unconditionally call applySharingConfig for mig devices. Since we now have the case where config.Sharing can be nil, could we either update applySharingConfig to ignore nil configs, or introduce a Disabled config so that we can set that for cases where we don't expect sharing to be allowed?

For MIG devices, config.Sharing can never be nil right - MIGs would always support Sharing so Normalize would initialize that config.

elezar · 2024-12-03T09:45:01Z

cmd/nvidia-dra-plugin/device_state.go

 }

-func (s *DeviceState) applySharingConfig(ctx context.Context, config configapi.Sharing, claim *resourceapi.ResourceClaim, results []*resourceapi.DeviceRequestAllocationResult) (*DeviceConfigState, error) {
+func (s *DeviceState) applySharingConfig(ctx context.Context, config configapi.Sharing, claim *resourceapi.ResourceClaim, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error {


As mentioned, I think we should keep this signature the same as before. I would rather avoid passing an argument by reference for modification.

cmd/nvidia-dra-plugin/device_state.go

elezar · 2024-12-03T09:48:04Z

cmd/nvidia-dra-plugin/device_state.go

+	return nil
+}
+
+func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error {


Suggested change

func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error {

func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult) (*DeviceConfigState, error) {

cmd/nvidia-dra-plugin/deviceinfo.go

elezar · 2024-12-03T09:50:46Z

cmd/nvidia-dra-plugin/nvlib.go

@@ -199,6 +199,10 @@ func (l deviceLib) enumerateImexChannels(config *Config) (AllocatableDevices, er
 	return devices, nil
 }

+func getPciAddressFromNvmlPciInfo(info nvml.PciInfo) string {
+	return fmt.Sprintf("%04x:%02x:%02x.0", info.Domain, info.Bus, info.Device)


How is this different from https://github.com/NVIDIA/go-nvlib/blame/bf0cbc7fef6c51b968da7abd1e8788f0647b2266/pkg/nvlib/device/device.go#L147

It should be the same if its returning the address in the format 0000:0a:00.0. Wasn't aware of that implementation. I'll switch to that

elezar · 2024-12-03T09:51:34Z

cmd/nvidia-dra-plugin/nvlib.go

+	if ret != nvml.SUCCESS {
+		return nil, fmt.Errorf("error getting PCI info for device %d: %w", index, err)
+	}
+	pciAddress := getPciAddressFromNvmlPciInfo(pciInfo)


Could we replace this with device.GetBusID()?

elezar · 2024-12-03T09:54:03Z

cmd/nvidia-dra-plugin/vfio-device.go

+}
+
+func (vm *VfioPciManager) loadVfioPciModule() error {
+	cmd := exec.Command("modprobe", vm.vfioPciModule) //nolint:gosec


Should we chroot to the host root mounted into the container instead of expecting /sys/ to be writable?

@elezar Do you mean something like: exec.Command("nsenter", "--mount=/proc/1/ns/mnt", "--", "modprobe", vm.vfioPciModule)?

elezar · 2024-12-03T09:55:01Z

cmd/nvidia-dra-plugin/vfio-device.go

+}
+
+func (vm *VfioPciManager) getIommuGroupForVfioPciDevice(pciAddress string) string {
+	iommuGroup, err := os.Readlink(filepath.Join(vm.pciDevicesRoot, pciAddress, "iommu_group"))


Do we have similar functionality in go-nvlib?

So we have it tracked in NvidiaPciDevice here: https://github.com/NVIDIA/go-nvlib/blob/main/pkg/nvpci/nvpci.go. I'm guessing nvpci is the sysfs way of discovering GPUs. But nvlib doesn't have it.

Btw one of the implicit assumptions I havent captured here is that iommu needs to be enabled in the kernel for us to be able to do gpu-passthrough. In that case we would hit the err in L171 and return empty iommu group.

Can we reuse the NvidiaPciDevice from go-nvlib/pkg/nvpci? It seems as if it is set here https://github.com/NVIDIA/go-nvlib/blob/1482a942fb6d52a023cff85c2d76ed4127af661a/pkg/nvpci/nvpci.go#L292-L304.

elezar · 2024-12-03T09:57:13Z

deployments/helm/k8s-dra-driver/templates/kubeletplugin.yaml

+        - name: sysfs
+          mountPath: /sys
+          readOnly: false
+        - name: dev-vfio
+          mountPath: /dev/vfio
+          readOnly: false


Should these be optionally mounted instead?

elezar · 2024-12-03T09:58:12Z

scripts/bind_to_driver.sh

+# Usage: ./bind_to_driver.sh <ssss:bb:dd.f> <driver>
+# Bind the GPU specified by the PCI_ID=ssss:bb:dd.f to the given driver.
+
+bind_to_driver()


Why not implement this in go?

Atleast for unbind, in case unbindLock is in use, we'd need the same process that acquires the unbindLock to be the one doing the driver unbind. I want to scope that unbindLock to just the task that's doing the unbind rather than the whole kubeletplugin binary.

(FYI unbindLock is in play when the NVIDIA Grid-vGPU driver is used on the node)

bind is a script just to be consistent with unbind.

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

elezar · 2025-01-09T06:40:33Z

demo/clusters/kind/scripts/kind-cluster-config.yaml

@@ -66,3 +66,5 @@ nodes:
  # on the kind nodes.
  - hostPath: /usr/bin/nvidia-ctk
    containerPath: /usr/bin/nvidia-ctk
+  - hostPath: /sys
+    containerPath: /sys


nit: newline

elezar · 2025-01-09T06:41:06Z

deployments/helm/k8s-dra-driver/templates/_helpers.tpl

+- name: dev-vfio
+  mountPath: /dev/vfio
+  readOnly: false
+{{- end -}}


nit: newline

elezar · 2025-01-09T06:41:38Z

scripts/bind_to_driver.sh

+   fi
+}
+
+bind_to_driver "$1" "$2" || exit 1


nit: newline.

elezar · 2025-01-09T06:41:50Z

scripts/unbind_from_driver.sh

+   return 0
+}
+
+unbind_from_driver "$1" || exit 1


nit: newline

Varun Ramachandra Sekar added 3 commits December 2, 2024 14:27

vfio-pci device config API

d64afb1

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

vfio-pci gpu configuration

2559d02

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

Dockerfile changes for vfio-pci device config

f92e72f

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

varunrsekar force-pushed the vfio-support-1.31 branch from 0fc7174 to 0978db9 Compare December 2, 2024 22:27