Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for allocating GPUs in Passthrough-Mode #183

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

varunrsekar
Copy link

This PR introduces a new DeviceClass vfiopci.nvidia.com that will allocate a full GPU in PassThrough-mode (PT) by binding the GPU to vfio-pci driver.

The primary usecase for this new DeviceClass are Kata containers and KubeVirt VMs that require the gpu to be in PT-mode and made available to a pod which then would spin up a guest with the gpu.

Note: Regular pod workloads will not benefit from this DeviceClass and shouldn't try to use this.

As part of this change, I've introduced some (but not all) modifications to the kind cluster config that are needed for this DeviceClass to work. Host-level modifications needed:

# Example on Ubuntu:

# Enable IOMMU on the host kernel
if ! grep -q "GRUB_CMDLINE_LINUX_DEFAULT=.*intel_iommu=on" /etc/default/grub; then
   sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on /g' /etc/default/grub
fi
sudo update-grub

# Disable GDM
sudo systemctl stop gdm && systemctl disable gdm
# Unload nvidia-drm
sudo modprobe -r nvidia-drm
# Reboot the node
sudo reboot

Validated on a kind cluster with a Quattro P2000 GPU:

$ nvidia-smi -L
GPU 0: Quadro P2000 (UUID: GPU-7bea1569-778c-fb4d-7801-df6b6b85ceac)
$ k get resourceclaim -n gpu-test-vfiopci
NAME             STATE                AGE
pod1-gpu-k9w6g   allocated,reserved   21s

$ k get pod -n gpu-test-vfiopci
NAME   READY   STATUS    RESTARTS   AGE
pod1   1/1     Running   0          2m20s

Open items:

  • how to make sysfs on the kind cluster node be read-write mountable?
  • Handling kubelet plugin restarts when the GPU is bound to vfio-pci driver as there is no device discovery possible at that time.

@varunrsekar
Copy link
Author

/cc @klueska

Varun Ramachandra Sekar added 3 commits December 2, 2024 14:27
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
type GpuDriver string

const (
NvidiaDriver GpuDriver = "nvidia"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another name for this driver that isn't nvidia?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. The driver is simply called nvidia

}

// Validate ensures that GpuDriverConfig has a valid set of values.
func (c *GpuDriverConfig) Validate() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is c == nil valid?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. gpuConfig.Normalize() would always ensure GpuDriverConfig is set

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question should probably have been: "Should we add an explict nil check and retunr an error if this is the case?"

case VfioPciDriver:
break
default:
return fmt.Errorf("invalid driver specified in gpu driver configuration")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include the invalid value here?

Comment on lines 65 to 68
if c.DriverConfig.Driver != NvidiaDriver {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does adding a SupportsSharing() function to the GpuDriver type make this clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is the expectation that we always call Validate() after Normalize() so that we check whether there is no sharing actually configured? Does it make sense to remove this check here entirely so that we only need to update logic in one place with regards to the driver type and sharing interaction?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to remove this check here entirely so that we only need to update logic in one place with regards to the driver type and sharing interaction?

Removing the check here won't work. There's 2 possibilities here:

  1. On NVIDIA driver, normalize sharing by initializing it if its nil, and then set config based on strategy.
  2. On VFIO-PCI driver, don't initialize it.

If we allow normalization of c.Sharing in case (2) by relaxing the check, then we're going to fail subsequent validation.

allocatable: allocatable,
config: config,
nvdevlib: nvdevlib,
checkpointManager: checkpointManager,
}

// Initialize the vfio-pci driver manager.
vfioPciManager.Init()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we init this here? Where are other managers initialised?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tsManager/mpsManager don't have any "initializations". But checkpointManager is initialized underneath: L117-L131.

What's the concern here?

switch castConfig := config.(type) {
case *configapi.GpuConfig:
return s.applySharingConfig(ctx, castConfig.Sharing, claim, results)
configState.GpuConfig = castConfig
err = s.applyGpuConfig(ctx, castConfig, claim, results, &configState)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not still return if err != nil in each of the branches? alternatively, does it make sense to have s.applyGpuConfig return (*DeviceConfigState, error) like the other apply functions and keep the other function unchanged?

Comment on lines 408 to 412
if config.Sharing != nil {
err := s.applySharingConfig(ctx, config.Sharing, claim, results, configState)
if err != nil {
return err
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: We unconditionally call applySharingConfig for mig devices. Since we now have the case where config.Sharing can be nil, could we either update applySharingConfig to ignore nil configs, or introduce a Disabled config so that we can set that for cases where we don't expect sharing to be allowed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MIG devices, config.Sharing can never be nil right - MIGs would always support Sharing so Normalize would initialize that config.

}

func (s *DeviceState) applySharingConfig(ctx context.Context, config configapi.Sharing, claim *resourceapi.ResourceClaim, results []*resourceapi.DeviceRequestAllocationResult) (*DeviceConfigState, error) {
func (s *DeviceState) applySharingConfig(ctx context.Context, config configapi.Sharing, claim *resourceapi.ResourceClaim, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I think we should keep this signature the same as before. I would rather avoid passing an argument by reference for modification.

return nil
}

func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error {
func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult) (*DeviceConfigState, error) {

@@ -199,6 +199,10 @@ func (l deviceLib) enumerateImexChannels(config *Config) (AllocatableDevices, er
return devices, nil
}

func getPciAddressFromNvmlPciInfo(info nvml.PciInfo) string {
return fmt.Sprintf("%04x:%02x:%02x.0", info.Domain, info.Bus, info.Device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be the same if its returning the address in the format 0000:0a:00.0. Wasn't aware of that implementation. I'll switch to that

if ret != nvml.SUCCESS {
return nil, fmt.Errorf("error getting PCI info for device %d: %w", index, err)
}
pciAddress := getPciAddressFromNvmlPciInfo(pciInfo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace this with device.GetBusID()?

}

func (vm *VfioPciManager) loadVfioPciModule() error {
cmd := exec.Command("modprobe", vm.vfioPciModule) //nolint:gosec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we chroot to the host root mounted into the container instead of expecting /sys/ to be writable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar Do you mean something like: exec.Command("nsenter", "--mount=/proc/1/ns/mnt", "--", "modprobe", vm.vfioPciModule)?

}

func (vm *VfioPciManager) getIommuGroupForVfioPciDevice(pciAddress string) string {
iommuGroup, err := os.Readlink(filepath.Join(vm.pciDevicesRoot, pciAddress, "iommu_group"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have similar functionality in go-nvlib?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have it tracked in NvidiaPciDevice here: https://github.com/NVIDIA/go-nvlib/blob/main/pkg/nvpci/nvpci.go. I'm guessing nvpci is the sysfs way of discovering GPUs. But nvlib doesn't have it.

Btw one of the implicit assumptions I havent captured here is that iommu needs to be enabled in the kernel for us to be able to do gpu-passthrough. In that case we would hit the err in L171 and return empty iommu group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the NvidiaPciDevice from go-nvlib/pkg/nvpci? It seems as if it is set here https://github.com/NVIDIA/go-nvlib/blob/1482a942fb6d52a023cff85c2d76ed4127af661a/pkg/nvpci/nvpci.go#L292-L304.

Comment on lines 106 to 111
- name: sysfs
mountPath: /sys
readOnly: false
- name: dev-vfio
mountPath: /dev/vfio
readOnly: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be optionally mounted instead?

# Usage: ./bind_to_driver.sh <ssss:bb:dd.f> <driver>
# Bind the GPU specified by the PCI_ID=ssss:bb:dd.f to the given driver.

bind_to_driver()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not implement this in go?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atleast for unbind, in case unbindLock is in use, we'd need the same process that acquires the unbindLock to be the one doing the driver unbind. I want to scope that unbindLock to just the task that's doing the unbind rather than the whole kubeletplugin binary.

(FYI unbindLock is in play when the NVIDIA Grid-vGPU driver is used on the node)

bind is a script just to be consistent with unbind.

Varun Ramachandra Sekar added 2 commits December 5, 2024 11:40
Varun Ramachandra Sekar added 3 commits December 9, 2024 10:31
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
@@ -66,3 +66,5 @@ nodes:
# on the kind nodes.
- hostPath: /usr/bin/nvidia-ctk
containerPath: /usr/bin/nvidia-ctk
- hostPath: /sys
containerPath: /sys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline

- name: dev-vfio
mountPath: /dev/vfio
readOnly: false
{{- end -}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline

fi
}

bind_to_driver "$1" "$2" || exit 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline.

return 0
}

unbind_from_driver "$1" || exit 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants