Add dev root option to mig-manager container #69

cdesiniotis · 2024-05-10T21:12:34Z

No description provided.

cdesiniotis · 2024-05-10T21:31:23Z

deployments/container/reconfigure-mig.sh

 	echo "Creating NVIDIA control device nodes"
-	nvidia-ctk system create-device-nodes --control-devices --driver-root=${DRIVER_ROOT_CTR_PATH}
+	nvidia-ctk system create-device-nodes \
+	  --load-kernel-modules \


This won't work currently since the toolkit code to load kernel modules relies on chroot'ng into the driverRoot.

General question -- does anyone remember why it was required to run nvidia-smi first before creating the device nodes and generating the management CDI spec?

I have removed --load-kernel-modules for the time being since this currently does not work when driverRoot is not chroot'ble.

I think we use nvidia-smi to create the NON control device nodes. I don't believe it's required to run nvidia-smi BEFORE we run this command, but we do need to run it in addition to running this if we wanted the device nodes to be created.

The issue with nvidia-smi is that it doesn't create the /dev/nvidia-uvm* devices nodes, which is why we create them here.

The non-control devices should typically exist at this point, except in the scenario where a reboot is required to toggle the MIG mode and the driver is preinstalled. Even though the driver and toolkit validations run before mig-manager starts, the invocation of nvidia-smi will create the device nodes in the container's dev and not the host's.

Any recommendations for how we should handle this case?

In the longer term it seems as if this is related to our proposed driver API. Meaning that one of the post conditions for a driver installation is that the required device nodes are created -- by the driver "container". This should ensure that we don't have to keep implementing these workarounds.

Thinking on this a bit more, I think one of the issues here is that the nvidia-ctk system create-device-nodes command does not actually need access to the driver root, and what we should be passing in is the "kernel module" root instead.

I have updated the script to run nvidia-smi first if the NVIDIA driver is installed on the host (e.g. at /). This should address the scenario I described in #69 (comment)

cdesiniotis · 2024-05-10T21:31:58Z

deployments/container/reconfigure-mig.sh

+	  --load-kernel-modules \
+	  --control-devices \
+	  --driver-root=${DRIVER_ROOT_CTR_PATH} \
+	  --dev-root=${DEV_ROOT_CTR_PATH}


This assumes we add a --dev-root option to the nvidia-ctk system create-device-nodes command.

I have reverted this change for now, and instead have resorted to setting --driver-root=${DEV_ROOT_CTR_PATH} since nvidia-ctk system create-device-nodes currently creates device nodes there.

deployments/container/reconfigure-mig.sh

elezar · 2024-06-17T12:25:15Z

versions.mk

@@ -24,4 +24,4 @@ BUILDIMAGE ?=  ghcr.io/nvidia/k8s-test-infra:$(BUILDIMAGE_TAG)

 GIT_COMMIT ?= $(shell git describe --match="" --dirty --long --always --abbrev=40 2> /dev/null || echo "")

-NVIDIA_CTK_VERSION := v1.14.6
+NVIDIA_CTK_VERSION := v1.15.0


Do we need unreleased features for this?

Updated to v1.16.0-rc.1.

elezar

I have some minor questions in terms of the nvidia-ctk interaction, but these are not blockers.

elezar · 2024-06-17T12:26:51Z

deployments/container/reconfigure-mig.sh

-	chroot ${DRIVER_ROOT_CTR_PATH} nvidia-smi >/dev/null
-	if [ "${?}" != "0" ]; then
-		exit_failed
+	if [ "${DRIVER_ROOT_CTR_PATH}" = "/host" ]; then


Does the /run/nvidia/driver case still work as expected?

Base on my initial testing, yes. But will keep investigating this.

I was wrong -- the /run/nvidia/driver case was not working with this change. As suspected, without running nvidia-smi the nvidia-cap* device nodes were note created / updated correctly after a MIG configuration, and thus, the CDI specifications were not accurate. I have raised #82 to address both the /run/nvidia/driver case as well as the GKE case, where driverRoot != devRoot.

elezar · 2024-06-17T12:28:38Z

deployments/container/reconfigure-mig.sh

-	nvidia-ctk system create-device-nodes --control-devices --driver-root=${DRIVER_ROOT_CTR_PATH}
+	nvidia-ctk system create-device-nodes \
+		--control-devices \
+		--driver-root=${DEV_ROOT_CTR_PATH}


Do we want to merge NVIDIA/nvidia-container-toolkit#526 first and switch to --dev-root here?

Updated to use --dev-root.

Signed-off-by: Christopher Desiniotis <[email protected]>

This change ensures that the nvidia-mig-manager CLI requires the path to the driver root accept both the NVIDIA_DRIVER_ROOT and DRIVER_ROOT environment variables in addition to the --driver-root command line argument. Signed-off-by: Christopher Desiniotis <[email protected]>

cdesiniotis self-assigned this May 10, 2024

cdesiniotis changed the title ~~Add dev root option to mig-manager container~~ Draft: Add dev root option to mig-manager container May 10, 2024

cdesiniotis marked this pull request as draft May 10, 2024 21:13

cdesiniotis requested review from elezar and klueska May 10, 2024 21:13

cdesiniotis commented May 10, 2024

View reviewed changes

cdesiniotis force-pushed the dev-root branch 2 times, most recently from 8c33f3a to d3bba75 Compare May 23, 2024 23:21

cdesiniotis changed the title ~~Draft: Add dev root option to mig-manager container~~ Add dev root option to mig-manager container May 23, 2024

cdesiniotis marked this pull request as ready for review May 23, 2024 23:23

elezar mentioned this pull request Jun 4, 2024

Add dev-root option to create-device-nodes NVIDIA/nvidia-container-toolkit#526

Merged

cdesiniotis force-pushed the dev-root branch from d3bba75 to f4ff091 Compare June 15, 2024 21:21

elezar reviewed Jun 17, 2024

View reviewed changes

elezar approved these changes Jun 17, 2024

View reviewed changes

cdesiniotis added 3 commits June 17, 2024 07:56

Add dev root option to mig-manager container

d265f83

Signed-off-by: Christopher Desiniotis <[email protected]>

Bump nvidia-ctk version to v1.16.0-rc.1

caa7010

Signed-off-by: Christopher Desiniotis <[email protected]>

cdesiniotis force-pushed the dev-root branch from f4ff091 to 42b4b63 Compare June 17, 2024 15:00

cdesiniotis merged commit ae6d51b into NVIDIA:main Jun 17, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dev root option to mig-manager container #69

Add dev root option to mig-manager container #69

cdesiniotis commented May 10, 2024

cdesiniotis May 10, 2024

cdesiniotis May 23, 2024

elezar May 30, 2024

cdesiniotis Jun 4, 2024

elezar Jun 4, 2024

cdesiniotis Jun 15, 2024

cdesiniotis May 10, 2024

cdesiniotis May 23, 2024

elezar Jun 17, 2024

cdesiniotis Jun 17, 2024

elezar left a comment

elezar Jun 17, 2024

cdesiniotis Jun 17, 2024

cdesiniotis Jun 17, 2024

elezar Jun 17, 2024

cdesiniotis Jun 17, 2024

Add dev root option to mig-manager container #69

Add dev root option to mig-manager container #69

Conversation

cdesiniotis commented May 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elezar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment