-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dev root option to mig-manager container #69
Conversation
echo "Creating NVIDIA control device nodes" | ||
nvidia-ctk system create-device-nodes --control-devices --driver-root=${DRIVER_ROOT_CTR_PATH} | ||
nvidia-ctk system create-device-nodes \ | ||
--load-kernel-modules \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work currently since the toolkit code to load kernel modules relies on chroot'ng into the driverRoot.
General question -- does anyone remember why it was required to run nvidia-smi first before creating the device nodes and generating the management CDI spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed --load-kernel-modules
for the time being since this currently does not work when driverRoot is not chroot'ble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we use nvidia-smi
to create the NON control device nodes. I don't believe it's required to run nvidia-smi
BEFORE we run this command, but we do need to run it in addition to running this if we wanted the device nodes to be created.
The issue with nvidia-smi
is that it doesn't create the /dev/nvidia-uvm*
devices nodes, which is why we create them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-control devices should typically exist at this point, except in the scenario where a reboot is required to toggle the MIG mode and the driver is preinstalled. Even though the driver and toolkit validations run before mig-manager starts, the invocation of nvidia-smi will create the device nodes in the container's dev and not the host's.
Any recommendations for how we should handle this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the longer term it seems as if this is related to our proposed driver API. Meaning that one of the post conditions for a driver installation is that the required device nodes are created -- by the driver "container". This should ensure that we don't have to keep implementing these workarounds.
Thinking on this a bit more, I think one of the issues here is that the nvidia-ctk system create-device-nodes
command does not actually need access to the driver root, and what we should be passing in is the "kernel module" root instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the script to run nvidia-smi
first if the NVIDIA driver is installed on the host (e.g. at /
). This should address the scenario I described in #69 (comment)
--load-kernel-modules \ | ||
--control-devices \ | ||
--driver-root=${DRIVER_ROOT_CTR_PATH} \ | ||
--dev-root=${DEV_ROOT_CTR_PATH} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes we add a --dev-root
option to the nvidia-ctk system create-device-nodes
command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted this change for now, and instead have resorted to setting --driver-root=${DEV_ROOT_CTR_PATH}
since nvidia-ctk system create-device-nodes
currently creates device nodes there.
8c33f3a
to
d3bba75
Compare
versions.mk
Outdated
@@ -24,4 +24,4 @@ BUILDIMAGE ?= ghcr.io/nvidia/k8s-test-infra:$(BUILDIMAGE_TAG) | |||
|
|||
GIT_COMMIT ?= $(shell git describe --match="" --dirty --long --always --abbrev=40 2> /dev/null || echo "") | |||
|
|||
NVIDIA_CTK_VERSION := v1.14.6 | |||
NVIDIA_CTK_VERSION := v1.15.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need unreleased features for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to v1.16.0-rc.1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor questions in terms of the nvidia-ctk
interaction, but these are not blockers.
chroot ${DRIVER_ROOT_CTR_PATH} nvidia-smi >/dev/null | ||
if [ "${?}" != "0" ]; then | ||
exit_failed | ||
if [ "${DRIVER_ROOT_CTR_PATH}" = "/host" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the /run/nvidia/driver
case still work as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base on my initial testing, yes. But will keep investigating this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong -- the /run/nvidia/driver
case was not working with this change. As suspected, without running nvidia-smi
the nvidia-cap*
device nodes were note created / updated correctly after a MIG configuration, and thus, the CDI specifications were not accurate. I have raised #82 to address both the /run/nvidia/driver
case as well as the GKE case, where driverRoot != devRoot.
nvidia-ctk system create-device-nodes --control-devices --driver-root=${DRIVER_ROOT_CTR_PATH} | ||
nvidia-ctk system create-device-nodes \ | ||
--control-devices \ | ||
--driver-root=${DEV_ROOT_CTR_PATH} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to merge NVIDIA/nvidia-container-toolkit#526 first and switch to --dev-root
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use --dev-root
.
Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Christopher Desiniotis <[email protected]>
This change ensures that the nvidia-mig-manager CLI requires the path to the driver root accept both the NVIDIA_DRIVER_ROOT and DRIVER_ROOT environment variables in addition to the --driver-root command line argument. Signed-off-by: Christopher Desiniotis <[email protected]>
No description provided.