Problems with MPS quickstart #106

anencore94 · 2024-05-03T08:13:00Z

Description

During testing of the MPS-related Quickstart (using the demo script to create a kind cluster), I encountered several issues concerning the deployment of the MPS control daemon and pod deletion processes.

Issues Encountered

MPS Control Daemon Deployment Failure:
The MPS control deployment did not deploy. The logs from the nvidia-k8s-dra-driver-kubelet-plugin daemonset indicated the following errors:

Defaulted container "plugin" out of: plugin, init (init)
I0503 03:44:08.333148       1 device_state.go:146] using devRoot=/driver-root
I0503 03:44:08.341885       1 nonblockinggrpcserver.go:105] "GRPC server started" logger="dra"
I0503 03:44:08.341960       1 nonblockinggrpcserver.go:105] "GRPC server started" logger="registrar"
I0503 03:44:17.213001       1 driver.go:104] NodePrepareResource is called: number of claims: 1
I0503 03:44:17.219672       1 sharing.go:183] Starting MPS control daemon for 'af3fbcca-a63a-4a62-8393-bf663267b4dc', with settings: &{DefaultActiveThreadPercentage:0xc0006ae510 DefaultPinnedDeviceMemoryLimit:10Gi DefaultPerDevicePinnedMemoryLimit:map[]}
E0503 03:44:17.227691       1 mount_linux.go:230] Mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t tmpfs -o rw,nosuid,nodev,noexec,relatime,size=65536k shm /var/lib/kubelet/plugins/gpu.resource.nvidia.com/mps/af3fbcca-a63a-4a62-8393-bf663267b4dc/shm
Output: mount: /lib/x86_64-linux-gnu/libselinux.so.1: no version information available (required by mount)
mount: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by mount)
mount: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by mount)
mount: /lib/x86_64-linux-gnu/libmount.so.1: version `MOUNT_2_38'

However, attempts to directly execute the mount on the host node (with docker exec -it k8s-dra-driver-cluster-worker bash) were succeded.

Modifying the Docker image BASE_DIST from Ubuntu 20.04 to Ubuntu 22.04 (thereby updating GLIBC to version 2.37) resolved the issues with libselinux and libc but not with libmount (version mismatch continued with MOUNT_2_38 not found). Eventually, manually mounting /lib/x86_64-linux-gnu/libmount.so.1 from the host node to use the 2_38 version resolved the issue, allowing the MPS daemon and example pods to deploy correctly.

Pod Deletion Issue:
Pods created with kubectl apply often remain stuck in a Terminating state when attempting deletion with kubectl delete. Forcing the deletion (--force) seems to resolve this temporarily, but any subsequent applications of kubectl apply result in the MPS daemon deployment failing to deploy correctly.

Questions/Requests

Validation of Behavior:
Is the described behavior(modifying Dockerfile and use hostpath VolumeMounts) expected, or could there be a misconfiguration or bug causing these issues? If it's an issue, I would appreciate guidance on how to proceed with a fix.
Pod Deletion Stuck in Terminating State:
Is this a known issue? Are there any recommended solutions to avoid pods getting stuck in Terminating state without using --force?

Thank you for your attention to these issues. I look forward to your insights and recommendations on these matters.

The text was updated successfully, but these errors were encountered:

klueska · 2024-05-06T15:02:05Z

That's strange. The only reason I could see this happening is if we somehow set the PATH such that it is referencing the host binary mount, but the container LD_LIBRARY_PATH. @elezar do you have any thoughts on why this might be happening?

elezar · 2024-05-06T15:11:00Z

The issue is that we're running the following:

	updatePathListEnvvar("PATH", filepath.Dir(nvidiaSMIPath))

which attempts to add nvidia-smi to the PATH. This will be at /driver-root/usr/bin in the container and as such when we run:

	mountExecutable, err := exec.LookPath("mount")
	if err != nil {
		return fmt.Errorf("error finding 'mount' executable: %w", err)
	}

we find /driver-root/usr/bin/mount which is the executable from the host and not in the container.

klueska · 2024-05-06T15:16:35Z

Yeah, that would do it.

klueska · 2024-05-06T15:21:01Z

Do we need to set these envvars in the plugin itself, or can they be passed to the ENV of the the exec.Command call when we invoke nvidia-smi?

elezar · 2024-05-06T16:33:07Z

We shouldn't need to set it for the plugin and can pass this to exec instead.

Note that for the compute mode we can also use the NVML api directly.

anencore94 · 2024-05-07T02:03:22Z

Thanks for clarifying 👍
Is the nvidia-smi compute-policy corresponds to ComputeMode in nvml ?

klueska · 2024-05-07T05:45:06Z

Yes. We are in the process of getting the NVML team to update things so that we can set a compute mode on a MIG device as well.

Previously, we were setting the envvars needed for nvidia-smi in the plugin's environment itself. This caused errors, however, when shelling out to other binaries that didn't need these envvars set. This change pushes the setting of the envvars into the actual exec call for nvidia-smi instead of setting them in the plugin's environment itself. Closes NVIDIA#106 Signed-off-by: Kevin Klues <[email protected]>

anencore94 · 2024-05-08T02:57:59Z

Thanks for the fast reply, BTW for the Question2 (Pod Deletion Stuck in Terminating State), does it resolved by #109 ? @klueska
I met some cases, after deleting the mps gpu pod, I always have to restart the kubelet of gpu worker node (systemctl restart kubelet)

Previously, we were setting the envvars needed for nvidia-smi in the plugin's environment itself. This caused errors, however, when shelling out to other binaries that didn't need these envvars set. This change pushes the setting of the envvars into the actual exec call for nvidia-smi instead of setting them in the plugin's environment itself. Closes NVIDIA#106 Signed-off-by: Kevin Klues <[email protected]> Bump golangci/golangci-lint-action from 4 to 6 Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 4 to 6. - [Release notes](https://github.com/golangci/golangci-lint-action/releases) - [Commits](golangci/golangci-lint-action@v4...v6) --- updated-dependencies: - dependency-name: golangci/golangci-lint-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Add a demo svg file showing the baisc DRA use Update demo Adjust demo svg format Update svg format Update the description of demo. update Signed-off-by: Yuan Chen <[email protected]> Signed-off-by: Yuan Chen <[email protected]> Update tje demo svg description Signed-off-by: Yuan Chen <[email protected]> Update the svg demo Signed-off-by: Yuan Chen <[email protected]> Remove duplicated info. Signed-off-by: Yuan Chen <[email protected]> Clean up Signed-off-by: Yuan Chen <[email protected]> Add hostPID to MPS daemon template Without this, the MPS server was not able to find it's own PID via /proc/self and was failing to start. It's unclear why this wasn't needed previously, but it makes sense why adding hostPID would solve this. Signed-off-by: Kevin Klues <[email protected]> Add basic examples for Linux workstations Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Remove the timeslicing example Add restartPolicy to examples Signed-off-by: Yuan Chen <[email protected]> Update demo files Signed-off-by: Yuan Chen <[email protected]>

klueska mentioned this issue May 7, 2024

Update logic to set environment for calls out to nvidia-smi #110

Merged

elezar mentioned this issue May 7, 2024

Set PATH per nvidia-smi invocation #111

Closed

klueska closed this as completed in #110 May 7, 2024

yuanchen8911 mentioned this issue May 15, 2024

Update logic to set environment for calls out to nvidia-smi #119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with MPS quickstart #106

Problems with MPS quickstart #106

anencore94 commented May 3, 2024

klueska commented May 6, 2024

elezar commented May 6, 2024

klueska commented May 6, 2024

klueska commented May 6, 2024

elezar commented May 6, 2024

anencore94 commented May 7, 2024

klueska commented May 7, 2024 •

edited

Loading

anencore94 commented May 8, 2024

Problems with MPS quickstart #106

Problems with MPS quickstart #106

Comments

anencore94 commented May 3, 2024

Description

Issues Encountered

Questions/Requests

klueska commented May 6, 2024

elezar commented May 6, 2024

klueska commented May 6, 2024

klueska commented May 6, 2024

elezar commented May 6, 2024

anencore94 commented May 7, 2024

klueska commented May 7, 2024 • edited Loading

anencore94 commented May 8, 2024

klueska commented May 7, 2024 •

edited

Loading