Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with MPS quickstart #106

Closed
anencore94 opened this issue May 3, 2024 · 8 comments · Fixed by #110
Closed

Problems with MPS quickstart #106

anencore94 opened this issue May 3, 2024 · 8 comments · Fixed by #110

Comments

@anencore94
Copy link

Description

During testing of the MPS-related Quickstart (using the demo script to create a kind cluster), I encountered several issues concerning the deployment of the MPS control daemon and pod deletion processes.

Issues Encountered

  1. MPS Control Daemon Deployment Failure:
    The MPS control deployment did not deploy. The logs from the nvidia-k8s-dra-driver-kubelet-plugin daemonset indicated the following errors:
    image
Defaulted container "plugin" out of: plugin, init (init)
I0503 03:44:08.333148       1 device_state.go:146] using devRoot=/driver-root
I0503 03:44:08.341885       1 nonblockinggrpcserver.go:105] "GRPC server started" logger="dra"
I0503 03:44:08.341960       1 nonblockinggrpcserver.go:105] "GRPC server started" logger="registrar"
I0503 03:44:17.213001       1 driver.go:104] NodePrepareResource is called: number of claims: 1
I0503 03:44:17.219672       1 sharing.go:183] Starting MPS control daemon for 'af3fbcca-a63a-4a62-8393-bf663267b4dc', with settings: &{DefaultActiveThreadPercentage:0xc0006ae510 DefaultPinnedDeviceMemoryLimit:10Gi DefaultPerDevicePinnedMemoryLimit:map[]}
E0503 03:44:17.227691       1 mount_linux.go:230] Mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t tmpfs -o rw,nosuid,nodev,noexec,relatime,size=65536k shm /var/lib/kubelet/plugins/gpu.resource.nvidia.com/mps/af3fbcca-a63a-4a62-8393-bf663267b4dc/shm
Output: mount: /lib/x86_64-linux-gnu/libselinux.so.1: no version information available (required by mount)
mount: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by mount)
mount: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by mount)
mount: /lib/x86_64-linux-gnu/libmount.so.1: version `MOUNT_2_38'

However, attempts to directly execute the mount on the host node (with docker exec -it k8s-dra-driver-cluster-worker bash) were succeded.

Modifying the Docker image BASE_DIST from Ubuntu 20.04 to Ubuntu 22.04 (thereby updating GLIBC to version 2.37) resolved the issues with libselinux and libc but not with libmount (version mismatch continued with MOUNT_2_38 not found). Eventually, manually mounting /lib/x86_64-linux-gnu/libmount.so.1 from the host node to use the 2_38 version resolved the issue, allowing the MPS daemon and example pods to deploy correctly.
image
image

  1. Pod Deletion Issue:
    Pods created with kubectl apply often remain stuck in a Terminating state when attempting deletion with kubectl delete. Forcing the deletion (--force) seems to resolve this temporarily, but any subsequent applications of kubectl apply result in the MPS daemon deployment failing to deploy correctly.

Questions/Requests

  1. Validation of Behavior:
    Is the described behavior(modifying Dockerfile and use hostpath VolumeMounts) expected, or could there be a misconfiguration or bug causing these issues? If it's an issue, I would appreciate guidance on how to proceed with a fix.

  2. Pod Deletion Stuck in Terminating State:
    Is this a known issue? Are there any recommended solutions to avoid pods getting stuck in Terminating state without using --force?

Thank you for your attention to these issues. I look forward to your insights and recommendations on these matters.

@klueska
Copy link
Collaborator

klueska commented May 6, 2024

That's strange. The only reason I could see this happening is if we somehow set the PATH such that it is referencing the host binary mount, but the container LD_LIBRARY_PATH. @elezar do you have any thoughts on why this might be happening?

@elezar
Copy link
Member

elezar commented May 6, 2024

The issue is that we're running the following:

	updatePathListEnvvar("PATH", filepath.Dir(nvidiaSMIPath))

which attempts to add nvidia-smi to the PATH. This will be at /driver-root/usr/bin in the container and as such when we run:

	mountExecutable, err := exec.LookPath("mount")
	if err != nil {
		return fmt.Errorf("error finding 'mount' executable: %w", err)
	}

we find /driver-root/usr/bin/mount which is the executable from the host and not in the container.

@klueska
Copy link
Collaborator

klueska commented May 6, 2024

Yeah, that would do it.

@klueska
Copy link
Collaborator

klueska commented May 6, 2024

Do we need to set these envvars in the plugin itself, or can they be passed to the ENV of the the exec.Command call when we invoke nvidia-smi?

@elezar
Copy link
Member

elezar commented May 6, 2024

We shouldn't need to set it for the plugin and can pass this to exec instead.

Note that for the compute mode we can also use the NVML api directly.

@anencore94
Copy link
Author

Thanks for clarifying 👍
Is the nvidia-smi compute-policy corresponds to ComputeMode in nvml ?

@klueska
Copy link
Collaborator

klueska commented May 7, 2024

Yes. We are in the process of getting the NVML team to update things so that we can set a compute mode on a MIG device as well.

klueska added a commit to klueska/k8s-dra-driver-gpu that referenced this issue May 7, 2024
Previously, we were setting the envvars needed for nvidia-smi in the plugin's
environment itself. This caused errors, however, when shelling out to other
binaries that didn't need these envvars set.

This change pushes the setting of the envvars into the actual exec call for
nvidia-smi instead of setting them in the plugin's environment itself.

Closes NVIDIA#106

Signed-off-by: Kevin Klues <[email protected]>
klueska added a commit to klueska/k8s-dra-driver-gpu that referenced this issue May 7, 2024
Previously, we were setting the envvars needed for nvidia-smi in the plugin's
environment itself. This caused errors, however, when shelling out to other
binaries that didn't need these envvars set.

This change pushes the setting of the envvars into the actual exec call for
nvidia-smi instead of setting them in the plugin's environment itself.

Closes NVIDIA#106

Signed-off-by: Kevin Klues <[email protected]>
klueska added a commit to klueska/k8s-dra-driver-gpu that referenced this issue May 7, 2024
Previously, we were setting the envvars needed for nvidia-smi in the plugin's
environment itself. This caused errors, however, when shelling out to other
binaries that didn't need these envvars set.

This change pushes the setting of the envvars into the actual exec call for
nvidia-smi instead of setting them in the plugin's environment itself.

Closes NVIDIA#106

Signed-off-by: Kevin Klues <[email protected]>
klueska added a commit to klueska/k8s-dra-driver-gpu that referenced this issue May 7, 2024
Previously, we were setting the envvars needed for nvidia-smi in the plugin's
environment itself. This caused errors, however, when shelling out to other
binaries that didn't need these envvars set.

This change pushes the setting of the envvars into the actual exec call for
nvidia-smi instead of setting them in the plugin's environment itself.

Closes NVIDIA#106

Signed-off-by: Kevin Klues <[email protected]>
klueska added a commit to klueska/k8s-dra-driver-gpu that referenced this issue May 7, 2024
Previously, we were setting the envvars needed for nvidia-smi in the plugin's
environment itself. This caused errors, however, when shelling out to other
binaries that didn't need these envvars set.

This change pushes the setting of the envvars into the actual exec call for
nvidia-smi instead of setting them in the plugin's environment itself.

Closes NVIDIA#106

Signed-off-by: Kevin Klues <[email protected]>
@anencore94
Copy link
Author

Thanks for the fast reply, BTW for the Question2 (Pod Deletion Stuck in Terminating State), does it resolved by #109 ? @klueska
I met some cases, after deleting the mps gpu pod, I always have to restart the kubelet of gpu worker node (systemctl restart kubelet)

yuanchen8911 pushed a commit to yuanchen8911/k8s-dra-driver that referenced this issue May 15, 2024
Previously, we were setting the envvars needed for nvidia-smi in the plugin's
environment itself. This caused errors, however, when shelling out to other
binaries that didn't need these envvars set.

This change pushes the setting of the envvars into the actual exec call for
nvidia-smi instead of setting them in the plugin's environment itself.

Closes NVIDIA#106

Signed-off-by: Kevin Klues <[email protected]>

Bump golangci/golangci-lint-action from 4 to 6

Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 4 to 6.
- [Release notes](https://github.com/golangci/golangci-lint-action/releases)
- [Commits](golangci/golangci-lint-action@v4...v6)

---
updated-dependencies:
- dependency-name: golangci/golangci-lint-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>

Add a demo svg file showing the baisc DRA use

Update demo

Adjust demo svg format

Update svg format

Update the description of demo.

update

Signed-off-by: Yuan Chen <[email protected]>
Signed-off-by: Yuan Chen <[email protected]>

Update tje demo svg description

Signed-off-by: Yuan Chen <[email protected]>

Update the svg demo

Signed-off-by: Yuan Chen <[email protected]>

Remove duplicated info.

Signed-off-by: Yuan Chen <[email protected]>

Clean up

Signed-off-by: Yuan Chen <[email protected]>

Add hostPID to MPS daemon template

Without this, the MPS server was not able to find it's own PID via /proc/self
and was failing to start. It's unclear why this wasn't needed previously, but
it makes sense why adding hostPID would solve this.

Signed-off-by: Kevin Klues <[email protected]>

Add basic examples for Linux workstations

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Remove the timeslicing example

Add restartPolicy to examples

Signed-off-by: Yuan Chen <[email protected]>

Update demo files

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 pushed a commit to yuanchen8911/k8s-dra-driver that referenced this issue May 15, 2024
Previously, we were setting the envvars needed for nvidia-smi in the plugin's
environment itself. This caused errors, however, when shelling out to other
binaries that didn't need these envvars set.

This change pushes the setting of the envvars into the actual exec call for
nvidia-smi instead of setting them in the plugin's environment itself.

Closes NVIDIA#106

Signed-off-by: Kevin Klues <[email protected]>

Bump golangci/golangci-lint-action from 4 to 6

Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 4 to 6.
- [Release notes](https://github.com/golangci/golangci-lint-action/releases)
- [Commits](golangci/golangci-lint-action@v4...v6)

---
updated-dependencies:
- dependency-name: golangci/golangci-lint-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>

Add a demo svg file showing the baisc DRA use

Update demo

Adjust demo svg format

Update svg format

Update the description of demo.

update

Signed-off-by: Yuan Chen <[email protected]>
Signed-off-by: Yuan Chen <[email protected]>

Update tje demo svg description

Signed-off-by: Yuan Chen <[email protected]>

Update the svg demo

Signed-off-by: Yuan Chen <[email protected]>

Remove duplicated info.

Signed-off-by: Yuan Chen <[email protected]>

Clean up

Signed-off-by: Yuan Chen <[email protected]>

Add hostPID to MPS daemon template

Without this, the MPS server was not able to find it's own PID via /proc/self
and was failing to start. It's unclear why this wasn't needed previously, but
it makes sense why adding hostPID would solve this.

Signed-off-by: Kevin Klues <[email protected]>

Add basic examples for Linux workstations

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Remove the timeslicing example

Add restartPolicy to examples

Signed-off-by: Yuan Chen <[email protected]>

Update demo files

Signed-off-by: Yuan Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment