-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with MPS quickstart #106
Comments
That's strange. The only reason I could see this happening is if we somehow set the PATH such that it is referencing the host binary |
The issue is that we're running the following:
which attempts to add
we find |
Yeah, that would do it. |
Do we need to set these envvars in the plugin itself, or can they be passed to the ENV of the the |
We shouldn't need to set it for the plugin and can pass this to exec instead. Note that for the compute mode we can also use the NVML api directly. |
Thanks for clarifying 👍 |
Yes. We are in the process of getting the NVML team to update things so that we can set a compute mode on a MIG device as well. |
Previously, we were setting the envvars needed for nvidia-smi in the plugin's environment itself. This caused errors, however, when shelling out to other binaries that didn't need these envvars set. This change pushes the setting of the envvars into the actual exec call for nvidia-smi instead of setting them in the plugin's environment itself. Closes NVIDIA#106 Signed-off-by: Kevin Klues <[email protected]>
Previously, we were setting the envvars needed for nvidia-smi in the plugin's environment itself. This caused errors, however, when shelling out to other binaries that didn't need these envvars set. This change pushes the setting of the envvars into the actual exec call for nvidia-smi instead of setting them in the plugin's environment itself. Closes NVIDIA#106 Signed-off-by: Kevin Klues <[email protected]>
Previously, we were setting the envvars needed for nvidia-smi in the plugin's environment itself. This caused errors, however, when shelling out to other binaries that didn't need these envvars set. This change pushes the setting of the envvars into the actual exec call for nvidia-smi instead of setting them in the plugin's environment itself. Closes NVIDIA#106 Signed-off-by: Kevin Klues <[email protected]>
Previously, we were setting the envvars needed for nvidia-smi in the plugin's environment itself. This caused errors, however, when shelling out to other binaries that didn't need these envvars set. This change pushes the setting of the envvars into the actual exec call for nvidia-smi instead of setting them in the plugin's environment itself. Closes NVIDIA#106 Signed-off-by: Kevin Klues <[email protected]>
Previously, we were setting the envvars needed for nvidia-smi in the plugin's environment itself. This caused errors, however, when shelling out to other binaries that didn't need these envvars set. This change pushes the setting of the envvars into the actual exec call for nvidia-smi instead of setting them in the plugin's environment itself. Closes NVIDIA#106 Signed-off-by: Kevin Klues <[email protected]>
Previously, we were setting the envvars needed for nvidia-smi in the plugin's environment itself. This caused errors, however, when shelling out to other binaries that didn't need these envvars set. This change pushes the setting of the envvars into the actual exec call for nvidia-smi instead of setting them in the plugin's environment itself. Closes NVIDIA#106 Signed-off-by: Kevin Klues <[email protected]> Bump golangci/golangci-lint-action from 4 to 6 Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 4 to 6. - [Release notes](https://github.com/golangci/golangci-lint-action/releases) - [Commits](golangci/golangci-lint-action@v4...v6) --- updated-dependencies: - dependency-name: golangci/golangci-lint-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Add a demo svg file showing the baisc DRA use Update demo Adjust demo svg format Update svg format Update the description of demo. update Signed-off-by: Yuan Chen <[email protected]> Signed-off-by: Yuan Chen <[email protected]> Update tje demo svg description Signed-off-by: Yuan Chen <[email protected]> Update the svg demo Signed-off-by: Yuan Chen <[email protected]> Remove duplicated info. Signed-off-by: Yuan Chen <[email protected]> Clean up Signed-off-by: Yuan Chen <[email protected]> Add hostPID to MPS daemon template Without this, the MPS server was not able to find it's own PID via /proc/self and was failing to start. It's unclear why this wasn't needed previously, but it makes sense why adding hostPID would solve this. Signed-off-by: Kevin Klues <[email protected]> Add basic examples for Linux workstations Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Remove the timeslicing example Add restartPolicy to examples Signed-off-by: Yuan Chen <[email protected]> Update demo files Signed-off-by: Yuan Chen <[email protected]>
Previously, we were setting the envvars needed for nvidia-smi in the plugin's environment itself. This caused errors, however, when shelling out to other binaries that didn't need these envvars set. This change pushes the setting of the envvars into the actual exec call for nvidia-smi instead of setting them in the plugin's environment itself. Closes NVIDIA#106 Signed-off-by: Kevin Klues <[email protected]> Bump golangci/golangci-lint-action from 4 to 6 Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 4 to 6. - [Release notes](https://github.com/golangci/golangci-lint-action/releases) - [Commits](golangci/golangci-lint-action@v4...v6) --- updated-dependencies: - dependency-name: golangci/golangci-lint-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Add a demo svg file showing the baisc DRA use Update demo Adjust demo svg format Update svg format Update the description of demo. update Signed-off-by: Yuan Chen <[email protected]> Signed-off-by: Yuan Chen <[email protected]> Update tje demo svg description Signed-off-by: Yuan Chen <[email protected]> Update the svg demo Signed-off-by: Yuan Chen <[email protected]> Remove duplicated info. Signed-off-by: Yuan Chen <[email protected]> Clean up Signed-off-by: Yuan Chen <[email protected]> Add hostPID to MPS daemon template Without this, the MPS server was not able to find it's own PID via /proc/self and was failing to start. It's unclear why this wasn't needed previously, but it makes sense why adding hostPID would solve this. Signed-off-by: Kevin Klues <[email protected]> Add basic examples for Linux workstations Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README Signed-off-by: Yuan Chen <[email protected]> Update README.md Signed-off-by: Yuan Chen <[email protected]> Remove the timeslicing example Add restartPolicy to examples Signed-off-by: Yuan Chen <[email protected]> Update demo files Signed-off-by: Yuan Chen <[email protected]>
Description
During testing of the MPS-related Quickstart (using the demo script to create a kind cluster), I encountered several issues concerning the deployment of the MPS control daemon and pod deletion processes.
Issues Encountered
The MPS control deployment did not deploy. The logs from the
nvidia-k8s-dra-driver-kubelet-plugin
daemonset indicated the following errors:However, attempts to directly execute the
mount
on the host node (withdocker exec -it k8s-dra-driver-cluster-worker bash
) were succeded.Modifying the Docker image BASE_DIST from Ubuntu 20.04 to Ubuntu 22.04 (thereby updating GLIBC to version 2.37) resolved the issues with libselinux and libc but not with libmount (version mismatch continued with
![image](https://private-user-images.githubusercontent.com/37469330/327691402-25447b39-2b60-4910-a15c-6a1089125609.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk5NzY5MDUsIm5iZiI6MTczOTk3NjYwNSwicGF0aCI6Ii8zNzQ2OTMzMC8zMjc2OTE0MDItMjU0NDdiMzktMmI2MC00OTEwLWExNWMtNmExMDg5MTI1NjA5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE5VDE0NTAwNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZmNjJiNDg0N2Y1ZGQ3MmRhYzMzZjBhMmRjNjIyMWJjOTVjYWVhNTczN2Y3ZTI5NmUwZWU0ODBkZjE2MDVjMjkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.bUPWHJ_1rZT_ozxE-sVmJiflcb0LgR9LCezpF1OSm8c)
![image](https://private-user-images.githubusercontent.com/37469330/327691434-1da10f3e-50f9-4d1c-8aa4-d6430aac3772.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk5NzY5MDUsIm5iZiI6MTczOTk3NjYwNSwicGF0aCI6Ii8zNzQ2OTMzMC8zMjc2OTE0MzQtMWRhMTBmM2UtNTBmOS00ZDFjLThhYTQtZDY0MzBhYWMzNzcyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE5VDE0NTAwNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWU0YmM5N2NkZDFlZDI2NjQ5MThiZGZmYzRhYmRkZjRmMDI3ZTI4OTZjNzYyYjgyYjdkYTdmYjFlOTAxYzVlOTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.QdZ0Qe-CUiH_9XaGb76A47oyb__dsqFRk8A14Hp44qY)
MOUNT_2_38 not found
). Eventually, manually mounting/lib/x86_64-linux-gnu/libmount.so.1
from the host node to use the 2_38 version resolved the issue, allowing the MPS daemon and example pods to deploy correctly.Pods created with
kubectl apply
often remain stuck in a Terminating state when attempting deletion withkubectl delete
. Forcing the deletion (--force
) seems to resolve this temporarily, but any subsequent applications ofkubectl apply
result in the MPS daemon deployment failing to deploy correctly.Questions/Requests
Validation of Behavior:
Is the described behavior(modifying Dockerfile and use hostpath VolumeMounts) expected, or could there be a misconfiguration or bug causing these issues? If it's an issue, I would appreciate guidance on how to proceed with a fix.
Pod Deletion Stuck in Terminating State:
Is this a known issue? Are there any recommended solutions to avoid pods getting stuck in Terminating state without using
--force
?Thank you for your attention to these issues. I look forward to your insights and recommendations on these matters.
The text was updated successfully, but these errors were encountered: