Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s 节点重启后,GPU虚拟化失效,nvidia.com/gpu 数目变回物理GPU数目,重启pod hami-device-plugin后,nvidia.com/gpu 数目恢复正常 #829

Open
christu opened this issue Jan 23, 2025 · 3 comments
Labels
kind/bug Something isn't working

Comments

@christu
Copy link

christu commented Jan 23, 2025

What happened:
k3s 节点重启后,GPU虚拟化失效,nvidia.com/gpu 数目变回物理GPU数目,重启pod hami-device-plugin后,nvidia.com/gpu 数目恢复正常

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version: v2.3.13
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:
@christu christu added the kind/bug Something isn't working label Jan 23, 2025
@archlitchi
Copy link
Collaborator

你应该是之前没卸载掉nvidiade device-plugin吧

@christu
Copy link
Author

christu commented Feb 8, 2025

你应该是之前没卸载掉nvidiade device-plugin吧

先安装的nvidia GPU Opeartor,再安装的Hami,安装完hami之后需要把 nvidia-device-plugin-daemonset 卸载掉吗?

@archlitchi
Copy link
Collaborator

你应该是之前没卸载掉nvidiade device-plugin吧

先安装的nvidia GPU Opeartor,再安装的Hami,安装完hami之后需要把 nvidia-device-plugin-daemonset 卸载掉吗?

是的,除非你用自定义的资源名

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants