You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
title 中提到的pr会导致pod的调度很久都不能成功,大量的报错 bind 超时,即使吧extender scheduler的配置超时时间设置成5分钟也会超时。 What you expected to happen:
正常的调度,bind不会超时。 How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
我们的生产环境中待调度的pod 随时都保持在几千个左右。版本是master分之,因为看到pr从2.4.1到master最近合了合多的pr。导致调度慢,卡住,bind超时的原因是 nodelock.go中增加了sycn.mutex,在加减锁的时候吧这两个func都锁了,赶紧吧这个pr重新优化下,把锁去掉
The output of nvidia-smi -a on your host
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The hami-device-plugin container logs
The hami-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
Any relevant kernel output lines from dmesg
Environment:
HAMi version:
nvidia driver or other AI device driver version:
Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Others:
The text was updated successfully, but these errors were encountered:
What happened:
title 中提到的pr会导致pod的调度很久都不能成功,大量的报错 bind 超时,即使吧extender scheduler的配置超时时间设置成5分钟也会超时。
What you expected to happen:
正常的调度,bind不会超时。
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
我们的生产环境中待调度的pod 随时都保持在几千个左右。版本是master分之,因为看到pr从2.4.1到master最近合了合多的pr。导致调度慢,卡住,bind超时的原因是 nodelock.go中增加了sycn.mutex,在加减锁的时候吧这两个func都锁了,赶紧吧这个pr重新优化下,把锁去掉
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)dmesg
Environment:
docker version
uname -a
The text was updated successfully, but these errors were encountered: