this pr will lead pod schedule hang very long time. https://github.com/Project-HAMi/HAMi/pull/714 #810

huangjiasingle · 2025-01-15T07:11:24Z

What happened:
title 中提到的pr会导致pod的调度很久都不能成功，大量的报错 bind 超时，即使吧extender scheduler的配置超时时间设置成5分钟也会超时。
What you expected to happen:
正常的调度，bind不会超时。
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
我们的生产环境中待调度的pod 随时都保持在几千个左右。版本是master分之，因为看到pr从2.4.1到master最近合了合多的pr。导致调度慢，卡住，bind超时的原因是 nodelock.go中增加了sycn.mutex，在加减锁的时候吧这两个func都锁了，赶紧吧这个pr重新优化下，把锁去掉

The output of nvidia-smi -a on your host
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The hami-device-plugin container logs
The hami-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
Any relevant kernel output lines from dmesg

Environment:

HAMi version:
nvidia driver or other AI device driver version:
Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Others:

The text was updated successfully, but these errors were encountered:

huangjiasingle added the kind/bug Something isn't working label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

this pr will lead pod schedule hang very long time. https://github.com/Project-HAMi/HAMi/pull/714 #810

this pr will lead pod schedule hang very long time. https://github.com/Project-HAMi/HAMi/pull/714 #810

huangjiasingle commented Jan 15, 2025

this pr will lead pod schedule hang very long time. https://github.com/Project-HAMi/HAMi/pull/714 #810

this pr will lead pod schedule hang very long time. https://github.com/Project-HAMi/HAMi/pull/714 #810

Comments

huangjiasingle commented Jan 15, 2025