Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

this pr will lead pod schedule hang very long time. https://github.com/Project-HAMi/HAMi/pull/714 #810

Open
huangjiasingle opened this issue Jan 15, 2025 · 0 comments
Labels
kind/bug Something isn't working

Comments

@huangjiasingle
Copy link

What happened:
title 中提到的pr会导致pod的调度很久都不能成功,大量的报错 bind 超时,即使吧extender scheduler的配置超时时间设置成5分钟也会超时。
What you expected to happen:
正常的调度,bind不会超时。
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
我们的生产环境中待调度的pod 随时都保持在几千个左右。版本是master分之,因为看到pr从2.4.1到master最近合了合多的pr。导致调度慢,卡住,bind超时的原因是 nodelock.go中增加了sycn.mutex,在加减锁的时候吧这两个func都锁了,赶紧吧这个pr重新优化下,把锁去掉

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version:
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:
@huangjiasingle huangjiasingle added the kind/bug Something isn't working label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant