reclaim action uses HasPendingTasks
instead of JobStarving
, which may cause the two jobs to reclaim with each other and lead to deadlock.
#3869
Labels
kind/bug
Categorizes issue or PR as related to a bug.
Description
Such a scenario: the number of replicas of the two jobs is 5, minAvailable is 1, we can not deploy 5 replicas at the same time on the cluster, there will be at least one pod pending, and then the reclaim will be triggered, and then the two jobs will keep reclaiming each other, leading to deadlock.
Steps to reproduce the issue
3.create a new job in a queue, 5 replicas, 1 minAvailable:
Describe the results you received and expected
Deploy job-a first, occupying cluster resources, and then deploy job-b, triggering reclaim to evict job-a's pod resources. After job-a pod becomes pending, reclaim reclaim job-b's pod for job-a, resulting dead-lock.
We may change reclaim to use
JobStarving
, if a job has enought pods running >= minAvailable, we don't trigger reclaim action for it.What version of Volcano are you using?
v1.10
Any other relevant information
No response
The text was updated successfully, but these errors were encountered: