-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite preemption loop in fair sharing #3779
Comments
@yuvalaz99 thank you for reporting the issue with the detailed scenario. Let me ask some questions to triage it. I could probably look into this closer by EOW, or next week. There are a couple of features used here: HierarchicalCohorts, Deployments, borrowWithinCohort. Do you have some evidence for which one is the actual culprit? Do you have a temporary workaround for the issue? Would you like to work on the fix? |
Thank you for the quick response :) Cohort Structure |
I see, thank you for the update, it feels the next step will be to replicate, preferably in integration tests, and debug where is the culprit. |
/assign @vladikkuzn |
cc @gabesaba |
Hi @yuvalaz99, I was unable to reproduce the issue. Were there any additional options/feature gates you configured when running Kueue? Are you still able to reproduce this issue using Kueue 0.10.0? Setup: Steps: I applied the configurations you provided, in order. I observed different preemptions: best-effort-tenant-b-2 when guaranteed-1 was created, and best-effort-tenant-b-1 when guaranteed-2 was created. guaranteed-1 and guaranteed-2 admitted and scheduled successfully, with best-effort-tenant-b-1 and best-effort-tenant-b-2 remaining gated post preemption. To match your description more closely, I also created the best effort workloads sequentially (since we sort preemption candidates based on admission timestamp) [a1, b1, b2, a2], to reproduce the preemptions of a2 and b2 you mentioned. In this case, guaranteed-1 and guaranteed-2 were still able to admit and schedule. |
Hi @gabesaba, The core problem appears to be that when the preempting ClusterQueue has a higher weighted share than the target ClusterQueue. The workload that was preempted gets admitted again to maintain fairness in the system ( which will be again preempted ). I’ll provide a minimal configuration to reproduce this issue. Cohort Structure Event Sequence Leading to the Issue GuaranteedTenantA1 (250m) admitted
How to reproduce the issue:
Expected pods status
Our Kueue configuration:
|
Thanks for the additional info, @yuvalaz99. I'll take a look this week into this more generalized issue you described. I want to note: in the current state, Hierarchical Cohorts and Fair Sharing are incompatible - their usage together results in undefined behavior. We are working on making Fair Sharing and Hierarchical Cohorts compatible now. That may be a culprit in the initial issue you described, but not the latest, as the cohort structure in your latest issue description is flat. |
@gabesaba Thank you for your response and for looking into this! Please note that the new information shows this unexpected behavior also occurs when hierarchical cohorts are not used. If you need any more information, I’ll be happy to help :) |
I was able to reproduce the issue - though it didn't result in an infinite preemption loop, but some unnecessary preemptions ( |
We're reproducing the bug in an integration test (#4030). Afterwords, we'll implement a fix. |
@yuvalaz99, could you update the issue title please, to more accurately reflect this? Something like "Preemption Loop in Fair Sharing" |
I'm glad you were able to reproduce it. Thanks :) |
The root cause is the below-threshold logic: kueue/pkg/scheduler/preemption/preemption.go Lines 378 to 381 in d02a764
When doing fair preemptions, even though best effort queue + workload has a lower DominantResourceShare than guaranteed queue + workload, this logic is short circuited by belowThreshold, and causing Kueue to preempt anyway. Then, when we get to scheduling, best effort Kueue has a lower DominantResourceShare than guaranteed-1, so it is scheduled. Then the loop occurs. I will think about how to fix this. In the mean time, can you try removing |
@gabesaba Thanks for your reply :) Currently, we are required to use this parameter to meet our business requirements.
we won’t be able to remove this parameter. |
/assign |
@yuvalaz99, please take a look at #4165, and see if you think it resolves your issue. Please note that this change will make Was this preemption below priority, while ignoring fair sharing value, something that you were relying on for your business requirements? |
What happened:
An infinite preemption loop occurs in a hierarchical cohort scenario with Kueue's Pod integration.
When a higher-priority workload preempts a lower-priority one from a different queue, the Deployment controller recreates the preempted workload, causing it to be readmitted.
Once readmitted, the preemption occurs again, triggering this infinite loop.
I'll show an edge case where this infinite preemption behavior occurs. Not all preemption operations using hierarchical cohorts behave in this way.
Cohort Structure
--- Cohort Root (NM: 0m)
-------- Cohort Guaranteed (NM: 300m)
------------- Clusterqueue Guaranteed (NM: 0m)
-------- ClusterQueue BestEffortTenantA (NM: 0m)
-------- ClusterQueue BestEffortTenantB (NM: 0m)
Event Sequence Leading to the Issue
Pods status
What you expected to happen:
Preemption should complete successfully without entering an infinite loop.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
kueue-manager logs
Environment:
cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: