Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2678] Improve FAIR queue sort algorithm #935

Closed
wants to merge 21 commits into from

Conversation

psantacl
Copy link
Contributor

@psantacl psantacl commented Aug 7, 2024

What is this PR for?

Changes to the Fair scheduling algorithm to always produce ratio's only when sorting queues for 'fair' usage.

Jira: https://issues.apache.org/jira/browse/YUNIKORN-2678

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2678

How should this be tested?

Screenshots (if appropriate)

Please see Jira.

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@psantacl psantacl changed the title YUNIKORN-2678 [YUNIKORN-2678] Description Aug 7, 2024
@psantacl psantacl changed the title [YUNIKORN-2678] Description [YUNIKORN-2678] Fair Scheduling Changes Aug 7, 2024
@craigcondit craigcondit changed the title [YUNIKORN-2678] Fair Scheduling Changes [YUNIKORN-2678] Improve FAIR queue sort algorithm Aug 7, 2024
Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial revisions to work on.

pkg/common/resources/resources.go Outdated Show resolved Hide resolved
pkg/common/resources/resources.go Outdated Show resolved Hide resolved
pkg/common/resources/resources.go Outdated Show resolved Hide resolved
pkg/common/resources/resources.go Show resolved Hide resolved
pkg/common/resources/resources.go Outdated Show resolved Hide resolved
pkg/common/resources/resources.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/queue.go Show resolved Hide resolved
@@ -65,8 +65,10 @@ func sortQueuesByPriorityAndFairness(queues []*Queue) {
if lPriority < rPriority {
return false
}
comp := resources.CompUsageRatioSeparately(l.GetAllocatedResource(), l.GetGuaranteedResource(),
r.GetAllocatedResource(), r.GetGuaranteedResource())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function needs some work as long as we're rewriting it. We are doing the expensive recursive computation of resources on every comparison. Since we already have a slice of queues, it would be better to precompute a slice of resources computed on each one, and then use that in the comparison function so that we only create the resource maps once per sort run instead of once per comparison.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wilfred also pointed out in the community meeting that we may also be able to optimize the parent hierarchy separately as all the compared queues will have the same parent. If we do that, we do need to be sure that the parent is not the root.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently on vacation, but I will add this optimization when I return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed. This should compute GetFairMaxResource O(n) times instead of NLog(N).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better, but to Wilfred's point from the community meeting, check out the callers of queue.sortQueues() -- the are in queue.go under GetQueueOutstandingRequests(), TryAllocate(), TryPlaceholderAllocate() and TryReservedAllocate(). Each computes queue.getHeadRoom() at the start which is used for further processing. Since these methods are all recursive from the root queue down, you can compute the parent values at each step and then pass the results into the sortQueues() function. This avoids having to recompute all the parent queue values during sorting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with deferring that refactor until a follow-up PR. @wilfred-s what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can defer that to the next release. It is not just a question of performance but also a question about correctness. Since the queue values can change it might be the case that a parents' setting changes. If that happens between the child checks, specially when you have a larger number of children, you might not have the exact same inputs for each child computation.
The chance is low but we should try an prevent that.
The change would not be in the sorter but in the queue code when we build the fairMaxResources slice

pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
@craigcondit
Copy link
Contributor

craigcondit commented Aug 8, 2024

@psantacl please run make lint / make test before pushing PRs. If the linter doesn't pass, the PR is rejected before the remainder of the checks run.

Copy link

codecov bot commented Aug 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.39%. Comparing base (375895b) to head (066797f).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #935      +/-   ##
==========================================
+ Coverage   80.33%   80.39%   +0.05%     
==========================================
  Files          97       97              
  Lines       12384    12431      +47     
==========================================
+ Hits         9949     9994      +45     
- Misses       2164     2165       +1     
- Partials      271      272       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a final comment, I think it's worth waiting for #943.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@psantacl please resolve merge conflicts

pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/queue_test.go Show resolved Hide resolved
pkg/scheduler/objects/queue_test.go Show resolved Hide resolved
Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some clone-safety and code tweaks. Also the big optimization work yet to-do.

pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, other than the big refactor.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@craigcondit
Copy link
Contributor

@wilfred-s thoughts on committing this version and working on the performance refactor in a followup?

Copy link
Contributor

@wilfred-s wilfred-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Follow up for the performance and correctness change YUNIKORN-2840

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants