-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi goroutine deal taskUnschedulable #3921
multi goroutine deal taskUnschedulable #3921
Conversation
Welcome @lishangyuzi! |
/assign @lowang-bh |
Have you increase the QPS of kubeclient in volcano scheduler? |
default qps of kubeclient has already met my expectations.It takes approximately 200 seconds for a job with 5000 pods to complete this stage. volcano/cmd/scheduler/app/options/options.go Lines 127 to 128 in b169623
volcano/cmd/scheduler/app/options/options.go Lines 40 to 41 in b169623
The parameters related to my API server QPS are as follows: --max-mutating-requests-inflight=4000
--max-requests-inflight=2000
--watch-cache-sizes=node#2000,pod#10000 |
pkg/scheduler/cache/cache.go
Outdated
"reason", reason, "message", msg) | ||
} | ||
wg.Add(1) | ||
semaphore <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is more better using goroutines pool to update those tasks concurent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've already changed this using goroutines pool.
/ok-to-test |
5297a55
to
bd8de08
Compare
4e32d71
to
8156c0b
Compare
There is concern that will this cause some conflict err? |
/area performance |
8156c0b
to
b6313fd
Compare
pkg/scheduler/cache/cache.go
Outdated
for _, taskInfo := range job.TaskStatusIndex[status] { | ||
statusTasks := job.TaskStatusIndex[status] | ||
workerNum := 16 | ||
taskInfos := make([]*schedulingapi.TaskInfo, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please set slice’s capacity as the length of statusTasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
5395b76
to
7f14a89
Compare
/lgtm |
i merge branch 'master' into my branch, lgtm label has been removed. @JesseStutler |
Please squash your commit into one :) |
e35e570
to
084b4bd
Compare
done @JesseStutler |
pkg/scheduler/cache/cache.go
Outdated
@@ -1486,22 +1486,32 @@ func (sc *SchedulerCache) RecordJobStatusEvent(job *schedulingapi.JobInfo, updat | |||
} | |||
// Update podCondition for tasks Allocated and Pending before job discarded | |||
for _, status := range []schedulingapi.TaskStatus{schedulingapi.Allocated, schedulingapi.Pending, schedulingapi.Pipelined} { | |||
for _, taskInfo := range job.TaskStatusIndex[status] { | |||
statusTasks := job.TaskStatusIndex[status] | |||
workerNum := 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can define a const like jobUpdaterWorker = 16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we use the workqueue.ParallelizeUntil function, const is defined in some places, while in others it isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean re-define a const and quote it here: )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
084b4bd
to
69641a5
Compare
Signed-off-by: 张建宇 <[email protected]>
a02189f
to
e24db9d
Compare
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Monokaix The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
In the scenario of scheduling large-scale jobs, I also encountered a problem. When the job fails to be scheduled, all the pods under this job will update the PodCondition. Since it is necessary to communicate with the apiserver, this will take a long time.Could we consider using the multi-goroutine approach to handle this part of the logic?