Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]Optimize volcano end-to-end scheduling large-scale pod performance #3852

Open
6 tasks
JesseStutler opened this issue Dec 2, 2024 · 12 comments
Open
6 tasks
Labels
area/performance Issues or PRs related to performance good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@JesseStutler
Copy link
Member

JesseStutler commented Dec 2, 2024

What is the problem you're trying to solve

Recently, I used the scripts in benchmark/sh to compare the performance of volcano and kube-scheduler under benchmark. There are several problems that I found needed to be solved

Describe the solution you'd like

  • Traffic conflict optimization. The metric called apiserver_current_inflight_requests of kube-apiserver is significantly higher when volcano schedules large-scale pods than when kube-scheduler schedules, which may cause pressure on kube-apiserver and cause network conflicts, affecting scheduling performance. After reviewing the codes, I found there two places causing the problem:
    • At the end of each session, the session will update queue status's allocated field. If the session interval is short, there will be a lot of queue status update requests. And the session will also update the nums of pods in Running/Failed/Succeed states in podgroup status. If these fields of podgroups only used in kubectl to show the nums of pods in each state, we may rewrite the logic , not to persist these nums fields in podgroups, like this pr: feature: Add podgroups statistics #3751

    Already in Progress:update pg status with applyStatus #3865

    • If we open gang plugin, gang plugin will always refresh the Scheduled condition at the end of each session(Refresh the transition ID), it's a bug need to be fixed. In the logic of isPodGroupStatusUpdated, because the nums of pod in Running state is always refreshing in scheduling, therefore !equality.Semantic.DeepEqual(newStatus, oldStatus) is true, causing the condition always will be refreshed:
      func isPodGroupStatusUpdated(newStatus, oldStatus scheduling.PodGroupStatus) bool {
      newCondition := newStatus.Conditions
      newStatus.Conditions = nil
      oldCondition := oldStatus.Conditions
      oldStatus.Conditions = nil
      return !equality.Semantic.DeepEqual(newStatus, oldStatus) || isPodGroupConditionsUpdated(newCondition, oldCondition)
      }
      We also need to integrate Unschedulable and Scheduled conditions into the same condition, and use different ConditionStatus to distinguish them. Otherwise, even if the pods in the podgroups have met the minmember, the Unschedulable condition will still remain in the conditions list of the podgroup. , which would cause ambiguity.
  • Performance bottleneck of volcano admission. During the test, I found that we choose to use deployment to create large-scale pods, volcano-admission will seriously affect the speed of kube-controller-manager to create pods. We may need to rewrite or further optimize the logic of validate and mutate.
    • validate webhook improvement( use Validating Admission Policy instead, which is already released in v1.30)
    • mutate webhook improvement( use Mutating Admission Policy instead, which will be released in v1.32)
  • Deployment pods can not use the feature of predicateCache, the annotation of volcano.sh/template-uid is only patched on vc-job pods, predicateCache will firstly judge whether the annotation exists and then use the feature of predicateCache.
  • If we stop the volcano-scheduler first, create all the pods, and then start the volcano-scheduler again, it takes a very short time to schedule all the pods in one session. But if we don't stop the volcano-scheduler, increase the session interval(like 2min), and then create all the pods immediately after the last session, all the pods can still be scheduled in one session, but it takes twice as long as stopping volcano-scheduler and starting it again. We need to figure out why.

In addition, we may need to reconstruct the logic of volcano's predicate and prioritize. Assume that there are m nodes in the cluster and n pods need to be scheduled. Currently, all m nodes need to be filtered. The time complexity is O(mn). Can we not filter all the nodes? Or after predicates some nodes, there is no need to filter all these nodes when scoring.

  • We can use heap or B-tree to quickly find the best node (but we need a uniform function to build heaps or B-trees, volcano has too many plugins, it may be complicated to find a uniform function)

Additional context

No response

@JesseStutler JesseStutler added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 2, 2024
@JesseStutler
Copy link
Member Author

/good-first-issue

@volcano-sh-bot
Copy link
Contributor

@JesseStutler:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot volcano-sh-bot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Dec 2, 2024
@Monokaix
Copy link
Member

Monokaix commented Dec 2, 2024

Some previous performance issus: #3502

@jxs1211
Copy link

jxs1211 commented Dec 7, 2024

Hi, newbie to volcano community, I'd like to make my first contribution to get my feet wet in the community, and I'm looking at the first one, which hopes change the logic in the UpdatePodGroup of defaultStatusUpdater to avoid update the nums status in podgroup, like the queueStatusApply did, is that right?

@jxs1211
Copy link

jxs1211 commented Dec 8, 2024

And currently PredicateCache will only be used when predicateByStablefilter fails. Can we just use PredicateCache directly when pod template is the same? (Exclude some predicates such as podAntiAffinity and topologySpread)

I didn't find the logic of PredicateCache will only be used when predicateByStablefilter fails
I can only find the checking result of the PredicateWithCache, the cache will be updated if the the template-uid doesn't exist or there is no cache for the node and pod

		predicateCacheStatus := make([]*api.Status, 0)
		if predicate.cacheEnable {
			fit, err = pCache.PredicateWithCache(node.Name, task.Pod)
			if err != nil {
				predicateCacheStatus, fit, _ = predicateByStablefilter(nodeInfo)
				pCache.UpdateCache(node.Name, task.Pod, fit)
			} else {
				if !fit {
					err = fmt.Errorf("plugin equivalence cache predicates failed")
					predicateCacheStatus = append(predicateCacheStatus, &api.Status{
						Code: api.Error, Reason: err.Error(), Plugin: CachePredicate,
					})
				}
			}
		} else {
			predicateCacheStatus, fit, _ = predicateByStablefilter(nodeInfo)
		}

		predicateStatus = append(predicateStatus, predicateCacheStatus...)
		if !fit {
			return api.NewFitErrWithStatus(task, node, predicateStatus...)
		}

Basd on the code, could you pls explain how can I use PredicateCache if the podTemplate is the same instead of checking the predicateByStablefilter fails? or help me understand it correctly.

@JesseStutler
Copy link
Member Author

Hi, newbie to volcano community, I'd like to make my first contribution to get my feet wet in the community, and I'm looking at the first one, which hopes change the logic in the UpdatePodGroup of defaultStatusUpdater to avoid update the nums status in podgroup, like the queueStatusApply did, is that right?

Yes, you can try this task! You can open a new issue and refer here when you finish your work.

@JesseStutler
Copy link
Member Author

And currently PredicateCache will only be used when predicateByStablefilter fails. Can we just use PredicateCache directly when pod template is the same? (Exclude some predicates such as podAntiAffinity and topologySpread)

Sorry it was my fault, ignore this. But prediacte only works for vc-job pod now, you can implement it to work for deployment also.

@jxs1211
Copy link

jxs1211 commented Dec 10, 2024

And currently PredicateCache will only be used when predicateByStablefilter fails. Can we just use PredicateCache directly when pod template is the same? (Exclude some predicates such as podAntiAffinity and topologySpread)

Sorry it was my fault, ignore this. But prediacte only works for vc-job pod now, you can implement it to work for deployment also.

Is there any related or similar code can help me understand how to do and what to do, like statefulset implementation or something?

@JesseStutler
Copy link
Member Author

And currently PredicateCache will only be used when predicateByStablefilter fails. Can we just use PredicateCache directly when pod template is the same? (Exclude some predicates such as podAntiAffinity and topologySpread)

Sorry it was my fault, ignore this. But prediacte only works for vc-job pod now, you can implement it to work for deployment also.

Is there any related or similar code can help me understand how to do and what to do, like statefulset implementation or something?

No, you can check how the vc-job pods "volcano.sh/template-uid" are annotated in the controller and do some refactoring to implement deployment.

@lishangyuzi
Copy link

lishangyuzi commented Dec 23, 2024

In the scenario of scheduling large-scale jobs, I also encountered a problem. When the job fails to be scheduled, all the pods under this job will update the PodCondition. Since it is necessary to communicate with the apiserver, this will take a long time.Could we consider using the multi-goroutine approach to handle this part of the logic?

for _, status := range []schedulingapi.TaskStatus{schedulingapi.Allocated, schedulingapi.Pending, schedulingapi.Pipelined} {

I've implemented it simply and adjusted the QPS (Queries Per Second) of the apiserver. This method has helped me reduce a lot of time consumption. The code implementation is as follows.

releated PR:#3921

@lowang-bh
Copy link
Member

  • volcano.sh/template-uid

I think we can use taskGroupID("%s/%s", task.Job, task.TaskRole) to unify those cases.
There is already a taskSpec (from volcano.sh/task-spec annotation) in each role-pod。 Deployment also can use this annotation。

@Monokaix
Copy link
Member

/area performance

@volcano-sh-bot volcano-sh-bot added the area/performance Issues or PRs related to performance label Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Issues or PRs related to performance good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

6 participants