Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2037] Testing the throughput of Yunikorn #367

Closed
wants to merge 2 commits into from

Conversation

wusamzong
Copy link
Contributor

What is this PR for?

Capture the outcomes of KWOK-based throughput tests, while details about the tools and examples employed will be found in separate PRs. Upon acceptance of the tools PR, a link to the tool will be integrated into this document, accompanied by additional tutorials on KWOK environment building.
PR links- lightweight monitor tool, example&Kwok-setup-tool

Todos

  • [*] - Task

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@wusamzong wusamzong self-assigned this Nov 14, 2023
@pbacsko pbacsko self-requested a review November 19, 2023 22:14
Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wusamzong could you rename the JIRA and the PR to sth like "Testing the performance of Yunikorn" or "Testing the throughput of Yunikorn"? The phrase "performance of throughput" sounds weird and does not make sense.

@wusamzong wusamzong changed the title [YUNIKORN-2037] Testing the performance of Throughput [YUNIKORN-2037] Testing the throughput of Yunikorn Nov 20, 2023
@wusamzong
Copy link
Contributor Author

@wusamzong could you rename the JIRA and the PR to sth like "Testing the performance of Yunikorn" or "Testing the throughput of Yunikorn"? The phrase "performance of throughput" sounds weird and does not make sense.

Sure! Thank you for your correction

@imliuda
Copy link

imliuda commented Dec 7, 2023

I wonder what is the performance about 100 applications, 500 applications, and so.

@pbacsko
Copy link
Contributor

pbacsko commented Dec 7, 2023

I wonder what is the performance about 100 applications, 500 applications, and so.

Througput is definitely based on apps/pods ratio, queue hiearchy, etc. Things like failed headroom checks and pending pods slow things down.

Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wusamzong
One question about plugin mode, if we compare plugin mode with preEnqueue and without preEnqueue, and preEnqueue is applied to newer k8s version.

@wusamzong
Copy link
Contributor Author

@zhuqi-lucas Sorry for my delayed response. In my experiment, I observed no significant difference between using the plugin mode with preEnqueue and without preEnqueue. The experiment was conducted on Kubernetes version v1.27.8 & YuniKorn 1.4.0, deploying 5 applications, each with 1000 pods. The throughput of the YuniKorn plugin with PreEnqueue was 50 pods/sec, while the throughput without PreEnqueue was marginally higher at 50.01 pods/sec. Attached is a flame graph that compares the performance of YuniKorn in standard mode, plugin mode with PreEnqueue, and plugin mode without PreEnqueue.
StandardMode
withPrequeue
withoutPrequeue

@zhuqi-lucas
Copy link
Contributor

Thanks @wusamzong for info, it makes sense to me, because we use preEnqueue to improve autoscaling case when we make pod unschedulable we support not going to scheduling cycle with preEnqueue plugin, but our performance testing case will not reach the autoscaling part, it's enough for this case.

@pbacsko
Copy link
Contributor

pbacsko commented Dec 31, 2023

@wusamzong could you upload the binary files so I can examine them? Thanks.

@wusamzong
Copy link
Contributor Author

Thanks @wusamzong for info, it makes sense to me, because we use preEnqueue to improve autoscaling case when we make pod unschedulable we support not going to scheduling cycle with preEnqueue plugin, but our performance testing case will not reach the autoscaling part, it's enough for this case.

Okay, thank you for clarifying the outcomes of this experiment. I now have a much better understanding of the plugin aspect.

@pbacsko
Copy link
Contributor

pbacsko commented Jan 2, 2024

@pbacsko Sure! Here is the result of the pprof. standard-mode.samples.cpu.001.pb.gz standard-mode.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz plugin-mode.samples.cpu.001.pb.gz plugin-mode.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz

Thanks. Standard mode looks pretty solid. I haven't seen anything that stands out and needs immediate addressing.

@wilfred-s
Copy link
Contributor

Is there anything that is left to do on this or should we add this as is to show the current state?

@wusamzong
Copy link
Contributor Author

I'll make an effort to upload the remaining test cases as soon as possible!

@pbacsko
Copy link
Contributor

pbacsko commented Jan 26, 2024

@wusamzong is there a docs update which explains how to run these tests? It's OK to see the results, but what if I want to run this locally on my machine?

@wusamzong
Copy link
Contributor Author

Certainly! I plan to submit a PR to k8shim, adding the scripts used for the performance test to the yunikorn-k8shim/deployments/kwok-perf-test. Along with this, I'll include a readme that provides instructions on how to utilize these scripts.

@pbacsko
Copy link
Contributor

pbacsko commented Jan 26, 2024

Ok, thanks, this PR looks good.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

github-actions bot pushed a commit to chia7712/yunikorn-site that referenced this pull request Jan 27, 2024
chenyulin0719 added a commit to chenyulin0719/yunikorn-site that referenced this pull request Jan 29, 2024
[YUNIKORN-2037] Document performance using kwok (apache#367)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants