Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK] Add benchmark for Spark TRowSet generation of row-based and column-based #5809

Closed
wants to merge 3 commits into from

Conversation

bowenliang123
Copy link
Contributor

@bowenliang123 bowenliang123 commented Dec 3, 2023

🔍 Description

Issue References 🔗

Subtask of #5808.

Describe Your Solution 🔧

Add performance benchmark for Spark TRowSet generation for

  • row-based TRowSet on HIVE_CLI_SERVICE_PROTOCOL_V5 and below
  • column-based TRowSet on HIVE_CLI_SERVICE_PROTOCOL_V6 and above

Types of changes 🔖

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Row-based:
image

Column-based:
image

Related Unit Tests

Added "to row set benchmark" ut in Spark Engine's RowSetSuite.


Checklists

📝 Author Self Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • This patch was not authored or co-authored using Generative Tooling

📝 Committer Pre-Merge Checklist

  • Pull request title is okay.
  • No license issues.
  • Milestone correctly set?
  • Test coverage is ok
  • Assignees are selected.
  • Minimum number of approvals
  • No changes are requested

Be nice. Be informative.

@bowenliang123 bowenliang123 changed the title [SPARK] Add benchmark ut for Spark TRowSet generation of both row-based and column-based [SPARK] Add benchmark ut for Spark TRowSet generation of row-based and column-based Dec 3, 2023
@codecov-commenter
Copy link

codecov-commenter commented Dec 3, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (79b24a7) 61.44% compared to head (cba080a) 61.29%.
Report is 5 commits behind head on master.

❗ Current head cba080a differs from pull request most recent head 51919fb. Consider uploading reports for the commit 51919fb to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #5809      +/-   ##
============================================
- Coverage     61.44%   61.29%   -0.15%     
  Complexity       23       23              
============================================
  Files           608      608              
  Lines         36094    36027      -67     
  Branches       4952     4952              
============================================
- Hits          22178    22083      -95     
- Misses        11522    11560      +38     
+ Partials       2394     2384      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wForget
Copy link
Member

wForget commented Dec 4, 2023

Should we create a new class separately? And you can refer to org.apache.spark.sql.ZorderCoreBenchmark.

@bowenliang123 bowenliang123 changed the title [SPARK] Add benchmark ut for Spark TRowSet generation of row-based and column-based [SPARK] Add benchmark for Spark TRowSet generation of row-based and column-based Dec 4, 2023
@bowenliang123
Copy link
Contributor Author

Should we create a new class separately? And you can refer to org.apache.spark.sql.ZorderCoreBenchmark.

Thanks for the advice. Moved to a new class RowSetBenchmark.

import org.apache.spark.sql.types._

class RowSetBenchmark extends BaseRowSetSuite {
test("to row set benchmark") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't put benchmarks into tests

Copy link
Contributor Author

@bowenliang123 bowenliang123 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Could you have a check?
I took the existing TPCDSTableGenerateBenchmark for the example.
class TPCDSTableGenerateBenchmark extends KyuubiFunSuite with KyuubiBenchmarkBase {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to keep it as running via tests by setting RUN_BENCHMARK=1, just like other existed benchmarks like TPCDSTableGenerateBenchmark.

JMH for isolated benchmark testing could be introduced next time.
Having trouble in integrating JMH for Scala without official Maven plugin support , using JMH Java annotations , the proper execution entry point to run with JMH and the isolation path for JMH benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yaooqinn . I have changed to use Spark's Benchmark tools for running and generating the benchmarks, just like the TPCDSTableGenerateBenchmark and ZorderCoreBenchmark. Could you have a look?
And thanks for the guidance from @wForget .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 to use Spark's Benchmark tools in engine modules

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could refactor this onto JMH in the follow-up PRs. These tests are not run with GA tests. This should not be a blocker issue here for evaluating the overall TRowSet generation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 to use Spark's Benchmark tools in engine modules

what's the reason/major concern

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could refactor this onto JMH in the follow-up PRs

We don't need to refactor if it's originally designed with JMH

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added kind:infra license, community building, project builds, asf infra related, etc. kind:build labels Dec 5, 2023
@yaooqinn
Copy link
Member

yaooqinn commented Dec 5, 2023

And based on your screenshots in your PR desc, what are actually the control group, experimental groups?

@bowenliang123
Copy link
Contributor Author

bowenliang123 commented Dec 5, 2023

And based on your screenshots in your PR desc, what are actually the control group, experimental groups?

There is no control or experimental group in this PR. It provides a benchmark tool for evaluating both column-based and row-based rowset for the access from V5 and V6 above. In the coming-up experiments, the benchmark will be run on the base version and different improvement implementations for comparison.

Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I agree to adopt this benchmark framework, because

  1. it is light
  2. there are existing benchmarks based on it
  3. many Kyuubi developers are familiar with it

@bowenliang123
Copy link
Contributor Author

And we could decouple it with Spark's utils and move it to kyuubi-util module for a general light-weight benchmark kit in the future. And when it's ready to integrate JMH in Kyuubi with Maven + sbt + Scala, this benchmark toolkit is able to be removed.

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep my -1 as the testing purpose here is not clear

@yaooqinn
Copy link
Member

yaooqinn commented Dec 5, 2023

In the coming-up experiments

If there are a bunch of PRs, I suggest you create an umbrella, and an KPIP(discuss/vote in the dev list) is necessary for introduce a benchmarking framework.

@wForget
Copy link
Member

wForget commented Dec 5, 2023

And based on your screenshots in your PR desc, what are actually the control group, experimental groups?

As I understand, Behavior Without This Pull Request is the control group and Behavior With This Pull Request is experimental group.

I will keep my -1 as the testing purpose here is not clear

I think the testing purpose is clear, this is for benchmarking the conversion performance from rows of sql result to TRowSet.

@yaooqinn
Copy link
Member

yaooqinn commented Dec 5, 2023

@wForget

You are correct about the purpose of this PR, but the benchmark itself needs to be corrected. Technically, if we introduce the Spark benchmark tool, the first line of results in each single benchmark should be the control group as it always produces 1x for Relative.

The current test also varies the simple rule of univariate analysis for experiments. What I saw in the results is a mess, TBH.

@wForget
Copy link
Member

wForget commented Dec 5, 2023

You are correct about the purpose of this PR, but the benchmark itself needs to be corrected. Technically, if we introduce the Spark benchmark tool, the first line of results in each single benchmark should be the control group as it always produces 1x for Relative.

The current test also varies the simple rule of univariate analysis for experiments. What I saw in the results is a mess, TBH.

Got it, thank you for your explanation.

@bowenliang123
Copy link
Contributor Author

Closing this PR with no enough consensus on the purposes, the design, the changes and the approaches.

@bowenliang123 bowenliang123 deleted the rowset-benchmark branch December 6, 2023 00:34
@bowenliang123
Copy link
Contributor Author

bowenliang123 commented Dec 6, 2023

In the coming-up experiments

If there are a bunch of PRs, I suggest you create an umbrella, and an KPIP(discuss/vote in the dev list) is necessary for introduce a benchmarking framework.

I'm strongly against your comment here. First, the umbrella issue is created for the whole task list that is still extendable, Second, you did not allow me to use the test-jars of Spark for using existed benchmark kit, unintentionally or intentionally ignoring that several benchmark tests have already introduced on it . Third, you told me to raise a KPIP for such a duplicated framework from a copied implementation. I respect all your comments but I just extremely unwillingly to see every and every and every effort in resolving this problem has been deliberately disregarded and pulled back a meter back for a inch forward. I did no evil and did not violate any community code of conduct now and ever! WHY make it difficult for me !!!

@yaooqinn
Copy link
Member

yaooqinn commented Dec 6, 2023

I'm strongly against your comment here. ... I respect all your comments

Hi @bowenliang123. First thing first, calm down.

I want to clarify that I am a regular contributor/PMC member of Apache Kyuubi, just like everyone else. My comments on this PR are simply my personal opinion. I have left a veto with explanations, which also have been challenged and discussed.

I did no evil and did not violate any community code of conduct now and ever! WHY make it difficult for me !!!

I know you well in person. You and nobody else don't violate CoC in this PR.

Third, you told me to raise a KPIP for such a duplicated framework from a copied implementation.

When in doubt, if a committer thinks a change needs a KPIP, it does. --- https://kyuubi.apache.org/improvement-proposals.html

@bowenliang123 bowenliang123 restored the rowset-benchmark branch December 6, 2023 13:56
@bowenliang123 bowenliang123 reopened this Dec 6, 2023
@bowenliang123 bowenliang123 deleted the rowset-benchmark branch April 24, 2024 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:infra license, community building, project builds, asf infra related, etc. module:spark
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants