[SPARK] Add benchmark for Spark TRowSet generation of row-based and column-based #5809

bowenliang123 · 2023-12-03T15:47:14Z

🔍 Description

Issue References 🔗

Subtask of #5808.

Describe Your Solution 🔧

Add performance benchmark for Spark TRowSet generation for

row-based TRowSet on HIVE_CLI_SERVICE_PROTOCOL_V5 and below
column-based TRowSet on HIVE_CLI_SERVICE_PROTOCOL_V6 and above

Types of changes 🔖

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Row-based:

Column-based:

Related Unit Tests

Added "to row set benchmark" ut in Spark Engine's RowSetSuite.

Checklists

📝 Author Self Checklist

My code follows the style guidelines of this project
I have performed a self-review
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
This patch was not authored or co-authored using Generative Tooling

📝 Committer Pre-Merge Checklist

Be nice. Be informative.

codecov-commenter · 2023-12-03T18:34:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (79b24a7) 61.44% compared to head (cba080a) 61.29%.
Report is 5 commits behind head on master.

❗ Current head cba080a differs from pull request most recent head 51919fb. Consider uploading reports for the commit 51919fb to get more accurate results

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #5809      +/-   ##
============================================
- Coverage     61.44%   61.29%   -0.15%     
  Complexity       23       23              
============================================
  Files           608      608              
  Lines         36094    36027      -67     
  Branches       4952     4952              
============================================
- Hits          22178    22083      -95     
- Misses        11522    11560      +38     
+ Partials       2394     2384      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wForget · 2023-12-04T01:43:22Z

Should we create a new class separately? And you can refer to org.apache.spark.sql.ZorderCoreBenchmark.

bowenliang123 · 2023-12-04T02:13:11Z

Should we create a new class separately? And you can refer to org.apache.spark.sql.ZorderCoreBenchmark.

Thanks for the advice. Moved to a new class RowSetBenchmark.

yaooqinn · 2023-12-04T02:31:09Z

...-spark-sql-engine/src/test/scala/org/apache/kyuubi/engine/spark/schema/RowSetBenchmark.scala

+import org.apache.spark.sql.types._
+
+class RowSetBenchmark extends BaseRowSetSuite {
+  test("to row set benchmark") {


don't put benchmarks into tests

Updated. Could you have a check?
I took the existing TPCDSTableGenerateBenchmark for the example.
class TPCDSTableGenerateBenchmark extends KyuubiFunSuite with KyuubiBenchmarkBase {

Would like to keep it as running via tests by setting RUN_BENCHMARK=1, just like other existed benchmarks like TPCDSTableGenerateBenchmark.

JMH for isolated benchmark testing could be introduced next time.
Having trouble in integrating JMH for Scala without official Maven plugin support , using JMH Java annotations , the proper execution entry point to run with JMH and the isolation path for JMH benchmarks.

Hi @yaooqinn . I have changed to use Spark's Benchmark tools for running and generating the benchmarks, just like the TPCDSTableGenerateBenchmark and ZorderCoreBenchmark. Could you have a look?
And thanks for the guidance from @wForget .

-1 to use Spark's Benchmark tools in engine modules

We could refactor this onto JMH in the follow-up PRs. These tests are not run with GA tests. This should not be a blocker issue here for evaluating the overall TRowSet generation.

-1 to use Spark's Benchmark tools in engine modules

what's the reason/major concern

We could refactor this onto JMH in the follow-up PRs

We don't need to refactor if it's originally designed with JMH

jmh-scala-benchmark-archetype

externals/kyuubi-spark-sql-engine/pom.xml

externals/kyuubi-spark-sql-engine/src/test/scala/org/apache/spark/kyuubi/TRowSetBenchmark.scala

yaooqinn · 2023-12-05T06:21:26Z

And based on your screenshots in your PR desc, what are actually the control group, experimental groups？

bowenliang123 · 2023-12-05T07:24:00Z

And based on your screenshots in your PR desc, what are actually the control group, experimental groups？

There is no control or experimental group in this PR. It provides a benchmark tool for evaluating both column-based and row-based rowset for the access from V5 and V6 above. In the coming-up experiments, the benchmark will be run on the base version and different improvement implementations for comparison.

pan3793

LGTM, I agree to adopt this benchmark framework, because

it is light
there are existing benchmarks based on it
many Kyuubi developers are familiar with it

bowenliang123 · 2023-12-05T10:29:26Z

And we could decouple it with Spark's utils and move it to kyuubi-util module for a general light-weight benchmark kit in the future. And when it's ready to integrate JMH in Kyuubi with Maven + sbt + Scala, this benchmark toolkit is able to be removed.

yaooqinn

I will keep my -1 as the testing purpose here is not clear

yaooqinn · 2023-12-05T11:28:06Z

In the coming-up experiments

If there are a bunch of PRs, I suggest you create an umbrella, and an KPIP(discuss/vote in the dev list) is necessary for introduce a benchmarking framework.

wForget · 2023-12-05T12:49:23Z

And based on your screenshots in your PR desc, what are actually the control group, experimental groups？

As I understand, Behavior Without This Pull Request is the control group and Behavior With This Pull Request is experimental group.

I will keep my -1 as the testing purpose here is not clear

I think the testing purpose is clear, this is for benchmarking the conversion performance from rows of sql result to TRowSet.

yaooqinn · 2023-12-05T13:19:09Z

@wForget

You are correct about the purpose of this PR, but the benchmark itself needs to be corrected. Technically, if we introduce the Spark benchmark tool, the first line of results in each single benchmark should be the control group as it always produces 1x for Relative.

The current test also varies the simple rule of univariate analysis for experiments. What I saw in the results is a mess, TBH.

wForget · 2023-12-05T14:03:14Z

You are correct about the purpose of this PR, but the benchmark itself needs to be corrected. Technically, if we introduce the Spark benchmark tool, the first line of results in each single benchmark should be the control group as it always produces 1x for Relative.

The current test also varies the simple rule of univariate analysis for experiments. What I saw in the results is a mess, TBH.

Got it, thank you for your explanation.

bowenliang123 · 2023-12-05T14:21:02Z

Closing this PR with no enough consensus on the purposes, the design, the changes and the approaches.

bowenliang123 · 2023-12-06T00:44:54Z

In the coming-up experiments

If there are a bunch of PRs, I suggest you create an umbrella, and an KPIP(discuss/vote in the dev list) is necessary for introduce a benchmarking framework.

I'm strongly against your comment here. First, the umbrella issue is created for the whole task list that is still extendable, Second, you did not allow me to use the test-jars of Spark for using existed benchmark kit, unintentionally or intentionally ignoring that several benchmark tests have already introduced on it . Third, you told me to raise a KPIP for such a duplicated framework from a copied implementation. I respect all your comments but I just extremely unwillingly to see every and every and every effort in resolving this problem has been deliberately disregarded and pulled back a meter back for a inch forward. I did no evil and did not violate any community code of conduct now and ever! WHY make it difficult for me !!!

yaooqinn · 2023-12-06T02:08:08Z

I'm strongly against your comment here. ... I respect all your comments

Hi @bowenliang123. First thing first, calm down.

I want to clarify that I am a regular contributor/PMC member of Apache Kyuubi, just like everyone else. My comments on this PR are simply my personal opinion. I have left a veto with explanations, which also have been challenged and discussed.

I did no evil and did not violate any community code of conduct now and ever! WHY make it difficult for me !!!

I know you well in person. You and nobody else don't violate CoC in this PR.

Third, you told me to raise a KPIP for such a duplicated framework from a copied implementation.

When in doubt, if a committer thinks a change needs a KPIP, it does. --- https://kyuubi.apache.org/improvement-proposals.html

github-actions bot added the module:spark label Dec 3, 2023

bowenliang123 mentioned this pull request Dec 3, 2023

[Umbrella] Improvements and evaluation for TRowSet generation of Spark Engine #5808

Open

12 tasks

bowenliang123 changed the title ~~[SPARK] Add benchmark ut for Spark TRowSet generation of both row-based and column-based~~ [SPARK] Add benchmark ut for Spark TRowSet generation of row-based and column-based Dec 3, 2023

bowenliang123 requested review from yaooqinn and pan3793 December 4, 2023 00:49

bowenliang123 changed the title ~~[SPARK] Add benchmark ut for Spark TRowSet generation of row-based and column-based~~ [SPARK] Add benchmark for Spark TRowSet generation of row-based and column-based Dec 4, 2023

yaooqinn reviewed Dec 4, 2023

View reviewed changes

bowenliang123 force-pushed the rowset-benchmark branch from 089bb2b to 3619589 Compare December 4, 2023 09:42

github-actions bot added kind:infra license, community building, project builds, asf infra related, etc. kind:build labels Dec 5, 2023

bowenliang123 mentioned this pull request Dec 5, 2023

[DRAFT] [DEMO] [SPARK] Use explicit Rowset time formatters for improving RowSet generation #5815

Closed

18 tasks

pan3793 reviewed Dec 5, 2023

View reviewed changes

externals/kyuubi-spark-sql-engine/pom.xml Outdated Show resolved Hide resolved

pan3793 reviewed Dec 5, 2023

View reviewed changes

externals/kyuubi-spark-sql-engine/src/test/scala/org/apache/spark/kyuubi/TRowSetBenchmark.scala Outdated Show resolved Hide resolved

github-actions bot removed the kind:build label Dec 5, 2023

bowenliang123 force-pushed the rowset-benchmark branch from ef0d3a1 to c7cfba8 Compare December 5, 2023 08:32

pan3793 approved these changes Dec 5, 2023

View reviewed changes

yaooqinn requested changes Dec 5, 2023

View reviewed changes

bowenliang123 closed this Dec 5, 2023

bowenliang123 deleted the rowset-benchmark branch December 6, 2023 00:34

bowenliang123 restored the rowset-benchmark branch December 6, 2023 13:56

bowenliang123 reopened this Dec 6, 2023

add benchmark for Spark TRowSet generation

d2a360f

bowenliang123 force-pushed the rowset-benchmark branch from cba080a to d2a360f Compare December 9, 2023 15:28

bowenliang123 added 2 commits December 9, 2023 23:56

import

8b9aa6b

set array length to 10

51919fb

bowenliang123 closed this Apr 24, 2024

bowenliang123 deleted the rowset-benchmark branch April 24, 2024 05:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK] Add benchmark for Spark TRowSet generation of row-based and column-based #5809

[SPARK] Add benchmark for Spark TRowSet generation of row-based and column-based #5809

bowenliang123 commented Dec 3, 2023 •

edited

Loading

codecov-commenter commented Dec 3, 2023 •

edited

Loading

wForget commented Dec 4, 2023

bowenliang123 commented Dec 4, 2023

yaooqinn Dec 4, 2023

bowenliang123 Dec 4, 2023 •

edited

Loading

bowenliang123 Dec 4, 2023

bowenliang123 Dec 5, 2023

yaooqinn Dec 5, 2023

bowenliang123 Dec 5, 2023

pan3793 Dec 5, 2023

yaooqinn Dec 5, 2023

bowenliang123 Dec 8, 2023

yaooqinn commented Dec 5, 2023

bowenliang123 commented Dec 5, 2023 •

edited

Loading

pan3793 left a comment •

edited

Loading

bowenliang123 commented Dec 5, 2023

yaooqinn left a comment

yaooqinn commented Dec 5, 2023

wForget commented Dec 5, 2023

yaooqinn commented Dec 5, 2023

wForget commented Dec 5, 2023

bowenliang123 commented Dec 5, 2023

bowenliang123 commented Dec 6, 2023 •

edited

Loading

yaooqinn commented Dec 6, 2023

[SPARK] Add benchmark for Spark TRowSet generation of row-based and column-based #5809

[SPARK] Add benchmark for Spark TRowSet generation of row-based and column-based #5809

Conversation

bowenliang123 commented Dec 3, 2023 • edited Loading

🔍 Description

Issue References 🔗

Describe Your Solution 🔧

Types of changes 🔖

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

Checklists

📝 Author Self Checklist

📝 Committer Pre-Merge Checklist

codecov-commenter commented Dec 3, 2023 • edited Loading

Codecov Report

wForget commented Dec 4, 2023

bowenliang123 commented Dec 4, 2023

Choose a reason for hiding this comment

bowenliang123 Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn commented Dec 5, 2023

bowenliang123 commented Dec 5, 2023 • edited Loading

pan3793 left a comment • edited Loading

Choose a reason for hiding this comment

bowenliang123 commented Dec 5, 2023

yaooqinn left a comment

Choose a reason for hiding this comment

yaooqinn commented Dec 5, 2023

wForget commented Dec 5, 2023

yaooqinn commented Dec 5, 2023

wForget commented Dec 5, 2023

bowenliang123 commented Dec 5, 2023

bowenliang123 commented Dec 6, 2023 • edited Loading

yaooqinn commented Dec 6, 2023

bowenliang123 commented Dec 3, 2023 •

edited

Loading

codecov-commenter commented Dec 3, 2023 •

edited

Loading

bowenliang123 Dec 4, 2023 •

edited

Loading

bowenliang123 commented Dec 5, 2023 •

edited

Loading

pan3793 left a comment •

edited

Loading

bowenliang123 commented Dec 6, 2023 •

edited

Loading