[KYUUBI #6315] Spark 3.5: MaxScanStrategy supports DSv2 #5852

zhaohehuhu · 2023-12-13T06:28:18Z

🔍 Description

Issue References 🔗

Now, MaxScanStrategy can be adopted to limit max scan file size in some datasources, such as Hive. Hopefully we can enhance MaxScanStrategy to include support for the datasourcev2.

Describe Your Solution 🔧

get the statistics about files scanned through datasourcev2 API

Types of changes 🔖

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

Checklists

📝 Author Self Checklist

My code follows the style guidelines of this project
I have performed a self-review
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
This patch was not authored or co-authored using Generative Tooling

📝 Committer Pre-Merge Checklist

Be nice. Be informative.

pan3793 · 2023-12-14T11:53:37Z

Please make sure that the Kyuubi Spark extension also works well on iceberg-free Spark runtime.

zhaohehuhu · 2023-12-15T03:35:32Z

Please make sure that the Kyuubi Spark extension also works well on iceberg-free Spark runtime.

good point. Thanks

zhaohehuhu · 2023-12-15T03:36:40Z

Please make sure that the Kyuubi Spark extension also works well on iceberg-free Spark runtime.

Fixed. Plz review again.

codecov-commenter · 2024-03-14T13:11:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.40%. Comparing base (67f099a) to head (3c5b0c2).
Report is 23 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #5852      +/-   ##
============================================
- Coverage     58.58%   58.40%   -0.19%     
  Complexity       24       24              
============================================
  Files           649      651       +2     
  Lines         39379    39513     +134     
  Branches       5415     5441      +26     
============================================
+ Hits          23070    23076       +6     
- Misses        13841    13955     +114     
- Partials       2468     2482      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wForget · 2024-03-15T07:32:03Z

@zhaohehuhu Could you add a unit test?

zhaohehuhu · 2024-03-15T07:36:17Z

@zhaohehuhu Could you add a unit test?

Sure. I will add it. Thanks!

...uubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/watchdog/MaxScanStrategy.scala

wForget · 2024-03-22T02:18:35Z

...rk-3-5/src/test/scala/org/apache/spark/sql/ReportStatisticsAndPartitionAwareDataSource.scala

+import org.apache.spark.sql.connector.read.partitioning.{KeyGroupedPartitioning, Partitioning}
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+class ReportStatisticsAndPartitionAwareDataSource extends SimpleWritableDataSource {


Do we need to add a new data source? Is it better to use iceberg datasource directly? @pan3793 WDYT?

Prefer to use a dummy DS like Spark does.

zhaohehuhu · 2024-03-22T09:53:09Z

Thanks @wForget @pan3793

wForget · 2024-03-22T10:13:32Z

...uubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/watchdog/MaxScanStrategy.scala

+          lazy val scanFileSize = stats.sizeInBytes
+          lazy val scanPartitions = relation.scan.asInstanceOf[SupportsReportPartitioning]
+            .outputPartitioning()
+            .numPartitions()


numPartitions does not seem to be the number of scan table partitions. As in iceberg implementation, it is the size of taskGroups.

I think it is the task number of RDD/stage, instead of the table's partition number, does taskGroups in Iceberg means same thing?

It's the input RDD partition number for iceberg datasource. Maybe the value of it is equal to table's partition number, but they're not the same thing. Seems it's a bit hard to get the number of scan table partitions.

zhaohehuhu · 2024-04-07T04:19:17Z

disable the rule that checks the maxPartitions for dsv2 @wForget

...uubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/watchdog/MaxScanStrategy.scala

...spark/kyuubi-extension-spark-3-5/src/test/scala/org/apache/spark/sql/WatchDogSuiteBase.scala

wForget

Thanks, LGTM

# 🔍 Description ## Issue References 🔗 Now, MaxScanStrategy can be adopted to limit max scan file size in some datasources, such as Hive. Hopefully we can enhance MaxScanStrategy to include support for the datasourcev2. ## Describe Your Solution 🔧 get the statistics about files scanned through datasourcev2 API ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 #### Behavior Without This Pull Request ⚰️ #### Behavior With This Pull Request 🎉 #### Related Unit Tests --- # Checklists ## 📝 Author Self Checklist - [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project - [x] I have performed a self-review - [x] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] New and existing unit tests pass locally with my changes - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) ## 📝 Committer Pre-Merge Checklist - [x] Pull request title is okay. - [x] No license issues. - [x] Milestone correctly set? - [x] Test coverage is ok - [x] Assignees are selected. - [ ] Minimum number of approvals - [ ] No changes are requested **Be nice. Be informative.** Closes #5852 from zhaohehuhu/dev-1213. Closes #6315 3c5b0c2 [hezhao2] reformat fb113d6 [hezhao2] disable the rule that checks the maxPartitions for dsv2 acc3587 [hezhao2] disable the rule that checks the maxPartitions for dsv2 c8399a0 [hezhao2] fix header 70c845b [hezhao2] add UTs 3a07396 [hezhao2] add ut 4d26ce1 [hezhao2] reformat f87cb07 [hezhao2] reformat b307022 [hezhao2] move code to Spark 3.5 73258c2 [hezhao2] fix unused import cf893a0 [hezhao2] drop reflection for loading iceberg class dc128bc [hezhao2] refactor code 661834c [hezhao2] revert code 6061f42 [hezhao2] delete IcebergSparkPlanHelper 5f1c3c0 [hezhao2] fix b15652f [hezhao2] remove iceberg dependency fe620ca [hezhao2] enable MaxScanStrategy when accessing iceberg datasource Authored-by: hezhao2 <[email protected]> Signed-off-by: Cheng Pan <[email protected]> (cherry picked from commit 8edcb00) Signed-off-by: Cheng Pan <[email protected]>

pan3793 · 2024-04-17T08:30:30Z

Thanks, merged to master/1.9

github-actions bot added module:spark kind:build module:extensions labels Dec 13, 2023

pan3793 changed the title ~~enable MaxScanStrategy when accessing iceberg datasource~~ MaxScanStrategy supports DSv2 Mar 14, 2024

github-actions bot removed the kind:build label Mar 15, 2024

pan3793 requested a review from wForget March 15, 2024 06:46

pan3793 changed the title ~~MaxScanStrategy supports DSv2~~ Spark 3.5. MaxScanStrategy supports DSv2 Mar 15, 2024

pan3793 changed the title ~~Spark 3.5. MaxScanStrategy supports DSv2~~ Spark 3.5: MaxScanStrategy supports DSv2 Mar 15, 2024

wForget reviewed Mar 15, 2024

View reviewed changes

...uubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/watchdog/MaxScanStrategy.scala Show resolved Hide resolved

github-actions bot removed module:metrics module:tpcds module:ha module:jdbc module:rest-client module:events module:integration-tests module:authz module:ui labels Mar 21, 2024

wForget reviewed Mar 22, 2024

View reviewed changes

fix header

c8399a0

wForget reviewed Mar 22, 2024

View reviewed changes

zhaohehuhu added 2 commits April 7, 2024 12:15

disable the rule that checks the maxPartitions for dsv2

acc3587

disable the rule that checks the maxPartitions for dsv2

fb113d6

wForget reviewed Apr 7, 2024

View reviewed changes

...uubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/watchdog/MaxScanStrategy.scala Outdated Show resolved Hide resolved

...spark/kyuubi-extension-spark-3-5/src/test/scala/org/apache/spark/sql/WatchDogSuiteBase.scala Outdated Show resolved Hide resolved

reformat

3c5b0c2

zhaohehuhu requested review from pan3793 and wForget April 9, 2024 04:37

wForget approved these changes Apr 10, 2024

View reviewed changes

wForget assigned zhaohehuhu Apr 10, 2024

wForget added this to the v1.9.1 milestone Apr 10, 2024

pan3793 approved these changes Apr 17, 2024

View reviewed changes

pan3793 mentioned this pull request Apr 17, 2024

[TASK][EASY] MaxScanStrategy supports DSv2 #6315

Closed

3 tasks

pan3793 changed the title ~~Spark 3.5: MaxScanStrategy supports DSv2~~ [KYUUBI #6315] Spark 3.5: MaxScanStrategy supports DSv2 Apr 17, 2024

pan3793 closed this in 8edcb00 Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KYUUBI #6315] Spark 3.5: MaxScanStrategy supports DSv2 #5852

[KYUUBI #6315] Spark 3.5: MaxScanStrategy supports DSv2 #5852

zhaohehuhu commented Dec 13, 2023 •

edited by wForget

Loading

pan3793 commented Dec 14, 2023

zhaohehuhu commented Dec 15, 2023

zhaohehuhu commented Dec 15, 2023

codecov-commenter commented Mar 14, 2024 •

edited

Loading

wForget commented Mar 15, 2024

zhaohehuhu commented Mar 15, 2024

wForget Mar 22, 2024

pan3793 Mar 22, 2024

zhaohehuhu commented Mar 22, 2024

wForget Mar 22, 2024

pan3793 Mar 22, 2024

zhaohehuhu Mar 26, 2024

zhaohehuhu commented Apr 7, 2024

wForget left a comment

pan3793 commented Apr 17, 2024

[KYUUBI #6315] Spark 3.5: MaxScanStrategy supports DSv2 #5852

[KYUUBI #6315] Spark 3.5: MaxScanStrategy supports DSv2 #5852

Conversation

zhaohehuhu commented Dec 13, 2023 • edited by wForget Loading

🔍 Description

Issue References 🔗

Describe Your Solution 🔧

Types of changes 🔖

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

Checklists

📝 Author Self Checklist

📝 Committer Pre-Merge Checklist

pan3793 commented Dec 14, 2023

zhaohehuhu commented Dec 15, 2023

zhaohehuhu commented Dec 15, 2023

codecov-commenter commented Mar 14, 2024 • edited Loading

Codecov Report

wForget commented Mar 15, 2024

zhaohehuhu commented Mar 15, 2024

wForget Mar 22, 2024

Choose a reason for hiding this comment

pan3793 Mar 22, 2024

Choose a reason for hiding this comment

zhaohehuhu commented Mar 22, 2024

wForget Mar 22, 2024

Choose a reason for hiding this comment

pan3793 Mar 22, 2024

Choose a reason for hiding this comment

zhaohehuhu Mar 26, 2024

Choose a reason for hiding this comment

zhaohehuhu commented Apr 7, 2024

wForget left a comment

Choose a reason for hiding this comment

pan3793 commented Apr 17, 2024

zhaohehuhu commented Dec 13, 2023 •

edited by wForget

Loading

codecov-commenter commented Mar 14, 2024 •

edited

Loading