-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48177][BUILD] Upgrade Apache Parquet
to 1.14.1
#46447
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay, finally.
Please run the following and attach the updated dependency file, @Fokko .
dev/test-dependencies.sh --replace-manifest
Apache Parquet
to 1.14.0
cc @cloud-fan , @HyukjinKwon , @mridulm , @sunchao , @yaooqinn , @LuciferYang , @steveloughran , @viirya , @huaxin, @parthchandra , too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it seems that there exist many unit test failures.
[info] *** 189 TESTS FAILED ***
[error] Failed: Total 1526, Failed 189, Errors 0, Passed 1337, Ignored 597
[error] Failed tests:
[error] org.apache.spark.sql.hive.execution.SQLQuerySuite
[error] org.apache.spark.sql.hive.execution.HiveResolutionSuite
[error] org.apache.spark.sql.hive.execution.HiveDDLSuite
[error] org.apache.spark.sql.hive.execution.HiveQuerySuite
[error] org.apache.spark.sql.hive.execution.SQLQuerySuiteAE
[error] org.apache.spark.sql.hive.execution.HiveSQLViewSuite
[error] org.apache.spark.sql.hive.execution.HashUDAQuerySuite
[error] org.apache.spark.sql.hive.execution.PruneHiveTablePartitionsSuite
[error] org.apache.spark.sql.hive.execution.HiveUDAFSuite
[error] org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite
[error] org.apache.spark.sql.hive.execution.HiveTableScanSuite
[error] org.apache.spark.sql.hive.execution.HashAggregationQueryWithControlledFallbackSuite
[error] org.apache.spark.sql.hive.execution.HiveCommandSuite
[error] org.apache.spark.sql.hive.execution.HashUDAQueryWithControlledFallbackSuite
[error] org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
[error] org.apache.spark.sql.hive.execution.HiveUDFSuite
[error] org.apache.spark.sql.hive.HiveSparkSubmitSuite
[error] org.apache.spark.sql.hive.execution.HashAggregationQuerySuite
[error] (hive / Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 1448 s (24:08), completed May 7, 2024, 9:07:49 PM
For example,
- SPARK-6851: Self-joined converted parquet tables *** FAILED *** (4 seconds, 473 milliseconds)
[info] java.util.concurrent.ExecutionException: org.apache.spark.SparkException:
[FAILED_READ_FILE.NO_HINT] Encountered error while reading file
file:///home/runner/work/spark/spark/target/tmp/warehouse-75fc0262-e914-40da-98bf-ad2460270fb5/orders/state=CA/month=20151/part-00000-d46019ae-951c-4974-96da-2b38ade7b49e.c000.snappy.parquet. SQLSTATE: KD001
Oh, it seems that wrong FYI, this PR is supposed to have two files: |
Thanks for pointing out @dongjoon-hyun. I've fixed it right away 👍 |
I have to look into the tests 👀 |
I think the
|
Thanks for digging into this @rshkv, let's follow up on the Parquet side |
Apache Parquet
to 1.14.0Apache Parquet
to 1.14.1
Apache Parquet 1.14.1 has been released, thanks @wgtmac 🙌 |
Could you make CI happy, @Fokko ?
|
1.14.1 still seem to have this error in writing statistics. Does this indicate an incompatibility? |
@LuciferYang Could you please check the test case? It seems to be writing |
This is an existing test case in Spark, this error does not occur when using version 1.13.x. |
Yes I know that. The exception is thrown when building size statistics, which is a new feature and has caught similar issues in the test cases of parquet-mr. So I'd suggest to check if the existing test violates the rule of |
These lines are suspicious: Lines 501 to 505 in 05c87e5
If |
@wgtmac Thank you for your explanation, it seems you are correct, should Line 505 be changed from val definitionLevels = inputValues.map(v => if (v == null) 0 else 1) to val definitionLevels = inputValues.map(v => if (v == null) 0 else maxDef) ? I manually tested it, and this way |
Yes, that change looks reasonable. Thanks for verification! @LuciferYang |
(I have to admit that it is a little bit aggressive to enable a new feature by default on the parquet side, sigh) |
@LuciferYang Thanks for the pointer, I've updated the PR 👍 |
@Fokko It seems that the data written by 1.14.1 is larger than that by 1.13.1.
The spark/sql/core/src/test/scala/org/apache/spark/sql/InjectRuntimeFilterSuite.scala Lines 487 to 503 in b77caf7
The log on line 489 needs to be fixed, the statement " |
@LuciferYang Thanks again for the elaborate pointers. I just switched jobs and got a new laptop, so I have to reconfigure everything :) I'll keep an eye on the CI |
Looks like very promising! Thanks all for the work! The failed tests do not seem related - I just re-triggered the CI jobs to be sure. One thing nice to have is to do a bit perf comparison using benchmarks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has anyone set up a nightly jenkins with stable spark and its tests set to run off a nightly build of parquet? would seem a good way to catch regressions early -provided the test failures get attention. That's always a problem with cross project builds |
@Fokko Do you have time to move this pr forward? |
@LuciferYang Yes, let me get right to it! |
Sorry for the long wait, that's quite a comprehensive test suite. I've ran the benchmarks both on the main branch and this branch: This branchDataSourceReadBenchmark
BuiltInDataSourceWriteBenchmark
Main branchDataSourceReadBenchmark
BuiltInDataSourceWriteBenchmark
VerdictI don't see any huge deviations from main. Sometimes this branch is a bit faster, sometimes the the main branch is just a bit faster. Does the deviation look acceptable to you? |
We should run the corresponding benchmarks using GitHub Actions and update their results in the pr, both Java 17 and 21 |
Kicked them off: https://github.com/Fokko/spark/actions/workflows/benchmark.yml |
@LuciferYang I've updated the PR. Sorry, I wasn't aware that you'll need to run the benchmarks in the GA. I was assuming that the runners would be too noisy. |
ParquetReader Vectorized: DataPageV2 69 71 2 228.4 4.4 1.0X | ||
ParquetReader Vectorized -> Row: DataPageV1 47 48 1 332.4 3.0 1.5X | ||
ParquetReader Vectorized -> Row: DataPageV2 47 48 1 334.0 3.0 1.5X | ||
ParquetReader Vectorized: DataPageV1 93 94 1 169.9 5.9 1.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the result , it appears that there is a slight decrease in throughput in the TINYINT
scenario.
cc @dongjoon-hyun @sunchao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is probably due to other factors. It's fine as long as the relative stay the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Although there is a comment from @LuciferYang , given that this is GitHub Action environment, the result looks good overall to me.
Thank you, @Fokko and all. Let me merge this for Apache Spark 4.0.0-preview2. |
Thanks @dongjoon-hyun for merging, and @LuciferYang and @sunchao for the pointers 👍 |
The benchmark results look OK to me as well - there is no big deviation from the previous result. Thanks @Fokko for the PR! |
### What changes were proposed in this pull request? ### Why are the changes needed? Fixes quite a few bugs on the Parquet side: https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Using the existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46447 from Fokko/fd-bump-parquet. Authored-by: Fokko Driesprong <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Why are the changes needed?
Fixes quite a few bugs on the Parquet side: https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140
Does this PR introduce any user-facing change?
No
How was this patch tested?
Using the existing unit tests
Was this patch authored or co-authored using generative AI tooling?
No