Performance Regression Caused by Schema Hash in Spark PartitionPruning with Wide Tables #11763
Open
2 of 3 tasks
Labels
bug
Something isn't working
Apache Iceberg version
1.5.0
Query engine
Spark
Please describe the bug 🐞
Description:
In Spark’s optimization rule PartitionPruning, the method
SparkBatchQueryScan#filterAttributes
is called, which triggers the computation ofSet<PartitionSpec> specs
. During this process, it iterates over each file and parses the jsonString intoPartitionSpec
. To avoid repeated parsing, a cache map was added inorg.apache.iceberg.PartitionSpecParser#fromJson
with(schema, jsonStr) -> PartitionSpec
.However, when dealing with tables containing a large number of files and columns, calculating the schema hash can consume significant CPU time.
Proposed Solution:
Avro Schema mitigates this issue by caching the schema’s hashCode to avoid repeated computations. A similar optimization could be applied to Iceberg’s schema to reduce the performance regression caused by frequent schema hash calculations.
Reproduction Example:
I added a timer to the method
org.apache.iceberg.spark.source.SparkPartitioningAwareScan
and tested the following SQL query on a table with 900,000 files and 1500+ columns:
This query triggers
org.apache.spark.sql.execution.dynamicpruning.PartitionPruning
optimization rule twice. Before the task execution, the driver spends approximately 150 seconds on pre-execution preparation, with over 140 seconds consumed in calculating PartitionSpec.Flame Graph:
Thread Dump:
Environment:
I tested this issue on Iceberg 1.5.0, and it is expected to persist in the latest version as well.
Willingness to contribute
The text was updated successfully, but these errors were encountered: