Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Spark] Skip collecting commit stats to prevent computing Snapshot State
#### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Before this PR, Delta computes a [SnapshotState](https://github.com/delta-io/delta/blob/v3.1.0/spark/src/main/scala/org/apache/spark/sql/delta/SnapshotState.scala#L46-L58) during every commit. Computing a SnapshotState is fairly slow and expensive, because it involves reading the entirety of a checkpoint, sidecars, and log segment. For many types of commit, it should be unnecessary to compute the SnapshotState. After this PR, an transaction can avoid computing the SnapshotState of a newly created snapshot. Skipping the computation is enabled via a spark configuration option `spark.databricks.delta.commitStats.collect=false` This change can have a big performance impact when writing into a Delta Table. Especially when the table comprises a large number of underlying data files. ## How was this patch tested? - Locally built delta-spark - Ran a small spark job to insert rows into a delta table - Inspected log4j output to see if snapshot state was computed - Repeated again, this time setting `spark.databricks.delta.commitStats.collect=false` ## Does this PR introduce _any_ user-facing changes? Yes, after this PR the user can set spark config option `spark.databricks.delta.commitStats.collect=false` to avoid computing SnapshotState after a commit. Signed-off-by: Ian Streeter <[email protected]>
- Loading branch information