From 098d6136f4de24d69fbdfda0fa11aa122bdbb566 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Malte=20S=C3=B8lvsten=20Velin?= Date: Sun, 29 Dec 2024 00:50:35 +0100 Subject: [PATCH] =?UTF-8?q?Add=20unit=20test=20ensuring=20that=20if=20prop?= =?UTF-8?q?erty=20is=20set=20to=20true=20then=20output=20is=20sorted=20on?= =?UTF-8?q?=20Z-order=20value.=20Signed-off-by:=20Malte=20Velin=20=20Author:=20Malte=20S=C3=B8lvsten=20Velin?= =?UTF-8?q?=20=20Date:=20=20=20Sat=20Dec=2028?= =?UTF-8?q?=2020:10:01=202024=20+0100?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add configuration property to toggle sorting output on Z-order value. Signed-off-by: Malte Velin commit 82e940f17f51a0ebeaac0b03441b13875da3c439 Author: Fred Storage Liu Date: Fri Dec 20 17:02:18 2024 -0800 Fix indentation in CloneTableBase (#3996) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fix indentation in CloneTableBase ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? commit 4dbadbbf8ddd0a12273ac9521d61bc89196dc80d Author: Carmen Kwan Date: Thu Dec 19 22:39:44 2024 +0100 [Spark] Make Identity Column High Water Mark updates consistent (#3989) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently: - When we do a MERGE, we will always call `setTrackHighWaterMarks` on the transaction. This will have an effect if there is an INSERT clause in the MERGE. - If we `setTrackHighWaterMarks`, we collect the max/min of the column using `DeltaIdentityColumnStatsTracker`. This stats tracker is only invoked on files that are written/rewritten. These min/max values are compared with the existing high watermark. If the high watermark doesn't exist, we will keep as high watermark the largest of the max or the lowest of the min without checking against the starting value of the identity column. - If an identity column did not generate a value yet, the high watermark is None and isn't stored in the table. This is true for GENERATED ALWAYS AS IDENTITY tables when it is empty and true for GENERATED BY DEFAULT AS IDENTITY tables when it only has user inserted values for the identity column. - If you run a MERGE UPSERT that only ends up updating values in a GENERATED BY DEFAULT table that doesn't have a high watermark yet, we will write a new high watermark that is the highest for the updated file, which may be lower than the starting value specified for the identity column. Proposal: - This PR makes all high water mark updates go through the same validation function by default. It will not update the high watermark if it violates the start or the existing high watermark. Exception is if the table already has a corrupted high water mark. - This does NOT prevent the scenario where we automatically set the high watermark for a generated by default column based on user inserted values when it does respect the start. - Previously, we did not do high water mark rounding on the `updateSchema` path. This seems erroneous as the min/max values can be user inserted. We fix that in this PR. - Previously, we did not validate that on SYNC identity, the result of max can be below the existing high water mark. Now, we also do check this invariant and block it by default. A SQLConf has been introduced to allow reducing the high water mark if the user wants. - We add logging to catch bad high water mark. ## How was this patch tested? New tests that were failing prior to this change. ## Does this PR introduce _any_ user-facing changes? No commit ae4982ce267052c526fef638a88ce86f7d85e583 Author: Allison Portis Date: Thu Dec 19 11:42:13 2024 -0800 [Kernel] Fix flaky test for the Timer class for metrics (#3946) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Fixes a flaky test. ## How was this patch tested? Unit test fix. ## Does this PR introduce _any_ user-facing changes? No. commit da58cad55741313852005cf2d84a7f2e0280bf2b Author: Allison Portis Date: Wed Dec 18 19:35:07 2024 -0800 [Kernel] Remove CC code from SnapshotManager (#3986) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We are re-thinking the design of the Coordinated Commits table feature and much of this snapshot code will be refactored. Remove it for now as it greatly complicates our snapshot construction, and hopefully we can be more intentional in our code design/organization when re-implementing it. https://github.com/delta-io/delta/commit/fc81d1247d66cc32e454e985f0cfc81447f897b6 already removed the public interfaces and made it such that `SnapshotImpl::getTableCommitCoordinatorClientHandlerOpt` never returned a handler. ## How was this patch tested? Existing tests should suffice. ## Does this PR introduce _any_ user-facing changes? No. commit 34f02d8858faf2d74465a40c22edb548e0626c05 Author: Cuong Nguyen Date: Wed Dec 18 14:46:52 2024 -0800 [Spark] Avoid unnecessarily calling update and some minor clean up in tests (#3965) commit 1cd6fed7987ad15e7d8b2d593c4579ce865f4cbe Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Wed Dec 18 23:01:54 2024 +0100 [Spark] Drop feature support in DeltaTable Scala/Python APIs (#3952) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adds drop feature support in the DeltaTable API for both scala and python APIs. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? Yes. See description. commit baa55187fd32bb4b0f97fd1d2305db4e0dd7d44e Author: Carmen Kwan Date: Wed Dec 18 20:21:45 2024 +0100 [Spark][TEST-ONLY] More tests updating Identity Column high water mark (#3985) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Test-only PR. Add one more test for updating the identity column high water mark when it not already available. ## How was this patch tested? Test-only PR. ## Does this PR introduce _any_ user-facing changes? No. commit f577290c5dec0b76130397cc0a050f9030b12035 Author: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com> Date: Tue Dec 17 13:55:03 2024 -0800 [Spark] Fix auto-conflict handling logic in Optimize to handle DVs (#3981) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Bug: There was an existing long standing bug where the custom conflict detection logic in Optimize does not catch concurrent transactions that add DVs. e.g. AddFile(path='a') -> AddFile(path='a', dv='dv1'). Fix: Updated the conflict resolution to consider a composite key of (path, dvId) instead of just depending on path. ## How was this patch tested? - unit tests ## Does this PR introduce _any_ user-facing changes? no commit fc81d1247d66cc32e454e985f0cfc81447f897b6 Author: Scott Sandre Date: Fri Dec 13 11:14:09 2024 -0800 [Kernel] Remove Coordinated Commits from public API (#3938) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description We are re-thinking the design of the Coordinated Commits table feature (currently still in RFC). Thus, we should remove it from the public Kernel API for Delta 3.3 release. To summarize the changes of this PR - I remove `getCommitCoordinatorClientHandler` from the `Engine` interface - I move various previously `public` CC interfaces and classes to be `internal` now - `SnapshotImpl::getTableCommitCoordinatorClientHandlerOpt` is hardcoded to return an empty optional - Delete failing test suites and unapplicable utils ## How was this patch tested? Existing CI tests. ## Does this PR introduce _any_ user-facing changes? We remove coordinated commits from the public kernel API. commit 2f5673e0432962cb834e103dbc79ce8aea9a4e37 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Fri Dec 13 01:09:27 2024 +0100 [Docs] Update documentation for Row Tracking to include Row Tracking Backfill introduced in Delta 3.3 (#3968) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Docs) ## Description - Update the [Row Tracking blog](https://docs.delta.io/latest/delta-row-tracking.html#-limitations). Before, we mention in the limitation that we cannot enable Row Tracking on non-empty tables. Now, with [Row Tracking Backfill release](https://github.com/delta-io/delta/releases/) in Delta 3.3, we are now enable Row Tracking on non-empty tables. - Explicitly mention that you can enable Row Tracking on existing tables from Delta 3.3. ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? N/A commit 259751b51d73831fd6222d98178091b037ef0d7a Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Fri Dec 13 01:09:17 2024 +0100 [Docs][3.3] Update documentation for Row Tracking to include Row Tracking Backfill introduced in Delta 3.3 (#3969) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Docs) ## Description - Cherry-pick https://github.com/delta-io/delta/pull/3968 into Delta 3.3. ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? N/A commit d0be1d7b6c376b5d7cf7fba5daf039a2638cd7b9 Author: Zhipeng Mao Date: Thu Dec 12 20:01:50 2024 +0100 Add identity column doc (#3935) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description It adds doc for identity column. ## How was this patch tested? Doc change. ## Does this PR introduce _any_ user-facing changes? No. commit fdf887d6104582955ad75d3f7297b36d249d91d1 Author: Zhipeng Mao Date: Thu Dec 12 19:59:20 2024 +0100 [SPARK] Add test for Identity Column merge metadata conflict (#3971) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description It adds a test for identity column to verify merge will be aborted if high water mark is changed after analysis and before execution. ## How was this patch tested? Test-only. ## Does this PR introduce _any_ user-facing changes? No. commit 58f94afafd16a19644fef7130a46cb8a93d18ec8 Author: Dhruv Arya Date: Thu Dec 12 10:58:13 2024 -0800 [PROTOCOL][Version Checksum] Remove references to Java-specific Int.MaxValue and Long.MaxValue (#3961) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (PROTOCOL) ## Description Fixes a Version Checksum spec changes introduced in https://github.com/delta-io/delta/pull/3777. The last two bin bounds for Deleted File Count Histogram right now are defined in terms of Java's Int.MaxValue and Long.MaxValue. This PR makes the spec language independent by inlining the actual values of these bounds. ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? No commit 05cdd3cd4752dbb826f6bcfa4ba1d46ef1b246ee Author: Anton Erofeev Date: Thu Dec 12 17:20:08 2024 +0300 [Kernel] Fix incorrect load protocol and metadata time log (#3964) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Resolves #3948 Fixed incorrect load protocol and metadata time log ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit 19d89f6ba0803b0f4c1826a521c27ababdd50864 Author: Jiaheng Tang Date: Wed Dec 11 18:28:25 2024 -0800 Update liquid clustering docs (#3958) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (docs) ## Description Add docs for OPTIMIZE FULL, in-place migration, and create table from external location. ## How was this patch tested? ![127 0 0 1_8000_delta-clustering html (6)](https://github.com/user-attachments/assets/4148e5e0-3aad-403a-bb91-641f08a500b7) ## Does this PR introduce _any_ user-facing changes? No commit 30d74a6b8d5a305ce4a6ab625f69d0b9b93e6f92 Author: Carmen Kwan Date: Wed Dec 11 21:25:09 2024 +0100 [Spark][TEST-ONLY] Identity Column replace tests for partitioned tables (#3960) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Quick follow up for https://github.com/delta-io/delta/pull/3937 Expand test to cover partitioned tables too. ## How was this patch tested? Test only change. New tests and existing tests pass. ## Does this PR introduce _any_ user-facing changes? No. commit 10972577202783720f5e61925ee7d7c6fc204a78 Author: Fred Storage Liu Date: Wed Dec 11 11:50:01 2024 -0800 Update Delta uniform documentation to include ALTER enabling (#3927) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (fill in here) ## Description Update Delta uniform documentation to include ALTER enabling ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? commit 57d0e3b42f60d133db9c4a81a432804803d9955b Author: Fred Storage Liu Date: Wed Dec 11 07:37:09 2024 -0800 Expose Delta Uniform write commit size in logs (#3898) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Expose Delta Uniform write commit size in logs commit 407d4c99b437636cde2fcc5c52039bb19510bb64 Author: Kaiqi Jin Date: Wed Dec 11 07:36:30 2024 -0800 Use default partition value during uniform conversion when partition value is missing (#3924) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Previously, missing pair in the partitionValues map was not handled correctly, resulting in a Delta -> Iceberg conversion failure. To fix this, this PR use default value correctly for missing entries in the partitionValues map. ## How was this patch tested? Existing tests ## Does this PR introduce _any_ user-facing changes? No commit e3a613dfa550defb86a05a57d9fef52daa86e8da Author: Cuong Nguyen Date: Tue Dec 10 15:35:16 2024 -0800 [Spark] Pass catalog table to DeltaLog API call sites, part 3 (#3949) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fix a number of code paths where we want to pass catalog table to the commit coordinator client via DeltaLog API ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No. commit b39d5b328ffa8e1071fc6aab78cfb345c8f2d8f7 Author: Fred Storage Liu Date: Tue Dec 10 09:33:33 2024 -0800 Add sizeInBytes API for Delta clone (#3942) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add sizeInBytes API for Delta clone ## How was this patch tested? existing UT ## Does this PR introduce _any_ user-facing changes? commit 61ac84d4579fdf99465861991f1a0fb697fa0325 Author: Cuong Nguyen Date: Tue Dec 10 09:07:26 2024 -0800 [SPARK] Clean up vacuum-related code (#3931) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR cleans up a few things + In scala API, use the `VacuumTableCommand` instead of calling `VacuumCommand.gc` directly, + Pass `DeltaTableV2` to `VacuumCommand.gc` instead of `DeltaLog`. + Use `DeltaTableV2` in tests instead of `DeltaLog`. ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit 79e518ba81505384695ec4a71ba0013eeb860646 Author: Johan Lasperas Date: Tue Dec 10 17:50:10 2024 +0100 [Spark] Allow missing fields with implicit casting during streaming write (#3822) ## Description Follow-up on https://github.com/delta-io/delta/pull/3443 that introduced implicit casting during streaming write to delta tables. The feature was shipped disabled due to a regression found in testing where writing data with missing struct fields start being rejected. Streaming writes are one of the few inserts that allows missing struct fields. This change allows configuring the casting behavior used in MERGE, UPDATE and streaming writes wrt to missing struct fields. ## How was this patch tested? Extensive tests were added in https://github.com/delta-io/delta/pull/3762 in preparation for this changes, covering for all inserts (SQL, dataframe, append/overwrite, ..): - Missing top-level columns and nested struct fields. - Extra top-level columns and nested struct fields with schema evolution. - Position vs. name based resolution for top-level columns and nested struct fields. with e.p. the goal of ensuring that enabling implicit casting in stream writes here doesn't cause any other unwanted behavior change. ## This PR introduces the following *user-facing* changes From the initial PR: https://github.com/delta-io/delta/pull/3443 Previously, writing to a Delta sink using a type that doesn't match the column type in the Delta table failed with `DELTA_FAILED_TO_MERGE_FIELDS`: ``` spark.readStream .table("delta_source") # Column 'a' has type INT in 'delta_sink'. .select(col("a").cast("long").alias("a")) .writeStream .format("delta") .option("checkpointLocation", "") .toTable("delta_sink") DeltaAnalysisException: [DELTA_FAILED_TO_MERGE_FIELDS] Failed to merge fields 'a' and 'a' ``` With this change, writing to the sink now succeeds and data is cast from `LONG` to `INT`. If any value overflows, the stream fails with (assuming default `storeAssignmentPolicy=ANSI`): ``` SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] Fail to assign a value of 'LONG' type to the 'INT' type column or variable 'a' due to an overflow. Use `try_cast` on the input value to tolerate overflow and return NULL instead." ``` commit 8f344098e0601d04f9bd3fa25306569b3d106e06 Author: jackierwzhang <67607237+jackierwzhang@users.noreply.github.com> Date: Tue Dec 10 08:49:46 2024 -0800 Fix schema tracking location check condition against checkpoint location (#3939) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR introduces a better way to check if the schema tracking location is under the checkpoint location that would work with arbitrary file systems and paths. ## How was this patch tested? New UT. ## Does this PR introduce _any_ user-facing changes? No commit fdc2c7f7c7367a50de8734cc9b4520cecc5aeadc Author: Rajesh Parangi <89877744+rajeshparangi@users.noreply.github.com> Date: Mon Dec 9 17:50:24 2024 -0800 Add Documentation for Vacuum LITE (#3932) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adds Documentation for Vacuum LITE ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? NO commit 00fa0ae8a0d2ec9f0e52cbe8ab28274a80e6272b Author: Carmen Kwan Date: Mon Dec 9 21:02:47 2024 +0100 [Spark][TEST-ONLY] Identity Column high watermark and replace tests (#3937) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description In this PR, we expand the test coverage for identity columns. Specifically, we add more assertions for the high watermarks and cover more test scenarios with replacing tables. ## How was this patch tested? Test-only PR. We expand test coverage. ## Does this PR introduce _any_ user-facing changes? No. commit 7224677acda11eb21103112c8b636963874e9071 Author: Carmen Kwan Date: Mon Dec 9 20:32:51 2024 +0100 [Spark] Enable Identity column SQLConf (#3936) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 In this PR, we flip the SQLConf that guards the creation of Identity Column from false to true. Without this, we cannot create identity columns in Delta Spark! ## How was this patch tested? Existing tests pass. ## Does this PR introduce _any_ user-facing changes? Yes, it enables the creation of Identity Columns. commit bb3956f0c8e290725d0b6ab02981d2c5ad462c12 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Fri Dec 6 14:51:31 2024 +0100 [Spark] CheckpointProtectionTableFeature base implementation (#3926) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Base implementation of `CheckpointProtectionTableFeature`. Writers are only allowed to cleanup metadata as long as the can truncate history up to `requireCheckpointProtectionBeforeVersion` in one go. As a second step, the feature can be improved by allowing metadata cleanup even when the invariant above does not hold. Metadata cleanup could be allowed if the client verifies it supports all writer features contained in the history it intends to truncate. This improvement is important to support for providing GDPR compliance. ## How was this patch tested? Added tests in `DeltaRetentionSuite`. ## Does this PR introduce _any_ user-facing changes? No. commit da162a097a25524fc97334f47a180257cb487789 Author: Dhruv Arya Date: Thu Dec 5 17:16:14 2024 -0800 [Protocol] Add a version checksum to the specification (#3777) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (PROTOCOL) ## Description Adds the concept of a Version Checksum to the protocol. This version checksum can be emitted on every commit and stores important bits of information about the snapshot which can later be used to validate the integrity of the delta log. ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? N/A commit 8fb17a0160a937307d6fb9276a77403aeb7efc63 Author: Dhruv Arya Date: Thu Dec 5 16:58:21 2024 -0800 [Spark][Version Checksum] Read Protocol, Metadata, and ICT directly from the Checksum during Snapshot construction (#3920) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Stacked over https://github.com/delta-io/delta/pull/3907. This PR makes the Checksum (if available) the source of truth for Protocol, Metadata, ICT during snapshot construction. This helps us avoid a Spark query and improves performance. ## How was this patch tested? Added some test cases to existing suites ## Does this PR introduce _any_ user-facing changes? No commit 1ee278ae23bc08a25c448524264622ba106686cd Author: Allison Portis Date: Wed Dec 4 15:40:15 2024 -0800 [Kernel][Metrics][PR#4] Adds Counter class (#3906) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Adds a `Counter` that will be used by following PRs to count metrics. ## How was this patch tested? Adds a unit test. ## Does this PR introduce _any_ user-facing changes? No. commit 8cd614107468389a117362f708a540c0263c01e7 Author: Qiyuan Dong Date: Wed Dec 4 23:21:34 2024 +0100 [Kernel] Add JsonMetadataDomain and RowTrackingMetadataDomain (#3893) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description This PR adds the following to Delta Kernel Java: - `JsonMetadataDomain.java`: Introduces the base abstract class `JsonMetadataDomain` for metadata domains that use JSON as their configuration string. Concrete implementations, such as `RowTrackingMetadataDomain`, should extend this class to define their specific metadata domain. This class provides utility functions for - serializing to/deserializing from a JSON configuration string - creating a `DomainMetadata` action for committing - creating a specific metadata domain instance from a `SnapshotImpl` - `RowTrackingMetadataDomain.java`: Implements the metadata domain `delta.rowTracking`. It has a configuration field `long rowIdHighWaterMark`, which will be used in the future for assigning fresh row IDs. ## How was this patch tested? Added unit tests and integration tests covering the serialization and deserialization functionalities of `JsonMetadataDomain` and `RowTrackingMetadataDomain` in `DomainMetadataSuite.scala`. ## Does this PR introduce _any_ user-facing changes? No. commit 82c2f648ee864e0b6c77428cbf2514f92ceb2321 Author: Allison Portis Date: Wed Dec 4 14:17:09 2024 -0800 [Kernel][Metrics][PR#1] Adds initial interfaces and a Timer class (#3902) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Adds the initial interfaces for https://github.com/delta-io/delta/issues/3905 as well as a `Timer` class that will be used in follow-up PRs. ## How was this patch tested? Just interface changes. ## Does this PR introduce _any_ user-facing changes? No. commit a52578bfb15c8f5216232a37c86fc34f9935bed1 Author: Scott Sandre Date: Wed Dec 4 14:02:04 2024 -0800 [Kernel] Add `Snapshot::getTimestamp` public API (#3791) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Add a new `Snapshot::getTimestamp` public API. ## How was this patch tested? N/A. Trivial. ## Does this PR introduce _any_ user-facing changes? Yes. Adds a new `Snapshot::getTimestamp` public API. commit 09e75238586f1f458761a59da0e9f19dc3dfa832 Author: Dhruv Arya Date: Wed Dec 4 13:31:19 2024 -0800 [Spark][Version Checksum] Enable checksum writes by default (#3919) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Version Checksum can now be incrementally computed without triggering a full state reconstruction (see https://github.com/delta-io/delta/pull/3899, https://github.com/delta-io/delta/pull/3895). This PR enables writing the version checksum by default. ## How was this patch tested? Existing tests in DeltaLogSuite and ChecksumSuite should cover this change. ## Does this PR introduce _any_ user-facing changes? No commit 4aecba55eb5519f133a164ad003e23d017a8ccbd Author: Adam Binford Date: Wed Dec 4 15:38:04 2024 -0500 [Spark] Make `delta.dataSkippingStatsColumns` more lenient for nested columns (#2850) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves #2822 Make `delta.dataSkippingStatsColumns` more lenient for nested columns by not throwing an exception if a nested column doesn't support gathering stats. This more closely matches the behavior of the `dataSkippingNumIndexedCols` which allows for unsupported types in those columns (and seems to still gather null counts for those unsupported types). This also allows more use cases where you might have a wide variety of types inside a top level struct, and you simply want to gather stats on whatever columns inside that struct you can. I kept the duplicate column checking in place to avoid less changes, but I'm not sure how necessary that really is besides letting users know they are doing something dumb. ## How was this patch tested? A couple tests were removed that were specifically testing for the now-allowed behavior, and a new test was added to verify the new behavior works. ## Does this PR introduce _any_ user-facing changes? Yes, specifying a struct with unsupported stats gathering types in `delta.dataSkippingStatsColumns` is now allowed instead of throwing an exception. --------- Co-authored-by: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> commit 28a9e6fffe236a4e2f4255913516e094a826a0e1 Author: Scott Sandre Date: Wed Dec 4 09:47:38 2024 -0800 [Kernel] Add a public utility function to determine if a given partition exists (i.e. actually has data) (#3918) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Add a public utility function to determine if a given partition exists (i.e. actually has data) ## How was this patch tested? Simple UTs. ## Does this PR introduce _any_ user-facing changes? Yes. commit f56bcd5c03234e72321725d0e95b772971ad0404 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Wed Dec 4 17:53:12 2024 +0100 [Spark] Add support for dropping the CheckpointProtection table feature (#3915) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adds support for dropping the `CheckpointProtection` table feature. To drop the feature, we need to truncate all history prior to the CheckpointProtection version. For this cleanup operation we use a shorter retention period of 24 hours. This is configured in `delta.dropFeatureTruncateHistory.retentionDuration` and is the same config we use in the drop feature with History Truncation implementation. ## How was this patch tested? Added tests in `DeltaFastDropFeatureSuite`. ## Does this PR introduce _any_ user-facing changes? No. commit e91dd1fe4b4376088a87bd3af2b2ffad3d9fdb3d Author: Johan Lasperas Date: Wed Dec 4 17:51:34 2024 +0100 Update checkError calls in DeltaInsertIntoTableSuite to compile against spark master (#3921) ## Description Small test fix to get https://github.com/delta-io/delta/commit/ca118d189591a98082e7bfa5014bf9264918c0a2 to compile against spark master The `errorClass` argument of `checkError` was renamed to `condition` in recent spark release. To work with both, the named argument is changed to be unnamed. ## How was this patch tested? N/A, test-only ## Does this PR introduce _any_ user-facing changes? No commit 6345a88ebe52e93928d94d3b7920ab5f7b60d4d9 Author: Scott Sandre Date: Wed Dec 4 08:42:20 2024 -0800 [Kernel] Add `Snapshot::getPartitionColumnNames` public API (#3916) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Add new `Snapshot::getPartitionColumnNames` public API ## How was this patch tested? Simple UTs. ## Does this PR introduce _any_ user-facing changes? Yes. commit 8ef5efc67a783e7416bf3dc781e50d2adb7afbf1 Author: Johan Lasperas Date: Wed Dec 4 16:27:18 2024 +0100 Increase timeout on first test in DeltaSinkImplicitCastSuite (#3922) ## Description Test `write wider type - long -> int` in suite `DeltaSinkImplicitCastSuiteBase` is randomly timing out. Specifically, the first write in the test suite times out, due to the initial streaming setup taking a long time. The timeout on that first write is increased to address the test flakiness ## How was this patch tested? Test-only change ## Does this PR introduce _any_ user-facing changes? No commit 43ea6892146a2aa486426e9ce218bfbf27ff1297 Author: Scott Sandre Date: Tue Dec 3 14:45:22 2024 -0800 Revert "[Kernel] [CC Refactor #2] Add TableDescriptor and CommitCoordinatorClient API" (#3917) This reverts commit 6ae4b62845ed579bb5a19f4646831c4ee2931c02 We seem to be rethinking our Coordinated Commits CUJ / APIs, and we don't want these APIs leaked in Delta 3.3. commit d4ced37bbd4aa8e4d105a89aa15452dbd41e47f2 Author: Scott Sandre Date: Tue Dec 3 13:31:35 2024 -0800 Revert "[Kernel] [CC Refactor #1] Add `TableIdentifier` API (#3795)" (#3900) This reverts commit 024dadbb425e8edb862b425d5857a18bcb0d7045. We seem to be rethinking our Coordinated Commits CUJ / APIs, and we don't want these APIs leaked in Delta 3.3. commit 510c170e2c81eb45d997c487236397290aa3a58a Author: YotillaAntoni <92581297+YotillaAntoni@users.noreply.github.com> Date: Tue Dec 3 20:58:42 2024 +0100 [Kernel] Fix RoaringBitmapArray create/add methods. Closes issue #3881 (#3882) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Resolves #3881 - The bitmaps array should be initialized in the `create` path. - The `expandBitMaps` method should set new bitmaps from the old length up to the new length, instead of overwriting the old ones. Also adds a `toArray` method mimicking the one provided by the scala version which the class is based on. ## How was this patch tested? Added unit tests. ## Does this PR introduce _any_ user-facing changes? No --------- Signed-off-by: Antoni Reus commit ca118d189591a98082e7bfa5014bf9264918c0a2 Author: Richard-code-gig <69102122+Richard-code-gig@users.noreply.github.com> Date: Tue Dec 3 17:37:44 2024 +0000 [Spark] Fix schema evolution issue with nested struct (within a map) and column renamed (#3886) This PR fixes an issue with schema evolution in Delta Lake where adding a new field to a struct within a map and renaming an existing top level field caused the operation to fail. The fix includes logic to handle these transformations properly, ensuring that new fields are added without conflicts. It also resolved a ToDo of casting map types in the [DeltaAnalysis.scala](https://github.com/Richard-code-gig/delta/blob/feature/schema-evolution-with-map-fix/spark/src/main/scala/org/apache/spark/sql/delta/DeltaAnalysis.scala) module. ### Changes: - Updated schema evolution logic to support complex map transformations. - Enabled schema evolution for both map keys, simple and nested values - Added additional case statements to handle MapTypes in addCastToColumn method in DeltaAnalysis.scala module. - Modified TypeWideningInsertSchemaEvolutionSuite test to support schema evolution of maps. - Added an additional method (addCastsToMaps) to DeltaAnalysis.scala module. - Changed argument type of addCastToColumn from attributes to namedExpression - Added [EvolutionWithMap](https://github.com/Richard-code-gig/delta/blob/feature/schema-evolution-with-map-fix/examples/scala/src/main/scala/example/EvolutionWithMap.scala) in the example modules to demonstrate use case. - Modified nested struct type evolution with field upcast test in map in TypeWideningInsertSchemaEvolutionSuite.scala - Added new tests cases for maps to DeltaInsertIntoTableSuite.scala ### Related Issues: - Resolves: #3227 #### Which Delta project/connector is this regarding? - [✓] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ## How was this patch tested? Tested through: - Integration Tests: Validated changes with Delta Lake and Spark integration. See [EvolutionWithMap](https://github.com/Richard-code-gig/delta/blob/feature/schema-evolution-with-map-fix/examples/scala/src/main/scala/example/EvolutionWithMap.scala). - Validated the test suites passed and [TypeWideningInsertSchemaEvolutionSuite](https://github.com/Richard-code-gig/delta/blob/feature/schema-evolution-with-map-fix/spark/src/test/scala/org/apache/spark/sql/delta/typewidening/TypeWideningInsertSchemaEvolutionSuite.scala) to add support for maps. - Added additional tests cases in [DeltaInsertIntoTableSuite](https://github.com/Richard-code-gig/delta/blob/feature/schema-evolution-with-map-fix/spark/src/test/scala/org/apache/spark/sql/DeltaInsertIntoTableSuite.scala) to cover complex map transformations ## Does this PR introduce _any_ user-facing changes? No, it doesn't introduce any user-facing changes. It only resolved an issue even in the released versions of Delta Lake. The previous behaviour was an error message when attempting operations involving adding extra fields to StructField in maps: [[DATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION](https://docs.databricks.com/error-messages/error-classes.html#datatype_mismatch.cast_without_suggestion)] Cannot resolve "metrics" due to data type mismatch: cannot cast "MAP>" to "MAP>". --------- Co-authored-by: Sola Richard Olorunfemi commit 634ba15143f1fefed1442738e7498a8ef9bc3344 Author: Cuong Nguyen Date: Tue Dec 3 09:37:11 2024 -0800 [SPARK] Rewrite GENERATE SYMLINK MANIFEST command to use Spark table resolution (#3914) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR rewrites the GENERATE SYMLINK MANIFEST command (`DeltaGenerateCommand`) to use Spark's table resolution mechanism. It is replaced to be a `UnaryRunnableCommand` that takes in a child logical plan, which is initially an unresolved table. The table is then resolved by Spark's resolution mechanism, resulting in a `DeltaTableV2` object. With this object, we also have a catalog table which is passed down into `generateFulllManifest` as part of the effort to pass the catalog table to DeltaLog API's call sites. ## How was this patch tested? unit tests ## Does this PR introduce _any_ user-facing changes? No commit 2937bc8fd7e9264ccc8369664b65584c88f2c21f Author: Alexey Shishkin Date: Tue Dec 3 10:41:42 2024 +0100 [Spark] Fix dependent constraints/generated columns checker for type widening (#3912) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The current checker of dependent expressions doesn't validate changes for array and map types. For example, usage of type widening could lead to constraints breaks: ``` scala> sql("CREATE TABLE table (a array) USING DELTA") scala> sql("INSERT INTO table VALUES (array(1, -2, 3))") scala> sql("SELECT hash(a[1]) FROM table").show() +-----------+ | hash(a[1])| +-----------+ |-1160545675| +-----------+ scala> sql("ALTER TABLE table ADD CONSTRAINT ch1 CHECK (hash(a[1]) = -1160545675)") scala> sql("ALTER TABLE table SET TBLPROPERTIES('delta.enableTypeWidening' = true)") scala> sql("ALTER TABLE table CHANGE COLUMN a.element TYPE BIGINT") scala> sql("SELECT hash(a[1]) FROM table").show() +----------+ |hash(a[1])| +----------+ |-981642528| +----------+ scala> sql("INSERT INTO table VALUES (array(1, -2, 3))") 24/11/15 12:53:23 ERROR Utils: Aborting task com.databricks.sql.transaction.tahoe.schema.DeltaInvariantViolationException: [DELTA_VIOLATE_CONSTRAINT_WITH_VALUES] CHECK constraint ch1 (hash(a[1]) = -1160545675) violated by row with values: ``` The proposed algorithm is more strict and regards maps, arrays and structs during constraints/generated columns dependencies. ## How was this patch tested? Added new tests for constraints and generated columns used with type widening feature. ## Does this PR introduce _any_ user-facing changes? Due to strictness of the algorithm new potential dangerous type changes will be prohibited. An exception will be thrown in the example above. But such changes are called in the schema evolution feature mostly that was introduced recently, so it should not affect many users. commit 81f27b3cbe2869e380c397eee28c3da4de5742a6 Author: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Date: Mon Dec 2 13:43:49 2024 -0800 [Spark] Restrict partition-like data filters to whitelist of known-good expressions (#3872) commit 4d2c5cfd45119d78a86560de12e7b946228d77f2 Author: Dhruv Arya Date: Tue Nov 26 22:29:48 2024 -0800 [Spark][Version Checksum] Incrementally compute allFiles in the checksum (#3899) commit 700bdafbb5a43de8b070f9ad3fc7f2fcefeb8e49 Author: Qiyuan Dong Date: Mon Nov 25 18:25:45 2024 +0100 [Kernel] Add Domain Metadata support to Delta Kernel (#3835) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description This PR adds support for domain metadata to Delta Kernel as described in the [Delta Protocal](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#domain-metadata). In particular, it adds the following to Delta Kernel: - `DomainMetadata` Class - Used to represent a [domain metadata action](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#domain-metadata) as described in the Delta Protocol. - Includes necessary utility functions, such as creating a `DomainMetadata` instance from `Row`/`ColumnVector` and creating a action `Row` from a `DomainMetadata` instance for committing. - Transaction Support - Checks for duplicate domain metadata and protocol support prior to committing them. - Adds an internal `addDomainMetadata` API to `TransactionImpl` for testing purposes. In real scenarios, domain metadata will be constructed by feature-specific code within `TransactionImpl`. A future PR introducing Row Tracking will provide a concrete example of domain metadata usage in practice. - Checkpointing - Domain metadata is maintained during checkpointing. - Log Replay. - Currently, domain metadata is lazily load in a separate pass of reply when requested. - We might want to improve this in the future by caching domain metadata during the initial Protocol & Metadata replay. - Conflict Resolution. - Two overlapping transactions conflict if they include domain metadata actions for the same metadata domain. - Future features can implement custom conflict resolution logic as needed. - Adds `domainMetadata` to `SUPPORTED_WRITER_FEATURES` ## How was this patch tested? Added tests covering operations involving DomainMetadata in `DomainMetadataSuite`. - Unit tests for committing, log replaying, checkpointing, and conflict resolution related to domain metadata. Negative tests for missing writer feature in the protocol and duplicate domain metadata actions. - Integration tests where a table with domain metadata is write by Spark and read by Kernel, and vice versa. ## Does this PR introduce _any_ user-facing changes? No. Domain metadata is currently intended for internal use by kernel developers to support specific table features. We don't plan to allow users to create their own domain metadata in the near future. So this PR only involves changes to internal APIs with no additions/modifications to public APIs. --------- Co-authored-by: Johan Lasperas commit ec0ab0d7e18dc891b81d1c3aa15ad741a705fbd8 Author: Dhruv Arya Date: Fri Nov 22 14:14:50 2024 -0800 [Spark][Version Checksum] Incrementally compute VersionChecksum setTransactions and domainMetadata (#3895) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Follow up for https://github.com/delta-io/delta/pull/3828. Adds support for incrementally computing the set transactions and domain metadata actions based on the current commit and the last version checksum. Incremental computation for both these action types have thresholds so that we don't store them if they are too long (tests have been added for the same). ## How was this patch tested? Added new tests in DomainMetadataSuite and a new suite called `DeltaIncrementalSetTransactionsSuite` ## Does this PR introduce _any_ user-facing changes? No commit 68275d150db7980419836d9b891c61cddb986457 Author: michaelzhan-db Date: Mon Nov 18 14:37:37 2024 -0800 Make type of logged values consistent for each logkey (#3883) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Ensures that the types of logged values for each logkey are consistent. ## How was this patch tested? Code compiles and existing tests pass. ## Does this PR introduce _any_ user-facing changes? No Signed-off-by: Michael Zhang commit ee87b777d60fff490744cd048f73c86848c4ef0d Author: Cuong Nguyen Date: Mon Nov 18 12:02:21 2024 -0800 [SPARK] Tests for streaming use case of coordinated commits (#3884) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add tests for the streaming use case of coordinated commits. ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit ed6ae700a0c7dff0ba6fd0e0120b5e757e757588 Author: Johan Lasperas Date: Mon Nov 18 21:02:11 2024 +0100 [Delta Sharing] Enable D2D delta sharing with type widening (#3675) ## Description Adds type widening to the list of supported features for D2D delta sharing and adds client-side tests covering reading a table that had a type change applied using the type widening table feature. ## How was this patch tested? Added tests. commit 3a98b8a33a48416649d0844355acffceca4ae21c Author: Ole Sasse Date: Fri Nov 15 23:11:04 2024 +0100 [SPARK] Add SQL metrics for ConvertToDelta (#3841) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add the metric "numConvertedFiles" to the ConvertToDelta command node ## How was this patch tested? Added a new test validating the values ## Does this PR introduce _any_ user-facing changes? The metric become visible in the plan (i.e. Spark UI) commit 6ff1cbf2a6a2ac35098cfdfc4c478e706c8db0cb Author: Fred Storage Liu Date: Thu Nov 14 16:18:07 2024 -0800 Not include baseConvertedDeltaVersion in iceberg table property when flag is off (#3877) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Not include baseConvertedDeltaVersion in iceberg table property when flag is off ## How was this patch tested? manual test ## Does this PR introduce _any_ user-facing changes? commit a94e84e69f43ce8dc772681b337b79d40011bac1 Author: Burak Yavuz Date: Thu Nov 14 14:56:25 2024 -0800 [Spark] Remove unused method and do a little refactor in Checkpoints.scala (#3874) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description I found an unused method. Removed it, and also did a mini refactor in Checkpoints.scala ## How was this patch tested? Existing unit tests ## Does this PR introduce _any_ user-facing changes? No commit 98a47a192be26e67a0a0ebe6923ae7fa34348795 Author: Jungtaek Lim Date: Fri Nov 15 02:38:27 2024 +0900 Upgrade the version of delta-sharing-client to 1.2.2 (#3878) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR proposes to upgrade the version of delta-sharing-client to 1.2.2, which is compatible with recent change in master branch for Apache Spark. The version can work for both Spark 4.0 and prior. ## How was this patch tested? Existing tests with CI ## Does this PR introduce _any_ user-facing changes? No. commit e099883d21f6a9a775cf7caf793e8f1d7f4a691d Author: Ming DAI Date: Wed Nov 13 16:36:03 2024 -0800 Fix import order (#3875) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description As title, fix import order ## How was this patch tested? format only ## Does this PR introduce _any_ user-facing changes? commit 6783cca57b3e74f0c11a6e93a3eaeaeca8651bfa Author: Cuong Nguyen Date: Wed Nov 13 16:35:38 2024 -0800 [Spark] Pass catalog table to DeltaLog API call sites, part 2 (#3804) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fix a number of code paths where we want to pass table identifier to the commit coordinator client via the update method of DeltaLog ## How was this patch tested? unit tests ## Does this PR introduce _any_ user-facing changes? No commit fbdd347148e5d4ca1341acb1348b2d9d97f3d499 Author: jintao shen Date: Tue Nov 12 14:36:02 2024 -0800 [Spark]Add OPTIMIZE FULL history support (#3852) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add isFull operation peramemter in the commit history for command OPTIMIZE tb FULL. ## How was this patch tested? Existing unit tests. ## Does this PR introduce _any_ user-facing changes? No commit 4f5431306edb9287cd7b8e79846b0e9c8bf7e12b Author: Cuong Nguyen Date: Tue Nov 12 11:22:38 2024 -0800 [Spark] Pass table catalogs throughout DeltaLog (#3863) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Previously, we passed the table identifier to the DeltaLog's constructor and methods to use. For future-proof purpose, we should pass the whole catalog table instead. ## How was this patch tested? Unit test ## Does this PR introduce _any_ user-facing changes? No commit 7a8fdd30f2f7ba614711b5bad06cac25dbbc2333 Author: Johan Lasperas Date: Tue Nov 12 19:16:38 2024 +0100 [Spark] Ignore internal metadata when detecting schema changes in Delta source (#3849) ## Description When reading from a Delta streaming source with schema tracking enabled - by specifying `schemaTrackingLocation` - internal metadata in the table schema causes a schema change to be detected. This is especially problematic for identity columns that track the current high-water mark for ids as metadata in the table schema and update it on every write, causing streams to repeatedly fail and requiring a restart. This change addresses the issue by ignoring internal metadata fields when detecting schema changes. A flag is added to revert to the old behavior if needed. ## How was this patch tested? Added test case covering problematic use case with both fix enabled and disabled. commit f4a427db796bce7bdfefc7ed77686132fb5434d6 Author: Johan Lasperas Date: Tue Nov 12 18:00:25 2024 +0100 [Spark] Remove use of current_date in MERGE test (#3860) ## Description Fix flaky test that very rarely fail with actual result being off by one day compared to the expected result. Turns out it calls `current_date()` separately to compute the input data and the expected result. ## How was this patch tested? ~Run the test around midnight~ (as a thought experiment) commit cf64afaafe40fa523d42ad2f6ad1c8558b466be9 Author: Christos Stavrakakis Date: Tue Nov 12 18:00:16 2024 +0100 [Spark] Detect opaque URIs and throw proper exception (#3870) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description An opaque URI is an absolute URI whose scheme-specific part does not begin with a slash character ('/'), and are not further parsed by `java.net.URI` library (see https://docs.oracle.com/javase/7/docs/api/java/net/URI.html): ``` val uri = new URI("http:example.com") uri.isOpaque -> true uri.isAbsolute -> true uri.getPath -> null ``` This causes issues when we try to call path-related methods in the URIs, e.g.: ``` val filePath = new Path(uri) filePath.toString -> "http:" filePath.isAbsolute -> NullPointerException ``` This commit fixes this issue by detecting such URIs in Delta file actions and throwing a proper exception. ## How was this patch tested? Add new UT. ## Does this PR introduce _any_ user-facing changes? No commit fcc3d9bc892927af74bc628daab81347f79afd7f Author: Cuong Nguyen Date: Mon Nov 11 18:36:20 2024 -0800 [Spark] Pass catalog table to DeltaLog API call sites (#3862) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR changes the API of the DeltaLog to take in an extra parameter for table catalog and switching some call sites (more to come) to use the new API version. Delta log API changes + Added `forTable(SparkSession, CatalogTable, Map[String, String])` + Added `forTableWithSnapshot(SparkSession, CatalogTable, Map[String, String])` + Modified `withFreshSnapshot` to take in a catalog table. ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit 860438f97bc209296f6b55ab5311d7f3d2bf6d25 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Mon Nov 11 19:08:14 2024 +0100 [Spark] Fast Drop Feature Command (#3867) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This is base PR for Fast Drop feature. This is a new implementation of the DROP FEATURE command that requires no waiting time and no history truncation. The main difficulty when dropping a feature is that after the operation is complete the history of the table still contains traces of the feature. This may cause the following problems: 1. Reconstructing the state of the latest version may require replaying log records prior to feature removal. Log replay is based on checkpoints, an auxiliary data structure, which is used by clients as a starting point for replaying history. Any actions before the checkpoint do not need to be replayed. However, checkpoints are not permanent and may be deleted any time. 2. Clients may create checkpoints in historical versions when do not support the required features. The proposed solution is `CheckpointProtectionTableFeature`. This is a new writer feature that ensures that the entire history until a certain table version, V, can only be cleaned up in its entirety or not at all. Alternatively, the writer can delete commits and associated checkpoints up to any version (less than V) as long as it validates against all protocols included in the commits/checkpoints planing to remove. We protect against the anomalies above as follows: - All checkpoints before the transition table version are protected. This prevents anomaly (1) by turning checkpoints into reliable barriers that can hide unsupported log records behind them. - Because log cleanup is disabled for the older versions, this also removes the only reason why writers would create new checkpoints, preventing anomaly (2). This still uses a writer feature, but is a step forward compared to previous solutions because it allows the table to be readable by older clients immediately, instead of after 24 hours. Compatibility with older writers can subsequently be achieved by truncating the history after a 24-hour waiting period. ## How was this patch tested? Added `DeltaFastDropFeatureSuite` as well as tests in `DeltaProtocolTransitionsSuite`. ## Does this PR introduce _any_ user-facing changes? No. commit e65d06e00d6a8fef5505221872d7c59633d24829 Author: Fred Storage Liu Date: Sat Nov 9 13:30:08 2024 -0800 add retry logic for delta uniform iceberg conversion (#3856) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description add retry logic for delta uniform iceberg conversion ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? commit 95d493cc85f03c8086e44d4b9f9376c9013cbbb3 Author: Dhruv Arya Date: Fri Nov 8 14:54:23 2024 -0800 [Spark] Validate computed state against checksum on checkpoint (#3846) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Follow up for https://github.com/delta-io/delta/pull/3828. This PR adds checksum validation logic. On every checkpoint, we will take the computed state of the table as per the deltas and the previous checkpoint and compare it against the checksum that was written at that version. The same methods can potentially be used to validate more frequently (if needed). ## How was this patch tested? Added a new test case in ChecksumSuite that tests that all logically corrupted fields are being caught by the validation logic. ## Does this PR introduce _any_ user-facing changes? No commit f4dbc9b6091a27d1901f96ef86a3f1c8437b9d23 Author: Fred Storage Liu Date: Fri Nov 8 10:46:03 2024 -0800 Revert "Skip getFileStatus call during iceberg to delta clone" (#3855) This reverts commit a2ba9e9159725b44be08a7319d7159cb0cb78d88. The approach needs to be revisited and prepared in separate PR later. #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? commit 6257799a25e25602115cdfcb1b1f4d36e458e6b7 Author: Lars Kroll Date: Fri Nov 8 18:01:34 2024 +0100 [Spark] Accept generated columns of nested types that differ in nullability (#3859) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description - Changes the behaviour of generated columns over nested types to allow nullability mismatches like we do for top-level types by switching exact data type equal to using `equalsIgnoreNullability` instead. - Additionally, improve the error message in the case where the SQL string is identical (which hopefully shouldn't happen anymore after this, but just in case this comes up again later). ## How was this patch tested? - Added new tests for various type (mis-) matches. ## Does this PR introduce _any_ user-facing changes? Yes: Generated columns now allow the type of the column definition to differ by nullability from the type of the generating expression, even when the type is a nested type. commit e8d09ce11130385b59062e5b81547b644b187a1d Author: Hao Jiang Date: Thu Nov 7 14:54:41 2024 -0500 [Hudi] Disable a flaky UniForm Hudi unit test (#3854) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Disable a flaky UniForm Hudi Unit test ## How was this patch tested? Change to UT ## Does this PR introduce _any_ user-facing changes? No commit 6118f3fd4bf4cb32abd5ebeff1e74d3d9656bd86 Author: Zhipeng Mao Date: Thu Nov 7 20:47:51 2024 +0100 [SPARK] Add metric for DV-related errors (#3858) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description It wraps two `RoaringBitmapArray` APIs `readFrom` and `serializeAsByteArray` with two utils function `DeletionVectorUtils.serialize` and `DeletionVectorUtils.deserialize`, which will log a delta event when it fails during (de-)serialization to enable monitoring and debugging the error. ## How was this patch tested? Log-only change. ## Does this PR introduce _any_ user-facing changes? No. commit e3d8ae26f8ea22d5e4a9172c19b5ed444f0106c2 Author: Rajesh Parangi <89877744+rajeshparangi@users.noreply.github.com> Date: Thu Nov 7 09:56:48 2024 -0800 Implement parser changes for Lite Vacuum (#3857) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR implements parser related changes for Lite Vacuum. As part of this, Vacuum overall syntax has been changed to accept vacuum parameters in any order unlike a fixed order before. This was one of the recommendations from the SQL committee. Additionally, PR contains some test related changes. ## How was this patch tested? Added new tests ## Does this PR introduce _any_ user-facing changes? No commit dc402d35f2bba18a055a858b9db54c90f81f08d2 Author: Jungtaek Lim Date: Fri Nov 8 01:40:52 2024 +0900 [Spark] Reflect a new parameter in the constructor of LogicalRelation for Spark 4.0 (#3847) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR proposes to reflect a new parameter in the constructor of LogicalRelation for Spark 4.0. This PR deals with it via creating a shim object like we have done with IncrementalExecution for parameters difference. There are overloaded methods to create a LogicalRelation instance and Spark community tried to provide the new method to avoid this change, but Spark couldn't just add a new overloaded method to workaround as it was ambiguous with existing method with default param. ## How was this patch tested? Existing tests. ## Does this PR introduce _any_ user-facing changes? No. commit 520c8e8bb288604ff044ac1f593bedd6aac6a5ca Author: Paddy Xu Date: Thu Nov 7 17:39:45 2024 +0100 [Spark] Add `DeltaTable.addFeatureSupport` API to PySpark (#3786) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR introduces a `DeltaTable.addFeatureSupport` API which was missing in PySpark. This API is used to add support of a table feature to a given Delta table. ## How was this patch tested? A new test is added. ## Does this PR introduce _any_ user-facing changes? Yes. See the above `Description` section. commit cb352c24809097021ad0b2811e419a2e01841cde Author: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Date: Wed Nov 6 10:44:22 2024 -0800 [Spark] Support partition-like data filters (#3831) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Given an arbitrary data skipping expression, we can skip a file if: 1. For all of the referenced attributes in the expression, the collected min and max values are equal AND there are 0 nulls on that column. AND 2. The data skipping expression (when evaluated on the collect min==max values on all referenced attributes) evaluates to false. This PR adds support for some of these expressions with the following limitations: 1. The table must be >= 100 files (this is to ensure that the added data skipping expressions to avoid regressing the performance for small tables that won't have many files with the same min-max value). 2. The table must be a clustered table and all referenced attributes must be clustering columns (we use this heuristic to avoid adding extra complexity to data skipping for expressions that won't be able to filter out many files). 3. The expression must not reference a Timestamp-type column. Because stats on timestamp columns are truncated to millisecond precision, we can't safely assume that the min and max value for a timestamp column are the same (even if the collected stats are the same). Because timestamp is generally quite high cardinality, it should anyways be relatively rare that the min and max value for a file are equal for the timestamp column. One more minor nuance: There's one more case where the collected stats differs from the behavior of partitioned tables - a truncated string. However, if a string value is truncated to the first 32 characters, then the collected max value for the string will not be equal to the collected min value (as one or more tiebreaker character(s) will be appended to the collected max value). As a result, it should be sufficient to validate equality, since for any truncated string, the min and max value will not be equal. ## How was this patch tested? See new test. ## Does this PR introduce _any_ user-facing changes? No. commit 235ce96c31641cdbfa157e94c84a6ae2a2d07bc2 Author: Christos Stavrakakis Date: Wed Nov 6 17:30:39 2024 +0100 [Spark] Fix wrong entry in COLUMN_MAPPING_METADATA_KEYS (#3848) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Commit d07a7bd50 introduced a bug in `COLUMN_MAPPING_METADATA_KEYS`: The list should contain `PARQUET_FIELD_NESTED_IDS_METADATA_KEY` and not `PARQUET_MAP_VALUE_FIELD_NAME`. This commit fixes this issue. ## How was this patch tested? UTs. ## Does this PR introduce _any_ user-facing changes? No commit c60437b98ba4d26434d4a4d966cbba26db31b5c1 Author: Jiaheng Tang Date: Tue Nov 5 10:28:05 2024 -0800 [INFRA] Add structured logging linter script to github workflow (#3840) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Infra) ## Description Add a python linter script to enforce new code to adhere to structured logging. There is a similar effort in Spark: https://github.com/apache/spark/pull/47239. ## How was this patch tested? github workflow passed. ## Does this PR introduce _any_ user-facing changes? No commit 796f518f889f0cecb1fcb15f02d6607af8122456 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Nov 5 17:27:54 2024 +0100 [Spark] Move some ICT test helper utils to InCommitTimestampTestUtils.scala (#3843) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description For code hygiene, we move some ICT test helper utils from the `ICTSuite` to the object `InCommitTimestampTestUtils`, which is dedicated to contain the ICT test utils. ## How was this patch tested? Existing UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 6ae4b62845ed579bb5a19f4646831c4ee2931c02 Author: Scott Sandre Date: Fri Nov 1 11:28:53 2024 -0700 [Kernel] [CC Refactor #2] Add `TableDescriptor` and `CommitCoordinatorClient` API (#3797) This is a stacked PR. Please view this PR's diff here: - https://github.com/scottsand-db/delta/compare/delta_kernel_cc_1...delta_kernel_cc_2 #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Adds new `TableDescriptor` and `CommitCoordinatorClient` API. Adds a new `getCommitCoordinatorClient` API to the `Engine` (with a default implementation that throws an exception). ## How was this patch tested? N/A trivial. ## Does this PR introduce _any_ user-facing changes? Yes. See the above. commit 024dadbb425e8edb862b425d5857a18bcb0d7045 Author: Scott Sandre Date: Thu Oct 31 19:02:33 2024 -0700 [Kernel] [CC Refactor #1] Add `TableIdentifier` API (#3795) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description - adds a new `TableIdentifier` class, that kernel will pass on to Commit Coordinator Client - adds a new `Table::forPathWithTableId(engine, path, tableId)` interface - the tableId is stored as an `Optional` in the `Table`, and this PR does **not** propagate that value into SnapshotManager, Snapshot, etc. Future PRs can take care of that. ## How was this patch tested? TableIdentifier UTs ## Does this PR introduce _any_ user-facing changes? Yes. See the above. commit 97655288414ac60ce31ccf0d9c4500b5b84acb36 Author: Charlene Lyu Date: Thu Oct 31 16:03:47 2024 -0700 [Sharing][TEST ONLY] Reformat DeltaFormatSharingSourceSuite (#3811) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Sharing) ## Description No actual code change, format only change. ## How was this patch tested? Test only change. ## Does this PR introduce _any_ user-facing changes? commit 9f452812a1fd2e6e049e2b4bb6bf75eded2330d3 Author: Dhruv Arya Date: Thu Oct 31 11:47:45 2024 -0700 [Spark][Version Checksum] Incrementally compute the checksum (#3828) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description https://github.com/delta-io/delta/pull/3799 added the capability to write a Checksum file after every commit. However, writing a checksum currently requires a full state reconstruction --- which is expensive. This PR adds the capability to compute most of the fields incrementally (apply the current delta on top of the last checksum to get the checksum of the current version). This works as long as the the actual operation performed matches exactly with the specified operation type in the commit. Note that this feature is gated behind a flag that is `true` by default. ## How was this patch tested? Added tests in ChecksumSuite. ## Does this PR introduce _any_ user-facing changes? No commit 1eff5df1e70a3e78a2801174a7e7b90cd1c91bfd Author: Jungtaek Lim Date: Thu Oct 31 13:47:24 2024 +0900 [Spark] Replace the default pattern matching for LogicalRelation to LogicalRelationWithTable (#3805) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR proposes to replace the default pattern matching for `LogicalRelation` to newly introduced pattern object `LogicalRelationWithTable` which will be available in upcoming Spark 4.0. This change helps the project to modify less pieces of code when Spark makes the change to the LogicalRelation; most pattern matchings with LogicalRelation only extract the relation and catalog table, hence they fit with LogicalRelationWithTable. ## How was this patch tested? Existing tests would suffice. ## Does this PR introduce _any_ user-facing changes? No. --------- Signed-off-by: Jungtaek Lim commit 010a44ca4d2f955e8253122a15b544abc201c451 Author: ChengJi-db Date: Wed Oct 30 10:06:10 2024 -0700 [Delta] Extend signature for checkpoint function (#3830) ## Description A minor refactor for checkpoint function * Change the signature from `checkpoint(snapshotToCheckpoint, tableIdentifierOpt: Option[TableIdentifier] = None)` to `checkpoint(snapshotToCheckpoint, catalogTable: Option[CatalogTable] = None)` ## How was this patch tested? Existing UTs ## Does this PR introduce _any_ user-facing changes? No commit 959765ac2da2a346a28bde633c97931e9183811a Author: jintao shen Date: Wed Oct 30 10:04:53 2024 -0700 [Spark] Support OPTIMIZE tbl FULL for clustered table (#3793) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description 1. Add new sql syntax OPTIMIZE tbl FULL 2. Implemented OPTIMIZE tbl FULL to re-cluster all data in the table. ## How was this patch tested? new unit tests added ## Does this PR introduce _any_ user-facing changes? Yes Previously clustered table won't re-cluster data that was clustered against different cluster keys. With OPTIMIZE tbl FULL, they will be re-clustered against the new keys. commit 0c916e02d81bd3d64f19e34a9cbe3408dcb03c4b Author: Zhipeng Mao Date: Wed Oct 30 17:23:57 2024 +0100 [SPARK] Match collated StringType in TypeWideningMetadata (#3832) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Spark master introduced collated `StringType`. In `TypeWideningMetadata.collectTypeChanges`, we used to only match StringType instance, therefore if the type change is from/to `StringType(collationId)`, which is expected and unrelated to type widening, we will throw an error `typeWidening.unexpectedTypeChange`. This PR fixes this by matching both collated and uncollated `StringType` when capturing string type changes. ## How was this patch tested? Existent tests. Tests related to collations cannot be added as delta does not support collations yet. ## Does this PR introduce _any_ user-facing changes? No. commit c80ba4f1e1f8f18ff6a12401973651669f97c26c Author: Lin Zhou <87341375+linzhou-db@users.noreply.github.com> Date: Wed Oct 30 08:39:33 2024 -0700 Add additional logging in DeltaFormatSharingSource (#3826) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Sharing) ## Description Add additional logging in DeltaFormatSharingSource ## How was this patch tested? Unit Test ## Does this PR introduce _any_ user-facing changes? No. commit 051f85a4d0a44d745252b9a5eb8c1aa6384a998f Author: Fred Storage Liu Date: Tue Oct 29 16:02:51 2024 -0700 Revert "Write Int64 by default for Timestamps" (#3827) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Revert "Write Int64 by default for Timestamps" ## How was this patch tested? UT commit 285ed6bd2434ac760dba61399a8156ecdb4c1c47 Author: Fred Storage Liu Date: Tue Oct 29 13:06:24 2024 -0700 Skip getFileStatus call during iceberg to delta clone (#3825) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description use snapshot time as the Delta AddFile modificationTime ## How was this patch tested? UT ## Does this PR introduce _any_ user-facing changes? No commit 0d960ada11ef48f4325313cb1198e08806fea55c Author: kamcheungting-db <91572897+kamcheungting-db@users.noreply.github.com> Date: Tue Oct 29 10:28:21 2024 -0700 [Spark][Table Redirect] No Redirect Rules (#3818) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR implements the `No Redirect Rules` component on table redirect features and integrate this functionality to ReadClone. ## How was this patch tested? Test: OSS PathBasedRedirect No Redirect Rules, ## Does this PR introduce _any_ user-facing changes? No commit 43853d0c8153fba78a18fb3f2545256b4d4895c5 Author: Lin Zhou <87341375+linzhou-db@users.noreply.github.com> Date: Mon Oct 28 17:17:46 2024 -0700 Upgrade delta-sharing-client to 1.2.1 (#3820) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Sharing) ## Description Upgrade delta-sharing-client to 1.2.1 in DBR master, see the changes in https://github.com/delta-io/delta-sharing/releases/tag/v1.2.1, supporting EndStreamAction, and additional logging. ## How was this patch tested? Unit Tests ## Does this PR introduce _any_ user-facing changes? No commit b86500d44d37a2ffe112c650656b4d31ee8ea5c4 Author: Jiaheng Tang Date: Mon Oct 28 17:17:18 2024 -0700 [Spark] Update more logs to use structured logging (#3816) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description As title ## How was this patch tested? Existing tests. ## Does this PR introduce _any_ user-facing changes? No commit ab5a9080be96a286c24bfb1082c6639945102683 Author: kamcheungting-db <91572897+kamcheungting-db@users.noreply.github.com> Date: Mon Oct 28 10:47:24 2024 -0700 [Spark] Introduce Redirect WriterOnly Feature (#3813) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR introduces a new writer-only table feature "redirection". This table feature would redirect the read and write query from the current storage location to a new storage location described inside the value of table feature. ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? No. commit 6599b406cf28768925950efb0cbdbb52b3f21e7b Author: Johan Lasperas Date: Mon Oct 28 17:09:56 2024 +0100 [Spark] Add INSERT tests with missing, extra, reordered columns/fields (#3762) ## Description Follow on https://github.com/delta-io/delta/pull/3605 Adds more tests covering behavior for all ways of running insert with: - an extra column or struct field in the input, in `DeltaInsertIntoSchemaEvolutionSuite` - a missing column or struct field in the input, in `DeltaInsertIntoImplicitCastSuite` - a different column or field ordering than the table schema, in `DeltaInsertIntoColumnOrderSuite` Note: tests are spread across multiple suites as each test case covers 20 different ways to run inserts, quickly leading to large test suites. This change includes improvements to `DeltaInsertIntoTest`: - Group all types of inserts into categories that are easier to reference in tests: - SQL vs. Dataframe inserts - Position-based vs. name-based inserts - Append vs. overwrite - Provide a mechanism to ensure that each test covers all existing insert types. ## How was this patch tested? N/A: test only commit 223894f119db04d01ff11ccccce000179eab3799 Author: Yuya Ebihara Date: Tue Oct 29 00:37:59 2024 +0900 [Kernel] Remove unused code from Path (#3815) Remove unused code from `Path` class i commit 30d7356125168be8afad8efd8309069a3b054185 Author: kamcheungting-db <91572897+kamcheungting-db@users.noreply.github.com> Date: Fri Oct 25 19:38:40 2024 -0700 [Table Redirect] Introduce Redirect ReaderWriter Feature (#3812) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR introduces a new reader-writer table feature "redirection". This table feature would redirect the read and write query from the current storage location to a new storage location described inside the value of table feature. The redirection has several phases to ensure no anomaly. To label these phases, we introduces four states: 0. NO-REDIRECT: This state indicates that redirect is not enabled on the table. 1. ENABLE-REDIRECT-IN-PROGRESS: This state indicates that the redirect process is still going on. No DML or DDL transaction can be committed to the table when the table is in this state. 2. REDIRECT-READY: This state indicates that the redirect process is completed. All types of queries would be redirected to the table specified inside RedirectSpec object. 3. DROP-REDIRECT-IN-PROGRESS: The table redirection is under withdrawal and the redirection property is going to be removed from the delta table. In this state, the delta client stops redirecting new queries to redirect destination tables, and only accepts read-only queries to access the redirect source table. To ensure no undefined behavior, the valid procedures of state transition are: 0. NO-REDIRECT -> ENABLE-REDIRECT-IN-PROGRESS 1. ENABLE-REDIRECT-IN-PROGRESS -> REDIRECT-READY 2. REDIRECT-READY -> DROP-REDIRECT-IN-PROGRESS 3. DROP-REDIRECT-IN-PROGRESS -> NO-REDIRECT 4. ENABLE-REDIRECT-IN-PROGRESS -> NO-REDIRECT The protocol RFC document is on: https://github.com/delta-io/delta/issues/3702 ## How was this patch tested? Unit Test of transition between different states of redirection. ## Does this PR introduce _any_ user-facing changes? No commit d0e6b96491c2a495961a58328264e1440136c220 Author: kamcheungting-db <91572897+kamcheungting-db@users.noreply.github.com> Date: Fri Oct 25 16:19:02 2024 -0700 Revert "[Table Redirect] Introduce Redirect ReaderWriter Feature (#38… (#3810) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This reverts commit f6a771d7a8e078c6423d81d6837ea5ff12366b7c. The commit contains some problematic comments. ## How was this patch tested? N.A. ## Does this PR introduce _any_ user-facing changes? No. commit 5b06dc816e20530551a09148ef7421407bc78e67 Author: Scott Sandre Date: Fri Oct 25 14:44:03 2024 -0700 [Kernel] Remove `engine` from `TableConfig` APIs (#3808) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Remove `engine` references from `TableConfig` APIs. We don't actually need it, and it muddles our APIs, making us pass `engine` references everywhere. Clean up code along the way. ## How was this patch tested? Existing UTs. ## Does this PR introduce _any_ user-facing changes? No. commit fe1df605ba0345b66677acff521022982cae615c Author: Dhruv Arya Date: Fri Oct 25 14:19:45 2024 -0700 [Spark] Write a checksum after every commit (#3799) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adds a `ChecksumHook` which is responsible for a writing a checksum (See https://github.com/delta-io/delta/pull/3777) of the current table state after every commit. This is guarded behind a flag which is `false` by default. Currently, every checksum write will trigger a full state reconstruction, which can be very expensive. An upcoming PR will try to make this checksum computation incremental so that we don't have to pay a performance penalty. ## How was this patch tested? Added a new suite --- ChecksumSuite. ## Does this PR introduce _any_ user-facing changes? No commit f6a771d7a8e078c6423d81d6837ea5ff12366b7c Author: kamcheungting-db <91572897+kamcheungting-db@users.noreply.github.com> Date: Fri Oct 25 08:38:09 2024 -0700 [Table Redirect] Introduce Redirect ReaderWriter Feature (#3801) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR introduces a new reader-writer table feature "redirection". This table feature would redirect the read and write query from the current storage location to a new storage location described inside the value of table feature. The redirection has several phases to ensure no anomaly. To label these phases, we introduces four states: 0. NO-REDIRECT: This state indicates that redirect is not enabled on the table. 1. ENABLE-REDIRECT-IN-PROGRESS: This state indicates that the redirect process is still going on. No DML or DDL transaction can be committed to the table when the table is in this state. 2. REDIRECT-READY: This state indicates that the redirect process is completed. All types of queries would be redirected to the table specified inside RedirectSpec object. 3. DROP-REDIRECT-IN-PROGRESS: The table redirection is under withdrawal and the redirection property is going to be removed from the delta table. In this state, the delta client stops redirecting new queries to redirect destination tables, and only accepts read-only queries to access the redirect source table. To ensure no undefined behavior, the valid procedures of state transition are: 0. NO-REDIRECT -> ENABLE-REDIRECT-IN-PROGRESS 1. ENABLE-REDIRECT-IN-PROGRESS -> REDIRECT-READY 2. REDIRECT-READY -> DROP-REDIRECT-IN-PROGRESS 3. DROP-REDIRECT-IN-PROGRESS -> NO-REDIRECT 4. ENABLE-REDIRECT-IN-PROGRESS -> NO-REDIRECT The protocol RFC document is on: https://github.com/delta-io/delta/issues/3702 ## How was this patch tested? Unit Test of transition between different states of redirection. ## Does this PR introduce _any_ user-facing changes? No commit 85db7e471388a372e9529ecd3ed49d44203a503b Author: Jiaheng Tang Date: Thu Oct 24 12:39:22 2024 -0700 [Spark] Clean up log keys (#3802) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Rename a few log keys to make the intention more clear. Also remove unused log keys. ## How was this patch tested? Existing tests. ## Does this PR introduce _any_ user-facing changes? No commit ebdb35b68e455c4e0ed6c5a361b8650ef3a10f9f Author: Cuong Nguyen Date: Thu Oct 24 09:40:29 2024 -0700 [Spark] Refactor CommitFailedException to allow passing in an inner exception (#3794) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Refactor `CommitFailedException` to allow for passing in a root cause exception so that we don't lose its stack trace ## How was this patch tested? Simulate exceptions in unit tests and observe the output exception ## Does this PR introduce _any_ user-facing changes? No commit e3dd5b0f4b20c527bc90055893bf186d5e59b8d0 Author: Nils Andre Date: Thu Oct 24 14:11:51 2024 +0100 Fix typo in PROTOCOL.md (#3790) Co-authored-by: R. Tyler Croy commit 4f96aa140917ee6971fb96770d7921b218e983be Author: Sumeet Varma Date: Tue Oct 22 14:43:07 2024 -0700 [Spark] Add utils to find logPath from dataPath in CoordinatedCommitsUtils (#3779) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add utils to find logPath from dataPath in CoordinatedCommitsUtils ## How was this patch tested? Build ## Does this PR introduce _any_ user-facing changes? No commit 8442bc60da5ac45d13e76fb0cd59aacfc0c447ad Author: Yuya Ebihara Date: Tue Oct 22 00:01:59 2024 +0900 [Kernel] Minor cleanup (#3783) commit e3ec6771f1ce290f0aa77902e87748136dd86110 Author: Yumingxuan Guo Date: Wed Oct 16 13:04:22 2024 -0700 [Delta] Update Coordinated Commits Usage Logs (#3774) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description 1. Renames the full error message with - `"exceptionString" -> exceptionString(e)` to record the full error message; - `"exceptionClass" -> e.getClass.getName` to retain the error class information. 2. Adds field `registeredCommitCoordinators`. 3. Removes nested JSONizing. ## How was this patch tested? Existed UTs. ## Does this PR introduce _any_ user-facing changes? No. commit cb7e27ffe0ff2d91d90b6b6b1aaa3e471ad4252c Author: Sumeet Varma Date: Tue Oct 15 15:23:48 2024 -0700 [Spark] Writing of UUID commits should not use put-if-absent semantics (#3765) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes the coordinated commits utils to not write UUID-based commit files with put-if-absent semantics. This is not necessary because we assume that UUID-based commit files are globally unique so we will never have concurrent writers attempting to write the same commit file. DynamoDBCommitCoordinator also now uses the utils for writing backfilled files. ## How was this patch tested? Existing tests are sufficient as this only affects how a commit is written in the underlying storage layer but does not change any logic in Delta Spark. ## Does this PR introduce _any_ user-facing changes? No commit 5d2a275b55fa07930a51034239c6e33972a5bc90 Author: Rajesh Parangi <89877744+rajeshparangi@users.noreply.github.com> Date: Fri Oct 11 15:24:15 2024 -0700 Implementation for Lite Vacuum without parser changes (#3757) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR implements the logic of Lite Vacuum. Since the user surface is not defined yet, all of the code is behind a spark conf which is set to false by default. ## How was this patch tested? Modified existing tests to run in both Lite and full mode. Additionally, added new tests cases specific to Lite vacuum. ## Does this PR introduce _any_ user-facing changes? NO commit 09fb300d16ad528fb47da4d0271ef6fb62c51ad3 Author: Yuya Ebihara Date: Fri Oct 11 21:28:13 2024 +0900 [Kernel] Fix typo (#3766) commit e6cd0ff2816d32f0c5b63066a96f92d5c8b54138 Author: Alden Lau Date: Thu Oct 10 12:43:03 2024 -0700 [Spark] Allow setting both allowTypeWidening and keepExistingTypes to true in mergeSchemas (#3759) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently, `SchemaMergingUtils.mergeSchemas` will not widen fields when both `keepExistingTypes` and `allowTypeWidening` are set to true. This change will allow types to be widened when both `keepExistingTypes` and `allowTypeWidening` are true, and in the case of non-widening type changes, the existing type will be kept. This change can be used to fall back to existing types instead of throwing a `DeltaAnalysisException` when we want to merge a schema with widened fields that may also have non-widenable type changes. ## How was this patch tested? New unit test in `SchemaUtilsSuite`.scala ## Does this PR introduce _any_ user-facing changes? No commit 9d4e0988933dda87507312c80592bef9b0c1ce59 Author: jackierwzhang <67607237+jackierwzhang@users.noreply.github.com> Date: Thu Oct 10 13:53:15 2024 +0800 Allow schema to be merged on non struct root type (#3745) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Expanding existing Delta utils to add schema merging of non-struct root functions ## How was this patch tested? New IT. ## Does this PR introduce _any_ user-facing changes? No. commit 9e472b70e0f7cd9616bbd2023ee4d0c29d4b11ab Author: Zhipeng Mao Date: Wed Oct 9 16:06:53 2024 +0200 [SPARK] Refactor StatsCollectionUtils.computeStats (#3756) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The PR refactors `StatsCollectionUtils.computeStats`: - Replaces parameter `dataPath` with `deltaLog` - Replaces parameter `columnMappingMode` and `dataSchema` with `snapshot` The old fields can be accessed from the newly added fields, so it's better to not have parameters more than necessary. In order to achieve this, the PR adds a new parameter `protocolOpt` in the constructor of `InitialSnapshot` to make the `statsSchema` derived from the snapshot consistent with the desired protocol rather than the default protocol computed from the metadata in the snapshot. We also changed the name `InitialSnapshot` to `DummySnapshot` to fit its semantics more. ## How was this patch tested? It's a refactor change. Existing tests should cover this. ## Does this PR introduce _any_ user-facing changes? No. commit 8e0b133f46f641941ad15ed8cbe7c2d1cc777a5b Author: Dhruv Arya Date: Tue Oct 8 10:30:42 2024 -0700 [Spark] Make formatting consistent (#3755) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Makes some no-op changes to make the formatting consistent after https://github.com/delta-io/delta/pull/3754. ## How was this patch tested? No logical changes. ## Does this PR introduce _any_ user-facing changes? No commit f0082a90937256c0f171b6d924a98fb4814c0ef2 Author: Fred Storage Liu Date: Tue Oct 8 06:31:30 2024 +0800 make the iceberg conversion on replace/overwrite uniform table asynchronous (#3751) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description There is a problem today that the iceberg conversion on replace/overwrite uniform table is sync, so if user do concurrent replace, many will fail because there are very likely conversion of higher versions succeeded/committed. Making the conversion async/one-at-a-time will be a good solution to stick with existing behavior for all other kinds of commits. If there is a conflict, thats because a higher version of conversion committed and we should not worry about the conflict. ## How was this patch tested? manual test ## Does this PR introduce _any_ user-facing changes? commit 4f7d1772ada949748c764df96ffc92d2b317ae42 Author: Cuong Nguyen Date: Mon Oct 7 14:27:34 2024 -0700 [Spark] Pass table identifier through DeltaLog API to commit coordinator (#3754) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Change all relevant APIs of `DeltaLog` to take in a table identifier and pass this identifier all the way to the commit coordinator. Update `OptimisticTransaction` to use the new API. ## How was this patch tested? New unit tests ## Does this PR introduce _any_ user-facing changes? No commit ea7745ba9b870247c5a5d544783fd117a0406cca Author: Sumeet Varma Date: Mon Oct 7 14:01:16 2024 -0700 [Spark] Add Catalog Type to Spark Structured Logging MDC framework for Delta (#3753) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add a new `Catalog` type to be used in Spark Structured Logging within delta-spark ## How was this patch tested? Build ## Does this PR introduce _any_ user-facing changes? No commit e70ca04e2a49b2809756339d093c165d7ad1033d Author: Johan Lasperas Date: Wed Oct 2 19:38:13 2024 +0200 [Spark] Fix type widening with char/varchar columns (#3744) ## Description Using type widening on a table that contains a char/varchar column causes the following reads to fail with `DELTA_UNSUPPORTED_TYPE_CHANGE_IN_SCHEMA`: ``` CREATE TABLE t (a VARCHAR(10), b INT); ALTER TABLE t SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true'); ALTER TABLE t ALTER COLUMN b TYPE LONG; SELECT * FROM t; [DELTA_UNSUPPORTED_TYPE_CHANGE_IN_SCHEMA] Unable to operate on this table because an unsupported type change was applied. Field cut was changed from VARCHAR(10) to STRING` ``` Type changes are recorded in the table metadata and a check on read ensures that all type changes are supported by the current implementation as attempting to read data after an unsupported type change could lead to incorrect results. CHAR/VARCHAR columns are sometimes stripped down to STRING internally, for that reason, ALTER TABLE incorrectly identify that column `a` type changed to STRING and records it in the type widening metadata. The read check in turn doesn't recognize that type change as one of the supported widening type changes (which doesn't include changes to string columns). Fix: 1. Never record char/varchar/string type changes in the type widening metadata 2. Never record unsupported type changes in the type widening metadata and log an assertion instead. 3. Don't fail on char/varchar/string type changes in the type widening metadata if such type change slips through 1. This will prevent failing in case a non-compliant implementation still record a char/varchar/string type change. 4. Provide a table property to bypass the check if a similar issue happens again in the future. commit 3b15f0e435a350116793f8fa1d170f90e58afd1c Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Wed Oct 2 18:56:26 2024 +0200 [Spark] Revert column mapping protocol fix (#3748) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We revert of https://github.com/delta-io/delta/commit/920f185a04382ff466e47e75240dfda48efe40d3 due to creating a backward/forward compatibility issue. In the original PR we fixed an issue where when column mapping was the only reader feature, it would not appear in the reader features set. This is primarily a memory representation issue, but it turns out the invalid protocol could also be serialized with a specific sequence of events. The protocol action, has a requirement at initialization time to ensure that only protocols with version 3 have the reader features set. When we fixed the column mapping bug, the requirement was expanded to also include reader features with version 2. This can be problematic if a table was created with an old Delta version which allowed to serialize the invalid protocol, and then try to read the table with the latest Delta version. The reverse is also problematic. ## How was this patch tested? Clean revert. Existing tests. ## Does this PR introduce _any_ user-facing changes? No. commit 19c054b5b3a865a8b583c6216aaee7096188fae9 Author: Sumeet Varma Date: Tue Oct 1 15:11:44 2024 -0700 [Spark][Kernel] Add isFatal detection to LogStoreErors (#3743) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description Simple PR that adds `isFatal` utility that can be used in `delta-storage` ## How was this patch tested? Build ## Does this PR introduce _any_ user-facing changes? No commit 567d448420bee3e1690cb89873f15483c8e94ca5 Author: Carmen Kwan Date: Tue Oct 1 19:05:08 2024 +0200 [Protocol] Amend Row Tracking Protocol to explicitly require domainMetadata (#3740) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Protocol changes) ## Description Right now, the Delta protocol for Row tracking references the DomainMetadata table feature, but does not explicitly state domainMetadata as one of its required table feature like e.g. Clusteredtable. The code in `TableFeature.scala` accurately lists domainMetadata as one of the required table feature. This PR amends the Protocol to accurately reflect the state of the system and the remainder of the Row tracking proposition. Row Tracking cannot require writers to write DomainMetadata if it is not listing domainMetadata as one of its required feature. This is also to be consistent with how other table features handle listing their required table features. ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? 'No' commit 440cc3e106e466e64d25229075dadf03522f8b4c Author: Johan Lasperas Date: Tue Oct 1 08:23:02 2024 +0200 [Spark] Nicer error when failing to read table due to type change not support… (#3728) ## Description Delta 3.2/3.3 only supports a limited subset of type changes that will become available with Delta 4.0 / Spark 4.0. This changes improves the error returned when reading a table with an unsupported type change to tell user to upgrade to Delta 4.0 in case the type change will be supported in that version. ## How was this patch tested? Added tests to cover the error path. ## Does this PR introduce _any_ user-facing changes? Updates error thrown on unsupported type change when reading a table. commit be191c542be7af6b82aa50bb81c46b7d9b1a5a14 Author: Juliusz Sompolski Date: Mon Sep 30 22:09:28 2024 +0200 [Spark] Fix DeltaColumnMappingSuite."column mapping batch scan should detect physical name changes" test (#3742) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description After https://github.com/apache/spark/pull/48211 this test needs to use a separate Dataframe object for the failing / succeeding test case, because LazyTry will cache the error of the failing case. ## How was this patch tested? Running the test on Spark master. ## Does this PR introduce _any_ user-facing changes? No. Co-authored-by: Julek Sompolski commit 4cf9f28467d70404f3dba99b6e486c6cd88075d9 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Mon Sep 30 19:37:29 2024 +0200 [Spark] Add isShallow argument to Clone Scala APIs (#3736) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description - Add the `isShallow` argument to the Clone Scala APIs (`clone`, `cloneAtVersion`, and `cloneAtTimestamp`). - This enables users to be able to decide the type of clone they would like from the API call. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? Yes, modified the Clone Scala APIs of `DeltaTable` commit bcdd0bc5c57bd954f4975ef3bb5fa76d44c705db Author: Hao Jiang Date: Mon Sep 30 09:56:03 2024 -0700 Remove snapshotAnalysis from TahoeLogFileIndex (#3734) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes the OOM caused by SparkSession.cloneSession and TemporaryView. It replaces the reference of Snapshot in TahoeLogFileIndex using SnapshotDescriptor, thus remove the reference to SparkSession from TahoeLogFileIndex. ## How was this patch tested? UT ## Does this PR introduce _any_ user-facing changes? No commit a163d33538f1f898d70da24dfb2c0a0bc462ef99 Author: Johan Lasperas Date: Mon Sep 30 18:55:38 2024 +0200 [Spark] Fix stripping temp view in MERGE (#3669) ## Description We (unfortunately) allow running MERGE on views over Delta tables. We detect when the view is equivalent to a `SELECT * FROM target` and strip the plan to only keep the target logical relation. That step doesn't handle more complex views, e.g. with multiple aliases and can then leave target plans with Project nodes that can cause analysis exceptions during MERGE execution. This fix improves detecting and stripping views that are equivalent to `SELECT * FROM target` ## How was this patch tested? - Expand existing MERGE into temp views tests and check that views are correctly removed from the target plan. commit 1b4d2ca1ffa57268ed1d816907217f30aede4b46 Author: Robin Moffatt Date: Mon Sep 30 17:40:57 2024 +0100 [Flink] [docs] Add blog link for help configuring Flink SQL for Delta Lake (#3645) ## Description I wrote a blog that explains how to configure and troubleshoot the Flink connector. It might be helpful for others to include it in the doc to help them find it. ## How was this patch tested? n/a ## Does this PR introduce _any_ user-facing changes? n/a commit 6e75c236330360d6318a3812b119dc9a7b8dc2d8 Author: Lukas Rupprecht Date: Fri Sep 27 16:55:08 2024 -0700 [Spark] Makes DataSkippingReader encoders lazy to prevent initialization failures (#3733) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This is a small fix that changes the `sizeCollectorInputEncoders` in `DataSkippingReader` to a `lazy val`. We are already doing this for other encoders in the codebases (e.g. see [here](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/actions/actions.scala#L106)) in order to prevent initialization failures of those encoders during JVM startup. ## How was this patch tested? Existing tests are sufficient as this does make any logical changes. ## Does this PR introduce _any_ user-facing changes? No commit 1f8bcb24fd52fffb59c9f08becbbde44bca4252c Author: Yumingxuan Guo Date: Fri Sep 27 16:54:57 2024 -0700 [Delta] Refactor Coordinated Commits Util Names and Calling Orders (#3732) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Refactor Coordinated Commits utility methods names and calling orders. ## How was this patch tested? Existing UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 48c3192bd338364b5c532c3bfbda1319e1c21ca6 Author: Hao Jiang Date: Fri Sep 27 14:30:12 2024 -0700 Revert "Remove snapshotAnalysis from TahoeLogFileIndex" (#3731) This reverts commit d10e62a2da74d31b619eeffce925b1725224ee74. https://github.com/delta-io/delta/pull/3722 #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? commit c54666baf204db9ae927dbb07063fec5e3078c70 Author: Scott Sandre Date: Fri Sep 27 08:28:11 2024 -0700 [Infra] [Spark] Reduce delta-spark CI test runtime by 33 mins (1h46m to 1h13m) (#3712) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Infra) ## Description This PR reduces delta-spark CI test runtime by 33 mins. Previously the max shard duration was 1h 46 mins, and now it is 1h 13 mins. This PR does so by the following 0. We add an extra shard 1. I used https://github.com/delta-io/delta/pull/3694 to collect some metrics about delta-spark test runtime execution. 2. I specifically identified (a) the 50 slowest test suites and (b) the average suite duration excluding those top 50 (it was 0.71 minutes) 3. I used this information to update `TestParallelization` to do smarter test suite assignment. The logic is as follows: - For the top 50 slowest test suites, we assign them deterministically by, in sorted descending order, assigning the suites to the shard + group (group means thread) with the lowest duration so far. - For the remaining tests that are not in the top 50, we assign them to a random shard, and within that shard we assign it to the group with the lowest duration so far, too 4. We also update the hash function used to me MurmurHash3 which is known to create balanced assignments in scenarios where the input strings (test names) might have similar prefixes or patterns Note that purely adding another shard and using a better hash function does NOT yield any better results. That was attempted here: https://github.com/delta-io/delta/pull/3715. ## How was this patch tested? GitHub Ci tests. https://github.com/delta-io/delta/actions/runs/11004181545?pr=3712 image ## Does this PR introduce _any_ user-facing changes? No. commit b1e4a033ba81c57ba64a3705ded73d6a93d44497 Author: Marko Ilić Date: Fri Sep 27 02:40:10 2024 +0200 [Kernel] Extended schema JSON serde to support collations (#3628) ## Description Extended serialization and deserialization to support collations in metadata. ## How was this patch tested? Tests added to `DataTypeJsonSerDe.java` and `StructTypeSuite.scala`. Co-authored-by: Venki Korukanti commit b29a30f538d4efaf7ecd789457a8cbcb8aed9648 Author: Yumingxuan Guo Date: Thu Sep 26 14:53:25 2024 -0700 [Delta] Extend ClusteredTableDDLSuite with Coordinated Commits and Fix Issues (#3720) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description 1. Adds coordinated commits coverage for ClusteredTableDDLSuite. 2. Fix CC configurations' interaction with CREATE LIKE command. Before, all three CC configurations are copied along with all other table properties from the source table; now, we filter them out, since CREATE LIKE is similar to CLONE, and we do not copy the commit coordinator from the source table during CLONE. 3. Fix CC configuration's interaction with CREATE an external table in a location with an existing table. Before, the CC configurations from the existing table and the command are compared against each other, and exceptions are thrown when they don't match; now, omitting CC configurations in the command does not error out, and the existing commit coordinator is retained, along with the ICT configuration dependencies. This is similar to REPLACE command. 4. Added UTs for the above changes. 5. Added some more workaround for the in-memory commit coordinator. ## How was this patch tested? UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 0b89ac31da318f16bbb6c70290443c95771dd74b Author: Yumingxuan Guo Date: Thu Sep 26 14:53:17 2024 -0700 [Delta] Add Test Suites for Coordinated Commits Properties (#3729) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Added UTs for CC properties' interaction with CREATE/REPLACE/CLONE against this [commit](https://github.com/delta-io/delta/commit/18eb1a6cbd09049eb835586a4d607fbd9673a1ee). All future CC properties-related test cases could go here. ## How was this patch tested? UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 67934f818691e4d285ec721e900073a0c4153d30 Author: Johan Lasperas Date: Thu Sep 26 18:18:27 2024 +0200 [Kernel] Add type widening to supported table features (#3656) ## Description Allow reading and writing to tables that have the type widening table features enabled (both preview and stable table feature). Reading: - The default kernel parquet reader supports widening conversions since https://github.com/delta-io/delta/pull/3541. Engines may also choose to implement type widening natively in their parquet reader if they wish. Writing: - Nothing to do, type widening doesn't impact the write path - writing data always uses the latest data schema. ## How was this patch tested? Added read integration tests. Tests are based on golden tables. Generating the tables requires Spark 4.0, due to spark master cross-compilation being broken, the table generation code is not included here. The following steps where used to generate the tables. 1. Create a table with initial data types and insert initial data 2. Enable type widening and schema evolution 3. Insert data with wider type for each column. Column types are automatically widened during schema evolution. `type-widening` table: | Column | Initial type | Widened Type | | - | - | - | | byte_long | byte | long | | int_long | int | long | | float_double | float | double | | byte_double | byte | double | | short_double | short | double | | int_double | int | double | | decimal_decimal_same_scale | decimal(10, 2) | decimal(20, 2) | | decimal_decimal_greater_scale | decimal(10, 2) | decimal(20, 5) | | byte_decimal | byte | decimal(11, 1) | | short_decimal | short | decimal(11, 1) | | int_decimal | int | decimal(11, 1) | | long_decimal | long | decimal(21, 1) | | date_timestamp_ntz | date | timestamp_ntz | `type-widening-nested` table: | Column | Initial type | Widened Type | | - | - | - | | struct | struct | struct | | map | map | map | | array | array | array | ## Does this PR introduce _any_ user-facing changes? Yes, it's now possible to read from and write to delta tables with type widening enabled using kernel. commit 68713799ea7ceb769366c32f0f3df811674765ef Author: Hao Jiang Date: Wed Sep 25 15:56:32 2024 -0700 Remove snapshotAnalysis from TahoeLogFileIndex (#3722) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes the OOM caused by SparkSession.cloneSession and TemporaryView. It replaces the reference of `Snapshot` in `TahoeLogFileIndex` using `SnapshotDescriptor`, thus remove the reference to `SparkSession` from `TahoeLogFileIndex`. ## How was this patch tested? UT ## Does this PR introduce _any_ user-facing changes? No commit 6cfae83ce640e9ca42f099fff6c306ad31e80cd8 Author: Hyukjin Kwon Date: Thu Sep 26 00:48:56 2024 +0900 [Spark] Fix compilation error against latest Spark master (#3721) ## Description This PR fixes the compilation error against the latest Spark. ## How was this patch tested? Manually compiled. ## Does this PR introduce _any_ user-facing changes? No commit 37cc8218757de7ab288da06604622c01702d8333 Author: Dhruv Arya Date: Tue Sep 24 20:38:49 2024 -0700 [Spark][ICT] Make CDCReader.changesToDF aware of InCommitTimestamps (#3714) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Context: Currently CDCReader.changesToDF relies on DeltaHistoryManager.getCommits for getting a list of commits and their timestamps. Since DeltaHistoryManager.getCommits is not aware of InCommitTimestamps and Coordinated Commits, it will either return the wrong timestamp or no timestamp at all for certain commits. This PR updates CDCReader.changesToDF so that it only relies on DeltaHistoryManager.getCommits for non-ICT commits. The rest of CDCReader.changesToDF relies on the output of deltaLog.getChanges which is already Coordinated Commit-aware. The function also already extracts the `CommitInfo` for all of these commits, which we reuse to get the In-Commit Timestamp for relevant commits. Since the actions were already being read in the function, this PR does not add any additional IO. This PR also updates `DeltaSource` so that it propagates `CommitInfo` actions to `CDCReader.changesToDF`. These `CommitInfo` actions are only used for InCommitTimestamps are later filtered out. ## How was this patch tested? Added a Coordinated Commit variant of DeltaCDCScalaSuite with a batch size of 10. New test cases in InCommitTimestampSuite. More tests coming up. ## Does this PR introduce _any_ user-facing changes? No commit a8cc4b4c6ce1f27a3aa073f04e7ef231670d6de6 Author: Marko Ilić Date: Tue Sep 24 18:40:09 2024 +0200 [KERNEL] Extended StringType to have CollationIdentifier (#3627) ## Description Extended StringType to have attribute collationIdentifier. ## How was this patch tested? Tests added to `CollatioinIdentifierSuite` and `StringTypeSuite` ## Does this PR introduce _any_ user-facing changes? Yes. Previously, users could use StringType just as StringType.STRING, but now they can create StringType instances with arbitrary CollationIdentifier values. commit 2514222497402dbc04bbb3452c6931249aa9b9c3 Author: Venki Korukanti Date: Tue Sep 24 08:57:32 2024 -0700 [Spark] Upgrade latest Spark dependency to 3.5.3 (#3647) commit 971deda78ba5a7b827fa603156386b7e7fb4af77 Author: Fred Storage Liu Date: Tue Sep 24 23:06:43 2024 +0800 Make CTAS with replace honor column mapping when writing first set of parquet files (#3704) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Make CTAS with replace honor column mapping when writing first set of parquet files; Otherwise the uniform table will end up with parquet files without column mapping and thus have null in query results. ## How was this patch tested? manual test with Spark ## Does this PR introduce _any_ user-facing changes? commit 19374e27dfc95ffd53f3ad80f692450aececf728 Author: jackierwzhang <67607237+jackierwzhang@users.noreply.github.com> Date: Tue Sep 24 23:06:05 2024 +0800 Genericalize schema utils to support non-struct root level access (#3716) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Improving some schema utils to allow path index into non-struct root data structures. ## How was this patch tested? New UT. ## Does this PR introduce _any_ user-facing changes? No. commit 36995d991a79d1f00195b0199961d67fe401fcd9 Author: Sumeet Varma Date: Mon Sep 23 16:02:00 2024 -0700 [Storage] Add Javadoc for CoordinatedCommitsUtils::getCoordinatorName (#3711) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Storage) ## Description This is a simple cleanup PR for improving documentation of getCoordinatorName 1. Return an Optional to indicate the method may return null if no coordinator is set 2. Rename the method from getCoordinator -> getCoordinatorName ## How was this patch tested? Build ## Does this PR introduce _any_ user-facing changes? No commit 437db30f6ba6a001d1682ff05e4cb890161737a3 Author: Venki Korukanti Date: Mon Sep 23 13:03:23 2024 -0700 Revert "[Spark] Fix Spark-master compile errors" (#3710) Reverts delta-io/delta#3591 commit 538e736ea5526e98a7b0e9124315757c1d5e54f3 Author: Shixiong Zhu Date: Mon Sep 23 10:10:57 2024 -0700 Set parallelism for the parallelize job in recursiveListDirs (#3708) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description `DeltaFileOperations.recursiveListDirs` calls `parallelize` without specifying the parallelism. Hence, it always uses [the number of available cores on a cluster](https://github.com/apache/spark/blob/d2e8c1cb60e34a1c7e92374c07d682aa5ca79145/core/src/main/scala/org/apache/spark/SparkContext.scala#L1003). When a cluster has many cores but `subDirs` is small, it will launch many empty tasks. This PR makes a small change to use `subDirs.length.min(spark.sparkContext.defaultParallelism)` as the parallelism so that when `subDirs` is smaller than the number of available cores, it will not launch empty tasks. ## How was this patch tested? Existing tests. ## Does this PR introduce _any_ user-facing changes? No commit a99f62be69b6754daf571acdfa2048f3048e8143 Author: Johan Lasperas Date: Mon Sep 23 16:06:28 2024 +0200 [Spark] Disable implicit casting in Delta streaming sink (#3691) ## Description https://github.com/delta-io/delta/pull/3443 introduced implicit casting when writing to a Delta table using a streaming query. We are disabling this change for now as it regresses behavior when a struct field is missing in the input data. This previously succeeded, filling the missing fields with `null` but would now fail with: ``` DELTA_UPDATE_SCHEMA_MISMATCH_EXPRESSION] Cannot cast struct to struct. All nested columns must match. ``` Note: batch INSERT fails in this scenario with: ``` [DELTA_INSERT_COLUMN_ARITY_MISMATCH] Cannot write to ', not enough nested fields in ; target table has 3 column(s) but the inserted data has 2 column(s) ``` but since streaming write allowed this, we have to preserve that behavior. ## How was this patch tested? Tests added as part of https://github.com/delta-io/delta/pull/3443, e.p. with flag disabled. ## Does this PR introduce _any_ user-facing changes? Disabled behavior change that was to be introduced with https://github.com/delta-io/delta/pull/3443. commit 1753cb547e0194aa3021e45de5c19fc24209471c Author: Cuong Nguyen Date: Fri Sep 20 18:08:01 2024 -0700 [Spark] Passing TableIdentifier from CheckpointHook to commit coordinator client (#3695) ## Description As part of the effort to make commit coordinator client aware of the table identifier, this PR handles the code path going from the checkpoint hook to the commit coordinator client. ## How was this patch tested? Ran existing unit tests commit 5bfd99eb1b5c5bc066acc791bb0f39d2de2f9a1f Author: Venki Korukanti Date: Fri Sep 20 18:07:07 2024 -0700 Revert "Make CTAS with replace honor column mapping when writing first set of parquet files" (#3703) Reverts delta-io/delta#3696 which breaks the build. commit cddde68533f243ab049aaec4e464cf4e35a4db72 Author: Scott Sandre Date: Fri Sep 20 16:13:35 2024 -0700 [Spark] Fix delta-spark test log4j (#3700) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description https://github.com/delta-io/delta/pull/3146 added support for spark master structured logging, but broke the test logging for delta-spark against spark 3.5. This PR fixes that. The broken logging looked like this (https://github.com/delta-io/delta/actions/runs/10856009436/job/30129811815): ``` ERROR StatusConsoleListener Unable to locate plugin type for JsonTemplateLayout ERROR StatusConsoleListener Unable to locate plugin for JsonTemplateLayout ERROR StatusConsoleListener Could not create plugin of type class org.apache.logging.log4j.core.appender.FileAppender for element File: java.lang.NullPointerException java.lang.NullPointerException at org.apache.logging.log4j.core.config.plugins.visitors.PluginElementVisitor.findNamedNode(PluginElementVisitor.java:104) ``` ## How was this patch tested? GitHub CI tests. ## Does this PR introduce _any_ user-facing changes? No. commit 4c3c70b28c3e7e32271da6dc1c46924e0cbcf727 Author: Fred Storage Liu Date: Sat Sep 21 06:37:27 2024 +0800 Make CTAS with replace honor column mapping when writing first set of parquet files (#3696) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Make CTAS with replace honor column mapping when writing first set of parquet files; Otherwise the uniform table will end up with parquet files without column mapping and thus have null in query results. ## How was this patch tested? manual test with CTAS. ## Does this PR introduce _any_ user-facing changes? commit e50416dc46cb212b71e3468b9afc85876aecc49f Author: Wenchen Fan Date: Fri Sep 20 23:32:00 2024 +0800 DeltaCatalog#createTable should respect write options (#3674) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description As of today, Delta extracts write options from table properties only for CTAS (in `StagedDeltaTableV2#commitStagedChanges`), but not for CREATE TABLE. In general, this makes sense because CREATE TABLE has no data-writing but CTAS has. However, the write options can be file system configs that we should respect because CREATE TABLE needs to access Delta logs. This PR makes Delta CREATE TABLE to follow CTAS and also extract write options from table properties. ## How was this patch tested? Locally tested with Unity Catalog. It's hard to write a test in Delta because the write options are not persisted but only used during table creation. ## Does this PR introduce _any_ user-facing changes? No commit 413a5cb7edb2c4b0cf2b7653f0c3827f57bf9d4e Author: Amogh Jahagirdar Date: Fri Sep 20 09:28:55 2024 -0600 Revert "[UNIFORM] Disable cleanup of files in expire snapshots via API" (#3692) #### Which Delta project/connector is this regarding? Uniform ## Description This reverts commit b51b5b45595e46748339e42dbb0792e8b485a234. disabling the cleanExpiredFiles API technically prevents removal manifests/manifest lists and users may not be running orphan file removal, so for those users manifests/manifest list may never be cleaned up. For now we can revert this patch to preserve the original behavior of just preventing data file removal so storage can be reclaimed via manifest/manifest list cleanup during uniform commits. ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? This reverts to cleaning up unreachable manifests/manifest lists during background commits, not really a direct user facing change. commit 2547e91202f75888fd39977efe511ae4fa462c80 Author: Ryan Johnson Date: Fri Sep 20 08:26:34 2024 -0700 [SPARK] Use StateCache to manage DatasetRefCache instances (#3682) ## Description `DatasetRefCache` instances are currently untracked, making it hard to discard or invalidate them when no longer needed. We start to address that issue by using `StateCache` to track them, so that `Snapshot.uncache()` can clean them up -- the same way `CachedDS` instances are already tracked and cleaned up. ## How was this patch tested? Existing unit tests exercise `Snapshot.uncache` path. ## Does this PR introduce _any_ user-facing changes? No Co-authored-by: Ryan Johnson commit 80457b00829da17e0cc1fd793e7dd9b4f6089d9e Author: Scott Sandre Date: Thu Sep 19 14:48:28 2024 -0700 [Cleanup] Remove unused directories from `/connectors` (#3693) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Cleanup) ## Description Remove unused and no-longer-applicable directories from `/connectors` ## How was this patch tested? N/A. ## Does this PR introduce _any_ user-facing changes? No. commit cb0c26083386a05e3cd4e40fa1a7be013d9109ba Author: Jun <85203301+junlee-db@users.noreply.github.com> Date: Thu Sep 19 13:31:12 2024 -0700 [Spark] Extend FuzzSuite to have PostCommit phase (#3679) ## Description Adds additional phase to capture in Fuzz test. PostCommit is intended to capture the phase after the commit to run some post commit hooks. ## How was this patch tested? Uni tests commit dcaf833b4439eae5293be84c8a222a393c664328 Author: Juliusz Sompolski Date: Thu Sep 19 21:09:24 2024 +0200 [Spark] Followup to #3633: update some comments to checkAddFileWithDeletionVectorStatsAreNotTightBounds (#3641) Post merge review followup: update and fix some comments. Co-authored-by: Julek Sompolski commit 408b8948073ae3515a52ca8d5ba93926e436e4aa Author: Christos Stavrakakis Date: Thu Sep 19 18:30:07 2024 +0200 Exclude metadata only updates from DV check (#3686) ## Description During commit we validate that `AddFile` actions cannot contain Deletion Vectors when DVs are not enabled for a table (table property). This restriction is incorrect for actions that update metadata of existing files, e.g. `ComputeStatistics` or `RowTrackingBackfill`. The current code skips the check for `ComputeStatistics` operation but not for other operations that perform in-place-metadata updates. The new `isInPlaceFileMetadataUpdate` method is added to Delta operations so that we can easily distinguish such operations. The `getAssertDeletionVectorWellFormedFunc` function is slightly refactor to be more readable. ## How was this patch tested? Existing tests provide coverage. commit d07a7bd50c0ea0af84adc51bc3652030d99d3b00 Author: Christos Stavrakakis Date: Thu Sep 19 18:29:44 2024 +0200 Strip column mapping metadata when feature is disabled (#3688) ## Description Transactions might try to create or update the schema of a Delta table with columns that contain column mapping metadata, even when column mapping is not enabled. For example, this can happen when transactions copy the schema from another table without stripping metadata. To avoid such issues, we automatically strip column mapping metadata when column mapping is disabled. We are doing this only for new tables or for transactions that add column mapping metadata for the first time. If column metadata already exist, we cannot strip them because this would break the table. A usage log is emitted so we can understand the impact on existing tables. Note that this change covers the cases where txn.updateMetadata is called (the "proper API") and not the cases where a Metadata action is directly committed to the table. Finally, this commit changes drop column mapping command to check that all column mapping metadata do not exist, and not only physical column name and ID. ## How was this patch tested? Added new UT. commit 93eef1112fce9c766aec504f26f09d53bbcabb03 Author: Scott Sandre Date: Wed Sep 18 15:36:18 2024 -0700 [Storage] Remove file path cache tech debt from S3SingleDriverLogStore (#3685) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Storage) ## Description S3 has been strongly consistent (for `GET`, `PUT`, `LIST`) for years ([announcement](https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/)), yet the `S3SingleDriverLogStore` has had code that assumes S3 is read-after-write inconsistent. This PR removes such redundant code. ## How was this patch tested? Existing unit tests. ## Does this PR introduce _any_ user-facing changes? No. commit afc8bf6af2bc689dc8db3199a734acf5d52b30e2 Author: Venki Korukanti Date: Wed Sep 18 13:03:45 2024 -0700 [Kernel][Cleanup] Remove unused API (#3687) Remove `ColumnarBatch.slice` API. This API was proposed during the prototype for handling DVs, but the final implementation took a different route (basically passing a selection vector to indicate the selected rows). This is not used anywhere. commit 1a5203768e4b837e35939b57c110986d5cc542c3 Author: Aleksei Shishkin Date: Wed Sep 18 18:13:34 2024 +0200 [Spark] Get rid of duplicated replaceCharWithVarchar function (#3684) ## Description WriteIntoDeltaLike has a function replaceCharWithVarchar that is absolutely similar to Spark's one. Changes from the PR remove WriteIntoDeltaLike#replaceCharWithVarchar and replace usages with Spark's CharVarcharUtils#replaceCharWithVarchar. ## How was this patch tested? It gets compiled. Changes doesn't require new tests. commit b2339cb5d80fc29c8f202d9eec4c338b25abda59 Author: Maxim Gekk Date: Wed Sep 18 17:54:24 2024 +0200 [Spark] Use `condition` instead of `errorClass` in `checkError()` (#3680) ## Description In the PR, I propose to use the `condition` parameter instead of `errorClass` in calls of `checkError` because `errorClass` was renamed in Spark by the PR https://github.com/apache/spark/pull/48027. This PR fixes compilation issues like: ``` [error] checkError( [error] ^ [error] /home/runner/work/delta/delta/spark/src/test/scala/org/apache/spark/sql/delta/rowtracking/RowTrackingReadWriteSuite.scala:304:7: overloaded method checkError with alternatives: [error] (exception: org.apache.spark.SparkThrowable,condition: String,sqlState: Option[String],parameters: Map[String,String],context: RowTrackingReadWriteSuite.this.ExpectedContext)Unit ``` ## How was this patch tested? By compiling locally. ## Does this PR introduce _any_ user-facing changes? No. This makes changes in tests only. commit d467f520dc3e63cffaa17ddcd837ade6a85a1901 Author: Yan Zhao Date: Wed Sep 18 11:37:08 2024 +0800 [Kernel] Support cleaning expired Delta logs as part of checkpointing (#3212) Add support for metadata cleanup as part of the checkpointing. Metadata cleanup removes expired Delta table log files (delta + checkpoints) according to the table log retention configuration. Any removed delta log files must not cause the table state to be inconstructible. Co-authored-by: Venki Korukanti commit e1dd98728bf79beeace976db5f3179a305335ae7 Author: Lukas Rupprecht Date: Tue Sep 17 13:04:28 2024 -0700 [Spark] Correctly handles protocol properties during repeat table creation (#3681) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes a bug that could occur, if a table is created at the same location but under a different name and had delta.minReader/WriterVersion set explicitly as part of the table creation. Because these properties are removed from the table metadata, they will not appear as part of the table property comparison during the second table creation. As it is required for the properties to match, the second creation will fail, even though the specified properties are identical to the first one. This PR removes these two special properties from the comparison to allow table creation to succeed. ## How was this patch tested? Added a unit test to assert that repeat table creation succeeds, even if minReader/WriterVersion is specified. ## Does this PR introduce _any_ user-facing changes? No commit 8e9741708a434e3776b16d8ff047b55ec875f0bf Author: Rajesh Parangi <89877744+rajeshparangi@users.noreply.github.com> Date: Tue Sep 17 08:42:31 2024 -0700 Refactor Vacuum code to properly handle path url encoding (#3678) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This is intended to handle special characters in the table path. For context, DV team recently made a change to include special characters in all paths used by tests. As of today, the paths we get from listing are not url encoded -- they get url-encoded later in the logic. However, paths contained in delta log files are already url encoded. To keep these two things compatible, this change makes the file names from listing to be url encoded and changes the later logic to not url encode again. ## How was this patch tested? Existing tests ## Does this PR introduce _any_ user-facing changes? NO commit 74d19a5ad2360d358563de27a0b1d90034983439 Author: Ryan Johnson Date: Fri Sep 13 11:45:29 2024 -0700 [SPARK] Use better TahoeLogFileIndex constructor (#3677) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Tiny bit of code hygiene: Remove some redundant code from `DeltaSharingFileIndex` by invoking the existing alternate constructor for `TahoeLogFileIndex` that does the same thing. ## How was this patch tested? Existing tests cover this code. ## Does this PR introduce _any_ user-facing changes? No. commit 9b6a3e0976208c5f8237963aa89433241e45c530 Author: Zhipeng Mao Date: Fri Sep 13 18:56:25 2024 +0200 [Spark] Refactor CONVERT TO DELTA blocking conversion logic (#3676) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The PR extracts the logic to block conversion to Delta table into a separate method `checkConversionIsAllowed` in `ConvertToDeltaCommand` in order not to bloat the main `performConvert` method when more blocking conditions are added. ## How was this patch tested? Just refactoring. Existent tests should cover the change. ## Does this PR introduce _any_ user-facing changes? No. commit 1aaf10b89e8247c90c7a9091fb2886a2c0758d05 Author: Sumeet Varma Date: Thu Sep 12 15:31:43 2024 -0700 [Spark][Kernel] Remove duplicate CoordinatedCommitsUtils.java from DynamoDBCommitCoordinator (#3653) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other ## Description - Remove duplicate CoordinatedCommitsUtils.java from DynamoDBCommitCoordinator package. - Move the common utils to non-test package. ## How was this patch tested? Compile ## Does this PR introduce _any_ user-facing changes? No commit 920f185a04382ff466e47e75240dfda48efe40d3 Author: Zhipeng Mao Date: Thu Sep 12 18:09:43 2024 +0200 [Spark] Table features protocols should handle legacy reader features as reader features (#3672) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes a bug where protocol (2, 7) is not considered to support reader features. This issue could manifest in several places. For example, it would result in hiding (2, x) features in the reader features list when it was the only reader feature present. For example, Protocol(2, 7, None, Set(ColumnMapping, RowTracking). ## How was this patch tested? Added new test and adapted existing tests. ## Does this PR introduce _any_ user-facing changes? No. commit fbf0f9b4f208e08b51c590a6816555f470bf4685 Author: Tom van Bussel Date: Thu Sep 12 17:51:09 2024 +0200 [Spark] Fix for pushing down filters referencing attributes with special characters in name into CDF (#3673) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes the push down of filters into CDF. Previously this would cause a parsing exception to be thrown if the filter referenced an attribute with at least one special character in the name. This is due to having to turn the `Filter`s back into `Expression`s. We avoid this by extending `CatalystScan` instead, which avoids the `Expression` to `Filter` to `Expression` roundtrip. ## How was this patch tested? Added a new test ## Does this PR introduce _any_ user-facing changes? No commit 479a5c81b39ed5e9ec77fb81a7a814a725ec0fc7 Author: Yumingxuan Guo Date: Wed Sep 11 16:40:43 2024 -0700 [Delta] Block ALTER from unsetting ICT property dependencies for CC (#3666) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description 1. Blocks ALTER TABLE command unsetting or disabling ICT properties, which is a dependency for Coordinated Commits. Scenarios to block: a) Table had CC enabled -- `ALTER TABLE t SET TBLPROPERTIES ('ict' = 'xx')` -- blocked. b) Table had CC enabled -- `ALTER TABLE t UNSET TBLPROPERTIES ('ict' = 'xx')` -- blocked. c) Table without CC -- `ALTER TABLE t SET TBLPROPERTIES ('coordinator' = 'yy', 'ict' = 'xx')` -- blocked, because the table is about to be upgraded with CC, so ICT modification along with it is blocked as well. 2. Added UTs. ## How was this patch tested? UTs. ## Does this PR introduce _any_ user-facing changes? No. commit d7b94cd27e217f63705dd9610a5665ae7c01eb06 Author: Allison Portis Date: Wed Sep 11 13:03:00 2024 -0700 [Kernel] Add some logging to the getChanges implementation (#3667) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Adds some logging to the getChanges implementation. ## How was this patch tested? N/A commit 27cdcb901e6b7c2d929c5f576a90f47f38e89716 Author: Allison Portis Date: Wed Sep 11 11:37:13 2024 -0700 [Kernel] Adds protocol checks to the public getChanges API on TableImpl (#3651) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description To avoid reading invalid tables, Kernel should check that any read protocol actions are supported by Kernel. This PR makes the current API private, and adds a public API around it that does this check when the `Protocol` is included in the list of actions to be read from the file. Also removes the "byVersion" part of the API name since we are adding separate timestamp APIs in https://github.com/delta-io/delta/pull/3650. ## How was this patch tested? Adds unit tests. commit b843ad6ef3900ac9f87433dace5c60ce1b39af5c Author: Venki Korukanti Date: Wed Sep 11 09:33:25 2024 -0700 [Spark] Fix Spark-master compile errors (#3591) Spark-master based build broken of change https://github.com/apache/spark/pull/47785 --------- Co-authored-by: Thang Long VU Co-authored-by: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> commit 73b420a9bcd32c114805b83368d6b18c38444a8d Author: Wenchen Fan Date: Wed Sep 11 14:37:16 2024 +0800 [MINOR] fix import ordering (#3612) commit 6c22aca54bff1e0b648f0ac4da10d70ab785f3dd Author: Tulio Cavalcanti Date: Tue Sep 10 13:57:51 2024 -0300 [FLINK] [#2331] Add support to partition table by date type (#3533) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [x] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adds support for table partition of DATE type for RowData elements. Resolves #2331 ## How was this patch tested? Tests were updated to include the new type. ## Does this PR introduce _any_ user-facing changes? No --------- Signed-off-by: Tulio Cavalcanti commit 81afbe35697e2eccabdad02fe4514d6f0e3be68a Author: Christos Stavrakakis Date: Tue Sep 10 18:30:15 2024 +0200 [Spark] Fix path handling in DeltaTableCreationSuite (#3662) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fix a number of tests in `DeltaTableCreationSuite` that were not correctly handling paths with special chars. Enable special char injection in `DeltaTableCreationIdColumnMappingSuite` and `DeltaTableCreationNameColumnMappingSuite` suites. ## How was this patch tested? Test-only PR. ## Does this PR introduce _any_ user-facing changes? No. commit 559ca40d30571b955c9eb73a69d9dd77c7c302b4 Author: Christos Stavrakakis Date: Tue Sep 10 18:30:03 2024 +0200 [Spark] Allow setting the same active Delta transaction (#3660) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description OptimisticTransaction.setActive() and OptimisticTransaction.withActive() methods will fail if the active transaction is already set, even if the caller tries to set the same transaction. This commit fixes this issue and allows setting the same transaction instance. ## How was this patch tested? New and existing tests. ## Does this PR introduce _any_ user-facing changes? No commit b7ef01c1298b0181679565515247b0895f949062 Author: Paddy Xu Date: Tue Sep 10 18:29:51 2024 +0200 [Spark] Report writer offset mismatch in DVStore (#3661) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We identified a potential issue with the API used to write DV to files, where the `DataOutputStream.size()` method may not return the correct file size. Our investigation revealed that `DataOutputStream` maintains its own byte count, but its `write(data)` method does not increment this counter `DataOutputStream` has multiple subclasses, which might override the counter or the `write(data)` method to update the counter correctly. We want to find out which class is being used when the issue occurs, thus this PR. To address this, we introduced our own mechanism to track the number of bytes written, which will be used solely for logging. If there is a discrepancy between the system's reported file size and our own record, a Delta event will be triggered. ## How was this patch tested? Not needed. ## Does this PR introduce _any_ user-facing changes? No. commit a08712d1b83743410bbd7e0e264d8b9b10db0228 Author: Wenchen Fan Date: Tue Sep 10 23:56:33 2024 +0800 DeltaCatalog.createTable should respect PROP_IS_MANAGED_LOCATION (#3654) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Even if a table has the location field, it should still be a managed table if `PROP_IS_MANAGED_LOCATION` is present in the table properties. Note: this case won't happen with Spark integration solely. It's only an issue for third-party catalogs that delegate requests to `DeltaCatalog`, such as Unity Catalog. ## How was this patch tested? new test ## Does this PR introduce _any_ user-facing changes? no commit 5d0a05e544497594af89019d38fab1357ed61294 Author: Yumingxuan Guo Date: Mon Sep 9 20:54:48 2024 -0700 [Delta] Exclude default coordinated commits properties during replace (#3658) 1. Exclude default CC properties during REPLACE command. Before, all default properties are merged at some later stage of REPLACE; now, we exclude the CC ones. If an existing commit coordinator is retained, the existing ICT properties are also retained; otherwise, default ICT properties are included. 2. Added UTs. #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ## How was this patch tested? UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 66440f0f57d15efa525b48edd0a9a0f3e4077d97 Author: Allison Portis Date: Mon Sep 9 15:48:25 2024 -0700 [Kernel] Support beforeOrAt and atOrAfter semantics for getting a commit version from a timestamp (#3650) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Adds the functions `getVersionBeforeOrAtTimestamp` and `getVersionAtOrAfterTimestamp` to `TableImpl` for use with the `getChangesByVersion` API to enable querying changes between two timestamps. ## How was this patch tested? Adds unit tests. commit 92f2068b0801fabd92c76f860e0b06d649a88785 Author: Johan Lasperas Date: Mon Sep 9 18:49:36 2024 +0200 [Spark] Fix overflow in GeneratedColumnSuite (#3652) ## Description Small test fix, follow up from https://github.com/delta-io/delta/pull/3601 The test contains an overflow and fails when running with ANSI_MODE enabled. ## How was this patch tested? Test-only commit 15da4aa9a5789d46af9d2289f54d88b49514f2e6 Author: Bart Samwel Date: Mon Sep 9 15:19:22 2024 +0200 [Spark] Only enable a single legacy feature with legacy metadata properties (#3657) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently enabling legacy features on legacy protocols with metadata properties results to enabling all preceding legacy features. For example, enabling enableChangeDataFeed results to protocol (1, 4). This is inconsistent with the rest of the protocol operations. In this PR, we fix this inconsistency by always enabling only the requested feature. This is a behavioral change. ## How was this patch tested? Existing and new unit tests. ## Does this PR introduce _any_ user-facing changes? Yes. When enabling a feature using a table property, e.g. by setting `delta.enableChangeDataFeed` to `true`, then in the previous situation you would typically get protocol `(1, 4)`. Now you would get `(1, 7, changeDataFeed)`. The user can get `(1, 4)` by also asking for `delta.minWriterVersion = 4`. This change is OK now because (a) enabling fewer features is safer than enabling more features, and (b) Deletion Vectors requires table features support, and it is very popular to implement, so many clients have added support table features, (c) users can easily get back to the legacy protocol by ALTERing the protocol and asking for `delta.minWriterVersion = 4`. Signed-off-by: Bart Samwel commit ac667ff7d508c006b88611d8d73653c3248b3222 Author: Jun <85203301+junlee-db@users.noreply.github.com> Date: Fri Sep 6 13:46:07 2024 -0700 [Spark] Extends IncrementalClustering Suite with Coordinated Commits (#3648) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR extends the incremental clustering suite to run with coordinated commits configs. This will help clustering related code to have a better coverage with coordinated commits. ## How was this patch tested? This is a test PR. ## Does this PR introduce _any_ user-facing changes? No commit d59e3f0d4a27a38d7c1e965d79ae6dd4794899bb Author: Johan Lasperas Date: Fri Sep 6 18:46:58 2024 +0200 [Spark] Automatic type widening in Delta streaming sink (#3626) ## Description This change introduces automatic type widening during schema evolution in the Delta streaming sink. Conditions for type widening to trigger: - Type widening is enabled on the Delta table - Schema evolution (`mergeSchema`) is enabled on the sink - The data written to the sink uses a type that is strictly wider than the current type in the table schema, and moving from the narrower to the wider type is eligible for type widening - see `TypeWidening.isTypeChangeSupportedForSchemaEvolution` When all conditions are satisfied, the table schema is updated to use the wider type before ingesting the data. ## How was this patch tested? Added test suite `TypeWideningStreamingSinkSuite` covering type widening in the Delta streaming sink ## Does this PR introduce _any_ user-facing changes? This builds on the user-facing change introduced in https://github.com/delta-io/delta/pull/3443 that allows writing to a delta sink using a different type than the current table type. Without type widening: ``` spark.readStream .table("delta_source") # Column 'a' has type INT in 'delta_sink'. .select(col("a").cast("long").alias("a")) .writeStream .format("delta") .option("checkpointLocation", "") .toTable("delta_sink") ``` The write to the sink succeeds, column `a` retains its type `INT` and the data is cast from `LONG` to `INT` on write. With type widening: ``` spark.sql("ALTER TABLE delta_sink SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true')") spark.readStream .table("delta_source") # Column 'a' has type INT in 'delta_sink'. .select(col("a").cast("long").alias("a")) .writeStream .format("delta") .option("checkpointLocation", "") .option("mergeSchema", "true") .toTable("delta_sink") ``` The write to sink succeeds, the type of column `a` is changed from `INT` to `LONG`, data is ingested as `LONG`. commit 9bd2c43a01936f3d43a58ae94363bed8b790f9e6 Author: Marko Ilić Date: Fri Sep 6 07:52:37 2024 +0200 [Kernel] Changed string comparator to fallback to binary comparator (#3621) Changed string comparator to fallback to binary comparator. commit 8a9b447b07d31e39553565bf9c30739ce37d589b Author: Allison Portis Date: Thu Sep 5 14:08:05 2024 -0700 [KERNEL] Additional Delta action support (Metadata, Protocol, CommitInfo and AddCDCFile) for LogImpl::getChangesByVersion (#3532) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description https://github.com/delta-io/delta/pull/3531 added support for reading the add and remove file actions using the LogImpl::getChangesByVersion API. This PR adds support for additional actions metadata, protocol, commitInfo and cdc. ## How was this patch tested? Adds unit tests. commit fab761067facaf0ba4d24986fef5acea3ebe01a7 Author: Scott Sandre Date: Thu Sep 5 13:36:34 2024 -0700 [Standalone] Safer checkpointing and better logging (#3646) #### Which Delta project/connector is this regarding? - [ ] Spark - [X] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR - adds better logging throughout Delta Standalone (improves debuggability) - fixes an issue in the Delta Standalone Checkpoint code path where we would incorrectly close an incomplete checkpoint upon error ## How was this patch tested? Existing UTs + tested and verified locally ## Does this PR introduce _any_ user-facing changes? No --------- Co-authored-by: vkorukanti commit 3a9762f85793742b65f403ed1800054ae6ddd41e Author: Zihao Xu Date: Thu Sep 5 12:37:41 2024 -0400 [Spark] Relax check for generated columns and CHECK constraints on nested struct fields (#3601) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description close #3250. this PR relaxes the check for nested struct fields especially when only some are being referenced by CHECK constraints or generated columns, which allows for more valid use cases in scenarios involving type widening or schema evolution. the core function, `checkReferencedByCheckConstraintsOrGeneratedColumns`, inspects the nested/inner fields of the provided `StructType` to determine if any are referenced by dependent (CHECK) constraints or generated columns; for column types like `ArrayType` or `MapType`, the function checks these properties directly without inspecting the inner fields. ## How was this patch tested? through unit tests in `TypeWideningConstraintsSuite` and `TypeWideningGeneratedColumnsSuite`. ## Does this PR introduce _any_ user-facing changes? yes, now the following (valid) use case will not be rejected by the check in [ImplicitMetadataOperation.checkDependentExpression](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/schema/ImplicitMetadataOperation.scala#L241). ```sql -- with `DELTA_SCHEMA_AUTO_MIGRATE` enabled create table t (a struct) using delta; alter table t add constraint ck check (hash(a.x) > 0); -- changing the type of struct field `a.y` when it's not -- the field referenced by the CHECK constraint is allowed now. insert into t (a) values (named_struct('x', CAST(2 AS byte), 'y', 1030)) ``` --------- Co-authored-by: Tathagata Das commit 7dfc6d97cc11cabe4facb2ca3b943e40230b36ec Author: Allison Portis Date: Wed Sep 4 19:05:53 2024 -0700 [KERNEL] Initial support to read the Delta actions from the Delta log between two versions (#3531) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Adds support for `TableImpl.getChangesByVersion` which takes a `startVersion`, `endVersion`, and a set of Delta actions to read from the log files. ## How was this patch tested? Adds unit tests in this PR. commit 7a0db43df1ef8236e4db8a57837734b83ed15153 Author: Amogh Jahagirdar Date: Wed Sep 4 13:55:08 2024 -0600 [Uniform] Add support for Timestamp partition values, and move away from using partition string paths to using StructLike partition values in Iceberg.. (#3606) ## Description This change adds support for timestamp partition values in Delta Uniform and refactors to avoid using partition strings during partition value conversion from Delta->Iceberg. Instead, Delta partition values are deserialized and then converted Iceberg partition values which are stored in StructLike and passed to the DataFile builder during metadata conversion. ## How was this patch tested? Added unit test which tests different partition data types in addition to the new support for timestamp partitions. ## Does this PR introduce _any_ user-facing changes? Before this change, the table could successfully be created but writes to the table with timestamp partition would fail. Now, writes to the table with the timestamp partition value will succeed. commit b17d48a39116b642b19a9aa0bdead1c8cb5210c3 Author: Scott Sandre Date: Wed Sep 4 12:27:41 2024 -0700 [Infra] Include GitHub workflow name in each task (#3644) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Infra) ## Description I'd like to do some data science using the GitHub REST API on our PR runtimes. Unfortunately, using the GitHub CHECKs API (https://docs.github.com/en/rest/checks/runs?apiVersion=2022-11-28#list-check-runs-for-a-git-reference), I only get the name of the individual job. Getting the name of the parent workflow requires an additional API call. This is because our job names are the default job names right now for most of our GitHub workflows. This doesn't give us any info as to which workflow it is for. For example, today I'd get result `test (2.13.13, 0)` but I am unable to determine if this is for Delta Spark Latest or Delta Spark Master. This PR updates the workflow job name to include the workflow name so now we can uniquely identify the jobs. ## How was this patch tested? CI tests. ## Does this PR introduce _any_ user-facing changes? No commit 0bfe331f66c4cabda821c364ffb3ac47103d6a83 Author: Yumingxuan Guo Date: Wed Sep 4 11:30:53 2024 -0700 [Delta] Block alter command from overriding or unsetting coordinated commits properties (#3573) 1. Blocks ALTER TABLE command from setting Coordinated Commits properties if the table already had them. If the table did not have them, checks that the CC property overrides contains exactly Coordinator Name and Coordinator Conf, no Table Conf. 2. Blocks ALTER TABLE command from unsetting Coordinated Commits properties if the table already has them. 3. Added UTs for both cases. Moved the UT to a new suite `CoordinatedCommitsUtilsSuite.scala`. 4. Renamed some methods for clarity. #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ## How was this patch tested? UTs. ## Does this PR introduce _any_ user-facing changes? No. commit d0c2a0033be8890beea1402a2eb5063b3b7f7b1b Author: Jun <85203301+junlee-db@users.noreply.github.com> Date: Wed Sep 4 11:30:36 2024 -0700 [Spark] Extends Vacuum Suite with Coordinated Commits (#3631) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR extends the vacuum suite to run with coordinated commits configs. This will help vacuum related code to have a better coverage with coordinated commits. ## How was this patch tested? This is a test PR. ## Does this PR introduce _any_ user-facing changes? No commit cdd39dda25f1a9b06ce7a414d0ebff4ddfa419c5 Author: Christos Stavrakakis Date: Tue Sep 3 16:29:49 2024 +0200 [Spark] Update tests in ConvertToDeltaSuite (#3635) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Enable injection of special chars in temporary directories in two tests in ConvertToDeltaSuite. The previous issues have been resolved in 90c1b2a26a86ce0c16e8218ec4188c81b9f7a21c. ## How was this patch tested? Test-only change. ## Does this PR introduce _any_ user-facing changes? No commit 5568dd0c382876ac5c1dd078d4725ae2ba724056 Author: Johan Lasperas Date: Tue Sep 3 16:27:56 2024 +0200 [Spark] Add tests for implicit cast in INSERT to delta table (#3605) ## Description Add tests covering implicit casting when inserting into a Delta table. Covers various insert API: - Dataframe V1, V2, SQL, streaming - Append vs. Overwrite - Position-based vs. name-based Changes: - Move test abstraction to run insert using various APIs out of `TypeWideningInsertSchemaEvolutionSuite` and into its own trait to allow reusability. - Add streaming write to the set of insert APIs that are covered by that abstraction. - Add implicit casting tests for insert. ## How was this patch tested? Test-only ## Does this PR introduce _any_ user-facing changes? No commit 5d6309477ab51be5f88d421e868a4489723f400d Author: Juliusz Sompolski Date: Tue Sep 3 06:51:25 2024 +0200 [Spark] Fix stats with tightBounds check for AddFiles with deletionVectors (#3633) ## Description Check `DeletionVectorFilesHaveWideBoundsCheck` has been disabled for COMPUTE STATS because it reintroduces stats with tight bound to files with Deletion Vectors. However, there are other operations that can then copy over these AddFile actions with DVs and tight stats. These operations resulted in DELTA_ADDING_DELETION_VECTORS_WITH_TIGHT_BOUNDS_DISALLOWED error, which was a false positive. In this PR we also attempt to introduce and discuss a "framework" for checks like that as a property of DeltaOperations, with DeltaOperations declaring as a member method whether a certain property and check should be performed. This is opposed to current practice, where many places in the code feature special cases like matching against a certain DeltaOperation and doing something special; this kind of code is very decentralized, and it's easy to miss if any new place or new operation needs such central handling. If this was centralized in DeltaOperations, this could lead to better discoverability of special cases and edge cases when implementing new operations. ## How was this patch tested? Tests added. Co-authored-by: Julek Sompolski commit a7450003327a41414ca320aca7f610d18a92a3bb Author: Prakhar Jain Date: Fri Aug 30 08:24:05 2024 -0700 [Spark][Kernel] Pass table identifier in Coordinated Commits (#3608) ## Description This PR adds TableIdentifier in the Coordinated Commit Interface. There will be a separate change to pass the tableName reliably to the Commit Coordinator in delta-spark. ## How was this patch tested? Existing UTs commit 215996f7c3b1327a215806fa06459a28aefc2b7b Author: Johan Lasperas Date: Fri Aug 30 00:40:54 2024 +0200 [Spark] Handle type mismatches in Delta streaming sink (#3443) ## Description This change enables writing data to a Delta streaming sink using data types that differ from the actual Delta table schema. This is achieved by adding an implicit cast to columns when needed. Casting behavior respects the configuration `spark.sql.storeAssignmentPolicy`, similar to batch INSERT. ## How was this patch tested? - Added test suite `DeltaSinkImplicitCastSuite` covering writing to a Delta sink using mismatching types. Covers e.p. interactions with: schema evolution, schema overwrite, column mapping, partition columns, case sensitivity. ## Does this PR introduce _any_ user-facing changes? Previously, writing to a Delta sink using a type that doesn't match the column type in the Delta table failed with `DELTA_FAILED_TO_MERGE_FIELDS`: ``` spark.readStream .table("delta_source") # Column 'a' has type INT in 'delta_sink'. .select(col("a").cast("long").alias("a")) .writeStream .format("delta") .option("checkpointLocation", "") .toTable("delta_sink") DeltaAnalysisException: [DELTA_FAILED_TO_MERGE_FIELDS] Failed to merge fields 'a' and 'a' ``` With this change, writing to the sink now succeeds and data is cast from `LONG` to `INT`. If any value overflows, the stream fails with (assuming default `storeAssignmentPolicy=ANSI`): ``` SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] Fail to assign a value of 'LONG' type to the 'INT' type column or variable 'a' due to an overflow. Use `try_cast` on the input value to tolerate overflow and return NULL instead." ``` commit 3a6a60b4788af8b23d99e716f1027a443ab21479 Author: Fred Storage Liu Date: Thu Aug 29 15:40:15 2024 -0700 Adds null pointer check in DeltaParquetWriteSupport (#3623) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adds null pointer check in DeltaParquetWriteSupport to handle null field ids ## How was this patch tested? always good and safe to have. no test. commit 210a39f2440f46b6b39244e2e87619351b1175d5 Author: jintao shen Date: Thu Aug 29 15:03:22 2024 -0700 [Spark] Fix wordings and sqlState for DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED (#3622) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ## How was this patch tested? Existing UTs ## Does this PR introduce _any_ user-facing changes? No commit d94a9fe02a1113d4bd53b43b3ecbdb793f445ea7 Author: ChengJi-db Date: Wed Aug 28 14:45:15 2024 -0700 [Delta Uniform] Delete Iceberg Metadata when Vacuum (#3614) ## Description Guard the iceberg metadata for Vacuum only when uniform is enabled on table ## How was this patch tested? UTs commit 851ddf9f659f5faef6d4bb3e128155afef6c3031 Author: Eduard Tudenhoefner Date: Wed Aug 28 22:30:03 2024 +0200 [Infra] Separate out delta-spark python tests (#3542) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Infra) ## Description Separate out delta-spark python tests to their own GitHub CI ## How was this patch tested? CI tests ## Does this PR introduce _any_ user-facing changes? No commit 2ab5a8d419d869525b1c110372a7782875caa88b Author: Fred Storage Liu Date: Wed Aug 28 10:37:39 2024 -0700 Uniform iceberg conversion transaction should not convert commit with only AddFiles without datachange (#3615) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Uniform iceberg conversion transaction should not convert commit with only AddFiles without datachange. Otherwise it will result in duplicate AddFile in Iceberg. ## How was this patch tested? Manual tested with Spark; UT will be added in future PR. ## Does this PR introduce _any_ user-facing changes? commit 55b1d1232914a7a10b8c4a4aab527ec103ea26d0 Author: Marko Ilić Date: Wed Aug 28 17:44:03 2024 +0200 [Kernel] Fix binary comparator to use the unsigned comparison (#3617) ## Description Fixed binary comparator. Previously, bytes were compared as signed, which was incorrect. ## How was this patch tested? Tests added to `DefaultExpressionEvaluatorSuite.scala` commit f468733b89f80431dbbd7ac0558a800988e3c34e Author: Marko Ilić Date: Wed Aug 28 17:40:33 2024 +0200 [Kernel] Change string comparing from UTF16 to UTF8 (#3611) ## Description Changed string comparing from UTF16 to UTF8. This fixes comparison issues around the characters with surrogate pairs. ## How was this patch tested? Tests added to `DefaultExpressionEvaluatorSuite.scala` commit a0beb105b4dd2983aadbac41ec5631bf678b3034 Author: Christos Stavrakakis Date: Wed Aug 28 17:28:24 2024 +0200 [Spark] Use Spark conf to control special char injection in paths (#3616) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Use a static Spark config to select the prefix that will be used for Parquet and DV files. This is a follow-up from commit 82f797159. ## How was this patch tested? Test-only changes. ## Does this PR introduce _any_ user-facing changes? No. commit 103af9dd8605a87c3bf693f451de9a4f8a2a22bd Author: Venki Korukanti Date: Tue Aug 27 22:34:07 2024 -0700 [Docs] Refactor the docs (#3544) Refactor the docs to: * make it easy to access the documentation for each connector. * add Kernel docs Changes are staged at: https://docs.delta.io/0.0.2/index.html commit 44182b5a2dd6cb1acfc4c53549730daf60bd4cca Author: Dhruv Arya Date: Tue Aug 27 09:30:05 2024 -0700 [Spark][Test-Only] Allow override of the name of the tracked commit coordinator (#3609) ## Description Updates `TrackingInMemoryCommitCoordinatorBuilder` so that the class user can now override the name of the coordinator returned by this builder. This will be useful when the `defaultCommitCoordinatorClientOpt` is defined and is different from the "in-memory" coordinator. ## How was this patch tested? Test-only change commit 386fa573bf51cc04648e6771bd50383d9bef4727 Author: Venki Korukanti Date: Mon Aug 26 22:05:48 2024 -0700 [Kernel] Remove `JsonHandler.deserializeStructType` and add `DataType` JSON SerDe in API module (#3602) ## Description Given now Kernel-API module contains Jackson libs: 1) Remove `JsonHandler.deserializeStructType` and its usages 2) Rename `DataTypeParser` to `DataTypeJSONSerDe` (similarly the test suite) 3) Add `StructType` serialization utilities to `DataTypeJSONSerDe` ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? Connectors that implement custom `JsonHandler`, don't need to implement the `deserializeStructType` API. commit 4e323aecda3ad3bcd7b67e6083440d0e19b79b8b Author: Wenchen Fan Date: Tue Aug 27 06:46:48 2024 +0800 always include CatalogTable#storage#properties into data source options (#3536) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description `DeltaTableV2` uses the options to create `DeltaLog`, assuming that the options should always be consistent with the one in `CatalogTable#storage#properties`. This may not be true as custom catalogs may add more table storage properties on the fly, like Unity Catalog. ## How was this patch tested? Tested locally with UC. It's not an issue for HMS. ## Does this PR introduce _any_ user-facing changes? no commit 5c6b451579acf5fb0d2413075e162512bf27a6c7 Author: Rakesh Veeramacheneni <173086727+raveeram-db@users.noreply.github.com> Date: Mon Aug 26 15:10:16 2024 -0700 [Kernel] Shade`jackson` into Kernel-Api for internal JSON operations (#3587) ## Description Adds Jackson as a general-purpose JSON parsing library into the Kernel-Api jar. The library is shaded to avoid version dependency conflicts downstream for connector developers. ## How was this patch tested? Ran existing tests. commit 020ef32095f234106e8e44bfc3bff28dade1d813 Author: ChengJi-db Date: Mon Aug 26 11:42:39 2024 -0700 [Delta Uniform] Enforce newly enabled iceberg converter to create a new iceberg table (#3600) ## Description Now for newly disabling delta uniform iceberg tables, it would always convert all delta metadata into a newly created iceberg table. This is for newly enabling uniform table to start a new history line for iceberg metadata so that if a uniform table is corrupted, user can unset and re-enable to unblock. ## How was this patch tested? Add UTs commit 82f797159bcddfc3658978deae03282ac9db7367 Author: Christos Stavrakakis Date: Mon Aug 26 19:44:22 2024 +0200 [Spark] Inject special characters in file paths in testing mode (#3597) ## Description Change Rarquet and DV file names to use in testing mode a name prefix that contains special chars. This provides good coverage of handling special characters in paths for all Delta code. ## How was this patch tested? Test-only PR. Updated existing tests that were incorrectly handling paths. commit dcf9ea98d015a34bbbe9ce2e63c46a4d829dcecd Author: Johan Lasperas Date: Mon Aug 26 18:16:46 2024 +0200 [Kernel] Add widening type conversions to Kernel default parquet reader (#3541) \## Description Add a set of conversions to the default parquet reader provided by kernel to allow reading columns using a wider type than the actual in the parquet file. This will support the type widening table feature, see https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md. Conversions added: - INT32 -> long - FLOAT -> double - decimal precision/scale increase - DATE -> timestamp_ntz - INT32 -> double - integers -> decimal ## How was this patch tested? Added tests covering all conversions in `ParquetColumnReaderSuite` ## Does this PR introduce _any_ user-facing changes? This change alone doesn't allow reading Delta table that use the type widening table feature. That feature is still unsupported. It does allow reading Delta tables that somehow have Parquet files that contain types that are different from the table schema, but that really should never happen for tables that don't support type widening.. commit b5e9aebe6f28afec552a782afd4f890740014f55 Author: Zhipeng Mao Date: Mon Aug 26 18:08:03 2024 +0200 [SPARK] Move identity column feature out of dev mode (#3598) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 . We have implemented identity column support and all the tests passed. We now can move identity column feature out of developer mode. ## How was this patch tested? Existent tests. commit d75babdaa66505d82acc001050463739946e74d8 Author: Christos Stavrakakis Date: Mon Aug 26 16:43:20 2024 +0200 [Spark] Inject special chars in temp dirs of all Delta tests (#3604) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Make DeltaSQLTestUtils use a default prefix for temporary directories that contains special chars and provides test coverage for URL-encoding of paths. In addition, update all Delta suites to inherit from DeltaSQLTestUtils instead of SQLTestUtils to use the temp dir overrides. Finally, update a bunch of tests to correctly handle tests with special chars. A number of tests require further investigation and potential code fixes and will be handled in follow-up commits. ## How was this patch tested? Test-only PR. ## Does this PR introduce _any_ user-facing changes? No commit 7bafe8e1eb5aa28fdb03be78a275c7a024c031c3 Author: Wenchen Fan Date: Mon Aug 26 08:06:15 2024 +0800 invoke Spark catalog plugin API to create tables (#3497) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Today, `DeltaCatalog` takes care of the Spark CREATE TABLE command and calls `CreateDeltaTableCommand#run` at the end. Within this command, `spark.sessionState.catalog.createTable` is called, which bypasses any custom catalog that overwrites `spark_catalog`. `DeltaCatalog` always creates tables in the Hive Metastore. This PR fixes this issue by calling the Spark catalog plugin API to create the table, to respect custom catalog that overwrites `spark_catalog`, such as Unity Catalog. ## How was this patch tested? Locally tested it with Spark + Unity Catalog. ## Does this PR introduce _any_ user-facing changes? no --------- Co-authored-by: Ryan Johnson Co-authored-by: Tathagata Das commit b1fbe3e6b22afb9ac9420b0a108050c2ead7f168 Author: ChengJi-db Date: Fri Aug 23 16:27:42 2024 -0700 [Delta Uniform] Refactor iceberg & hudi constants (#3599) ## Description Refactor iceberg & hudi constants ## How was this patch tested? Existing UTs commit 0d71f85412004db0a0da3167829f87b48b23a734 Author: Liwen Sun <36902243+liwensun@users.noreply.github.com> Date: Fri Aug 23 12:14:23 2024 -0700 Open up APIs to access all Delta configs and features (#3582) ## Description Provide an API for apps or developers to have a view of all configs and features available in Delta. ## How was this patch tested? A new unit test on the new API + existing tests. commit 9d2270e9b950612d96a65e75eb260005c8820e09 Author: Zhipeng Mao Date: Fri Aug 23 17:57:57 2024 +0200 [SPARK] Add MERGE support for tables with Identity Columns (#3566) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 . It supports MERGE command to provide system generated IDENTITY values in INSERT and UPDATE actions. Unlike INSERT, where the identity columns that needs writing are collected in `WriteIntoDelta.writeAndReturnCommitData` exactly before writing in `TransactionalWrite.writeFiles`, MERGE expressions are resolved earlier. Specifically, we resolve the table's identity columns to track for high water marks in `PreprocessTableMerge.apply`. The column set will be passed to `OptimisticTransaction` and be written in `TransactionalWrite.writeFiles`. ## How was this patch tested? New test suite `IdentityColumnDMLScalaSuite`. commit eb00b0d1f86aae72b9fa25739725132fa56b4853 Author: jintao shen Date: Thu Aug 22 17:12:49 2024 -0700 [Spark] Improve missing stats column message for unsupported data skipping types (#3577) ## Description Today when either the clustering column data type is unsupported for data skipping or clustering column is not in the first 32 columns, it shows the following message [DELTA_CLUSTERING_COLUMN_MISSING_STATS] Liquid clustering requires clustering columns to have stats. Couldn't find clustering column(s) 'current_version' in stats schema: This is confusing to users when the column data type is not supported for data skipping. To improve this scenario this PR introduces a new error class DELTA_CLUSTERING_COLUMNS_NOT_SUPPORTED_DATATYPE when cluster by non data skipping data type such as cluster by Boolean column ## How was this patch tested? Existing unit tests. commit 1ea621bf8c36915fe72d0e749db5bd0b90887f19 Author: Zhipeng Mao Date: Thu Aug 22 22:34:06 2024 +0200 [SPARK][TEST-ONLY] Use unique table name for identity column tests (#3594) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 . `IdentityColumnSuite` is flaky due to using duplicate table name 'identity_test' is used across tests. This PR generates all table names in identity column related suites by using UUID to make them unique. ## How was this patch tested? It is test only change. commit 33ad6b3d09cf4d0eaabfc59238e43ca27f9d7ec4 Author: Christos Stavrakakis Date: Thu Aug 22 18:21:42 2024 +0200 [Spark] Add tests for DML commands on partitioned tables with special chars (#3593) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The paths of partitioned tables contain the partition values in escaped form that follows the Hive-style partitioning. This PR introduces a basic test for DML operations on such tables. ## How was this patch tested? Test-only PR. ## Does this PR introduce _any_ user-facing changes? No commit ef1fabe6ba773e06af10ba9421e6ff361082907e Author: Venki Korukanti Date: Thu Aug 22 09:14:36 2024 -0700 Revert "[Spark] Add Delta Connect Merge Server and Scala Client" (#3592) Reverts delta-io/delta#3580 This PR looks like missing `DeltaMergeBuilder` which is causing build failures. See https://github.com/delta-io/delta/pull/3591 commit 90c1b2a26a86ce0c16e8218ec4188c81b9f7a21c Author: Ming DAI Date: Wed Aug 21 14:46:04 2024 -0700 [Spark] Handle URI properly in CatalogFileManifest for special characters (#3590) ## Description As title, handle URI properly to cover special characters in partition path for ConvertToDelta ## How was this patch tested? Unit test commit c2d343750c384985b1ee6b623758a01772e4d624 Author: richardc-db <87336575+richardc-db@users.noreply.github.com> Date: Wed Aug 21 13:50:19 2024 -0700 [SHARING][VARIANT] Add variant table feature to sharing client (#3549) ## Description adds variant table feature as a supported table feature. Because delta-spark already can read variants, support comes mostly for free cross tests with Spark 3.5 and 4.0 in the "sharing" package because we require spark 4.0 to create variants. ## How was this patch tested? added UTs for the streaming and non-streaming case tested ``` build/sbt -DsparkVersion=master sharing/'testOnly io.delta.sharing.spark.DeltaSharingDataSourceDeltaSuite -- -z "basic variant test"' ``` with both `-DsparkVersion=master` and `-DsparkVersoin=latest` commit 56d057cce5b09742d0315e81a3436fa418eacf5c Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed Aug 21 22:47:27 2024 +0200 [Spark] Remove waiting for a fixed time for Delta Connect Server to be available in Delta Connect testing (#3576) ## Description For local E2E Delta Connect testing, we also designed an [util class](https://github.com/delta-io/delta/blob/01bf60743b77c47147843e9083129320490f1629/spark-connect/client/src/test/scala-spark-master/io/delta/connect/tables/RemoteSparkSession.scala#L62) to start a local server in a different process similar to [SparkConnect](https://github.com/apache/spark/blob/ba208b9ca99990fa329c36b28d0aa2a5f4d0a77e/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/test/RemoteSparkSession.scala#L37). We noticed that the server takes a random amount of seconds to start up, and back then we received the error `INVALID_HANDLE.SESSION_NOT_FOUND] The handle 746e6c86-9fa9-4b08-9572-388c20eaed47 is invalid. Session not found. SQLSTATE: HY000"`, so what we did is to add a 10s `Thread.sleep` before starting the client. This is not robust, so we are removing the `Thread.sleep`. This should work because: 1. The SparkSession's builder here already uses the default [Configuration](https://github.com/apache/spark/blob/3edc9c23a723a92c5a951cea0436529de65c640a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala#L891) of the `SparkConnectClient` which includes a default retry policy. 2. Spark patches the error `INVALID_HANDLE.SESSION_NOT_FOUND` in this [PR](https://github.com/apache/spark/pull/46971) at some point, so we should be able to retry even if encountering this error. ## How was this patch tested? Existing UTs. commit 0c188c2a3d1e798c0219888bc51a9979887f322f Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed Aug 21 22:46:59 2024 +0200 [Spark] Add Delta Connect Merge Server and Scala Client (#3580) ## Description Add support for `merge` for Delta Connect Server and Scala Client. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit cfb0292c2a3a301021d42e444e14f517ca16111a Author: Juliusz Sompolski Date: Wed Aug 21 20:50:23 2024 +0200 [Spark] CDC reader accept case insensitive Boolean option values (#3584) ## Description `DeltaOptions` are equipped to accept case-insensitive values of boolean flags, but CDCReader was not, resulting in not-accepting 'True'. Make it case insensitive. A separate bug in Spark Connect was causing "True" to be passed from Python boolean True. That is being fixed by https://github.com/apache/spark/pull/47790 ## How was this patch tested? Tests added. ## Does this PR introduce _any_ user-facing changes? Datasource option to enable CDC should now accept "True" and other mixed-case variants and not only "true". --------- Co-authored-by: Julek Sompolski commit 01bf60743b77c47147843e9083129320490f1629 Author: Rajesh Parangi <89877744+rajeshparangi@users.noreply.github.com> Date: Tue Aug 20 10:51:39 2024 -0700 [TEST-ONLY + Refactor] Fix Vacuum test code to make use of artificial clock everywhere (#3572) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fix Vacuum test code to make use of artificial clock everywhere ## How was this patch tested? Existing tests ## Does this PR introduce _any_ user-facing changes? No commit 0afde444ce11eb20c12252b89aded3bef4370818 Author: Charlene Lyu Date: Tue Aug 20 10:01:15 2024 -0700 [Sharing] Add timestamp_ntz to Delta Sharing reader feature header (#3579) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Sharing) ## Description Add timestamp_ntz to Delta Sharing reader feature header. ## How was this patch tested? Unit test ## Does this PR introduce _any_ user-facing changes? commit e213023ebf3ab0c03ed67ff0cd0c6d8eebcb6b29 Author: Amogh Jahagirdar Date: Tue Aug 20 09:08:31 2024 -0600 [Protocol, Spark] UTC normalize timestamp partition values (#3378) ## Description Currently, in the Delta Protocol, timestamps are not stored with their time zone. This leads to unexpected behavior when querying across systems with different timezones configured (e.g. different spark sessions for instance). For instance in Spark, the timestamp value will be adjusted to spark session time zone and written to the delta log partition values without TZ. If someone were to query the same "timestamp" from a different session timezone, the same time zone value it can fail to surface results due to partition pruning. What this change proposes to the delta lake protocol is to allow timestamp partition values to be adjusted to UTC and explicitly stored in partition values with a UTC suffix. The original approach is still supported for compatibility but it is recommended for newer writers to write with UTC suffix. This is also important for Iceberg Uniform conversion because Iceberg timestamps must be UTC adjusted. Now we have a well defined format for UTC in delta, we can convert string partition values to Iceberg longs to make Uniform conversion succeed. This change updates the Spark-Delta integration to write out the UTC adjusted values for timestamp types. This also addresses an issue of microsecond partitions where previously microsecond partitioning (not recommended but technically allowed) would not work and be truncated to seconds. ## How was this patch tested? Added unit tests for the following cases: 1.) UTC timestamp partition values round trip across different session TZ 2.) A delta log with a mix of Non-UTC and UTC partition values round trip across the same session TZ 3.) Timestamp No Timezone round trips across timezones (kind of a tautology but important to make sure that the timestamp_ntz does not get written with UTC timestamp unintentionally) 4.) Timestamp round trips across same session time zone: UTC normalized 5.) Timestamp round trips across same session time zone: session time normalized (this case worked before this change, so it's important that it keeps working after this change) Mix of microsecond/second level precision and dates before epoch (to test if everything works with negative) ## Does this PR introduce _any_ user-facing changes? Yes in the sense that new timestamp partition values will be the normalized UTC values. commit 8f1b29789e54d232a2e456c97b0626e5bafb2a39 Author: zzl-7 <143959416+zzl-7@users.noreply.github.com> Date: Tue Aug 20 06:20:03 2024 -0700 [Kernel][Expressions] Add IS NOT DISTINCT FROM support (#3230) Adding support for `IS NOT DISTINCT` (aka null safe equal) Resolves part of https://github.com/delta-io/delta/issues/2538 ## How was this patch tested? UTs commit a3d86dc4d304d88347d8e5f9480613291073ef30 Author: Marko Ilić Date: Tue Aug 20 07:02:58 2024 +0200 [Kernel] Add editable property to TableConfig (#3555) ## Description Added editable property to TableConfig. Changed `TableConfig#validateProperties` to check if the property is editable. Resolves #3455. ## How was this patch tested? Test added to `TableConfigSuite.scala`. commit 2c00219413c00b168b3bccd877b7375b0b8bccdc Author: Allison Portis Date: Mon Aug 19 13:40:03 2024 -1000 [Spark][Master] Fix compilation issues on Spark master (#3581) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Spark master compilation is currently broken due to this commit in Spark https://github.com/apache/spark/commit/a767f5cb1704075ee249169e8faf2ab3610b9dbc#diff-81eca9f7af2e9b23b13904131c3d32b0af9e9e1dcc7ddb5efba13201b00066c4 which changes the `Column` constructor. To fix, change any instances of `new Column(...)` to `Column(...)` so that we can use the constructors in `object Column`. ## How was this patch tested? Existing tests suffice. commit b2d82353060dc507f7139852d796fb6de4d5a315 Author: Adam Binford Date: Mon Aug 19 19:15:25 2024 -0400 [Spark] Add Scala `clone`, `cloneAtVersion`, and `cloneAtTimestamp` API (#3392) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves #3391 Add a Scala APIs for cloning a Delta Table: `clone`, `cloneAtVersion`, and `cloneAtTimestamp`. Simply creates a `CloneTableStatement` like the SQL parser does and lets the analyzer handle the rest. I tried to mimic what limited info I could find on the current Databricks implementation: https://docs.databricks.com/en/delta/clone.html#language-scala. ## How was this patch tested? New suite extending `CloneTableSuiteBase` ## Does this PR introduce _any_ user-facing changes? New Scala API on DeltaTable for cloning --------- Co-authored-by: Thang Long VU Co-authored-by: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> commit 64d46026d4cea4ac37bfce5d084df47e6e8bb763 Author: Marko Ilić Date: Tue Aug 20 00:37:24 2024 +0200 [Kernel] Fixed the issue with reading a list of JSON files when some files are empty (#3537) ## Description Resolves #2251. Fixed the issue with reading a list of JSON files when some files are empty. Previously, the reading process would stop after encountering an empty file. ## How was this patch tested? Test added to `DefaultJsonHandlerSuite.scala`. commit 28546cedba226b9670f4a93155756f1e4ef0085c Author: Fred Storage Liu Date: Mon Aug 19 13:02:20 2024 -0700 use logical column name as physical name for Iceberg clone source tables (#3578) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The current randomly assigned physical names do not match with anything in parquet file in source iceberg table, and caused it cannot convert type widened iceberg tables; ## How was this patch tested? manually tested the fix to convert a type widened iceberg table. ## Does this PR introduce _any_ user-facing changes? commit 95aff65fab185e297d4eda94c015eaae0322df6c Author: Zhipeng Mao Date: Mon Aug 19 19:32:07 2024 +0200 [SPARK] Refactor IdentityColumnTestUtils (#3568) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 . The change refactors `IdentityColumnTestUtils` to reuse `createTableWithIdColAndIntValueCol` to create tables and to unify the column names in identity column tests. ## How was this patch tested? This is test only change. ## Does this PR introduce _any_ user-facing changes? No. commit c67180b30a04d424e3419aef823608ed1d7eb615 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Mon Aug 19 19:31:33 2024 +0200 [Spark] Add Delta Connect Update/Delete Server and Scala Client (#3545) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add support for `update` and `delete` for Delta Connect Server and Scala Client. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. --------- Co-authored-by: Dhruv Arya Co-authored-by: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Co-authored-by: Zihao Xu Co-authored-by: Lukas Rupprecht Co-authored-by: Venki Korukanti Co-authored-by: jackierwzhang <67607237+jackierwzhang@users.noreply.github.com> commit d2a6fa023b7552e270eb4d2c48992e71c1f4071f Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Mon Aug 19 19:28:15 2024 +0200 [Spark] Enable Row Tracking Backfill by default in Delta (#3569) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Enable Row Tracking Backfill by default in Delta, after we extensively added the test coverage. ## How was this patch tested? Existing UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 9aab676be5cc6143c78052c68b770c72d3323114 Author: Taiga Matsumoto Date: Mon Aug 19 09:20:41 2024 -0700 [Sharing] Update delta-sharing-client to 1.2.0 (#3564) commit da5a5d29b6f00265439bc84020580080a2d4568f Author: Jun <85203301+junlee-db@users.noreply.github.com> Date: Fri Aug 16 16:44:33 2024 -0700 [Spark] Do not use the same variable name in the coordinator registration code (#3570) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description In commit coordinator registration, we use the same variable name from the method and case use case which are bad practice. This PR simply fixes it. ## How was this patch tested? Simple variable renaming change + existing unit tests will cover them ## Does this PR introduce _any_ user-facing changes? No commit 597950c5ff5cf0aa2b76a3d5fba9da58fdba332e Author: Yan Zhao Date: Sat Aug 17 04:09:10 2024 +0800 [Kernel] Add `remove` schema to the single action schema (#3209) commit 09aa7a5ca5ebd8650d20b0e67f8895fdbc814576 Author: Allison Portis Date: Fri Aug 16 09:24:15 2024 -1000 [Spark] Use `encoderFor` to copy encoders for DeltaUDF (#3562) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This Spark commit https://github.com/apache/spark/commit/67d2888c476b6c472722f9cfebfb2e99302aff1c breaks Spark master compilation (specifically [this diff](https://github.com/apache/spark/commit/67d2888c476b6c472722f9cfebfb2e99302aff1c#diff-7852549cf1376f95414d804c8a9382a236179582cb4f03f836c284b0c1a81191L93-R99)). ## How was this patch tested? Existing tests suffice. commit 18eb1a6cbd09049eb835586a4d607fbd9673a1ee Author: Yumingxuan Guo Date: Fri Aug 16 11:01:48 2024 -0700 [Delta] Centralize checks for coordinated commits property via different code paths (#3561) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description 1. Moves the checks for Coordinated Commits configurations for CLONE early to `CreateDeltaTableCommand.scala`. This also affects CREATE and REPLACE code paths; the same checks are placed on all three commands. 2. Modified the precedence of Coordinated Commits configurations for CLONE. Now for existing tables, it blocks Coordinated Commits overrides completely, even if the existing table had no commit coordinator before. We want the users to upgrade a table with Coordinated Commits only with ALTER command, and not piggyback in other commands like CLONE. 4. Updated the error messages to better reflect the exceptions. 5. Added UTs and modified existing UTs for the new logic. ## How was this patch tested? UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 992be8412c1e0b843ef9881995d63800f95f9f02 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Fri Aug 16 18:53:30 2024 +0200 [Spark] Add tests that check row tracking preservation works on DMLs for backfilled enabled tables (#3472) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description - Add tests that check row tracking preservation works on DMLs for backfilled enabled tables, as opposed to having row tracking enabled from the start. - `FileMetadataMaterializationTrackerSuite` has a test that sets a different number of permits than the default in the `FileMetadataMaterializationTracker` but forgot to set it back, which causes unwanted failure in other test. We address that by using the default number of permits in that test instead of a different one. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 489a6c3a8ba2548bfdc74b30725faeedb46054d0 Author: Venki Korukanti Date: Fri Aug 16 07:39:57 2024 -0700 [Kernel][Defaults] Add field id validation when writing data to Parquet files (#3511) ## Description Add validation of field IDs in the given schema of data before writing to the Parquet files. ## How was this patch tested? UTs commit 939b54ea2415c715bbc46f41338fda3cc602d25e Author: Venki Korukanti Date: Thu Aug 15 23:19:37 2024 -0700 [Kernel][IcebergCompatV2] Support writing nested field ids in schema to Parquet file (#3504) ## Description Currently, any nested field IDs that are created as of the `delta.icebergCompatV2` protocol feature are not written into the generated Parquet files as part of the data writes. This PR adds support. ## How was this patch tested? Unit and integration tests using the Parquet data files generated by Delta-Spark commit 703023ff376965091053abdb7c73569da4295248 Author: Lin Zhou <87341375+linzhou-db@users.noreply.github.com> Date: Fri Aug 16 13:42:04 2024 +0800 Fix possible stability issues when reusing deltaLog of delta sharing queries (#3514) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (delta sharing) ## Description 1. Add limitHint in the queryParamHashId used in delta log table path 2. Use fileIndexId instead of customId. ## How was this patch tested? Unit Test ## Does this PR introduce _any_ user-facing changes? No commit 92946fcad4ea6f75d2bcc43cabd146afc03548e4 Author: richardc-db <87336575+richardc-db@users.noreply.github.com> Date: Thu Aug 15 22:15:50 2024 -0700 [Kernel][VARIANT] Variant schema deserialization (#3464) ## Description schema parsing when "variant" is in the schema string. Introduces a new "VariantType" ## How was this patch tested? added UTs commit 06b56a02e50e4a8911e865fc1a45be0a15490c4c Author: Allison Portis Date: Thu Aug 15 15:32:40 2024 -1000 [INFRA] Fix the pip version in the dockerfile as well to accomodate for versions with "-" in it (#3560) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (INFRA) ## Description Later versions of pip doesn't allow installing python packages with "-" in the name but we use version with "-SNAPSHOT" in it. Instead fix the pip version to 24.0. ## How was this patch tested? Copied from our other CI. commit a056dc2546bb579c6f12882d2b767cd13be77f6a Author: Felipe Pessoto Date: Thu Aug 15 17:46:09 2024 -0700 [Infra] [Security] Update Scala and packages dependencies - Follow up (#3139) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [X] Other (connector, examples, benchmark) ## Description #2828 updated SBT version to Spark Delta. This is a follow up to update other projects. - Update SBT to 1.9.9. [CVE-2023-46122](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2023-46122) ## How was this patch tested? CI ## Does this PR introduce _any_ user-facing changes? No --------- Signed-off-by: Felipe Pessoto commit f28c7e96789e40e36004645b7635ecab79550334 Author: Zhipeng Mao Date: Thu Aug 15 21:44:52 2024 +0200 [SPARK] Allow non-deterministic expressions in actions of merge (#3558) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description It makes `MergeIntoCommandBase` extend a trait `SupportsNonDeterministicExpression` in Spark that logical plans can extend to check whether it can allow non-deterministic expressions and pass the CheckAnalysis rule. `MergeIntoCommandBase` extends `SupportsNonDeterministicExpression` to check whether all the conditions in the Merge command are deterministic. This is harmless and allows more flexible usage of merge. For example, we use a non-deterministic UDF to generate identity values for identity columns, so it is required to allow non-deterministic expressions in updated/inserted column values of merge statements in order to support merge on target tables with identity columns. So this PR is part of https://github.com/delta-io/delta/issues/1959. ## How was this patch tested? New test cases. ## Does this PR introduce _any_ user-facing changes? Yes. We are changing the behavior to allow non-deterministic expressions in updated/inserted column values of merge statements. We still don't allow non-deterministic expressions in conditions of merge statements. e.g. We currently don't allow the merge statement to add a random noise to the value that is inserted in merge ``` MERGE INTO target USING source ON target.key = source.key WHEN MATCHED THEN UPDATE SET target.value = source.value + rand() ``` Now we are allowing this as this may be helpful in terms of data privacy to not disclose the actual data while preserving the data properties e.g. mean values etc. commit 78479deb42f9c195bc901b3c0e8d5c0e0dd0e403 Author: Felipe Pessoto Date: Thu Aug 15 11:59:50 2024 -0700 [Spark] CI - Publish Spark Group only during Python tests (#2690) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Python tests are specific to Spark Delta, so there is no need to publish everything. ``` # Python tests are run only when spark group of projects are being tested. is_testing_spark_group = args.group is None or args.group == "spark" # Python tests are skipped when using Scala 2.13 as PySpark doesn't support it. is_testing_scala_212 = scala_version is None or scala_version.startswith("2.12") if is_testing_spark_group and is_testing_scala_212: run_python_tests(root_dir) ``` ## How was this patch tested? Unit Tests ## Does this PR introduce _any_ user-facing changes? No Signed-off-by: Felipe Pessoto commit e763fe93024d77f96239c81afe67dff9cafe64be Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed Aug 14 21:04:39 2024 +0200 [Spark] Add RowTrackingBackfill and Clone conflicts tests (#3550) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add some `RowTrackingBackfill` and Clone conflicts tests. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 2ef280689330f122625258bae847707c222713df Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed Aug 14 21:04:25 2024 +0200 [Spark] Add Delta Connect Describe History/Detail, Convert To Delta, Restore and isDeltaTable (#3490) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add support for `describeHistory/Detail`, `convertToDelta`, `RestoreTable` and `isDeltaTable` for Delta Connect Server and Scala Client. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit b7a040bedad310fb30fc38a4d22a0bf2d680d9b8 Author: Eduard Tudenhoefner Date: Wed Aug 14 18:16:41 2024 +0200 [Infra] Increase test parallelism (#3524) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Infra) ## Description Increase test JVM parallelization to 4. ## How was this patch tested? CI tests. ## Does this PR introduce _any_ user-facing changes? No. commit db67cc20f7964eeeb35fd30e2be49bc77416e4b4 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed Aug 14 17:37:32 2024 +0200 [Spark] Add RowTrackingBackfillBackfillConflictsSuite (#3551) ## Description Add `RowTrackingBackfillBackfillConflictsSuite` with test cases for concurrent backfills. ## How was this patch tested? Added UTs. commit f1ea35c05059646f154c96f61f636e5cfd4ef23f Author: Prakhar Jain Date: Wed Aug 14 08:36:55 2024 -0700 [Spark] Fix time travel utility method for ICT (#3553) ## Description Fix time travel utility method for ICT. DeltaHistoryManager modified the in-commit-timestamp to custom values for testing specific scenarios. This is done by getting the delta file from the snapshot's logsegment and modifying its content. Sometimes the Snapshot might be pointing to UUID files - in such cases we should also modify the timestamp in backfilled commit.json file. ## How was this patch tested? Existing UTs commit 48def61dca62f8320845182a4eb38ff46e9b306c Author: Zhipeng Mao Date: Wed Aug 14 17:34:09 2024 +0200 [SPARK][TEST-ONLY] Add more tests for Identity Column (#3526) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 . It add more tests for Identity Column to test - logging identity column properties and stats - reading table should not see identity column properties - compatibility with table of older protocols - identity value generation starting at range boundaries of long data type ## How was this patch tested? Test only change. ## Does this PR introduce _any_ user-facing changes? No. commit 51ecfe5ef1b759e88d7e799b94c1919c4aa9eef0 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed Aug 14 02:04:39 2024 +0200 [Spark] Remove duplicate unblockCommit to fix GitHub tests (#3547) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description 2 PRs both added `unblockCommit` at similar time, which causes Compile Error to the repro. Removing one of it to fix the problem. ## How was this patch tested? Existing UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 184ffc0770a1d34d597576d2def00c868d1bb973 Author: Yuya Ebihara Date: Wed Aug 14 07:36:50 2024 +0900 [Protocol] Fix style in PROTOCOL.md (#3488) commit e2cbd5b29eeee6334fc0232c448a4c191edee926 Author: Scott Sandre Date: Tue Aug 13 13:54:20 2024 -0700 [INFRA] Cleanup build.sbt build output (#3480) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (INFRA) ## Description Remove print statements from Iceberg assembly output. ## How was this patch tested? CI tests. ## Does this PR introduce _any_ user-facing changes? No commit 179040cffb5ac45cdf7cc239a8ef96a1a6e3f855 Author: Venki Korukanti Date: Tue Aug 13 13:37:13 2024 -0700 [Spark] Upgrade Spark version to 3.5.2 (#3417) commit dbdcd0b143d1b13b4f06f1928c0e19607145edf3 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Aug 13 21:39:31 2024 +0200 [Spark] Add RowTrackingBackfillConflictsSuite (#3540) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add extensive testing that checks various scenario of conflict scenarios between Backfill and other concurrent transactions. Note: We have not yet covered the case of conflicts between Row Tracking Backfill and Clone, and also the case of multiple Row Tracking Backfills, which will be added in follow-up PRs. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 7122c2525b4cdad1eb9836d99b349521d5afe84b Author: Venki Korukanti Date: Tue Aug 13 10:16:06 2024 -0700 [Docs] Minor update to docs build instructions (#3543) An additional step are needed to build the docs locally. commit 6bc1eb03c1a73529f9d02264ca568a1d94ac39ad Author: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Date: Tue Aug 13 09:44:39 2024 -0700 [Kernel] Always read the checkpoint manifest for CPV2 if sidecar files are present (#3509) ## Description The sidecar actions in the checkpoint manifest of a CPV2 table should always be read regardless of the checkpoint predicate, as actions satisfying the predicate may be found in the sidecar file. ## How was this patch tested? See added test. commit 55a038c117e1c11cf5e863fd2f3ed3c3190d53e6 Author: Zhipeng Mao Date: Tue Aug 13 18:29:59 2024 +0200 [SPARK] Relax metadata conflict for identity column (#3525) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959. The PR relaxes metadata conflict for identity column SYNC high water mark operation. When winning transaction contains identity column metadata change and the current transaction does not contain metadata change, we mark the current transaction as no metadata conflict. ## How was this patch tested? A new test suite. ## Does this PR introduce _any_ user-facing changes? No. commit cfd420dbe1d0732bac7d2e36afb654173286f50f Author: Amogh Jahagirdar Date: Tue Aug 13 10:12:10 2024 -0600 Infra: Shard spark and spark master tests across 3 actions (#3517) This change adds a matrix of 3 shards to github CI for the delta spark/spark master tests and plumbs those through to our parallelization strategy which will then select tests to run if they are assigned to that shard. This brings down CI times from ~4 hours to 2 hours. 3 shards seemsd to be the point of diminishing returns, and we should look at the actual tests and possibly higher instance types if we really want to get those down further. No sharding: 4 hours 2 shards: 3 hours 3 shards: ~2 hours 4 shards: ~2 hours ### How to revert To undo this change without a full revert commit there are a few ways (in order of recommendation) 1. Remove/comment out *both* the shard row in the GH matrix https://github.com/delta-io/delta/pull/3517/files#diff-b77047dc65c62d814f7f67eec57c23c4cf9d9796e6a8a018558e0e935739d8d8R10-R11 AND the NUM_SHARDS env variable https://github.com/delta-io/delta/pull/3517/files#diff-b77047dc65c62d814f7f67eec57c23c4cf9d9796e6a8a018558e0e935739d8d8R15 2. Comment out this block https://github.com/delta-io/delta/pull/3517/files#diff-00b77d52981e1ba8cc5ac9b80b22d8f7bb9814222121b714269e14e139daaae0R141-R152 in TestParallelization commit 1606fdf9a7e6656254d92408446099d1c35cfda8 Author: Yuya Ebihara Date: Tue Aug 13 23:39:29 2024 +0900 [Kernel] Minor cleanup in delta-kernel (#3539) commit 3ba54a18852f625f936a53b2e404069030bc63bb Author: Venki Korukanti Date: Mon Aug 12 16:37:50 2024 -0700 [Kernel][IcebergCompatV2] Validate `add` actions have `numRecords` in statistics (#3534) ## Description Delta [IcebergCompatV2](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#iceberg-compatibility-v2) requires each `add` action added to the Delta Log contain `numRecords` in statistics. This PR adds the change to enforce the check as part of `Transaction.generateAppendActions`. ## How was this patch tested? Unit tests. commit 99c7b0b6ab157c94237612eccf46362a16ed39c9 Author: Venki Korukanti Date: Mon Aug 12 13:11:14 2024 -0700 [Kernel][IcebergCompatV2] Include partition columns in physical data written to files (#3530) ## Description When `icebergCompatV2` is enabled, include the partition columns in the physical data that are going to be written in the data files. `Transaction.transformLogicalData` takes care of converting the table logical data into physical data that should go into the Parquet files. One of the transformations done is removing the partition columns that are not required to be in data files for Delta. `icebergCompatV2` requires partition columns to be in data files. Update `Transaction.transformLogicalData` so that partition columns are not removed from physical data if the table has `icebergCompatV2` enabled. ## How was this patch tested? Unit tests commit 4910909bc969aebf6407f0839046a6491b12ae90 Author: jackierwzhang <67607237+jackierwzhang@users.noreply.github.com> Date: Mon Aug 12 10:20:55 2024 -0700 Do not leak column mapping metadata during streaming read (#3487) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Column mapping metadata should be pruned during a streaming read, just like that of batch read. ## How was this patch tested? New UT. ## Does this PR introduce _any_ user-facing changes? No commit bb12b94d9c0ecb833084c49cbc595b81a8f67b8e Author: Venki Korukanti Date: Sun Aug 11 10:23:44 2024 -0700 [Kernel][Default Parquet Writer] Always return the row count in returned file statistics (#3515) ## Description Currently, the default Parquet writer returnes `numRows` statistics only if one or more column statistics are requested. This changes to always return the `numRows` in `DataFileStatus`. It is easy to compute the number of rows written without reading the Parquet footer. Reading parquet footer is needed for finding the column stats for written file. The `numRows` stat is needed for supporting writing into Uniform enabled tables. ## How was this patch tested? Modify existing tests. commit 87a073ab17acf132b7c0921c58819f7fb647bd44 Author: Lukas Rupprecht Date: Fri Aug 9 12:49:11 2024 -0700 [Spark] Makes LogSegment comparison aware of coordinated commits (#3506) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR updates the comparison (equals) method of LogSegments to consider coordinated commits. The reason for this change is as follows: The current comparison logic determines whether a LogSegment is equal to another LogSegment, if the deltas match and the modification time of the last delta is equal for both LogSegments. However, on a coordinated commits table, this may not be true anymore. This is because after a commit, we update the LogSegment by appending the last commit to it to create the post commit snapshot. In a coordinated commits table, this will be a UUID commit. However, by the next time DeltaLog.update is called, that UUID commit could have already been backfilled and so the LogSegment determined as part of the update call will contain the backfilled commit, which will have a different modification time compared to its UUID counterpart. As a result, even though the LogSegments represent identical table states, they would be identified as different. The problem is that if we determine that the LogSegment from the previous snapshot is different to the LogSegment created by update, we will swap the old snapshot with the new snapshot. However, if these are indeed identical, then the swapping is not necessary and leads to Delta losing any cached state on the swapped out snapshot. This can lead to unnecessary slowdowns. This PR fixes this issue by updating the LogSegment comparison logic. Instead of comparing the last file in the segment, we compare the minimum last backfilled file across both segments. For example, if segment 1 contains files 0.json, 1.json, 2.uuid.json, and 3.uuid.json and segment 2 contains 0.json, 1.json, 2.json, and 3.json, we compare 1.json and if that matches in both segments, we assume that the segments are equal (they also need to match in length of their deltas). In addition to the LogSegment comparison fix, we also introduce a new member on Snapshot, that captures the correct last known backfilled version. In the example above, segment 1 would report 1.json as the last known backfilled version but segment 2 already contains 3.json as the last known backfilled version. To correctly determine, whether all commits for a specific table version (snapshot) have been backfilled, we update this state on the snapshot whenever we detect a stale LogSegment. ## How was this patch tested? Updated existing and added new unit tests. ## Does this PR introduce _any_ user-facing changes? No commit 9bb45d244ffb009fc4f6c2552fff2fde5f7a1203 Author: Zihao Xu Date: Fri Aug 9 11:55:35 2024 -0700 [Spark][UniForm] Add `UniversalFormatSuite` and utils for `IcebergCompatV1/V2` & `REORG UPGRADE UNIFORM` (#3507) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description this PR adds unit tests in `UniversalFormatSuiteBase` and the corresponding testing utilities, especially for `IcebergCompatV1/V2` table features and `REORG UPGRADE UNIFORM` command. - `UniversalFormatSuiteBase` contains all common unit tests. - `UniversalFormatSuite` contains the actual classes to run the corresponding suites, it also contains helper traits for testing. ## How was this patch tested? through unit tests in `UniversalFormatSuite`. ## Does this PR introduce _any_ user-facing changes? no. commit 28ca61af03485e9508a3430ad9ee7af939b0d62c Author: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Date: Fri Aug 9 09:02:36 2024 -0700 [Kernel] Update the Kernel read with coordinated commit support (#3381) ## Description Update the Kernel read with coordinated commit support and prepare for Kernel coordinated commit write support. ## How was this patch tested? Unit tests commit 4cc50bc33632044f19a295507654460292656f64 Author: Dhruv Arya Date: Fri Aug 9 08:58:34 2024 -0700 [Spark][Kernel][Protocol] Merge InCommitTimestamp RFC, and remove the -preview suffix from feature name and configs. (#3416) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [X] Other (Protocol) ## Description 1. Merges the InCommitTimestamp RFC 2. Removes the -preview suffix from the feature name and properties. ## How was this patch tested? Existing tests should cover this change. ## Does this PR introduce _any_ user-facing changes? No commit 3cebe546aaf9876da821da7036e4eae951507479 Author: Rakesh Veeramacheneni <173086727+raveeram-db@users.noreply.github.com> Date: Thu Aug 8 17:50:44 2024 -0700 [Kernel] Support data skipping on timestamp/timestampNtz columns (#3481) Adds a `TIMEADD` scalar expression to the data skipping logic. This addresses issues arising from TIMESTAMP being truncated to millisecond precision when serialized to JSON. For example, a file containing only `01:02:03.456789` will be written with `min == max == 01:02:03.456`, so we must consider it to contain the range from `01:02:03.456 to 01:02:03.457`. Resolves #2462. ## How was this patch tested? Unit tests. commit 78cdeb06d5c9f141b108f8644d03586b201eb185 Author: Zhipeng Mao Date: Thu Aug 8 18:25:49 2024 +0200 [Spark] Python DeltaTableBuilder API for Identity Columns (#3404) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 In this PR, we extend the addColumn interface in DeltaTableBuilder to allow for Identity Columns creation. Resolves https://github.com/delta-io/delta/issues/1072 ## How was this patch tested? New tests. ## Does this PR introduce _any_ user-facing changes? We update the arguments of addColumn method: - Support a new data type for parameter `generatedAlwaysAs`. Users can specify `generatedAlwaysAs` as `IdentityGenerator` to add an identity column that is GENERATED ALWAYS. - Add a new parameter `generatedByDefaultAs`. Users can specify `generatedByDefaultAs` as `IdentityGenerator` to add an identity column that is GENERATED BY DEFAULT. - Users can optionally pass in `start` (default = 1) and `step` (default = 1) values to construct `IdentityGenerator` object, which specify the start and step value to generate the identity column. Interface ``` def addColumn( self, colName: str, dataType: Union[str, DataType], nullable: bool = True, generatedAlwaysAs: Optional[Union[str, IdentityGenerator]] = None, generatedByDefaultAs: Optional[IdentityGenerator] = None, comment: Optional[str] = None, ) -> "DeltaTableBuilder" ``` Example Usage ``` DeltaTable.create() .tableName("tableName") .addColumn("id", dataType=LongType(), generatedAlwaysAs=IdentityGenerator()) .execute() DeltaTable.create() .tableName("tableName") .addColumn("id", dataType=LongType(), generatedAlwaysAs=IdentityGenerator(start=1, step=1)) .execute() DeltaTable.create() .tableName("tableName") .addColumn("id", dataType=LongType(), generatedByDefaultAs=IdentityGenerator()) .execute() DeltaTable.create() .tableName("tableName") .addColumn("id", dataType=LongType(), generatedByDefaultAs=IdentityGenerator(start=1, step=1)) .execute() ``` --------- Co-authored-by: Carmen Kwan commit fd75e83a94127075263be4e55ddd7203b7b9675b Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Thu Aug 8 18:25:28 2024 +0200 [Spark] Allowing Row Tracking enablement only txn to not fail concurrent no metadata update txns (#3473) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Introducing a new `DeltaCommitTag` which allows ALTER TABLE commands that do row tracking enablement only not to fail concurrent transactions if they do not do metadata update. This is a special case of metadata update conflict that we can safely resolve. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit c23b9731408afce10d2d822432a16b1321accac9 Author: Eduard Tudenhoefner Date: Thu Aug 8 18:22:07 2024 +0200 [Kernel] build/sbt javafmtAll (#3489) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description This re-formats the code in the Kernel by running `build/sbt javafmtAll` and enforcing the style during compilation. ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? commit 08b67d78cf61b1cec6da0c1dbaaab331b252d2c8 Author: Lukas Rupprecht Date: Wed Aug 7 18:36:25 2024 -0700 [Spark] Writing of UUID commits should not use put-if-absent semantics (#3501) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes the coordinated commits utils to not write UUID-based commit files with put-if-absent semantics. This is not necessary because we assume that UUID-based commit files are globally unique so we will never have concurrent writers attempting to write the same commit file. ## How was this patch tested? Existing tests are sufficient as this only affects how a commit is written in the underlying storage layer but does not change any logic in Delta Spark. ## Does this PR introduce _any_ user-facing changes? No commit e82900633bacbfa1b61609ffdb99695ada4ff919 Author: Zhipeng Mao Date: Wed Aug 7 22:27:08 2024 +0200 [Spark] Block unsupported operations on identity columns (#3457) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 In this PR, we block unsupported operations on identity columns including: - ALTER TABLE ALTER COLUMN is not supported for IDENTITY columns. - Providing values for GENERATED ALWAYS AS IDENTITY column is not supported. - PARTITIONED BY IDENTITY column is not supported. - ALTER TABLE REPLACE COLUMNS is not supported for table with IDENTITY columns. - UPDATE on IDENTITY column is not supported. ## How was this patch tested? A new test suite `IdentityColumnAdmissionScalaSuite` is added. ## Does this PR introduce _any_ user-facing changes? Yes. The aforementioned operations on identity columns are blocked. commit 1af62b0a47b93b81306d6a9615e0b94cfa731d65 Author: Zihao Xu Date: Wed Aug 7 10:01:41 2024 -0700 [Spark][UniForm] Add test suite/utils for enable UniForm without rewrite (#3431) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description add the test suite `IcebergCompatV2EnableUniformByAlterTableSuite` and `ParquetIcebergCompatV2Utils` for enable UniForm without rewrite. ## How was this patch tested? through the newly-added unit tests. ## Does this PR introduce _any_ user-facing changes? no. commit 019cb809e23dcd287ed580b6ea939125a546a7cb Author: Charlene Lyu Date: Wed Aug 7 09:23:28 2024 -0700 [Sharing] Upgrade delta-sharing-client to 1.1.1 (#3492) ## Description Upgrade delta-sharing-client to 1.1.1 ## How was this patch tested? Existing unit tests. commit 4f768f1bfb0de003af5e4f20d2cd0fcbb857b996 Author: Robert Dillitz Date: Wed Aug 7 17:45:43 2024 +0200 [Spark] Fix AttributeReference mismatch for readChangeFeed queries coming from Spark Connect (#3451) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This fixes an issue in the `DeltaAnalysis` rule, more specifically `fromV2Relation`, that leads to Spark Connect `readChangeFeed` queries failing when applying Projections, Selections, etc. to the underlying table's columns due to an `AttributeReference` mismatch: Spark Connect Query: ``` spark.read.format("delta").option("readChangeFeed", "true").option("startingVersion", 0) .table("main.dillitz.test").select("id").show() ``` Unresolved Logical Plan: ``` common { plan_id: 15 } project { input { common { plan_id: 14 } read { named_table { unparsed_identifier: "main.dillitz.test" options { key: "startingVersion" value: "0" } options { key: "readChangeFeed" value: "true" } } } } expressions { unresolved_attribute { unparsed_identifier: "id" } } } ``` Resolved Logical Plan: ``` 'Project ['id] +- 'UnresolvedRelation [dillitz, default, test], [startingVersion=0, readChangeFeed=true], false ``` Plan before DeltaAnalysis rule: ``` Project [id#594L] +- SubqueryAlias dillitz.default.test +- RelationV2[id#594L] dillitz.default.test dillitz.default.test ``` Plan after DeltaAnalysis rule: ``` !Project [id#594L] +- SubqueryAlias spark_catalog.delta.`/private/var/folders/11/kfrr0zqj4w3_lb6mpjk76q_00000gp/T/spark-8f2dc5b0-6722-4928-90bb-fba73bd9ce87` +- Relation [id#595L,_change_type#596,_commit_version#597L,_commit_timestamp#598] DeltaCDFRelation(SnapshotWithSchemaMode(...)) ``` Error: ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] Resolved attribute(s) "id" missing from "id", "_change_type", "_commit_version", "_commit_timestamp" in operator !Project [id#493L]. Attribute(s) with the same name appear in the operation: "id". Please check if the right attribute(s) are used. SQLSTATE: XX000; !Project [id#493L] +- SubqueryAlias dillitz.default.test +- Relation dillitz.default.test[id#494L,_change_type#495,_commit_version#496L,_commit_timestamp#497] DeltaCDFRelation(SnapshotWithSchemaMode(...)) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:55) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:694) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:197) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:287) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:197) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:179) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:341) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:167) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:155) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:155) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:341) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$2(Analyzer.scala:396) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:169) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:396) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:443) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:393) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:260) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:441) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$4(QueryExecution.scala:600) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:1152) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:600) at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:596) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1180) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:596) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:254) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:253) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:235) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:105) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1180) at org.apache.spark.sql.SparkSession.$anonfun$withActiveAndFrameProfiler$1(SparkSession.scala:1187) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at org.apache.spark.sql.SparkSession.withActiveAndFrameProfiler(SparkSession.scala:1187) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:103) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformShowString(SparkConnectPlanner.scala:323) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:169) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$3(SessionHolder.scala:480) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:479) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:166) at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.handlePlan(SparkConnectPlanExecution.scala:90) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handlePlan(ExecuteThreadRunner.scala:312) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:244) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:176) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:343) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1180) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:343) at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:97) at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:83) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:237) at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:82) at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:342) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:176) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:126) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.$anonfun$run$2(ExecuteThreadRunner.scala:530) at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51) at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:103) at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:108) at scala.util.Using$.resource(Using.scala:269) at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:107) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:529) ``` ## How was this patch tested? Created an E2E Spark Connect test running queries like the one above. Not sure how to merge it into this repository. ## Does this PR introduce _any_ user-facing changes? No. commit 100cc4dba42ce9a88999c6c0b4fb31400bfaf0cb Author: Dhruv Arya Date: Tue Aug 6 17:19:06 2024 -0700 [Spark] Allow stale reads when commit coordinator has not been registered (#3454) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Snapshot construction will now be allowed even if the table has a configured coordinator but the coordinator implementation has not been registered. Writes will still be blocked in such cases. The change has been gated behind a new flag. ## How was this patch tested? Added a new test in CoordinatedCommitsSuite. ## Does this PR introduce _any_ user-facing changes? No commit f8d7d76a272a0bb5c86cfbbcb4f19fe904010ac2 Author: Carmen Kwan Date: Tue Aug 6 20:59:38 2024 +0200 [Spark] ALTER TABLE ALTER COLUMN SYNC IDENTITY SQL support (#3005) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 In this PR, we add SQL support for `ALTER TABLE ALTER COLUMN SYNC IDENTITY`. This is used for GENERATED BY DEFAULT Identity Columns, where a user may want to manually update the identity column high watermark. ## How was this patch tested? This PR adds a new test suite `IdentityColumnSyncSuite`. ## Does this PR introduce _any_ user-facing changes? Yes. We introduce the SQL syntax `ALTER TABLE (ALTER| CHANGE) COLUMN? SYNC IDENTITY` into Delta. This will update the high watermark stored in the metadata for that specific identity column. **Example Usage** ``` ALTER TABLE ALTER COLUMN id SYNC IDENTITY ALTER TABLE CHANGE COLUMN id SYNC IDENTITY ALTER TABLE ALTER id SYNC IDENTITY ALTER TABLE CHANGE id SYNC IDENTITY ``` --------- Co-authored-by: zhipeng.mao Co-authored-by: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> commit 93ad94f0bf45bca4c7f674c5b015e0f4b8fc9daa Author: Amogh Jahagirdar Date: Tue Aug 6 11:29:58 2024 -0700 [UNIFORM] Remove timestamp partition patch as it is not effective currently (#3486) ## Description Delta-Iceberg uniform currently does not support partitioning on timestamps. There was originally patch intended to address that from the beginning of the project but it's not effective as Iceberg internally relies on long timestamp values since epoch, and the patch converts to java.sql.Timestamp; as a result the conversion currently fails with ``` IllegalArgumentException: Wrong class, expected java.lang.Long, but was java.sql.Timestamp, for object ``` Since this patch is essentially ineffective, I think it makes the most sense to remove it. This is also very important since it is required to enable us to upgrade Iceberg versions since this patch does not cleanly apply on anything after Iceberg 1.2! Note: I am also working towards adding this support so this gap will be closed soon. ## How was this patch tested? Existing CI ## Does this PR introduce _any_ user-facing changes? Technically now a user with an timestamp partitioned table will encounter a different error: Before this change the error would manifest as: ``` IllegalArgumentException: Wrong class, expected java.lang.Long, but was java.sql.Timestamp, for object: ``` After this change the error would be ``` "Unsupported type for fromPartitionString: Timestamp" ``` Considering it's unsupported, the new error change is a bit more clear. Note: I am also working towards adding this support so this gap will be closed soon. commit c1f42375eaf99c61dc4614fc7352c3af3cd7e874 Author: Eduard Tudenhoefner Date: Tue Aug 6 17:35:48 2024 +0200 [Kernel] Configure code formatter for Java Code (#3466) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description This configures a code formatter using *google-java-format* for the Java code in the Kernel (similarly to how it was done in https://github.com/unitycatalog/unitycatalog/commit/54b76d88255dd2baa3f11e515159cfd34cb295e2). Code can be checked by running `build/sbt javafmtCheckAll`. Code can automatically be formatted `build/sbt javafmtAll`. Once this PR is in, we should open follow-up PRs where the code in the Kernel is properly formatted. After that we can enforce the new code style during compilation ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? commit cb6e386864fbd606cde81b8cffbecaf2eff868c9 Author: Christos Stavrakakis Date: Mon Aug 5 23:38:02 2024 +0200 [Spark] Add commit version and logical records in Delta DML metrics (#3458) ## Description Extend `delta.dml.{merge, update, delete}.stats` metrics with the following fields: - `commitVersion` The commit version of the DML version. This allows associating DML metrics with commit metrics and distinguishing DML operations that did not commit. - `numLogicalRecordsAdded` and `numLogicalRecordsRemoved`: The number of logical records in AddFile and RemoveFile actions to be committed. These metrics can be compared to the row-level metrics emitted by the DML operations. Finally, this commit adds the `isWriteCommand` field in DELETE metrics to distinguish DELETE operations that are performed in the context of WRITE commands that selectively overwrite data. ## How was this patch tested? Log-only changes. Existing tests. commit 6463b3e2359909f56b1e8750fcbb050ba8404059 Author: Lukas Rupprecht Date: Mon Aug 5 11:49:33 2024 -0700 [Spark] Uses java-based coordinated commits classes in Delta Spark (#3470) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adopts the coordinated commits interfaces from the storage module in Delta Spark. It removes the existing scala classes and adds the necessary conversion code from java -> scala (and in the opposite direction) where necessary. ## How was this patch tested? Adds some unit tests for the critical code pieces (action serialization/deserialization and LogStore conversion). For the remainder, existing tests are sufficient. ## Does this PR introduce _any_ user-facing changes? No commit 9be04ba143373f14c3b4a6e39822b27adf34fbfa Author: Dhruv Arya Date: Mon Aug 5 11:49:07 2024 -0700 [Spark] Add an integration test for DynamoDB Commit Coordinator (#3158) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adds an integration test for the DynamoDB Commit Coordinator. Tests the following scenarios 1. Automated dynamodb table creation 2. Concurrent reads and writes 3. Table upgrade and downgrade The first half of the test is heavily borrowed from `dynamodb_logstore.py`. ## How was this patch tested? Test runs successfully with real DynamoDB and S3. Set the following environment variables (after setting the credentials in ~/.aws/credentials): ``` export S3_BUCKET= export AWS_PROFILE= export RUN_ID= export AWS_DEFAULT_REGION= ``` Ran the test: ``` ./run-integration-tests.py --use-local --run-dynamodb-commit-coordinator-integration-tests \ --dbb-conf io.delta.storage.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider \ spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider \ --dbb-packages org.apache.hadoop:hadoop-aws:3.4.0,com.amazonaws:aws-java-sdk-bundle:1.12.262 ``` ## Does this PR introduce _any_ user-facing changes? commit 1248c5ca2606edd48b10fb7ef468da77597e176a Author: Robert Dillitz Date: Mon Aug 5 18:58:21 2024 +0200 [Spark] Fix DeltaConnectPlannerSuite by copying the moved createDummySessionHolder (#3465) ## Description Fixes `DeltaConnectPlannerSuite` by replacing `SessionHolder.forTesting` with a copy of `createDummySessionHolder`, as this method got moved in the Spark master branch: https://github.com/apache/spark/commit/acb2fecb8c174fa4e2f23c843a904161151c8dfa ## How was this patch tested? Fixes test. commit ac2bcb4a6f82b9ead91e2ccae868ebecd4a87ae9 Author: Fred Storage Liu Date: Mon Aug 5 09:57:00 2024 -0700 [Spark] populate Delta clone override table properties to catalog (#3469) ## Description populate clone override table properties to catalog, which is missed in the current impl ## How was this patch tested? UT commit 930901237d814cf76595c5670855009e1e88e778 Author: Zhipeng Mao Date: Mon Aug 5 18:55:35 2024 +0200 [Spark] Add DELTA_TESTING=1 environment variable when running Python tests (#3444) ## Description In Python tests, we want to test features that are only for testing. But `DELTA_TESTING=1` is missing for Python tests. So this PR adds it to the environment variable when running Python tests. commit 9151a5466217456e51079687d869c240b7dbb308 Author: Zhipeng Mao Date: Mon Aug 5 18:53:51 2024 +0200 [Spark] Support clone and restore for Identity Columns (#3459) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 . It adds support for clone and restore tables with identity columns. ## How was this patch tested? Clone and restore related test cases. commit 63845c201643ac1571d58a3c5be1e6fd30761a46 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Fri Aug 2 23:40:57 2024 +0200 [Spark] Add Row Tracking Backfill Conflict Checker RemoveFile rule (#3467) ## Description `RowTrackingBackfill` (or backfill for short) is a special operation that materializes and recommits all existing files in table using one or several commits to ensure that every AddFile has a base row ID and default Row Commit Version. In this PR, we add a new rule to `ConflictChecker` that resolve concurrent conflicts involving Backfill. We check that RowTrackingBackfill is not resurrecting files that were removed concurrently and that an AddFile and its corresponding RemoveFile have the same base row ID and default RCV. We also add logic to skip certain concurrency checks if it involves Backfill. This opens up a lot of interesting Conflict Resolutions cases to test, so we will add more UTs revolving conflict checking/resolutions between Backfill and other operations with different scenario in the next PRs. ## How was this patch tested? Added ConflictResolution UTs. commit 2d1faaeccbf140fbad5fd17d1fd9bab19ee28912 Author: Krishnan Paranji Ravi Date: Fri Aug 2 17:05:51 2024 -0400 [Kernel][Expression] - Performance Optimization for LIKE expression evaluation (#3185) ## Description Resolves https://github.com/delta-io/delta/issues/3129 ## How was this patch tested? Existing tests validated. This is a performance optimization for LIKE expression evaluation. Signed-off-by: Krishnan Paranji Ravi commit 19248992f0a4a14155a0b0e32d0202331b2a8955 Author: Eduard Tudenhoefner Date: Fri Aug 2 18:39:17 2024 +0200 [Kernel] Add support for nested schema fields (#3445) This handles nested schema and resolves https://github.com/delta-io/delta/issues/3427. commit 2e371e79e40806237e9bc97f3dd264aad48c9ae7 Author: Fred Storage Liu Date: Fri Aug 2 09:03:42 2024 -0700 Use correct partition/batch size for Delta Uniform iceberg conversion (#3453) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The existing conversion logic used toLocalIterator which will spawn many Spark jobs to collect AddFiles to the driver based on default spark partition size. Mostly the default size is not good and thus conversion and commit to Iceberg will be bottlenecked. The PR used repartition to size the partition properly to avoid the bottleneck. ## How was this patch tested? manually tested on a 5M files table. performance improved from tens of minutes to 5 minutes. ## Does this PR introduce _any_ user-facing changes? commit 8eb7a4fa7b68d7edd8862f3de59673b2ea743167 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Fri Aug 2 17:11:32 2024 +0200 [Spark] Add RowTrackingBackfillCommand (#3449) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adding the [RowTrackingBackfillCommand](https://docs.google.com/document/d/1ji3zIWURSz_qugpRHjIV_2BUZPVKxYMiEFaDORt_ULA/edit#heading=h.8al9qhd83yov), the ability to assign row IDs to table rows after the table creation. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 9ca8e82ddd47e24fa84c105f88c3a5fe199bbdbc Author: Johan Lasperas Date: Thu Aug 1 23:14:00 2024 +0200 [Spark] Execute MERGE using Dataframe API in Scala (#3456) ## Description Due to Spark unfortunate behavior of resolving plan nodes it doesn't know, the `DeltaMergeInto` plan created when using the MERGE scala API needs to be manually resolved to ensure spark doesn't interfere with its analysis. This currently completely bypasses Spark's analysis as we then manually execute the MERGE command which has negatiev effects, e.g. the execution is not visible in QueryExecutionListener. This change addresses this issue, by executing the plan using the Dataframe API after it's manually resolved so that the command goes through the regular code path. Resolves https://github.com/delta-io/delta/issues/1521 ## How was this patch tested? Covered by existing tests. commit eb719f8f2eedf6d010c54a69cf126321bcfa6f11 Author: Juliusz Sompolski Date: Thu Aug 1 19:05:46 2024 +0200 [Spark] Add annotation for merge materialize source stage of merge (#3452) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Originally MERGE source materialization was lazy, and triggered when the source was used for the first time. Because of that, it couldn't be cleanly separated as a stage. Since it was changed to be eager, we can now annotate it, which should make it easier to find in Spark UI. ## How was this patch tested? Merge materialize source stage is now annotated: ![image](https://github.com/user-attachments/assets/5a468cda-ffae-40d4-9054-dcfca681c470) Unit tests validate that the new stage is present in MERGE commit metrics. ## Does this PR introduce _any_ user-facing changes? No Co-authored-by: Julek Sompolski commit 890889a3b841f8157c833f813728b49d7276c73b Author: Sumeet Varma Date: Wed Jul 31 17:04:22 2024 -0700 [Spark] Fix the inconsistencies in min/max Delta Log stats for special characters (#3430) ## Description When truncating maxValue strings longer than 32 characters for statistics, it's crucial to ensure the final truncated string is lexicographically greater than or equal to the input string in UTF-8 encoded bytes. Previously, we used the Unicode replacement character as the tieBreaker, comparing it directly against one byte of the next character at a time. This approach was insufficient because the tieBreaker could incorrectly win against the non-first bytes of other characters (e.g., � < 🌼 but � > the second byte of 🌼). We now compare one UTF-8 character (i.e. upto 2 Scala UTF-16 characters depending on surrogates) at a time to address this issue. We also start using U+10FFFD i.e. character with highest Unicode code point as the tie-breaker now. ## How was this patch tested? UTs commit a88198709eb99f363e9f7377d8c6d234d44862dc Author: Allison Portis Date: Tue Jul 30 13:29:04 2024 -1000 [Kernel] Add exception principles for Kernel to solidify the rules for our exception framework (#3408) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? commit cfd91336755eb98eae902250e354d84eb3724df2 Author: Allison Portis Date: Tue Jul 30 13:28:54 2024 -1000 [Docs] Clean up the docs in master for future 3.X releases (#3440) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Docs) ## Description Removes the docs changes for some Delta 4.0+ specific features since the next release will be 3.X. Also adds a banner to the landing page that announces and points to the Delta 4.0 Preview release. Also fixes the version numbers in the quickstart page to be `3.2.0` and not `3.1.0` ## How was this patch tested? Local build. ## Does this PR introduce _any_ user-facing changes? No. commit ef8c779c415ca305f36d60cfd4088fe96feaf436 Author: Allison Portis Date: Tue Jul 30 13:26:54 2024 -1000 [Spark] Add variant integration test to master (#3439) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Variant type support was added for the Delta 4.0 preview release on Spark 4.0 Preview. An integration test was added to the release branch in https://github.com/delta-io/delta/pull/3220, this PR adds the integration test to master with some updated infra. - Doesn't run the test when Spark version is too low - Updates `examples/scala/build.sbt` to work for 4.0.0+ ## How was this patch tested? Ran the scala integration tests using both `3.2.0` and `4.0.0rc1` ## Does this PR introduce _any_ user-facing changes? No. --------- Co-authored-by: richardc-db <87336575+richardc-db@users.noreply.github.com> commit 03ca73fe70cceea2999edb7ebbbb66d4e75f5055 Author: Eduard Tudenhoefner Date: Tue Jul 30 23:42:47 2024 +0200 [Kernel] Add get method to FieldMetadata for easier extraction of value in the correct type (#3435) ## Description This adds a `get()` method to `FieldMetadata` for easier extraction of the value in the correct type and fixes https://github.com/delta-io/delta/issues/3419 commit c45c6e6bb4b7e34a2ddcd8e47aaacbe169e1729a Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Jul 30 22:08:35 2024 +0200 [Spark] Add BackfillBatchIterator for Row Tracking Backfill (#3441) ## Description In this PR, we add `BackfillBatchIterator` which contains the core logic for selecting files for the WIP Row Tracking Backfill operation, creating iterators which batches files together. The Row Tracking Backfill operation will have multiple commits, each commit adds `baseRowId` to a batch of files. More details can be found in the [Design Doc](https://docs.google.com/document/d/1ji3zIWURSz_qugpRHjIV_2BUZPVKxYMiEFaDORt_ULA/edit#heading=h.8al9qhd83yov). ## How was this patch tested? Added new Test Suite. commit 2156efde82048e90bf25d31ae44e138f538be6aa Author: Christos Stavrakakis Date: Tue Jul 30 22:07:50 2024 +0200 [Spark] Make ConflictCheckerPredicateEliminationUnitSuite work Spark 4.0 (#3442) ## Description Update `ConflictCheckerPredicateEliminationUnitSuite` to work with Spark 4.0 where `rand()` returns an `UnresolvedFunction` instead of `Rand()` expression. ## How was this patch tested? Test-only change. commit 22ff95194891db5e915e7e477f0fd6d79b9047c7 Author: Venki Korukanti Date: Tue Jul 30 11:54:34 2024 -0700 [Kernel][Clean up] Use enums for column mapping modes instead of strings (#3446) ## Description Resolve to `enum` as part of the `TableConfig` fetch methods. This avoids comparing strings and dealing with case sensitivity. ## How was this patch tested? Existing tests. commit 455dbacaf456158644bcbe8b0b13bd4eae65c974 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Mon Jul 29 22:32:01 2024 +0200 [Spark] Add Delta Connect Scala Client (#3177) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adds the skeleton for the Delta Connect Scala Client. It adds some basic methods (`forPath`, `forName`, and `as`) to confirm that Delta Connect relations work. More methods will be added in future PRs. ## How was this patch tested? Added `DeltaTableSuite`. ## Does this PR introduce _any_ user-facing changes? No. --------- Co-authored-by: Scott Sandre Co-authored-by: Allison Portis commit 03bdf8476c3e4f76d9a2d26592b7fd638736f57a Author: Scott Sandre Date: Mon Jul 29 12:58:57 2024 -0700 [#3423] Fix unnecessary DynamoDB GET calls during LogStore::listFrom VACUUM calls (#3425) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves #3423. This PR updates the logic in `BaseExternalLogStore::listFrom` so that it does not make a request to get the latest entry from the external store (which is used to perform recovery operations) in the event that a non `_delta_log` file is being listed. This is useful for VACUUM operations which may do hundreds or thousands of list calls in the table directory and nested partition directories of parquet files. This is NOT the `_delta_log`. Thus, checking the external store during these list calls is (1) useless and unwanted as we are not listing the `_delta_log` so clearly now isn't the time to attempt to do a fixup, and (2) expensive. This PR makes it so that future VACUUM operations do not perform unnecessary calls to the external store (e.g. DyanamoDB). ## How was this patch tested? Unit tests and an integration test that actually runs VACUUM and compares the number of external store calls using the old/new logic. I ran that test myself 50 times, too, and it passed every time (therefore, not flaky). ## Does this PR introduce _any_ user-facing changes? No commit ec3f6be3771dca6631b3f7586fc3d9f7f21e488f Author: Eduard Tudenhoefner Date: Mon Jul 29 18:37:32 2024 +0200 [Kernel] Add column mapping metadata update functionality (#3393) ## Description This adds column mapping metadata when the column mapping feature is enabled on a table and fixes #3383 ## How was this patch tested? New tests have been added to verify the behavior commit e1dd541ce99389a79e8c70dbcf9045fd5d4112a7 Author: Lin Zhou <87341375+linzhou-db@users.noreply.github.com> Date: Mon Jul 29 21:23:13 2024 +0800 [Sharing]Use bytes length for file size in DeltaSharingLogFileStatus (#3432) ## Description Use bytes length for file size in DeltaSharingLogFileStatus, to match the actual size of the bytes in SeekableByteArrayInputStream, this is to avoid the length difference caused by non utf-8 characters. ## How was this patch tested? Unit Test commit 6495cba229024ce1d3a0e2260c1d7f9944f2e8f0 Author: Yumingxuan Guo Date: Fri Jul 26 14:54:26 2024 -0700 [DELTA] Catch fatal error in error matching in commitLarge (#3428) ## Description Catch fatal error in error matching in commitLarge ## How was this patch tested? Existing test cases pass. commit 86b4313d8e00bd2349f7d5514bb603aa3d05de08 Author: Eduard Tudenhoefner Date: Fri Jul 26 18:32:56 2024 +0200 [Kernel] Track nested fields in ArrayType / MapType (#3426) Rather than just tracking the underlying `DataType` this tracks the underlying field and thus allows to track nullability and metadata through that field commit 4fefba182f81d39f1d11e2f2b85bfa140079ea11 Author: Jun <85203301+junlee-db@users.noreply.github.com> Date: Thu Jul 25 15:23:06 2024 -0700 [Spark] Extend coordinated commit changes to other suites (#3305) ## Description Extends various existing Delta suite with coordinated commits. This helps to add more coverage with coordinated commit changes. ## How was this patch tested? This is test PR. commit e89617342720c6482750ddb62cd43b55ae25357a Author: Zihao Xu Date: Thu Jul 25 14:26:04 2024 -0700 [Spark] Fix the semantic of `shouldRewriteToBeIcebergCompatible` in REORG UPGRADE UNIFORM (#3412) ## Description currently we utilize the helper function `shouldRewriteToBeIcebergCompatible` to filter the portion of parquet files that need to be rewritten when running `REORG UPGRADE UNIFORM` based on the tags in the `AddFile`. however, the `DeltaUpgradeUniformOperation.icebergCompatVersion` is accidentally shadowed, which will make `shouldRewriteToBeIcebergCompatible` always return `false` if the `AddFile.tags` is not `null` - this is not the expected semantic of this function. this PR introduces the fix for this problem and add unit tests to ensure the correctness. ## How was this patch tested? through unit tests in `UniFormE2ESuite.scala`. ## Does this PR introduce _any_ user-facing changes? no. commit de87d7f6b70d9aa5313cd3a880617a2e430349eb Author: Lars Kroll Date: Thu Jul 25 17:11:44 2024 +0200 [Spark] Flip isDeltaTable.throwOnError to true (#3422) ## Description - Change the default value of the isDeltaTable.throwOnError flag to `true`. ## How was this patch tested? Existing tests (was already `true` in testing). ## Does this PR introduce _any_ user-facing changes? Resolving a Delta that is accessed by path (e.g. DeltaTable.forPath() or SELECT ... FROM delta.) will now forward exceptions thrown while accessing , instead of always throwing a DELTA_MISSING_DELTA_TABLE exception. This helps locating issues such as missing access permissions without having to through support. commit 69230a1ccd6ed59ae0145abafb6fc00c4f328af5 Author: Lars Kroll Date: Thu Jul 25 16:22:54 2024 +0200 [Spark] Add Delta log throttling class (#3418) ## Description - Add a token bucket based throttler implementation that can be used to throttle log messages that are suspected of potential log flooding issues. ## How was this patch tested? New test suite: `LogThrottlingSuite` commit 4243fef89b3ca599b9812d76a0970233632fb290 Author: Annie Wang <170372889+anniewang-db@users.noreply.github.com> Date: Wed Jul 24 15:29:12 2024 -0700 [Hudi] Refactor tests (#3415) ## Description This PR refactors a unit test for UniForm Hudi ## How was this patch tested? It is a test-only change commit 7467dfb4587fe483b3df52b530c0365c715e5da9 Author: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Date: Wed Jul 24 18:20:07 2024 -0400 [Storage][Kernel] Add InMemoryCommitCoordinator (#3377) ## Description An InMemoryCommitCoordinator to kernel to track per-table commits, backfills and validates various edge cases and unexpected scenarios. ## How was this patch tested? Unit tests commit 6e3e4ef20224ace721780d132f01daea738a33b2 Author: Ming DAI Date: Wed Jul 24 13:50:40 2024 -0700 Match table exhaustively for ConvertToDelta to avoid scala.MatchError (#3411) ## Description Make the match loop exhaustive to avoid scala.MatchError ## How was this patch tested? Existing Unit tests. commit 2b2ef732533c707b7ca1af30e2a059da86c3c3ff Author: Allison Portis Date: Tue Jul 23 15:50:20 2024 -1000 [Kernel] Wrap calls into the engine implementation with `KernelEngineException` (#3407) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Wrap all calls into `Engine` implementations with try-catch blocks to wrap any unchecked exceptions as `KernelEngineException`. We do this using helper methods added to `DeltaErrors` and kernel-api adds additional context about the failing operation at hand. ## How was this patch tested? Existing tests should suffice. ## Does this PR introduce _any_ user-facing changes? No. commit 9cdc1c71331eed04053dd886ce3628d4cd991be9 Author: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Date: Tue Jul 23 19:09:21 2024 -0400 [Kernel] Add parsing support for coordinated commits Map type table properties (#3400) ## Description Add parsing support for coordinated commits `Map` type table properties and give a new option of using `fromMetadata` with `Engine`. ## How was this patch tested? Unit tests commit 64c9b979085692e9b750455c82bc1be99d21de63 Author: leonwind-db Date: Tue Jul 23 21:43:00 2024 +0200 Revert "[Spark] Add custom not matched for insert clauses expr (#3405)" (#3410) This reverts commit 1d9ad36f53a73d9a5e207adaac41dd54626937b8. #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Reverts adding custom expressions for not matched insert clauses. ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit acd67103537c5bd532d000b756dd3312c046c787 Author: Amogh Jahagirdar Date: Tue Jul 23 13:41:30 2024 -0600 [UNIFORM] Remove unnecessary iceberg patch and disable cleanup of files in expire snapshots via cleanExpiredFiles API (#3399) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ X] Other Uniform ## Description We don't need a custom patch to disable cleaning up data files for Iceberg's expire snapshots API. icebergShaded/iceberg_src_patches/0002-iceberg-core-must-not-delete-any-delta-data-files.patch An option to disable file cleanup already exists through the `cleanExpiredFiles` API. ## How was this patch tested? Iceberg's cleanFiles option is already tested by Iceberg. I can add separate tests here to make sure that the custom transaction logic for Uniform uses that option. ## Does this PR introduce _any_ user-facing changes? No, it preserves the existing behavior. Snapshots can be removed, but files will never be deleted. commit 3cd70412dc72c67e3f873969b26c7d2cc464726c Author: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Date: Mon Jul 22 22:09:09 2024 -0400 [Kernel] Add default coordinated commits handler implementation (#3396) ## Description Add default coordinated commits handler implementation, `CommitCoordinatorProvider` and `CommitCoordinatorBuilder` for user to define, register and get their own commit coordinator builder from `DefaultEngine`. ## How was this patch tested? Unit tests commit 1d9ad36f53a73d9a5e207adaac41dd54626937b8 Author: leonwind-db Date: Mon Jul 22 20:34:07 2024 +0200 [Spark] Add custom not matched for insert clauses expr (#3405) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Make `notMatchedClauses` a custom `def` ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit 3d2ce5e303ec1c4df2d32388c1d354844ec9489e Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Mon Jul 22 20:32:17 2024 +0200 [Spark] Minor refactor in Delta Protocol Transition Suite (#3402) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Minor refactor in `DeltaProtocolTransitionsSuite`. ## How was this patch tested? Test only PR. ## Does this PR introduce _any_ user-facing changes? No. commit 2c450feba1b9e26ac2b3c6019a6bdf42a70583ce Author: Carmen Kwan Date: Fri Jul 19 23:59:30 2024 +0200 [Spark] Identity Columns Value Generation (without MERGE support) (#3023) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 In this PR, we enable basic ingestion for Identity Columns. * We use a custom UDF `GenerateIdentityValues` to generate values when not supplemented by the user. * We introduce classes to help update and track the high watermark of identity columns. * We also do some cleanup/ improve readability for ColumnWithDefaultExprUtils Note: This does NOT enable Ingestion with MERGE INTO yet. That will come in a follow up PR, to make this easier to review. ## How was this patch tested? We introduce a new test suite IdentityColumnIngestionSuite. ## Does this PR introduce _any_ user-facing changes? No. commit 589cabad0cb4d3318c85989b7915a461a2ddc39b Author: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Date: Fri Jul 19 17:34:26 2024 -0400 [Kernel] Add coordinated commits interfaces and table properties (#3370) ## Description Add coordinated commits interfaces and related table properties in Kernel to prepare for Kernel read and write supported by coordinated commits. ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? For now this PR just contains the interfaces, but once fully implemented user can define and register their own commit coordinator builder and get the corresponding commit coordinator client handler from engine. commit 2bfc2f2f19c27289339505bb3e3bb88f8612e176 Author: Carmen Kwan Date: Fri Jul 19 20:04:37 2024 +0200 [Spark] Batch resolve DeltaMergeAction in ResolveMergeInto (#3366) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description In this PR, we batch resolve `DeltaMergeActions` in `ResolveMergeInto` to avoid redundant invocations of the analyzer. For example, if we have the clause ``` WHEN NOT MATCHED THEN INSERT (c1, c2) VALUES (src.c1, 22) ``` Previously, we would call `resolveSingleExprOrFail` which invokes the analyzer 4 times: 1 call to resolve `c1`, 1 call to resolve `src.c1`, 1 call to resolve `c2`, and 1 call to resolve `22`. With this PR, we batch the resolution of the target column name parts (`[c1, c2]`) against the target relation and the resolution of assignment expression (`[src.c1, 22]`) together. We can resolve a Merge clause with 2 calls using batching. This helps us with the analyzer performance on wide tables. ## How was this patch tested? Existing tests pass. ## Does this PR introduce _any_ user-facing changes? No. commit fb4d88d1e4a762ffb81adb10f7801784ec24bf97 Author: Fokko Driesprong Date: Thu Jul 18 23:13:13 2024 +0200 Move OutputTimestampType to `DeltaSQLConf` (#3388) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? commit 0e45ad2b75d7b836cb85453e0e308dc8e4d77d83 Author: Zihao Xu Date: Thu Jul 18 14:12:23 2024 -0700 [Spark] Remove dropped columns when running REORG PURGE (#3371) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description According to #3228, we add the support to also find and remove dropped columns when running `REORG PURGE`. Close #3228. ## How was this patch tested? Through unit test in `DeltaReorgSuite.scala`. ## Does this PR introduce _any_ user-facing changes? No. commit 3797fe810e97a709cf38fc1658f6d1eade1239f1 Author: Johan Lasperas Date: Thu Jul 18 21:33:32 2024 +0200 [Spark] Increase replication level in MERGE source materiazliation after first retry (#3386) ## Description Source materialization for MERGE INTO currently uses 2-way replication on retries. This may not be enough when executors are aggressively killed, for example when using spot instances. This change retains 2-way replication on the first retry, then increases it 3-way by default on the following retries. ## How was this patch tested? Update merge source materialization test commit a2bcd1a90ca50e15e0191b27ee651d48b9d70db4 Author: Mingkang Li Date: Wed Jul 17 19:34:37 2024 -0700 [SPARK] Add Integration Test in `MergeIntoMaterializeSourceSuite` (#3387) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adds an integration test in `MergeIntoMaterializeSourceSuite` to improve code coverage for `MergeIntoMaterializeSource.scala`. ## How was this patch tested? An integration test was added to `MergeIntoMaterializeSourceSuite` ## Does this PR introduce _any_ user-facing changes? No. commit ef35e67782a7b6c97f0ac991eb036739e00c4000 Author: Amogh Jahagirdar Date: Wed Jul 17 17:44:15 2024 -0600 [Spark] Only attempt parsing partition path into different types if type inference is enabled (#3374) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description I was looking at this code path for another issue and noticed that when parsing partition values from paths (only used for the write path) without type inference we still attempt to due parsing for date/decima/timestamp types even though it's not required. We can avoid that work with a small refactoring. ## How was this patch tested? This is a refactoring/minor optimization where existing unit tests will exercise this path ## Does this PR introduce _any_ user-facing changes? No commit db9d9ac5a1be27506932cffc1b9abc3470bd0ab4 Author: Johan Lasperas Date: Wed Jul 17 17:35:41 2024 +0200 [Spark] Use DeltaSQLTestUtils in more test suites (#3376) ## Description Follow-up from https://github.com/delta-io/delta/pull/3365 Mix in `DeltaSQLTestUtils` in more test suites so that they use the Delta temp dir creation helpers. ## How was this patch tested? Test-only commit 669dca9c05f1ddb32ce1fd612ffd83eabb1cddd9 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Wed Jul 17 02:53:14 2024 +0200 [Spark] Improve Delta Protocol Transitions (#2848) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently, protocol transitions can be hard to manage. A few examples: - It is hard to predict the output of certain operations. - Once a legacy protocol transitions to a Table Features protocol it is quite hard to transition back to a legacy protocol. - Adding a feature in a protocol and then removing it might lead to a different protocol. - Adding an explicit feature to a legacy protocol always leads to a table features protocol although it might not be necessary. - Dropping features from legacy protocols is not supported. As a result, the order the features are dropped matters. - Default protocol versions are ignored in some cases. - Enabling table features by default results in feature loss in legacy protocols. - CREATE TABLE ignores any legacy versions set if there is also a table feature in the definition. This PR proposes several protocol transition improvements in order to simplify user journeys. The high level proposal is the following: Two protocol representations with singular operational semantics. This means that we have two ways to represent a protocol: a) The legacy representation and b) the table features representation. The latter representation is more powerful than the former, i.e the table features representation can represent all legacy protocols but the opposite is not true. This is followed by three simple rules: 1. All operations should be allowed to be performed on both protocol representations and should yield equivalent results. 2. The result should always be represented with the weaker form when possible. 3. Conversely, if the result of an operation on a legacy protocol cannot be represented with the legacy representation, use the Table Features representation. **The PR introduces the following behavioural changes:** 1. Now all protocol operations are followed by denormalisation and then normalisation. Up to now, normalisation would only be performed after dropping a features. 2. Legacy features can now be dropped directly from a legacy protocol. The result is represented with table features if it cannot be represented with a legacy protocol. 3. Operations on table feature protocols now take into account the default versions. For example, enabling deletion vectors on table results to protocol `(3, 7, AppendOnly, Invariants, DeletionVectors)`. 5. Operations on table feature protocols now take into account any protocol versions set on the table. For example, creating a table with protocol `(1, 3)` and deletion vectors results to protocol `(3, 7, AppendOnly, Invariants, CheckConstraints, DeletionVectors)`. 6. It is not possible now to have a table features protocol without table features. For example, creating a table with `(3, 7)` and no table features is now normalised to `(1, 1)`. 7. Column Mapping can now be automatically enabled on legacy protocols when the mode is changed explicitly. ## How was this patch tested? Added `DeltaProtocolTransitionsSuite`. Also modified existing tests in `DeltaProtocolVersionSuite`. ## Does this PR introduce _any_ user-facing changes? Yes. commit 4430dc1699108e385de4dc29297272a12b58ab99 Author: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Date: Tue Jul 16 17:56:24 2024 -0400 [Kernel] Update ConflictChecker to perform conflict resolution of ICT (#3283) ## Description Update ConflictChecker to perform conflict resolution of inCommitTimestamp and complete the inCommitTimestamp support in Kernel. ## How was this patch tested? Add unit tests to verify the conflict resolution of timestamps and enablement version. ## Does this PR introduce _any_ user-facing changes? Yes, user can enable monotonic inCommitTimestamp by enabling its property. commit 8f66b06a1e65e8346d77943b6b021478ce073e29 Author: Dhruv Arya Date: Tue Jul 16 12:41:25 2024 -0700 [Spark] Make TrackingCommitCoordinatorClient more generic (#3380) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description `TrackingCommitCoordinatorClient` currently takes a `InMemoryCommitCoordinator` in the constructor. This makes it hard to test the coordinator with other coordinator types. This PR makes it take `CommitCoordinatorClient` instead and moves the `InMemoryCommitCoordinator`-specific APIs to `InMemoryCommitCoordinator`. ## How was this patch tested? No new tests for this test-only change. ## Does this PR introduce _any_ user-facing changes? No commit dc0f35de20227527308b8b7989f6cdbd89df13a6 Author: Fokko Driesprong Date: Tue Jul 16 19:42:09 2024 +0200 [Spark] Write INT64 by default for Timestamps (#3373) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description - Iceberg states in the spec that Timestamps should be written with INT64 physical types. There were already flags to enable this, but this PR makes this behavior the default. - `INT96` is discouraged: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L981 ## How was this patch tested? - Added new unit-tests to validate the correct behavior ## Does this PR introduce _any_ user-facing changes? Yes, it will default to INT64 if nothing has been set. commit 7b352599d7f0c4e8c5bd72208f2a5a7395cded92 Author: Zihao Xu Date: Tue Jul 16 10:21:00 2024 -0700 [Spark] Enable UniForm Without Rewrite (#3379) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Now users could directly enable Delta UniForm using `ALTER TABLE SET TBLPROPERTIES` command, this only converts the corresponding metadata from `delta` to `iceberg` without rewriting all the underlying parquet files. ## How was this patch tested? Through manual tests and e2e tests. ## Does this PR introduce _any_ user-facing changes? Yes, this PR let users enable Delta UniForm directly via `ALTER TABLE SET TBLPROPERTIES` command. commit 573a57f62918d0cb8937ca8c9b4047f8d696cc9c Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Fri Jul 12 20:46:35 2024 +0200 [Spark] Remove tracking usageRecords in runFunctionsWithOrderingFromObserver (#3369) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Remove tracking usageRecords generated from the query runs in runFunctionsWithOrderingFromObserver because it is unnecessary for our testing purpose. ## How was this patch tested? Existing UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 9c576327d3d737ccd7cf09553b8c3852dfb5cfa0 Author: Johan Lasperas Date: Fri Jul 12 17:54:12 2024 +0200 [Spark] Remove timeout source from withTempDir in Delta tests (#3365) ## Description `withTempDir` is widely used across Delta tests to create temporary directories. The spark version in `SQLTestUtil` waits for all running spark tasks to finish with a 10s timeout, but seems to be prone to under or over counting the number of running tasks, sometime causing it to timeout and fail the test. Change: - Use Delta's version of `withTempDir`/`withTempPath`/withTempPaths`, which immediately deletes the temp directory when the code returns. This version is already in many Delta tests. ## How was this patch tested? Test-only change. commit b3d764ae6869c5103b7fadce9982f7c3edda373d Author: Lars Kroll Date: Fri Jul 12 17:50:55 2024 +0200 [Spark] Expose uncertainty in isDeltaTable resolution via exceptions (#3368) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description - Add a new codepath (disabled by default except in testing) to `findDeltaTableRoot` (which is the implementation of `isDeltaTable`), that throws the first exception if the passed in path is not accessible, and no accessible parent is found to contain a `_delta_log` folder. - This improves error reporting on permission errors, which current often get misreported as `t is not a DeltaTable`. - This also reduces the risk for transient cloud storage errors causing us to silently resolve a DeltaTable as a non-Delta table, which can lead to inconsistencies in how we process it when we check multiple times. - Note that this is still not ideal, we could still end up with a situation where we the first path is accessible but a child of the Delta Table, such as a partition, but then when we check the path that actually does contain the `_delta_log` we get a transient error and wrongly resolve it as not a Delta table. Ultimately, this kind of parent resolution really has been and continues to be best-effort. But at least the case where we are pointed at the correct folder behaves more sane now. The new path is only enabled in unit tests for now. Flag flip PR will follow separately. ## How was this patch tested? Existing tests (showing that there is no behaviour change in the "happy path"). ## Does this PR introduce _any_ user-facing changes? No, they will only apply on the flag-flip PR. commit 27a89efa72e07c1cfa5ef13f887d30ee0e7b02b7 Author: Matt Braymer-Hayes Date: Fri Jul 12 11:31:36 2024 -0400 [Flink][Docs] Fix link to Java API docs (#3359) Fix broken link to Flink connector's Java API docs. Signed-off-by: Matt Braymer-Hayes commit c422db88ef0c405954ac90c0cc07fe3a233ad232 Author: jackierwzhang <67607237+jackierwzhang@users.noreply.github.com> Date: Thu Jul 11 12:34:32 2024 -0700 [SPARK] Add schema utils to prune empty structs from a schema (#3361) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description As title, this adds an useful utils to prune out empty structs from a schema, in preparation for some future work. ## How was this patch tested? New UT. ## Does this PR introduce _any_ user-facing changes? No commit bde83ad6ae60476176d52b24e41dbcb663db0b1d Author: Hao Jiang Date: Thu Jul 11 11:40:08 2024 -0700 [UniForm] Report accurate type when checking Iceberg compatible types (#3362) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes an inaccurate error message when IcebergCompat discovers incompatible data type in the schema. Before the fix, if a struct has a field containing nested incompatible data type, the checks in IcebergCompat reports the field type. For example, "STRUCT>" contains an incompatible type "VOID", our message will report "struct is not supported". As struct is supported, this causes a confusion. This fix makes it report the actual incompatible data type. In the example above, we will report "void is not supported". ## How was this patch tested? UT ## Does this PR introduce _any_ user-facing changes? No commit 9dad86a0bc2cf350628f0e7a635993f868efd481 Author: Sumeet Varma Date: Thu Jul 11 09:04:23 2024 -0700 [Spark] Add list utils to DeltaCommitFileProvider (#3349) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add a list util to DeltaCommitFileProvider. ## How was this patch tested? Build ## Does this PR introduce _any_ user-facing changes? No commit 23770cbceba7483339cb8f4ed8229b6398fcdce8 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Thu Jul 11 18:02:31 2024 +0200 Use valid protocol versions in CreateCheckpointSuite (#3356) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Minor fix in `CreateCheckpointSuite` to use a valid legacy protocol. ## How was this patch tested? Test only change. ## Does this PR introduce _any_ user-facing changes? No. commit d04b3cdd7a75ba8c2b313a41c4bb966fb72c9eed Author: Jun <85203301+junlee-db@users.noreply.github.com> Date: Wed Jul 10 21:01:07 2024 -0700 [Spark] Extends NewTransactionSuite with CoordinatedCommits (#3344) commit 569ec7a9b2af5fa3c6a70356db9252c5921add38 Author: Sumeet Varma Date: Wed Jul 10 10:06:09 2024 -0700 [Spark] Use Coordinated Commits Properties from SparkSession during CLONE (#3325) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description For CLONE, the priority order for Coordinated Commits related settings is: 1. Explicit overrides with the CLONE command 2. SparkSession defaults Note, we never pick the Coordinated Commits related settings from the source metadata even if the above two are not set. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit 83ecfc596b248a751871ec700a415795efe75e54 Author: Jiaheng Tang Date: Wed Jul 10 08:44:39 2024 -0700 [Spark] Migrate remaining logging code to use Spark Structured Logging (#3354) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Migrate remaining logging code to use Spark structured logging. Structured logging will be introduced in Spark 4.0, as highlighted in the [preview release of Spark 4.0 | Apache Spark](https://spark.apache.org/news/spark-4.0.0-preview1.html). Note that the feature is turned off by default and the output log message will remain unchanged. Resolves #3145 ## How was this patch tested? Existing tests. ## Does this PR introduce _any_ user-facing changes? No commit 3a6fd47e531c7ae1535aeab290aa6b96c9c54545 Author: Sumeet Varma Date: Wed Jul 10 08:43:50 2024 -0700 [Spark] Fix error message formatting in DeltaCommitFileProvider (#3353) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Make sure that version and maxVersionInclusive are formatted correctly. ## How was this patch tested? Build ## Does this PR introduce _any_ user-facing changes? No commit 1375b74afb0033ccac54f96307ccde7d61932c63 Author: Dhruv Arya Date: Tue Jul 9 17:22:13 2024 -0700 [Spark] Fix VARCHAR/CHAR to string conversion (#3346) commit 0d87908cdaf49bd11fb9f4fd6e3106ab14541e34 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Jul 9 20:50:13 2024 +0200 [Spark] Fix CDC Commit Timestamp value under different Timezones (#3347) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The CDC's `_commit_timestamp` is incorrect when we try to read/display it under a different Spark Session's timezone `spark.sql.session.timeZone` (e.g. `America/Chicago`, `Asia/Ho_Chi_Minh`, ...). In this PR, we address this issue by taking into account timezone to capture the precise point in time when we convert `CDCDataSpec`'s [Java Timestamp](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/commands/cdc/CDCReader.scala#L173) field to [Spark's Timestamp](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/commands/cdc/CDCReader.scala#L80) for the `_commit_timestamp` column for all CDC's file indexes (`CDCAddFileIndex`, `TahoeRemoveFileIndex`, `TahoeChangeFileIndex`). This is needed in order for CDF to work properly under a different timezone than `UTC`. ## How was this patch tested? Added UT, some minor UTs fix to take into account timezone. ## Does this PR introduce _any_ user-facing changes? Yes. - CDC's `_commit_timestamp` should now be correct when we try to read/display it under a different Spark Session's timezone `spark.sql.session.timeZone` (e.g.` America/Chicago`, `Asia/Ho_Chi_Minh`, ...). - This is a user-facing change compared to the released Delta Lake versions and within the unreleased branches such as master. commit 70bfe82e40db6ba1fd0fe591d635bef5e177411f Author: Johan Lasperas Date: Tue Jul 9 17:53:52 2024 +0200 [Spark] Deprecate tableVersion field in type widening metadata (#3334) ## Description The protocol specification for type widening dropped the requirement to populate a `tableVersion` field as part of type change history to track in which version of the table a given change was applied. See protocol update: https://github.com/delta-io/delta/pull/3297 This field was used at some point during the preview but isn't needed anymore. It is now deprecated: - The field is preserved in table metadata that already contains it. - The field isn't set anymore when the stable table feature is active on the table. - The field is still set when only the preview table feature is active on the table. The last point is necessary to avoid breaking preview clients (Delta 3.2 & Delta 4.0 preview) that require the field to be set. ## How was this patch tested? - Updated existing metadata tests to cover `tableVersion` not being set by default. - Added metadata tests to explicitly cover `tableVersion` being set. - Added tests covering `tableVersion` when using the preview and stable table features. ## Does this PR introduce _any_ user-facing changes? Yes. As of this change, a table that supports the stable table feature `typeWidening` won't have a `tableVersion` field in the type change history stored in the table metadata. Tables that only support the preview table feature `typeWidening-preview` don't see any change. commit 0e5b85681955d19237c51865af58d8ae9ceedc06 Author: Annie Wang <170372889+anniewang-db@users.noreply.github.com> Date: Mon Jul 8 10:44:22 2024 -0700 [Hudi] Flesh out tests and update column type support (#3339) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Hudi) ## Description This PR adds unit tests and also removes support of some data types that are not supported by Hudi (e.g. TIMESTAMP_NTZ, SHORT, BYTE, etc) ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit 74d09e9343ec14b63000874757eaffe680e4f140 Author: Annie Wang <170372889+anniewang-db@users.noreply.github.com> Date: Mon Jul 8 10:13:07 2024 -0700 [Hudi] Add integration tests for Hudi (#3338) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Hudi) ## Description This PR adds integration tests for Hudi. Previously, the integration test was not able to actually read the table from a Hudi engine due to Spark incompatibility, but since the new Hudi 15.0.0 release supports Spark 3.5 we can now add verifications that actually read the tables from Hudi. ## How was this patch tested? Added integration tests ## Does this PR introduce _any_ user-facing changes? No commit 4afb16bf599b27d338512abd85d772c24f42be8d Author: Vishwas Modhera <35566657+vishxm@users.noreply.github.com> Date: Sun Jul 7 20:21:41 2024 +0530 [Documentation] typo correction - very to every in quick-start docs (#3340) modified very to every #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Documentation) ## Description This PR corrects the typographical error in quick-start.md by modifying `very` to `every`. ## How was this patch tested? Not needed. ## Does this PR introduce _any_ user-facing changes? No. commit 97439835a4a667ac2ad86ec6054f0e85e8214760 Author: Christos Stavrakakis Date: Wed Jul 3 18:20:45 2024 +0200 [Spark] Use binary encoding for DV descriptor in file metadata (#3331) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The Deletion Vector descriptor of each file is currently serialized in file's custom metadata as a JSON string. We can reduce the size of the descriptor by using a custom binary encoding. Note that the serialized DV descriptor is never persisted to disk so this change is safe. ## How was this patch tested? Updated existing new tests. ## Does this PR introduce _any_ user-facing changes? No commit 146d49718956b8e713ae9cdf2877eacc7c6352df Author: Fred Storage Liu Date: Tue Jul 2 09:22:05 2024 -0700 [Spark] add Delta logging to UniForm conversion mismatch (#3327) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description add Delta logging to UniForm conversion mismatch commit 0a99248379ce2a6350d3b594538a9f50569cab8d Author: sabir-akhadov <52208605+sabir-akhadov@users.noreply.github.com> Date: Tue Jul 2 18:02:26 2024 +0200 Add Delta command operation metrics to SQL metrics (#3328) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add missing operation metrics to SQL metrics emitted by Delta command Spark plans. ## How was this patch tested? existing tests ## Does this PR introduce _any_ user-facing changes? no commit f61b12d31a32d73f24b992f86e9617c714d75493 Author: Annie Wang <170372889+anniewang-db@users.noreply.github.com> Date: Tue Jul 2 09:02:06 2024 -0700 [Hudi] Add SQL config for synchronous Delta -> Hudi metadata conversion (#3326) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Hudi) ## Description This PR adds a Delta SQL config to let us make the Hudi metadata conversion synchronously after each commit instead of asynchronously (default). This SQL config is used for testing and debugging purposes only. ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit 23aa41e582dbab04af23b26f781d1d3d74334f92 Author: Annie Wang <170372889+anniewang-db@users.noreply.github.com> Date: Mon Jul 1 14:33:56 2024 -0700 [Hudi] Catch harmless HoodieExceptions when metadata conversion aborts (#3323) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Hudi) ## Description This PR adds a clause to catch a few specific HoodieExceptions from the Hudi metadata conversion that are not data-corrupting and only cause the conversion to abort. This is acceptable because the conversion of this commit will just happen in a future Delta commit instead. These exceptions can occur due to IO failures or multiple writers trying to write to the metadata table at the same time. If multiple writers are writing to the metadata table at the same time, the issue that occurs is that both writers see a failed commit in the Hudi metadata timeline and try to roll it back. The faster writer is able to roll it back with no problem and no exception, but the slower writer will try to roll it back and find that the commit no longer exists (since it was already rolled back by the faster writer). This will lead to an error. However, since the Hudi metadata table is updated within the Hudi commit transaction, the entire transaction will abort if there is a failure in writing to the metadata table. Thus, the commit is not marked as completed and the state of the Hudi table is unchanged. The incompleted commits to both the metadata table and the actual table itself will be cleaned up in a later transaction. The changes that we wanted to make in this commit will be made in the next commit instead since the lastDeltaVersionConverted is unchanged in the table (no new data added). Also, similar errors can happen after the data is already committed and both writers are trying to clean up the metadata (in function markInstantsAsCleaned). Multiple writers may try to clean up the same Instant and the slower writer will again run into an error. (this is the "Error getting all file groups in pending clustering" error) Again, this does not lead to any data corruption because it is only cleanup step that gets aborted, and the data is already committed to the table. The cleanup will just be performed after a later commit instead. ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit dd39415912f6009fb9e5d2f4057288bb1e9fd117 Author: Annie Wang <170372889+anniewang-db@users.noreply.github.com> Date: Mon Jul 1 09:48:57 2024 -0700 [Hudi] Support list/map data type conversion (#3320) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Hudi) ## Description This PR adds functionality to convert Delta tables with list or map type columns to be Hudi-readable. ## How was this patch tested? Added unit tests in ConvertToHudiSuite and tested manually with external Hudi Spark reader. ## Does this PR introduce _any_ user-facing changes? Yes. Previously users could not enable the Delta table property for Hudi conversion on tables containing list/map columns and would receive an unsupportedType error but now they can. commit 207d8d2f50c79e0afc10327395e5b80a1b7caad7 Author: Annie Wang <170372889+anniewang-db@users.noreply.github.com> Date: Fri Jun 28 08:57:54 2024 -0700 [Hudi] Fix duplicate record naming in schema conversion (#3310) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Hudi) ## Description This PR fixes a bug in the Delta->Hudi schema conversion for structs. Previously, any table conversion that included at least one struct somewhere in the schema would fail. This is because the previous code was incorrectly naming the Avro RecordSchemas. There were two problems: 1. There was no namespace added to each record, so for a struct column with the following schema: ` (myName STRUCT>)` there would be an error due to the duplicate naming of the two structs. (**Avro does not allow recordSchemas with the same name under the same namespace**) It would incorrectly place the nested struct in the same namespace as the parent struct, and even though our schema should be valid the Avro schema creation would fail. 2. For some reason we were naming each record by its data type name instead of its own name. Since we represent the Delta schema as a struct, even if we don't have any nested structs inside our table, as long as we have at least one struct column in the schema we will end up creating a nested struct. Both of these records would be named "struct" and be under the same namespace (due to problem 1), so we would run into a duplication error even if we just have a single struct of ints. So for an example table defined as follows: `CREATE TABLE myTable (col1 STRUCT)` the previous code would not work because it would represent our overall schema as a struct with name "struct", and our struct column would be a nested struct with name "struct" under the same namespace. Now, I have changed it so that it works and is compatible with Spark+Hudi. We are now using namespaces and also naming with column names rather than column type names. For this example, our Avro schema would look like this: ``` { "type": "record", "name": "table", "fields": [ { "name": "col1", "type": [ "null", { "type": "record", "name": "col1", "namespace": "table", "fields": [ { "name": "field1", "type": [ "null", "int" ] }, { "name": "field2", "type": [ "null", "string" ] } ] } ] } ] } ``` ## How was this patch tested? Unit test and manually tested with Hudi SparkSession reader. ## Does this PR introduce _any_ user-facing changes? No commit 87f0685ee7e68e680df8f3627388b0267040f83b Author: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Date: Fri Jun 28 02:38:59 2024 -0400 [Kernel] Add monotonic inCommitTimestamp and read support for inCommitTimestamp (#3282) ## Description Add read support for inCommitTimestamp and ensure the increasing monotonicity of inCommitTimestamp assuming that there are no conflicts to prepare for the complete inCommitTimestamp support with conflict resolution in Kernel. ## How was this patch tested? Add unit tests to verify that the read of inCommitTimestamp is correct and inCommitTimestamp is monotonic. ## Does this PR introduce _any_ user-facing changes? Yes, user can enable monotonic inCommitTimestamp assuming that there are no conflicts by enabling its property. commit c8e87b4cf0c43c37ece9614e3ea273974fde8d11 Author: Ming DAI Date: Thu Jun 27 19:15:05 2024 -0700 Use trigger.AvailableNow in ConvertToDelta suites (#3315) commit 66699bbf801ef63935f3bbc60a355441e3a86d9c Author: Johan Lasperas Date: Fri Jun 28 04:14:31 2024 +0200 [Spark] Introduce stable type widening table feature (#3314) commit b7da7f40a3955d3af3f279f7d18483e686d8d286 Author: Sumeet Varma Date: Thu Jun 27 13:14:11 2024 -0700 [Spark] Throw exception when additional listing also doesn't reconcile the listing gap in testing (#3244) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We already log the failure scenario when additional file-system listing also can't reconcile the gap between concurrent file-system listing and commit-owner calls. With this PR, we will throw an exception if the above condition is triggered while testing. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit c91de4b882c3e706d83c8c3c74fe7c44f9c88b3d Author: Johan Lasperas Date: Thu Jun 27 21:49:01 2024 +0200 [Protocol RFC] Finalize type widening protocol (#3297) ## Description Finalize the protocol RFC for type widening: - Add missing supported type changes: `byte`,`short`,`int` to `double` and integers to decimals. - Remove `tableVersion` from the type change metadata fields. - Remove requirements around populating default row commit versions. The two last requirements were initially intended to allow matching each file against type changes that happened before or after it was written. This didn't prove useful in practice - it was temporarily used to collect files to rewrite when dropping the feature but this now relies on fetching parquet footer as a more robust and simpler approach. commit 47871d8adb82f47240347934156b2abee5426e15 Author: Annie Wang <170372889+anniewang-db@users.noreply.github.com> Date: Thu Jun 27 09:37:35 2024 -0700 [Hudi] setCommitFileUpdates bug fix (#3309) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Hudi read support) ## Description This PR fixes a bug in the Delta->Hudi metadata conversion logic. When the number of actions to convert is greater than the action batch size (can be changed by user), the previous code incorrectly only converted the last batch instead of converting all batches. ## How was this patch tested? Unit test ## Does this PR introduce _any_ user-facing changes? No commit c395a63caef724bfe2e8e0633754038f8f5987d4 Author: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Date: Thu Jun 27 11:35:41 2024 -0400 [Kernel] Add non-monotonic inCommitTimestamp with related table properties and table features (#3276) ## Description Add non-monotonic `inCommitTimestamp` with related table properties and table features to prepare for adding monotonic `inCommitTimestamp` in later PRs. ## How was this patch tested? Add unit tests to verify the `inCommitTimestamp` and related table properties and table features when enabling the `inCommitTimestamp` enablement property. ## Does this PR introduce _any_ user-facing changes? Yes, the user can enable non-monotonic `inCommitTimestamp` by enabling its property. commit eb26989e046165d0c93cff30f3655bf5c179b0f4 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Wed Jun 26 21:34:39 2024 +0200 History truncation/validation support for writer features in DROP FEATURE command (#3296) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently, history truncation/validation in DROP feature command is only performed for reader+writer features. This PR enables this functionality for writer features as well. For reader+writer features is always is enabled by default while for writer features it was to be explicitly enabled. ## How was this patch tested? Added new tests in `DeltaProtocolVersionSuite`. ## Does this PR introduce _any_ user-facing changes? No. commit 7b4dbea42a286eadfdad71906ffe988ed966c90b Author: Annie Wang <170372889+anniewang-db@users.noreply.github.com> Date: Wed Jun 26 11:10:11 2024 -0700 [Hudi] Upgrade Hudi version (#3311) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Hudi) ## Description This version upgrades the Hudi version from 0.14.0 to 0.15.0. This upgrade is needed because 0.14.0 does not support Apache Spark 3.5. ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit 0dc07226da9311d116b280eabdf4aeae60c75754 Author: Sumeet Varma Date: Wed Jun 26 08:49:57 2024 -0700 [Spark] Handle concurrent CREATE TABLE IF NOT EXISTS ... LIKE ... table commands (#3306) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description When 2 or more CREATE TABLE IF NOT EXISTS table commands are run concurrently, they both think the table doesn't exist yet and the second command fails with TABLE_ALREADY_EXISTS error. With this PR, we aim to make sure the second command end up in a no-op instead of a failure. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit 4482ee3a346390055cac2a722fa896658659ef7b Author: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Date: Wed Jun 26 08:46:56 2024 -0700 [Spark] Improve internal error message for clustering column validation (#3300) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Improves error message on internal clustering column API. ## How was this patch tested? N/A. commit 136f88a1ef3ea9acc69fd58bea7f8aefad6271c1 Author: Lars Kroll Date: Wed Jun 26 16:28:04 2024 +0200 [Spark] Break down ImplicitDMLCastingSuite (#3308) ## Description - Break down ImplicitDMLCastingSuite into command-specific sub-suites to improve organisation of tests. ## How was this patch tested? Testing-only PR. commit 2abc2a29655a10cfa9f52d666ed7e46be6d61fda Author: Jiaheng Tang Date: Tue Jun 25 20:43:02 2024 -0700 [Spark] Migrate logging code to use Spark Structured Logging (#3293) ## Description Migrate Delta's logging code to use Spark structured logging. Structured logging will be introduced in Spark 4.0, as highlighted in the [preview release of Spark 4.0 | Apache Spark](https://spark.apache.org/news/spark-4.0.0-preview1.html). Note that the feature is turned off by default and the output log message will remain unchanged. Resolves #3145 ## How was this patch tested? Existing tests. commit 8e3647a3a88307e23113f361d0e5cd7a73f0e979 Author: Rakesh Veeramacheneni <173086727+raveeram-db@users.noreply.github.com> Date: Tue Jun 25 20:40:34 2024 -0700 [Kernel][Defaults] Support reading timestamp_ntz stored as INT96 (#3301) ## Description Resolves #2908 ## How was this patch tested? Added a unit test that reads an INT96 column written by Spark as TIMESTAMP_NTZ commit 7bb979205d7eb4cd8aaa04da8fd960f3862b53b7 Author: ChengJi-db Date: Tue Jun 25 14:04:02 2024 -0700 [Delta Uniform] Support expireSnapshot in uniform iceberg table automatically when OPTIMIZE (#3298) ## Description **_Issue_**: the current uniform iceberg table doesn't have a mechanism to cleanup old manifest/manifest list files, which adds great storage maintenance overhead **_Proposed_**: when `OPTIMIZE` is running on uniform delta table, it will trigger the `expireSnapshot` operation on corresponding iceberg table to do cleanup on manifests. The `OPTIMIZE` is chosen as the trigger since it's recommended to run frequently on delta table and iceberg's `expireSnapshot` is also recommended to run frequently (once every day) ## How was this patch tested? Manually tested commit e054904f32d1da94e2bebad3e1bbd8daa4a8919c Author: leonwind-db Date: Tue Jun 25 19:38:01 2024 +0200 [Spark] Add trait for generic executor observer thread local storage (#3307) ## Description Add generic trait for thread local execution observers to extend and implement. ## How was this patch tested? Unit tests commit 715f45ebe7848b8b549717118270636b91847015 Author: Jade Wang <111902719+jadewang-db@users.noreply.github.com> Date: Tue Jun 25 16:56:15 2024 +0000 [Spark][Sharing] Update delta sharing client to version 1.1.0 (#3303) ## Description Upgrade delta-sharing-client to v1.1.0 ## How was this patch tested? Existing test commit d929d369cee4fcdee6a5329cda4807eac3ed47cc Author: Venki Korukanti Date: Tue Jun 25 09:19:36 2024 -0700 [Spark] Pin the `pip` version to `24.0` to get around the version format requirement (#3302) ... enforced by the `pip` from `24.1` Recent `delta-spark` [CI jobs](https://github.com/delta-io/delta/actions/runs/9628486756/job/26556785657) are failing with the following error. ``` ERROR: Invalid requirement: 'delta-spark==3.3.0-SNAPSHOT': Expected end or semicolon (after version specifier) delta-spark==3.3.0-SNAPSHOT ~~~~~~~^ ``` Earlier [runs](https://github.com/delta-io/delta/actions/runs/9526169441/job/26261227425) had the following warning ``` DEPRECATION: delta-spark 3.3.0-SNAPSHOT has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of delta-spark or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063 ``` Pinning the `pip` version to `23.2.1` to let the jobs pass. We need to find a long-term solution on the version of the PyPI generated to avoid this issue but it is a bit complicated as the `delta-spark` PyPI also depends on the delta jars with the same version as the PyPI package name. commit e36829b0d987499ed9ac44d70e33c05f4d43f120 Author: Christos Stavrakakis Date: Tue Jun 25 17:45:52 2024 +0200 [PROTOCOL] Allow CDC actions to register data files (#3285) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (PROTOCOL) ## Description Update the Delta PROTOCOL to allow `AddCDCFile` actions that do not add Change Data Files, but instead register Data Files that are also added by `AddFile` actions. In this case the `_change_type` column in the Data Files might not be `null`. Non-change data readers should disregard this column and only process columns defined within the table's schema. ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? Protocol clarification. commit 1bccf8d906c4ab2109ce071932d6744e0c4b79df Author: Abhishek Radhakrishnan Date: Mon Jun 24 16:50:00 2024 -0700 [Kernel] Change `Class.forName()` usage in the `LogStoreProvider`. (#3304) ## Description Look for the `LogStore` class more broadly in the class loader than just the thread local's class loader. The thread context class loader requires the Thread local variables to have the `LogStore` in it, but this class loader may not have all the dependencies wired up. ## How was this patch tested? Not tested yet. Fixes https://github.com/delta-io/delta/issues/3299. commit 05e647ab4a8db3dc9cb87a9f1b4f225e3e08f15d Author: Venki Korukanti Date: Mon Jun 24 13:43:54 2024 -0700 [Kernel] Fix issue querying tables with spaces in the name (#3291) ## Description Currently, Kernel uses a mix of path (file system path) or URI (in string format) in API interfaces, which causes confusion and bugs. Context: Path refers to a file system path which could have some characters that should be escaped when converted to URI E.g. path: `s3:/bucket/path to file/`, URI for the same path: `s3:/bucket/path%20to%20file/` Make it uniform everywhere to just use the paths (file system path). ## How was this patch tested? Additional tests with table path containing spaces. commit 9f03492d34152b76868e8c8c2842633135d8e932 Author: Venki Korukanti Date: Mon Jun 24 11:33:10 2024 -0700 [Kernel][Defaults] Add support for pushing IS [NOT] NULL into Parquet reader (#3292) ## Description Allows pushing down predicates `IS NULL` and `IS NOT NULL` into the default Parquet reader. Helps prune the number of row groups read based on the predicates. ## How was this patch tested? Unit tests commit 5ea073b648c8e2ec088d2912eda87284e4792281 Author: Venki Korukanti Date: Mon Jun 24 11:12:54 2024 -0700 [Kernel] Fix the partition path construction (#3290) ## Description The current code does not escape the control + special characters in partition values when constructing a path for writing the data related to the partition. Not escaping these characters could cause invalid path issues. The escaping logic is similar to what Spark and Hive do. ## How was this patch tested? Unit tests. commit 864b5e34953fd9240e882b293afdc03d550d687c Author: Allison Portis Date: Mon Jun 24 11:07:44 2024 -0700 [Kernel] Throw InvalidTableException when we encounter a known invalid table state (#3288) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description During log reconstruction we throw a bunch of various errors when we encounter a table state that we consider invalid. This PR updates some of those errors to be `InvalidTableException`. This doesn't necessarily cover all possible exceptions due to an invalid table, just some of the known and clear cases during log reconstruction. ## How was this patch tested? Updates existing tests. ## Does this PR introduce _any_ user-facing changes? Yes, instead of internal `RuntimeExceptions` or `IllegalStateExceptions`, etc a `InvalidTableException` will be thrown when certain invalid table states are encountered. --------- Co-authored-by: Venki Korukanti commit 956b95078edca14599e9fa3d1a560bfeddf7be7b Author: Jiaheng Tang Date: Mon Jun 24 10:50:20 2024 -0700 [Spark] Fix compile error on Spark master due to ParserInterface changes (#3294) ## Description https://github.com/apache/spark/pull/46665 introduced a new `parseScript` method in `ParserInterface`, which broke compilation against Spark master: ``` [error] /home/runner/work/delta/delta/spark/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala:75:7: class DeltaSqlParser needs to be abstract. [error] Missing implementation for member of trait ParserInterface: [error] def parseScript(sqlScriptText: String): org.apache.spark.sql.catalyst.parser.CompoundBody = ??? [error] class DeltaSqlParser(val delegate: ParserInterface) extends ParserInterface { [error] ^ [warn] 100 warnings found [error] one error found [error] (spark / Compile / compileIncremental) Compilation failed ``` This PR fixes the issue by shimming the `ParserInterface` since `parseScript` is not available in Spark 3.5 ## How was this patch tested? Existing DeltaSqlParserSuite commit 2dfbcdcd5010801892b191b08b41148c98d7445e Author: andrewxue-db <169104436+andrewxue-db@users.noreply.github.com> Date: Mon Jun 24 08:29:53 2024 -0700 [Spark] use map lookup in createPhysicalSchema (#3236) ## Description Instead of calling `SchemaUtils.findNestedFieldIgnoreCase` for each column, we prepare a map with `SchemaUtils.explode` before, and perform map lookups during iteration. This speeds up this function on wide tables. It may still be slow for tables with deeply nested schemas because the path needs to be built every time, but there should be no regression. ## How was this patch tested? Manual profiling for an alter table add columns query: Before: (~13s) Screenshot 2024-06-06 at 5 50 01 PM After: (~3s) Screenshot 2024-06-06 at 5 50 17 PM commit 6b01387c03912a57976ce5411589c57c932db019 Author: Venki Korukanti Date: Fri Jun 21 13:09:58 2024 -0700 [Kernel][Tests] Fix test failure when running in non-UTC tz env (#3289) ## Description Currently, Spark interprets the timestamp partition columns in the local zone and Kernel in UTC. Delta protocol specifies no details on what timezone to use. This is a known issue and in Kernel we decided to always interpret as UTC to avoid timezone issues. Fix here is when getting the expected results using Spark, set the timezone to UTC to get the same values as Kernel. ## How was this patch tested? Ran locally on Pacific timezone test env commit 12efca46ba31ce89164469c0bad110b69e291a73 Author: Tom van Bussel Date: Tue Jun 18 19:56:35 2024 +0200 [Docs] Add documentation for Row Tracking (#2939) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Docs) ## Description This PR adds documentation for the new Row Tracking table feature that is releasing in Delta 3.2.0 ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? N/A commit 0cd33a143bd36ff348a78f03d4180277736dd67f Author: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Date: Tue Jun 18 00:47:20 2024 -0400 [Kernel] Add API on TxnBuilder to set the table properties (#3269) ## Description Adds an API to `TransactionBuilder` to set the table properties to provide a way to configure the table by committing a transaction. For example, user can enable inCommitTimestamp property with this API. ## How was this patch tested? Adds unit tests when setting valid and invalid properties with this API. ## Does this PR introduce _any_ user-facing changes? Yes, connectors can use this API to set table properties. commit 75c6acbb1cffe745b3616703185f91a76b2ea962 Author: Jiaheng Tang Date: Mon Jun 17 10:47:54 2024 -0700 [Spark] Support show tblproperties and update catalog for clustered table (#3271) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Support show tblproperties for clustered table, and support updating the clustering column properties in the catalog. Remove table properties from describe detail's output since that's using the properties from metadata. ## How was this patch tested? Add verification for table properties, describe detail, and catalog table in verifyClusteringColumns. ## Does this PR introduce _any_ user-facing changes? No commit 96ae1c53a8cfe89a27801fd673aca94856a584fe Author: Lukas Rupprecht Date: Mon Jun 17 10:46:56 2024 -0700 [Other] Adds coordinated commits-related interfaces and definitions to storage module (#3275) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (storage module) ## Description This PR takes the existing definitions in CommitCoordinatorClient.scala and converts them to java classes/interfaces in the storage module. This is in preparation for replacing the CommitCoordinatorClient in Spark Delta with a generic java-based module that can be implemented in any Delta client. ## How was this patch tested? This PR only copies existing definitions to java classes/interfaces in the storage module. These new classes/interfaces are not in use yet so no tests are required. ## Does this PR introduce _any_ user-facing changes? No commit 8c7b62e1a39a27e7108208a8d921e8de07b60ff2 Author: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Date: Fri Jun 14 17:06:48 2024 -0700 [Spark] Fix validation of clustering columns (#3273) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently, clustering columns are validating by parsing a list of clustering columns. This is super brittle, and breaks when any clustering column has a comma in the name. Fix that by passing a list of clustering columns directly. This fix resolves https://github.com/delta-io/delta/issues/3265 ## How was this patch tested? Test-only change. ## Does this PR introduce _any_ user-facing changes? No. commit d23324dd8854dfe790130122fc9ca66e557508ee Author: Adam Binford Date: Fri Jun 14 20:06:14 2024 -0400 [Spark] Optimize batching / incremental progress (#3089) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves https://github.com/delta-io/delta/issues/3081 Adds support for splitting an optimize run into batches with a new config `spark.databricks.delta.optimize.batchSize`. Batches will be created by grouping existing bins into groups until `batchSize` is reached. The default behavior remains the same, and batching is only enabled if the `batchSize` is configured. This will apply to all optimization paths. I don't see any reason it shouldn't apply to to compaction, z-ordering, clustering, auto-compaction, or reorg/DV rewriting if a user configures it. The way transactions are handled within the optimize executor had to be updated. Instead of creating a transaction upfront, we list all the files in the most recent snapshot, and then create transactions for each batch. This is very important to add for clustering, as there is no way to manually do a partial set of the table using partition filtering. This could cause a lot of execution time and storage space to be wasted if something fails before optimizing the entire table finishes. ## How was this patch tested? A simple new UT is added. I can add others as well, just looking for some feedback on the approach and suggestions of what other tests to add. ## Does this PR introduce _any_ user-facing changes? Yes, adds new capability to optimization that is disabled by default. commit ee350db1fd5e8a5aae42ad579e69646c83870663 Author: Venki Korukanti Date: Fri Jun 14 11:46:28 2024 -0700 [Kernel][Defaults] Minor cleanups + additional logging (#3274) ## Description Minor cleanups + additional logging. * Pass the cause exception to `KernelException`, so that it is visible to the caller * Add logging when the LogStore can't be created. * Convert `RuntimeException` to `KernelException` when an invalid schema JSON string is received. ## How was this patch tested? Existing tests commit 7b19e8e777962a6fd459ccf610ba69d7fb62da2c Author: Dhruv Arya Date: Fri Jun 14 11:08:09 2024 -0700 [Spark] Rename managed commit to coordinated commits (#3237) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (RFC) ## Description Renames Managed Commit to Coordinated Commits and Commit Owner to Commit Coordinator to better express the meaning of the feature. Configs, table feature name, classes, functions, comments, file names, and directory names have all been updated to reflect the new terminology. ## How was this patch tested? Existing tests should cover these changes. ## Does this PR introduce _any_ user-facing changes? Yes, the feature name and config names have changed. commit 984e81d30299bde86c951c321968bcdfcd1f55a3 Author: Tom van Bussel Date: Fri Jun 14 17:50:17 2024 +0200 [Spark] Clean up imports in CheckConstraintsSuite.scala (#3270) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description See title ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? No commit 7a451dc4f35a004b48fc9f1bea52fb15f3682601 Author: Jiaheng Tang Date: Thu Jun 13 15:42:51 2024 -0700 [Spark] Support Spark Structured Logging API (#3146) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Uber issue: #3145 Support Spark Structured Logging API by using the shimming layer. Spark Structured Logging is only available in Spark 4.0 snapshot so need to shim the API to make it compile for Spark 3.5. ## How was this patch tested? Tests that the new API on Spark master can produce structured logs: DeltaStructuredLoggingSuite Tests that the shimming API on Spark 3.5 still produce the same plain text logs: DeltaPatternLoggingSuite ## Does this PR introduce _any_ user-facing changes? No. commit fa7d6c0eb5de0e05b0efac6b11cf96c58740a828 Author: Yumingxuan Guo Date: Thu Jun 13 11:25:48 2024 -0700 [DELTA] Removes asInstanceOf[Metadata] casts (#3259) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR removes the asInstanceOf[Metadata] casts in AbstractBatchBackfillingCommitOwnerClient. ## How was this patch tested? Ensured the existing test suites, along with the newly added unit test for the helper function, all pass after the removal. ## Does this PR introduce _any_ user-facing changes? No. commit 5a6f382f2b61bb0db21e8ea33e154defb9ce9b37 Author: Tom van Bussel Date: Thu Jun 13 19:02:19 2024 +0200 [Spark] Support dropping the CHECK constraints table feature (#2987) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adds support for dropping the `checkConstraints` table feature using the `ALTER TABLE ... DROP FEATURE` command. It throws an error if the table still contains CHECK constraints (as dropping a feature should never change the logical state of a table). ## How was this patch tested? Added a test to `CheckConstraintsSuite` ## Does this PR introduce _any_ user-facing changes? Yes, `ALTER TABLE ... DROP FEATURE checkConstraints` is now supported. commit 4fac1f1b7ea3b9cd1bbeeccec4fb02129338a986 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed Jun 12 16:44:57 2024 -0700 [Doc] Delta 4.0 - Delta Connect documentation (#3232) ## Description Add a documentation page for the [Delta Connect](https://github.com/delta-io/delta/issues/1570), in Delta 4.0 preview. ## How was this patch tested? N/A --------- Co-authored-by: Allison Portis commit b07bf47d0e2593c87b22443d92eb377c31738163 Author: Dhruv Arya Date: Wed Jun 12 16:26:47 2024 -0700 [Doc] Add doc for the coordinated commits writer feature (#3203) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Doc) ## Description Documents the Coordinated Commits writer table feature as well as the DynamoDB Commit Coordinator implementation. Describes the process for enabling the feature for tables. --------- Co-authored-by: Allison Portis commit 35dcd2767af33b2e14d80ce0ad4da0f21206fea6 Author: richardc-db <87336575+richardc-db@users.noreply.github.com> Date: Wed Jun 12 16:20:40 2024 -0700 [VARIANT][DOCS] Add Variant type (Preview) to docs (#3206) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description adds variant type to the docs in "feature by protocol version" and "what delta features require client upgrades" section. I didn't think that this feature requires a dedicated section (similar to TimestampNTZ). ## How was this patch tested? manually tested links using localhost ## Does this PR introduce _any_ user-facing changes? --------- Co-authored-by: Allison Portis commit ed508cb2ecdd330acd5af82aaec71d8846cb08c4 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Wed Jun 12 20:58:08 2024 +0200 [Tests] Improve the running time of DeletionVectorsSuite (#3263) ## Description Improves the running time of DeletionVectorsSuite. It also includes some minor import cleanup. Resolves #3257 ## How was this patch tested? Test only PR. commit c41dd6d56e9560a4d5ad573cb274ceab48ac4902 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Wed Jun 12 19:24:34 2024 +0200 Improve documentation in PROTOCOL.md about TightBounds (#3264) ## Description This PR adds a remark in PROTOCOL.md about the behavior of `tightBounds` when not present in the statistics. ## Does this PR introduce _any_ user-facing changes? Yes. Adds a remark in PROTOCOL.md about the behavior of `tightBounds` when not present in the statistics. commit 5106d279248c781122962ec474906f17bfd4dfd6 Author: Sumeet Varma Date: Tue Jun 11 19:25:08 2024 -0700 [Spark] Append the tieBreaker unicode max character only if we actually truncated the string (#3222) ## Description This is to not append the tieBreaker character when no part of the string was truncated ## How was this patch tested? UTs commit 3aa8be0b1761bdb3f2800853efda82332c9f34e6 Author: Lin Zhou <87341375+linzhou-db@users.noreply.github.com> Date: Tue Jun 11 19:24:12 2024 -0700 Fix the issue of not using the correct refreshToken (#3238) ## Description Fix the issue of not using the correct refreshToken, when refreshing on the 2nd time, or more. ## How was this patch tested? Unit Tests commit bb148334a7e4f35dd00e597f8ec6902123b82503 Author: Jiaheng Tang Date: Tue Jun 11 11:04:03 2024 -0700 [Spark] Support create external table for clustered table (#3251) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Support creating a clustered table from an external location that already has a clustered table. We follow the same semantics as partitioned tables: External location already has clustered/partitioned table: | Create clustered/partitioned table | partitioned | clustered | | ----------------------------------- | ------ | ----------- | | schema not specified, cluster/partitioned by not specified | success | success | | schema specified, cluster by/partitioned by not specified | throw `DELTA_CREATE_TABLE_WITH_DIFFERENT_PARTITIONING` | throw `DELTA_CREATE_TABLE_WITH_DIFFERENT_CLUSTERING` | | schema specified, cluster by/partitioned by different column | throw `DELTA_CREATE_TABLE_WITH_DIFFERENT_PARTITIONING` | throw `DELTA_CREATE_TABLE_WITH_DIFFERENT_CLUSTERING` | | schema specified, cluster by/partitioned by same column | success | success | External location already has non-clustered/non-partitioned table: | Create clustered/partitioned table | partitioned | clustered | | ----------------------------------- | ------ | ----------- | | schema specified, cluster by/partitioned by specified | throw `DELTA_CREATE_TABLE_WITH_DIFFERENT_PARTITIONING` | throw `DELTA_CREATE_TABLE_WITH_DIFFERENT_CLUSTERING` | ## How was this patch tested? Added new unit tests to cover all scenarios above. ## Does this PR introduce _any_ user-facing changes? No. commit a4c33846d9d5425848b182528b979ca2c2988dc7 Author: Prakhar Jain Date: Tue Jun 11 10:35:39 2024 -0700 [DELTA] Add logs around managed commits (#3246) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add logging around managed commits. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit 7320d6bc9ea301d13247756c8d070d85bce6c646 Author: Tom van Bussel Date: Tue Jun 11 17:58:50 2024 +0200 [Spark] Log duration of DeltaSource operations (#2846) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adds logging the main DeltaSource operations to allow us to measure the duration of these operations. ## How was this patch tested? Existing tests to make sure nothing breaks. ## Does this PR introduce _any_ user-facing changes? No commit a24a857d27bbb4b3585814c35fa3002cee929fec Author: ChengJi-db Date: Mon Jun 10 10:36:24 2024 -0700 [Delta Uniform] Compute correct MAX_ID in column mapping on a schema with nested fields and already have IDs assigned (#3234) ## Description Propose a fix to prevent delta table got duplicate ids assigned when schema have nested fields and ids assigned. **Issue**: today when we are assigning column's ids we first compute the `maxId` of existing columns and assign ids for new fields from `maxId + 1`. However, the existing code doesn't consider nested ids when computing the `maxId`, so it's possible to have duplicate ids assigned to different columns - which causes failure of uniform iceberg conversion since iceberg requires that id is unique for each column. **Proposed fix**: we are adding the logic to consider nested fields' ids when computing `maxId`. ## Does this PR introduce _any_ user-facing changes? No commit 3e60ff159bd90eb7ba422eb6312342944a4ce2f9 Author: Tom van Bussel Date: Mon Jun 10 16:26:25 2024 +0200 [Spark] Use SQL expression in VARCHAR violation error (#3242) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Small fix to use the SQL expression instead of the internal catalyst representation of an expression in the error message for a VARCHAR length constraint violation. ## How was this patch tested? Modified an existing test ## Does this PR introduce _any_ user-facing changes? Yes, the error message for the class `DELTA_EXCEED_CHAR_VARCHAR_LIMIT` will use different formatting for the expression of the violated constraint. For example, the user will now see `((value IS NULL) OR (length(value) <= 12))` instead of `(isnull('value) OR (length('value) <= 12))` commit ef1def9e3c97d80518dcb9b5cd48f58e8e859bd8 Author: Jiaheng Tang Date: Fri Jun 7 15:53:29 2024 -0700 [Spark] Support RESTORE for clustered table (#3194) ## Description Support RESTORE for clustered tables by adding a new domain metadata to overwrite the existing one so that clustering columns are correctly restored. ## How was this patch tested? New unit tests. commit 87549c53e8fb76d40e42711e59b8eba97473f035 Author: Johan Lasperas Date: Fri Jun 7 21:00:24 2024 +0200 [Spark][Test-only] Split type widening tests in multiple suites (#3223) ## Description The `DeltaTypeWideningSuite` has grown organically as the feature was implemented and now contains a very large number of tests covering different aspects. This change splits it into individual suites, this is fairly straightforward as the tests were already collected in multiple traits. - Only test `widening Date -> TimestampNTZ rejected when TimestampNTZ feature isn't supported` had to be updated to explicitly disable timestampNTZ support in the test. All other tests are completely unchanged. - All files, classes, traits are renamed from `DeltaTypeWideningX` to `TypeWideningX` and moved from `tahoe` to `tahoe.typewidening` ## How was this patch tested? Test-only change ## Does this PR introduce _any_ user-facing changes? No commit 5289a5ec8228ce32637e6aa10e75a77fe9485484 Author: Venki Korukanti Date: Fri Jun 7 11:03:39 2024 -0700 [Kernel][Parquet Writer] Fix an issue with writing decimal as binary (#3233) ## Description The number of bytes needed to calculate the max buffer size needed when writing the decimal type to Parquet is off by one. Resolved #3152 ## How was this patch tested? Added unit tests that read and write decimals with various precision and scales. commit 273f988ff74ff39d5ca75b47230e5b17e5b2857b Author: Dhruv Arya Date: Thu Jun 6 11:56:29 2024 -0700 [Delta] Simplify DeltaHistoryManagerSuite (#3226) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Refactors DeltaHistoryManagerSuite so that a common method is used for updating timestamps. ## How was this patch tested? Test-only change. ## Does this PR introduce _any_ user-facing changes? No commit 1495bb4fe5c21bf326c3f54ea3a097e149c3cead Author: Dhruv Arya Date: Thu Jun 6 11:56:01 2024 -0700 [Spark] Make naming of manage commit consistent (#3224) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description 1. Replaces instances of Managed Commits with Managed Commit (without the s) in functions, configurations, and docs. 2. Renames the feature name to managedCommit-preview 3. Renames the DynamoDBCommitOwner tableConfig key to dynamoDBTableName ## How was this patch tested? Existing tests should cover this refactor. ## Does this PR introduce _any_ user-facing changes? The managed commit table feature name has been updated to managedCommit-preview commit 6a843ba4c49e2c91b972856ec7e6b9d999eced2a Author: Paddy Xu Date: Thu Jun 6 17:29:12 2024 +0200 [Spark] Add a Delta config to enable/disable the workaround of handling colons ':' in paths (#3193) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR follows https://github.com/delta-io/delta/pull/3153 and introduces a Delta config to enable or disable the workaround of handling colons ':' in paths. The value of this config is `true` by default. ## How was this patch tested? Existing tests. ## Does this PR introduce _any_ user-facing changes? No, unless the user encountered any problem, which is very unlikely. commit 18d1a276bfda4404d5809848820625d80fe19ca1 Author: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Date: Wed Jun 5 16:04:45 2024 -0400 [Kernel] Add the InCommitTimestamp table feature (#3218) [Kernel] Add the `InCommitTimestamp` as supported writer table feature commit a8e339f32eab2aadcb4e93053d7e247c3a5ae481 Author: Sumeet Varma Date: Wed Jun 5 11:51:24 2024 -0700 [Spark] Remove deprecated FileNames.deltaFile method (#3178) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description - Clients are expected to use DeltaCommitFileProvider instead. - FileNames.unsafeDeltaFile can be used in case the clients knows for sure the file has been backfilled. ## How was this patch tested? Build ## Does this PR introduce _any_ user-facing changes? No commit cd957996d54aa21adde47a4cd093543152a86a3e Author: Johan Lasperas Date: Wed Jun 5 17:25:39 2024 +0200 [Spark] Drop Type widening feature: read Parquet footers to collect files to rewrite (#3155) ## What changes were proposed in this pull request? The initial approach to identify files that contain a type that differs from the table schema and that must be rewritten before dropping the type widening table feature is convoluted and turns out to be more brittle than intended. This change switches instead to directly reading the file schema from the Parquet footer and rewriting all files that have a mismatching type. ### Additional Context Files are identified using their default row commit version (a part of the row tracking feature) and matched against type changes previously applied to the table and recorded in the table metadata: any file written before the latest type change should use a different type and must be rewritten. This requires multiple pieces of information to be accurately tracked: - Default row commit versions must be correctly assigned to all files. E.p. files that are copied over without modification must never be assigned a new default row commit version. On the other hand, default row commit versions are preserved across CLONE but these versions don't match anything in the new cloned table. - Type change history must be reliably recorded and preserved across schema changes, e.g. column mapping. Any bug will likely lead to files not being correctly rewritten before removing the table feature, potentially leaving the table in an unreadable state. ## How was this patch tested? Tests added in previous PR to cover CLONE and RESTORE: https://github.com/delta-io/delta/pull/3053 Tests added and updated in this PR to cover rewriting files with different column types when removing the table feature. commit 4b102d34a2ce881b2a851b4c6cfbf2ab3ab5534f Author: Venki Korukanti Date: Tue Jun 4 09:50:00 2024 -0700 [Kernel] Add Delta log replay metrics in active file list construction (#3205) ## Description Currently, the log replay code used to find the active add files in a table snapshot is not observable. Add a few metrics to provide visibility into the log replay. These metrics are collected as the `CloseableIterator` of scan files is consumed and printed in slf4j logs when the iterator is closed. Following are the metrics collected: * number of `AddFile` actions seen * number of `AddFile` actions seen from delta files * number of duplicate `AddFile` actions seen * number of tombstones seen * number of active `AddFile`s ## How was this patch tested? Add unit tests with different scenarios * deltas only * checkpoint class/multi-parts, * stats recompute - generates duplicate add files, * removes - generates tombstones) ## Does this PR introduce _any_ user-facing changes? No, but the info level logs will have a message with the log replay metrics. commit b3c0a1db59c5d518a1862ebb946de2ec567e9c4c Author: Venki Korukanti Date: Tue Jun 4 09:42:06 2024 -0700 [Kernel][Test] Add test for reading a shallow cloned table (#3208) Shallow cloned tests have `AddFile`s with an absolute path to the source table location. Normally the path is relative to the table root. Add a test to verify that reading works in Kernel. commit 74bf5db6b6ad10ca20c94ee5a215c25befa0c0e8 Author: Jacek Laskowski Date: Tue Jun 4 17:11:27 2024 +0200 [SPARK][MINOR] Code cleanup (reformatting and typo hunting) (#3191) ## Description Many of these changes are to make the code more Scala-idiomatic (and less "Pythonic") to ease code comprehension ❤️ A few typos got squashed along the way, too 😎 ## How was this patch tested? Local builds commit ad094e2ae0dd41caaae84099a809336c94c52321 Author: Venki Korukanti Date: Mon Jun 3 14:25:31 2024 -0700 [Kernel] Throw `KernelException` when `VOID` type is encountered (#3196) ## Description Minor fix to throw `KernelException` when `VOID` type is encountered as it is an unsupported data type in Kernel. ## How was this patch tested? Test commit be9718dfc1eb0b9cedd428182eb8f3d604fd7663 Author: Hao Jiang Date: Mon Jun 3 10:02:14 2024 -0700 [Spark] Fix race condition in Uniform conversion (#3189) ## Description This PR fixes a race condition in UniForm Iceberg Converter. Before our change, UniForm Iceberg Converter executes as follows: 1. Read `lastConvertedDeltaVersion` from Iceberg latest snapshot 2. Convert the delta commits starting from `lastConvertedDeltaVersion` to iceberg snapshots 3. Commit the iceberg snapshots. When there are multiple iceberg conversion threads, a race condition may occur, causing one delta commit to be written into multiple Iceberg snapshots, and data corruption. As an example, considering we have a UniForm table with latest delta version and iceberg version both 1. Two threads A and B start writing to delta tables. 1. Thread A writes Delta version 2, reads `lastConvertedDeltaVersion` = 1, and converts delta version 2. 2. Thread B writes Delta version 3, reads `lastConvertedDeltaVersion` = 1, and converts delta version 2, 3. 3. Thread A commits Iceberg version 2, including converted delta version 2. 4. Thread B commits Iceberg version 3, including converted delta version 2 and 3. When both threads commit to Iceberg, we will have delta version 2 included in iceberg history twice as different snapshots. If version 2 is an AddFile, that means we insert the same data twice into iceberg. Our fix works as follows: 1. Read `lastConvertedDeltaVersion` and **a new field** `lastConvertedIcebergSnapshotId` from Iceberg latest snapshot 2. Convert the delta commits starting from `lastConvertedDeltaVersion` to iceberg snapshots 5. Before Iceberg Commits, checks that the base snapshot ID of this transaction equals `lastConvertedIcebergSnapshotId` (**this check is the core of this change**) 6. Commit the iceberg snapshots. This change makes sure we are only committing against a specific Iceberg snapshot, and will abort if the snapshot we want to commit against is not the latest one. As an example, our fix will successfully block the example above. 1. Thread A writes Delta version 2, reads `lastConvertedDeltaVersion` = 1, `lastConvertedIcebergSnapshotId` = S0 and converts delta version 2. 2. Thread B writes Delta version 3, reads `lastConvertedDeltaVersion` = 1, `lastConvertedIcebergSnapshotId` = S0 and converts delta version 2, 3. 3. Thread A creates an Iceberg transaction with parent snapshot S0. Because `lastConvertedIcebergSnapshotId` is also S0, it commits and update iceberg latest snapshot to S1. 4. Thread B creates an Iceberg transaction, with parent snapshot S1. Because `lastConvertedIcebergSnapshotId` is S0 != S1, it aborts the conversion. commit c0b3c971fe5fd5e236924a5c2085162ba1a137aa Author: Shawn Chang <42792772+CTTY@users.noreply.github.com> Date: Mon Jun 3 09:41:23 2024 -0700 [Uniform][Minor] Fix typo (#3140) Signed-off-by: Shawn Chang Co-authored-by: Shawn Chang commit a70c8e20ccd52abbb5464f4be9d6b3e5baf32e31 Author: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Mon Jun 3 17:37:47 2024 +0100 [DOCS] Fix typo suggestions on Delta APIs page (#3134) Minor typo fix suggestions commit 40896bc22cc0fb4b737a9962d238aff2952bb9ce Author: Venki Korukanti Date: Mon Jun 3 08:43:38 2024 -0700 [Kernel] Handle long values in `FieldMetadata` parsing (#3186) ## Description Currently when parsing the `FieldMetadata` in `StructField` (as part of the schema parsing), we always assume the integral values are of `int` type, but it could be of value `long`. ## How was this patch tested? Add couple of cases to existing test commit eccb75e83dcb33de237eaabefb9e58f9b2a8295d Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Mon Jun 3 17:42:00 2024 +0200 [Spark] Add Delta Connect Server Library (#3136) ## Description This PR adds a skeleton for Delta Connect server library, and we add support for `Scan` to Delta's planner plugin. ## How was this patch tested? Added some basic tests for [SparkConnectPlanner](https://github.com/apache/spark/commit/6637bbe2b25ff2877b41a9677ce6d75e6996f968) using the `DeltaRelationPlugin` and `DeltaCommandPlugin` plugins. commit ed87fbf075265950de273b6738e7fbe2d6b85882 Author: Sumeet Varma Date: Mon Jun 3 08:41:06 2024 -0700 [Kernel] Remove usages of FileNames.deltaFile from Delta-Kernel and Delta-Standalone (#3180) ## Description `FileNames.deltaFile` is deprecated and will be removed in future versions of Delta-Spark. ## How was this patch tested? Existing UTs commit 59f8c64c2a2bc49877e8e19913c9249f9f337028 Author: Jiaheng Tang Date: Fri May 31 14:57:06 2024 -0700 [Spark] Fix replacing clustered table with non-clustered table (#3175) ## Description Fix replacing clustered table with non-clustered table, by creating a domain metadata with empty clustering columns. ## How was this patch tested? New UTs. commit 56ce2126615e9e046d2e0b60a3e4011234d87866 Author: Jiaheng Tang Date: Fri May 31 08:27:29 2024 -0700 [Spark] Support in-place migration from unpartitioned table to clustered table (#3174) ## Description Support in-place migration from unpartitioned table to clustered table. If the table is an unpartitioned table and users run `ALTER TABLE CLUSTER BY` on it, it will now create a clustered table with ClusteringMetadataDomain. Resolves #2460 ## How was this patch tested? New UTs. commit 8cfb11962819edf417a5afb009556797ba217641 Author: Tom van Bussel Date: Fri May 31 17:26:22 2024 +0200 [Spark] Include violating value in DELTA_EXCEED_CHAR_VARCHAR_LIMIT (#3167) ## Description This PR modifies the error message for the `DELTA_EXCEED_CHAR_VARCHAR_LIMIT` error class to include the value that violated the constraint. ## How was this patch tested? Modified a test in `DeltaErrorsSuite` and added a test to `DeltaConstraintsSuite`. ## Does this PR introduce _any_ user-facing changes? Yes, the error message for the `DELTA_EXCEED_CHAR_VARCHAR_LIMIT` error class is modified to include the value that violated the constraint. commit 085f11718b2a4de2dc72d1cb09139039ba22dab2 Author: zzl-7 <143959416+zzl-7@users.noreply.github.com> Date: Wed May 29 22:38:28 2024 -0700 [Kernel] Change comparator expression to lazy evaluation (#2853) ## Description Resolves https://github.com/delta-io/delta/issues/2541 ## How was this patch tested? Existing tests commit 39e91af88c838902764743069f2d113ad39a50d3 Author: Qianru Lao <55441375+EstherBear@users.noreply.github.com> Date: Wed May 29 21:33:39 2024 -0700 [Kernel] Remove the Reference to Engine in LogReplay (#3165) Remove the reference to `Engine` in `LogReplay` and get it as an argument to methods on `LogReplay` Resolves #2641 commit 6421bc52ce7d67e514fc565672cb225fbd5a1216 Author: Dhruv Arya Date: Wed May 29 17:26:39 2024 -0700 [Spark] Make listDeltaCompactedDeltaCheckpointFilesAndLatestChecksumFile reusable (#3157) ## Description Factors out the main logic from `listDeltaCompactedDeltaCheckpointFilesAndLatestChecksumFile` into a new function `listFromFileSystemInternal` which can potentially be reused by other callers. Also added a new test utility which runs tests with in-memory-tracking commit owner enabled. ## How was this patch tested? Existing tests should cover this. commit 6deecf994946b2a17c8a55864ad62ba2d0e0f931 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed May 29 23:43:47 2024 +0200 [Spark] Fix license headers for some Spark Connect proto files (#3170) ## Description - Adjusted the formatting of the license headers of some Spark Connect proto files. - Added also the Delta Lake license since even though we copied the Spark Connect proto files, we modified the `option java_package` of each proto files in [Spark Connect](https://github.com/apache/spark/tree/master/connector/connect/common/src/main/protobuf/spark/connect), similar to how it was done [here](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/util/DateFormatter.scala) by this [PR](https://github.com/delta-io/delta/commit/e26435bcd787b232c1cf73eb118202971f1e18f1#diff-727ff4ed853adab6032cd9d06895dee225f15dec45fbcb0541676e77f459ef2e). commit 46c18dfe78fc0d2c491cfaba564bdaea85d7dd3b Author: Abhishek Radhakrishnan Date: Wed May 29 11:27:13 2024 -0700 [Kernel] Few updates to usage guide and javadocs (#3160) ## Description Changes: - Fix a few Javadocs in the `Table` interface. - Update `USER_GUIDE.md` to fix the optional vector usage. commit 31f09f05c2e78482b8ed69b5d31c1f222b5770c5 Author: Johan Lasperas Date: Wed May 29 17:40:43 2024 +0200 [Docs][4.0] Type Widening documentation (#3162) ## Description Update the type widening documentation to list additional type changes supported in Delta 4.0 commit 96411496a0cf9a60416e5ab83ae745ace3aa382a Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed May 29 16:27:06 2024 +0200 [Spark] Adding Spark Connect Protobuf Messages to the Delta repository (#3168) ## Description Delta Connect is an extension for Spark Connect that adds support for the DeltaTable API. To be able to communicate back to Spark Connect we need to have access to Spark Connect's Protobufs. Since Buf does not support local dependencies, while Delta and Apache Spark are two separate repositories, the only solution for Delta Connect to get access to [Spark Connect's Protobuf messages](https://github.com/apache/spark/tree/master/connector/connect/common/src/main/protobuf/spark/connect) is to have the copies of the messages in the Delta repository. This may go out of sync, and we would need to update it manually from the Apache Spark repository, but this should be fine and would not break anything since Protobufs are backward compatible. ## How was this patch tested? N/A. commit 5829da49d0c3bb66ac6f26b89e7b9795f8bfdec0 Author: Christos Stavrakakis Date: Wed May 29 16:19:45 2024 +0200 [Spark] Set active txn during write and ctas commands (#3163) ## Description Make WRITE commands call `OptimisticTransaction.withActive` to set as active the transaction the transaction that is created by `startTxnForTableCreation()`. ## How was this patch tested? Existing tests. commit eb638bba6973a98a13bc543d6d2bde0efd90b300 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed May 29 16:18:58 2024 +0200 [Spark] Set up Python Protobuf codegen for Delta Connect (#3125) ## Description Added the very first protobuf messages for `DeltaTable`, `Clone` and `Scan`. This is the first PR for Delta Connect, which adds support for Delta's `DeltaTable` interface to Spark Connect. This is needed to support Delta table operations outside of SQL directly on Spark Connect clients. This PR sets up the [Python code generation for the Protobufs of Delta Connect](https://protobuf.dev/getting-started/pythontutorial/). For this I created a new Buf workspace and I added a few initial Protobuf messages to confirm that everything works. This is the ground work of the project, before we move on to setting up the server and client library. What we are doing here is similar to the [Spark Connect's protobuf development guide](https://github.com/apache/spark/tree/master/connector/connect). ## How was this patch tested? Added the `check-delta-connect-codegen-python.py` to the automated testing, making sure the Python Protobuf Generated codes stay in sync with the proto messages. commit 85428ee08cca58f601d9fc45e86b7a28705fee9b Author: Paddy Xu Date: Mon May 27 19:29:25 2024 +0200 Fix concating paths with relative filenames that contain colon ':' (#3153) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes an issue where the `safeConcatPaths` method throws an exception when the `relativeChildPath` contains a colon `:`. Such a character is not allowed in Hadoop paths due to ambiguity (`aa:bb.csv` can be interpreted as an absolute path like `aa://bb.csv` where `aa` is the scheme), but is allowed in many file systems such as S3. Thus we need to handle this case. The fix here is to prepend a `/` so that Hadoop will know that everything after `/` belongs to the path, not the scheme. ## How was this patch tested? New tests. ## Does this PR introduce _any_ user-facing changes? Nope. commit 6c0137b18eaf0ae93e8d2760381b879ed58794eb Author: Ole Sasse Date: Mon May 27 17:34:36 2024 +0200 Protocol RFC for collations (#3068) ## Description Protocol RFC for adding collation support to Delta [Design doc](https://docs.google.com/document/d/1cwztlKt7b2hWF6Uu1S895ko6jPfRlP9x-V5POUcXtXk/edit?usp=sharing) commit 3cd9529b6a9fcc1fd6d72e2574760b1c622e12bb Author: James DeLoye Date: Fri May 24 11:53:35 2024 -0700 [Spark] Add CREATE TABLE LIKE compatibility with user-provided table properties (#3138) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description User provided properties when performing CREATE LIKE commands were being ignored and only the properties from source table were being added. This PR adds/overwrites any applicable properties with the user provided ones. ## How was this patch tested? Unit tests were created replicating the customer issue for CREATE LIKE commands both originating in Delta tables and other formats. ## Does this PR introduce _any_ user-facing changes? No commit ff5b36fbcc3bb894b9a885eaa05338460c8173d6 Author: Johan Lasperas Date: Fri May 24 18:34:11 2024 +0200 [Spark] Allow type widening for all supported type changes with Spark 4.0 (#3024) This PR adds shims to ungate the remaining type changes that only work with Spark 4.0 / master. Spark 4.0 contains the required changes to Parquet readers to be able to read the data after applying the type changes. ## Description Extend the list of supported type changes for type widening to include changes that can be supported with Spark 4.0: - (byte, short, int) -> long - float -> double - date -> timestampNTZ - (byte, short, int) -> double - decimal -> decimal (with increased precision/scale that doesn't cause precision loss) - (byte, short, int, long) -> decimal Shims are added to support these changes when compiling against Spark 4.0/master and to only allow `byte` -> `short` - > `int` when compiling against Spark 3.5. ## How was this patch tested? Adding test cases for the new type changes in the existing type widening test suites. The list of supported / unsupported changes covered in tests differs between Spark 3.5 and Spark 4.0, shims are also provided to handle this. ## Does this PR introduce _any_ user-facing changes? Yes: allow using the listed type changes with type widening, either via `ALTER TABLE CHANGE COLUMN TYPE` or during schema evolution in MERGE and INSERT. commit bfb5c94aa818495f35ed007bd4566cd2f7fecf42 Author: Tom van Bussel Date: Fri May 24 18:33:33 2024 +0200 [Spark] Validate the expression in AlterTableAddConstraintDeltaCommand (#3143) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes an internal error thrown from `AlterTableAddConstraintDeltaCommand`. This error is thrown when adding a CHECK constraint with a non-existent column in the expression. The error is thrown when we check if the expressions returns a boolean. This works correctly for most expressions, but will result in an exception if the data type of the unresolved column is checked. This PR fixes this issue by making the analyzer responsible for checking whether the expression returns a boolean by wrapping the expression with a `Filter` node. ## How was this patch tested? Added a test ## Does this PR introduce _any_ user-facing changes? Yes, `ALTER TABLE ... ADD CONSTRAINT ... CHECK` will now throw a `UNRESOLVED_COLUMN` error instead of a `INTERNAL_ERROR` in the case described above. commit 039a29abb4abc72ac5912651679233dc983398d6 Author: Johan Lasperas Date: Fri May 24 00:54:44 2024 +0200 [Spark] Test type widening compatibility with other Delta features (#3053) ## Description Additional tests covering type widening and: - Reading CDF - Column mapping - Time travel - RESTORE - CLONE ## How was this patch tested? Test only commit 1609c38782f21c06fe3b24b32d0f4c97c3c2c755 Author: Johan Lasperas Date: Thu May 23 18:06:08 2024 +0200 [Spark] Update OptimizeGeneratedSuite to apply constant folding (#3141) ## Description The following change in Spark master broke tests in `OptimizeGeneratedColumnSuite`: https://github.com/apache/spark/commit/7974811218c9fb52ac9d07f8983475a885ada81b It added an execution of the `ConstantFolding` rule after `PrepareDeltaScan`, causing constant expressions in filters on generated columns to be simplified, which `OptimizeGeneratedColumnSuite` heavily used. This change: - updates the expected results in `OptimizeGeneratedColumnSuite` to simplify constant expressions - adds a pass of `ConstantFolding` after `PrepareDeltaScan` so that Delta on spark 3.5 behaves the same as Delta on spark master. ## How was this patch tested? Updated tests commit b043f5d7c2655c27866d4c33e2255e076f7598a2 Author: Dhruv Arya Date: Thu May 23 08:57:04 2024 -0700 [Spark] Make ManagedCommit a preview feature (#3137) ## Description Renames the ManagedCommit feature and config names by replacing -dev with -preview to indicate that it is in preview phase. ## How was this patch tested? No new tests. commit 0deef042b18689fd4b73f4b252700dd2f1ab94f8 Author: Krishnan Paranji Ravi Date: Thu May 23 11:45:39 2024 -0400 [Kernel][Expressions] Add support for LIKE expression (#3103) ## Description Add SQL `LIKE` expression support in Kernel list of supported expressions and a default implementation. Addresses part of https://github.com/delta-io/delta/issues/2539 (where `STARTS_WITH` as `LIKE 'str%'`) ## How was this patch tested? added unit tests Signed-off-by: Krishnan Paranji Ravi commit 35c7536a70c2d0ba57e140704d3e213e4e75a516 Author: Tai Le Manh <49281946+tlm365@users.noreply.github.com> Date: Thu May 23 04:58:33 2024 +0700 [INFRA] Improve the java style checks log the errors to sbt console (#3115) ## Description Resolves #3067. ## How was this patch tested? On local machine, intentionally create checkstyle errors in module `kernelDefaults` (for experimental), then run the `build/sbt compile` and `build/sbt kernelDefaults/test`. Signed-off-by: Tai Le Manh commit 420d9e059db18845a49b85cb1571752667d39dc6 Author: sergiupoco-db Date: Wed May 22 09:43:30 2024 -0700 [Standalone] Introduce FileAction.tagsOrEmpty (#3132) ## Description This PR introduces Introduce `FileAction.tagsOrEmpty` to factor out the common pattern `Option(tags).getOrElse(Map.empty)`. ## How was this patch tested? Existing unit tests. ## Does this PR introduce _any_ user-facing changes? No Signed-off-by: Sergiu Pocol commit a5263cc02b450b16d80b65f358dbf5ff092355bf Author: sergiupoco-db Date: Wed May 22 09:42:20 2024 -0700 [Standalone] AddFile Long Tags Accessor + Memory Optimization (#3131) ## Description This PR introduces AddFile.longTag which factors out the pattern `tag(...).map(_.toLong)` and also converts the insertion time tag lazy val to a method in order to save memory. ## How was this patch tested? Existing unit tests. ## Does this PR introduce _any_ user-facing changes? No Signed-off-by: Sergiu Pocol commit d5e9a26195742728ea4693a6abca493c5e6e2241 Author: Dhruv Arya Date: Wed May 22 09:40:32 2024 -0700 [Spark] DynamoDBCommitOwner: add logging, get dynamic confs from sparkSession (#3130) ## Description Updates DynamoDBCommitOwner: - Added logging around table creation flow - Get wcu, rcu, and awsCredentialsProvider from SparkSession - Return -1 as the table version if registerTable has already been called but no actual commits have gone through the owner. This is done by tracking an extra flag in DynamoDB. ## How was this patch tested? Existing tests ## Does this PR introduce _any_ user-facing changes? Yes, introduces new configs (see DeltaSQLConf changes) which can be used to configure the DynamoDBCommitOwner. commit 0c35eea4100d83040a11417f66016b48c246c466 Author: sabir-akhadov <52208605+sabir-akhadov@users.noreply.github.com> Date: Wed May 22 15:07:11 2024 +0200 [Spark] Column Mapping DROP FEATURE (#3124) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Allow column mapping feature to be dropped. ``` ALTER TABLE DROP FEATURE columnMapping ``` Feature is hidden behind a flag. ## How was this patch tested? new unit tests ## Does this PR introduce _any_ user-facing changes? No commit 529717bb6f171ff5252e3a913dd3667d53a2095c Author: Sumeet Varma Date: Tue May 21 22:46:55 2024 -0700 [Spark] Metadata Cleanup for Unbackfilled Delta Files (#3094) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Unbackfilled deltas eligible for deletion: - Version <= max(backfilled-delta-deleted-versions) ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit f2d6c8b4e1ccdd1fcdffc44f87536f1f56408d31 Author: Tom van Bussel Date: Wed May 22 07:46:24 2024 +0200 [Spark] Apply filters pushed down into DeltaCDFRelation (#3127) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR modifies `DeltaCDFRelation` to apply the filters that are pushed down into this. This enables both partition pruning and row group skipping to happen when reading the Change Data Feed. ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit 0ee9fd0996e2d34630cd094123f54562211570af Author: Anish <100322362+anishshri-db@users.noreply.github.com> Date: Tue May 21 18:10:21 2024 -0700 [Spark] Skip reading log entries beyond endOffset, if specified while getting file changes for CDC in streaming queries (#3110) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Skip reading log entries beyond endOffset, if specified while getting file changes for CDC in streaming queries ## How was this patch tested? Existing unit tests Also verified using logs to ensure that additional Delta logs are not read ``` 24/05/16 01:21:01 INFO StateStore: StateStore stopped Run completed in 54 seconds, 237 milliseconds. Total number of tests run: 1 Suites: completed 1, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Before: ``` 10457:24/05/16 01:38:37 INFO DeltaSource: [queryId = 199ce] [batchId = 0] Getting CDC dataFrame for delta_log_path=file:/tmp/spark-c3309e79-80cf-4819-8b03-2fc607cc2679/_delta_log with startVersion=0, startIndex=-100, isInitialSnapshot=false, endOffset={"sourceVersion":1,"reservoirId":"75194ee3-4ff6-431e-8e7c-32006684d5ad","reservoirVersion":1,"index":-1,"isStartingVersion":false} took timeMs=52 ms 11114:24/05/16 01:38:39 INFO DeltaSource: [queryId = 199ce] [batchId = 1] Getting CDC dataFrame for delta_log_path=file:/tmp/spark-c3309e79-80cf-4819-8b03-2fc607cc2679/_delta_log with startVersion=1, startIndex=-100, isInitialSnapshot=false, endOffset={"sourceVersion":1,"reservoirId":"75194ee3-4ff6-431e-8e7c-32006684d5ad","reservoirVersion":2,"index":-1,"isStartingVersion":false} took timeMs=25 ms 11518:24/05/16 01:38:39 INFO DeltaSource: [queryId = 199ce] [batchId = 2] Getting CDC dataFrame for delta_log_path=file:/tmp/spark-c3309e79-80cf-4819-8b03-2fc607cc2679/_delta_log with startVersion=2, startIndex=-100, isInitialSnapshot=false, endOffset={"sourceVersion":1,"reservoirId":"75194ee3-4ff6-431e-8e7c-32006684d5ad","reservoirVersion":3,"index":-1,"isStartingVersion":false} took timeMs=24 ms ``` After: ``` 10498:24/05/16 01:32:10 INFO DeltaSource: [queryId = ede3f] [batchId = 0] Getting CDC dataFrame for delta_log_path=file:/tmp/spark-270c3d6e-40df-4e6f-b1da-c293af5d6741/_delta_log with startVersion=0, startIndex=-100, isInitialSnapshot=false, endOffset={"sourceVersion":1,"reservoirId":"516bafe0-e0ea-4380-afcb-44e416302a07","reservoirVersion":1,"index":-1,"isStartingVersion":false} took timeMs=39 ms 11155:24/05/16 01:32:11 INFO DeltaSource: [queryId = ede3f] [batchId = 1] Getting CDC dataFrame for delta_log_path=file:/tmp/spark-270c3d6e-40df-4e6f-b1da-c293af5d6741/_delta_log with startVersion=1, startIndex=-100, isInitialSnapshot=false, endOffset={"sourceVersion":1,"reservoirId":"516bafe0-e0ea-4380-afcb-44e416302a07","reservoirVersion":2,"index":-1,"isStartingVersion":false} took timeMs=14 ms 11579:24/05/16 01:32:12 INFO DeltaSource: [queryId = ede3f] [batchId = 2] Getting CDC dataFrame for delta_log_path=file:/tmp/spark-270c3d6e-40df-4e6f-b1da-c293af5d6741/_delta_log with startVersion=2, startIndex=-100, isInitialSnapshot=false, endOffset={"sourceVersion":1,"reservoirId":"516bafe0-e0ea-4380-afcb-44e416302a07","reservoirVersion":3,"index":-1,"isStartingVersion":false} took timeMs=13 ms ``` Difference is even more if we are processing/reading through large number of backlog versions. In Cx setup, before the change - batches are taking > 300s. After the change, batches complete is < 15s. ## Does this PR introduce _any_ user-facing changes? No commit 699df388f977f936a0b2ecc5462a5e811dafb09b Author: Allison Portis Date: Tue May 21 13:25:21 2024 -0700 [DOCS] Add page on BigQuery Delta Lake integration (#3123) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (fill in here) ## Description Adds a page on BigQuery Delta Lake integration to the "integrations" page. Page is like this: Screenshot 2024-05-20 at 6 15 42 PM And it is indexed as part of the integrations page here: Screenshot 2024-05-20 at 6 15 57 PM ## How was this patch tested? Local build. ## Does this PR introduce _any_ user-facing changes? No. --------- Co-authored-by: Tathagata Das commit 9a0a2826f46418d1a97b289f6c9a756bc42621d3 Author: sabir-akhadov <52208605+sabir-akhadov@users.noreply.github.com> Date: Tue May 21 22:16:25 2024 +0200 [Spark] Enable column mapping removal feature flag (#3114) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Enable column mapping removal feature flag to allow user to run ``` ALTER TABLE table_name SET TBLPROPERTIES ('delta.columnMapping.mode' = 'none') ``` or ``` ALTER TABLE table_name UNSET TBLPROPERTIES ('delta.columnMapping.mode') ``` to remove column mapping property from their table and rewrite physical files to match the logical column names. Also allows column mapping feature to be dropped with ``` ALTER TABLE DROP FEATURE columnMapping ``` ## How was this patch tested? existing tests ## Does this PR introduce _any_ user-facing changes? Yes Allows user to run ``` ALTER TABLE table_name SET TBLPROPERTIES ('delta.columnMapping.mode' = 'none') ``` or ``` ALTER TABLE table_name UNSET TBLPROPERTIES ('delta.columnMapping.mode') ``` to remove column mapping from a Delta table. Previously, this commands would not run successfully and would return an exception stating such an operation is prohibited. Also allows column mapping feature to be dropped with ``` ALTER TABLE DROP FEATURE columnMapping ``` commit d22a4da23970e68def84226b745a7c2310f532b5 Author: Yan Zhao Date: Wed May 22 02:56:18 2024 +0800 [Kernel] Return `tags` in `Scan.getScanFiles()` output (#3119) Now, the file scan result didn't contain a tags field. In our case, we define some custom properties in the tags filed, so we want to filter the `AddFile` according to the custom tags. commit 063c71d99ece90081caf20d10cdb5ccd63f3c27c Author: Dhruv Arya Date: Tue May 21 10:51:03 2024 -0700 [Spark] Integrate ICT into Managed Commits (#3108) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves TODOs around ICT <-> MC integration. ## How was this patch tested? Updated existing tests. ## Does this PR introduce _any_ user-facing changes? No commit 611ad13c88833332d78e624ec2ecec15b77f36b3 Author: Ole Sasse Date: Tue May 21 17:59:02 2024 +0200 [Spark] Support predicates for stats that are not at the top level (#3117) ## Description This refactoring adds support for nested statistics columns. So far, all statistics are keys in the stats struct in AddFiles. This PR adds support for statistics that are part of nested structs. This is a prerequisite for file skipping on collated string columns ([Protocol RFC](https://github.com/delta-io/delta/pull/3068)). Statistics for collated string columns will be wrapped in a struct keyed by the versioned collation that was used to generate them. For example: ``` "stats": { "statsWithCollation": { "icu.en_US.72": { "minValues": { ...} } } } ``` This PR replaces statType in StatsColumn with pathToStatType, which can be used to represent a path. This way we can re-use all of the existing data skipping code without changes. ## How was this patch tested? It is not possible to test this change without altering [statsSchema](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala#L285). I would still like to ship this PR separately because the change is big enough in itself. There is existing test coverage for stats parsing and file skipping, but none of them uses nested statistics yet. ## Does this PR introduce _any_ user-facing changes? No commit 25a42df8b68a944490a38dcd838acff0c438d517 Author: richardc-db <87336575+richardc-db@users.noreply.github.com> Date: Mon May 20 10:18:20 2024 -0700 [SPARK] Add more testing for variant + delta features (#3102) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description adds testing for auto compaction and deletion vectors ## How was this patch tested? test only change ## Does this PR introduce _any_ user-facing changes? commit 7b4ee63ac02f9a11664689c664116f11c29bed86 Author: Dhruv Arya Date: Mon May 20 09:40:08 2024 -0700 [Spark] Managed Commits: add a DynamoDB-based commit owner (#3107) ## Description Taking inspiration from https://github.com/delta-io/delta/pull/339, this PR adds a Commit Owner Client which uses DynamoDB as the backend. Each Delta table managed by a DynamoDB instance will have one corresponding entry in a DynamoDB table. The table schema is as follows: * tableId: String --- The unique identifier for the entry. This is a UUID. * path: String --- The fully qualified path of the table in the file system. e.g. s3://bucket/path. * acceptingCommits: Boolean --- Whether the commit owner is accepting new commits. This will only * be set to false when the table is converted from managed commits to file system commits. * tableVersion: Number --- The version of the latest commit. * tableTimestamp: Number --- The inCommitTimestamp of the latest commit. * schemaVersion: Number --- The version of the schema used to store the data. * commits: --- The list of unbackfilled commits. - version: Number --- The version of the commit. - inCommitTimestamp: Number --- The inCommitTimestamp of the commit. - fsName: String --- The name of the unbackfilled file. - fsLength: Number --- The length of the unbackfilled file. - fsTimestamp: Number --- The modification time of the unbackfilled file. For a table to be managed by DynamoDB, `registerTable` must be called for that Delta table. This will create a new entry in the db for this Delta table. Every `commit` invocation appends the UUID delta file status to the `commits` list in the table entry. `commit` is performed through a conditional write in DynamoDB. ## How was this patch tested? Added a new suite called `DynamoDBCommitOwnerClient5BackfillSuite` which uses a mock DynamoDB client. + plus manual testing against a DynamoDB instance. commit 57df2c0c5d1e25e70aac1d5ce5c6ac7dba54d0f9 Author: Prakhar Jain Date: Sun May 19 21:04:10 2024 -0700 [Spark] Handle case when Checkpoints.findLastCompleteCheckpoint is passed MAX_VALUE (#3105) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fixes an issue where `Checkpoints.findLastCompleteCheckpoint` goes into an almost infinite loop if it is passed a Checkpoint.MAX_VALUE. ## How was this patch tested? UT ## Does this PR introduce _any_ user-facing changes? No commit 3af433517bf5a42b1774cb63a8bd1d262e7d933d Author: Dhruv Arya Date: Fri May 17 16:15:34 2024 -0700 [Spark] Pass sparkSession to commitOwnerBuilder (#3112) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Updates CommitOwnerBuilder.build so that it can take in a sparkSession object. This allows it to read CommitOwner-related dynamic confs from the sparkSession while building it. ## Does this PR introduce _any_ user-facing changes? No commit eca5a7f439f8039207705ee03b66916c6a987b79 Author: Dhruv Arya Date: Fri May 17 10:51:49 2024 -0700 [Spark] InCommitTimestamp: Use clock.currentTimeMillis() instead of nanoTime() in commitLarge (#3111) ## Description We currently use NANOSECONDS.toMillis(System.nanoTime()) for generating the ICT when `commitLarge` is called. However, this usage of System.nanoTime() is not correct as it should only be used for measuring time difference, not to get an approximate wall clock time. This leads to scenarios where the ICT becomes very small (e.g. 1 Jan 1970) sometimes because some systems return a very small number when System.nanoTime() is called. This PR changes this so that clock.getCurrentTimeMillis() is used instead. ## How was this patch tested? Added a test case to ensure that `clock.getCurrentTimeMillis()` is being used. commit e15132b021f3d4ec5b2f359aaf682fa29fadbb46 Author: Jiaheng Tang Date: Fri May 17 07:55:48 2024 -0700 [Spark] Fall back to zordering when clustering on a single column (#3109) ## Description Fall back to zorder when clustering on a single column, because hilbert clustering doesn't support 1 column. Resolves #3087 ## How was this patch tested? New unit test. commit 8a8e757eba08abe83a4ddbb328442a9c6125ce03 Author: richardc-db <87336575+richardc-db@users.noreply.github.com> Date: Thu May 16 11:24:14 2024 -0700 [SPARK] Promote Variant Type Table Feature to Preview (#3091) ## Description promotes variant type table feature to preview by removing `-dev` suffix and appending `-preview` suffix. We will support `-preview` forever. ## How was this patch tested? existing tests commit c024269aff0be26c515cfdc6cce34e6df2fda59d Author: Venki Korukanti Date: Thu May 16 11:11:07 2024 -0700 [Kernel] Handle `KernelEngineException` when reading the `_last_checkpoint` file (#3086) There is an issue with the `CloseableIterator` interface that Kernel is using. Currently, it extends Java's `iterator`, which doesn't throw any exceptions. We use `CloseableIterator` when returning data read from a file or any incremental data access. Any `IOException` in `hasNext` or `next` is wrapped in a `UncheckedIOException` or `RuntimeException`. Users of the `CloseableIterator` need to catch for `UncheckedIOException` or `RuntimeException` explicitly and look at the cause if they are interested in the `IOException`. This is not consistent and causes problems for the code that want to handle exceptions like `FileNotFoundException` (subclass of `IOException`) and take further actions. * Change the `CloseableIterator.{next, hasNext}` contract to expect `KernelEngineException` for any exceptions that occur while executing in the `Engine`. * Update the `DefaultParquetHandler` and `DefaultJsonHandler` to throw `KernelEngineException` instead of `UncheckedIOException` or `RuntimeException`. * In the checkpoint metadata loading method, catch `KernelEngineException` and see if the cause is `FileNotFoundException.` If yes, don't retry loading. commit 5bec678293b7014979cd76741f680914eef60a6f Author: Venki Korukanti Date: Thu May 16 09:26:41 2024 -0700 [Build] Use `--release` instead of `--source` and `--target` (#3104) ## Description Currently, we are using `--target` and `--source` to set the target JVM version, but it is known to cause issues when the jars are built using JDK17 to generate bytecode that is runnable using JVM 1.8 and above. See [here](https://www.morling.dev/blog/bytebuffer-and-the-dreaded-nosuchmethoderror/) for an example. Instead, use `--release` to generate the useable byte code in JVM 1.8 and above. Quote from [blog](https://www.morling.dev/blog/bytebuffer-and-the-dreaded-nosuchmethoderror/): ``` In contrast to the more widely known pair of—-source and—-target, the—-release switch will ensure that only byte code is produced that is actually usable with the specified Java version. ``` ## How was this patch tested? * Set JDK home to 17 * `./build/sbt clean publishM2` * Edit `kernel/examples/run-kernel-examples.py` to comment the line `clear_artifact_cache()`. * Set JAVA_HOME to 1.8 * Run Kernel integration tests: `./kernel/examples/run-kernel-examples.py --version 3.2.1-SNAPSHOT` commit 0138129f299b373fda587fc16f14c7478f2e1015 Author: Venki Korukanti Date: Thu May 16 09:00:20 2024 -0700 [Release] Set next development version 3.3.0-SNAPSHOT (#3101) commit 95f924bbe93040dbadf9e60105cc1d0bebf24ab9 Author: Lukas Rupprecht Date: Wed May 15 17:54:11 2024 -0700 [Spark] Improve GetCommits validation in CommitOwnerClient tests (#3096) ## Description This PR makes a small improvement to the recently introduced CommitOwnerClientImplSuiteBase (a base trait that can be extended by CommitOwnerClient implementation tests). The change improves how the result of a getCommits call is validated. Instead of passing in a list of versions to validateGetCommitsResult, it now accepts the entire GetCommitsResponse. This allows to do more filtering on the result and also compare the latestTableVersion that is returned as part of the response (which is now implemented in the InMemoryCommitOwnerSuite). ## How was this patch tested? Ensure that InMemoryCommitOwnerSuite still passes ## Does this PR introduce _any_ user-facing changes? No commit 69fd7e4ecbff136ec77a6c37fc76fada6dadfc41 Author: Anish <100322362+anishshri-db@users.noreply.github.com> Date: Wed May 15 11:35:16 2024 -0700 [Spark] Add time tracking for file changes and getting dataframe for Delta source with/without CDC (#3090) #### Which Delta project/connector is this regarding? Spark ## Description Add time tracking for file changes and getting dataframe for Delta source with/without CDC ## How was this patch tested? Existing unit tests ## Does this PR introduce _any_ user-facing changes? No commit b9fe0e1d26c5aeb340b9808ad09de46e1dec4a7b Author: Venki Korukanti Date: Tue May 14 16:23:03 2024 -0700 [Kernel][Defaults] Handle legacy map types in Parquet files (#3097) ## Description Currently, Kernel's Parquet reader explicitly looks for the `key_value` repeated group under the Parquet map type, but the older versions of Parquet writers wrote any name for the repeated group. Instead of looking for the explicit `key_value` element, fetch the first element in the list. See [here](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps) for more details. ## How was this patch tested? The [test](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetThriftCompatibilitySuite.scala#L29) and sample file written by legacy writers are taken from Apache Spark™. Some columns (arrays with 2-level encoding, another legacy format) from the test file are currently not supported. I will follow up with a separate PR. It involves bit refactoring on the ArrayColumnReader. commit b1b84d5784afa5754b1d3666e9b70723c937181f Author: Scott Sandre Date: Tue May 14 09:18:09 2024 -0700 [Spark] [Test only] Fix Spark Master test in DeltaSourceSuite (#3088) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fix and re-enable a failing delta-spark test against Spark Master in DeltaSourceSuite ## How was this patch tested? Test only change. commit 9c88f8bf4420f57c5ed7e99f65bb7dbdab4eb287 Author: Tathagata Das Date: Tue May 14 01:01:56 2024 -0400 Update README.md commit 13fdbe87ae6b4115dfc96a3051c75b8cef09fbd3 Author: Venki Korukanti Date: Mon May 13 22:00:17 2024 -0700 [Kernel] Update README.md (#3085) commit 4ee7f4d669558b3d19785152298c899f70d7b055 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Mon May 13 18:29:07 2024 +0200 [Spark] Protocol version downgrade in the presence of table features (#2841) ## Description This PR adds support for protocol versions downgrade when table features exist in the protocol. The downgraded protocol versions should be the minimum required to support all available table features. For example, `Protocol(3, 7, DeletionVectors, RowTracking)` can be downgraded to `Protocol(1, 7, RowTracking)` after removing the DV feature. ## How was this patch tested? Added new UTs in DeltaProtocolVersionSuite. Furthermore, existing UTs cover a significant part of the functionality. These these are the following: - Downgrade protocol version on table created with (3, 7). - Downgrade protocol version on table created with (1, 7). - Protocol version downgrade on a table with table features and added legacy feature. - Protocol version is not downgraded when writer features exist. - Protocol version is not downgraded when reader+writer features exist. - Protocol version is not downgraded when multiple reader+writer features exist. ## Does this PR introduce _any_ user-facing changes? Yes. Dropping a table feature from a table with multiple features may now result to a Protocol versions downgrade. For example, `Protocol(3, 7, DeletionVectors, RowTracking)` can now be downgraded to `Protocol(1, 7, RowTracking)`. commit f6ebe24a559bd435ec241475b51b96602d26a6c0 Author: Venki Korukanti Date: Fri May 10 16:00:23 2024 -0700 [Kernel][Writes] Write `timestamp` as `INT64` type to Parquet data files (#3084) ## Description Write the `timestamp` as `INT64` physical format in Parquet. Currently, it is written as `INT96` which is a very old method of writing timestamp and deprecated a long time ago. Also, collect statistics, for `timestamp` type columns. ## How was this patch tested? Update the existing tests. commit a5d7c6936e21453e1541dbaf9bfbff90d62e7ced Author: Venki Korukanti Date: Fri May 10 14:13:55 2024 -0700 [Kernel][Defaults] Support reading parquet files with legacy 3-level repeated types (#3083) ## Description When legacy mode is enabled in Spark, array physical types are stored slightly different from the standard format. Standard mode (default): ``` optional group readerFeatures (LIST) { repeated group list { optional binary element (STRING); } } ``` When write legacy mode is enabled (`spark.sql.parquet.writeLegacyFormat = true`): ``` optional group readerFeatures (LIST) { repeated group bag { optional binary array (STRING); } } ``` TODO: We need to handle the 2-level lists. Will post a separate PR. The challenge is with generating or finding the Parquet files with 2-level lists. ## How was this patch tested? Added tests Fixes #3082 commit c2f23d75b187e750e20d729299a0910c5410cde5 Author: Scott Sandre Date: Fri May 10 12:31:58 2024 -0700 [BUILD] Set the version to 3.2.1-SNAPSHOT (#3076) Set the version to 3.2.1-SNAPSHOT commit da93223fddf2f3e724077793f28cac2166e9427b Author: richardc-db <87336575+richardc-db@users.noreply.github.com> Date: Fri May 10 10:14:13 2024 -0700 [SPARK][VARIANT] Add more variant test cases (#3013) ## Description adds more tests to ensure that Delta spark can use various delta features with variant ## How was this patch tested? test only change ## Does this PR introduce _any_ user-facing changes? no commit e6738fdf9c19ba58cf4a59cb11737df8d58f4a90 Author: Dhruv Arya Date: Fri May 10 09:39:44 2024 -0700 [Spark] Fix inCommitTimestamp deserialization for very small timestamps (#3077) ## Description Follow up for https://github.com/delta-io/delta/commit/2ef3a2ae887e4d79c1469c60c54b3ec993b97db0 . There is another site where we don't specify the right way to serialize inCommitTimestamp which will cause failures when deserializing very small timestamps. commit 5cf570cb8a9665a3afda67510e65e30107299d4f Author: Lukas Rupprecht Date: Thu May 9 15:31:33 2024 -0700 [Spark] Refactors InMemoryCommitOwnerClient suite into a generic suite that can be implemented by other CommitOwnerClient implementations (#3075) ## Description This PR refactors the existing InMemoryCommitOwnerClientSuite into a generic suite, CommitOwnerClientImplSuiteBase. This base suite can be extended by CommitOwnerClient implementations to run the basic set of tests provided by the base suite. This is to make it easier for CommitOwnerClient implementations to get test coverage in the future. It also adds an implementation of the suite for the InMemoryCommitOwnerClient with two different backfilling batch sizes (1 and 5). ## How was this patch tested? Ensured that the newly introduced suites pass. ## Does this PR introduce _any_ user-facing changes? No commit f348774df3d4208eda2596fc5d8a1ba3b664a38c Author: Venki Korukanti Date: Thu May 9 14:14:10 2024 -0700 [Kernel][Writes] Update the `USER_GUIDE` (#3074) Add the usage of create table, blind append, idempotent writes and checkpointing usage guides. Rendered at: https://github.com/vkorukanti/delta/tree/writeUsageGuide commit 50d4f024a766576aaa5390bdd5cfd8c38a8dcb80 Author: Sumeet Varma Date: Thu May 9 09:54:17 2024 -0700 [Spark] Extend various Delta Suites to Managed Commits (#2900) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Extend various Delta suites to use Managed Commits. Bug fix: During listing delta-files, Filter initial list to exclude files with versions beyond initialListingMaxDeltaVersionSeen to prevent duplicating non-delta files with higher versions in the combined list ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit cb25d7f3292f722e179fce6b5315fae740da18b5 Author: Wenchen Fan Date: Fri May 10 00:08:15 2024 +0800 [Spark] Fix time option evaluation (#2999) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We made a fix to evaluate current datetime functions for Delta options. However, the fix is not completed as it doesn't handle data types other than `TimestampType`. This PR fixes it. ## How was this patch tested? A new test ## Does this PR introduce _any_ user-facing changes? Yes, before the fix, Delta throws `[INTERNAL_ERROR] Cannot evaluate expression` commit 0abe4645962cdde16a706c7a477d64954db6fe36 Author: Dhruv Arya Date: Thu May 9 08:22:57 2024 -0700 [Spark] Add drop support for managed commits (#3010) ## Description Adds table feature phaseout support for managed commits. Fixes the ManagedCommit table feature --> correctly marks it as a WriterFeature instead of a ReaderWriterFeature. ## How was this patch tested? New tests in DeltaProtocolVersionSuite. ## Does this PR introduce _any_ user-facing changes? drop table command now works for managed commits. commit 027d6e71a2b461fdf4171b3c1549eb444464d0dc Author: Jun <85203301+junlee-db@users.noreply.github.com> Date: Thu May 9 08:22:16 2024 -0700 [Spark] Extend DeltaLogSuite to have ManagedCommit coverage (#3047) ## Description Extends DeltaLogSuite to test ManagedCommit. DeltaLogSuite already has a good coverage for Delta, hence use this suite to test MangaedCommit enabled table as well. ## How was this patch tested? This PR itself is a test. commit 87d92defe36d0fd2e7287204bd743f8f69776af0 Author: Lin Zhou <87341375+linzhou-db@users.noreply.github.com> Date: Wed May 8 19:22:09 2024 -0700 Use lazy val for DeltaSharingFileIndex.queryCustomTablePath (#3073) ## Description Fix on an edge case: Use lazy val for DeltaSharingFileIndex.queryCustomTablePath, to allow the table path to be resolved at execution time, to match the table path used in CachedTableManager.INSTANCE.register(: https://github.com/delta-io/delta-sharing/blob/main/client/src/main/scala/org/apache/spark/delta/sharing/PreSignedUrlCache.scala#L181, and allow the query to find the pre-signed urls for the delta sharing table. ## How was this patch tested? Unit Tests ## Does this PR introduce _any_ user-facing changes? No commit 4a57ca64e9a74cd06e3250b7d820f602275cd245 Author: Venki Korukanti Date: Wed May 8 19:21:28 2024 -0700 [Kernel][Writes] Example connector programs using the Kernel Write APIs (#3070) ## Description Sample connector programs using the Kernel write APIs to * create unpartitioned table * create partitioned table * create unpartitioned table and insert data into it (CTAS) * create partitioned table and insert data into it (CTAS) * insert into an existing unpartitioned table * idempotent inserts * insert with optional checkpoint Also run these examples as part of the integration tests for release verification ## How was this patch tested? Manually ran ``` $ kernel/examples/run-kernel-examples.py --use-local $ kernel/examples/run-kernel-examples.py --version 3.2.0 --maven-repo https://oss.sonatype.org/content/repositories/iodelta-1138 ``` commit 8a56bdc5aba88302e6d3da8d5e9a1a932e14d7d6 Author: Jun <85203301+junlee-db@users.noreply.github.com> Date: Wed May 8 17:44:45 2024 -0700 [Spark] Extend Fuzz test For Managed Commit (#3049) This PR extends Fuzz test to test managed commit features. Specifically, it adds a new event phase inside commit operation, so that we can capture the backfill as a separate operation. By doing so, it is possible that multiple commits can go through before backfill and managed commit is expected to deal with various situation to return the correct output. ## How was this patch tested? Existing fuzz tests should naturally use the extended backfill phases. commit 40b8f970f8990200d17eca46a5d8f26f0bb1c88e Author: Venki Korukanti Date: Wed May 8 12:38:58 2024 -0700 [Kernel] Fix the infinite loop in `CloseableIterator.map` (#3071) ## Description `CloseableIterator.map` creates new `CloseableIterator` which has wrong `forEachRemaining` impl (it calls itself). We should remove the impl and fall back on the default impl. This is a day 0 bug, not needed for 3.2 (given 3.2 is so close to release) commit 3c6f18ec6e6e26cd6f2fc8608c8f8dc6f237ae76 Author: sabir-akhadov <52208605+sabir-akhadov@users.noreply.github.com> Date: Wed May 8 17:12:58 2024 +0200 [Spark] Column mapping removal: support streaming tables (#3061) ## Description Add support for streaming reads on tables that undergo column mapping downgrade. Basically most streams should fail when a downgrade is happening on the table. ## How was this patch tested? New unit tests commit 0ca72a8511e8adb816dab4e727ec56a7e4bfe5d3 Author: Sumeet Varma Date: Tue May 7 22:29:20 2024 -0700 [Spark] Ensure commit directory is created for old and new tables (#3000) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description - Remove `createLogDirectory` and replace all usages with `ensureLogDirectoryExists` since the latter is optimized for creation rather than existence check now. Rename it `createLogDirectoriesIfNotExists` - Ensure commit directory is created on existing tables from older releases when manged-commits is enabled. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit 52b6be2dc1d4de8b898646f9f2ffd9e527ca9b05 Author: Ami Oka Date: Tue May 7 14:50:16 2024 -0700 fix hanging behavior for constructDataFilters with expression like EqualTo(Literal, Literal) (#3059) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This fixes the hanging behavior for `constructDataFilters` with expressions like `EqualTo(Literal, Literal)`. This is due to infinite recursion caused by cases like `case EqualTo(v: Literal, a) => constructDataFilters(EqualTo(a, v))`. This also fixes the same behavior for `EqualTo, Not(EqualTo), EqualNullSafe, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual` ## How was this patch tested? new UT `DataSkippingDeltaConstructDataFiltersSuite` ## Does this PR introduce _any_ user-facing changes? No commit 5efee74bab3be6761e93914e84a5570e0d5e0915 Author: Lukas Rupprecht Date: Tue May 7 14:08:24 2024 -0700 [Spark] Makes fields in CommitOwner-related classes private and introduces getters (#3033) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This is the second PR in a series of PRs to move the current CommitOwner interface to its own module. PR1 is [here](https://github.com/delta-io/delta/pull/3002). This PR makes the fields of all CommitOwner-related classes private and forces callers to use getters to access the fields. This is needed in preparation for making the CommitOwner module a Java module (the same as the existing LogStore module) and to follow Java best practices to keep fields private and only allow access through getters. ## How was this patch tested? Simple refactor so existing tests are sufficient. ## Does this PR introduce _any_ user-facing changes? No commit 64fe4dee80be2271ae0528fefd8ed3b5ca1561c7 Author: Venki Korukanti Date: Tue May 7 13:28:01 2024 -0700 [Kernel][Examples] Refactor/clean up the Kernel examples (#3064) ## Description * Rename the `table-reader` directory to `kernel-examples` as we are going to have write examples in the same project * Rename `IntergrationTestSuite.java` to `ReadIntegrationTestSuite.java` as it is just for reads and be specific about it. * Misc. clean up and docs for running the integration tests ## How was this patch tested? Manually ran ``` $ kernel/examples/run-kernel-examples.py --use-local $ kernel/examples/run-kernel-examples.py --version 3.2.0 --maven-repo https://oss.sonatype.org/content/repositories/iodelta-1138 ``` commit bd3d56a7c8e66f99951a11fa0bc72e027a973258 Author: Krishnan Paranji Ravi Date: Tue May 7 16:20:28 2024 -0400 [Kernel] [API-Docs] API doc update for ScanBuilder.withFilter (#3027) ## Description Fixes https://github.com/delta-io/delta/issues/2703 ## How was this patch tested? No. API docs update only. ## Does this PR introduce _any_ user-facing changes? No --------- Signed-off-by: Krishnan Paranji Ravi Co-authored-by: Venki Korukanti commit d0bd28ee179af5a7e73eb1b576d6ba54789b83c5 Author: Prakhar Jain Date: Tue May 7 11:19:16 2024 -0700 [Spark] Fix O(n^2) issue in find last complete checkpoint before (#3060) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes an O(n^2) issue in `find last complete checkpoint before` method. Today the `findLastCompleteCheckpointBefore` tries to find the last checkpoint before a given version. In order to do this, it do following: findLastCompleteCheckpointBefore(10000): 1. List from 9000 2. List from 8000 3. List from 7000 ... Each of these listing today lists to the end as they completely ignore delta files and try to list with takeWhile with version clause: ``` listFrom(..) .filter { file => isCheckpointFile(file) && file.getLen != 0 } .map{ file => CheckpointInstance(file.getPath) } .takeWhile(tv => (cur == 0 || tv.version <= cur) && tv < upperBoundCv) ``` This PR tries to fix this issue by terminating each listing early by checking if we have crossed a deltaFile for untilVersion. In addition to this, we also optimize how much to list in each iteration. E.g. After this PR, findLastCompleteCheckpointBefore(10000) will need: 1. Iteration-1 lists from 9000 to 10000. 2. Iteration-2 lists from 8000 to 9000. 3. Iteration-3 lists from 7000 to 8000. 4. and so on... ## How was this patch tested? UT commit 3baeb999963009a007e6c03752714a2ec7386ed6 Author: Allison Portis Date: Tue May 7 10:48:56 2024 -0700 [INFRA] Move checkstyle config files to one location (#3056) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (INFRA) ## Description Currently the java checkstyle config files are shared by some projects and stored in different locations nested under those projects. Moves the checkstyle config files to a common location. This will make it easier to compare our configs in the future and keep track of the different settings for the different projects. ## How was this patch tested? Locally ran `checkstyle` for all the affected projects. Also modified one file and checked that the checks failed. commit 8827d7583356ee1fe4ebb2b6a3f52d1d3e19e6fb Author: Venki Korukanti Date: Tue May 7 09:57:29 2024 -0700 [Kernel][Usage Guide] Migration docs for Kernel 3.2 (#3062) Capture the breaking API changes to enable connectors to migrate to the latest version of Kernel commit 58844aba636db2ea8106a8a1cc36ec95974d970e Author: Venki Korukanti Date: Mon May 6 15:55:06 2024 -0700 [Kernel][Docs] Rename `TableClient` to `Engine` (#3058) ## Description Rename the references of `TableClient` to `Engine`. Follow up to #3015. ## How was this patch tested? Just docs change commit 0ca2607a87dd849c96c77be171b0b6913d23cb84 Author: Venki Korukanti Date: Mon May 6 14:50:39 2024 -0700 [Kernel][Writes] Allow transaction retries for blind append (#3055) ## Description Currently, Kernel throws an exception when there is a conflict (i.e., there already exists a committed file at a given version). We should retry the transaction as the current support is just for blind appends. Retry checks if there are no logical conflicts (`metadata`, `protocol` or `txn` (Set Tranaction)) conflicts that affect the blind append. ## How was this patch tested? Tests for protocol, metadata and setTxn conflicts. Also tests to verify blind appends are retried and committed. commit 129ea9556ec2d16569ef312d08ecf5487bbb2335 Author: Allison Portis Date: Mon May 6 10:20:08 2024 -0700 [Docs] Update the Snowflake integration page (#2814) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (fill in here) ## Description Updates the Snowflake integration page to mark the manifest file method as outdated and include the Snowflake Delta Lake connector as main content. Current page: https://docs.delta.io/latest/snowflake-integration.html Updated page: Screenshot 2024-03-27 at 5 51 29 PM ## How was this patch tested? Built locally. commit 96f87bf701628e8c09a7afec0108d54f139cc9e9 Author: Venki Korukanti Date: Mon May 6 10:04:25 2024 -0700 [Kernel][Writes] Support idempotent writes (#3051) ## Description (Split from #2944) Adds an API on `TransactionBuilder` to take the transaction identifier for idempotent writes ``` /* * Set the transaction identifier for idempotent writes. Incremental processing systems (e.g., * streaming systems) that track progress using their own application-specific versions need to * record what progress has been made, in order to avoid duplicating data in the face of * failures and retries during writes. By setting the transaction identifier, the Delta table * can ensure that the data with same identifier is not written multiple times. For more * information refer to the Delta protocol section * Transaction Identifiers. * * @param engine {@link Engine} instance to use. * @param applicationId The application ID that is writing to the table. * @param transactionVersion The version of the transaction. This should be monotonically * increasing with each write for the same application ID. * @return updated {@link TransactionBuilder} instance. */ TransactionBuilder withTransactionId( Engine engine, String applicationId, long transactionVersion); ``` During the transaction build, check the latest txn version of the given AppId. If it is not monotonically increasing throw `ConcurrentTransactionException`. ## How was this patch tested? Added to `DeltaTableWriteSuite.scala` commit 91cd61a867a0d3b243832f8ea486f5b8cba2e002 Author: Johan Lasperas Date: Mon May 6 18:40:40 2024 +0200 [Spark] Fix logging failure in deltaAssert helper (#3052) ## Description Fix helper method `deltaAssert` introduced in https://github.com/delta-io/delta/pull/2709 to not log failures when the assertion holds. ## How was this patch tested? The helper intentionally behaves differently in tests and in production (failing vs. logging), there's no easy/meaningful way to test the prod behavior from tests. commit 4bcd4b9f011b8fb1c96259683cf115bf4b143271 Author: Johan Lasperas Date: Mon May 6 18:29:24 2024 +0200 [Doc] Type Widening documentation (#3038) Cherry-pick of https://github.com/delta-io/delta/pull/3036 to `master` ## Description Add a documentation page for the type widening table feature, in preview in Delta. 3.2 ## How was this patch tested? N/A commit 2ef3a2ae887e4d79c1469c60c54b3ec993b97db0 Author: Dhruv Arya Date: Mon May 6 09:06:19 2024 -0700 [Spark] Fix CommitInfo.inCommitTimestamp deserialization for very small timestamps (#3046) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently, if we deserialize a CommitInfo with a very small inCommitTimestamp and then try to access this inCommitTimestamp, this exception is thrown ``` java.lang.ClassCastException: class java.lang.Integer cannot be cast to class java.lang.Long ``` This PR fixes the CommitInfo so that the inCommitTimestamp field is deserialized correctly. ## How was this patch tested? Added a new test case that was failing before the fix. ## Does this PR introduce _any_ user-facing changes? No commit 47f4fc04a337db44ec825b46c80fa4158d30253a Author: Christos Stavrakakis Date: Mon May 6 18:00:14 2024 +0200 [Spark] Make txn readPredicates thread safe (#3022) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description `OptimisticTransaction.readPredicates` may be updated by multiple threads that call `filesForScan`. This commit turns it from an `ArrayBuffer` to a `ConcurrentLinkedQueue` to be thread safe. ## How was this patch tested? Existing tests. ## Does this PR introduce _any_ user-facing changes? No commit d78ae295b9f2024a902f78226007db72fc0f5616 Author: Johan Lasperas Date: Mon May 6 17:59:25 2024 +0200 [Spark] Fix GeneratedColumnSuite tests with type changes (#3037) ## Description Spark started using ANSI mode enabled by default, breaking a couple of tests in `GeneratedColumnSuite` that contained slient overflows. These tests are updated to remove the overflows. ## How was this patch tested? Running tests with ANSI mode enabled commit 7f199febb84d2c62218fdffbc3a7fe1e48086638 Author: Venki Korukanti Date: Sun May 5 16:56:04 2024 -0700 [Kernel][Writes] Add support of inserting data into tables (#3030) ## Description (Split from #2944) Adds support for inserting data into the table. ## How was this patch tested? Tests for inserting into partitioned and unpartitioned tables with various combinations of the types, partition values etc. Also tests the checkpoint is ready to create. commit 7c6020133f0bfc6bd514b6ce26eafc7263e3981d Author: Venki Korukanti Date: Fri May 3 19:03:06 2024 -0700 [Kernel][Writes] APIs and impl. for creating new tables (#3016) ## Description (Split from #2944) APIs and implementation for creating partitioned or unpartitioned tables. No data insertion yet. Will come in the next PR. ## How was this patch tested? Test suite commit 589ae4f33844afeffdd347445130248947ce017e Author: Kaiqi Jin Date: Fri May 3 17:14:16 2024 -0700 [Spark]Block metadata conversion for Iceberg tables with row-level deletes (#3039) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Iceberg V2 spec introduce the row-level deletes, which is not supported yet for metadata conversion. We add the blocker here to throw exception if customer want to convert a iceberg table with row-level deletes. Also MOR is not necessarily to be enabled for using row-level deletes, so we remove the check for MOR configurations here and only check if the deleteFiles is nonEmpty in `planFiles()`. ## How was this patch tested? Exist tests. ## Does this PR introduce _any_ user-facing changes? No commit c426f030cd48ac6fc09bd6e7bf9fcff9252ca211 Author: Allison Portis Date: Fri May 3 15:57:30 2024 -0700 [Kernel] Refactor all user-facing exceptions to be "KernelExceptions" (#3014) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description In order to provide a unified and user-friendly experience we are making all user-facing exceptions instances of one parent class `KernelException`. This PR also attempts to unify/improve some of these error messsages. ## How was this patch tested? Existing tests should suffice (updated to match the changes here). ## Does this PR introduce _any_ user-facing changes? Yes, changes exception types + messages. commit 79ffd69791adc4737c8ff3b9b94d709037a27023 Author: Kaiqi Jin Date: Fri May 3 15:19:59 2024 -0700 [Spark]fix the exception name for REORG TABLE ... APPLY (UPGRADE UNIFORM ...) (#3040) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fix the exception name for REORG TABLE table_name APPLY (UPGRADE UNIFORM (ICEBERG_COMPAT_VERSION = version) ## How was this patch tested? Current unit tests ## Does this PR introduce _any_ user-facing changes? No commit e7cafecdb10a7698eef7aadcf1a2876c9e067fbb Author: Allison Portis Date: Fri May 3 11:27:25 2024 -0700 [Kernel] Remove unused `ExpressionHandler.isSupported(...)` for now (#3018) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Removes a currently unused API in `ExpressionHandler`. See https://github.com/delta-io/delta/issues/3017 for details. **TLDR** we added this API before achieving consensus on what we want to implement and it is currently unused. Remove it for now and we can add it back later if needed. ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? No commit b472d4ea22cd19361bdc3d07cda81fd03e52dc08 Author: Jiaheng Tang Date: Fri May 3 10:37:46 2024 -0700 [Example] Update clustering example for 3.2 (#2991) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Example) ## Description Update clustering example for delta 3.2 ## How was this patch tested? Manually ran the example ``` python3 run-integration-tests.py --use-local ``` ## Does this PR introduce _any_ user-facing changes? No commit ec91e73818b66f9cced4d50d9f6e7b402831888d Author: Sumeet Varma Date: Fri May 3 10:37:10 2024 -0700 [Spark] Refactor CommitStore getCommits and backfillToVersion API params (#2998) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description - We use optional startVersion and optional endVersion for getCommits now. - We use optional lastKnownBackfilledVersion and a required version for backfillToVersion now since the consumers of backfillToVersion must know the version. ## How was this patch tested? Various existing UTs ## Does this PR introduce _any_ user-facing changes? No commit 38ed85a9a32e3a79f688291cf08000614a4d3b28 Author: Johan Lasperas Date: Fri May 3 18:38:09 2024 +0200 [Spark] Fix dropping type widening feature with multipart identifiers (#3035) ## Description Fix an issue found while testing the Delta 3.2 RC2: Dropping the type widening table feature may fail parsing the table identifier: ``` ALTER TABLE default.type_widening_int DROP FEATURE 'typeWidening-preview' TRUNCATE HISTORY; [PARSE_SYNTAX_ERROR] Syntax error at or near '.'.(line 1, pos 21) == SQL == spark_catalog.default.type_widening_int ``` Parsing the table identifier isn't needed as it's not used by the REORG operation that rewrite files when dropping the feature. This change skip parsing the table identifier and directly passes the table name to the REORG command ## How was this patch tested? Added test covering the issue commit fe88cc3496940a082b0db414f603c0c4712a95cf Author: Allison Portis Date: Thu May 2 18:25:35 2024 -0700 [Kernel][Infra] Fix the java checkstyle for Kernel's Meta file (#3034) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Merging https://github.com/delta-io/delta/pull/3019 broke the `kernelApi/checkstyle` which is failing CI jobs since it's original tests ran before https://github.com/delta-io/delta/commit/8cdf411d2c3e483a483595d117619c4aa6e15faa was committed. ## How was this patch tested? Checked locally that `kernelApi/checkstyle` passes. commit 12cabb79f955105cd640396f599eff88c6c220c7 Author: Allison Portis Date: Thu May 2 17:09:05 2024 -0700 [Build] Fix Java checkstyle plugin to work with SBT upgrade (#3019) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (fill in here) ## Description https://github.com/delta-io/delta/pull/2828 upgrades the SBT version from 1.5.5 to 1.9.9 which causes `projectName/checkstyle` to fail with ``` sbt:delta> kernelApi/checkstyle [error] stack trace is suppressed; run last kernelApi / checkstyle for the full output [error] (kernelApi / checkstyle) org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 10; DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true. [error] Total time: 0 s, completed May 1, 2024 2:59:48 PM ``` This failure was silent in our CI runs for some reason, if you search the logs before that commit you can see "checkstyle" in them but no instances after. This is a little concerning but don't really have time to figure out why this was silent. For now, upgrades versions to match Spark's current plugins which fixes the issue. See the matching Spark PR here https://github.com/apache/spark/pull/38481. ## How was this patch tested? Ran `kernelApi/checkstyle` locally. TODO: verify it's present in the CI runs after as well ## Does this PR introduce _any_ user-facing changes? No. commit f727b84ba248c118043ef27c16b7df83bc378589 Author: Allison Portis Date: Thu May 2 17:08:08 2024 -0700 [Kernel] Rename `TableClient` to `Engine` (#3015) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Renames `TableClient` to `Engine` in package names and code. A follow-up PR will be needed to update the `USER_GUIDE` and `README` ## How was this patch tested? Existing tests suffice. ## Does this PR introduce _any_ user-facing changes? Yes, renames public interface. commit e7fa94de1df391296de99baf09e1e101495e8033 Author: Sumeet Varma Date: Thu May 2 16:32:52 2024 -0700 [Spark] Extend Delta feature drop safety check to unbackfilled deltas (#3028) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description When dropping a feature, DeltaLog now checks both backfilled and unbackfilled deltas for any traces of the feature before confirming it's safe to drop. However, feature dropping currently does a checkpoint before detecting feature traces in the history, so there are no unbackfilled deltas at that point. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit 122288fe88beb9b55dca728e91dfe7b6fc438285 Author: Sumeet Varma Date: Thu May 2 16:27:32 2024 -0700 Add documentation link for Vacuum Protocol Check (#3029) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add documentation link for Vacuum Protocol Check spec commit 8cdf411d2c3e483a483595d117619c4aa6e15faa Author: Venki Korukanti Date: Thu May 2 14:52:12 2024 -0700 [Kernel] Add a meta file containing the Kernel version. (#3032) This is useful when we need to get the current version of Kernel. Approach is similar to how we do it for other modules in the delta. Manually verified. commit 97bdf6bf485e23815e74f907d1edff7ab4901464 Author: Scott Sandre Date: Thu May 2 14:07:08 2024 -0700 [INFRA] Fix and create new workflow to compile delta-spark examples (#3011) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (INFRA) ## Description Create new workflow `spark_examples_test.yaml` that compiles the `examples/scala/build.sbt` project. Also fixes such compilation so that it uses the local jars (previously it was incorrectly hardcoded to Delta 3.0). This requires running `publishM2` beforehand. ## How was this patch tested? CI tests and specifically tested on a commit that uses a new `clusterBy` API in 3.2: https://github.com/delta-io/delta/pull/3012 ## Does this PR introduce _any_ user-facing changes? No. commit eba5d658f461c1807ec5e1245349d66540edfa48 Author: Lukas Rupprecht Date: Thu May 2 09:30:55 2024 -0700 [Spark] Makes CommitOwnerClient independent of Delta Spark dependencies (#3002) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is the first in a series of PRs that refactors the current CommitOwnerClient interface to become its own module. This is similar to `storage` module that currently exists and contains the LogStore interface and its implementations. In this PR, we remove any Delta Spark dependencies from the CommitOwnerClient in preparation for it to be moved outside of Delta Spark. ## How was this patch tested? Added tests to check equivalence of newly introduced AbstractProtocol/AbstractMetadata with Protocol/Metadata. ## Does this PR introduce _any_ user-facing changes? No commit 4f21f774ed2bfb34dc260a624c6bfb70cbabcaf8 Author: sabir-akhadov <52208605+sabir-akhadov@users.noreply.github.com> Date: Thu May 2 18:30:02 2024 +0200 Streaming Delta Source should not drop NullType columns (#3021) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Prevent createDataFrame from dropping NullType columns with streaming sources. ## How was this patch tested? New unit tests for streaming read/write and test for non-streaming createDataFrame code path verifying the current behavior is preserved. ## Does this PR introduce _any_ user-facing changes? No commit fae8f70ce1bc91f12bf3303040558e9299f0d9b6 Author: Prakhar Jain Date: Wed May 1 16:02:02 2024 -0700 [Spark] Disallow dropping of dependent table features (#3009) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adds validations around dropping of dependent table features. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit cae0c6e01a45c3f514ea3e9c8fbced5d282ef151 Author: richardc-db <87336575+richardc-db@users.noreply.github.com> Date: Wed May 1 13:03:35 2024 -0700 [SPARK][VARIANT] Add Variant as a delta source to enable writing with DF API and more tests (#2978) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adds Variant as a delta source to enable writing with DF API and more tests ## How was this patch tested? Writing variant using `df.write.format("delta").save(...)" works in tests. ## Does this PR introduce _any_ user-facing changes? commit 809b62b1243648d2e8857c2ca2895625177a0ce2 Author: Tom van Bussel Date: Wed May 1 20:11:56 2024 +0200 [Spark] Avoid parsing in translateFilterForColumnMapping (#3001) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR changes how Column Mapping is applied to the filters that are pushed down in the Parquet reader. Before this change we would parse the identifiers before replacing the identifiers. This could cause some queries to fail, as the identifiers in the pushed down filters are not 100% guaranteed to be quoted. With this PR we avoid the parsing and instead match the unparsed identifier. If the identifier was not quoted as expected then we simply ignore the predicate. This matches how `ParquetFilters` (used by `ParquetFileFormat`) processes the identifiers in the pushed down predicates. ## How was this patch tested? Existing tests ## Does this PR introduce _any_ user-facing changes? No commit bc4fd2350a090fb02d8bb19ecad340e6e94bea75 Author: Venki Korukanti Date: Wed May 1 10:55:34 2024 -0700 [Kernel][Writes] Add schema validation utils (#3003) ## Description (Split from the larger PR #2944) These are utility to make sure the given schema when creating the table is valid (has no duplicate column names or invalid chars). The code/logic is similar to Delta-Spark/Standalone. ## How was this patch tested? Unittests commit 8a4a91e18bdbc5b2066f74277f9dd2c0dc30a3a4 Author: Venki Korukanti Date: Wed May 1 10:19:22 2024 -0700 [Kernel][Writes] Add `FileSystemClient.mkdirs` API (#3004) ## Description (Split from larger PR #2944) This API allows the creation of an initial delta log directory when the table is created. commit 8eb3bb32552b8d02494c4ed6cf505b39b4a20180 Author: Felipe Pessoto Date: Wed May 1 09:30:57 2024 -0700 [Infra] [Security] Update Scala and packages dependencies (#2828) #### Which Delta project/connector is this regarding? - [X] Spark - [X] Standalone - [X] Flink - [X] Kernel - [ ] Other (fill in here) ## Description We haven't updated some dependencies for a while, exposing us to security risks. This PR updates: - Scala 2.12 to 2.12.18 (the same used by Spark 3.5 branch) - Scala 2.13 to 2.13.13 (the same in Spark master branch). [CVE-2022-36944](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-36944) - Update SBT to 1.9.9. [CVE-2023-46122](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2023-46122) - Update JUnit. Fix #1518 - [CVE-2020-15250](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-15250) - Update plugins: sbt-mima-plugin and sbt-scoverage ## How was this patch tested? CI ## Does this PR introduce _any_ user-facing changes? No --------- Signed-off-by: Felipe Pessoto commit 86e145348f5a58969c8405e36bc144f723fe6fb2 Author: Scott Sandre Date: Wed May 1 09:18:01 2024 -0700 [SPARK] Fix failing Spark Master test in DeltaExtensionAndCatalogSuite (#2996) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fix failing test `DeltaLog should throw exception if spark.sql.catalog.spark_catalog config is not found` in `DeltaExtensionAndCatalogSuite` by replacing `.getOption($key).isEmpty` with `.contains($key)` - In Spark 3.5, `spark.conf.getOption("spark.sql.catalog.spark_catalog")` returned `None` - In Spark Master (4.0), `spark.conf.getOption("spark.sql.catalog.spark_catalog")` returned `Some("undefined")`. I'm not sure why ## How was this patch tested? Existing UTs. ## Does this PR introduce _any_ user-facing changes? No commit 15447198bd59b4ca10a0e829876bbf754285746f Author: Fred Storage Liu Date: Wed May 1 09:13:16 2024 -0700 Add documentation for Uniform Hudi (#2968) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add documentation for Uniform Hudi commit 4f397565c0943e193a1b526664f9b6e61201aeda Author: Hyukjin Kwon Date: Wed May 1 11:57:47 2024 +0900 [Spark] Move private import to method call to make it compatible with `pyspark-connect` (#2979) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR proposes to move import from `from pyspark.sql.column import _to_seq ` to `from pyspark.sql.classic.column import _to_seq ` `pyspark.sql.column._to_seq` has moved to `pyspark.sql.classic.column._to_seq`, and `pyspark.sql.classic` does not exist in `pyspark-connect` package. ## How was this patch tested? Manually tested. ## Does this PR introduce _any_ user-facing changes? Yes, this makes `delta.tables` compatible with pure Python library `pyspark-connect` (in Spark development branch). commit b2d182adedbd1bbf5c3b00d7bc363a89edf9b348 Author: Dhruv Arya Date: Tue Apr 30 19:47:32 2024 -0700 [Spark] Enable ICT and VacuumProtocolCheck automatically with Managed Commits (#2977) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Enabling Managed Commits will automatically: 1. Add ICT and VacuumProtocolCheck features to the protocol. 2. Enable ICT These two features are pre-requisites for managed commit. ## How was this patch tested? New suite ManagedCommitEnablementSuite. Updated existing tests to work with ICT. ## Does this PR introduce _any_ user-facing changes? No commit 3d9574573a864ad613c2aec5aac00613374d7517 Author: Venki Korukanti Date: Tue Apr 30 19:16:34 2024 -0700 [Kernel][Writes] Remove the target file size in `ParquetHandler.writeParquetFiles` API (#2997) Currently Delta protocol doesn't enforce any particular target file size for writes. Remove the `maxFileSize` argument from the `ParquetHandler.writeParquetFiles` API. In future if this is really needed, we can add it back. In order for the existing tests to pass, the `DefaultParquetHandler` takes the target file size as config in the Hadoop configuration. commit 44b76fa06121182aefaaf30f40e87b7466c502c3 Author: Carmen Kwan Date: Wed May 1 01:35:21 2024 +0200 [Spark] Identity Columns APIs in DeltaColumnBuilder (#2857) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 * We introduce `generateAlwaysAsIdentity` and `generatedByDefaultAsIdentity`APIs into DeltaColumnBuilder so that users can create Delta table with Identity column. * We guard the creation of identity column tables with a feature flag until development is complete. ## How was this patch tested? New tests. ## Does this PR introduce _any_ user-facing changes? Yes, we introduce `generateAlwaysAsIdentity` and `generatedByDefaultAsIdentity` interfaces to DeltaColumnBuilder for creating identity columns. **Interfaces** ``` def generatedAlwaysAsIdentity(): DeltaColumnBuilder def generatedAlwaysAsIdentity(start: Long, step: Long): DeltaColumnBuilder def generatedByDefaultAsIdentity(): DeltaColumnBuilder def generatedByDefaultAsIdentity(start: Long, step: Long): DeltaColumnBuilder ``` When the `start` and the `step` parameters are not specified, they default to `1L`. `generatedByDefaultAsIdentity` allows users to insert values into the column while a column specified with`generatedAlwaysAsIdentity` can only ever have system generated values. **Example Usage** ``` // Creates a Delta identity column. io.delta.tables.DeltaTable.columnBuilder(spark, "id") .dataType(LongType) .generatedAlwaysAsIdentity() // Which is equivalent to the call io.delta.tables.DeltaTable.columnBuilder(spark, "id") .dataType(LongType) .generatedAlwaysAsIdentity(start = 1L, step = 1L) ``` commit e84a8741dbd93593b6f4cc3938b042f162b26a27 Author: Prakhar Jain Date: Tue Apr 30 14:14:42 2024 -0700 [Spark] Use DeltaLog.getChanges API in DeltaFileProviderUtils's getDeltaFilesInVersionRange API (#2986) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Use DeltaLog.getChanges API in DeltaFileProviderUtils's getDeltaFilesInVersionRange API. ## How was this patch tested? UT ## Does this PR introduce _any_ user-facing changes? No commit 99efb2175c8c1a33840cb4fa46a52febb910f632 Author: Scott Sandre Date: Tue Apr 30 12:58:28 2024 -0700 [DOCS] Update `releases.md` to mention that Delta Spark 3.2 uses Spark 3.5 (#2994) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (DOCS) ## Description Update `releases.md` to mention that Delta Spark 3.2 uses Spark 3.5 ## How was this patch tested? Trivial ## Does this PR introduce _any_ user-facing changes? No commit cb8e0cf23a5e7d00f3fd20a1e8df390aa2487658 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Apr 30 20:31:57 2024 +0200 [Spark] Add nextObserver and the PhaseLockingExecutionObserver for the Concurrency Testing Framework (#2932) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add the `nextObserver` and the `PhaseLockingExecutionObserver` for the Concurrency Testing Framework, as we will need it to write Row Tracking Backfill conflicts with other commands tests when we add Row Tracking Backfill. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 866a5c1a5ad5d5df057d710fca006a1e202f8d54 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Apr 30 20:12:56 2024 +0200 [Spark] Add FileMetadataMaterializationTracker for Row Tracking Backfill (#2926) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ### Background on Row Tracking Backfill In a future PR, we will introduce Row Tracking Backfill, a command in Delta Lake that assigns Row IDs and Row Commit Versions to each rows when we enable Row Tracking on an existing table. This is done by committing `addFile` actions to assign the `baseRowId` and the `defaultRowCommitVersion` for every files in the table. Due to the size of the table, doing it in one commit can be very large, unstable and causes a lot of concurrency conflicts, we propose doing it by batches (that is multiple commits, each commit handles a subset of the `addFile` actions of the table). ### Why we need this for Row Tracking Backfill? However, we could still hit stability issues when a single batch is large enough to OOM the driver. This would happen when individual tasks batched together would be huge. Think of tables where each `AddFile` is just 1-2 rows. ### What is the solution? We propose having a global file materialization limit that restricts the number of files that can be materialized at once on the driver, this limit will be added when Row Tracking Backfill is introduced in a future PR. A case we need to consider is what if the task size is more than the materialization limit. In that case we can allow the task to complete as we do not want to break the task boundary (breaks idempotence of Row Tracking Backfill). The `FileMetadataMaterializationTracker` is the component used in the Row Tracking Backfill process to ensure that a single batch is not large enough to OOM the driver. ### Design The driver holds a semaphore that can give out permits equal to the materialization limit of the driver. The driver also holds a overallocation lock that allows only one query to over allocate to complete materializing the task. Each `RowTrackingBackfillCommand` instance will maintain a `FileMetadataMaterializationTracker` that will keep track of how many files were materialized and this will acquire/release the over provisioning semaphore as well. A permit to materialize a file is acquired while iterating the files iterator while creating a task. The permits are released upon failure or when the batch completes executing. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 7714e813fcb00cd13ff0a86517b531b8584eab62 Author: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Date: Tue Apr 30 11:06:23 2024 -0700 [Kernel] Add integration test for V2 checkpoints (#2992) commit 2f599e634508d93d4095c21abda22eb2ac750aaf Author: Johan Lasperas Date: Tue Apr 30 19:03:46 2024 +0200 [Spark] Better schema validation in MERGE schema evolution tests (#2981) ## Description Address following shortcomings of schema evolution tests in MERGE: - Tests ignore the nullability of fields when validating the schema of the table after evolution, which prevents checking for nullability of specific fields. - For struct evolution tests, a result schema can be passed but this schema is only used to parse the result data and not to validate the schema of the table after evolution. ## How was this patch tested? Updated tests ## Does this PR introduce _any_ user-facing changes? No, test-only commit 44869f49f843ed8192122d787cd150cdfec5bb9e Author: Scott Sandre Date: Mon Apr 29 13:50:56 2024 -0700 [Docs] Improve doc generation CUJ and ensure that version is correctly set in `conf.py` (#2990) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Docs) ## Description Today, `conf.py`'s `version` is hardcoded to `3.0.0`. Let's make this easily configurable (and mandatory) when we generate the docs. This will help prevent us from unintentionally forgetting to update this conf and publishing docs with the incorrect visible version. ## How was this patch tested? Locally ## Does this PR introduce _any_ user-facing changes? No commit 60f4a538f0134d9befd1af32f394a397197d38d5 Author: Lukas Rupprecht Date: Mon Apr 29 13:11:16 2024 -0700 [Spark] Refactor several managed commit-related methods into ManagedCommitUtils (#2989) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR refactors several methods that were internal to commit owner (client) implementations (AbstractBatchBackfillingCommitOwnerClient and InMemoryCommitOwner) into ManagedCommitUtils. These methods are generic and should be easily accessible to any commit owner (client) implementations so we are making them static methods. ## How was this patch tested? Refactor only so existing tests are sufficient. ## Does this PR introduce _any_ user-facing changes? No commit 2833a3efc15cec45315db7a181d3ca456a8d8436 Author: Prakhar Jain Date: Mon Apr 29 10:47:21 2024 -0700 [Spark] DROP Support for Vacuum Protocol Check table feature (#2983) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description DROP Support for Vacuum Protocol Check table feature ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit f4a49446b114962a6e621eb5c45f18e2801450c9 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Fri Apr 26 17:55:51 2024 +0200 [Spark] Support predicate pushdown in scans with DVs (#2933) ## Description Currently, when Deletion Vectors are enabled we disable predicate pushdown and splitting in scans. This is because we rely on a custom row index column which is constructed in the executors and cannot not handle splits and predicates. These restrictions can now be lifted by relying instead on `metadata.row_index` which was exposed recently after relevant [work](https://issues.apache.org/jira/browse/SPARK-37980) was concluded. Overall, this PR adds predicate pushdown and splits support as follows: 1. Replaces `__delta_internal_is_row_deleted` with `_metadata.row_index`. 2. Adds a new implementation of `__delta_internal_is_row_deleted` that is based on `_metadata.row_index`. 3. `IsRowDeleted` filter is now non deterministic to allow predicate pushdown. Furthermore, it includes previous relevant [work](https://github.com/delta-io/delta/pull/2576) to remove the UDF from `IsRowDeleted` filter. ## How was this patch tested? Added new suites. commit 78970abd96dfc0278e21c04cda442bb05ccde4a1 Author: Allison Portis Date: Thu Apr 25 17:20:08 2024 -0700 [Kernel] Expose the "stats" field in returned scan files in `ScanImpl` using a boolean parameter (#2928) ## Description Add a private API `ScanImpl.getScanFiles(TableClient tableClient, boolean includeStats)` to enable connectors to fetch the `stats` as part of the returned scan file rows. By default `stats` are not read unless there is a predicate on data columns. The main reason why we are not making this a public API yet is that currently, the `ColumnarBatch` interface has no way to remove a nested column vector before returning it to the connector. It requires major work on the expression interfaces. Until we have that, this is a workaround for connectors that are interested in the `stats` field in scan files. ## How was this patch tested? UTs commit 4e43b9050af1b90dd8f147ffe0e78a9127e830ae Author: Venki Korukanti Date: Thu Apr 25 16:14:29 2024 -0700 [Kernel[Writes] Add methods to encode Delta Log actions as `Row` objects (#2976) (Split from larger PR #2944) Add `toRow` for each of the action objects. Also add any missing actions such as `CommitInfo`. Also * Rename `READ_SCHEMA` to `FULL_SCHEMA` when all columns in an action are present. `READ_SCHEMA` is a term used to read just the subset of columns, but in some actions, it also represents the full schema. * Utility method to create single action `Row` object from a specific actions. Eg. `createMetadataSingleAction(Metadata metadata)` returns a `Row` of single action schema with `metaData` column representing the given `metadata` object. commit c7bed5cbd0a6250926358f2f37aec3e477a2f0e7 Author: Scott Sandre Date: Thu Apr 25 15:15:00 2024 -0700 Fix Delta Spark Master aborted DeltaTableBuilderSuite (#2974) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description `DeltaTableBuilderSuite` was being aborted when testing against Delta on Spark Master (JDK 17) due to ``` [info] io.delta.tables.DeltaTableBuilderSuite *** ABORTED *** [info] java.lang.IllegalAccessError: Update to non-static final field io.delta.tables.DeltaTableBuilderSuite$SetPropertyThroughCreate$1$.preservedCaseConfig$1 attempted from a different method (io$delta$tables$DeltaTableBuilderSuite$CasePreservingTablePropertySetOperation$1$_setter_$preservedCaseConfig$1_$eq) than the initializer method [info] at io.delta.tables.DeltaTableBuilderSuite$SetPropertyThroughCreate$1$.io$delta$tables$DeltaTableBuilderSuite$CasePreservingTablePropertySetOperation$1$_setter_$preservedCaseConfig$1_$eq(DeltaTableBuilderSuite.scala:400) [info] at io.delta.tables.DeltaTableBuilderSuite$CasePreservingTablePropertySetOperation$1.$init$(DeltaTableBuilderSuite.scala:384) [info] at io.delta.tables.DeltaTableBuilderSuite$SetPropertyThroughCreate$1$.(DeltaTableBuilderSuite.scala:400) [info] at io.delta.tables.DeltaTableBuilderSuite.SetPropertyThroughCreate$lzycompute$1(DeltaTableBuilderSuite.scala:400) [info] at io.delta.tables.DeltaTableBuilderSuite.SetPropertyThroughCreate$2(DeltaTableBuilderSuite.scala:400) [info] at io.delta.tables.DeltaTableBuilderSuite.$anonfun$new$42(DeltaTableBuilderSuite.scala:448) ``` This PR fixes that. TBH I have no idea how or why this fixes it. ## How was this patch tested? CI tests. ## Does this PR introduce _any_ user-facing changes? No commit 280c878909a9a2d10062894b8ac30a772b44fa4a Author: Carmen Kwan Date: Thu Apr 25 23:41:48 2024 +0200 [Spark] Add GenerateIdentityValues UDF (#2915) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 In this PR, we introduce the `GenerateIdentityValues` UDF used for populating Identity Column values. The UDF is not used in Delta in this PR yet. `GenerateIdentityValues` is a simple non-deterministic UDF which keeps a counter with the user specified `start` and `step`. It counts in increments of `numPartitions` so that it can be parallelized in different tasks. ## How was this patch tested? New test suite and unit tests for the UDF. ## Does this PR introduce _any_ user-facing changes? No. commit 42f09bd1fa2e5a984f3b2836a6c11648889f5f28 Author: Venki Korukanti Date: Thu Apr 25 12:46:47 2024 -0700 [Kernel][Writes] Utility method to construct partition data directory (#2975) ## Description (Split from larger PR #2944) Utility method to construct the partition data directory. ## How was this patch tested? UTs commit 52e61de15300e14b4bc5564220eb0dfeda64fd92 Author: richardc-db <87336575+richardc-db@users.noreply.github.com> Date: Thu Apr 25 11:58:47 2024 -0700 [SPARK][VARIANT] Add minimal support for variant type in delta-spark (#2923) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adds the variant table feature to minimally implement the variant type as described in the RFC in #2867. Also disables using variant columns as partition columns. ## How was this patch tested? Added some UTs. More UTs will be merged in followup PRs tested against both spark 3.5 and 4.0 snapshot with ``` build/sbt -DsparkVersion=latest spark/'testOnly org.apache.spark.sql.delta.DeltaVariantSuite' build/sbt -DsparkVersion=master spark/'testOnly org.apache.spark.sql.delta.DeltaVariantSuite' ``` ## Does this PR introduce _any_ user-facing changes? no commit e3b58d226f769f46d89220c7837c0b56c1e4a290 Author: Venki Korukanti Date: Thu Apr 25 11:01:48 2024 -0700 [Kernel][Writes] Utility method to serialize partition literals as strings (#2973) ## Description (Split from larger PR #2944) Adds a utility method to convert the partition value literal to Delta protocol-compliant string value. ## How was this patch tested? UTs commit 458318b9b101c6254ce4abbdb64c21776e3393bb Author: Prakhar Jain Date: Thu Apr 25 10:39:37 2024 -0700 [Spark] Handle incomplete backfills after MC -> FS conversion (#2957) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Handle incomplete backfills after MC -> FS conversion. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit 358a91f9cad6772a6ce66cc250b73749ef112eca Author: Jiaheng Tang Date: Thu Apr 25 08:47:09 2024 -0700 [Docs] Update liquid clustering documentation for Delta 3.2 (#2930) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Doc) ## Description Update documentation about liquid clustering for Delta 3.2 ## How was this patch tested? ![127 0 0 1_8000_delta-clustering html (5)](https://github.com/delta-io/delta/assets/13973764/96d49fdd-76b1-428e-9490-27803222988d) ## Does this PR introduce _any_ user-facing changes? No commit 4df50672d838bc43cc295d8f7dc53f0ea5cbbc13 Author: Hyukjin Kwon Date: Fri Apr 26 00:45:19 2024 +0900 [Spark] Make test_deltatable compatible with the latest Spark dev branch (#2970) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description After https://github.com/apache/spark/commit/e01ac581f46aa595e66daf33fe92b56d1328bc78, `pyspark.sql.column._to_seq` has moved to `pyspark.sql.classic.column._to_seq`. This PR fixes the test to make the test compatible with old and new Spark versions. ## How was this patch tested? Date: Thu Apr 25 08:24:46 2024 -0700 [Kernel] Add kernel support for v2 checkpoints ## Description Add support for V2 checkpoints. When reconstructing the `LogSegment` of a table at a given version, check if the checkpoint file to be read is a checkpoint manifest. If it is, include the sidecar files referenced by that manifest in the `LogSegment` checkpoint files. See https://github.com/delta-io/delta/issues/2232 ## How was this patch tested? See changes to `LogReplaySuite`, `SnapshotManagementSuite`, `CheckpointerSuite`, `FileNamesSuite`, and `CheckpointInstanceSuite`. ## Does this PR introduce _any_ user-facing changes? No. commit 1e2c74f36a6de72f341bb881d409e3536688e124 Author: gene-db <77996944+gene-db@users.noreply.github.com> Date: Thu Apr 25 05:52:50 2024 -0700 [PROTOCOL RFC] Variant data type (#2867) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adds the protocol changes for the Variant data type (see https://github.com/delta-io/delta/issues/2864) to the RFC folder. ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? N/A commit 88c03d37348ee99ddfbb1efa081cbb615e2f91b6 Author: Scott Sandre Date: Wed Apr 24 18:45:13 2024 -0700 Disable 3 Delta Spark tests against Spark Master (#2967) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Disable 3 delta-spark tests from executing against Spark Master. They execute against Spark 3.5 only. ## How was this patch tested? CI tests. ## Does this PR introduce _any_ user-facing changes? No commit 92761bf788eaad35a51db76b9cc04c7730ea290d Author: Dhruv Arya Date: Wed Apr 24 17:23:17 2024 -0700 [Spark] Mark InCommitTimestamp as a preview feature (#2962) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Follow up for https://github.com/delta-io/delta/pull/2946. Updates the configs and table feature name associated with InCommitTimestamp to indicate that it is in the preview phase --- the RFC has not been finalized but the feature is code complete.   ## How was this patch tested? No additional testing. ## Does this PR introduce _any_ user-facing changes? No commit 1c80ad63c196deb3c0acb632611a0da8c8e7146f Author: Scott Sandre Date: Wed Apr 24 17:22:12 2024 -0700 [INFRA] Set Spark Master test parallelism to 2 (#2964) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (INFRA) ## Description Set test parallelism to 2 for delta-spark against Spark Master tests ## How was this patch tested? CI tests on this PR ## Does this PR introduce _any_ user-facing changes? No commit 3a2e469ba85f11846038cc465d4bbf480ee1afd5 Author: Prakhar Jain Date: Wed Apr 24 16:19:50 2024 -0700 [Spark] Rename Commit Store to Commit Owner (#2960) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Rename Commit Store to Commit Owner ## How was this patch tested? Existing UTs ## Does this PR introduce _any_ user-facing changes? No commit 5f2d957aaa258bfaf3827fb33fcdff60c70e6c37 Author: Venki Korukanti Date: Wed Apr 24 12:33:06 2024 -0700 [Kernel][Writes] Implement `Table.checkpoint` API to write a classic single file checkpoint (#2941) ## Description Implements `Table.checkpoint` API to checkpoint the table at given version ``` /** * Checkpoint the table at given version. It writes a single checkpoint file. * * @param tableClient {@link TableClient} instance to use. * @param version Version to checkpoint. * @throws TableNotFoundException if the table is not found * @throws CheckpointAlreadyExistsException if a checkpoint already exists at the given version * @throws IOException for any I/O error. * @since 3.2.0 */ void checkpoint(TableClient tableClient, long version) throws TableNotFoundException, CheckpointAlreadyExistsException, IOException; ``` ## How was this patch tested? Unit and integration tests. commit 8b4b6cce7071046da3d6d3fda4b85120a7445771 Author: Lin Zhou <87341375+linzhou-db@users.noreply.github.com> Date: Wed Apr 24 11:38:20 2024 -0700 Upgrade delta-sharing-client to 1.0.5 (#2955) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Delta Sharing) ## Description Upgrade delta-sharing-client to 1.0.5 ## How was this patch tested? Unit Tests ## Does this PR introduce _any_ user-facing changes? No commit 3c09d95a34b71fff20cb23753c65af95da5cb48f Author: Dhruv Arya Date: Wed Apr 24 09:06:48 2024 -0700 [Spark] Make update catalog schema truncation threshold configurable (#2911) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently, during schema sync to catalog, the whole schema gets truncated if any of the fields is longer than 4000 characters. This PR makes this threshold a configurable through the config `DeltaSQLConf. DELTA_UPDATE_CATALOG_LONG_FIELD_TRUNCATION_THRESHOLD`. ## How was this patch tested? Created variations of existing test cases that validate that setting the config to a bigger value skips the truncation. ## Does this PR introduce _any_ user-facing changes? No Co-authored-by: Tathagata Das commit 5ace82788a1073ca9c5302070cb07aecad87a743 Author: Johan Lasperas Date: Wed Apr 24 18:04:14 2024 +0200 [Spark] Type Widening preview (#2937) ## Description Expose the type widening table feature outside of testing and set its preview user-facing name: typeWidening-preview (instead of typeWidening-dev used until now). Feature description: https://github.com/delta-io/delta/issues/2622 The type changes that are supported for not are `byte` -> `short` -> `int`. Other types depend on Spark changes which are going to land in Spark 4.0 and will be available once Delta picks up that Spark version. ## How was this patch tested? Extensive testing in `DeltaTypeWidening*Suite`. ## Does this PR introduce _any_ user-facing changes? User facing changes were already covered in PRs implementing this feature. In short, it allows: - Adding the type widening table feature (using a table property) ``` ALTER TABLE t SET TBLPROPERTIES (‘delta.enableTypeWidening = true); ``` - Manual type changes: ``` ALTER TABLE t CHANGE COLUMN col TYPE INT; ``` - Automatic type changes via schema evolution: ``` CREATE TABLE target (id int, value short); CREATE TABLE source (id int, value in); SET spark.databricks.delta.schema.autoMerge.enabled = true; INSERT INTO target SELECT * FROM source; -- value now has type int in target ``` - Dropping the table feature which rewrites data to make the table reading by all readers: ``` ALTER TABLE t DROP FEATURE 'typeWidening' ``` commit 0cca6b8778c0ca966a27682b667a6b1973ad8e6e Author: Fred Storage Liu Date: Wed Apr 24 08:34:12 2024 -0700 Add Delta Uniform Hudi integration test (#2951) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add Delta Uniform Hudi integration test to verify it generates Hudi metadata ## How was this patch tested? run integration test commit 58707407f1dc1356557a56679fe92c250dae72da Author: Adam Binford Date: Wed Apr 24 11:25:33 2024 -0400 [SPARK] Cluster by Scala and Python API (#2880) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves https://github.com/delta-io/delta/issues/2471 Adds `clusteredBy` methods to the Scala and Python DeltaTableBuilder classes. ## How was this patch tested? A few small new UTs, there's not much new behavior as it's simply wrapping the main create table logic. ## Does this PR introduce _any_ user-facing changes? Yes, adds the ability to create clustered tables via Scala and Python APIs commit 5289b84ed2c00c8214cad3e7f5955b59f83034f8 Author: sabir-akhadov <52208605+sabir-akhadov@users.noreply.github.com> Date: Wed Apr 24 17:00:36 2024 +0200 Column mapping removal: support CDC enabled tables (#2938) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Extend CDC support around column mapping removal. Most notably, ensure CDC data is not generated for the column mapping removal commit and prohibit reading CDC data across column mapping removal boundary. ## How was this patch tested? New unit tests ## Does this PR introduce _any_ user-facing changes? commit 9c2cdf0ca6ce538692b9626af4c742de7f68d749 Author: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Date: Tue Apr 23 19:39:11 2024 -0700 [Kernel] Allow kernel-defaults to use kernel-api test utils (#2956) ## Description Allows kernel-defaults module to take advantage of kernel-api test utils (in preparation for refactoring containing these changes). commit e8a6bad0213c24a6b133f65c46de24995f6dbebe Author: Venki Korukanti Date: Tue Apr 23 18:40:58 2024 -0700 [Kernel][Writes] Update the behavior of `Table.forPath` to not throw error for empty tables (#2949) ## Description (Split from #2941) Earlier it used to throw if the path is not valid. Now, it doesn't validate immediately. Instead the validation is done when we try to create a snapshot or other operations. We want to create tables from scratch and `Table. forPath` throwing error doesn't let us proceed. ## How was this patch tested? Unittests commit ce0d9621c37c069565b998cd8e5bddf2421491e8 Author: Scott Sandre Date: Tue Apr 23 17:10:57 2024 -0700 Fix typo in Spark Master test workflow (#2958) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (INFRA) ## Description Rename Spark Master github workflow name ## How was this patch tested? CI test ## Does this PR introduce _any_ user-facing changes? No commit bd732b13af863cf288b022c7c3ef979a39744eea Author: Prakhar Jain Date: Tue Apr 23 15:33:16 2024 -0700 [Spark] Managed Commit support for DeltaLog's getChanges/getChangeLogFiles API (#2901) ## Description This PR makes changes to `DeltaLog.getChanges/getChangeLogFiles` API in order to support managed-commits. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit 83b4719e3b55b6883339679dd721b928d88c642c Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Apr 23 23:33:54 2024 +0200 [Spark] Add preserving Row Tracking for Column Mapping Removal (#2952) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description - Add Row Tracking Preservation for column mapping removal. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 0fa0a7b5d12986ab8aed28a627b14f1489a0f0a9 Author: Dhruv Arya Date: Tue Apr 23 11:28:30 2024 -0700 [Spark] Make ICT available outside of test environments (#2946) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Move the InCommitTimestamp feature to the list of supported features so that users can enable it outside of the test environment. Note that InCommitTimestamp is still an RFC and the feature name will still have the -dev suffix in it. ## How was this patch tested? Existing tests. ## Does this PR introduce _any_ user-facing changes? Yes, users can now use the `InCommitTimestamp-dev` feature in their environments if they want to try it before the RFC is merged. commit 36174cb14544b2f2ea50a1c7c91ded289d8e15d6 Author: Venki Korukanti Date: Tue Apr 23 11:13:54 2024 -0700 [Kernel][Writes] Add a utility to parse interval duration in Delta table configs (#2954) Currently these configs are stored in SQL interval format. We could depend on external libraries, but to minimize the dependencies on `kernel-api`, the minimal relevant code is copied from Standalone (which was a copy of Apache Spark code). Also I couldn't find a minimal library that does just the interval parsing. commit 2851f56a58c1a7a2425fee24ee264ee42c214f46 Author: Venki Korukanti Date: Tue Apr 23 10:54:26 2024 -0700 [Kernel][Writes] Add `TableFeatures.validateWriteSupportedTable` utility method (#2950) ## Description (Split from #2941) Add a utility method to check if the table is of protocol/features that current insert APIs being planned support writing into. The current plan is to give a write support similar to that Standalone, which is at (minReaderVersion = 1 and minWriterVersion = 2 with `appendOnly` support). ## How was this patch tested? Unittests commit ce4a35350b8c3e99e3145908003b3575aad86822 Author: Scott Sandre Date: Tue Apr 23 09:36:39 2024 -0700 Run Delta-Spark tests against Spark Master (new github workflow) (#2889) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Create a new GitHub action workflow to run delta-spark tests against Spark Master ## How was this patch tested? Ran CI tests. ## Does this PR introduce _any_ user-facing changes? No commit a7f5b967fd09821cf76430eb7f322957b301fa5d Author: Johan Lasperas Date: Tue Apr 23 18:09:25 2024 +0200 [Spark] Reject type changes on columns referenced by constraints/generated columns (#2881) ## What changes were proposed in this pull request? It is generally not safe to change the type of a column or field that is referenced by a CHECK constraint or a generated column. For example, some functions may produce different results depending on the input data type, e.g. `hash`. This change adds checks to fail when the type of a column or field that is referenced by a CHECK constraint or a generated column is changed: - using `ALTER TABLE t CHANGE COLUMN col TYPE type`. - using schema evolution, in `ImplicitMetadataOperation.mergeSchema()`. For the latter, a check for generated columns only was already in place in `SchemaMergingUtils.mergeSchemas`. That check is replaced in favor of the more generic check in `ImplicitMetadataOperation` which reuses existing logic already used to block column rename/drop in `ALTER TABLE`. ## How was this patch tested? - Tests for rejecting type changes with CHECK constraints and generated columns added to `DeltaTypeWideningSuite`. - Existing tests for rejecting type changes in `GeneratedColumnSuite` are extended. - Tests covering the updated and newly added error classes are added to `DeltaErrorsSuite` ## This PR introduces the following *user-facing* changes The type widening table feature isn't available publicly yet, this change isn't user-facing in that regard. This change update the following error codes: - `_LEGACY_ERROR_TEMP_DELTA_0004` -> `DELTA_CONSTRAINT_DEPENDENT_COLUMN_CHANGE` - `_LEGACY_ERROR_TEMP_DELTA_0005` -> `DELTA_GENERATED_COLUMNS_DEPENDENT_COLUMN_CHANGE` and introduced the following error code: - `DELTA_CONSTRAINT_DATA_TYPE_MISMATCH` commit b8caf4399ab2e92d6a6d46fb99aad1e2665ebf7d Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Apr 23 17:10:05 2024 +0200 [Spark] Add additional read stable Row IDs check (#2917) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add additional read stable Row IDs assertions in some tests after read stable Row Tracking is introduced in the previous PRs. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit be7183bef85feaebfc928d5f291c5a90246cde87 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Apr 23 17:09:09 2024 +0200 [Spark] DV Reads Performance Improvement in Delta by removing Broadcasting DV Information (#2888) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Back then, we relied on an [expensive Broadcast of DV files](https://github.com/delta-io/delta/pull/1542) to pass the DV files to the associated Parquet Files. With the introduction of [adding custom metadata to files](https://github.com/apache/spark/pull/40677) introduced in Spark 3.5, we can now pass the DV through the custom metadata field, this is expected to improve the performance of DV reads in Delta. ## How was this patch tested? Adjusted the existing UTs that cover our changes. ## Does this PR introduce _any_ user-facing changes? No. commit cb53c9ac2282f2a3e045ebb3fcb19604d7f2dd3e Author: Venki Korukanti Date: Mon Apr 22 19:44:32 2024 -0700 [Kernel][Writes] `ParquetHandler` and `JsonHandler` API changes (#2948) ## Description (extracted out the `TableClient` related changes from #2941) At a high level, the requirements for writing files (either JSON or Parquet) are as follows: * Write given data into one or more Parquet files (for writing the data or v2 checkpoint sidecar files) * `ParquetHandler.writeParquetFiles` is used in this case. * Write given data into exactly one atomically file (i.e either the file is created with full content or no file is created at all) (checkpoint classic, delta commit file, v2 cp manifest file) * JSON: `JsonHandler.writeJsonFileAtomically` with `overwrite = false` * Parquet: `ParquetHandler.writeParquetFileAtomically` * Write given data atomically into a file (overwrite if it exists) (`_last_checkpoint`) * JSON: `JsonHandler.writeJsonFileAtomically` with `overwrite = true` * Not supported: Write given data atomically in one or more files (multi-part checkpoint - this is not possible and the reason for creating checkpoint v2 to get around this issue. In this PR following API are added/updated and also the default implementation for the same. ``` interface ParquetHandler { ... existing APIs ... /** * Write the given data as a Parquet file. This call either succeeds in creating the file with * given contents or no file is created at all. It won't leave behind a partially written file. *

* @param filePath Fully qualified destination file path * @param data Iterator of {@link FilteredColumnarBatch} * @throws FileAlreadyExistsException if the file already exists and {@code overwrite} is false. * @throws IOException if any other I/O error occurs. */ void writeParquetFileAtomically( String filePath, CloseableIterator data) throws IOException; } ``` The `DefaultParquetHandler` implements the above API using `LogStore`s from `delta-storage` module to achieve the atomicity. For writing the `_last_checkpoint` file, update the existing `JsonHandler.writeJsonFileAtomically` to take an option `overwrite`. ``` void writeJsonFileAtomically( String filePath, CloseableIterator data, boolean overwrite) throws IOException; ``` ## How was this patch tested? Unittests. commit cb5ef2181a31a5c06ac0f1b64baa182aa4b558af Author: Scott Sandre Date: Mon Apr 22 18:01:03 2024 -0700 Fix Delta on Spark Master tests broken by error-classes.json rename (#2947) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description In Apache Spark https://github.com/apache/spark/commit/c5b8e60e0d5956d9f648f77ae13a1558c99adf6b, `error-classes.json` was renamed to `error-conditions.json`. Update Delta's DeltaThrowableHelper to use the right json file when cross compiling against Spark 3.5 and Spark Master. ## How was this patch tested? Existing UTs. ## Does this PR introduce _any_ user-facing changes? No commit 128cf783fba26ec36e329ca880f04094524ab68d Author: Xupeng Li <162375861+xupengli-db@users.noreply.github.com> Date: Mon Apr 22 13:45:31 2024 -0700 Add vacuum inventory table related description (#2918) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We have added [a new feature](https://github.com/delta-io/delta/commit/7d41fb7bbf63af33ad228007dd6ba3800b4efe81) for VACUUM command that allows users to provide a inventory table to specify the files to be considered by VACUUM. This PR updates the documentation to reflect this feature. ## How was this patch tested? N/A. Doc updates only. ## Does this PR introduce _any_ user-facing changes? No commit e75e4f99358c3ce1b4ac5dc7cb3a799027b5b4e3 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Mon Apr 22 22:17:17 2024 +0200 [Spark] Add Preserving Row Tracking in Merge (#2936) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Preserve row IDs in Merge by reading the metadata column and writing it out to the physical column. ## How was this patch tested? Existing UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 7d31b4f0be27d1ca4eba86352dd793a6c71945a8 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Mon Apr 22 17:54:29 2024 +0200 [Spark] Add Preserving Row Tracking in Optimize/Reorg/Z-Order/Purge (#2935) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Preserve row IDs in Optimize/Reorg/Z-Order/Purge by reading the metadata column and writing it out to the physical column. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 2878f61e9dbb66b95215d4fba95f020b2b1ebfc6 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Mon Apr 22 17:54:11 2024 +0200 [Spark] Add Preserving Row Tracking in Update (#2927) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Preserve row IDs in UPDATE by reading the metadata column and writing it out to the physical column. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 4b835e91a75e95d4596abd4f3060ec0411ecb096 Author: Tathagata Das Date: Fri Apr 19 20:12:29 2024 -0400 Increase tests parallelism to 2 on free Github action runners (#2769) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other: Infra ## Description As the title says ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? commit fcca4a671b8f97f3a0b5e863c58d273eae36309e Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Sat Apr 20 02:10:39 2024 +0200 [Spark] Add Preserving Row Tracking in Delete (#2925) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add Preserving Row Tracking in Delete by reading the metadata column and writing it out to the physical column. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit a1ddb8bcce96e963a6341ccc16922af9d60fbebc Author: Venki Korukanti Date: Fri Apr 19 13:24:20 2024 -0700 [Kernel] Add and implement an `JsonHandler` API to atomically write the data to JSON file (#2903) ## Description Add the following API to `JsonHandler` which will be used when writing the Delta Log actions to a delta file as part of Delta table write. ``` /** * Serialize each {@code Row} in the iterator as JSON and write as a separate line in * destination file. This call either succeeds in creating the file with given contents or no * file is created at all. It won't leave behind a partially written file. *

* Following are the supported data types and their serialization rules. At a high-level, the * JSON serialization is similar to that of {@code jackson} JSON serializer. *

    *
  • Primitive types: @code boolean, byte, short, int, long, float, double, string}
  • *
  • {@code struct}: any element whose value is null is not written to file
  • *
  • {@code map}: only a {@code map} with {@code string} key type is supported
  • *
  • {@code array}: {@code null} value elements are written to file
  • *
* * @param filePath Fully qualified destination file path * @param data Iterator of {@link Row} objects where each row should be serialized as JSON * and written as separate line in the destination file. * @throws FileAlreadyExistsException if the file already exists. * @throws IOException if any other I/O error occurs. */ void writeJsonFileAtomically(String filePath, CloseableIterator data) throws IOException; ``` The default implementation makes use of the `LogStore` implementations in `delta-storage` module. ## How was this patch tested? Unittests commit 9d4e4f543d48e781d289f4f0a6c2e265ebb0ce62 Author: Johan Lasperas Date: Fri Apr 19 21:19:10 2024 +0200 [Spark] Add type widening tests for data skipping (#2895) ## Description Add tests covering stats and data skipping when a type change is applied to a Delta table. Covers: - A combination of disabling/enabling storing JSON stats in checkpoint files and the impact on data skipping. - Parsed partition values stored as part of the checkpoint file when the type of the partition column is changed. ## How was this patch tested? See above. commit bd4b37b9961edf5889f45ea5329560da44ec07e1 Author: Tom van Bussel Date: Fri Apr 19 21:18:31 2024 +0200 [Spark] Apply Column Mapping to filters pushed to Parquet scan (#2924) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes a correctness bug when Column Mapping is enabled. The issue occurs when a column is renamed to the physical name of another column. If a query filters on this column then the Parquet scan incorrectly applies this filter to the other column. This PR fixes this problem by translating the column names in the filters from logical to physical in `DeltaParquetFileFormat` before passing them on to `ParquetFileFormat`. ## How was this patch tested? Added a unit test to `DeltaColumnMappingSuite`. ## Does this PR introduce _any_ user-facing changes? No commit b721e1913f5bdf93e69254333d839fa32673a30a Author: Scott Sandre Date: Thu Apr 18 17:31:48 2024 -0700 [Spark] [Delta X-Compile] Fix tests for both Spark 3.5 and Spark Maser (batch 4) (#2921) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is batch 4. This PR fixes 20 tests in 1 suite for Delta on Spark Master using shimming. ## How was this patch tested? Existing UTs ## Does this PR introduce _any_ user-facing changes? No commit dfdfd77ece328a453e24fc065507ee3e400fe057 Author: Dhruv Arya Date: Thu Apr 18 11:56:23 2024 -0700 [Spark] Fix InCommitTimestampSuite to make master compilable (#2919) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The merge of these two PRs: 1. https://github.com/delta-io/delta/pull/2902 2. https://github.com/delta-io/delta/pull/2883 broke master. This renames a variable in InCommitTimestampSuite to unbreak it. ## How was this patch tested? Test-only fix. ## Does this PR introduce _any_ user-facing changes? No commit dee9fd7c074c45667a31c622942fac106f40a9d7 Author: Kaiqi Jin Date: Thu Apr 18 10:49:59 2024 -0700 Refactor Create and Clone command (#2913) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Minor refactor to consolidate code path. Replace `new Path(tableWithLocation.location)` with `getDeltaTablePath(tableWithLocation)`, which better express the purpose here. ## How was this patch tested? Exist unit tests. ## Does this PR introduce _any_ user-facing changes? No commit 659617bda36981ee9734f369b8081c995a1c5659 Author: Carmen Kwan Date: Thu Apr 18 19:49:00 2024 +0200 [Spark] Add IdentityColumn.scala (#2916) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 In this PR, we introduce IdentityColumn.scala, a common file which contains most of the helpers for Identity Columns, necessary for unblocking future PRs. ## How was this patch tested? This PR commits dead code. Existing tests pass. ## Does this PR introduce _any_ user-facing changes? No. commit 3ebbbdade79f4bb45c616b5da9594ef26baea759 Author: Allison Portis Date: Thu Apr 18 10:40:27 2024 -0700 [Kernel] Rename snapshot time-travel APIs to be consistent with SQL syntax (#2909) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Renames the snapshot time travel APIs on `Table` to be consistent with SQL syntax. ## How was this patch tested? Just refactoring existing tests suffice. ## Does this PR introduce _any_ user-facing changes? Neither changed API has been released yet so no. commit 666a524e0dedaf8efc77d35195e2c4622064675c Author: Johan Lasperas Date: Thu Apr 18 18:07:40 2024 +0200 [Spark] Record an event on type widening (#2887) ## What changes were proposed in this pull request? This change records an event `delta.typeWidening.typeChanges` whenever one or more type changes are applied to a Delta table. For consistency, the existing event `delta.typeWideningFeatureRemovalMetrics` is renamed to `delta.typeWidening.featureRemovalMetrics`. ## How was this patch tested? - Added a test covering event `delta.typeWidening.typeChanges` - Added a test covering event `delta.typeWidening.featureRemovalMetrics` commit d23a617566c14e9fe02b8c3154a21c89dde6c5f8 Author: Dhruv Arya Date: Thu Apr 18 09:07:20 2024 -0700 [Spark] Make DeltaHistory.getActiveCommitAtTime compatible with managed commits (#2883) ## Description Makes DeltaHistory.getActiveCommitAtTime compatible with Managed Commits so that timestamp-based time travel queries work correctly with managed commits. Note that the only major changes are around edge case handling where commits can sometimes be missing. ## How was this patch tested? New test suite InCommitTimestampWithManagedCommitSuite which also extends the existing ICT suite. Added a test case where we assert that missing unbackfilled files result in an exception. ## Does this PR introduce _any_ user-facing changes? No. commit c8651f1fea0859582d38820a5a689736d6e8fede Author: Scott Sandre Date: Thu Apr 18 09:04:19 2024 -0700 [Spark] [Delta X-Compile] Fix tests for both Spark 3.5 and Spark Master (batch 3) (#2910) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is batch 3. This PR fixes 2 tests in 2 suites for Delta on Spark Master using shimming. ## How was this patch tested? Existing UTs ## Does this PR introduce _any_ user-facing changes? No commit 81e2ade74a3804d289948f8e2c2522f1940ee9ef Author: Prakhar Jain Date: Thu Apr 18 08:53:12 2024 -0700 [Spark] Register API with managed-commits (#2902) ## Description This PR adds a register API with managed commits. 1. Initial snapshot shouldn't have CommitStore. 2. Introduce concept of managedCommitTableConf. This info is unique to the table. e.g. tableId etc. - with this change, now configuration has 2 table properties related to managed commits: 2.1 `managedCommitOwnerName` - name of the MC Owner 2.2 `managedCommitOwnerConf` - configuration related to owner 2.3 `managedCommitTableConf` - managed commit configuration related to table `2.1` and `2.2` are used to initialize CommitStore. `2.3` is passed as an argument to CommitStore{commit/getCommits/backfillUpto} APIs as it is needed to uniquely identify the table by the owner. 3. CommitStore pre-registration support. If a table is transitioning for FS -> MC, we register with commit-store and get the `managedCommitTableConf` for it. This conf is then set in Metadata. 4. This PR moves the `FS -> MC` commit (or 0th commit) to filesystem as forcing to go it via Commit Owner adds additional complexity and doesn't give any major advantage. So whenever a transition is happening, the commit goes through previous owner: i.e. For FS -> MC, commit goes through FS. For MC -> FS, commit goes through MC. 5. This PR also introduces a `TableCommitStore` which is a helper class / simplified version of CommitStore. It takes care of passing basic things by itself. e.g. logPath, managedCommitTableMetadata, hadoopConf, logPath. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit ce4565139ef61e3f243cc5eb5d4e868a1254f997 Author: Juliusz Sompolski Date: Thu Apr 18 00:25:16 2024 +0200 [Spark] Merge should materialize source with correlated subqueries (#2906) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Small fixes for determinism determination for Merge Materialize Source. * since we consider joins as potentially non-deterministic, we should also consider correlated subqueries as such, because they essentially are also joins. * non Delta source will not be found in subqueries in `findFirstNonDeltaScan`, but found in `findFirstNonDeterministicNode`. This will make merge source materialization give wrong reason for materialization. ## How was this patch tested? Added tests with subqueries and nested subqueries in Project and Filters and correlated subqueries. ## Does this PR introduce _any_ user-facing changes? No. Co-authored-by: Julek Sompolski commit 0461ef813df610eaab29dbeb747595e6725cae75 Author: Tathagata Das Date: Wed Apr 17 16:06:01 2024 -0400 [DOCS] Fixed canonical links in docs that was messing up the SEO (#2804) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (fill in here) - Docs infra ## Description As the title says ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? No commit fab4b31146f1208492df622c43bf44d2be89fd75 Author: Venki Korukanti Date: Wed Apr 17 13:05:21 2024 -0700 [Kernel][Defaults] Add support for `timestamp_ntz` in Parquet writer (#2907) ## Description This is a Parquet writer-only change. ## How was this patch tested? Added a column of timestamp_ntz to the existing golden tables, which are read and written by the tests. commit bba0e94f02bce0eb9d3ac1fbd3a8766ce5b0011d Author: Ian Streeter Date: Wed Apr 17 02:33:48 2024 +0100 [Spark] Skip collecting commit stats to prevent computing Snapshot State (#2718) ## Description Before this PR, Delta computes a [SnapshotState](https://github.com/delta-io/delta/blob/v3.1.0/spark/src/main/scala/org/apache/spark/sql/delta/SnapshotState.scala#L46-L58) during every commit. Computing a SnapshotState is fairly slow and expensive, because it involves reading the entirety of a checkpoint, sidecars, and log segment. For many types of commit, it should be unnecessary to compute the SnapshotState. After this PR, a transaction can avoid computing the SnapshotState of a newly created snapshot. Skipping the computation is enabled via a spark configuration option `spark.databricks.delta.commitStats.collect=false` This change can have a big performance impact when writing into a Delta Table. Especially when the table comprises a large number of underlying data files. ## How was this patch tested? - Locally built delta-spark - Ran a small spark job to insert rows into a delta table - Inspected log4j output to see if snapshot state was computed - Repeated again, this time setting `spark.databricks.delta.commitStats.collect=false` Simple demo job that triggers computing SnapshotState, before this PR: ```scala val spark = SparkSession .builder .appName("myapp") .master("local[*]") .config("spark.sql.warehouse.dir", "./warehouse") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") .getOrCreate spark.sql("""CREATE TABLE test_delta(id string) USING DELTA """) spark.sql(""" INSERT INTO test_delta (id) VALUES (42) """) spark.close() ``` ## Does this PR introduce _any_ user-facing changes? Yes, after this PR the user can set spark config option `spark.databricks.delta.commitStats.collect=false` to avoid computing SnapshotState after a commit. commit 1b210c21cbdddf4b553549286f28fdbe06a20532 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed Apr 17 02:37:14 2024 +0200 [Spark] Add Read support for RowCommitVersion and Row Tracking Preservation in Read/Write (#2878) ## Description 1. Adding the `row_commit_version` field to the _metadata column for Delta tables, allowing us to read the `row_commit_version` from the file metadata after it is stored. 2. Adding Row Tracking preservation in Read/Write. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 0027d70f103faec26c775f4e366a075132ba2aa1 Author: Venki Korukanti Date: Tue Apr 16 15:45:29 2024 -0700 Revert "[Spark][TEST-ONLY] Merge source materialization non-determinism determination in subqueries tests" (#2898) Reverts delta-io/delta#2896 It was merged too soon before the tests were completed. Unfortunately, the tests have a failure. commit d00c1494e6d014a9bfd6a9c88a2c237f843fc0d2 Author: Scott Sandre Date: Tue Apr 16 15:42:58 2024 -0700 [Spark] [Delta X-Compile] Fix tests for both Spark 3.5 and Spark Master (batch 2) (#2897) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is batch 2 (of either 2 or 3). This PR fixes 12 tests in 8 suites for Delta on Spark Master using shimming. See batch 1 here: https://github.com/delta-io/delta/pull/2884 ## How was this patch tested? Existing UTs. ## Does this PR introduce _any_ user-facing changes? No commit c41977db3529a3139d6306abe5ded161f070982a Author: Sumeet Varma Date: Tue Apr 16 10:09:04 2024 -0700 [Spark] Enhance CommitStore Equality with Semantic Comparison (#2892) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR revises the CommitStore equality check to use semantic comparison rather than direct object reference comparison. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit f23442b260e30709a3135ca8f82191ef1daaffae Author: Dhruv Arya Date: Tue Apr 16 09:42:51 2024 -0700 [Spark] V2Checkpoint Manifest Writes: Use overwrite=false for logstores when partial write is visible (#2893) ## Description V2 Checkpoint Manifest writes currently have a bug: logStore.write(overwrite=true) is invoked for all logstores. This is unsafe for clouds like Azure where partial writes are visible to readers (i.e. a failed result can create a corrupted manifest). This PR updates this so that logStore.isPartialWriteVisible is used to determine where overwrite should be true. ## How was this patch tested? Added a new test suite V2CheckpointManifestOverwriteSuite which makes sure that the correct API is invoked depending on what logStore.isPartialWriteVisible returns. ## Does this PR introduce _any_ user-facing changes? No commit efd88991651adf3ecdf78649d8a70426a121de97 Author: Juliusz Sompolski Date: Tue Apr 16 18:06:08 2024 +0200 [Spark][TEST-ONLY] Merge source materialization non-determinism determination in subqueries tests (#2896) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adds extra tests to consider determinism in subqueries for source merge materialization. ## How was this patch tested? This patch is test only and adds new tests. ## Does this PR introduce _any_ user-facing changes? No. Co-authored-by: Julek Sompolski commit 726165608505e94b61736d21976459b2fea5d24c Author: Fred Storage Liu Date: Mon Apr 15 20:43:11 2024 -0700 Explicitly throw when sync iceberg conversion fails (#2886) ## Description sync iceberg conversion happens when enabled with session config, or uniform iceberg is turned on for the first time. This explicitly throws when sync iceberg conversion fails so user knows conversion status. ## How was this patch tested? UT commit 9bae74989c35fa1ce8b9e6f4712e0183193469e5 Author: Jonas Irgens Kylling Date: Mon Apr 15 18:21:07 2024 +0300 [PROTOCOL] Clarify checkpoint names (#2843) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Protocol) ## Description Resolves #2842 ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? No commit e1063a1bfafd06d57efc4b2f9bd89c7a9051dbec Author: Scott Sandre Date: Fri Apr 12 08:38:49 2024 -0700 [Spark] [Delta X-Compile] Fix tests for both Spark 3.5 and Spark Master (batch 1) (#2884) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fixes 25 tests in 6 suites for Delta OSS on Spark Master using shimming. ## How was this patch tested? Existing UTs ## Does this PR introduce _any_ user-facing changes? No commit 99ff4fd8ce1c9b44c5e5937170a776ac936eaa6c Author: Johan Lasperas Date: Fri Apr 12 16:25:15 2024 +0200 Reject reads on table with unsupported type changes (#2787) ## Description Adds a guardrail for the type widening table feature to reject reads when an unsupported change was applied to the table. This should never happen unless an implementation doesn't respect the type widening feature specification, which explicitly lists type changes that are allowed. ## How was this patch tested? - Added a test manually committing an invalid type change. commit 79a8558e4746bdd7c16dd44ba71e778754a9c6fb Author: Scott Sandre Date: Thu Apr 11 15:49:57 2024 -0700 [Spark] Remove (and rename) package `org.apache.spark.sql.delta.shims` (#2882) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Refactor shim package `org.apache.spark.sql.delta.shims`. When shimming `$package.$class`, use shim package `$package` ## How was this patch tested? Trivial refactor only. ## Does this PR introduce _any_ user-facing changes? No commit 8eeaeea8485daa79d2b6ffd409fed187283bc833 Author: Dhruv Arya Date: Thu Apr 11 15:22:24 2024 -0700 [Spark] Add drop support for InCommitTimestamp table feature (#2873) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adds drop feature support for InCommitTimestamps. When the user runs DROP FEATURE, the following will happen if ICT is present in the PROTOCOL: 1. If any of the ICT-related properties are present, the first commit will: - Set `IN_COMMIT_TIMESTAMPS_ENABLED` = true - Remove `IN_COMMIT_TIMESTAMP_ENABLEMENT_VERSION` and `IN_COMMIT_TIMESTAMP_ENABLEMENT_TIMESTAMP` 2. A second commit will remove ICT from the protocol. ## How was this patch tested? New tests in the DeltaProtocolVersionSuite for the following scenarios: 1. When ICT is enabled from commit 0 onwards. 2. When ICT is enabled in some commit after 0. 3. Dropping when the feature is not there in protocol 4. Dropping when only one provenance property is present and even the enablement property is not present 5. Dropping when none of the table properties are present ## Does this PR introduce _any_ user-facing changes? Yes. Users will now be able to run ALTER TABLE <> DROP FEATURE inCommitTimestamps-dev on their tables. commit 45ad64131039d3e024010e3438c5a1d99e5c5ad5 Author: Venki Korukanti Date: Thu Apr 11 08:53:27 2024 -0700 Add Delta data type to Parquet physical type mappings in PROTOCOL.md (#2048) ## Description Currently, Delta protocol doesn't specify how a Delta data type is stored physically in Parquet files. This PR is attempting to document the Delta data type to Parquet physical/logical type mappings. ## How was this patch tested? NA ## Does this PR introduce _any_ user-facing changes? No commit 3bf970459b9324a0b9dfacdcf350748e8a254da3 Author: Carmen Kwan Date: Thu Apr 11 17:23:05 2024 +0200 [Spark] Add IdentityColumnsTableFeature (#2859) ## Description This PR is part of https://github.com/delta-io/delta/issues/1959 In this PR, we introduce the IdentityColumnsTableFeature to test-only so that we can start developing with it. Note, we do not add support to minWriterVersion 6 yet to properties.defaults.minWriterVersion because that will enable the table feature outside of testing. ## How was this patch tested? Existing tests pass. ## Does this PR introduce _any_ user-facing changes? No, this is a test-only change. commit 8b419e7b6929cee56c054c576e904bb3f118353e Author: Prakhar Jain Date: Wed Apr 10 20:31:35 2024 -0700 [Spark] Add Managed Commit support in getSnapshotAt API (#2879) ## Description Add Managed Commit support in deltaLog.getSnapshotAt() API. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit ad9f67a665c5311e2d9f91c4b70404cd34004549 Author: Scott Sandre Date: Wed Apr 10 16:35:30 2024 -0700 Make Delta able to cross-compile against Spark Latest Release (3.5) and Spark Master (4.0) (#2877) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ### What DOES this PR do? - changes Delta's `build.sbt` to compile `delta-spark` against spark master. compilation succeeds. tests pass against spark 3.5. tests run but fail against spark master - e.g. `build/sbt -DsparkVersion=master spark/test` - the default spark version for Delta is still Spark 3.5 - testing requires building unidoc for (unfortunately) ALL projects in build.sbt. that breaks since spark master uses JDK 17 but delta-iceberg uses JDK 8. thus, we disable unidoc for delta-spark compiling against spark-master for now. - Delta: creates `spark-3.5` and `spark-master` folders. Delta will be able to cross compile against both. These folders will contain `shims` (code that will be selectively pulled to compile against a single spark version) but also spark-version-only code ### What does this PR NOT do? - this PR does not update any build infra (GitHub actions) to actually compile or test delta-spark against Spark Master. That will come later. ## How was this patch tested? Existing tests. `build/sbt -DsparkVersion=3.5 spark/test` ✅ `build/sbt -DsparkVersion=master spark/compile` ✅ `build/sbt -DsparkVersion=master spark/test` ❌ (expected, these fixes will come later) ## Does this PR introduce _any_ user-facing changes? No commit 3dcbbb807673ad4946cde393cfcd46d45233d031 Author: Venki Korukanti Date: Wed Apr 10 14:53:56 2024 -0700 [Kernel] Push predicate on partition values to checkpoint reader in state reconstruction (#2872) ## Description Converts the partition predicate into a filter on `add.partitionValues_parsed.`. This predicate is pushed to the Parquet reader when reading the checkpoint files during the state reconstruction. This helps prune reading checkpoint files that can't possibly have any scan files satisfying the given partition predicate. This can be extended in future to even support pushdown of predicate on data columns as well. ## How was this patch tested? Unittests commit 0fe578b59398694f3e1f3985e9d5119c79427b63 Author: Sumeet Varma Date: Wed Apr 10 08:46:02 2024 -0700 [Spark] Rename FileNames.deltaFile to FileNames.unsafeDeltaFile (#2838) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Previously, certain code paths assumed the existence of delta files for a specific version at a predictable path `_delta_log/$version.json`. This assumption is no longer valid with managed-commits, where delta files may alternatively be located at `_delta_log/_commits/$version.$uuid.json`. We explicitly rename the old method to `unsafeDeltaFile` to warn future users about it being incorrect for tables with Managed Commits. To not break dependent systems, plan: 1. Update all delta-spark usages to use the unsafe method. (current PR) 2. Deprecate the deltaFile method. (current PR) 3. Remove the deprecated method once it's proven to be safe. (future PR) ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit f012f0bab99c8d0b8470f4a1df0e767a20f8e236 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Wed Apr 10 15:35:15 2024 +0200 [Spark] Always create checkpoint when dropping reader+writer features (#2860) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description In the DROP FEATURE command we generate a checkpoint after cleaning reader+writer feature traces to allow metadata cleaning operations to truncate history before the cleanup. However, the checkpoint is not created if the cleanup operation does not do any work. This might cause an issue in some cases if the cleanup operation does not perform any work but there are historical traces that needed to be truncated. A scenario like this may occur if the user manually cleans the feature traces before invoking the DROP FEATURE command. This PR resolves this issue by relaxing the conditions the checkpoint is created. Note, this has as a side effect to create a checkpoint in both invocations of the DROP FEATURE command. ## How was this patch tested? Added a new test in DeltaProtocolVersionSuite. ## Does this PR introduce _any_ user-facing changes? No. commit 23b7c17628c21881fbefd04db11a31c973205d95 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed Apr 10 00:58:23 2024 +0200 [Spark] Add read support for RowId (#2856) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description 1. Add the Analyzer Rule `GenerateRowIds` to generate default Row IDs. 2. Add the `row_id` field to the `_metadata` column for Delta tables, allowing us to read the `row_id` from the file metadata after it is stored. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 3b50a92970bffc32b02e51676fa75d5baee408e0 Author: Venki Korukanti Date: Tue Apr 9 14:31:42 2024 -0700 [Kernel] Support filter pushdown in default parquet reader (#2692) ## Description Pushing down the predicate helps read fewer records especially when getting the scan files and reading multi-part checkpoint files. The current library we use parquet-mr already has support for pruning the records based on the stats in rowgroups and individual records as each record is read. This PR converts the given Kernel `Predicate` into `parquet-mr` predicate and gives as input to the `parquet-mr` reader. The support is only to prune the row groups using the pushed-down predicate. Individual record level filter is disabled due to the following reasons: * the `parquet-mr` materializes the entire record first before evaluating the predicate. This causes implementation challenges around the current `ColumnarBatch` construction out of the rows returned by the `parquet-mr` * We have additional partition pruning/data skipping that helps prune the records anyway. Current support is on the following column types: BYTE, SHORT, INT, LONG, FLOAT, DOUBLE, DATA, BOOLEAN, STRING, BINARY. Timestamp is currently not supported as the most popular physical format is INT96 which can't be used for stats based rowgroup pruning. Supported operators: eq, gt, lt, gte, lte, not, and, or Resolves #2667 ## How was this patch tested? Unit tests commit 9d4a1a525aa9dbe55623b0b1238ce680fe4df0b5 Author: Wenchen Fan Date: Tue Apr 9 23:38:17 2024 +0800 [Spark] do not hardcode the default conf value of READ_SIDE_CHAR_PADDING (#2861) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description In `WriteIntoDelta.scala`, there is a small mistake that we hardcode the default value of `SQLConf.READ_SIDE_CHAR_PADDING` to be `false`, instead of respecting the config definition. ## How was this patch tested? existing test. ## Does this PR introduce _any_ user-facing changes? no commit 5fdba3699e82ed19290fc3f314ed3bf47c9928c5 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Apr 9 17:31:09 2024 +0200 [Spark] Enable Row Tracking outside of testing (#2866) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Delta Protocol for Row IDs was introduced in this PR: https://github.com/delta-io/delta/pull/1610 Support for writing fresh row IDs / row commit versions was introduced in the following PRs: - https://github.com/delta-io/delta/pull/1723 - https://github.com/delta-io/delta/pull/1781 - https://github.com/delta-io/delta/pull/1896 **This is sufficient to enable row tracking on a table and write to a table that has row tracking enabled** but not to actually read row IDs / row commit versions back, which is also being added in Delta at the moment ([read BaseRowId](https://github.com/delta-io/delta/commit/283ac02c0510ce67744ff5c410ca416f7fbaa0b9). [read defaultRowCommitVersion](https://github.com/delta-io/delta/pull/2795), [read RowId](https://github.com/delta-io/delta/pull/2856)...) Using row tracking is currently only allowed in testing, this change allows enabling row tracking outside of testing so that the upcoming Delta 3.2 release includes support for writing to tables with row tracking enabled, making Delta writers future-proof. ## How was this patch tested? Tests have already been added in previous changes, this only flips the switch to let users enabled Row Tracking outside of tests. ## Does this PR introduce _any_ user-facing changes? Users are now able to enable Row Tracking when creating a delta table: ``` CREATE TABLE tbl(a int) USING DELTA TBLPROPERTIES ('delta.enableRowTracking' = 'true') ``` commit dc9ae44b3175c54383fb6e448486c52cae57e901 Author: Lars Kroll Date: Tue Apr 9 17:30:44 2024 +0200 [Spark] Don't use fixed tahoe id for testing (#2870) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description * Do not use the literal "testId" for every single tahoeId in testing. This can hide issues where we use the wrong id (for example as key in a map). * Fix one of these issues with the partition stats for auto compaction, where we used a completely random tableId as the key, rather than than the tableId that commitLarge just wrote. * Add a (temporary) flag to not include the tahoeId in the equals and hashCode methods of TahoeLogFileIndex. Having an unstable external field there is prone to race conditions (of the form this != this) and losing instances in hash sets/maps. So ideally we should make this the default. ## How was this patch tested? Testing-only PR. ## Does this PR introduce _any_ user-facing changes? No. commit d330cb18e390b8c431697436e9828659be51b26e Author: Jiaheng Tang Date: Mon Apr 8 17:01:21 2024 -0700 [Spark] Remove preview guard for clustered table (#2865) commit 79a0581bddb2b1f9c21972662c1646d5153a30c9 Author: Fred Storage Liu Date: Mon Apr 8 16:12:03 2024 -0700 fix Delta UniversalFormat bug so it supports overwrite table (#2852) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description fix Delta UniversalFormat bug so it supports overwrite table. It should use a full metadata for later detection logic. ## How was this patch tested? manual test with dataframe overwrite. commit afa0e395e6c46f94b03d7c500a15edf713e9abea Author: Fred Storage Liu Date: Mon Apr 8 15:32:53 2024 -0700 Minor fix: add IcebergCompatV2 to appendix supported list of table features (#2868) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Minor fix: add IcebergCompatV2 to appendix supported list of table features commit 64c32071ec0817efd482d64aa20984a8834ced20 Author: Venki Korukanti Date: Fri Apr 5 22:32:45 2024 -0700 [Kernel] Add support for reading tables with `timestamp_ntz` type columns 1) Add read support (schema, parquet reader and type) 2) Add partition pruning on ts_ntz type columns 3) Add expression evaluator support in default handler for ts_ntz type columns. Close #2279 commit 615c184a3677afb26b7ca82c630ebe595421555b Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Sat Apr 6 02:25:21 2024 +0200 [Spark] Add ConflictCheckerRowIdSuite (#2854) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add the `ConflictCheckerRowIdSuite`, to test the Row Tracking's Conflict Checker logic. We want to ensure that each row ID can only be assigned by one transaction. If we detect that a concurrent transaction also assigned row IDs, then we will assign different row IDs to the files added by the current transaction. ## How was this patch tested? Add UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 9e6cc9ea788c074bf85db271ff9e2c8ee8db2dc1 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Thu Apr 4 23:42:53 2024 +0200 [Spark] Rename BusyWait to ConcurrencyHelpers (#2851) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We will soon add some common helper functions for the Concurrency Testing Framework, so we are renaming the object `BusyWait` to `ConcurrencyHelpers` in preparation for this. ## How was this patch tested? Existing UTs, just some small name changes. ## Does this PR introduce _any_ user-facing changes? No. commit b8aa3bd973e9a91765edbeb0f492ccfbf2fdc44c Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Thu Apr 4 21:56:07 2024 +0200 [Spark] Make sure the RowId Suite tests use the Delta Format when reading from path (#2849) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add read Delta format to tests that are not using it in `RowIdSuite`. If we don't specify the format when reading from the path, the default would be Parquet. We would like to make sure all the relevant Row Tracking tests to use the Delta format to properly test Row Tracking. ## How was this patch tested? Existing UTs. ## Does this PR introduce _any_ user-facing changes? No. commit c9739a1af6d8c3ad74e17054536fd319a261f5a6 Author: Paddy Xu Date: Thu Apr 4 19:58:25 2024 +0200 [Spark] [TEST-ONLY] Fix generating FileActions with unescaped path (#2847) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes an issue (in `DeltaVacuumSuite`) where we generate `RemoveFile`s with an unescaped absolute path. It also improves the naming of a util method to emphasize the path that should be escaped. ## How was this patch tested? Test-only. ## Does this PR introduce _any_ user-facing changes? No. commit bac59b8d257d192389365f0375c81d6baafe5e83 Author: Paddy Xu Date: Thu Apr 4 17:49:51 2024 +0200 [Protocol] Rephase a message about Default column values (#2810) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Delta Protocol) ## Description This PR follows https://github.com/delta-io/delta/pull/2794 to re-phase a message regarding table protocol upgrades. ## How was this patch tested? Not needed. ## Does this PR introduce _any_ user-facing changes? No. commit 1f464232222be952e998f60769e5dc80415d86a9 Author: Prakhar Jain Date: Thu Apr 4 08:49:27 2024 -0700 [Spark] Add CommitStore support in more snapshot create codeflow (#2845) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add CommitStore support in more snapshot create codeflow ## How was this patch tested? Existing UTs commit e3a481bd6c42a4f91686377d78ec9d9c934e27ee Author: Venki Korukanti Date: Thu Apr 4 07:42:17 2024 -0700 [Kernel] Adding logging in few important places to help in debugging Add log4j logging in the following cases: 1) Time taken to get the version for a given timestamp in time travel 2) When the `_last_checkpoint` metadata file is missing or corrupted. 3) Time to find the last completed checkpoint before a version. 4) Time to list the delta files after a given last checkpoint 5) Time to construct a `LogSegment` 6) Time to construct a snapshot with a `LogSegment` (includes loading P&M) commit 5fdbbb6ec18e9da162c2afd4847aa23b8d1202f2 Author: Sumeet Varma Date: Wed Apr 3 13:10:10 2024 -0700 [Protocol] Accept the Vacuum Protocol Check RFC proposal (#2808) ## Protocol Change Request ### Description Adds the VacuumProtocolCheck PROTOCOL change. Design Doc: https://docs.google.com/document/d/15o8WO2T0vN21S5JG-FT_ZNhXFCWyh0i9tqhr9kBmZpE/edit#heading=h.4cz970y1mk93 closes https://github.com/delta-io/delta/issues/2630 ### Willingness to contribute The Delta Lake Community encourages protocol innovations. Would you or another member of your organization be willing to contribute this feature to the Delta Lake code base? - [x] Yes. I can contribute. - [ ] Yes. I would be willing to contribute with guidance from the Delta Lake community. - [ ] No. I cannot contribute at this time. commit 22ccb0110f9ee836e1928c96328276e9a295e016 Author: Dhruv Arya Date: Wed Apr 3 09:02:30 2024 -0700 [Spark] Make DeltaHistoryManager.getHistory aware of ICT (#2834) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Return the ICT (In-Commit Timestamp) timestamp for ICT-enabled ranges in DeltaHistoryManager.getHistory when ICT is currently enabled. ## How was this patch tested? New tests: 1. When all of history has ICT enabled. 2. When requested history is not an ICT range. 3. When requested range contains both ICT and non-ICT commits. ## Does this PR introduce _any_ user-facing changes? No visible differences for the user, the timestamp returned by `DESCRIBE HISTORY` will be different depending on whether ICT is currently enabled. commit 01c0ef91565d9a3e06fb9f62295f7c80fba09351 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Apr 2 20:53:31 2024 +0200 [Spark] Add TransactionExecutionObserver (#2816) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description 1. Add an observer field to `OptimisticTransaction` which can can be set via a thread-local to a custom instrumentation class. Instrumentation methods are invoked when creating a new transaction, for `prepareCommit`, for `doCommit`, and for completion of a transaction (failure or successful). 2. The default observer simply performs no-ops. 3. Added a testing `PhaseLocking` observer implementation that allows both observing, but also blocking the transaction's thread until unblocked by another thread. This allows fine control of how the transaction progresses, which is needed for some testing scenarios. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit c88db07a60e7ed6987dc9c956989ceaaedfe8458 Author: Sumeet Varma Date: Tue Apr 2 11:51:52 2024 -0700 [Spark] Backfill commit files before checkpointing or minor compaction in managed-commits (#2823) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description With managed-commit, commit files are not guaranteed to be present in the _delta_log directory at the time of checkpointing or minor compactions. While it is possible to compute a checkpoint file without backfilling, writing the checkpoint file in the log directory before backfilling the relevant commits will leave gaps in the directory structure. This can cause issues for readers that are not communicating with the CommitStore. To address this problem, we now backfill commit files up to the committedVersion before performing a checkpoint or minor compaction operation ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit a014fee827757035d090c25faac7263021ea9286 Author: Fred Storage Liu Date: Tue Apr 2 11:50:46 2024 -0700 [DeletionVectorsSuite] Delete commit json then write the new json (#2836) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Delete commit json then write the new json as env may cannot overwrite file. ## How was this patch tested? UT commit b3de6c2b98318f0571df2f6a6018138b6ddc2475 Author: Carmen Kwan Date: Tue Apr 2 20:50:22 2024 +0200 [Spark][TEST-ONLY] Add DDLTestUtils helper (#2811) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description In this PR, we add DDLTestUtils helper and ColumnSpec traits so that we have flexible interfaces for defining Delta tables. This will enable us to test both Scala and SQL interfaces easier. This PR introduces the helpers but this interface will not be used until later. We don't reuse the helper functions in GeneratedColumnSuite because they are less flexible. We can rewrite GeneratedColumnSuite to use DDLTestUtils later on. ## How was this patch tested? Unused code. Introducing the interface only. ## Does this PR introduce _any_ user-facing changes? No. commit 8424d5857c314dca0f61ae2bc44f3892fb74680b Author: Sumeet Varma Date: Tue Apr 2 11:49:13 2024 -0700 [Spark]Handle CommitFailedException in OptimisticTransaction::commitLarge (#2819) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Handle CommitFailedException in OptimisticTransaction::commitLarge depending on the values of `retryable` and `conflict` in the exception. ## How was this patch tested? UTs to cover the scenarios ## Does this PR introduce _any_ user-facing changes? No commit 65ef9d139f1b4e0342b177d4e46d865c757e3acc Author: Simon Dahlbacka Date: Tue Apr 2 20:15:52 2024 +0300 [docs] Update README.md (#2840) fix a spelling error #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [x] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? commit fb03330f37a202566b070ac447be597d591e9d44 Author: Scott Sandre Date: Tue Apr 2 09:58:50 2024 -0700 [SPARK] [Delta X-Compile] Get `delta-spark` tests cross-compiling against Spark 3.5.0 and Spark Master (#2835) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We want Delta to be able to cross-compile against Spark 3.5 and Spark Master (4.0). Previously we have gotten Delta production code to compile (in a branch that enables shimming). This PR gets the test code to compile, too, using changes that are forwards-compatible with spark master. ## How was this patch tested? Existing tests. commit 537ed8ee0be983579873851855ba3e96b20004bd Author: Sumeet Varma Date: Mon Apr 1 13:56:18 2024 -0700 [Spark] Make DeltaHistoryManager::getHistory more lenient towards incorrect ranges (#2825) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Allow invalid ranges in DeltaHistoryManager::getHistory and return the valid subset of the commits. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit 9576d0823b729348437833745b0d65451d95ff93 Author: Fred Storage Liu Date: Fri Mar 29 18:51:35 2024 -0700 only overwrite schema and field id for create table Iceberg conversion txn (#2820) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The existing Uniform convertion txn has a logic to launch a second txn to set schema, after the first CREATE TABLE or REPLACE TABLE txn, to set the correct field ids because Iceberg may reassign those. This behavior has following flaws: Firstly iceberg core does NOT reassign field id for REPLACE txn if the table already exists (in Uniform case it always does). So set schema for REPLACE TABLE is not necessary. Secondly, Uniform uses the replace transaction when number of snapshots to convert exceeds threshold. The replace txn will set last Delta converted version as -1, which can be confusing or erroneous. This PR fixes above flaws by NOT set schema for REPLACE txn. ## How was this patch tested? Manually tested. Unit test will come in separate PR. commit 5ae57cc1ea58efd8b1e6cbbe13fbaeb51a231c4f Author: Venki Korukanti Date: Fri Mar 29 14:55:13 2024 -0700 [Kernel] Iteratively search (1000 versions at a time) for last checkpoint before a given version (#2817) ## Description Adds a utility method for iteratively searching backwards 1000 delta versions at a time from a given version to find the checkpoint. This utility method is used when loading a snapshot by version id. This is similar to how delta-spark does. More details are [here](https://docs.google.com/document/d/13Nock1I8-143Dwidj8rMpgt3wAicrOI2OvDJt3OufOQ/edit?usp=sharing). ## How was this patch tested? Existing tests and mock unittests for granular tests. commit 296e100af646f52752d7db9e85e2436a0f4c842c Author: Dhruv Arya Date: Fri Mar 29 13:07:08 2024 -0700 [Spark] InCommitTimestamps: Use parallel radix search for ICT commit history (#2813) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Makes DeltaHistoryManager.getActiveCommitAtTime aware of `inCommitTimestamps`. DeltaHistoryManager.getActiveCommitAtTime will now bisect the history of commits into non-ICT and ICT commits based on the ICT enablement version and only search the range of commits relevant to the timestamp. If the relevant commit range is composed of ICT commits, `getActiveCommitAtTime` will perform a parallelized search (see `getActiveCommitAtTimeFromICTRange` for details). ## How was this patch tested? New tests in InCommitTimestampSuite. ## Does this PR introduce _any_ user-facing changes? No commit 40926cb1291c077d94010c5c8c63e5f448b5a1b9 Author: Venki Korukanti Date: Fri Mar 29 12:50:19 2024 -0700 [Kernel][Tests] Refactor `TableClient` mocking code duplication into a single file (#2824) ## Description We have been duplicating the mocking `TableClient` across the test suites. Consolidate and make common utilities/classes that allow mocking `TableClient` with minimal code. `MockTableClientUtils.scala` - is the base trait for mocking `TableClient`. ## How was this patch tested? Just a refactor commit 74ee8763ca2449505bc4addefdcb4ce34fb5f1c6 Author: Prakhar Jain Date: Fri Mar 29 11:50:28 2024 -0700 [Spark] Minor refactor to ManagedCommitBaseSuite utility - add comments (#2821) ## Description Minor refactor to ManagedCommitBaseSuite utility - add comments. ## How was this patch tested? Existing UTs ## Does this PR introduce _any_ user-facing changes? No commit 09835436e38ab54ca8decd0d73a3effad92d1c9a Author: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Date: Fri Mar 29 10:11:35 2024 -0700 [Spark] Add CLONE support for clustered tables (#2802) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adds support for CLONE operation on clustered tables. ## How was this patch tested? Test added to `ClusteredTableDDLSuite` ## Does this PR introduce _any_ user-facing changes? No. commit 902830369662f5a84e987b3a97e23f916da104ca Author: Tim Brown Date: Fri Mar 29 11:17:54 2024 -0500 Hudi uniform support (#2333) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Uniform) ## Description - This change aims to add support for Hudi in Uniform - The changes were mostly copied from [OneTable](https://github.com/onetable-io/onetable) which has a working version of Delta to Hudi already ## How was this patch tested? Some basic tests are added ## Does this PR introduce _any_ user-facing changes? Yes, this allows users to expose their Delta tables as Hudi commit 9c302b012f441d0f0919c3845196f967f2e78bbe Author: Scott Sandre Date: Fri Mar 29 09:10:30 2024 -0700 [Spark] [Delta X-Compile] Refactor AnalysisException to DeltaAnalysisException (batch 3) (#2815) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We want Delta to cross-compile against both Spark 3.5 and Spark Master (4.0). Unfortunately, the constructor `new AnalysisException("msg")` became protected in Spark 4.0, meaning that all such occurances do not compile against Spark 3.5. Thus, we decided to: - replace `AnalysisException` with `DeltaAnalysisException` - use errorClasses - assign temporary error classes when needed to speed this along This PR fixes all remaining related compilation errors. ## How was this patch tested? New UTs in `DeltaErrorsSuite`. Also, cherry-picked to the oss-cross-compile branch (https://github.com/delta-io/delta/pull/2780) and cross-compiled: - (this branch) Spark 3.5: ✅ - (this branch) Spark 4.0: no remaining compilation errors. commit acd9c6c5564ed9a4aed9155ff0236151d3d27e22 Author: Sumeet Varma Date: Fri Mar 29 08:19:46 2024 -0700 [Spark]Optimize Directory Creation in ensureLogDirectoryExists method (#2796) ## Description This PR updates the ensureLogDirectoryExist method to optimistically handle directory creation by attempting to create the directory before checking its existence. This is efficient because we're assuming it's more likely that the directory doesn't exist and it saves an filesystem existence check in that case. Cloud Hadoop implementations are expected to throw org.apache.hadoop.fs.FileAlreadyExistsException if the file already exists. e.g. [S3AFileSystem.java](https://github.com/apache/hadoop/blob/fc166d3aec7c95110a8cd4ef6ce1fbf4955107e5/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L3763) Unix-based File Systems are expected to throw java.nio.file.FileAlreadyExistsException if the file already exists. To cover all bases, including unforeseen file systems, it retains a final existence check for exceptions outside these specific cases, ensuring robustness without compromising functionality. ## How was this patch tested? Existing UTs ## Does this PR introduce _any_ user-facing changes? No commit 1bdd4687d8c48dd66cab0407f9afa81fb9d1a95f Author: Sumeet Varma Date: Fri Mar 29 08:19:34 2024 -0700 [Spark]Parallelize deltaLog listFrom and commitStore getCommits call (#2775) ## Description Improve performance of listDeltaCompactedDeltaAndCheckpointFilesWithCommitStore by making parallel calls to both the file-system and a commit-store (if available), reconciles the results to account for concurrent backfill operations and potentially makes another list call on the file-system to ensure a comprehensive list of file statuses. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit 734df2f641e8173c828420150b41ffd69d57b02c Author: Sumeet Varma Date: Fri Mar 29 08:19:23 2024 -0700 [Spark] Update deltaLog.getHistory to serve [start, end] version range (#2818) ## Description Previously, `deltaLog.getHistory(start, endOpt)` served the `[start, endOpt]` range. However, this behavior was accidentally changed in #2799 to serve the `[start, endOpt)` range. We revert it back to the original behavior and also update the function comment. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit f0b0878c88294f79562c8d86efbeb8cc94c79270 Author: Tom van Bussel Date: Fri Mar 29 02:09:07 2024 +0100 [PROTOCOL RFC] Column Mapping Usage Tracking (#2683) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (fill in here) ## Description Adds the proposal for spec change Column Mapping Usage Tracking (see https://github.com/delta-io/delta/issues/2682) to the RFC folder. ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? N/A commit f37dc86b87f7d6229bddb9d259b186a406d62ce6 Author: Venki Korukanti Date: Thu Mar 28 14:14:55 2024 -0700 [Kernel] Retry `_last_checkpoint` loading in case of failures (#2812) ## Description Currently, Kernel stops on first failure at loading the `_last_checkpoint` file (which contains info about the last checkpoint). We should have mechanisms to retry just like how Delta-Spark retries on retryable failures. More details are [here](https://docs.google.com/document/d/13Nock1I8-143Dwidj8rMpgt3wAicrOI2OvDJt3OufOQ/edit?usp=sharing). ## How was this patch tested? Added mock unittests ## Does this PR introduce _any_ user-facing changes? commit 06222bc5006adbcf118b08560e35e0945f8f5a5b Author: Jintian Liang <105243217+jintian-liang@users.noreply.github.com> Date: Thu Mar 28 11:10:37 2024 -0700 [Spark] Add error codes to Delta concurrent exceptions (#2800) ## Description This PR bolsters the concurrent exceptions Delta throws by adding error codes to them and plugs them into the DeltaThrowable framework. ## How was this patch tested? Added unit tests to verify the error codes get populated correctly for these exceptions. commit b0ed7775d76f55e91d754873e38212214fe3d7e3 Author: Scott Sandre Date: Wed Mar 27 14:26:37 2024 -0700 [Spark] [Delta X-Compile] Refactor AnalysisException to DeltaAnalysisException (batch 2) (#2805) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We want Delta to cross-compile against both Spark 3.5 and Spark Master (4.0). Unfortunately, the constructor `new AnalysisException("msg")` became protected in Spark 4.0, meaning that all such occurances do not compile against Spark 3.5. Thus, we decided to: - replace `AnalysisException` with `DeltaAnalysisException` - use errorClasses - assign temporary error classes when needed to speed this along This PR fixes compilation errors [6, 11] of ~20. ## How was this patch tested? New UTs in `DeltaErrorsSuite`. Also, cherry-picked to the oss-cross-compile branch (https://github.com/delta-io/delta/pull/2780) and cross-compiled: - (this branch) Spark 3.5: ✅ - (this branch) Spark 4.0: 10 compilation errors. (the oss-cross-compile branch has 16 compilation errors, this is expected) ## Does this PR introduce _any_ user-facing changes? No commit bcb1960b2935d7be27c896d50aa5aa91446c5a10 Author: Sumeet Varma Date: Wed Mar 27 08:31:04 2024 -0700 [Spark] Move VacuumProtocolCheck Table Feature from Dev/Testing to Prod (#2807) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We move the Table Feature from dev/test only to prod as part of this PR. ## How was this patch tested? Existing UTs ## Does this PR introduce _any_ user-facing changes? No commit 10edc9c2bb3691643dc8f6e931918beeab386d44 Author: Yousof Hosny <132951652+yhosny@users.noreply.github.com> Date: Wed Mar 27 08:07:08 2024 -0700 Support Java8 time objects (#2752) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Added support for Java8 time objects when Java8Api is enabled. The following query fails when Java8Api is enabled: ``` INSERT INTO $tableName REPLACE where (DATE IN (DATE('2024-03-11'), DATE('2024-03-12'))) VALUES ('1', DATE('2024-03-13')) ``` ## How was this patch tested? Unit Test. ## Does this PR introduce _any_ user-facing changes? The potential public surface area for this change is limited to only external DateType, TimestampType. By default, Spark converts DateType values to java.sql.Date, TimestampType -> java.sql.Timestamp but with the SQL conf enabled, Spark uses java.time.LocalDate as the external type for DateType and java.time.Instant for TimestampType. Necessity of fix: If users use Spark 3.0 and enable Java 8, they will be unable to parse and use dates in Delta tables correctly, leading to matching errors. --------- Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> commit 2f9a538a07691ed8c23a37e85c2385ef94f43ff9 Author: Jiaheng Tang Date: Wed Mar 27 08:06:30 2024 -0700 [Spark] Show clusterBy in DESCRIBE HISTORY for clustered tables (#2604) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves #2470 Support showing `clusterBy` in DESCRIBE HISTORY for clustered table. Currently it supports create/replace table operations. The `clusterBy` key will always be present, even if we are not creating a clustered table. This behavior is consistent with `partitionBy`. ## How was this patch tested? New unit tests. ## Does this PR introduce _any_ user-facing changes? Yes, it adds `clusterBy` to DESCRIBE HISTORY's output. commit 637820aae2622ff34c75c89494817fbb30a525e9 Author: Miles Cole <52209784+mwc360@users.noreply.github.com> Date: Wed Mar 27 08:21:07 2024 -0600 [BUG] Default columns is a writer table feature, not a reader feature. (#2794) The documentation incorrectly notes that "Enabling default column values for a table upgrades the Delta table version as a byproduct of enabling table features. This protocol upgrade is irreversible. Tables with default column values enabled can only be read in Delta Lake 3.1 and above." Signed-off-by: Miles Cole [m.w.c.360@gmail.com](mailto:m.w.c.360@gmail.com) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Default columns is a writer table feature, not a reader feature. ## How was this patch tested? I confirmed that a table with the defaultColumns table feature was readable by delta lake 3.0 ## Does this PR introduce _any_ user-facing changes? No commit 11385014fc19c5e027199812b4c146ccd36844a1 Author: Sumeet Varma Date: Tue Mar 26 15:54:40 2024 -0700 [Kernel] Add reader support for Tables with vacuumProtocolCheck feature (#2806) ## Description We add reader support for Tables with Vacuum Protocol Check Table Feature in Kernel. As per the protocol, for tables with Vacuum Protocol Check enabled, readers don’t need to understand or change anything new; they just need to acknowledge the feature exists. ## How was this patch tested? - GENERATE_GOLDEN_TABLES=1 build/sbt 'goldenTables/testOnly *GoldenTables -- -z "basic-with-vacuum-protocol-check-feature"' - build/sbt 'testOnly io.delta.kernel.defaults.DeltaTableReadsSuite' ## Does this PR introduce _any_ user-facing changes? No commit 77ce12c1e2f3d87d25513756ea31421c9f3add99 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Mar 26 22:48:24 2024 +0100 [Spark] Add read support for defaultRowCommitVersion (#2795) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adding the `default_row_commit_version` field to the `_metadata` column for Delta tables. This field contains the value in the `defaultRowCommitVersion` field of the `AddFile` action for the file, allowing us to read the `default_row_commit_version` from the file metadata after it is stored. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 9012736abb8ea3cfc326d0c648cb89b543f97d7c Author: Jason Teoh Date: Tue Mar 26 13:38:40 2024 -0700 [Spark] Delete disabled flaky DeltaLogSuite test case (#2776) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Delete a currently ignored test case. ## How was this patch tested? N/A (only a test removal) ## Does this PR introduce _any_ user-facing changes? No commit dbed7a9d53702b8152985f05d14d6e3b4a7eafa1 Author: Sumeet Varma Date: Tue Mar 26 08:49:01 2024 -0700 [Spark]Update Delta File Resolution Logic with introduction of Managed Commits (#2799) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR introduces necessary adjustments in our approach to locating delta files, prompted by the adoption of managed-commits. Previously, certain code paths assumed the existence of delta files for a specific version at a predictable path `_delta_log/$x.json`. This assumption is no longer valid with managed-commits, where delta files may alternatively be located at `_delta_log/_commits/$x.$uuid.json`. We attempt to locate the correct delta files from the Snapshot's LogSegment now. ## How was this patch tested? Add managed-commits to some of the existing UTs ## Does this PR introduce _any_ user-facing changes? No commit aa0af003068a9491b7b6830712082af3f6fc87de Author: Scott Sandre Date: Mon Mar 25 18:47:31 2024 -0700 [Spark] [Delta X-Compile] Refactor AnalysisException to DeltaAnalysisException (batch 1) (#2798) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We want Delta OSS to cross-compile against both Spark 3.5 and Spark Master (4.0). Unfortunately, the constructor `new AnalysisException("msg")` became protected in Spark 4.0, meaning that all such occurances do not compile against Spark 3.5. Thus, we decided to: - replace `AnalysisException` with `DeltaAnalysisException` - use errorClasses - assign temporary error classes when needed to speed this along This PR replaces the first 5 (of ~20) compilation errors. ## How was this patch tested? New UTs in `DeltaErrorsSuite`. Also, cherry-picked to the oss-cross-compile branch (https://github.com/delta-io/delta/pull/2780) and cross-compiled: - (this branch) Spark 3.5: ✅ - (this branch) Spark 4.0: 11 compilation errors. (the oss-cross-compile branch has 16 compilation errors, this is expected) ## Does this PR introduce _any_ user-facing changes? No commit b889eeb9151df95aeb4e41a6efa14b3f16cfa2b4 Author: Ami Oka Date: Mon Mar 25 14:49:05 2024 -0700 refactor isSameDomain for JsonMetadataDomain (#2767) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This refactors `RowTrackingMetadataDomain.isRowTrackingDomain` to `JsonMetadataDomain.isSameDomain` so that it can be useful for other domain metadata. ## How was this patch tested? Existing unit tests. commit 3cdcdd190e6e38519e1c1435951526d12724d4b7 Author: Felipe Pessoto Date: Mon Mar 25 13:10:07 2024 -0700 [Build] Remove orphan file (#2791) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Build) ## Description MiMaExcludes.scala theoretically has being renamed to SparkMimaExcludes.scala: https://github.com/delta-io/delta/pull/1952/files#diff-e7ee88f4ceae7019f7bff3eef41ee5cee6da1e54c436df887e3fd4c96c282eda For some reason both files exist, the former not being used. @allisonport-db ## How was this patch tested? Unit Tests ## Does this PR introduce _any_ user-facing changes? No Signed-off-by: Felipe Pessoto commit 283ac02c0510ce67744ff5c410ca416f7fbaa0b9 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Mon Mar 25 18:44:56 2024 +0100 [Spark] Add read support for baseRowID (#2779) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Adding the `base_row_id` field to the `_metadata` column for Delta tables. This field contains the value in the `baseRowId` field of the `AddFile` action for the file, allowing us to read the `base_row_id` from the file metadata after it is stored. ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit 4b043b1d885979d2c01e15bb42550b2b77bafad8 Author: Fred Storage Liu Date: Mon Mar 25 10:36:44 2024 -0700 Refactor to Delta shallow clone (#2789) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description minor refactor to delta shallow clone to reorder and simplify logics. ## How was this patch tested? UT commit ff9d819bd9a2506d27bba206a09a1b661c97c022 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Mon Mar 25 15:38:39 2024 +0100 [Spark] Refactor deduplication in DeltaMergeBuilder to utilize DeduplicateRelations Rule. (#2771) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently, `DeltaMergeBuilder` performs a custom relation deduplication logic reminiscent to the DeduplicateRelations analyzer rule. This duplicates logic and might cause issues from unforeseen interaction with the rule during the analysis of the MERGE plan. This PR refactors the deduplication in DeltaMergeBuilder to utilise the `DeduplicateRelations` rule. The new flow is as follows: 1. Un-resolve the ambiguous pre-resolved references. 2. Invoke DeduplicateRelations to do the deduplication work with FakeLogicalPlan. 3. Use the deduplicated source and target plan in the final MERGE command. ## How was this patch tested? Existing tests. `MergeIntoScalaSuite` covers self merges with the Scala API. ## Does this PR introduce _any_ user-facing changes? No. commit a172276e945667861c350507feb09ccc9da0287f Author: Johan Lasperas Date: Mon Mar 25 15:37:36 2024 +0100 Automatic type widening in INSERT (#2785) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This change is part of the type widening table feature. Type widening feature request: https://github.com/delta-io/delta/issues/2622 Type Widening protocol RFC: https://github.com/delta-io/delta/pull/2624 It adds automatic type widening as part of schema evolution in INSERT. During resolution, when schema evolution and type widening are enabled, type differences between the input query and the target table are handled as follows: - If the type difference qualifies for automatic type evolution: the input type is left as is, the data will be inserted with the new type and the table schema will be updated in `ImplicitMetadataOperation` (already implemented as part of MERGE support) - If the type difference doesn't qualify for automatic type evolution: the current behavior is preserved: a cast is added from the input type to the existing target type. ## How was this patch tested? - Tests are added to `DeltaTypeWideningAutomaticSuite` to cover type evolution in INSERT ## This PR introduces the following *user-facing* changes The table feature is available in testing only, there's no user-facing changes as of now. When automatic schema evolution is enabled in INSERT and the source schema contains a type that is wider than the target schema: With type widening disabled: the type in the target schema is not changed. A cast is added to the input to insert to match the expected target type. With type widening enabled: the type in the target schema is updated to the wider source type. ``` -- target: key int, value short -- source: key int, value int INSERT INTO target SELECT * FROM source ``` After the INSERT operation, the target schema is `key int, value int`. commit 36f95ddd03c45aba669d65ee1a4092cbd122215d Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Fri Mar 22 18:47:18 2024 +0100 Add Atomic Barrier and PhaseLockingTestMixin (#2772) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Added the `AtomicBarrier` and `PhaseLockingTestMixin`, the initial building blocks of the Concurrency Testing framework, in order to make way to add a Suite that tests the interaction of the `ConflictChecker` with the `RowTracking` feature, ensuring Row Tracking is well-tested and getting ready for [enabling Row Tracking outside of testing in Delta](https://github.com/delta-io/delta/pull/2059). ## How was this patch tested? Added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit cb77f540b9a09f003d27606a3dfb5e592d5e6d18 Author: Venki Korukanti Date: Fri Mar 22 09:33:52 2024 -0700 [Kernel] Use `LogStore`s in `listFrom` implementation in default `FileSystemClient` (#2770) ## Description `LogStore`s in `storage` module have file system operations (needed for reading/writing DeltaLogs) implemented for each storage (e.g. s3, GCS etc.) to take into account of the behavior of storage and also efficiently implement certain operations depending upon the storage system support (e.g. fast listing in S3). This PR creates `LogStoreProvider` to get the specific implementation of the `LogStore` for given `scheme`. Also updates the `DefaultFileSystemClient.listFrom` to use the `LogStore.listFrom`. The majority of the code here is copied from the `delta-spark` and `standalone` projects. ## How was this patch tested? Unittests for `LogStoreProvider` and existing integration tests for `DefaultFileSystemClient.listFrom` changes. commit 9e74e5640134f21454f558f267450a077651d5d7 Author: Venki Korukanti Date: Fri Mar 22 08:18:21 2024 -0700 [Build] Add `storage` module dependency to `kernel-defaults` ## Description Build changes from PR: https://github.com/delta-io/delta/pull/2770 commit 90b98e3e30c641d911af71c2cbfae3828cb27eb6 Author: Johan Lasperas Date: Fri Mar 22 15:18:32 2024 +0100 Automatic type widening in MERGE (#2764) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This change is part of the type widening table feature. Type widening feature request: https://github.com/delta-io/delta/issues/2622 Type Widening protocol RFC: https://github.com/delta-io/delta/pull/2624 It adds automatic type widening as part of schema evolution in MERGE INTO: - During resolution of the `DeltaMergeInto` plan, when merging the target and source schema to compute the schema after evolution, we keep the wider source type when type widening is enabled on the table. - When updating the table schema at the beginning of MERGE execution, metadata is added to the schema to record type changes. ## How was this patch tested? - A new test suite `DeltaTypeWideningSchemaEvolutionSuite` is added to cover type evolution in MERGE ## This PR introduces the following *user-facing* changes The table feature is available in testing only, there are no user-facing changes as of now. When automatic schema evolution is enabled in MERGE and the source schema contains a type that is wider than the target schema: With type widening disabled: the type in the target schema is not changed. the ingestion behavior follows the `storeAssignmentPolicy` configuration: - LEGACY: source values that overflow the target type are stored as `null` - ANSI: a runtime check is injected to fail on source values that overflow the target type. - STRICT: the MERGE operation fails during analysis. With type widening enabled: the type in the target schema is updated to the wider source type. The MERGE operation always succeeds: ``` -- target: key int, value short -- source: key int, value int MERGE INTO target USING source ON target.key = source.key WHEN MATCHED THEN UPDATE SET * ``` After the MERGE operation, the target schema is `key int, value int`. commit cb070925b8ef5313582f679b207d47a660a78697 Author: Venki Korukanti Date: Thu Mar 21 15:29:04 2024 -0700 [Build] Add `storage` module dependency to Kernel examples ## Description Prep for #2770 PR commit 050e0188ada0f45601098bd3d8b6d699ed410a96 Author: Jiaheng Tang Date: Thu Mar 21 14:01:22 2024 -0700 [Spark] OPTIMIZE on clustered table with no clustering columns should run compaction (#2777) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently, when running OPTIMIZE on a clustered table without any clustering columns(after ALTER TABLE CLUSTER BY NONE), it would fail with a long stack trace: ``` [info] org.apache.spark.SparkException: Exception thrown in awaitResult: [info] at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56) [info] at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310) [info] at org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:383) [info] at org.apache.spark.sql.delta.commands.OptimizeExecutor.$anonfun$optimize$1(OptimizeTableCommand.scala:276) ... [info] Cause: java.util.concurrent.ExecutionException: Boxed Error [info] at scala.concurrent.impl.Promise$.resolver(Promise.scala:87) ... [info] Cause: java.lang.AssertionError: assertion failed: Cannot cluster by zero columns! [info] at scala.Predef$.assert(Predef.scala:223) [info] at org.apache.spark.sql.delta.skipping.MultiDimClustering$.cluster(MultiDimClustering.scala:51) [info] at org.apache.spark.sql.delta.commands.OptimizeExecutor.runOptimizeBinJob(OptimizeTableCommand.scala:427) [info] at org.apache.spark.sql.delta.commands.OptimizeExecutor.$anonfun$optimize$6(OptimizeTableCommand.scala:277) ... ``` This change makes OPTIMIZE on a clustered table without any clustering columns run regular compaction, which is the desired behavior. ## How was this patch tested? This change adds a new test to verify the correct behavior. The test would fail without the fix. ## Does this PR introduce _any_ user-facing changes? No. commit 86183887bd423981cea3288f3f8c0d162d746c47 Author: Paddy Xu Date: Thu Mar 21 19:02:07 2024 +0100 [PySpark] Add schema evolution config to PySpark `DeltaMergeBuilder` (#2778) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] PySpark ## Description This PR continues from https://github.com/delta-io/delta/pull/2737 to add a `withSchemaEvolution()` method for `DeltaMergeBuilder` in PySpark. ## How was this patch tested? New unit tests. ## Does this PR introduce _any_ user-facing changes? Yes, this PR allows the user to turn on schema evolution for MERGE in PySpark by calling the `table.merge(...).withSchemaEvolution()` method. commit 9f040d4c5130be7376ab7abdbf23f4ebc7e07516 Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Thu Mar 21 18:57:32 2024 +0100 [Spark] Current Date/Time resolution in constraints (#2766) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Heading towards the removal of codegenFallback from Date/Time expressions (https://github.com/apache/spark/pull/44261), delta constraints need to resolve current_datetime expressions during the analysis of the invariants. The proposed changes work for both Spark 3.5 and Spark master. ## How was this patch tested? Existing tests. In particular, `CheckConstaintsSuite` covers constraints with `current_timestamp()` expressions. Added an extra test to cover `current_date()`. ## Does this PR introduce _any_ user-facing changes? No. commit 8cd0f948d107065051d762599789edbd08d24cbc Author: Babatunde Micheal Okutubo Date: Thu Mar 21 10:13:30 2024 -0700 [Spark] Use checkError for testing DeltaAnalysisException (#2768) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Followup for this PR https://github.com/delta-io/delta/pull/2695, that classifies DeltaAnalysisException. This includes test change only to use checkError for exception verification. ## How was this patch tested? existing tests changed ## Does this PR introduce _any_ user-facing changes? No commit 4619af70c84d6379160db107990fe65647bb838c Author: Prakhar Jain Date: Wed Mar 20 15:59:17 2024 -0700 [SPARK] Managed Commit support for cold and hot snapshot update (#2755) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adds support for cold and hot snapshot update for managed-commits. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit a11b92de2850dfbe439fad6ec7cca747be503b77 Author: Sumeet Varma Date: Wed Mar 20 15:36:38 2024 -0700 [PROTOCOL Change Request] Return latestTableVersion in the CommitStore.getCommits API (#2712) ## Protocol Change Request ### Description of the protocol change This change puts added responsibility on CommitStore to return the latestTableVersion in the CommitStore.getCommits API along with the list of Commits. Protocol RFC issue: https://github.com/delta-io/delta/issues/2598 ### Willingness to contribute The Delta Lake Community encourages protocol innovations. Would you or another member of your organization be willing to contribute this feature to the Delta Lake code base? - [x] Yes. I can contribute. - [ ] Yes. I would be willing to contribute with guidance from the Delta Lake community. - [ ] No. I cannot contribute at this time. commit a584fe1e5b36afc3a02b8569ee24a6f7238eda85 Author: Tai Le <49281946+tlm365@users.noreply.github.com> Date: Thu Mar 21 04:10:16 2024 +1100 [Kernel] Remove redundant imports and test suites (#2773) ## Description Remove redundant imports and test suites. Signed-off-by: Tai Le Manh commit c5ff2365390a10c87da04ffec437d8d218d98490 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Wed Mar 20 16:20:48 2024 +0100 [Spark][Test-Only] Add Row Tracking Clone tests with empty source table (#2742) ## Description Add Materialized Row Tracking columns CLONE tests, making sure we have enough Row Tracking test coverage necessary to gear us towards [enabling Row Tracking outside of testing](https://github.com/delta-io/delta/pull/2059). ## How was this patch tested? Existing + added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit d4ffc42ca79c977e4c2f7d0b57ce836ca564b21b Author: Tathagata Das Date: Tue Mar 19 17:14:27 2024 -0400 [INFRA] Removed old, unused Github actions (#2763) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other: Infra ## Description As the title says ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? No commit fbc23846de29b756674ed85015deb419baab16aa Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Mar 19 18:44:12 2024 +0100 [Spark][Test-Only] Add Row Tracking self-Clone tests (#2743) ## Description Add Row Tracking self-clone tests, making sure we have enough Row Tracking test coverage necessary to gear us towards [enabling Row Tracking outside of testing](https://github.com/delta-io/delta/pull/2059). ## How was this patch tested? Existing + added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit fb08c8ff52a27dc8a6ce7b121b9eaf7d2b5937f8 Author: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Date: Tue Mar 19 18:34:29 2024 +0100 [Spark][Test-Only] Add Materialized Row Tracking columns CLONE tests ## Description Add Materialized Row Tracking columns CLONE tests, making sure we have enough Row Tracking test coverage necessary to gear us towards [enabling Row Tracking outside of testing](https://github.com/delta-io/delta/pull/2059). ## How was this patch tested? Existing + added UTs. ## Does this PR introduce _any_ user-facing changes? No. commit a2691fb5d58a1a27aa15710596dc66e9388f4d87 Author: Jason Teoh Date: Tue Mar 19 09:42:00 2024 -0700 [Spark] Disable flaky DeltaLogSuite unit test that relies on filesystem behavior (#2761) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Mark the flaky `"DeltaLog should throw exception when unable to create log directory with silent filesystem failure"` DeltaLogSuite test as ignored. ## How was this patch tested? N/A. CI tests. ## Does this PR introduce _any_ user-facing changes? No commit d8d751ac9af9e0e78365f166d6aee1e043e4cd59 Author: Paddy Xu Date: Tue Mar 19 16:11:00 2024 +0100 [Spark] Add tests for MERGE with schema evolution Scala API (#2765) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR follows https://github.com/delta-io/delta/pull/2737 (https://github.com/delta-io/delta/commit/bbbace1085e46d2ebaa6204bf2603d0a4e2f23ee) to add tests for turning on schema evolution via the `WITH SCHEMA EVOLUTION` syntax or the `.withSchemaEvolution()` Scala API. ## How was this patch tested? This PR is test-only. ## Does this PR introduce _any_ user-facing changes? No. commit 72fad38a3b003122ccdd88144cfc4f4110d9e154 Author: sabir-akhadov <52208605+sabir-akhadov@users.noreply.github.com> Date: Mon Mar 18 21:22:27 2024 +0100 [Spark] Column mapping removal: support tables with deletion vectors, column constraints and generated columns. (#2753) ## Description Add additional tests for tables with deletion vectors, generated columns and column constraints for column mapping removal. ## How was this patch tested? New unit tests ## Does this PR introduce _any_ user-facing changes? No commit 6ccbabb1fe40da3df797aa99f214e17458ddd38a Author: Ala Luszczak Date: Mon Mar 18 21:19:13 2024 +0100 [SPARK] Add exclusion for integral types check in `SchemaUtils.normalizeColumnNamesInDataType` (#2760) ## Description The check was too restrictive and was logging false-positives. ## How was this patch tested? Updated tests in `SchemaUtilsSuite`. ## Does this PR introduce _any_ user-facing changes? No. commit 679e4441556d3a38259b4b9579bc524beaf19dcb Author: Johan Lasperas Date: Mon Mar 18 21:11:29 2024 +0100 Disentangle schema evolution in MERGE INTO command (#2709) ## What changes were proposed in this pull request? ### Context Schema evolution in the MERGE INTO command relies on manually updating the plan of the target table and replaces the original output attributes to match the schema after evolution. Manually updating the target plan is unnecessarily complex, error-prone and ultimately wrong: the new attributes don't match the actual target schema that was used to resolve all expressions. This worked so far for schema evolution that only adds new fields or columns that are then implicitly filled with `null`s on read but breaks when introducing type evolution due to type mismatches that can't be reconciled. ### Changes The target plan isn't manually updated to support schema evolution in the MERGE INTO command anymore. Instead, the original target output is used and we rely on the different expressions used to write out the data to produce an output that supports schema evolution. E.p.: - For target columns that are assigned to by an UPDATE action: `generateUpdateExpressions` already does the heavy lifting, it is simply updated to use the actual target output instead of trying to reference the evolved schema. This allows simplifying a bit the method. - For target columns that aren't assigned to by an UPDATE action (either columns without an assignment or with a DELETE action and the deleted row must be written to the CDC output): no-op expressions are generated by casting the target output attributes to their corresponding type after evolution in `getTargetOutputCols`. The manual handling in `buildTargetPlanWithIndex` and `replaceFileIndex` that isn't needed anymore is removed. This in turn allow splitting `replaceFileIndex` in two methods with separate concerns: `replaceFileIndex` (already exists) and `dropColumns` to manually remove columns from the target plan. ## How was this patch tested? This is extensively covered by existing MERGE tests, in particular: - `MergeIntoSchemaEvolutionTests` - `MergeCDCSuite` - `schema evolution with non-nullable schema` that covers handling non nullable fields that turn nullable due to the outer join in MERGE. commit 5cf1a607a67545cf375c1fd242a2e76c5fedd56b Author: kamcheungting-db <91572897+kamcheungting-db@users.noreply.github.com> Date: Mon Mar 18 09:49:51 2024 -0700 [Spark] Fix case sensitive of delta statistic column (#2758) ## Description This PR fixes a bug that happens when collecting the column attributes specified inside Delta statistic column table property. The Delta statistic column table property translates all columns' name into lower case while table schema keep the case of each column created by customer. As a result, if there are upper case columns inside table definition, the delta statistic collection would miss these columns. This PR fixes this issue by translating the column name to lower case while searching statistic columns. ## How was this patch tested? Modify existing test case to cover more column character cases. commit bbbace1085e46d2ebaa6204bf2603d0a4e2f23ee Author: Paddy Xu Date: Mon Mar 18 14:46:45 2024 +0100 [Spark] MERGE INTO: turn on schema evolution when WITH SCHEMA EVOLUTION is given in SQL command ## Description This PR teaches the `MERGE INTO` command to turn on schema evolution when there exists `WITH SCHEMA EVOLUTION` keywords in the commands. Changes in this PR are: 1. Modify case classes `MergeIntoCommand` and `DeltaMergeInto` to store schema evolution enablement information. 2. For `DeltaMergeInto`, we reuse the existing migrateSchema fields instead of adding a new one. 3. Scala user-facing `DeltaMergeBuilder` API. Changed to be done but not in this PR: 1. Python user-facing DeltaMergeBuilder API. 2. Extend schema evolution tests to test `WITH SCHEMA EVOLUTION` keywords. ## How was this patch tested? Improving the existing tests. ## Does this PR introduce _any_ user-facing changes? This PR allows user to turn on automatic scheme evolution for a specific MERGE command by issuing `MERGE WITH SCHEMA EVOLUTION INTO ...` SQL commands. commit fe3c3c05bf2656ab45b0a4e66fc010df13f5ed92 Author: Nick Lanham Date: Fri Mar 15 14:04:06 2024 -0700 [Flink] disable two flaky flink tests (#2735) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [X] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We have two flaky tests that sometimes block other jobs. See [here](https://github.com/delta-io/delta/actions/runs/8191593850/job/22401155588?pr=2727) and [here](https://github.com/delta-io/delta/pull/1934) for example. This disables these tests while we investigate and fix them. ## How was this patch tested? Existing UTs ## Does this PR introduce _any_ user-facing changes? No commit ec8ab169ac29a9ff20e190da7e5e11d2e020dc86 Author: Johan Lasperas Date: Fri Mar 15 15:37:16 2024 +0100 [Spark] Drop Type Widening Table Feature This PR includes changes from https://github.com/delta-io/delta/pull/2708 which isn't merged yet. The changes related only to dropping the table feature are in commit https://github.com/delta-io/delta/pull/2720/commits/e2601a6e049f82f8e7fc68f3284d7b9efcffa54b ## Description This change is part of the type widening table feature. Type widening feature request: https://github.com/delta-io/delta/issues/2622 Type Widening protocol RFC: https://github.com/delta-io/delta/pull/2624 It adds the ability to remove the type widening table feature by running the `ALTER TABLE DROP FEATURE` command. Before dropping the table feature, traces of it are removed from the current version of the table: - Files that were written before the latest type change and thus contain types that differ from the current table schema are rewritten using an internal `REORG TABLE` operation. - Metadata in the table schema recording previous type changes is removed. ## How was this patch tested? - A new set of tests are added to `DeltaTypeWideningSuite` to cover dropping the table feature with tables in various states: with/without files to rewrite or metadata to remove. ## Does this PR introduce _any_ user-facing changes? The table feature is available in testing only, there's no user-facing changes as of now. When the feature is available, this change enables the following user action: - Drop the type widening table feature: ``` ALTER TABLE t DROP FEATURE typeWidening ``` This succeeds immediately if no version of the table contains traces of the table feature (= no type changes were applied in the available history of the table. Otherwise, if the current version contains traces of the feature, these are removed: files are rewritten if needed and type widening metadata is removed from the table schema. Then, an error `DELTA_FEATURE_DROP_WAIT_FOR_RETENTION_PERIOD` is thrown, telling the user to retry once the retention period expires. If only previous versions contain traces of the feature, no action is applied on the table, and an error `DELTA_FEATURE_DROP_HISTORICAL_VERSIONS_EXIST` is thrown, telling the user to retry once the retention period expires. commit 9a59c0a7524a0046cab5744a5b0c6711ad7f3979 Author: Dhruv Arya Date: Thu Mar 14 12:35:35 2024 -0700 [Spark] Fix char/varchar to string column conversions Users are currently allowed to change the data type of their columns from char/varchar to string. Even though the command succeeds, the column metadata that indicates that the column is char/varchar is not removed during the process. As a result, Delta enforces the old char/varchar constraints on future inserts. This PR fixes that by removing this metadata from any updated char/varchar column when the column is updated. ## How was this patch tested? Added a new test which was failing without the code fix. ## Does this PR introduce _any_ user-facing changes? Yes, `ALTER COLUMN X TYPE STRING;` will now work correctly when X is char/varchar. commit 4456a122929b834e5c2652f99cc64ff8a71f4113 Author: jintao shen Date: Wed Mar 13 14:44:51 2024 -0700 [Spark] Implement incremental clustering using ZCUBE approach ## Description Implement incremental Liquid clustering according to the deisgn [doc](https://docs.google.com/document/d/1FWR3odjOw4v4-hjFy_hVaNdxHVs4WuK1asfB6M6XEMw/edit?usp=sharing). This implementation uses ZCube based approach to achieve incremental clustering. When a Zcube size is big enough, the zcube is sealed and the next clustering won't re-cluster those files and so less write amplification. Key changes Each clustered file is tagged with ZCUBE_ID to track which ZCUBE it belongs to and the id is generated using UUID. Also anther tag ZCUBE_ZORDER_BY is used to track the clustering columns. Each clustered file has the clsuteringProvider populated with liquid. ## How was this patch tested? new unit tests. ## Does this PR introduce _any_ user-facing changes? No commit c046547b5721ee096e5c0beb04bd9b2059021630 Author: Venki Korukanti Date: Wed Mar 13 08:57:14 2024 -0700 [Kernel] Add `ExpressionHandler.isSupported` to check for expression support ## Description Having the API to tell if an expression is supported on a given input schema and expected data type, can make the Kernel make better decisions in splitting the given query predicate to a guaranteed predicate and a best-effort predicate. The proposed API is: ``` /** * Is the given expression evaluation supported on the data with given schema? * * @param inputSchema Schema of input data on which the expression is evaluated. * @param expression Expression to evaluate. * @param outputType Expected result data type. * @return true if supported and false otherwise. */ boolean isSupported(StructType inputSchema, Expression expression, DataType outputType); ``` ## How was this patch tested? Unittests commit 2e197f130765d91f201b6b649f30190a44304b29 Author: Sumeet Varma Date: Wed Mar 13 08:10:32 2024 -0700 [Spark]Add VacuumProtocolCheck ReaderWriter Table Feature (#2730) ## Description Add a new VacuumProtocolCheck ReaderWriter Table Feature so that Vacuum command on older DBR client and OSS clients fail. This is in follow-up to https://github.com/delta-io/delta/pull/2557 where protocol-check was added during the vacuum-write flow. ## How was this patch tested? UTs ## Does this PR introduce _any_ user-facing changes? No commit 60914cdd872c4d6c3a14721d4cb63e93e8f463b6 Author: sabir-akhadov <52208605+sabir-akhadov@users.noreply.github.com> Date: Tue Mar 12 18:42:21 2024 +0100 [Spark]Column mapping removal basic rewrite operation (#2741) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Implement basic rewrite command to rewrite a table with column mapping enabled. ## How was this patch tested? New unit tests ## Does this PR introduce _any_ user-facing changes? commit 3ae99ae592f8d4d3a8b2ae83a6670977176a3598 Author: Dhruv Arya Date: Tue Mar 12 10:41:44 2024 -0700 [Spark] Rename lastCommitTimestamp to lastCommitFileModificationTimestamp (#2746) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Renames LogSegment.lastCommitTimestamp to lastCommitFileModificationTimestamp so as to distinguish this file timestamp from in-commit-timestamp. ## How was this patch tested? Related to https://github.com/delta-io/delta/issues/2532 . Name-only change. ## Does this PR introduce _any_ user-facing changes? No commit d0477bba4bd796967ace5f989f3285f1107baada Author: Venki Korukanti Date: Mon Mar 11 22:36:13 2024 -0700 [Kernel][LogReplay] Make a single read request for all checkpoint files ## Description Currently, the `kernel-api` reads one file (either checkpoint or commit file) at a time. Once the file is fully read, then the next file is read request is issued. This makes reading large checkpoints split over multiple files slower. Instead `kernel-api` could issue read requests for all checkpoint files at once (in case of multi-part checkpoints) using the `ParquetHandler.readParquetFiles` and let the implementations of the `ParquetHandler` prefetch or using multiple threads to read the checkpoint parts concurrently. This PR makes the change to `kernel-api` to issue one read request for all checkpoint files that need to be read for state reconstructions. Resolves #2668 Resolves #1965 ## How was this patch tested? Existing tests and a benchmark with a test only parallel parquet reader. Here are the sample benchmark results with the test only parallel Parquet reader. `Score` tells the average time to construct the Delta table state. `parallelReaderCount` indicates the number of parallel Parquet reading threads used. ``` Benchmark (parallelReaderCount) Mode Cnt Score Error Units BenchmarkParallelCheckpointReading.benchmark 0 avgt 5 1565.520 ± 20.551 ms/op BenchmarkParallelCheckpointReading.benchmark 1 avgt 5 1064.850 ± 19.699 ms/op BenchmarkParallelCheckpointReading.benchmark 2 avgt 5 785.918 ± 176.285 ms/op BenchmarkParallelCheckpointReading.benchmark 4 avgt 5 729.487 ± 51.470 ms/op BenchmarkParallelCheckpointReading.benchmark 10 avgt 5 693.757 ± 41.252 ms/op BenchmarkParallelCheckpointReading.benchmark 20 avgt 5 702.656 ± 19.145 ms/op ``` commit 6c4d86093f7d186fc87ddc5860c5fc7ebac2e016 Author: Venki Korukanti Date: Mon Mar 11 21:45:56 2024 -0700 [Kernel][Build] Add `jmh` dependency in kernel-defaults module This dependency is needed for https://github.com/delta-io/delta/pull/2701/files. commit 9b02ec727ad9c046e33632d46bd13e90bcc8a751 Author: Paddy Xu Date: Mon Mar 11 23:55:33 2024 +0100 [Spark] Test special characters for temp folders used by DV tests (#2726) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR follows https://github.com/delta-io/delta/pull/2696 and https://github.com/delta-io/delta/pull/2719 to finally enable testing special characters in all DV tests. One test is currently failing due to a potential bug in the `OPTIMIZE` code path, which is pending investigation. ## How was this patch tested? Test-only. ## Does this PR introduce _any_ user-facing changes? No. commit 86e3b2c1bb65f464dfdeb9eb02b348fd8d0baf9d Author: Chloe Xia Date: Mon Mar 11 09:13:10 2024 -0700 [Spark] Classify AnalysisException in SchemaMergingUtils.scala (#2695) ## Description This pr modifies (DELTA_FAILED_TO_MERGE_FIELDS ) and adds a new error class (DELTA_MERGE_INCOMPATIBLE_DATATYPE) to migrate AnalysisException to use the new error framework. ## How was this patch tested? It modifies the existing test. ## Does this PR introduce _any_ user-facing changes? Yes. Exception message before: Failed to merge fields 'c0' and 'c0'. Failed to merge incompatible data types IntegerType and StringType. Exception message after: [DELTA_FAILED_TO_MERGE_FIELDS] Failed to merge fields 'c0' and 'c0'. Failed to merge incompatible data types IntegerType and StringType. Co-authored-by: fanyue-xia commit ecbeef7fb6a4a08ef5e15bf46c776c35a80e6cd3 Author: Dhruv Arya Date: Fri Mar 8 17:55:40 2024 -0800 [Spark] Read In-Commit Timestamp during the P&M query (#2723) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The ICT is now retrieved during the Protocol & Metadata query. The only time this query will not retrieve the ICT will be when the latest version has a checkpoint. In that case, we fall back to reading the ICT from latest commit directly. ## How was this patch tested? Added multiple tests in InCommitTImestampSuite ## Does this PR introduce _any_ user-facing changes? No commit 02ef7c7fbedf4e21d7669a35e99ea92d935c96ea Author: sokolat <98281366+sokolat@users.noreply.github.com> Date: Fri Mar 8 18:49:02 2024 -0500 [Build] Workflow to generate unidoc (#2224) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (Doc) ## Description Resolves #2209 this PR will make sure all the component apis are documented. Documentation will be generated for all changes ## How was this patch tested? commit triggered workflow tests --------- Co-authored-by: Venki Korukanti Co-authored-by: Allison Portis commit a1fe1123902fa1c8874b15a530c3a6e7f833ff66 Author: Jason Teoh Date: Fri Mar 8 08:55:57 2024 -0800 [Spark] Add IOException handling in DeltaLog.ensureLogDirectoryExist (#2724) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adds a fix to handle IOExceptions that can be thrown when creating Delta log directories (e.g., due to permission errors) and includes tests. ## How was this patch tested? Added test case to test a scenario where IOException is thrown when creating the delta log directories in `spark/src/test/scala/org/apache/spark/sql/delta/DeltaLogSuite.scala`: "DeltaLog should throw exception when unable to create log directory with filesystem IO Exception". I also added missing test cases for the positive case and existing negative case - for silent failures / cases when the filesystem commands return false. ## Does this PR introduce _any_ user-facing changes? Previously unhandled `IOException`s thrown by underlying filesystems (exception varies by implementation, e.g. `org.apache.hadoop.fs.ParentNotDirectoryException` in the test case) will now be wrapped in a DELTA_CANNOT_CREATE_LOG_PATH exception with message: > [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create \ where is substituted with the actual directory path. This is a change for both master and existing released Delta Lake versions. commit 2f2acb7a6bb77a6f54fe6c1653722a1de6384839 Author: Lars Kroll Date: Fri Mar 8 17:26:01 2024 +0100 [Protocol] Correct Delta Spec DV version field byte range The protocol spec accidentally described the version field for the DV file format as being 0 - 1, i.e. a 2-byte range, but in truth it's only a single byte. This was likely a left-over from when the spec describe the ranges with exclusive bounds, rather than inclusive. This PR corrects this to clarify that only byte 0 contains the version number. commit e9081d6df58fdb3889e40a0c765060cab97a2c45 Author: Allison Portis Date: Thu Mar 7 22:36:31 2024 -0800 [CI] Disable running unidoc from the spark CI tests for now (#2732) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (fill in here) ## Description This removes compiling unidoc from the spark CI tests since we've seen a flaky failure on #2727. We will follow up and create a separate job just for unidoc (which makes more sense anyways as it should run for changes in any of the projects.) ## How was this patch tested? CI runs. commit 9c57f7b1e831bafcd335b0b4b3dac2bc6955f5e0 Author: Lars Kroll Date: Fri Mar 8 04:05:01 2024 +0100 [Spark] In materialize merge source: Only log storage error if we actually materialized (#2727) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Be more defensive about calling `Option.get` on `materializedSourceRDD` and check first that it's actually `Some`. In fact only invoke the entire branch where we log only when we actually did materialize the source (and not, say, ran out of disk space before). ## How was this patch tested? This PR adds a new test for the scenario where we throw an out of disk space error without having materialized the source. ## Does this PR introduce _any_ user-facing changes? No commit 77d43ed5ed2085b4953e55df1b2ef5ef74dc9a2f Author: Felipe Pessoto Date: Thu Mar 7 16:36:39 2024 -0800 [Spark] CI - Use Python packages compatible to Python 3.9 + Run CI when workflow files change (#2353) - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Use Python packages compatible to both Python 3.8 and 3.9 https://setuptools.pypa.io/en/stable/history.html#v41-1-0 https://pypi.org/project/pandas/1.1.3/ ## How was this patch tested? ## Does this PR introduce _any_ user-facing changes? --------- Signed-off-by: Felipe Fujiy Pessoto Signed-off-by: Felipe Pessoto commit 33699686c747ee0747113d3f49743a414da88448 Author: Sumeet Varma Date: Thu Mar 7 15:13:12 2024 -0800 [Spark] Add LatestTableVersion to CommitStore GetCommits API response (#2716) #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description With this change, we require CommitStores to maintain and return the latestTableVersion in the getCommits API. The motivation here is for readers to get more context from CommitStore when all commits are backfilled and the Commit Response is empty. Clients can use latestTableVersion to determine the version the table is when all commits are backfilled and the commits list is empty in the getCommits response. ## How was this patch tested? Unit tests ## Does this PR introduce _any_ user-facing changes? No commit 7b9c4ed34dcf102ff3dca0abed3f668ebc9f3061 Author: ram-seek <143373457+ram-seek@users.noreply.github.com> Date: Fri Mar 8 06:15:57 2024 +1100 [Spark][Python] Change to `isinstance` to allow `DataFrame` subclasses in `merge` API This PR allows the merge to accept subclasses of a dataframe. resolves #2619 commit 707e7a64be7c07b83381fa2c8f59366254f6a55b Author: Renan Tomazoni Pinzon Date: Thu Mar 7 14:02:34 2024 -0300 [Kernel] Rename Parquet readers to maintain consistency with Parquet writers Rename Parquet readers to maintain consistency with Parquet writers. Resolves #2636 commit 2fddb8b0b87b08ec57b13e86fcd2feec83ea10f1 Author: Thang Long Vu Date: Wed Mar 6 13:34:17 2024 +0100 Add tests to check the behaviour of CLONE with row IDs. Closes https://github.com/delta-io/delta/pull/2678 GitOrigin-RevId: 9473242559e6957b3ce56925cfe1da256972ba1b commit 6cdacab4649fcbf81582e061878b7fbd76d95e1b Author: Prakhar Jain Date: Tue Mar 5 15:35:41 2024 -0800 Use FileSystemCommitStore to simply code when managed commits is not enabled Use FileSystemCommitStore to simply code when managed commits is not enabled Closes https://github.com/delta-io/delta/pull/2721 GitOrigin-RevId: 5dbdb0ce0f357ba60e932000d83c31731915c24b commit 06d87d93e4ba658bb6319835989f91e29835fa9e Author: Paddy Xu Date: Tue Mar 5 17:54:06 2024 +0100 [Spark] Deletion Vector descriptor `path` type should URI-escape special characters #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes an issue where the `path` type of `DeletionVectorDescriptor` must escape special characters in the DV path. A special character may exist if the DV is referring to an existing file from another table that contains such a character in its path. New tests. ## Does this PR introduce _any_ user-facing changes? No. Closes https://github.com/delta-io/delta/pull/2719. GitOrigin-RevId: 5e3b5e9f60959e01cf4f324eaf6b24c6028ae365 commit e07697481d5033299f8c05e8d278866a96bc354c Author: Johan Lasperas Date: Tue Mar 5 10:34:47 2024 +0100 Record type widening metadata #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) This change is part of the type widening table feature. Type widening feature request: https://github.com/delta-io/delta/issues/2622 Type Widening protocol RFC: https://github.com/delta-io/delta/pull/2624 It introduces metadata to record information about type changes that were applied using `ALTER TABLE`. This metadata is stored in table schema, as specified in https://github.com/delta-io/delta/pull/2624/files#diff-114dec1ec600a6305fe7117bed7acb46e94180cdb1b8da63b47b12d6c40760b9R28 For example, changing a top-level column `a` from `int` to `long` will update the schema to include metadata: ``` { "name" : "a", "type" : "long", "nullable" : true, "metadata" : { "delta.typeChanges": [ { "tableVersion": 1, "fromType": "integer", "toType": "long" }, { "tableVersion": 5, "fromType": "integer", "toType": "long" } ] } } ``` - A new test suite `DeltaTypeWideningMetadataSuite` is created to cover methods handling type widening metadata. - Tests covering adding metadata to the schema when running `ALTER TABLE CHANGE COLUMN` are added to `DeltaTypeWideningSuite` Closes delta-io/delta#2708 GitOrigin-RevId: cdbb7589f10a8355b66058e156bb7d1894268f4d commit 4e9a15cac87edf559a232182785c16658d2b9624 Author: Fred Storage Liu Date: Tue Mar 5 00:33:29 2024 -0800 Refactor iceberg to delta partition value conversion refactor iceberg to delta partition value conversion to make code more modular. No function change. Closes delta-io/delta#2715 GitOrigin-RevId: 385178c997d40aa2fe1be8273dd65d01f2ddccc1 commit 5cc01848eb787f673d2f22b87d353eef236fa2e8 Author: Sabir Akhadov Date: Mon Mar 4 23:37:31 2024 +0100 Add RemoveColumnMappingSuite Add missing RemoveColumnMappingSuite. Closes delta-io/delta#2697 GitOrigin-RevId: 992c940b42748701bd5f080ee2219d8aa09c9570 commit 8568b3c9907ceffd0103649c40e38669fa7d4af9 Author: Paddy Xu Date: Mon Mar 4 16:24:28 2024 +0100 [Spark] Allow specifying custom prefixes for temp folders used by tests #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR introduces a new trait, `DeltaSQLTestUtils`, to be used by all tests as a replacement for Apache Spark's `SQLTestUtils.` The new trait provides an ability for each test to specify a custom name prefix for temporary folders, so we can test the behavior of Delta Lake when a table's path contains special characters. Not needed. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#2696 Signed-off-by: Paddy Xu GitOrigin-RevId: fa9bead06ff6da1d7baff58a564fbe4cbeac6727 commit c6f0baaf7c96b21c890d50f371c0de27f11989f2 Author: panbingkun Date: Mon Mar 4 14:37:19 2024 +0000 Update error check in DeltaOptionSuite to use Spark error classes This PR updates an error check in DeltaOptionSuite to use the error class framework from Apache Spark. This makes the test less brittle and fits into the framework that Spark is using. Closes delta-io/delta#2699 GitOrigin-RevId: 563d5a6e0d762275551951662eee14ea40c52eb4 commit d0d4eb2e988cf29638712e0dc54a69690da512f8 Author: Sumeet Varma Date: Fri Mar 1 13:15:15 2024 -0800 [Spark] Use FileStatus instead of SerializableFileStatus in CommitStore APIs Similar to LogStore, we will let CommitStore depend on Hadoop FileStatus. CommitStore already depends on Hadoop Path and Hadoop Configuration. Closes delta-io/delta#2700 GitOrigin-RevId: d51421997b603e5a47351503b7edfc1e0f9d05fe commit b15a2c97432c8892f986c1526ceb2c3f63ed5d2c Author: Dhruv Arya Date: Thu Feb 29 16:37:42 2024 -0800 Introduce InCommitTimestamp feature and write monotonically increasing timestamps in CommitInfo Follow-up for https://github.com/delta-io/delta/issues/2532. Adds a new writer feature called `inCommitTimestamp`. When this feature is enabled, the writer will make sure that it writes `commitTimestamp` in CommitInfo which contains a monotonically increasing timestamp. This PR is an initial implementation, it does not handle timestamp retrieval efficiently. It does not try to populate the inCommitTimestamp in Snapshot even in places where it is already available, instead Snapshot has to perform an IO to read the timestamp. Closes delta-io/delta#2596 GitOrigin-RevId: 44904e734eee74378ee55f708beb29a484cd93e6 commit f50bd83ee8e02d311c4a060628786b53710beff9 Author: Allison Portis Date: Mon Mar 4 12:37:33 2024 -0800 [Kernel] Support getting snapshots by timestamp (time-travel) (#2662) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description Resolves #2276 Adds support for reading the snapshot of the table at a specific timestamp using `Table:: getSnapshotAtTimestamp`. ## How was this patch tested? Adds unit tests. commit f71fef7c49f3fe6e2df9d7bd502093e2ab6572ba Author: Sumeet Varma Date: Mon Mar 4 11:17:19 2024 -0800 [Protocol Change Request] Add VacuumProtocolCheck PROTOCOL change request (#2693) ## Protocol Change Request ### Description of the protocol change Adds the VacuumProtocolCheck PROTOCOL change proposal. Design Doc: https://docs.google.com/document/d/15o8WO2T0vN21S5JG-FT_ZNhXFCWyh0i9tqhr9kBmZpE/edit#heading=h.4cz970y1mk93 Protocol RFC issue: https://github.com/delta-io/delta/issues/2630 ### Willingness to contribute The Delta Lake Community encourages protocol innovations. Would you or another member of your organization be willing to contribute this feature to the Delta Lake code base? - [x] Yes. I can contribute. - [ ] Yes. I would be willing to contribute with guidance from the Delta Lake community. - [ ] No. I cannot contribute at this time. --------- Co-authored-by: Prakhar Jain commit fda41dd8fee58f2f973db819b3c199bfc6410335 Author: Tai Le <49281946+tlm365@users.noreply.github.com> Date: Tue Mar 5 02:10:51 2024 +0700 [Kernel][Java to Scala test conversion] Convert and refactor TestParquetBatchReader Resolves #2638 commit 9b3fa0a1a05e51b38cec083afb41226beb399b0f Author: Johan Lasperas Date: Thu Feb 29 08:52:09 2024 +0100 Type Widening in ALTER TABLE CHANGE COLUMN ## Description This change introduces the `typeWidening` delta table feature, allowing to widen the type of existing columns and fields in a delta table using the `ALTER TABLE CHANGE COLUMN TYPE` or `ALTER TABLE REPLACE COLUMNS` commands. The table feature is introduced as `typeWidening-dev` during implementation and is available in testing only. For now, only byte -> short -> int are supported. Other changes will require support in the Spark parquet reader that will be introduced in Spark 4.0 Type widening feature request: https://github.com/delta-io/delta/issues/2622 Type Widening protocol RFC: https://github.com/delta-io/delta/pull/2624 A new test suite `DeltaTypeWideningSuite` is created, containing: - `DeltaTypeWideningAlterTableTests`: Covers applying supported and unsupported type changes on partitioned columns, non-partitioned columns and nested fields - `DeltaTypeWideningTableFeatureTests`: Covers adding the `typeWidening` table feature ## This PR introduces the following *user-facing* changes The table feature is available in testing only, there's no user-facing changes as of now. The type widening table feature will introduce the following changes: - Adding the `typeWidening` via a table property: ``` ALTER TABLE t SET TBLPROPERTIES (‘delta.enableTypeWidening' = true) ``` - Apply a widening type change: ``` ALTER TABLE t CHANGE COLUMN int_col TYPE long ``` or ``` ALTER TABLE t REPLACE COLUMNS int_col TYPE long ``` Note: both ALTER TABLE commands reuse the existing syntax for setting a table property and applying a type change, no new SQL syntax is being introduced by this feature. Closes delta-io/delta#2645 GitOrigin-RevId: 2ca0e6b22ec24b304241460553547d0d4c6026a2 commit eb59d4af321b21c0e38220e4e65b21a6b991b410 Author: Fred Storage Liu Date: Wed Feb 28 14:18:15 2024 -0800 [Delta Uniform] overwrite source column field id for Iceberg PartitionField with field id assigned by Delta context: Delta and Iceberg traverse schema and assigns field id in different way. Delta uniform currently use a extra Iceberg txn to overwrite the schema in iceberg table with wrong field ids reassigned by Iceberg. However, if the source field id for partition columns is different in that schema overwrite txn, Iceberg always expect field id for partition columns to be the same and does not have logic to reconcile that, and then Iceberg will fail the overwrite txn. This PR adds the logic to adopt new source column field id to Iceberg PartitionField if changed, so the overwrite txn can go through and set the correct source column field id for PartitionFields. Closes delta-io/delta#2676 GitOrigin-RevId: 37be472e9794d0a87c59a8fe06efe237ee1c609e commit 01b8da4c32678966f5065565b0d6bb06d1fe192f Author: Ole Sasse Date: Wed Feb 28 22:30:08 2024 +0100 [SPARK] Add a test case for implicit decimal conversion casts in DML commands #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add a test case that validates that the implicit conversion between decimal types in DML commands works correctly Only adding a new test ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2698 GitOrigin-RevId: efc1f37125a3453dd3d856ac67d6c3b0d5bada4f commit c8c1cfa6bc1b0280934b1c5e6031300372d82c57 Author: Thang Long Vu Date: Wed Feb 28 17:43:56 2024 +0100 Check that RESTORE doesn't downgrade protocol when restoring to a version where row tracking is not enabled, unless the flag is set. Closes https://github.com/delta-io/delta/pull/2677 GitOrigin-RevId: 0ead932e2cfb85f6164f1b8a2b98944312cd9281 commit 22a5616cbf7dbfaf89bcaf484e97d69544732781 Author: Thang Long Vu Date: Tue Feb 27 23:05:51 2024 +0100 Add more TimestampNTZ unit tests for data skipping on TimestampNTZ columns.

Add more TimestampNTZ unit tests for data skipping on TimestampNTZ columns. Unit tests are added to the `DataSkippingDeltaTests`. Closes https://github.com/delta-io/delta/pull/2691 GitOrigin-RevId: 50f5624ec6cb98ac6349ba5b28e7b66ca57eb32d commit dceba958e1dfc1d106f52499a27e1563f2dde2db Author: Johan Lasperas Date: Tue Feb 27 21:35:10 2024 +0100 [Spark] Move MERGE resolution logic to its own file #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Move `resolveReferencesAndSchema` that implements the core resolution logic for Delta MERGE INTO command from package `catalyst` to `delta`. `deltaMerge` should only contain plan node definition but grew over time to accumulate all of the MERGE resolution logic. We'll eventually want to split this logic into smaller, more maintainable units. This a plain refactor moving code around. Closes delta-io/delta#2685 GitOrigin-RevId: f5b1e2303ce402e8e2c23740f7d8ab978dfa1174 commit ffa668d456205c34e111d63866c115eee478502b Author: sumeet-db Date: Tue Feb 27 11:43:54 2024 -0800 [Spark] Limit number of retries for non-conflict retryable commit failures We add limited retries (default = 10) for retryable CommitFailedExceptions which are non-conflict. The number of default retries for retryable conflict CommitFailedExceptions is set to 10M right now. Closes delta-io/delta#2689 GitOrigin-RevId: 4b0e02e87bd5c101c47dfbe84759d6e09abf92f0 commit fcdd4654d3125a33c081102423f6076c848770fb Author: Ala Luszczak Date: Tue Feb 27 13:31:15 2024 +0100 [Spark] Fix adding wide bounds to files with incomplete stats No. Closes delta-io/delta#2687 GitOrigin-RevId: ba56180229234e3353e8816b220f7a34b0982c05 commit 5d25578120342246743abcb272f137ce74e7e799 Author: Johan Lasperas Date: Tue Feb 27 17:20:01 2024 +0100 [Protocol Change Request] Type Widening table feature (#2624) * Type Widening Protocol RFC * Clarified requirements * Add requirement on garbage collecting type change metadata * Update protocol_rfcs/type-widening.md Co-authored-by: Ryan Johnson * Update protocol_rfcs/type-widening.md Co-authored-by: Ryan Johnson * Update protocol_rfcs/type-widening.md Co-authored-by: Bart Samwel * Clarify reader & writer requirements re: unsupported type changes * Fix metadata example: s/int/integer --------- Co-authored-by: Ryan Johnson Co-authored-by: Bart Samwel commit 7d41fb7bbf63af33ad228007dd6ba3800b4efe81 Author: Arun Ravi M V Date: Fri Feb 23 18:36:59 2024 -0800 Use inventory reservoir as source for all files and dirs - Currently, users have large tables with daily/hourly partitions for many years, among all these partitions only recent ones are subjected to change due to job reruns, corrections, and late arriving events. - When Vacuum is run on these tables, the listing of files is performed on all the partitions and it runs for several hours/days. This duration grows as tables grow and vacuum becomes a major overhead for customers especially when they have hundreds or thousands of such delta tables. File system scan takes the most amount of time in Vacuum operation for large tables, mostly due to the parallelism achievable and API throttling on the object stores. - This change provides a way for users to pass a reservoir of files generated externally (eg: from inventory reports of cloud stores) as a delta table or as a spark SQL query (having a predefined schema). The vacuum operation when provided with such a reservoir data frame will skip the listing operation and use it as a source of all files in the storage. "Resolves #1691". - Unit Testing (` build/sbt 'testOnly org.apache.spark.sql.delta.DeltaVacuumSuite'`) yes, the MR accepts an optional method to pass inventory. `VACUUM table_name [USING INVENTORY ] [RETAIN num HOURS] [DRY RUN]` `VACUUM table_name [USING INVENTORY ] [RETAIN num HOURS] [DRY RUN]` eg: `VACUUM test_db.table using inventory select * from reservoir_table RETAIN 168 HOURS dry run` Closes delta-io/delta#2257 Co-authored-by: Arun Ravi M V Signed-off-by: Bart Samwel GitOrigin-RevId: 2bc824e524c677dd5f3a7ed787762df60c3b6d86 commit 4c8a442643cf434ea865b6c7a111325450220c05 Author: Prakhar Jain Date: Thu Feb 22 17:19:41 2024 -0800 Write side changes for managed-commits Write side changes for managed-commits Closes https://github.com/delta-io/delta/pull/2660 GitOrigin-RevId: f6bf0849db9b63473c3b13d3938962c8a6aff901 commit 26676aebdb2790207c6e354f5050bdcd0976987c Author: Costas Zarifis Date: Thu Feb 22 16:33:12 2024 -0800 Eagerly compute InternalRows before calling makePartitionDirectories Minor refactoring in [TahoeFileIndex.scala](https://github.com/delta-io/delta/pull/2671/files#diff-bd55057ac76812c275eae15225d360f1bb6b2d997a095dee512a1eb9ab1686aa). We push the computation of InternalRow down before calling makePartitionDirectories. This significantly improves the edge-side without causing any side-effects on the non-edge-side. By doing this we are able to cache the result, instead of recomputing it repeatedly on the edge-side. Additionally InternalRow is more memory-efficient than Map[String, String] which also improves the memory utilization. Closes delta-io/delta#2671 GitOrigin-RevId: 9acc90e921a98fc17e7c45e2b04bd7e6b2a85974 commit 441505117ae6bc111d3c5c10b461a6c80427ce4d Author: Tai Le <49281946+tlm365@users.noreply.github.com> Date: Fri Feb 23 21:51:11 2024 +0700 [Kernel][TEST-ONLY] Refactors the test parquet suite Create a new ParquetSuiteBase then move all logic handlers, i/o utils... isn't test case, and keep only unit tests in ParquetFileReaderSuite/ParquetFileWriterSuite. Signed-off-by: Tai Le Manh commit e10d249988ca949377882f151aaae7bd7c35625d Author: Thang Long Vu Date: Thu Feb 22 15:47:20 2024 +0100 Add missing Row IDs High Watermark test. Closes https://github.com/delta-io/delta/pull/2656 GitOrigin-RevId: c6d973ef7c64b4640f7c94502aec66d5eb705793 commit 19a45676d59e50991f53c1197bf89b12c00e91a3 Author: Sabir Akhadov Date: Thu Feb 22 15:41:54 2024 +0100 Column mapping removal check for invalid column names Add new column mapping removal command basic class hidden behind a feature flag. Add a check for invalid column names before applying the command. Closes delta-io/delta#2617 GitOrigin-RevId: e18ce58e11ce89115812e8644b99339bb1041a16 commit 05415edc157c13186bae4d00b29b278ac3a083c7 Author: Thang Long Vu Date: Thu Feb 22 01:11:06 2024 +0100 Add tests to check the behaviour of CONVERT TO DELTA with row IDs. Closes https://github.com/delta-io/delta/pull/2647 GitOrigin-RevId: 586d283d516d5eb179b4a03696c1d4294d3b7c80 commit 3f0496ba3a7648a92fc83cde80248e4339d7768f Author: Fred Storage Liu Date: Wed Feb 21 14:28:45 2024 -0800 Make a Delta SQL conf for DeltaLog cache size Make a Delta SQL conf for DeltaLog cache size Closes delta-io/delta#2568 GitOrigin-RevId: 2f5b0992afe7aba5586a5e0e083c782e8dab40e5 commit 210503a1ae23244edeed7c64c0e3a73c307b99ef Author: Sumeet Varma Date: Wed Feb 21 12:44:21 2024 -0800 Fix Checksum.scala store.read callsite to use the new LogStore API Cache the fileStatus of the last-read checksum file in SnapshotManagement.scala. This cache can then be used to potentially invoke the new LogStore read API in Checksum.scala. Closes delta-io/delta#2643 GitOrigin-RevId: e285e87168f4816e729bd10fc7a86a0f3624b2cc commit a842991739c5b338478af8c1d44eb4fd5580767e Author: Allison Portis Date: Thu Feb 22 13:19:38 2024 -0800 [Kernel] Test refactoring for #2662 (separate utilities for mocking file system handler) (#2663) commit f54c419fc0fe184b64d56e306e98fbf748b4c53e Author: Tai Le <49281946+tlm365@users.noreply.github.com> Date: Fri Feb 23 03:50:19 2024 +0700 [Kernel][Data skipping] Add support data skipping for IS_NULL and NOT(IS_NULL) (#2658) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description Resolves #2530 ## How was this patch tested? Unit tests added. commit ba6d5b9b3724512780951b075c2444f028da84f9 Author: Tai Le <49281946+tlm365@users.noreply.github.com> Date: Fri Feb 23 02:09:01 2024 +0700 [Kernel][Java to Scala test conversion] Convert TestDefaultJsonHandler written in Java to Scala Resolves #2639 Signed-off-by: Tai Le Manh commit 98fac5728f8efb297d3e68c1c6301871bfc93095 Author: Venki Korukanti Date: Wed Feb 21 13:29:59 2024 -0800 [Kernel] Collect file statistics as part of writing Parquet files ## Description Add support for collecting statistics for columns as part of the Parquet file writing. ## How was this patch tested? Refactored existing tests to make them concise. Added tests for stats collection and verifying the stats using the Spark reader. Also added a few special cases around collecting stats when the input contains NaN, -0.0 or 0.0. commit 3a972508db4531e0b8f6475799c3df4c566db47f Author: Sumeet Varma Date: Tue Feb 20 11:20:18 2024 -0800 Add InMemoryCommitStore to test Managed Commit Backend An in-memory-commit-store that tracks per-table commits, backfills and validates various edge cases and unexpected scenarios. Closes delta-io/delta#2649 GitOrigin-RevId: 4da495caa6259501f16723ced0ca236ab8420044 commit 19f3a4fc95860feee9b4d5508bbdc42c99417459 Author: Carmen Kwan Date: Tue Feb 20 14:18:24 2024 +0100 [Spark][TEST-ONLY] Add more test coverage for TRUNCATE HISTORY #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We are currently missing SQL tests for ALTER TABLE DROP FEATURE TRUNCATE HISTORY with non-path based table. We have tests for writer feature, but not for readerwriter feature that require the TRUNCATE HISTORY syntax. This PR addresses that gap. This is a test-only PR. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#2635 GitOrigin-RevId: f64a7226defe240145dca756b51c5f325a030841 commit 622bd3257182332b5ed3ca2d2146301b347c7c58 Author: Thang Long Vu Date: Fri Feb 16 19:05:31 2024 +0100 Add tests to check the behaviour of the different combinations of CREATE and REPLACE with row IDs. Add tests to check the behaviour of the different combinations of `CREATE` and `REPLACE` with row IDs. Closes https://github.com/delta-io/delta/pull/2642 GitOrigin-RevId: b8bddb63afa616416be01211a5167639e3f044d9 commit 70e527bd29eb640f9829cb16c0cda9b4c138c4a6 Author: Jing Zhan Date: Fri Feb 16 09:37:40 2024 -0800 Add a config flag for partition change check in DeltaSource Add a config for partition change check in Delta Source. Users can turn on or turn off the check by changing the config. Closes delta-io/delta#2618 GitOrigin-RevId: cbfc621d5f07e01b2ee60a048fb15e1fa80a9322 commit d6482c4440903e22a3855b973aa38f24b7ca284e Author: Hao Jiang Date: Thu Feb 15 14:12:29 2024 -0800 Add example for IcebergCompatV2 and REORG This PR add an example explaining how to use REORG APPLY UniForm command to enable IcebergCompatV2 and UniForm Closes delta-io/delta#2500 GitOrigin-RevId: 23ccd5bac7d95977530646fcf5ba5d53a25d2734 commit 66d0c54bde1bb77c37ff61f9edfd62c2fd381fd4 Author: Johan Lasperas Date: Thu Feb 15 20:12:12 2024 +0100 Support map and arrays in ALTER TABLE CHANGE COLUMN #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This change addresses an issue where trying to change the key/value of a map or element of an array with ALTER TABLE CHANGE COLUMN would succeed while doing nothing. In addition, a proper error is now thrown when trying to add or drop the key or value of a map or element of an array. - Added tests to `DeltaAlterTableTests` to cover changing maps and arrays in ALTER TABLE CHANGE COLUMN. - Added tests to `SchemaUtilsSuite` and `DeltaDropColumnSuite` to cover the updated error when trying to add/drop map key/value or array element. ## This PR introduces the following *user-facing* changes Changing the type of the key or value of a map or of the elements of an array now fails if the type change isn't supported (= anything except setting the same type or moving between char, varchar, string): ``` CREATE TABLE table (m map) USING DELTA; ALTER TABLE table CHANGE COLUMN m.key key long; -- Fails with DELTA_UNSUPPORTED_ALTER_TABLE_CHANGE_COL, previously succeeded while applying no change. ``` Similarly, adding a comment now also fails. The error when trying to add or drop a map key/value or array element field is updated: ``` CREATE TABLE table (m map) USING DELTA; ALTER TABLE table ADD COLUMN m.key long; -- Now fails with DELTA_ADD_COLUMN_PARENT_NOT_STRUCT instead of IllegalArgumentException: Don't know where to add the column m.key" ``` Closes delta-io/delta#2615 GitOrigin-RevId: e9d4ba42cefaf7be7e70d948075312922059cde0 commit 0ee57b79e54574cf6827553129ce4f248e309099 Author: Ala Luszczak Date: Thu Feb 15 12:29:04 2024 +0100 [Spark] Handle NullType in normalizeColumnNames() The sanity check in normalizeColumnNamesInDataType() introduced by that change is a bit too restrictive, and fails to handle NullType correctly. Closes delta-io/delta#2634 GitOrigin-RevId: faaf3d981c57ef3ceb4081e0bc94d457359fc9d8 commit 360e066a9653e5486544c57f2c728b62d5bc3bd1 Author: Johan Lasperas Date: Wed Feb 14 21:20:50 2024 +0100 Factor logic to collect files to REORG out of OPTIMIZE #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This is a plain refactor of REORG TABLE / OPTIMIZE to allow for better extendability and adding new types of REORG TABLE operations in the future. The REORG operation currently supports: - PURGE: remove soft deleted rows (DVs) and dropped columns. - UPGRADE UNIFORM: rewrite files to be iceberg compatible. More operations can be used in the future to allow dropping table features, in particular for column mapping: rewrite files to have the physical column names match the logical column name. This a plain refactor without functional changes. Closes delta-io/delta#2616 GitOrigin-RevId: b8e8ad4d148201a33b1fb173ebcfe4ad8b8407ef commit 25c44838b4b3457bff6cc010860fe4f2412cf8cd Author: Tai Le <49281946+tlm365@users.noreply.github.com> Date: Wed Feb 21 01:14:05 2024 +0700 [Kernel][Java to Scala test conversion] Convert TestDefaultFileSystemClient written in Java to Scala Resolves #2640 commit efc0e34dd907e257f84eabb8d43f3e0346859cc9 Author: Tai Le <49281946+tlm365@users.noreply.github.com> Date: Tue Feb 20 02:34:15 2024 +0700 [Kernel][Expressions] Adds the `IS_NULL` expression Resolve #2632. Adds the `IS NULL` expression. Signed-off-by: Tai Le Manh commit 8313f0270246aca9d5a77c2b80f3d60b53e5bae6 Author: Tai Le <49281946+tlm365@users.noreply.github.com> Date: Tue Feb 20 01:51:13 2024 +0700 [Kernel][Java to Scala test conversion] Convert `TestDeltaTableReads` written in Java to Scala Resolves #2637 Signed-off-by: Tai Le Manh commit 4dde5a74bb2e85d9005bfef21c02897291846261 Author: Allison Portis Date: Fri Feb 16 13:27:45 2024 -0800 [Kernel] Support getting snapshot by version (#2607) commit 4ecfa451c0618a8a1a0d048aa33f9bd031b8b0f2 Author: Venki Korukanti Date: Wed Feb 14 12:56:53 2024 -0800 [Kernel] Parquet writer `TableClient` APIs and default implementation (#2626) Add the following API to `ParquetHandler` to support writing Parquet files. ``` /** * Write the given data batches to a Parquet files. Try to keep the Parquet file size to given * size. If the current file exceeds this size close the current file and start writing to a new * file. *

* * @param directoryPath Path to the directory where the Parquet should be written into. * @param dataIter Iterator of data batches to write. * @param maxFileSize Target maximum size of the created Parquet file in bytes. * @param statsColumns List of columns to collect statistics for. The statistics collection is * optional. If the implementation does not support statistics collection, * it is ok to return no statistics. * @return an iterator of {@link DataFileStatus} containing the status of the written files. * Each status contains the file path and the optionally collected statistics for the file * It is the responsibility of the caller to close the iterator. * * @throws IOException if an I/O error occurs during the file writing. This may leave some files * already written in the directory. It is the responsibility of the caller * to clean up. * @since 3.2.0 */ CloseableIterator writeParquetFiles( String directoryPath, CloseableIterator dataIter, long maxFileSize, List statsColumns) throws IOException; ``` The default implementation of the above interface uses `parquet-mr` library. ## How was this patch tested? Added support for all Delta types except the `timestamp_ntz`. Tested writing different data types with variations of nested levels, null/non-null values and target file size. ## Followup work * Support 2-level structures for array and map type data writing * Support INT64 format timestamp writing * Decimal legacy format (always binary) support * Uniform support to add field id for intermediate elements in `MAP`, `LIST` data types. commit 665aa7def029ea86419242e185c5a58c6f82e1a5 Author: Felipe Pessoto Date: Tue Feb 13 14:48:27 2024 -0800 [Docs] Fix readme test command - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (fill in here) The command in readme is broken since we added Scala API for optimize. Tested locally No Closes delta-io/delta#2620 Signed-off-by: vkorukanti GitOrigin-RevId: 71000fadb2ed417f0b8aa1f52da83c01a5389fdc commit 0602ae8d5323b3845f6d9417349a24f43dd7b636 Author: Ala Luszczak Date: Tue Feb 13 15:59:29 2024 +0100 Normalize column names in the nested fields when writing Function `SchemaUtils.normalizeSchemaNames()` is used during a Delta write. It was intended to correct the case of names of any top level-fields that differed between the input schema and the table. If the case of any nested field differed, the function was supposed to raise an exception. However, due to a long-standing bug, the function could instead ignore the difference in the nested fields. This results in a data corruption. While the column values are written into the Delta table, the stats for these column are not gathered correctly. Instead, the stats are recorded as-if the column was missing in the input, that is: `minValue = null`, `maxValue = null`, `nullCount = rowCount`. This PR implements the full normalization of nested field names to fix this. Closes delta-io/delta#2569 GitOrigin-RevId: 8139d9c883b053d0ce4e27dd9a2db3214df1b5b5 commit bc519f094cd55c28f2476c4f10372485470ec53f Author: Prakhar Jain Date: Mon Feb 12 16:49:43 2024 -0800 Add ManagedCommit table feature and CommitStore interface Introduce ManagedCommit table feature, CommitStore interface Closes https://github.com/delta-io/delta/pull/2627 GitOrigin-RevId: e68448566e6c28f83b596fc4a28355eefeec2996 commit 3af14f0a72e7c16dd0d71e630417bec3dfa49121 Author: Lars Kroll Date: Fri Feb 9 20:03:54 2024 +0100 [Spark] Fix typo in ROW_INDEX_STRUCT_FIELD - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) Correct a typo: `ROW_INDEX_STRUCT_FILED` -> `ROW_INDEX_STRUCT_FIELD` N/A Closes delta-io/delta#2625 GitOrigin-RevId: 137522148d61557e282a519e3a2d3a22acbe64a7 commit 8db9617b59c9af76953f0c4efe20c89ee8fcd938 Author: Johan Lasperas Date: Thu Feb 8 12:13:15 2024 +0100 Fix inconsistent field metadata in MERGE -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) Fixes an issue where internal metadata attached to fields leaks into the plan constructed when executing a MERGE command, causing the same attribute to appear both with and without the internal metadata. This can cause plan validation to fail due to the same attribute having two apparently different data types (metadata is part of a field datatype). Added test that would fail without the fix Closes delta-io/delta#2612 GitOrigin-RevId: 50688200c53f9450512b76ee2d375b2e55db8216 commit cc1660be33d774fbd05c8df7fc782f4ab18f8996 Author: Bo Gao Date: Wed Feb 7 15:36:11 2024 -0800 Updated error message for DELTA_SOURCE_TABLE_IGNORE_CHANGES Updated error message for DELTA_SOURCE_TABLE_IGNORE_CHANGES to provide more information on offending commit and a clearer guide for using `skipChangeCommits` option. Closes delta-io/delta#2590 GitOrigin-RevId: 5fe93175955aba8a489364d5c4bc9163347a6f73 commit 5b71a4372c7f29dca871544053c453ac01679d35 Author: Johan Lasperas Date: Wed Feb 7 12:25:16 2024 +0100 Add tests for schema evolution in MERGE with assignments qualified with target name -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) Add a set of tests to capture the current behavior of schema evolution in MERGE when assignments are qualified with the target table name or alias, in particular when that qualifier conflicts with an existing column name, typ.: ``` -- target: key int, t struct MERGE INTO target t USING source s ON t.key = s.key WHEN MATCHED THEN UPDATE SET t.value = s.value ``` N/A, test-only change N/A Closes delta-io/delta#2597 GitOrigin-RevId: fcafe8d40ce078b957a162677710f9e537682cad commit a8074d307b6f514af53c79395339ae081cf2cd3d Author: Paddy Xu Date: Wed Feb 7 10:24:20 2024 +0100 DELTA_ICEBERG_COMPAT_V1_VIOLATION error message fix -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) This PR fixes an unfriendly error message that shows the raw string form of a table feature object. Instead it should show the name of the table feature. No because too trivial. No. Closes delta-io/delta#2611 GitOrigin-RevId: 2af4f7feeb75c2a51b520320857e7ad54d48c994 commit b3d7e8647109801eccf9fe38f55fb1ad76df4546 Author: Prakhar Jain Date: Tue Feb 13 14:12:59 2024 -0800 [Protocol Change] SPEC change proposal for managed-commits Adds the proposal for spec change for Managed-Commits (see #2598) to the RFC folder. commit c12ef3b34b497e3499c8c17ad6c43dff9274e286 Author: Allison Portis Date: Tue Feb 13 11:38:07 2024 -0800 [Kernel] Support reading timestamp partition columns (#2608) commit 49f2625423e16a7d0f6c7892145c7b3eed329052 Author: Allison Portis Date: Wed Feb 7 17:09:32 2024 -0800 [Kernel] Refactor SnapshotManager to take logPath and dataPath as part of the constructor (#2613) commit 5545f284fe878ece072c8aeb716571b6e192d480 Author: Dhruv Arya Date: Wed Feb 7 16:22:40 2024 -0800 [Protocol Change] Add the In-Commit Timestamps spec change proposal (#2599) * add the in-commit timestamps spec change proposal * add changes to the cdc section, specify timestamp as of algorithm that considers delta.inCommitTimestampEnablementTimestamp * use long instead of timestamp, fix time travel section * remove references to UTC * fix table formatting and add positional info * refine terminology around time * add reference to DESCRIBE HISTORY, replace Unix time with Unix epoch * update rfc proposal list commit e7959fed22c2da3a3b769813c55406b1388cb78d Author: Tathagata Das Date: Wed Feb 7 17:28:20 2024 -0500 [Process] Added the RFC template docs [#2594] (#2601) * Added the RFC template docs commit bfc652a84762a1b96e6e800c5b8fb37f1dd9a0bf Author: Prakhar Jain Date: Tue Feb 6 13:06:24 2024 -0800 [Spark] Refactor OptimisticTransaction commitLarge API This PR refactors the OptimisticTransaction.commitlarge API. Currently the API is confusing as it expects metadata to be not passed from outside but protocol could be passed from outside. Closes https://github.com/delta-io/delta/pull/2605 GitOrigin-RevId: b6c16f9ca48b6a2e9bde967c78ee2d55c4dd08b7 commit 9dc83e6ffdf635f1adc29d72a1a9f24756d085d0 Author: Jiaheng Tang Date: Tue Feb 6 12:25:02 2024 -0800 [Spark] Show clusteringColumns in DESCRIBE DETAIL for clustered tables Support showing `clusteringColumns` in DESCRIBE DETAIL for clustered tables. Closes delta-io/delta#2603 GitOrigin-RevId: 7631ef051395cb488e2bd8c50a01324edf24385b commit 6ffcbe9de58f774b17ce6fd00d337b2111eb1b7b Author: Venki Korukanti Date: Mon Feb 5 17:22:26 2024 -0800 [Docs] Fix issue `@return` tag in API docs not getting generating #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel -Other (docs) ## Description The current `unidoc` settings exclude the `@return` tag from the API docs. We should include it as it can contain expanded info about the returned value of the APIs. Example: [API doc](https://docs.delta.io/latest/api/java/kernel/io/delta/kernel/Scan.html#getScanFiles-io.delta.kernel.client.TableClient-) and the [code](https://github.com/delta-io/delta/blob/master/kernel/kernel-api/src/main/java/io/delta/kernel/Scan.java#L51) is generated from. Manually verified locally. Closes delta-io/delta#2606 Signed-off-by: vkorukanti GitOrigin-RevId: 74aeca2d5ef9ffc8baa253a1132687a1a31a4674 commit bc4d34573e49e801a40b50cc09b388de43d4f8dc Author: Prakhar Jain Date: Mon Feb 5 11:04:48 2024 -0800 Fix type around log compactions in Delta Spec This PR fixes a small typo in Protocol around log compaction files. Closes https://github.com/delta-io/delta/pull/2600 GitOrigin-RevId: fa55484ecc4e9e2067a1a4f5d325ddc1dedf9e53 commit 3650f5679d33f25b1b39d982dab0bb2130822f23 Author: Tom van Bussel Date: Mon Feb 5 19:26:54 2024 +0100 [Spark] Always materialize the merge source if it contains a UDF #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes a correctness issue in merge due the source query returning different results in job 1 and job 2. This issue is present when the source contains a non-deterministic UDF that has been marked as deterministic. UDFs are often incorrectly marked as deterministic, and therefore we should not trust this information and instead always materialize the source if it contains a UDF Added a test to `MergeIntoMaterializeSourceSuite`. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2585 GitOrigin-RevId: 1b55acc398d8c8e36dd618ebba28c92ed3309222 commit 6f191387f4c2737ef3963b9be6baf30b37f150ae Author: Venki Korukanti Date: Mon Feb 5 16:41:50 2024 -0800 [Docs] Clarify JDK requirements for API docs generation commit dad51b50799e1158d6efedbb6376c3f3dd45916c Author: Carmen Kwan Date: Mon Feb 5 18:16:12 2024 +0100 [Spark] Improve exception in resolveQueryColumnsByName In this PR, we replace an assertion error with an actual exception. This improves the error message from an internal error in spark to a more actionable error indicating user error. The assertion comes from an assumption that the code in DeltaAnalysis is run after PreprocessTableInsertion (resolution), but this is not the case. We choose to duplicate the check instead of reordering the rules to have a smaller change. Closes delta-io/delta#2555 GitOrigin-RevId: 5c61f919852c72b28a944abf46b44f5de3e61ed6 commit 8427f8115e9924ab9119d937fc578e878e190030 Author: Sabir Akhadov Date: Mon Feb 5 13:52:49 2024 +0100 Include column names in the DELTA_INVALID_CHARACTERS_IN_COLUMN_NAMES exception Add column names to `DELTA_INVALID_CHARACTERS_IN_COLUMN_NAMES` exception. It is thrown whenever an invalid character is found in column names. Previously: ``` Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema. Please use other characters and try again. ``` After this patch: ``` Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema. Invalid column names: . Please use other characters and try again. ``` Closes delta-io/delta#2592 GitOrigin-RevId: 52c9e5faf8806aa81b14f22eaa420a600d681823 commit 6f4e051972a8d9711820b110550e7e3f659c6814 Author: Jiaheng Tang Date: Fri Feb 2 13:10:51 2024 -0800 [Spark] Support ALTER TABLE CLUSTER BY This PR adds support for ALTER TABLE CLUSTER BY syntax for clustered tables: * `ALTER TABLE CLUSTER BY (col1, col2, ...)` to change the clustering columns * `ALTER TABLE CLUSTER BY NONE` to remove the clustering columns Closes delta-io/delta#2556 GitOrigin-RevId: 7cc2ff2abe6fdd1cba6150648c71f27fc7432be1 commit dc574eb9ae86ecaf45a9205c36dbe1d9105b736c Author: Scott Sandre Date: Thu Feb 1 18:50:33 2024 -0800 [Spark] Increase test heap size Increase spark test heap size from 4GB to 6GB to prevent `java.lang.OutOfMemoryError: GC overhead limit exceeded` Closes https://github.com/delta-io/delta/pull/2577 GitOrigin-RevId: b6b260e00eb5bf30466f31cad3140a055a65a1d4 commit 714d4ea889009ba49985daa3fff779913cd6565e Author: Venki Korukanti Date: Wed Jan 31 14:19:37 2024 -0800 [Release] Upgrade `version` to `3.2.0-SNAPSHOT` Set the new development version. Closes delta-io/delta#2586 GitOrigin-RevId: bdae745ac633e2c4a4666a3251115eafaec35174 commit fde8be3163b370cf8a9e33405a007e6121d386ec Author: Fred Storage Liu Date: Fri Feb 2 13:26:33 2024 -0800 Minor update to Delta Uniform doc specifying delta version that requires column mapping for REORG (#2595) commit 7874ac8f3f7f92b5d82f20f1dd6fa25b04603c6e Author: Ole Sasse Date: Wed Jan 31 20:27:24 2024 +0100 [Spark] Override Jackson string length limits #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Override the Jackson string length limit to allow reading and writing large json objects to and from the Delta Log. Added a new test suite ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2575 GitOrigin-RevId: bc73b05ab24c86f28df43e621c201e3de90a7f1a commit fa25e20dfbb3f3235bff8306b309f58328c28b14 Author: Christos Stavrakakis Date: Wed Jan 31 16:37:30 2024 +0100 [Spark] Emit event for compaction errors #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Emit Delta event when auto compaction fails. Log-only change. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2584 GitOrigin-RevId: 0ec2ac8500714262a3f0276afaaebb40e139dcc4 commit 476cf66992d5d7072e3c8068664a2c41dd6220de Author: Christos Stavrakakis Date: Wed Jan 31 11:01:55 2024 +0100 [Spark] Add config to make post-commit hooks throw on error #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently most post-commit hooks, except manifest generation, suppress errors, i.e. they are logged but do not throw during the command that triggered them. Although this makes sense for production, it can hide issues in testing. This commit adds a config to make post-commit hooks always throw an exception. Existing tests provide sufficient coverage. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2583 GitOrigin-RevId: fd5fdd5d5c0916019ba209bdd5c09a4d6414c247 commit 9046622e2fde2689339acc1ca7a7628b5ca70d97 Author: Christos Stavrakakis Date: Tue Jan 30 21:25:24 2024 +0100 [Spark] Introduce tag delta.rowTracking.preserved tag - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) Introduce `delta.rowTracking.preserved` commit tag that should be set by transactions that preserve Row Tracking. This commit adds only the tag definition and helper code for manipulating tags to prepare for subsequent commits that will actually use it. Existing tests. N/A Closes delta-io/delta#2582 GitOrigin-RevId: 95abaff9b2701105a9cd15ac06d02d4964957f10 commit f164f85f8b5efd5d62d6ccb4b99665893cf1a04f Author: Ami Oka Date: Fri Jan 26 16:42:57 2024 -0800 Disallow setting clustering table feature through tblproperties It throws an error when we do set tblproperties (delta.feature.clustering="supported") through `CREATE TABLE` or `ALTER TABLE` Closes delta-io/delta#2548 GitOrigin-RevId: da25d62fa2b4346652b4ccc7c3880b03eadbebef commit ae5ce354f6b72221711a30079c53ba1674f9cae7 Author: Lin Zhou Date: Fri Jan 26 15:51:41 2024 -0800 Fix Delta Sharing DataFrame not updated for Snapshot Query When provider updated the table(insert or delete row), in delta-sharing-spark session, the dataframe on the same query for the same table is not updated, though a new rpc is made, the local delta log is updated, but it’s still at version 0. Closes delta-io/delta#2574 GitOrigin-RevId: 101e22404e1d3ee55f98bdc525c516a2157d0540 commit 60be17aef73fb49d01df2bdecb13ef82db8bc411 Author: Davin Tjong Date: Wed Jan 24 17:06:14 2024 +0000 [Spark] Fix SQLMetric usage Fix usage of SQLMetric not to use positive `initValue`, and to register these values with the AccumulatorContext as intended. Only in `WriteIntoDeltaLike.scala`. Before, we were incorrectly using a metric name as the metric type, where we want to set that on registration rather than construction. Closes delta-io/delta#2563 GitOrigin-RevId: a8518c05dadd2418e0e6470ac0da77007a22c2d1 commit c09bf019fdf86882bbf904572d009d09559ddfb4 Author: Lars Kroll Date: Wed Jan 24 17:46:22 2024 +0100 [Spark] Add messageParameters to DeltaAnalysisException #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Add the correct `messageParameters` mapping to `DeltaAnalysisException` (like we have for `DeltaIllegalArgumentException` for example), so that Spark's `checkError` actually tests them, rather than testing against an empty `Map`. Test-only PR. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#2566 Signed-off-by: larsk-db GitOrigin-RevId: f7434fa843612dd1f7c98890a2358064c76a84f3 commit 6d7063f87b93534a4f6f7e15b7a32febe4d995cf Author: Prakhar Jain Date: Tue Jan 23 16:58:32 2024 -0800 [Spark] Add frame profiling around createCheckpointV2ParquetFile Add frame profiling around createCheckpointV2ParquetFile Closes https://github.com/delta-io/delta/pull/2559 GitOrigin-RevId: 9a7ef3be9ece561f9d86a42978ffdbb8057a7efb commit 1b4d8877450e3e7fb9f168bb9daf68cc20bb2d87 Author: Rachel Bushrian Date: Tue Jan 23 11:46:40 2024 -0800 Upgrade Hadoop version to 3.3.4 #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Upgrade Hadoop version to 3.3.4 to alain with spark(https://github.com/apache/spark/blob/v3.5.0/pom.xml#L125) version. Fix #1935. Unit test - SUCCEEDED ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2413 Co-authored-by: rbushrian Signed-off-by: Scott Sandre GitOrigin-RevId: 017ae2075e0165d7231896621ac5f2d1e91490e5 commit 6318f044f22d3adab66c2942b435566cb234a4f4 Author: Christos Stavrakakis Date: Tue Jan 23 20:07:22 2024 +0100 [Spark] Reorder checks in ConflictChecker #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Reorder checks in Delta Conflict Checker: - Move the check `checkIfDomainMetadataConflict` and `checkForUpdatedApplicationTransactionIdsThatCurrentTxnDependsOn` earlier to fail the transaction as early as possible. - Move Row Tracking reconciliation before file-level checks to ensure that files do no have duplicate row IDs. Existing tests. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2454 GitOrigin-RevId: 26bf33ac2514ec42d9cc195f83c81ee05607d81a commit 162be16a3359844b9bf865a2dbea289d43516801 Author: Johan Lasperas Date: Tue Jan 23 08:55:02 2024 +0100 Fix Delta benchmarks: unaccessible Utils.median New Delta benchmarks for MERGE added in https://github.com/delta-io/delta/pull/1835 broke the benchmark build due to Spark helper `Utils.median` method being inaccessible for two reasons: - Spark `Utils` object is package private to `spark` - Delta benchmarks use an ancient Spark version that doesn't even define `Utils.median`. This change upgrades the Spark version used in benchmarks to 3.5.0. and exposes `Utils.median` to the benchmark package. Compiled the benchmarks locally. Closes delta-io/delta#2554 GitOrigin-RevId: 3e8fcfd01899410b68210461707cc82dec83de22 commit 16c386b155218e8099832d8d60e8099576ccba9b Author: Prakhar Jain Date: Mon Jan 22 21:20:06 2024 -0800 Add missing Write PROTOCOL check to the Vacuum Command This PR adds Write PROTOCOL check to the Vacuum Command since it makes changes to delta directory. Closes https://github.com/delta-io/delta/pull/2557 GitOrigin-RevId: 92282898c139002146e7e1844089dc84df2ac8c3 commit 542ca62251c43bc9f3ffbe659d61732a4e6bcd8a Author: Lin Zhou Date: Mon Jan 22 12:37:04 2024 -0800 Log more info in DeltaFormatSharingSource Log more info in DeltaFormatSharingSource and throw error when server returns bad version. Closes delta-io/delta#2547 GitOrigin-RevId: ac9d0bfb5b278bab3de8d2a27fea197b85650b8c commit 4aab4d375bbbf283e8c53e4687fe06f4e3083f57 Author: Christos Stavrakakis Date: Mon Jan 22 14:23:38 2024 +0100 [Spark] Widen all UDFs during conflict checking #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Delta conflict detection widens non-deterministic expressions before applying them to the changes of the winning transaction. Unfortunately, user defined functions are marked as deterministic by default and customers need to mark them as deterministic. This can result in actually non-deterministic UDFs incorrectly being treated as deterministic. This commit makes conflict detection widen all UDFs to prevent customers from shooting themselves in the foot. Existing tests. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2553 GitOrigin-RevId: 851a127034b98fed886621491705e344b6f87c11 commit d013462136fcd133fa8b966414a38abda37abf14 Author: Dhruv Arya Date: Fri Jan 19 22:39:49 2024 -0800 Throw exception when CREATE OR REPLACE is run on existing catalog tables with missing delta_log Currently, CREATE OR REPLACE does not throw any exception when it is run on an existing table where the backing delta_log directory has been manually deleted. Even though we don't throw any exceptions during CREATE OR REPLACE, other queries don't work on these broken tables. This PR handles this scenario explicitly and throws an exception with the message: ``` [DELTA_METADATA_ABSENT_EXISTING_CATALOG_TABLE] The table `spark_catalog`.`default`.`delta_tbl` already exists in the catalog but no metadata could be found for the table at the path file:/tmp/spark-5103f24e-a80a-4cc0-b1a7-86926fe8ff4f/subdir/_delta_log. Did you manually delete files from the _delta_log directory? If so, then you should be able to recreate it as follows. First, drop the table by running `DROP TABLE `spark_catalog`.`default`.`delta_tbl``. Then, recreate it by running the current command again. ``` Closes delta-io/delta#2429 GitOrigin-RevId: cd123efad2d4789710de4ce0b6bc472b48402442 commit ef751d2364cadb594aae4f74285658d30dec59ca Author: Rajesh Parangi Date: Fri Jan 19 15:13:11 2024 -0800 [Spark]Add additional metrics regarding Vacuum to get better visibility This change adds additional metrics regarding Vacuum to get better visibility Closes delta-io/delta#2534 GitOrigin-RevId: d125486836b167c63fa0b5a0f3535fc4cfc04274 commit 4308771af65b6db55a25072288f9afeb37731157 Author: Fred Storage Liu Date: Wed Jan 31 14:16:58 2024 -0800 [Uniform] update documentation to explicitly enable column mapping before REORG update Uniform documentation for explicitly enable column mapping during REORG commit 50121df8f85210387e0c0bf2e037f1b6e10d891b Author: Nick Lanham Date: Tue Jan 30 18:39:35 2024 -0800 [Flink] Allow Flink connector to get snapshots via Kernel rather than standalone Allow the Flink connector to get a snapshot via the kernel rather than standalone. We do so by having a delegator class that can use kernel whenever possible but falls back to standalone when absolutely needed. The option io.delta.flink.kernel.enabled (defined as io.delta.flink.internal.DeltaFlinkHadoopConf.DELTA_KERNEL_ENABLED) can be set to enable this feature. commit 13739bf7740a9a2bee11d4e46dcab6e52d4fa8f3 Author: Dhruv Arya Date: Thu Jan 25 14:57:21 2024 -0800 [Doc] Add a link to the V2 Checkpoint specification in the DROP TABLE Feature doc Support for dropping V2 Checkpoints was added in Delta 3.1. This PR makes sure that the V2 Checkpoint spec is linked in the doc about DROP Table Feature. commit b77ff6b888ae5ab21985e8288298a8d87edf58b2 Author: Lin Zhou <87341375+linzhou-db@users.noreply.github.com> Date: Tue Jan 23 15:33:22 2024 -0800 [Spark][Sharing] Add doc for delta sharing Add doc for delta sharing with examples of commands. commit 53b4c6e5c24746ee17fedce8e8826fc6acbe5487 Author: Christos Stavrakakis Date: Tue Jan 23 18:29:00 2024 +0100 [Docs] Fix documentation for default columns Fix index to contain correct reference to delta-default-columns instead of delta-column-defaults and include default replacements. commit 97455f8c4c4500186a6ab67c5c78e11c222fc308 Author: Christos Stavrakakis Date: Tue Jan 23 18:28:28 2024 +0100 [Docs] Add docs for dropping table feature Update documentation to cover `ALTER TABLE DROP FEATURE ` command. commit 07a894d660374006b2369ff04a7122574bc6fce0 Author: Venki Korukanti Date: Mon Jan 22 14:15:15 2024 -0800 [Kernel] Update the usage docs to reflect the recent API changes (#2444) ## Description There have been a few changes to the API recently (eg. delta-io/delta#2383) that need an update to the usage docs and updated API docs. ## How was this patch tested? View formatting/links in GitHub rendering. commit a936597e664f8ab3ebb8ae5d468b216abb958736 Author: Fred Storage Liu Date: Fri Jan 19 10:08:56 2024 -0800 [Docs] Update Delta Uniform documentation for Delta 3.1 The doc update captures several major upgrades for Delta Uniform in Delta 3.1 commit 19b9bdf428099e53820b015194205cc4df62cca9 Author: Venki Korukanti Date: Thu Jan 18 16:15:29 2024 -0800 [Release] Update `run-integration-tests.py` to install the generated pypi artifact ## Description Currently the `run-integration-tests.py` requires the PyPi artifact of delta-spark to be either on `test.pypi.org` or `pypi.org`. However, neither of these package repositories allows replacing an existing artifact with the same version due to obvious package maintenance rules. This is a problem when generating multiple release candidates during the pre-release testing. The fallback approach is to append the release version name with `rc1` (ex. 3.0.0rc1) which requires regenerating the artifacts again for the final release without the `rc1` tag in the verison. If we want to use the same artifact that passed the release candidate testing for final release, we need to avoid the integration run script depending only on test.pypi.org or pypi.org. This PR updates the script to install the pypi packages that are generated locally and used as release candidate pypi artifacts. ### Generate the pypi artifacts ``` # check out a commit that changes version that is without SNAPSHOT suffix. pip3 install wheel twine setuptools --upgrade  (rm -r dist 2> /dev/null || true) && python3 setup.py bdist_wheel && python3 setup.py sdist ``` The above command should generate two artifacts under /dist folder ``` $ ls -l dist/ total 48 -rw-r--r-- 1 venkateshwar.korukanti ubuntu 21943 Jan 16 22:21 delta-spark-3.1.0.tar.gz -rw-r--r-- 1 venkateshwar.korukanti ubuntu 21003 Jan 16 22:21 delta_spark-3.1.0-py3-none-any.whl ``` ### Verify there is no `delta-spark` in current environment ``` $ pip3 show delta-spark WARNING: Package(s) not found: delta-spark ``` ### Run the integration test ``` $ python3 run-integration-tests.py --version 3.1.0 --maven-repo https://oss.sonatype.org/content/repositories/iodelta-1129 --use-localpypiartifact /home/venkateshwar.korukanti/delta/dist --pip-only ``` The script shows following output confirming the `delta-spark` installation from the given distribution directory ``` Found existing installation: pyspark 3.5.0 Uninstalling pyspark-3.5.0: Successfully uninstalled pyspark-3.5.0 Processing ./dist/delta_spark-3.1.0-py3-none-any.whl Collecting pyspark<3.6.0,>=3.5.0 (from delta-spark==3.1.0) Using cached pyspark-3.5.0-py2.py3-none-any.whl Requirement already satisfied: importlib-metadata>=1.0.0 in /home/venkateshwar.korukanti/.conda/envs/delta-release/lib/python3.8/site-packages (from delta-spark==3.1.0) (7.0.1) Requirement already satisfied: zipp>=0.5 in /home/venkateshwar.korukanti/.conda/envs/delta-release/lib/python3.8/site-packages (from importlib-metadata>=1.0.0->delta-spark==3.1.0) (3.17.0) Requirement already satisfied: py4j==0.10.9.7 in /home/venkateshwar.korukanti/.conda/envs/delta-release/lib/python3.8/site-packages (from pyspark<3.6.0,>=3.5.0->delta-spark==3.1.0) (0.10.9.7) Installing collected packages: pyspark, delta-spark Successfully installed delta-spark-3.1.0 pyspark-3.5.0 https://oss.sonatype.org/content/repositories/iodelta-1129 added as a remote repository with the name: repo-1 :: loading settings :: url = jar:file:/home/venkateshwar.korukanti/.conda/envs/delta-release/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /home/venkateshwar.korukanti/.ivy2/cache The jars for the packages stored in: /home/venkateshwar.korukanti/.ivy2/jars ``` Closes delta-io/delta#2526 Signed-off-by: vkorukanti GitOrigin-RevId: 9664dde1003c5717a9c8bc4403cafac748cb3e98 commit 52a391beced7475376e8b2d343bca3351be8a85b Author: Venki Korukanti Date: Thu Jan 18 13:47:34 2024 -0800 [Spark][Sharing][Build] Add release settings to new `sharing` artifact Cherry-pick of 800065bf0553ace7dbc87ac0e83369f2dd358c58 from branch-3.1 to master Closes delta-io/delta#2537 Signed-off-by: vkorukanti GitOrigin-RevId: 3b8b7bf2a56ee0665c9c6d99fa4f3cb71a459b5d commit bf6ba5a0a3e369375f4c36ba1abfec22d27b833f Author: Andreas Chatzistergiou Date: Thu Jan 18 15:11:02 2024 +0100 Avoid redudant jobs in buildRowIndexSetsForFilesMatchingCondition. In `buildRowIndexSetsForFilesMatchingCondition` we join the results DataFrame with the candidate file list to fetch the Deletion Vector Descriptors. Currently, there is a sanity check there to verify we do not eliminate any rows due to path encoding issues. The sanity check compares before and after counts. The counts produce 2 extra jobs which can cause a significant performance overhead. For example, in the case of merge each job could be a full shuffle join between the source and the target. This PR eliminates the extra jobs while maintaining the sanity check. Closes delta-io/delta#2516 GitOrigin-RevId: e9f1c17c482be35979a64b2ee5e72c4562c084d5 commit 10598647aff28e23cfe2300d4970cc83f6dd976a Author: Allison Portis Date: Wed Jan 17 16:29:41 2024 -0800 [Kernel] Downgrade SLF4J version to 1.x to avoid compatibility issues with connectors #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Downgrades SLF4J version to 1.x. Version 2.x includes additional APIs that are unsupported in lower versions and should not be used to avoid compatibility issues. Ran some unit tests and checked the output logs locally. Closes delta-io/delta#2527 GitOrigin-RevId: 3ba616e632fe427057b02414546698d9b063d9a3 commit 4d2444b111d3fac1bf2c5994f771ea59b95b60e7 Author: Nick Lanham Date: Wed Jan 17 11:42:04 2024 -0800 [Flink] Changes preparing for flink-on-kernel A few needed changes to build scripts in preparation for merging https://github.com/delta-io/delta/pull/2473 Closes delta-io/delta#2524 GitOrigin-RevId: 8ec5560ebefde4b5c7edaac39569b404f9f79a8b commit ba758a6660d727f5aaa4c7262292e675bca1753d Author: Andreas Chatzistergiou <93710326+andreaschat-db@users.noreply.github.com> Date: Thu Jan 18 17:23:35 2024 +0100 Update Deletion Vectors doc to include UPDATE and MERGE Update Deletion Vectors doc to include deletion support for UPDATE and MERGE. commit 531bae9895506b1a4d0d4369a79fcaee229d9e37 Author: Allison Portis Date: Wed Jan 17 16:28:53 2024 -0800 [Kernel] Add additional integration tests for data skipping, partition pruning, and column mapping mode ID (#2533) commit e3d00b69bee88274c03cb207a9665d57c1b8992e Author: Venki Korukanti Date: Wed Jan 17 09:39:45 2024 -0800 [Docs] Update the docs to refer to the 3.1.0 version Few updates to to get the docs ready for 3.1.0 release. commit 3e7a22f69e051dbd4b9a94c9bdcee4616bd3ef49 Author: Gabriel Russo Date: Wed Jan 17 10:53:33 2024 +0100 [Spark] Make sure files are writable in EvolvabilitySuiteBase -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) When running tests, in some circumstances the files might be read-only. We make sure to make all the files inside our temporary directory writable so the tests can write to tables. Existing tests No Closes delta-io/delta#2517 GitOrigin-RevId: c489eecbcd5fa6f597dcedc17b0be4fb792c07d8 commit cd58b8c61bb5d690f2d0240325c4a2fa460020d3 Author: Hao Jiang Date: Tue Jan 16 23:08:58 2024 -0500 Enable Column mapping in metadata before Cluster table fetch column names This PR allows UniForm to work together with Liquid table by enabling column mapping before Liquid table read the column names. Closes delta-io/delta#2521 GitOrigin-RevId: d05beeb5e738bd072f200d67be6a6c871492b98c commit 007195d3e1990467582e426fcb5e14ddb7f75dcc Author: Jiaheng Tang Date: Tue Jan 16 14:03:04 2024 -0800 [Spark] Add Clustering example Add Clustering example Closes delta-io/delta#2520 GitOrigin-RevId: 895b82e456dd1cd26c7e6b13d84f4ad4428b9913 commit 0fbc0e15a817515bb0f82b6914b4bed9daf412e9 Author: Christos Stavrakakis Date: Tue Jan 16 16:32:53 2024 +0100 [Spark] Abbreviate long strings in DML stats # #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The stats of DML commands contain the raw expression strings that can be arbitrarily long. The result is that the JSON blob is truncated at 16K and is not parseable. This commit fixes this issue by truncating the long string fields. Existing tests. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2514 GitOrigin-RevId: 25d163dbd59c41ff4c5fb9c367d37e5669f60b21 commit 52a5f6dd1e5b2bc82efef21a3a54012223f30a3b Author: Hyukjin Kwon Date: Tue Jan 16 20:10:12 2024 +0900 [Spark] Uncomment `assume`s in DeltaTableCreationTests #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR uncomment valid `assume`s in `DeltaTableCreationTests.scala`. The tests should be skipped on Windows. Manually. ## Does this PR introduce _any_ user-facing changes? No, test-only. Closes delta-io/delta#2465 GitOrigin-RevId: 2cbdbcabd70730f5ab9348a0283e4c3a2e407cec commit 39aac66d19b91f3d868e70e673c0c934fcc61ac6 Author: Fredrik Klauss Date: Tue Jan 16 12:06:38 2024 +0100 [Spark] Widen non-deterministic predicates during partition-level conflict checking During conflict checking we use the partition predicates of the transaction to see if any newly added files of the winning transaction should have been read by the current transaction. This causes a re-evaluation of non-deterministic predicates. This can lead to not detecting conflicts, even though they should have been detected. To fix this, we widen non-deterministic predicates to `TrueLiteral`s, to always conflict with more files. Closes delta-io/delta#2363 GitOrigin-RevId: 4745ce2f36caad724171a282e22da87e0fc68594 commit 213a739ab9dbd4efaa2c36ef62230312d138fefe Author: tangjiafu Date: Mon Jan 15 21:25:17 2024 +0000 [Spark] Fix Implicit definition should have explicit type #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR fixes `Implicit definition should have explicit type` in Scala 2.13. Manually tested with Spark build. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#2509 GitOrigin-RevId: 826a1aff5e86b1c31159f0b7820645e2a9179896 commit f6c3d924ebcd2ede6557e6253e7bc22606ddad83 Author: Paddy Xu Date: Mon Jan 15 17:37:21 2024 +0100 [Spark] Emphasize paths should (or should not) be escaped in DV helper This PR renames `stringToPath` and `pathToString` methods, this will emphasize the input/output string should/will be escaped. We also added a third new method `unescapedStringToPath`. which is used by almost all calls from tests. Refactoring PR. Closes https://github.com/delta-io/delta/pull/2510. GitOrigin-RevId: 3cc9847b0531f84815172cc067054cfa4b40844f commit ad650ac51d00c0b06caec0663e50736e4cf4af30 Author: Jiaheng Tang Date: Tue Jan 16 19:42:02 2024 -0800 [Docs] Add liquid clustering preview documentation This PR adds documentation about Liquid clustering preview in Delta Lake 3.1.0. Co-authored-by: Terry Kim commit 75c7e9ee0507580089efff3c26b9b2cd3dd06ebd Author: Allison Portis Date: Tue Jan 16 17:24:14 2024 -0800 [Kernel][Logging] Don't use the fluent logging API in SLF4J (#2523) commit 24f61ec3dbb0c64253b8d6d1ccb4d4479ae1fada Author: Dhruv Arya Date: Tue Jan 16 15:39:17 2024 -0800 [Doc] Add doc describing table schema/properties sync to HMS (#2505) https://github.com/delta-io/delta/pull/2409 added table metadata sync (to HMS) to Delta Lake. This PR adds some documentation about this feature. commit 494f2b2d414bce7a6e81b8447d2d635a1415a141 Author: Wei Luo <143362963+weiluo-db@users.noreply.github.com> Date: Tue Jan 16 15:33:39 2024 -0800 Add docs for optimized write (#2452) Optimized write feature was added by https://github.com/delta-io/delta/pull/2145. This PR adds the corresponding documentation for the feature. Co-authored-by: Venki Korukanti commit 8ae8fb3fc6eaac5b98a01a8165c332289f2c3adf Author: Daniel Tenedorio Date: Fri Jan 12 15:54:23 2024 -0800 Add docs for default column values (#2488) * commit * respond to code review comments * respond to code review comments commit e2ae08c701b332e621d964f6cabf6818bd19a35e Author: Venki Korukanti Date: Fri Jan 12 15:44:09 2024 -0800 [Kernel] Fix issue with column mapping and partition pruning ## Description Currently, the column name in the logical schema is used for lookup. For column mapping mode enabled tables, partition values are stored with physical names. ## How was this patch tested? Added tests for pruning with `id` and `name` modes commit db8b594ac26d2fd87fb25fe7cd5da484de2ab27f Author: Venki Korukanti Date: Fri Jan 12 15:21:56 2024 -0800 [Kernel] Fix javadoc generate error ## Description Due to a merge conflict, this error didn't occur in the CI of PR that introduced the issue. https://github.com/delta-io/delta/actions/runs/7507859889/job/20442273564?pr=2497 ``` [error] /home/runner/work/delta/delta/kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/DataSkippingPredicate.java:25:1: error: type arguments not allowed here [error] * A {@link Predicate} with a set of {@link Set} of columns referenced by the expression. [error] ``` It looks like we can't have a reference to generics class with a particular type in javadoc due to [type erasure with generics](https://stackoverflow.com/questions/9482309/javadoc-bug-link-cant-handle-generics) ## How was this patch tested? build succeeds locally `build/sbt clean kernelGroup/publishM2` commit 21a5c48a8f78c3b98f9b3119ab6c3df88011682f Author: Allison Portis Date: Fri Jan 12 13:34:19 2024 -0800 [Kernel][Data skipping] Prune the statistics schema based on the columns used in the generated data skipping filter (#2474) commit 2774fbbeeea35ea6aac3d22266c3ff14b93d0556 Author: Venki Korukanti Date: Fri Jan 12 11:58:53 2024 -0800 [Kernel] Run Kernel integration test/examples as part of the CI We have been changing the Kernel APIs which may cause the examples to fail. Currently, examples/integration can only be run manually. This PR runs the integration as part of the PR CI job so that we can cause any failure during the PR time. Delta CI Job Closes delta-io/delta#2494 GitOrigin-RevId: c6ea78a12cc294d077ad75b6e17f88bc74100214 commit e69a4d307032ce0ed1ff5c97d0bac9a571eb5931 Author: Lin Zhou Date: Fri Jan 12 11:04:23 2024 -0800 [Spark][Sharing] Adds deletion vector support for "delta format sharing" - Extends PrepareDeltaScan to PrepareDeltaSharingScan, to convert DeltaSharingFileIndex to TahoeLogFileIndex. - Update DeltaSparkSessionExtension to add the rule of PrepareDeltaSharingScan - Added unit test in DeltaSharingDataSourceDeltaSuite Closes delta-io/delta#2480 GitOrigin-RevId: 816ae9b4c9409f301690e205621ed252848cbb5b commit 11cd832880adb85c019520b21513e820ca6762a1 Author: Sabir Akhadov Date: Fri Jan 12 18:44:47 2024 +0100 [Spark] CreateOrReplace table command should not follow dynamic partition overwrite option CreateOrReplace table command accidentally followed DPO semantics when replacing an existing table. This PR adds writeOptions to the write command to differenciate a replace command from other inserts. Closes delta-io/delta#2482 GitOrigin-RevId: f1085d3a34775d93949b55ec52f2f2e0af09bcc6 commit 68d56e98d6e128419c4b648481883373139225a4 Author: Fred Storage Liu Date: Fri Jan 12 08:39:32 2024 -0800 [Spark][Uniform] Auto upgrade Delta protocol for REORG UPGRADE UNIFORM cmd Closes delta-io/delta#2485 GitOrigin-RevId: 93d502ca11e47a03d2be1a03a0821d45edad5836 commit 42dea93574fa27fcaedb5b333bbea33ce3edb202 Author: Nick Lanham Date: Fri Jan 12 11:47:25 2024 -0800 [Docs] Add auto-compact docs Adding docs for auto-compact commit 295e39b0a6e7d59221f227f190e2315bdd360945 Author: Venki Korukanti Date: Fri Jan 12 09:41:44 2024 -0800 [Kernel] Fix the integration test failures due to recent changes ## Description Recent data-skipping feature support that went into the master requires few changes in the expected results of integration tests. ## How was this patch tested? Ran locally. Will follow up with a PR to enable this as part of the CI. `delta/kernel/examples/run-kernel-examples.py --use-local` commit c4d1618923ed84f8b4321198d31f7cd56fcc3c97 Author: Allison Portis Date: Thu Jan 11 19:00:13 2024 -0800 [Kernel][Data skipping] Generate data skipping filters when possible for IS_NOT_NULL (#2478) commit 04096f624e8bf1cc0c8ca8126000c176d79af4e7 Author: Hao Jiang Date: Thu Jan 11 18:27:03 2024 -0800 [Spark][Uniform] Fix REORG APPLY UNIFORM This change fixes two problems found in REORG APPLY UNIFORM INSERT into a table enabling IcebergCompatV2 triggers NullPointerException. REORG APPLY UNIFORM does not rewrite files Closes delta-io/delta#2486 GitOrigin-RevId: 54765a9285ef680dbecdd230a7fcac83c0da8f16 commit 5a2eaa39d23350b20be502f5eecb5d36b46750fd Author: Jiaheng Tang Date: Thu Jan 11 17:37:53 2024 -0800 [Spark] Add comments to DeltaTable.forName to clarify views are not supported DeltaTable.forName doesn't work with views and passing in view name will always result in: ``` AnalysisException: xxxx is not a Delta table. ``` This PR updates the comments for `forName` API to avoid confusion. Closes delta-io/delta#2479 GitOrigin-RevId: 05d1bba13436a563f7a80945e0350a0b57a73181 commit a840be1f8bb5452aceb5161d29159c79fa915494 Author: Lin Zhou Date: Thu Jan 11 16:46:37 2024 -0800 [Spark][Sharing] Adds streaming and column mapping support for "delta format sharing" Adds Streaming support for "delta format sharing", and add column mapping test - DeltaSharingDataSource with streaming query support - DeltaFormatSharingSource - DeltaFormatSharingSourceSuite/DeltaSharingDataSourceCMSuite Closes delta-io/delta#2472 GitOrigin-RevId: c7ab3cbd2774eac1bee131b55e26d3a510e14e35 commit 70722af541ba23795e45959ffdf5771fc36792b3 Author: Dhruv Arya Date: Thu Jan 11 12:41:02 2024 -0800 Remove disclaimer about the Delta protocol still being a draft The protocol still has an old (~4 year old) disclaimer towards the top claiming that it is an in-progress draft. This PR removes that line. Closes https://github.com/delta-io/delta/pull/2483 GitOrigin-RevId: f82f2d50ff10df885f74aae0392d6c5991fe72e6 commit f638c91030cc250d399d0f7b21b82aa2a06b15ea Author: Allison Portis Date: Thu Jan 11 13:09:47 2024 -0800 [Kernel][Data skipping] Generated data skipping filters when possible for OR and NOT(OR) (#2477) commit e1fa1011e83f499a807401a768458fe7537f6495 Author: Paddy Xu Date: Thu Jan 11 12:21:38 2024 +0100 [Spark][UPDATE with DV] Let UPDATE command write DVs by default #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR flips the flag for enabling write DVs for UPDATE commands. Actually, this PR should have been created much earlier because all required works (code, test, benchmarking) are done. Existing tests. ## Does this PR introduce _any_ user-facing changes? Yes, after this PR all UPDATE commands will write DVs when possible. Closes delta-io/delta#2456 Signed-off-by: Paddy Xu GitOrigin-RevId: f8e4cc7ea09a1bf14bfabdc71ca61764133e38b5 commit 5dbee87a38b00d34097b12272aadda8d3d98d116 Author: Dhruv Arya Date: Wed Jan 10 17:11:37 2024 -0800 [Spark] Temporarily disable two flaky tests from UpdateCatalogSuite Temporarily disables two tests that have made builds flaky. This will unblock the PRs that are blocked due to this. I will re-enable these once they have been fixed. Closes https://github.com/delta-io/delta/pull/2476 GitOrigin-RevId: c46a3342115f86e40851c3444c4981f3afced093 commit a280a8c333d1699e862ded7e7bf9881df52f7dd1 Author: Prakhar Jain Date: Wed Jan 10 15:46:25 2024 -0800 [Spark] Refactor conflict detection code and pass fileStatus to Conflict Checker Closes delta-io/delta#2423 GitOrigin-RevId: 4bd63ad92bcf534bb1bec95113e8009f41767bc3 commit d18a441fcffbeb5d0a3f542b68556588c582635b Author: Nick Lanham Date: Wed Jan 10 14:59:29 2024 -0800 [Spark] Adds code to support Auto Compaction. Auto compaction combines small files within partitions to reduce problems due to a proliferation of small files. Auto compaction is implemented as a post commit hook, and so occurs after the write to a table has succeeded. It runs synchronously on the cluster that has performed the write. You can control the output file size by setting the spark.databricks.delta.autoCompact.maxFileSize. Auto compaction is only triggered for partitions or tables that have at least a certain number of small files. You can optionally change the minimum number of files required to trigger auto compaction by setting spark.databricks.delta.autoCompact.minNumFiles. This PR creates a post commit hook, which runs an `OptimizeExecutor` (from `OptimizeTableCommand.scala`), which will do the compaction. We add a post-commit hook in TransactionalWrite, that will check if auto-compaction is needed. If the configs are set such that the write meets the criteria (i.e. AC is enabled, enough small files exist, etc) then partitions that meet the criteria will be reserved, and used to make an OptimizeExecutor targeting those partitions, and with the appropriate config values. This runs and will compact the files. Partitions are then released for future compactions to consider. AutoCompact is *disabled* by default There are a number of new configs introduced by this PR, all with prefix spark.databricks.delta.autoCompact. Through a lot of experimentation and user feedback, we found these values to work well across a large range of tables and configurations. - `autoCompact.enabled`: should auto compaction run? (default: false) - `autoCompact.maxFileSize`: Target file size produced by auto compaction (default: 128 MB) - `autoCompact.minFileSize`: Files which are smaller than this threshold (in bytes) will be grouped together and rewritten as larger files by the Auto Compaction. (default: half of maxFileSize) Closes delta-io/delta#2414 GitOrigin-RevId: b0fb8f09a5c13d2ac14a907e9e872646354e7539 commit 9962dd0fa9bc5bab6e81a3328eb53b632e6cd071 Author: Allison Portis Date: Wed Jan 10 22:29:00 2024 -0800 [Kernel][Data skipping] Generate data skipping filters when possible for NOT expressions (#2475) commit 9b93ed0c5d1338a87fcb920f5cff241ee25e7e28 Author: Kaiqi Jin Date: Wed Jan 10 13:11:38 2024 -0800 Implement UPGRADE UNIFORM (ICEBERG_COMPAT_VERSION) command This PR is the implementation of REORG TABLE command for uniform tables to rewrite the parquet files in iceberg compatible way. Previous sql syntax is REORG TABLE table_name [WHERE predicate] APPLY (PURGE). We added the UPGRADE UNIFORM option to this command, now REORG command also support syntax below REORG TABLE table_name APPLY (UPGRADE UNIFORM (ICEBERG_COMPAT_VERSION = version) Closes delta-io/delta#2467 GitOrigin-RevId: 3a906221689c4dcfed402746b13f8fa9a67f11fe commit 701b2535f06415dc3e00d8b7e090c67bd0fb9351 Author: Prakhar Jain Date: Wed Jan 10 12:09:01 2024 -0800 [Spark] Refactor: Move lockInterruptibly to SnapshotManagement class Move def lockInterruptibly to SnapshotManagement class as this is where is it used. Closes https://github.com/delta-io/delta/pull/2431 GitOrigin-RevId: 0627069552284f72cebb9ffd55259f5eba0404d9 commit 41469df96b7cda9a128e281f6ecff31e4e65b1fc Author: Scott Sandre Date: Wed Jan 10 11:26:16 2024 -0800 If you look closely, you'll see that the error test this PR deletes already exists - exactly just above it. This PR cleans this up and deletes the bottom duplicate test. Closes delta-io/delta#2463 GitOrigin-RevId: 821b91853cd229538e2f2e8e4439f5ce7a1438ce commit 153f374f314de1f0ab2e66e074ead12031be39e8 Author: Lin Zhou Date: Wed Jan 10 08:04:00 2024 -0800 [Spark][Sharing] Adds cdf support for "delta format sharing" Adds cdf support for "delta format sharing", this is the third PR of issue #2291 - DeltaSharingDataSource with cdf query support - DeltaSharingDataSourceDeltaSuite Closes delta-io/delta#2457 GitOrigin-RevId: 2482f48c1344db14ecc79137ba1ea675820c83eb commit 8e3db4db791459eb2b312e294038e38162510fb0 Author: Hao Jiang Date: Wed Jan 10 01:26:52 2024 -0800 [Uniform] Trigger synchronous Iceberg Conversion when enabling UniForm This PR triggers a sync Iceberg conversion when enabling UniForm Closes delta-io/delta#2466 GitOrigin-RevId: 8161fc3fa4e4aa5c0e0192b336038439b2e435be commit 3501c0c6bb2a74d79d91ae59695c4285fc074e93 Author: Jiaheng Tang Date: Tue Jan 9 22:41:25 2024 -0800 [Spark] Integrate domain metadata with OPTIMIZE Resolves https://github.com/delta-io/delta/issues/2451 Integrates domain metadata with OPTIMIZE to get the clustering columns from DomainMetadata and perform clustering. Disallow specifying partition predicates and ZORDER BY for a clustered table. Introduce a new test suite `ClusteredTableClusteringSuite` and add test for running optimize on a clustered table. Closes delta-io/delta#2461 GitOrigin-RevId: 9ab789acc9892daf95a8bbb0b71fcc3feb393d34 commit 96f77ab3b0692e75c015aeb64716fce50043e77f Author: Dhruv Arya Date: Tue Jan 9 17:15:55 2024 -0800 Add support for syncing Delta table schema and properties to HMS This PR adds the class `UpdateCatalog` which is used as a post commit hook and during table creation for syncing table schema and properties to HMS. The new behaviour in Delta Lake is that when `catalog.update.enabled` = `true`, the schema and properties of the Delta table will be synced to HMS during creation/replacement and whenever an update to the table updates either of the schema or properties (asynchronously, in a post-commit hook). Some new additions are: 1. New Config: `catalog.update.enabled` -> Setting this to true will enable schema sync. 2. New OSS Config: `catalog.update.threadPoolSize` -> Controls the size of the thread pool which is used to asynchronously update the catalog. 3. HMS-specific Quirk: Hive does not allow for the removal of the partition column once it has been added and OSS Spark always appends the partition columns to the end of the schema when it finds them in Hive. So, for converted tables with partitions, the column order returned by Hive is incorrect. Closes https://github.com/delta-io/delta/pull/2409 Co-authored-by: dhruvarya-db Co-authored-by: schatterjee6 GitOrigin-RevId: 984fa6f3304bf68e97757724c69a542205319b23 commit 1f7a475954aec6db0e64e5401f01905be196a845 Author: Allison Portis Date: Wed Jan 10 01:16:33 2024 -0700 [Kernel][Data skipping] Initial end-to-end support for data skipping for a small set expressions (AND, =, <, <=, >, >=) (#2433) commit b4e5d5c4532f9eb94edcbabe63e613619bf87003 Author: Lukas Rupprecht Date: Tue Jan 9 13:46:16 2024 -0800 [Uniform] Support List and Map columns in Uniform This PR adds support for List and Map columns in Uniform. To support these types, Delta column mapping needs to write additional field IDs to the parquet schema. List columns require one additional field ID for the 'element' subfield and Map columns require two additional field IDs for the 'key' and 'value' subfields inside the parquet file. These nested field IDs are added to the table schema during the generation of the IDs and physical names for column mapping. They are added to the parquet schema through a new class, `DeltaParquetWriteSupport`, that hooks into Spark's parquet write path and rewrites the parquet schema based on the additional field IDs. Closes delta-io/delta#2459 GitOrigin-RevId: 275cd6ec36d49bd078e5e4b7bd7c9bab871b0e1e commit 74a2b39308c15d0c28f85bd78aae486f6a9c2ee8 Author: Kaiqi Jin Date: Tue Jan 9 13:06:27 2024 -0800 [Uniform] Add REORG TABLE APPLY (UPGRADE UNIFORM ICEBERG_COMPAYT_VERSION) error message Preparation for ` REORG TABLE table_name APPLY (UPGRADE UNIFORM (ICEBERG_COMPAT_VERSION = version)`: adding new error messages Closes delta-io/delta#2450 GitOrigin-RevId: c61f3ea158811246dea0deef2a26d00866d46496 commit 0c2a6a17af93f7af0622fe3df9085b75d5b88a0b Author: Lin Zhou Date: Tue Jan 9 10:04:21 2024 -0800 [Spark][Sharing] Adds snapshot support for "delta format sharing" Adds snapshot support for "delta format sharing", this is the second PR of issue #2291 - DeltaSharingDataSource with snapshot query support - DeltaSharingDataSourceDeltaSuite - DeltaSharingDataSourceDeltaTestUtils/TestClientForDeltaFormatSharing/TestDeltaSharingFileSystem Closes delta-io/delta#2440 GitOrigin-RevId: a095445b6da809ee9a5b4ece7c38d04a172ff70f commit 2ef52a6298135c92b4b5f9a7c25059863bc516ff Author: Jiaheng Tang Date: Tue Jan 9 09:55:56 2024 -0800 [Spark] Store CLUSTER BY columns as DomainMetadata Resolves https://github.com/delta-io/delta/issues/2447 According to the [protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#writer-requirements-for-clustered-table), store the CLUSTER BY columns as DomainMetadata. This completes the table creation path for clustered tables. Summary: * Introduce a new DomainMetadata called `ClusteringMetadataDomain` that tracks clustering columns * Extract the clustering columns passed down from the transform array and perform validations: * clustering column should have stats collected * number of clustering columns should not be greater than 4 (controlled by a config) * appending to a clustered table should have matching clustering columns * disallow replacing a clustered table with partitioned table * disallow dropping clustering columns * Introduce a `ClusteringColumn` to hold physical names in order to support Column Mapping Since clustered table support is still in preview, introduce a config to by default throw an exception when creating or writing to a clustered table. Closes delta-io/delta#2448 GitOrigin-RevId: 189b30448d606df335fa5c7af4b7a53d5208aefb commit 499eb4719abf0cd96fa2f43469ac9cbd436ae267 Author: Andreas Chatzistergiou Date: Tue Jan 9 17:15:01 2024 +0100 [Spark] Add Deletion Vectors metrics to Merge. This is a followup work of https://github.com/delta-io/delta/pull/2428. This PR adds deletion vector metrics to merge. In particular, it adds `numTargetDeletionVectorsAdded`, `numTargetDeletionVectorsUpdated`, `numTargetDeletionVectorsRemoved`. This is in line with the deletion vector metrics in DELETE and UPDATE. Furthermore, it adjusts metrics related to removed files. Closes delta-io/delta#2453 GitOrigin-RevId: 0ded3b867705b5eb3921b936d45519df5dbe40e1 commit 3b3d729e931772339d58d200ef130d05cd39466d Author: Fred Storage Liu Date: Mon Jan 8 23:11:38 2024 -0800 [Spark][Uniform] Support Delta Uniform statistics conversion to Iceberg Closes delta-io/delta#2439 GitOrigin-RevId: 1406baa237a9bb439a0676fa3c5bd61ea23da0b2 commit d485660f7bf9836dfcf2e41300f31702165bed3c Author: Hao Jiang Date: Mon Jan 8 23:04:49 2024 -0800 [Spark][Uniform] Fix Write Partition values after introducing IcebergCompatV2 This PR fix a problem in write partition values after introducing IcebergCompatV2 Closes delta-io/delta#2446 GitOrigin-RevId: bbcd1c5e46cd45727a519892cae22b57f458f544 commit 384e38b1ba7df3068e7213adb413528241573cb1 Author: Felipe Pessoto Date: Mon Jan 8 21:00:41 2024 -0800 Optimize Min/Max using Delta metadata ## Description Follow up of https://github.com/delta-io/delta/issues/1192, which optimizes COUNT. This PR adds support for MIN/MAX as well. Fix delta-io/delta#2092 Created additional unit tests to cover MIN/MAX. ## Does this PR introduce _any_ user-facing changes? Only performance improvement Closes delta-io/delta#1525 Signed-off-by: vkorukanti GitOrigin-RevId: 9b88f76bf99cc38bd4cf9d3397b7bb8ade822d0b commit 9ef9eb1dd47de3acc0984bd5fcc007ce51f6cdcd Author: Hao Jiang Date: Mon Jan 8 20:47:48 2024 -0800 Update Delta Protocol for IcebergCompatV2 This PR add the new table feature `icebergCompatV2` to the Delta Protocol. Closes delta-io/delta#2370 GitOrigin-RevId: 42f9138f6c0bfc0fa8e98d4161b6ccb252482536 commit 29f053304d823434acd6bdb9d222f0bbec29561d Author: Venki Korukanti Date: Mon Jan 8 12:57:15 2024 -0800 [Kernel] Revert running integration tests as part of the CI Part of delta-io/delta#2445 commit 28881f5710e7b0b0bf2d959e36a5b6ebd75c62ae Author: Venki Korukanti Date: Mon Jan 8 12:50:11 2024 -0800 [Kernel] Run Kernel integration tests as part of the CI job ## Description We have been changing the Kernel APIs which may cause the examples to fail. Currently, examples/integration can only be run manually. This PR runs the integration as part of the PR CI job so that we can cause any failure during the PR time. Also fixes an issue with the integration test run script when using the local version of Kernel code. ## How was this patch tested? CI job commit 19b99017bc8910751e75a37d8cd90e947d06bee0 Author: Hao Jiang Date: Mon Jan 8 08:47:35 2024 -0800 Use IcebergCompatV2 in UniForm This PR puts IcebergCompatV2 in action, allowing it to be used together with UniForm Closes delta-io/delta#2438 GitOrigin-RevId: 9f0a8d40a3af45351f4544dde339fcb769d20f3e commit 5a3466d5caa6cd2309531e3c29f9d7eeb2dd1f5f Author: Andreas Chatzistergiou Date: Sat Jan 6 21:56:15 2024 +0100 Add support for Deletion Vectors to Merge. This PR adds deletion vector support in Merge. It is part of a wider effort to speed up DML operations with Deletion Vectors (DVs). It builds on top of previous work: https://github.com/delta-io/delta/issues/1591 and https://github.com/delta-io/delta/issues/1923. The current implementation of merge is based on the Copy-on-Write (CoW) approach where touched files are rewritten entirely. This includes both the modified rows as well as the unmodified rows. On the other hand, deletion vectors allow a Merge-on-Read (MoR) approach where we "soft" delete the affected rows in the touched files and only rewrite the modified rows. The "soft" deleted rows are then filtered out on read. This can result into significant performance gains during writes by trading off a small overhead on read. This is because on the most common case merge operations only touch a small portion of data. The current implementation implementation of merge consists of 2 jobs: `Job 1`: Finds touched files by joining the source and target tables. `Job 2`: Rewrites touched files. The new implementation splits job 2 into two parts: one for writing the modified rows and one for writing the deletion vectors. Overall, merge with DVs consists of the following jobs: `Job 1`: Finds touched files by joining the source and target tables. `Job 2.a`: Writes modified and new rows. `Job 2.b`: Writes deletions vectors for the modified rows. From a performance point of view, the extra job adds some overhead but only operates on the touched files produced by job 1 and only shuffles the columns required by the predicates. Furthermore, jobs `2.a` and `2.b` perform stricter joins. Closes delta-io/delta#2428 GitOrigin-RevId: 7bbd0c4d3f08d91df890f6900c76de3c5cc25748 commit b4ca3108c86716cdc942fde89a034667939e5f46 Author: Lin Zhou Date: Fri Jan 5 13:07:55 2024 -0800 Adds utils classes for "delta format sharing" Adds utils classes for "delta format sharing" - DeltaSharingUtils - DeltaSharingFileIndex - DeltaSharingLogFileSystem Closes delta-io/delta#2403 GitOrigin-RevId: ab2ddea623cf267eed305fd007a673c6d6be46c5 commit e1315b9626b9a953ad1eea7ce746a542c4581a36 Author: Venki Korukanti Date: Fri Jan 5 15:46:00 2024 -0800 [Kernel] Remove contextualization APIs ## Description Currently, Delta Kernel has `FileHandler.contextualize` and `FileReadContext` API/interfaces to allow the connectors to split the scan files into chunks. However, Kernel is the one initiating these calls. It should be in the control of the connector. By making the connectors control the contextualization/splitting, we can get rid of a few interfaces/APIs from Kernel which simplifies Kernel interfaces. More details are [here](https://docs.google.com/document/d/11XJ_5x9A8g_tYSlfPHu4M4_sGnT40nVIOJs4a9HLi9c/edit?usp=sharing). ## How was this patch tested? Existing tests commit 3f1c4f323b355513e3e18d8b733cde608ec0aa5c Author: Hao Jiang Date: Fri Jan 5 10:31:33 2024 -0800 Add checking logic for IcebergCompatV2 This PR adds IcebergCompatV2 's check logic, including required table properties and features, data types, deletion vectors Closes delta-io/delta#2425 GitOrigin-RevId: 6e46a49a3aad043bd7c2312ff1e29a11c707f72e commit a66406bda46295911da5005ea21f17e72a8e248d Author: Allison Portis Date: Fri Jan 5 11:52:18 2024 -0700 [Kernel] Fix splitting of partition & data predicates (#2437) commit 12545e1372fda0af85ef350207986cf5b86c80ed Author: Scott Sandre Date: Fri Jan 5 10:30:24 2024 -0800 [Kernel] [#2252] Use a SnapshotHint to bound protocol and metadata loading when loading latest Snapshot (#2260) commit fdcfa62a31f99a5b217ead533e587172ca58937b Author: Felipe Pessoto Date: Thu Jan 4 12:30:36 2024 -0800 [Spark] Discard scoverage from delta-iceberg JAR. - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) Fixes delta-io/delta#2355 Unit tests Closes delta-io/delta#2356 Signed-off-by: Fred Storage Liu GitOrigin-RevId: 4882b1174045ef2732a08e9ee200fab104ab15bf commit 863a9756f021a9403268b05747047a55048ee93e Author: Prakhar Jain Date: Wed Jan 3 14:41:59 2024 -0800 [Spark] Add new checkpoint API which doesn't invoke metadata cleanup This PR adds a new checkpoint API which doesn't invoke metadata cleanup. Closes https://github.com/delta-io/delta/pull/2430 GitOrigin-RevId: e809c32c84534111f020c42a98d5e818960508a7 commit a280e39992c91e62fb3132b491a3153f1931b774 Author: Fred Storage Liu Date: Tue Jan 2 15:36:39 2024 -0800 write timestamp as int64 in parquet for Delta Uniform iceberg Closes delta-io/delta#2304 GitOrigin-RevId: b8254c02158a55441bedab8fae64dccd500ce3be commit 70a6354977cdcdc9b3c5d803dd59bf6fbab5af65 Author: Fred Storage Liu Date: Tue Jan 2 15:10:13 2024 -0800 Minor fix for Delta UniFormE2EIcebergSuite nested struct types unit test Closes delta-io/delta#2424 GitOrigin-RevId: 16af7cd84853875ce40ff276c07ce25da4fef8b1 commit b78252125dbfecf05e6356f991118459f8391035 Author: Hao Jiang Date: Tue Jan 2 16:30:27 2024 -0500 Refactor the enforceInvariant logic of IcebergCompatV1 This PR refactors IcebergCompatV1 's check logic to make it can be reused by IcebergCompatV2 Closes delta-io/delta#2422 GitOrigin-RevId: 7252fe3a11ab4fb58af105e9d0137fa4f55c42be commit ebd0e6f0cb4933e0800c7f6cc37897ac34030f3a Author: Venki Korukanti Date: Thu Jan 4 11:32:58 2024 -0800 [Kernel] Add an internal API to fetch the latest txn identifier ## Description This PR adds an internal API on `SnapshotImpl` to fetch the latest version of the transaction identifier for a given app id. For now, this API is internal to unblock the Flink upgrade to use Kernel. ## How was this patch tested? Unittest commit f61031a8185d23b1d0a0f344fdc78bc113f3a479 Author: Allison Portis Date: Wed Jan 3 22:34:16 2024 -0700 [Kernel][Expressions] Support COALESCE expression for boolean type arguments (#2415) commit de05f250e587b0294578545c0280830fb6899f4d Author: Venki Korukanti Date: Tue Dec 26 08:26:22 2023 -0800 [Spark] Fix an issue when reading a wide table with deletion vectors ## Description Querying a wide table (containing more than 100 leaf-level columns) with deletion vectors throws unsupported operation exceptions This is happening for wide tables which have more than 100 columns. When we are reading more than 100 columns, Sparks code generator makes a decision ((https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L560),(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L541)) to not use the codegen. When not using codegen, Spark sets options to get rows instead of columnar batches from the Parquet reader. This causes the [vectorized Parquet reader]((https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L189)) to return row abstraction over each column in the columnar batch. This row abstraction doesn't allow modification of the contents. Fix the issue by handling the `ColumnarBatchRow` by copying and making the updates. This is not efficient, but it affects only the wide tables. More details at delta-io/delta#2246 Fixes delta-io/delta#2246 Added a unit test Closes delta-io/delta#2392 Signed-off-by: vkorukanti GitOrigin-RevId: b13b973dacd16a483f8a9610f1a87c3a1f5b57a3 commit 5f0986c2f82d8284b82eaa3d853b57e8eba1796a Author: Sabir Akhadov Date: Fri Dec 22 18:53:43 2023 +0100 Refactor WriteIntoDelta Introduce WriteIntoDeltaLike trait for better extensibility. Additionally, refactor `UniversalFormat.enforceInvariantsAndDependenciesForCTAS` to patch a configuration if needed instead of a `WriteIntoDelta` object. Closes delta-io/delta#2391 GitOrigin-RevId: 69d6991df147bf0f14ff8b4d03fb10ac21c19c05 commit ad2620404d8fb4d89c37f3736298097d818969be Author: Hao Jiang Date: Fri Dec 22 01:13:40 2023 -0500 Upgrade IcebergCompat related error classes and error messages This PR update existing IcebergCompat related error messages to make it usable with both V1 and V2 Closes delta-io/delta#2401 GitOrigin-RevId: ea5e7a0bc3ab95bc883d8e607526494620d01184 commit dd64fe48416a45cfdc6f965909db086300e30ad3 Author: Fred Storage Liu Date: Thu Dec 21 13:47:31 2023 -0800 Fix Delta Uniform field id for nested types Iceberg CREATE_TABLE and REPLACE_TABLE reassigns the field id in schema, which made the field id in converted Iceberg schema different from whats in parquet file written by Delta. This fixes by setting Delta schema with Delta generated field id to ensure consistency between field id in Iceberg schema after conversion and field id in parquet files. Closes delta-io/delta#2388 GitOrigin-RevId: e6e2630854cf98d87b6e39f985b9a4e75269b3d2 commit 4c93afd869733287a0626925a461f819cb579597 Author: Ami Oka Date: Thu Dec 21 12:54:08 2023 -0800 Minor refactor to DeltaCatalog.scala and DeltaProtocolVersionSuite.scala Closes delta-io/delta#2400 GitOrigin-RevId: 15bcfe627a3ab79c52e429574879c84b34843d8a commit b8743edfc525f2377465f25b4d039f2c1b862cd7 Author: Hao Jiang Date: Thu Dec 21 15:10:45 2023 -0500 Add new table features and table properties for IcebergCompatV2 This PR introduces a new table feature and a new table property for IcebergCompatV2 Closes delta-io/delta#2372 GitOrigin-RevId: 029f3150e75ebf9d32b84ae6d7f63644e097a394 commit c0af792cfa5906874b8ea3c29fc5036f3fd77587 Author: Johan Lasperas Date: Thu Dec 21 12:19:58 2023 +0100 [Spark] Create MergeCDCCoreSuite to run only core CDC tests #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This is a small refactor to `MergeCDCSuite` to create a second, smaller suite `MergeCDCCoreSuite` that only runs CDC tests defined in that same file. This allows running a small number of tests with high CDC coverage instead of running the (very) large number of tests defined in `MergeIntoSQLSuite` every time. Running the tests ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2385 GitOrigin-RevId: ceed6022e75ade4ea40359876e85c2453b624c6c commit 9f9993db8a51fd83168186bba24041feffdbb566 Author: Allison Portis Date: Wed Dec 27 17:15:42 2023 -0800 [Kernel][Data skipping] Add selection vector for JsonHandler::parseJson and support remaining types in DefaultJsonHandler (#2397) commit 09b6dc5fd1ccaf681e561e7acdae084520848ff8 Author: Allison Portis Date: Wed Dec 27 15:40:07 2023 -0800 [Kernel][Data skipping] Conditionally read the "add.stats" column during log replay (#2398) commit 025aa398a6d57a85ef18c4f0358fd20ef09f7d82 Author: Tathagata Das Date: Tue Dec 26 18:02:13 2023 -0500 [Docs] Adding all the markdown source files for docs.delta.io ## Description This PR adds all the source files for generating the docs.delta.io ## How was this patch tested? Currently manually tested by `python docs/generate_docs.py --livehtml --api-docs`. The instructions are in `docs/README.md`. Co-authored-by: vkorukanti commit 639e1a52b743b7eed3d74db7813f0c117c91a55e Author: Allison Portis Date: Wed Dec 20 19:19:57 2023 -0800 [Kernel][Expressions] Adds the NOT and IS_NOT_NULL expressions (#2360) commit d51ed4518a9e5e91073e763529ddafd894fe0b49 Author: Lin Zhou Date: Tue Dec 19 22:26:17 2023 -0800 [Spark] Adds util function getTotalSizeOfDVFieldsInFile in DeletionVectorStore Adds util function getTotalSizeOfDVFieldsInFile in DeletionVectorStore, so it's consistent across the usages. Closes delta-io/delta#2393 GitOrigin-RevId: addfa1a0bb5314bfccb29a29f6cc6f2d38e41650 commit ccba00506a628727481e221beee27fd6f97efec7 Author: Venki Korukanti Date: Tue Dec 19 20:22:49 2023 -0800 [Docs] Remove the existing docs scripts to make place for the new revamped docs ## Description Delta docs are getting revamped as part of the PR delta-io/delta#2352 which is adding new doc source code and combining both the docs on delta.io website and API docs generation together. Closes delta-io/delta#2394 Signed-off-by: vkorukanti GitOrigin-RevId: bf4732942bec6327343af2db665d49ad0753caf1 commit 29404299cf10e97e9efc3ef87a4c71c65d6606e2 Author: Wei Luo Date: Mon Dec 18 17:52:26 2023 -0800 [Spark] Implement Hilbert clustering This PR is part of https://github.com/delta-io/delta/issues/1874. This PR implements a new data clustering algorithm based on Hilbert curve. No code uses this new implementation yet. Will implement incremental clustering using ZCube in follow-up PRs. Design can be found at: https://docs.google.com/document/d/1FWR3odjOw4v4-hjFy_hVaNdxHVs4WuK1asfB6M6XEMw/edit#heading=h.uubbjjd24plb. Closes delta-io/delta#2314 GitOrigin-RevId: abafaa717ba8f7d8809114858c0fd2a25861fcb8 commit feb1258c0941745fa6ea52bd2f460617411af7ae Author: Johan Lasperas Date: Mon Dec 18 18:51:17 2023 +0100 [Benchmarks] Add Merge benchmarks This PR extends the existing benchmarks with new test cases dedicated to the MERGE INTO command, with two scale factors: 1GB & 3TB. The following type of test cases are created and can be extended in the future: - `SingleInsertOnlyTestCase` - `MultipleInsertOnlyTestCase` - `DeleteOnlyTestCase` - `UpsertTestCase` Each test case uses the same (cloned) target table and defines its source table using the following parameters: - `fileMatchedFraction`: The fraction of target files sampled to create the source table. - `rowMatchedFraction: The fraction of rows sampled in each selected target file to form the rows that match the `ON` condition. - `rowNotMatchedFraction`: The fraction of rows sampled in each selected target file to form the rows that don't match the `ON` condition. The target and source tables are created using the `merge-1gb-delta-load`/`merge-3tb-delta-load`, which collect all the source table configurations used in merge test cases and creates the required source tables. This benchmark is added to measure the impact of a series of changes to the merge command, see https://github.com/delta-io/delta/issues/1827 I followed the instructions in https://github.com/delta-io/delta/tree/master/benchmarks to create an EMR cluster and run the new benchmarks. Here are the result comparing the impact of https://github.com/delta-io/delta/issues/1827: Test case  | Base duration (s) | Test duration (s)  |  Improvement ratio -- | -- | -- | -- delete_only_fileMatchedFraction_0.05_rowMatchedFraction_0.05 | 26,1 | 20,5 | 1,27 multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05 | 8,8 | 15,2 | 0,58 multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5 | 27,7 | 17,5 | 1,58 multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0 | 36,3 | 21,2 | 1,71 single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05 | 14,9 | 14,8 | 1,01 single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5 | 17,5 | 17,3 | 1,01 single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0 | 20,3 | 20,7 | 0,98 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.01_rowNotMatchedFraction_0.1 | 39,5 | 28,8 | 1,37 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.0_rowNotMatchedFraction_0.1 | 19,9 | 19,3 | 1,03 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.0 | 39,1 | 29,9 | 1,31 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.01 | 39,1 | 31 | 1,26 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.5_rowNotMatchedFraction_0.001 | 41,9 | 32,5 | 1,29 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.99_rowNotMatchedFraction_0.001 | 43,3 | 33,8 | 1,28 upsert_fileMatchedFraction_0.05_rowMatchedFraction_1.0_rowNotMatchedFraction_0.001 | 43,8 | 34,1 | 1,28 upsert_fileMatchedFraction_0.5_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001 | 147,9 | 84,8 | 1,74 upsert_fileMatchedFraction_1.0_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001 | 266,9 | 142,5 | 1,87 Closes delta-io/delta#1835 GitOrigin-RevId: 443099e8a02b98fffe5e5a9ec2cecb5d3b8f9537 commit d4fd5e2ae13fd6199fc4ffabee9e70d0ccdf644d Author: Fred Storage Liu Date: Sat Dec 16 11:27:12 2023 -0800 [Spark] Support write partition columns to data files for Delta This support write partition columns to data files for Delta. The feature is required by Uniform Iceberg as Iceberg spec defines so. The approach is to copy FileFormatWriter from Spark as a DeltaFileFormatWriter, and add the option and logic there to support writing partition columns. Closes delta-io/delta#2367 GitOrigin-RevId: 77657bb422ce93b924f3cb25548e845477f8632f commit 75dba0747ec43b4cc421f984297b07c0bd75b989 Author: Mark Jarvin Date: Fri Dec 15 16:33:17 2023 -0500 [Spark] Minor changes around ExternalRDDScan operations Add a `recordFrameProfile` around some potentially-long-running Delta operations. Some Spark jobs triggered by Delta have an `ExternalRDDScan` at the root and can be very expensive in terms of runtime. This PR adds `recordFrameProfile` to give a bit of additional visibility into the runtime of these operations. Closes delta-io/delta#2380 GitOrigin-RevId: 7eba73fb2e5e7eaf45ecaea6730187116eadef43 commit e0835ba939eee6473e237013add168ecdfed5e13 Author: Ole Sasse Date: Thu Dec 14 15:03:53 2023 +0100 [Spark] Validate writer features in all commit operations Add protocol checks for all commit paths where they are missing. Before, it was for example possible to clone into a table even though a writer feature on that table was not supported. Add new tests for all the DML commands. Closes delta-io/delta#2373 GitOrigin-RevId: b45a480eb6a887689f46e84596676d14042e616f commit 930f74106850c24c49ca87d8693ff6ff0afea233 Author: Max Gekk Date: Wed Dec 13 18:02:30 2023 +0000 [Spark] Assign error classes for DeltaParseException calls Yes, parse exceptions contain error class and sqlState. Closes delta-io/delta#2381 GitOrigin-RevId: 9276a5a583119913141817658e8a6b5820596009 commit ffef450c9047f7c0daa985ede4073f7d594ece4d Author: Scott Sandre Date: Tue Dec 12 14:31:37 2023 -0800 [Kernel] Add spark deps to kernel-defaults - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) Add spark deps to kernel-defaults module so that we can use spark in tests. See this other PR: https://github.com/delta-io/delta/pull/2260 Closes delta-io/delta#2366 Signed-off-by: Scott Sandre GitOrigin-RevId: 5ffade444356ca02879b353a8716e557a01b4ca1 commit a6a7c3c6a0dad326e39550591ba453173a600e06 Author: Allison Portis Date: Mon Dec 18 11:14:22 2023 -1000 [Kernel][Expressions] Update `AND` and `OR` expressions to follow SQL null semantics (#2361) commit b19b529e359a9986c390f5f6c8f6d0c87abe0f0d Author: Amogh Akshintala Date: Fri Dec 8 14:20:30 2023 -0800 Minor refactoring in InvariantViolationException.scala Minor refactoring of the constructors in DeltaInvariantViolationException to make them easier to use in other parts of Spark. Closes delta-io/delta#2365 GitOrigin-RevId: 3248d484a248aa962146cbbd46caf15e8f5bdf88 commit 9e45f7b8d2d3bed6a2b93af2afb35040a168a4bd Author: Wenchen Fan Date: Wed Dec 6 20:29:33 2023 -0800 [Spark] DeltaAnalysis should not turn the target table of MERGE into v1 relation too early In the Spark analyzer, it skips resolving the MERGE command if the target table is v2 and reports `ACCEPT_ANY_SCHEMA`. However, the `DeltaAnalysis` will turn Delta v2 relation to v1, and make Spark mistakenly resolve the Delta MERGE commands, which can lead to issues as Delta supports more features like schema evolution. We can't simply move the "turn Delta v2 relation to v1" code to the post-resolution phase, as we still need to do it to resolve metadata columns, which heavily rely on v1 relation. This PR does a surgical fix: if we find a MERGE command that can't be turned into its Delta variant yet (because the source query is not resolved), we mark its target table and do not turn it into v2 relation. After the MERGE command is fully resolved, we will turn it to the Delta variable, and all Delta relation inside it will become v1 relations Closes delta-io/delta#2308 GitOrigin-RevId: 6548dea6889215e04c7a1513b6028d42f651b2eb commit f7a6d26b0a64c44ebe9d81462c056b0a068c3718 Author: Hao Jiang Date: Wed Dec 6 11:57:16 2023 -0800 Allow the UniForm testing framework to test against Delta with HMS as MetaStore. This PR introduces a test trait that allows an UniForm e2e test to write a Delta table to a Hive MetaStore, and read the table as Iceberg table from the same Hive MetaStore. Closes delta-io/delta#2364 GitOrigin-RevId: 0eb7403b5d014883ba2ed0a60b8c9335f881b1b9 commit ac6c2cc243c7cdf72aae21d1553b37778107fa9a Author: Hao Jiang Date: Wed Dec 6 02:56:50 2023 -0800 [Spark] Create an In Memory HMS This PR adds an implementation of in-memory Hive Metastore that will be used by UniForm testing framework when testing against Delta This PR is part of the tests Closes delta-io/delta#2359 GitOrigin-RevId: 74f32a4f3836aa33fe94451e4565a7f06803da4f commit 92b68f856e02ca3035259f0e656da7ed45cc44ae Author: Li Haoyi Date: Mon Dec 4 16:36:15 2023 -0800 Split out utilities classes into their own files Closes delta-io/delta#2351 GitOrigin-RevId: 0864356b9249eea6aa0f290d491d52b05a1e93e1 commit 87ff9c4f82038bd06ef3fa50a62a7f48b8b822bf Author: Jiaheng Tang Date: Sat Dec 2 03:08:29 2023 -0800 Delta SQL parser change to support CLUSTER BY This PR changes the Delta SQL parser to support CLUSTER BY for Liquid clustering. Since Delta imports Spark as a library and can't change the source code directly, we instead replace `CLUSTER BY` with `PARTITIONED BY`, and leverage the Spark SQL parser to perform validation. After parsing, clustering columns will be stored in the logical plan's partitioning transforms. When we integrate with Apache Spark's CLUSTER BY implementation([PR](https://github.com/apache/spark/pull/42577)), we'll remove the workaround in this PR. Closes https://github.com/delta-io/delta/pull/2328 GitOrigin-RevId: 19262070edbcaead765e7f9eefe96b6e63a7f884 commit faa083b412332b3c8288a020a744ac9fc400ebde Author: Xin Zhao Date: Thu Nov 30 10:21:25 2023 -0800 Reformat CloneIcebergSuite.scala Update the format of CloneIcebergSuite.scala, no behavior change Closes delta-io/delta#2348 GitOrigin-RevId: 17465607744544e229feb07077df5bedb6abf462 commit a437f52f635a96e9532ea54c80d99328dd429b54 Author: jintao shen Date: Wed Nov 29 15:31:52 2023 -0800 Introduce clusteringProvider to AddFile action for Clustered Table According to the delta protocol for [Clustered Table](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#writer-requirements-for-clustered-table), any clustered files must be tagged with the implementation name in clusteringProvider field in AddFile action. The PR also removed the CLUSTERED_BY tag due to the above delta protocol change since last time. Closes delta-io/delta#2327 GitOrigin-RevId: 4f847e9d3c47ef5c1cd653589599d14df9809f0f commit 1d42c86f431f0bc05f52235005fbb8ff5681d293 Author: Venki Korukanti Date: Wed Dec 13 04:40:44 2023 -0800 [Kernel] Add support for column mapping `id` mode ## Description Adds support for column mapping id mode. The Parquet handler API contract is updated to look for the field id in `StructField`s of given read schema. When field IDs are present attempt is made to look up the column in the Parquet file by ID. If not found, an attempt is made to find the column in the Parquet file by column name. ## How was this patch tested? Added integration tests and granular unittests (converting missing field ids etc.) for the Delta schema to Parquet schema conversion utilities. commit 02ed385b638cc60f0420fad1802555b9ba34a999 Author: Venki Korukanti Date: Tue Dec 12 22:15:50 2023 -0800 [Kernel] Support complex type columns in `name` column mapping mode ## Description Currently Kernel can not read a table with `name` column mapping mode and having complex type columns. Also includes refactoring of the column mapping-related code in one file `ColumnMapping.java`. ## How was this patch tested? Added a test reading column mapping mode enabled table with different types of complex and primitive types. commit 0a0366400f8b2eab71065401d48edbecf6ba890e Author: Venki Korukanti Date: Mon Dec 11 13:37:09 2023 -0800 [Kernel] Move Parquet utility methods into a separate utils file ## Description Just a refactoring. Currently, the Parquet-related utility methods are in the generic utility file. Move them to a separate file as the utility methods are going to increase (with #2374). ## How was this patch tested? Existing tests. commit c63188c40329b6264dfcfa4a48711f83ad736a1b Author: Venki Korukanti Date: Mon Dec 11 13:28:41 2023 -0800 [Kernel] Fix comparison bug in default Parquet reader ## Description We are using `==` to compare strings. It worked because we used the same static variable everywhere. ## How was this patch tested? Existing tests. commit 1b595bd7f3e426f939acc253223d9e7e9041797e Author: Scott Sandre Date: Thu Nov 30 12:53:23 2023 -0800 [Kernel] Remove `Lazy` usage in Snapshot + LogReplay; Misc cleanup (#2341) commit 79c848f1f555e07c5d2ff368627d1414424cbef5 Author: Felipe Pessoto Date: Tue Nov 28 14:45:36 2023 -0800 [Spark] Improves DELTA_UNSUPPORTED_FEATURES_FOR_[READ/WRITE] message #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Improves DELTA_UNSUPPORTED_FEATURES_FOR_READ and DELTA_UNSUPPORTED_FEATURES_FOR_WRITE error message by adding the table path/name and Delta version. Follow up of #2118 Resolves #2215 @ryan-johnson-databricks Unit Tests ## Does this PR introduce _any_ user-facing changes? Yes, the error message is improved. Closes delta-io/delta#2242 Signed-off-by: Ryan Johnson GitOrigin-RevId: bdf02e2dbfc5dc72e074cfed6d4d492d158c34af commit e5de2c502e72f15c3e4cce62c16ec14c6c3d80f4 Author: Daniel Tenedorio Date: Tue Nov 28 22:40:36 2023 +0100 Support the integration of Apache Spark's column DEFAULT values with Delta Lake. This feature allows users to create or alter tables to assign default value expressions to columns of interest. After doing so, future INSERT, UPDATE, and MERGE commands may refer to the default value of any column using the DEFAULT keyword, and the query planner will replace each such reference with the result of evaluating the corresponding assigned default value expression (or NULL if there is no such explicit default assignment for that column yet). INSERT commands with user-specified lists of fewer columns than the target table will also add corresponding default values for all remaining non-specified columns. Note that at the time of this writing, Apache Spark's CSV, JSON, Orc, and Parquet data sources also support ALTER TABLE ADD COLUMN commands with DEFAULT values, but this Delta Lake implementation currently does not. The reason is that such commands update column metadata so that future scans inject the defaults for only those rows where the corresponding values are not present in storage. This requires scan operator support, but the Delta Lake part seemed too complex for this benefit; instead, such commands return detailed error messages instructing the user to first add the column and then alter it to assign the default after. Closes delta-io/delta#2210 GitOrigin-RevId: 9f3b1762c45853feae224458f279904c09abaae5 commit cb59da158c53eae07e2851c71d965d58212d8f28 Author: Bart Samwel Date: Tue Nov 28 10:34:41 2023 +0100 Make DeltaSourceOffset versionless The `DeltaSourceOffset` still has a version, but it's supposed to always be `VERSION_3`. This turned out to not always be the case. Furthermore, it was not guaranteed that all use cases would use the new `BASE_INDEX`, as some of them still used a hardcoded `-1`. This PR removes the concept of a version from `DeltaSourceOffset` and makes it entirely versionless. The versioning is retained in the serialized form where it belongs. Closes delta-io/delta#2335 GitOrigin-RevId: 835f15d3adf83a47b547344b6087626c93c26222 commit 834ec45f4ceb5a414611cad5a4c8ced8ee61de59 Author: Christos Stavrakakis Date: Tue Nov 28 09:42:25 2023 +0100 [Spark] Minor refactor in MERGE source materialization #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Introduce MergeIntoMaterializeSource.MergeSource class to hold together the source dataframe, whether it was materialized or not and the reason for this decision. Refactoring PR. Closes delta-io/delta#2324 GitOrigin-RevId: a51f8ffa69934112f49b76c5cad2e0fbfabbbacf commit 60e187a34f1bd8f8f5a081480427850f5628e129 Author: Sabir Akhadov Date: Fri Nov 24 11:23:09 2023 +0100 [Spark] Force MERGE INTO source materialization when ignore unreadable files options are enabled. Force MERGE INTO source materialization when Spark configs `spark.sql.files.ignoreMissingFiles`, `spark.sql.files.ignoreCorruptFiles` or file source read options `ignoreMissingFiles` or `ignoreCorruptFiles` are enabled on the source. This is done so to prevent irrecoverable data loss or unexpected results. Closes delta-io/delta#2287 GitOrigin-RevId: 78027b1769735a7ce4709f91fa4457d05730c8e2 commit f8b268df6301357b1ec70ad3909e1742ac56ec40 Author: Christos Stavrakakis Date: Thu Nov 23 11:02:07 2023 +0100 [Spark] Skip file matching for unpartitioned tables #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ConflictChecker supports partition-level concurrency by filtering files using the partition predicates. However, the file matching is performed even for unpartitioned tables, with true as a filter. This commit skips this check to avoid the unnecessary job. Existing tests. Closes delta-io/delta#2322 GitOrigin-RevId: 87f5fb4678b35b7e7a1f7b95d19431416a3fb0bb commit aa54af7252a170582c71a98748f34ea7e030fd8d Author: Andreas Chatzistergiou Date: Wed Nov 22 21:49:35 2023 +0100 [Spark] Replace hour truncation with minute truncation in DROP FEATURE command. In DROP FEATURE command, we truncate history when removing reader+writer features. The cutoff time is calculated by subtracting the logRetentionPeriod/historyTruncationPeriod and then truncating the result at an hour granularity. This PR changes the hour truncation to minute truncation to simplify error messaging. Closes delta-io/delta#2325 GitOrigin-RevId: b6116c8f81ebe431536225f91920d3e4416654bd commit 25e794387c45edcc80bc8bf4cf017a782f8bff3d Author: Hao Jiang Date: Wed Nov 22 12:15:31 2023 -0800 [Uniform] Base Support of UniForm test framework This PR adds the support framework and base classes for UniForm testing framework. Closes delta-io/delta#2329 GitOrigin-RevId: c2f481f6a3fa6252b923790bf1059e259c4b3670 commit 8c39476a78c6a38c0d5bd97d22426162ea1f2d11 Author: Paddy Xu Date: Wed Nov 22 16:41:05 2023 +0100 Protocol change: add notice about the additional column in data files #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel -Other (Delta Protocol) ## Description This PR adds a clarification to make sure readers do skip the additional `_change_type` in the data PARQUET files. This additional column may exist when the table has CDF enabled, and is not a part of the table schema. Not needed. Closes delta-io/delta#2318 Signed-off-by: Paddy Xu GitOrigin-RevId: fb9a0764cd2ba8722fae4946290ea1a8d1a3bbd0 commit da9bb9de7cefae6e9f375ac80a7c2aa29244b7fa Author: Ami Oka Date: Tue Nov 21 14:19:05 2023 -0800 [Spark] Minor refactor to CreateDeltaTableCommand Minor refactor to CreateDeltaTableCommand Closes delta-io/delta#2320 GitOrigin-RevId: 69316c46b63e834236ecefea9adb290c2f1d8290 commit 9f618c92cd9b762d98f999f1efd0c039ad517e7c Author: Bart Samwel Date: Tue Nov 21 16:14:18 2023 +0100 [Spark] Use IndexedFile for DeltaSource admission control instead of FileAction This is a small refactoring. The DeltaSource has admission control (e.g. maxFilesPerTrigger), and the code is based on `FileAction` objects. However, all the calling code really has `IndexedFile` objects, and has to do manual conversions to deal with the fact that not all `IndexedFile` objects have file actions. This PR simplifies this by making the admission control based on `IndexedFile`. Closes delta.io/delta#2317 GitOrigin-RevId: 683e6369cc5383a7863f15bdbdc87c4849b36d45 commit 683a73087e69dd32c4b4f195a3c5450a03e37bff Author: EJ Song Date: Tue Nov 21 12:05:25 2023 +0100 [Spark] Fix a data loss bug in MergeIntoCommand This is a cherry-pick of https://github.com/delta-io/delta/pull/2128 to the master branch. #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fix a data loss bug in MergeIntoCommand. It's caused by using different spark session config object for PreprocessTableMerge and MergeIntoCommand, which is possible when multiple spark sessions are running concurrently. If source dataframe has more columns than target table, auto schema merge feature adds additional nullable column to target table schema. The updated output projection built in PreprocessTableMerge, so `matchedClauses` and `notMatchedClauses` contains the addtional columns, but target table schema in MergeIntoCommand doesn't have it. As a result, the following index doesn't indicate the delete flag column index, which is `numFields - 2`. ``` def shouldDeleteRow(row: InternalRow): Boolean = row.getBoolean(outputRowEncoder.schema.fields.size) ``` row.getBoolean returns `getByte() != 0`, which causes dropping rows randomly. - matched rows in target table loss Also as autoMerge doesn't work - newly added column data in source df loss. The fix makes sure MergeIntoCommand uses the same spark session / config object. Fixes #2104 I confirmed that #2104 is fixed with the change. I confirmed the following by debug log message without the change: 1. matchedClauses has more columns after processRow 2. row.getBoolean(outputRowEncoder.schema.fields.size) refers random column value (It's Unsafe read) 3. canMergeSchema in MergeIntoCommand is false, it was true in PreprocessTableMerge ## Does this PR introduce _any_ user-facing changes? Yes, fixes the data loss issue Closes delta-io/delta#2162 Co-authored-by: Chungmin Lee Signed-off-by: Johan Lasperas GitOrigin-RevId: 49acacf8ff1c71d7e6bcb2dc2f709c325211430a commit b92885cb69274be7bc821878c70a22bb4200fa76 Author: Allison Portis Date: Mon Nov 27 11:39:45 2023 -0800 [Kernel] Add class `FieldMetadata` for storing field metadata (#2316) commit cca2f0180170234ef1bc0175cbed456b4905e9d0 Author: BjarkeTornager Date: Mon Nov 27 18:29:45 2023 +0100 [Flink] README.md update Azure known limitations section (#2332) Since Flink 1.17 Azure Data Lake Gen2 is supported [FLINK-30128](https://issues.apache.org/jira/browse/FLINK-30128) Signed-off-by: Bjarke Tornager commit 13f7fbce7b89cec387df9fbaba0389fe892322b8 Author: Allison Portis Date: Mon Nov 20 13:05:12 2023 -0800 [Kernel] Use SLF4J for logging in kernel-api - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) Resolves #2230 Uses SLF4J for logging in kernel-API. See #2230 for the decision doc on this. Also adds log4j for logging in tests. Temporarily changed log level=DEBUG and ran tests in both kernelApi and kernelDefaults and confirmed that logs were outputted. Closes delta-io/delta#2305 Signed-off-by: Allison Portis GitOrigin-RevId: eed1be726887a2faac5c3a22ad3e5f6fdad82b49 commit 35e5d6946658eacc4303d47dbe2befb0fa88e627 Author: ericm-db Date: Mon Nov 20 12:46:43 2023 -0800 Delta Test Suite Refactoring Closes delta-io/delta#2292 GitOrigin-RevId: 27e88f1e2dc5c8b8dad5ea36fa348476dd6f3d6c commit 246fcf3eb42d8a60b4204457fcab6d5c7e1f91a6 Author: Vitalii Li Date: Mon Nov 20 08:16:32 2023 +0000 [SPARK] Companion change to SPARK-45828 to remove use of deprecated methods - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) This is companion change for https://github.com/apache/spark/pull/43708. It replaces use of deprecated methods with `Symbol` parameter with methods with `String` parameter. Existing unit test No Closes delta-io/delta#2303 GitOrigin-RevId: a3cf851bcb1a8fb163d5748473b731b9e6857cbe commit 74965a32ea3fe4561c582b3fc8a723a55c29ea31 Author: Christos Stavrakakis Date: Sat Nov 18 14:31:00 2023 +0100 [Spark] Remove two unused methods in DeltaErrors.scala - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) Remove two unused methods in DeltaErrors.scala N/A Closes delta-io/delta#2309 GitOrigin-RevId: 55d8b064ff70c4d4b9a0410559ab3faac33ae801 commit 4ceba238a4311c7ad9eadb65c55420cd2af074fa Author: lzlfred Date: Thu Nov 16 21:52:43 2023 -0800 [Spark] fix Delta Uniform vacuum hidden dir Closes delta-io/delta#2301 GitOrigin-RevId: 7a6ecffd62fddb4745d772cdf1bedac1ce6b0ad4 commit 98deff5848ff9ff579a7290312cd17e514ce6a81 Author: Hao Jiang Date: Thu Nov 16 16:39:35 2023 -0800 [Spark] Add SYNC_CONVERT switch to IcebergConverter This PR introduces a new DeltaConf to control whether to perform Iceberg conversion synchronously. It is required by the UniForm testing framework we introduce in this stack PR. We want to ensure that iceberg conversion is finished before starting checking the result. Closes delta-io/delta#2307 GitOrigin-RevId: 634d6d0c82d68d6369d21c925904bef7bf5c11e4 commit f20a94a8fe6b86f3d23342d774c463bd3be9f15e Author: Bart Samwel Date: Thu Nov 16 21:59:47 2023 +0100 Replace IndexedFile.isLast by END_INDEX to reliably move to BASE_INDEX of next version Remove `IndexedFile.isLast` and replace it by a `DeltaSourceOffset.END_INDEX` that is always included after the last entry. There were several problems with `isLast`. For one thing, we were not setting it for the last file in initial snapshots, so when a streaming query starts and processes the initial snapshot, it would stop at the last file in the snapshot. Then on the next trigger we would reconstruct the initial snapshot only to find out that we'd already processed all of it. Another problem with `isLast` is that it doesn't exist for empty commits or snapshots. As a result, the stream would not be able to move forward if it started at the `BASE_INDEX` of an empty commit. With this change, I added `END_INDEX` to the end of every snapshot (CDC and non-CDC), and also a BASE_INDEX to the start (even if it's technically not used, but it's good to be consistent). I also updated some tests in `DeltaCDCStreamSuite` that were incorrect. They restarted a query without a checkpoint, and then did checks assuming that the query had restarted from a checkpoint. I also changed some tests to be very specific about the offsets that they expect the batches in the stream to end at. Closes delta-io/delta#2241 GitOrigin-RevId: 25eb321c2c85d0db9c89e0d0caf8aa6b3febd5a7 commit 266a2fc98565eafc3dd1c09b4e2dd47bfe9b4e57 Author: jintao shen Date: Thu Nov 16 10:52:53 2023 -0800 Follow up comments for ClusteringTable feature Follow up comments for ClusteringTable feature https://github.com/delta-io/delta/pull/2264 We had late comments after the original PR merged, and this PR is addressing those missed comments. Closes delta-io/delta#2294 GitOrigin-RevId: 12b53a4788a8ba7403cb72f60714bd380c742b04 commit d51ee3e44d3540c4d0ae434c617cbc9d24baed0d Author: Ryan Johnson Date: Thu Nov 16 04:57:14 2023 -0800 [SPARK] Remove unnecessary CheckpointsSuiteBase trait #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description For a long time, `CheckpointsSuite` is the only class that implements the `CheckpointsSuiteBase` trait. The division serves no purpose, so fold the trait into the class. Test-only change. ## Does this PR introduce _any_ user-facing changes? No. Closes https://github.com/delta-io/delta/pull/2302 GitOrigin-RevId: ff7185ed7591aa21b8fb9faa2f501e4d7a325e41 commit 20365e175ddc646216167bd69ee0aed2db825f1e Author: Allison Portis Date: Mon Nov 20 13:04:04 2023 -0800 [Kernel] Add parseStructType to the JsonHandler for parsing Delta schemas (#2282) commit ca82bef84dc7b1d94541609d68cfae70b36550ad Author: Kam Cheung Ting Date: Wed Nov 15 09:15:28 2023 -0800 Make Data Skipping Column List case insensitive This PR makes the `delta.dataSkippingStatsColumns` table property to be case insensitive. Closes delta-io/delta#2295 GitOrigin-RevId: f1241e98ada41dc49ef2b0e68b68cfcb9193cfb8 commit d0c8f9b2b83cc35150b2e4a8b0e34ab842294ee7 Author: jintao shen Date: Tue Nov 14 23:53:43 2023 -0800 Fix a delta protocol bug for DomainMetadata action. DomainMetadata is listed as a write only feature, but it says ``` Readers must preserve all domains even if they don't understand them, i.e. the snapshot read must include them. ``` This doesn't seem right for a writer-only feature as if the reader doesn't understand metadata domains at all, the snapshot wouldn't include that column. This PR is fixing the problem. Closes delta-io/delta#2288 GitOrigin-RevId: 911c1af6dd0347d88702616d5e19545ab5b2f305 commit bcd0ee2deb682aa5c8df488cb93dae003bc08921 Author: Dhruv Arya Date: Tue Nov 14 11:11:06 2023 -0800 Add Feature Phaseout support for V2 Checkpoints This PR adds table table feature phaseout support for V2 checkpoints. Users can now downgrade their tables from v2 checkpoints to classic checkpoints allowing older clients to interact with these tables. Closes delta-io/delta#2284 GitOrigin-RevId: 6321298103baeda6fbd9a0d9932090f915af4ee0 commit 355263f3cbe1a8e60ed46ba72d6524b4889ff852 Author: Christos Stavrakakis Date: Tue Nov 14 18:49:50 2023 +0100 [Spark] Add two error classes for Deletion Vector integrity checks - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) Add the error classes for `DELTA_DELETION_VECTOR_NON_INCREMENTAL_UPDATE` and `DELTA_DELETION_VECTOR_OVERLAPPING_ROWS` integrity checks. Existing tests. No Closes delta-io/delta#2286 GitOrigin-RevId: d55b861a75417d7856e92d2c5ee09e3b10859f2e commit 39e11e5ddc04ed70e10cdffb5d57a87bbdafe3a4 Author: Wenchen Fan Date: Mon Nov 13 11:15:38 2023 +0000 [Spark] Fix code styles fix code styles GitOrigin-RevId: 6f36a787806b62becbe3926bf4a308422fdb2355 commit ee6b1ab8b1dfb63ff394290bbcf40e166e8bad20 Author: lzlfred Date: Fri Nov 10 12:20:01 2023 -0800 Support enable Delta Uniform on tables with DV supported but not present - part1 refactor Closes delta-io/delta#2265 GitOrigin-RevId: b2a5e94bda9f64a677e1a92333c7fbe11045726b commit 61dd5d16dc0f09d401965df685fc7be63a413682 Author: Johan Lasperas Date: Fri Nov 10 14:34:41 2023 +0100 Correct case handling in MERGE with schema evolution This fixes an issue where inconsistently using case-insensitive column names with schema evolution and generated columns can trigger an assertion during analysis. If `new_column` is a column present in the source and not the target of a MERGE operation and the target contains a generated column, the following INSERT clause will fail as `NEW_column` and `new_column` are wrongly considered different column names when computing the final schema after evolution: ``` WHEN NOT MATCHED THEN INSERT (NEW_column) VALUES (source.new_column); ``` Added tests for schema evolution, generated column to cover case-(in)sensitive column names. Closes delta-io/delta#2272 GitOrigin-RevId: 5f4e3f1294ca2538484de7238c294236cfc8a8b5 commit 8b768b60de4bd2ce726f020e65bb49c9f428046d Author: jintao shen Date: Thu Nov 9 17:47:16 2023 -0800 Introduce ClusteringTableFeature and CLUSTERED_BY file tag https://github.com/delta-io/delta/issues/1874 requests Liquid clustering, and this PR starts the first step to introduce ClusteringTableFeature and CLUSTERED_BY tags. When creating a clustered table, The feature clustering must exist in the table protocol's writerFeatures. When a clustering implementation clusters files, writers must incorporate a tag with CLUSTERED_BY as the key and the name of the clustering implementation as the corresponding value in add action. More detail can be found in the Delta protocol change PR https://github.com/delta-io/delta/pull/2264 The next step is to pave the way to integrate the table feature and clusterby tags when defining and clustering a clustered table. Closes delta-io/delta#2281 GitOrigin-RevId: e210b491a324a0794ec9f3a9236bb1932a6677e3 commit eafb36caf689c3ac72908a07356e7f6e57448566 Author: Gerhard Brueckl Date: Thu Nov 16 19:46:12 2023 +0100 [PowerBI] fix issue with special characters in partitioning columns (#2289) commit d7cc30ecd2e6e632904cc43aee8733cbf49f80e5 Author: Allison Portis Date: Wed Nov 15 11:47:36 2023 -0800 [Kernel] Refactor SnapshotManagerSuite helper methods (#2298) commit 0d3b6e2f24a4862e199b03e1b295e8f8cfc276a5 Author: Johan Lasperas Date: Wed Nov 8 18:06:11 2023 +0100 Add override flag to OptimisticTransaction.withActive Small improvement to allow replacing an active transaction with `OptimisticTransaction.withActive`. Callers must explicitly pass a flag `overrideExistingTransaction` to prevent unintentionally replacing an active transaction. GitOrigin-RevId: 1df2577c5348ad1f48f6cf7848f843181487834a commit 24f2d935cef6e7bd574af17e6bcfadb7bfd4cb1f Author: Tathagata Das Date: Wed Nov 8 15:21:17 2023 +0000 fix code style Fix the compilation warning "Auto-application to () is deprecated" GitOrigin-RevId: e4e892107ddf87ede30ec75c3f7065802bac7e3e commit 72c81c60fa664668093fb9c5888bc5701e542ebc Author: Christos Stavrakakis Date: Wed Nov 8 09:56:44 2023 +0100 [Spark] Minor refactor to Merge test utils Minor refactor in `MergeIntoSQLTestUtils` to allow using helpers in suites that want to generate SQL test for MERGE commands but don't use `executeMerge` helper. Closes delta-io/delta#2267 GitOrigin-RevId: 287fc4d5f40fb8f628270574ab315de304655fb3 commit bc1c11d983fa8ed2e18f22233cf15fe4d0d7a54e Author: Paddy Xu Date: Wed Nov 8 09:54:38 2023 +0100 Correct Delta protocol terminology "enabled" vs "supported" ## Description This PR fixes a terminology issue in Delta protocol, so the term `supported` is now used to describe a table feature name being listed in table protocol's `readerFeatures` and/or `writerFeatures`. The choice of this word is to emphasize that, in such a scenario, the Delta table *may* use the listed table features but is not forced to do so. For example, when `appendOnly` is listed in a table's protocol, the table may or may not be append-only, depending on the existence and value of table property `delta.appendOnly`. However, writers must recognize the table feature `appendOnly` and know that the table property should be checked before writing this table. This PR did not touch the Row ID/Row Tracking sections, as it's handled by another PR: https://github.com/delta-io/delta/pull/1747. Closes delta-io/delta#1780 Co-authored-by: Lars Kroll Co-authored-by: Bart Samwel Signed-off-by: Paddy Xu GitOrigin-RevId: 8d9b86262e91a88a85388c6333c5ef7ac296931e commit deb1b312da36dcbe5787a76e335d149749b3831b Author: Ami Oka Date: Tue Nov 7 15:40:24 2023 -0800 Fix DomainMetadata updates for DataFrameWriterV1 overwrite mode. Before this change, it was not handling DomainMetadata updates (METADATA_DOMAINS_TO_REMOVE_FOR_REPLACE_TABLE) for DataFrameWriterV1 overwrite mode correctly. This change adds the handling in WriteIntoDelta.scala to only remove the DomainMetadata in METADATA_DOMAINS_TO_REMOVE_FOR_REPLACE_TABLE when it is an overwrite mode and overwriteSchema is true. GitOrigin-RevId: c21c2f0ff9802064a9fe17c53efc4cd79386ac91 commit d7cd654ff8e9a931749cd9589db15461b6cc00e9 Author: Daniel Tenedorio Date: Tue Nov 7 12:43:44 2023 -0800 This is a follow-up to https://github.com/delta-io/delta/pull/2240 which updated the Delta log protocol to support column default values. This PR updates the description to explicitly mention that this feature only applies to write operations, not reads. Closes delta-io/delta#2266 GitOrigin-RevId: a1749871b8d2ae33496136a8bf26bd7ac2bd4437 commit 5cd42d644f0be82e5a5c7be3d49080bb89f538b2 Author: Shixiong Zhu Date: Mon Nov 6 21:32:41 2023 -0800 A minor change from `versions.toString` to `versions.mkString(", ")` to improve the error message for `DELTA_VERSIONS_NOT_CONTIGUOUS`. GitOrigin-RevId: 88969376bcb36256dff542cc5eaee60042d789fd commit 74cd1c93fe4c84367903f2dd9c300551dcdb02db Author: jintao shen Date: Mon Nov 6 15:10:34 2023 -0800 Introduce Clustered Table to Delta spec We propose to introduce a new feature **Clustered Table** to the Delta spec. The Clustered Table feature facilitates the physical clustering of rows that share similar values on a predefined set of clustering columns. This enhances query performance when selective filters are applied to these clustering columns through data skipping. More details can be found in the github issue [here](https://github.com/delta-io/delta/issues/1874). Closes https://github.com/delta-io/delta/pull/2264 GitOrigin-RevId: db124f01b8a8bfa06367700fbabb588c5b03497b commit d541cf016ff761ee98a580e5347bdd88b7bd4720 Author: Ming DAI Date: Fri Nov 3 14:56:33 2023 -0700 Make Iceberg Converter file format check case insensitive GitOrigin-RevId: e8cb9c6cae9fee962672911cba7207312e5f1b86 commit fcfd440fa51b4bc037f828b3d5caae19bf1b841c Author: Wei Luo Date: Fri Nov 3 10:07:31 2023 -0700 [Spark] Implement [optimized write](https://github.com/delta-io/delta/issues/1158). -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) Optimized write is an optimization that repartitions and rebalances data before writing them out to a Delta table. Optimized writes improve file size as data is written and benefit subsequent reads on the table. This PR introduces a new `DeltaOptimizedWriterExec` exec node. It's responsible for executing the shuffle (`HashPartitioning` based on the table's partition columns) and rebalancing afterwards. More specifically, the number of shuffle partitions is controlled by two new knobs: - `spark.databricks.delta.optimizeWrite.numShuffleBlocks` (default=50,000,000), which controls "maximum number of shuffle blocks to target"; - `spark.databricks.delta.optimizeWrite.maxShufflePartitions` (default=2,000), which controls "max number of output buckets (reducers) that can be used by optimized writes". After repartitioning, the blocks are then sorted in ascending order by size and bin-packed into appropriately-sized bins for output tasks. The bin size is controlled by the following new knob: - `spark.databricks.delta.optimizeWrite.binSize` (default=512MiB). Note that this knob is based on the in-memory size of row-based shuffle blocks. So the final output Parquet size is usually smaller than the bin size due to column-based encoding and compression. The whole optimized write feature can be controlled in the following ways, in precedence order from high to low (i.e. each option takes precedence over any successive ones): 1. The `optimizeWrite` Delta option in DataFrameWriter (default=None), e.g. `spark.range(0, 100).toDF().write.format("delta").option("optimizedWrite", "true").save(...)`; 2. The `spark.databricks.delta.optimizeWrite.enabled` Spark session setting (default=None). 3. The `delta.autoOptimize.optimizeWrite` table property (default=None); Optimized write is **DISABLED** by default. Closes delta-io/delta#2145 GitOrigin-RevId: f76f96d7a94fddab027bfa512d223b12ab3dd681 commit cacb7a3cceab89d0d36f5542974676acaf1585dc Author: Dhruv Arya Date: Fri Nov 3 09:12:28 2023 -0700 Resolve inconsistencies between the V2 Checkpoint specification and implementation Follow-up for https://github.com/delta-io/delta/issues/2214. The V2 Checkpoint implementation does not match with what is expected in the PROTOCOL in some places. It does not write some fields in the V2 Checkpoint-related actions: 1. flavor in checkpointMetadata 2. type in sidecar Also, 3. The implementation writes a field called `version` (checkpoint version) in checkpointMetadata and relies on it but the PROTOCOL does not specify any such fields. 4. The PROTOCOL requires that the sidecar’s relative file path should be specified under the field `fileName` in the sidecar action. But the implementation writes this under the field name `path`. This PR updates the specification so that it correctly reflects the implementation. Closes delta-io/delta#2249 GitOrigin-RevId: 39a11840e6eae8fcf24b792b39f83cf9f2cb8dd4 commit 7f85ae0864bfe2c8616b70c45ee3009ea84e2543 Author: Tathagata Das Date: Fri Nov 3 11:19:21 2023 +0000 Fix code style Fix code style to eliminate compile warnings for Scala 2.13 GitOrigin-RevId: 23fd1b46997443606ba2ac1f1b0640222682884f commit cc40f41b7746206b772b18fdaa83bcadf7abbbf5 Author: lzlfred Date: Thu Nov 2 16:56:56 2023 -0700 create catalogTable existence test for Delta v2 commands create catalogTable existence test for Delta v2 commands Closes delta-io/delta#2202 GitOrigin-RevId: 7ea094260317a0a9f2f615f1e86bdfa9715113a9 commit 392545187c435810c0a63645a4fe8edc27959a19 Author: Kaiqi Jin Date: Thu Nov 2 15:24:24 2023 -0700 [Spark][Uniform]support timestamp as partition value for uniform Support timestamp as partition value for UniForm. Now UniForm could work if a Delta table is partitioned by a timestamp field. Closes delta-io/delta#2256 GitOrigin-RevId: 900cf0c8e6cfbd04e22eedc99bcd4a670c3d7253 commit dbe347c134bbb7357915d4cb602eea1101d84170 Author: Dhruv Arya Date: Thu Nov 2 00:23:25 2023 -0700 Fix CheckpointsSuite for V2 Checkpoints This commit removes the assumption in CheckpointsSuite that snapshot.checkpointProvider will always be LazyCompleteCheckpointProvider for V2 Checkpoints. GitOrigin-RevId: ca6cc99c301835471247de737f0c0c18094f8414 commit eb480e4eabdf535d506d4f8c37a9f52d8d086df5 Author: Patrick Leahey <42986994+PatrickLeahey@users.noreply.github.com> Date: Fri Nov 10 14:14:45 2023 -0500 [Kernel] Define LogReplay assertLogFilesBelongToTable() helper method (#2206) Resolves https://github.com/delta-io/delta/issues/2150 Defines LogReplay assertLogFilesBelongToTable() helper method based on the Spark implementation [here](https://github.com/delta-io/delta/blob/4f9c8b9cc294ec7b321847115bf87909c356bc5a/spark/src/main/scala/org/apache/spark/sql/delta/Snapshot.scala#L430) as specified in issue. Verifies that a set of delta or checkpoint files to be read actually belong to the transaction log of the delta table. Throws AssertionError if violation detected. Signed-off-by: Patrick Leahey james.patrick.leahey@gmail.com Co-authored-by: vkorukanti commit 70a5234777a709abc6492d86bac1ab0def0ba3f9 Author: Allison Portis Date: Fri Nov 10 10:33:13 2023 -0800 [Kernel] Add mocked unit tests for log segment construction (#2223) commit 616af05e487a9a4ccffe90a9469cb03674607690 Author: Allison Portis Date: Fri Nov 3 12:33:38 2023 -0700 [Kernel] Copy over more golden table tests from standalone (#2211) commit aac1465f1a2f8d04818fe448c8b7ab7da8c42bf4 Author: Scott Sandre Date: Wed Nov 1 17:16:44 2023 -0700 [Examples] Fix compilation #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Examples) ## Description Previously, the `/examples/scala` sbt project was using (hardcoded) an older version of delta, which had no `delta-iceberg` artifact. Updated the delta version to be the latest. `cd /examples/scala && build/sbt compile` ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2228 Signed-off-by: Scott Sandre GitOrigin-RevId: 8c7a3032974282a4fc89e474db0713c2439184dc commit 4b32e59063738550581b2abd097e31f32af0bd8c Author: Allison Portis Date: Wed Nov 1 13:37:32 2023 -0700 [Connectors] Add java checkstyle checks to Flink and Standalone - [ ] Spark - [X] Standalone - [X] Flink - [ ] Kernel - [ ] Other (fill in here) Adds checkstyle checks back to the Flink and Standalone projects (were removed when archiving the delta-io/connectors repo) Checked locally that checks failed when modifying import order. Closes delta-io/delta#2193 Signed-off-by: Allison Portis GitOrigin-RevId: 7c438ccd2d0883d266d98982331cb5184bec07fe commit c052d3b42e2f240a956b75621eff5a3eb9fd6322 Author: Venki Korukanti Date: Wed Nov 1 08:22:52 2023 -0700 [Spark] Fix config issue in `DeltaParqutFileFormatSuite` Currently, the SQL config block is outside of the `test()` function which causes the test the not take into the config into account. GitOrigin-RevId: 9b3b9f6cf0f9c8761132d434a1642971458ccba3 commit d1d088f7d6f883019cf4ebda3b4b068d9926e4a8 Author: Gengliang Wang Date: Tue Oct 31 17:15:45 2023 -0700 Support 3-part identifier in table_changes() function Support 3-part identifier in table_changes() function. For example, with ``` spark.sql.catalog.customer_catalog=org.apache.spark.sql.CustomerCatalog ``` We can query ``` SELECT * from table_changes('customer_catalog.default.t1', 1); ``` Before this changes, the query above will just failed with: ``` [PARSE_SYNTAX_ERROR] Syntax error at or near '.'.(line 1, pos 13) == SQL == dummy.default.dummy_table -------------^^^ ``` GitOrigin-RevId: 1e7c46ab6449637ee18a5f1356b70d80bcbea384 commit 1536ebe109ed672471efb552aa3aedd19f0c3c17 Author: Daniel Tenedorio Date: Tue Oct 31 11:43:43 2023 -0700 Update Delta log protocol to support column default values. This will support column default values for Delta Lake tables. Users should be able to associate default values with Delta Lake columns at table creation time or thereafter. Support for column defaults is a key requirement to facilitate updating the table schema over time and performing DML operations on wide tables with sparse data. Please refer to an open design doc [here](https://docs.google.com/document/d/e/2PACX-1vTyozwH8A4lemW_wNq7YC7GpuTzNn19NUZQ_pw9dDJNYBuhmdqDunauqmLr0qIuD8kQRNI7a4x72c55/pub). Closes delta-io/delta#2240 GitOrigin-RevId: c8f4c730e5740ae060d3c6082267fafc90900719 commit ff38a252aaa258729b2c951ab4fb5532e6b0724e Author: Christos Stavrakakis Date: Tue Oct 31 19:33:56 2023 +0100 Allow specifying start snapshot in CDCReader Extend CDCReader.changesToBatchDF() method with startVersionSnapshotOpt argument that allows specifying the Snapshot of the start version. Closes delta-io/delta#2255 GitOrigin-RevId: 414ad54a478a1519619ea7f7311188cfd4569f72 commit ca4931379464e066a65f16c60db9740e8b6f1c82 Author: Andreas Chatzistergiou Date: Tue Oct 31 19:26:48 2023 +0100 Removed DROP FEATURE flag which was used to guard the feature while in development. GitOrigin-RevId: ff268a6df649f9057e5076cb0e747ae7b450b19f commit 55c4075ff58dc78f5255db45df9d421035686eb0 Author: Costas Zarifis Date: Mon Oct 30 22:11:52 2023 -0700 Minor refactor to TransactionalWrite.scala GitOrigin-RevId: e75c6a91787894bbc0af85f4627dddb43358c0e4 commit aa43438f8b272a2002c15e15285802fc80984e2c Author: Tom van Bussel Date: Mon Oct 30 22:30:47 2023 +0100 Revert commit 7251507fd83518fd206e54574968054f77a11cc0 This reverts commit 7251507fd83518fd206e54574968054f77a11cc0, as it can cause MERGE statements that reference a view containing a non-correlated scalar subquery referring a CTE to fail. GitOrigin-RevId: 300883c1e69ea7838ce1842415c41882bb1334ea commit 348f87a602deb9bf51cfbb96d958340b72a62f54 Author: Prakhar Jain Date: Mon Oct 30 11:17:49 2023 -0700 [Spark] Extend DeltaLogSuite to V2 Checkpoints Extend DeltaLogSuite to V2 Checkpoints GitOrigin-RevId: 8c57418f9ae436c38e55be2b46a9049142fcef5e commit f0d91ea2e64ba5dfdbbe444f3d29b9bef5c7c440 Author: Lars Kroll Date: Mon Oct 30 16:08:25 2023 +0000 MERGE: Track when a second source scan is performed - We used to set `numSourceRowsInSecondScan` to -1 whenever no second source scan was performed, but the metrics framework just sets this back to 0. - Instead, added a field `performedSecondSourceScan` to track this. GitOrigin-RevId: 78128469bd2e76db67c9fa4e71d1f3820a6e39b7 commit 9265b9e566465d50a7b55abd888b77c141c624cc Author: Lin Zhou Date: Fri Oct 27 11:57:10 2023 -0700 Move `getMetadataTrackingLogForDeltaSource` to object `DeltaDataSource`, so it could be used by other streaming sources later. GitOrigin-RevId: f16aa3182cb7434565950e165deee18dd14a80b3 commit ee8d0951363d08296171c4f4feea94b0057f77a5 Author: Fredrik Klauss Date: Fri Oct 27 10:31:36 2023 +0200 Parallel stats collection within each partition * Do stats collection in parallel within a partition instead of sequentially to reduce the time spent idle waiting for network requests to go through while fetching the file status or the parquet footers from cloud store * Using global threadpool on each executor to do parallel stats collection * Code to partition the dataset before collecting stats to increase the achievable throughput Closes https://github.com/delta-io/delta/pull/2203 GitOrigin-RevId: 60ce68f0a1627f9afacd5d3266e6596e1c0e8796 commit ff71245aafdac560948b4f3deee5623852286dd4 Author: Ami Oka Date: Thu Oct 26 17:43:04 2023 -0700 Minor refactor to WriteIntoDelta.scala, ImplicitMetadataOperation, DeltaDataSource.scala, and DeltaSourceUtils GitOrigin-RevId: cbeffc270fcacd5de2cf6629df105e8cc38509b6 commit 2b0a782cb05fbd302de49b4ac0a04ef3dfb397f2 Author: Scott Sandre Date: Thu Oct 26 15:21:12 2023 -0700 [Infra] Set next Delta version to 3.1.0-SNAPSHOT Now that the Delta Lake 3.0.0 release has been finalized, set the next Delta version to 3.1.0-SNAPSHOT. Closes delta-io/delta#2208 Signed-off-by: Scott Sandre GitOrigin-RevId: cc234af97393c8ab29b730befda07aedfa20a627 commit 7b5761d9299697c401f3549268a3e62006758356 Author: Allison Portis Date: Mon Oct 30 09:37:18 2023 -1000 [Kernel] Adds tests for LogReplay using golden tables (#2212) commit 1d5dd774111395b0c4dc1a69c94abc169b1c83b6 Author: Ryan Johnson Date: Thu Oct 26 11:51:31 2023 -0700 [Spark] Make NonFateSharingFuture fully non fate sharing It turns out `NonFateSharingFuture` can propagate the exception from a failed future between sessions, causing spurious failures in my session if it waits on a future launched by a job that got canceled. The fix has several pieces: * Non-fatal exceptions are completely ignored (other than being logged) * Fatal exceptions only propagate (once) to a caller with the same session that created the future. * Change `DeltaThreadPool` to use Java futures, because Scala futures do not handle fatal exceptions gracefully (they swallow the exception while leaving the future permanently unfinished). A new unit test validates the expected behavior of `NonFateSharingFuture`. Existing unit tests validate other use sites of `DeltaThreadPool`. Yes -- spurious exceptions no longer thrown in my session, if a shared future launched by somebody else's session fails. Closes https://github.com/delta-io/delta/pull/2237 GitOrigin-RevId: 2376fee832721c9d4d3268331239490ab17dbf75 commit ae2bc34e310ea5ae796245a14ed4d6d54a04b7b6 Author: Felipe Pessoto Date: Wed Oct 25 10:53:11 2023 -0700 [Spark] Improves InvalidProtocolVersionException message Improve the InvalidProtocolVersionException error message. Fix #2082 Unit tests Yes, the error message changed to: Delta protocol version is not supported by this version of Delta Lake: table "**tableNameOrPath**" requires reader version **readerRequired** and writer version **writerRequired**, client supports reader versions **supportedReaders** and writer versions **supportedWriters**. Please upgrade to a newer release. Closes delta-io/delta#2118 Signed-off-by: Ryan Johnson GitOrigin-RevId: 99353fbbd29a01d38d3758c3a1d3c28d9603e360 commit 500d18002cd806100ee19bfceecf7e3604fdd315 Author: Andreas Chatzistergiou Date: Wed Oct 25 19:05:08 2023 +0200 [Spark] Checkpoint tombstones need to respect the HistoryTruncationRetention period. Snapshot reconstruction filters tombstones based on the `tombstoneRetentionPeriod` (default 7 days). This behaviour may disallow the DROP FEATURE command from downgrading the protocol due to feature traces being detected in the earliest checkpoint. To avoid this issue, we filter out all tombstones from the earliest checkpoint. GitOrigin-RevId: 7a1dc0ddca8e376f4856fb0c83c1af73e345d235 commit ff15d91271758b0f95cc89efc607c77a19d6084e Author: Thang Long Vu Date: Wed Oct 25 12:35:23 2023 +0200 [Spark] Support data skipping on TimestampNTZ columns. Support data skipping on `TimestampNTZType` columns in `DataSkippingReader`. Unit tests are added to the `DataSkippingDeltaTests` and `DeltaTimestampNTZSuite`. Closes https://github.com/delta-io/delta/pull/2222 GitOrigin-RevId: 57ff108cd5fe63cefaf7d74a1639a1bf7576440f commit 3880c7a277c2ebc6c1c1ae77da72979a25c862c3 Author: Paddy Xu Date: Wed Oct 25 11:24:18 2023 +0200 [Spark] Rephase error message on CHECK constraint violation in replaceWhere This PR rephases an error message on CHECK constraint violation in replaceWhere from ``` Data written out does not match replaceWhere ''. ``` to ``` Written data does not conform to partial table overwrite condition or constraint ''. ``` because `` is sometimes not a replaceWhere condition but a CHECK constraint. GitOrigin-RevId: b44582e485cb9b3159f2d043e49fc335f8073688 commit 53c0e198c87567ee0d831ed5cf42bf4a1f9d7783 Author: jintao shen Date: Tue Oct 24 18:28:33 2023 -0700 [Spark] Minor changes .. to DeltaErrors.scala around `createTableWithDifferentPropertiesException ` to have deterministic order of properties. .. to CreateDeltaTableCommand.scala to fix creating external table GitOrigin-RevId: 5ecd5331d43e368971f5894f7eef236a9ba661cb commit 8a5be053bc66fa0ec5649945f4c1a673ad8ef999 Author: Lukas Rupprecht Date: Tue Oct 24 17:05:10 2023 -0700 [Spark] Replaces startTransaction in StatisticsCollection.recompute with catalogTable overload #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR replaces the call to DeltaLog.startTransaction() in StatisticsCollection.recompute with calls to DeltaLog.startTransaction(Option[CatalogTable], Snapshot). This PR is part of https://github.com/delta-io/delta/issues/2105 and a follow-up to https://github.com/delta-io/delta/pull/2125, to ensure that all transactions have a valid catalogTable attached to them so Uniform can correctly update the table in the catalog. This PR also introduces a new helper in DeltaTestImplicits, which allows unit test call sites to still call the old version of StatisticsCollection.recompute and passes None as the catalogTable. This implicit should only be used if the test really only runs against a path-based Delta table and so no catalogTable is present. This is a small refactoring change so existing test coverage is sufficient. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2217 GitOrigin-RevId: bc3783d298de7d7ad442ac347042d8fc7820afbe commit 1799e2022c5312c22d4f2dd98b62a61ddbffbbcc Author: Venki Korukanti Date: Mon Oct 23 17:50:36 2023 -0700 [Docs] Update the conda `environment.yml` to work with Python 3.8 ## Description PySpark has dropped support for Python 3.7 which broke the API doc generation script. Update the `docs/environment.yml` dependencies to work with Python 3.8. Tested by following the `docs/README.md` and was able to see the generated docs for all modules: delta-spark, flink, connectors and kernel. Closes delta-io/delta#2200 Signed-off-by: vkorukanti GitOrigin-RevId: c22f8ed8de21e86933b5a1e670f0b0a2e0f7050d commit cb305c0ccdc5cab3b4c7188bf0126d919f78fe3b Author: Serge Rielau Date: Mon Oct 23 15:47:29 2023 -0700 [Spark] Minor refactor to DeltaDDLUsingPathSuite GitOrigin-RevId: a483c48ec57feb004a24b4df090d13953e8ad53e commit fe974608555469146d3a1b4b16a7f60e5cc7f1a5 Author: Felipe Pessoto Date: Sat Oct 21 09:25:22 2023 -0700 [Spark] Remove databricks reference Removes a Databricks reference from an error message Closes delta-io/delta#1861 Signed-off-by: vkorukanti GitOrigin-RevId: 53f0572f5e17c7dc91b811332a2040a02d5a110e commit f6ec635a2ae70a04534d2c3e41fce2f5afbabb69 Author: Bart Samwel Date: Fri Oct 20 17:42:29 2023 +0200 [Spark] Use a Serializer and Deserializer for DeltaSourceOffset This PR changes `DeltaSourceOffset` to have a jackson `Serializer` and `Deserializer`. `DeltaSourceOffset` objects need extra adjustment of values at serialization and deserialization time, for backward compatibility. This was implemented using a hacky custom `.json` getter for serialization, and an `apply` constructor from `OffsetV2`. But some tests use a plain old `JsonUtils.mapper.readValue`, and they wouldn't get the special magic applied. With this change, the special magic is always applied. The PR also adds a specific error for arbitrary JSON parse errors. Closes delta-io/delta#2205 GitOrigin-RevId: 58b18ff2099d731b6e4e68b8a3fb02663cd2b3ff commit c8219b75756a9ffdcc64e9abc5b902c6077276fd Author: Ala Luszczak Date: Fri Oct 20 15:22:36 2023 +0200 [Spark] Minor refactoring GitOrigin-RevId: df6f13b2740f039c2a9e95ee30c4b52b5bf6f272 commit e415e7a5267b69e4fbadabb8ed6295a655bd1c8a Author: Paddy Xu Date: Thu Oct 19 18:49:15 2023 +0200 [Spark] Remove column mapping suggestion when column name contains invalid character Removes a suggestion to turn on Column Mapping when the user tries to create a table that has a column containing invalid characters. This suggestion is not useful for table creation and is error-prone since enabling CM is a one-way road that cannot be reversed. GitOrigin-RevId: ee68ddad9ffc81ea9a1b02e803f93353d68b3fa9 commit b505412dab2c51ffa0f71ada34b3db449407e95d Author: Johan Lasperas Date: Thu Oct 19 15:34:51 2023 +0200 [Spark] Factor out withKeyValueData and getCDCForLatestOperation in DeltaTestUtils This allows reusing `withKeyValueData` outside of `MergeIntoSuiteBase` and avoid reimplemeting reading CDC across different test suites. GitOrigin-RevId: f5c1ba6a41a03d3220c19010aa8eac755507ba2b commit 1c5249ec9cde385a3abc08948a0b12778d623269 Author: Sabir Akhadov Date: Thu Oct 19 10:23:04 2023 +0200 [Spark] Minor comment update GitOrigin-RevId: e09a59fcd70c4efebb3bae855907b8233adada0f commit 9cb6f2d4b30c5e0483db6cb07260c7f83832a3be Author: Eric Maynard Date: Thu Oct 19 00:07:39 2023 -0700 [Spark] Minor refactor to DeltaSuite.scala GitOrigin-RevId: a8751f2f6d5404d217e7cc0c4655af29e88834f8 commit acb1f3b74f97928ff1f85208921abd8438035b0c Author: Lars Kroll Date: Wed Oct 18 14:53:47 2023 +0200 [Spark] Minor refactoring to OptimisticTransaction GitOrigin-RevId: 314f119a51f1cbb0b29919fccfefbacf4e115a48 commit 91788192f46ef90ca28b40ef1844a80708432edc Author: Sabir Akhadov Date: Wed Oct 18 12:10:31 2023 +0200 [Spark] Add condDo to ScalaExtensions Add an implicit condDo which is similar to [condOpt](https://www.scala-lang.org/api/2.13.4/scala/PartialFunction$.html#condOpt[T,U](x:T)(pf:PartialFunction[T,U]):Option[U]) but doesn't return a result. GitOrigin-RevId: bf87dea94e47138490895c723f9d553e4e210b05 commit ae6eef8e7e7f3505d9d91ed3cf0279ac62ff27a6 Author: Ole Sasse Date: Tue Oct 17 15:33:35 2023 +0200 [Spark] Print error details in user facing stack trace for implicit casts For DeltaArithmeticException, print the error details in the message. This will for example make the column for which an implicit DML cast failed user visible. Added a new test to validate message details Closes https://github.com/delta-io/delta/pull/2198 GitOrigin-RevId: b1535d2d6d0171b79a64d051903531d2e4bc667a commit 6a3c92269c782566e2c765e07c487b30a678d40e Author: Fredrik Klauss Date: Tue Oct 17 14:03:59 2023 +0200 [Spark] Propagate thread locals to Delta thread pools * The default thread pool executor in Apache Spark does not forward thread locals to threads spawned in a thread pool. * This can cause issues if the threads depend on the thread locals. * To fix this, we introduce a wrapper class around the thread pool executor that forwards thread locals. Closes delta-io/delta#2154 GitOrigin-RevId: 9e9423e4b041232457ffaab18f5f96490bb45b88 commit 65668893473db35a6783a63bd07d1a51b6737fa0 Author: Sabir Akhadov Date: Tue Oct 17 13:49:35 2023 +0200 Minor refactoring in MergeInto Minor refactoring in MergeInto to facilitate extensibility. GitOrigin-RevId: c0244c42e153010dabfaf1246adf244c45d0d2a5 commit 2991b8c259be027d3b402abfcc5d278b8bef6301 Author: Johan Lasperas Date: Tue Oct 17 08:55:25 2023 +0200 Resolve multiple UpCast expressions in MERGE and UPDATE Follow-up from https://github.com/delta-io/delta/pull/1998 to run the rule `PostHocResolveUpCast` multiple times until all `UpCast` expressions are resolved. With the previous change, it only ran once, failing to resolve nested UpCast expressions. Added a test covering a problematic case with multiple `UpCast` expressions. GitOrigin-RevId: de56850c6013ad2b7c4a134917778069ee2f5915 commit 3a494e94b7fb5c10abe383080d210da8c2a7c32a Author: Renan Tomazoni Pinzon Date: Thu Oct 26 17:19:24 2023 -0300 [Kernel] Consolidate `Preconditions` checks in one utility class Moves `checkArgument` methods from `io.delta.kernel.defaults.internal.DefautKernelUtils` and `io.delta.kernel.internal.util.InternalUtils` to `io.delta.kernel.internal.util.Preconditions` class to clean up the code. Resolves #2148 --------- Signed-off-by: Renan Tomazoni Pinzon Co-authored-by: vkorukanti commit 8639c411890a5c77386f04e2282fcf4caa401eff Author: Venki Korukanti Date: Thu Oct 19 10:11:29 2023 -0700 [Kernel] Update `README.md` (#2197) The `kernel/README.md` got bit stale. Fix the links, add more links and update content. commit f879bfbe1e81aab0be7fe8b04bb08b2adeb1397d Author: Venki Korukanti Date: Wed Oct 18 15:48:45 2023 -0700 [Kernel] Fix doc build error Quality the class reference with full namespace in `package-info`. commit a7971a939eabd00db8f5dc6d5c64fe95f5478651 Author: Venki Korukanti Date: Tue Oct 17 10:51:54 2023 -0700 [Kernel] Add `package-info` for `io.delta.kernel.defaults.client` package Add missing `package-info` for `io.delta.kernel.defaults.client` package. commit 79d0f0d5c7db1edadc1fb41a1251642bc6eee2dd Author: Dhruv Arya Date: Mon Oct 16 18:18:28 2023 -0700 [Spark] Test equivalency of compatibility V2 Checkpoints with normal V2 checkpoints Adds Unit Tests that check whether Compatibility V2 Checkpoints are equivalent to normal V2 Checkpoints after being loaded. Closes delta-io/delta#2194 GitOrigin-RevId: c3f9161a983c71cf26e311ce9cc1e57379ac930d commit a923c43b060c7cb36d14e7cd23523e15e8c80282 Author: Dhruv Arya Date: Mon Oct 16 17:55:13 2023 -0700 [Spark] Allow V2 checkpoints with no sidecars The current code implicitly requires that all V2 checkpoints should have sidecars even though the Delta specification does not require this. This PR relaxes this assumption. Adds a new test which ensures that a snapshot can be initialized from a V2 Checkpoint where all of the actions are in the manifest and no sidecars are present. GitOrigin-RevId: 1c1dc158898dd1c66c999ed4d1b52d51380fd23e commit 5d7182ce6dd3b807dffd6ae461bfad1154bbbaa6 Author: Hao Jiang Date: Mon Oct 16 16:50:23 2023 -0700 [Spark] Uniform table creation with AS SELECT returns null values during table creation This PR fixes a bug in CTAS of delta table when using Universal Format but not explicitly specifying columnMapping. GitOrigin-RevId: fb1db0a680f197c98c30c022520c81d4b5525d2f commit 60f17d9560def6ac5369bef3fdfd53c42bffbf61 Author: lzlfred Date: Mon Oct 16 16:01:55 2023 -0700 [Spark] Support populating iceberg default name mapping in Uniform Delta to Iceberg conversion Closes delta-io/delta#2191 GitOrigin-RevId: 83fde01987adddb5f26f8b1d7434f49afad13401 commit 447e4eda2fcf8ac7d3192fc374c525d751aabbbc Author: Tathagata Das Date: Mon Oct 16 10:13:16 2023 -0700 [Docs] Updated all language doc generation script Fixed doc generation scripts to generate and collect all language docs of all modules together in one place. ran locally. Closes delta-io/delta#1990 Signed-off-by: Venki Korukanti GitOrigin-RevId: eda137f00e1c67aef09766300d5635db6067ede1 commit d754513412024e779e77ff95620e56d864cb4575 Author: Dhruv Arya Date: Fri Oct 13 19:02:48 2023 -0700 [Spark] Minor refactor of DeltaAnalysis.scala GitOrigin-RevId: 2904bc6b6d2de5da6a9300a8629f6d976917634c commit 8af2bb33f35277b892596068b85874e863da8a20 Author: Hao Jiang Date: Fri Oct 13 17:30:57 2023 -0700 [Spark] Add a UniForm example for Delta 3.0 Closes delta-io/delta#2181 GitOrigin-RevId: 2efb13f6eec083602c78c760cf851494f20d8b8d commit 370ae5a4951590b3f90327420ad1f0d903b79444 Author: Hao Jiang Date: Fri Oct 13 11:11:08 2023 -0700 [Spark] Add new test cases against enforceSupportInCatalog Add new test cases in ConvertToIcebergSuite against enforceSupportInCatalog GitOrigin-RevId: fffd4edb470d5cdf6c792c7f7b4bb92fac523e66 commit 2d30f78629728c6a5a67c9228b13dd3490f9f46d Author: Hao Jiang Date: Fri Oct 13 11:10:00 2023 -0700 [Spark] Minor fixes to make release smoother Fix to scala version check in examples Fix to allow iceberg integration tests to use staged sonatype repo Closes delta-io/delta#2175 GitOrigin-RevId: b999b748b13eaf703cda853fdba392c6fd08bb0a commit 310d43d4889de3e1b3e838c7a968ecb3ff47c2b5 Author: panbingkun Date: Fri Oct 13 09:24:23 2023 +0000 [Spark] Minor refactor to DeltaAlterTableTests.scala and DescribeDeltaDetailSuite.scala GitOrigin-RevId: ed16cfc8cc56583a9a240df59e72d1548a883f2b commit f3c7cc3907bd3802a0ba4f369d6a0a9ea2af8bdc Author: Rui Wang Date: Fri Oct 13 06:28:36 2023 +0000 [Spark] Relax exception test in DeltaTableBuilderSuite Relax exception test in DeltaTableBuilderSuite to allow general AnalysisException GitOrigin-RevId: 809c19a8c2599ae22ef31408c4f24974a5941596 commit fd06d99d4b6198bd1ad6c63a4d2c55618f079351 Author: Hao Jiang Date: Thu Oct 12 17:29:15 2023 -0700 [Spark] Fix the hardcoded iceberg version in build.sbt `build.sbt` uses hardcoded jar name when packing iceberg jars. Update to use variable in jar name instead of hardcoding Closes delta-io/delta#2177 GitOrigin-RevId: 8957369dc7a0eb37c9b6cf734057e0946e197511 commit b44bd8a9023a318325fe8738a5c56b1325ed56b7 Author: Venki Korukanti Date: Fri Oct 13 14:50:12 2023 -0700 [Kernel] Add integration tests for staged/released maven artifact verification ## Description Add the following integration tests to verify the sanity of staged maven artifacts for Kernel release signoff. * Reading a simple table w/o partitions * Reading a simple table with partitions * Reading only a subset of columns from a table * Reading a table with deletion vectors * Reading a table with deletion vectors and column mapping. * Reading a table with partition pruning (simple filter) * Reading a table with partition pruning (filter - combined with partition col filter and data col filter) ## How was this patch tested? Here are the examples to run these tests ``` # To run with binaries built from the current repo /kernel/examples/run-kernel-examples.py --use-local # To run with staged artifacts at a particular location /kernel/examples/run-kernel-examples.py --maven-repo=https://oss.sonatype.org/content/repositories/iodelta-1120 --version=3.0.0rc2 ``` commit 91df04da8abe2747a156a1218f4d981bfae6fb27 Author: Venki Korukanti Date: Fri Oct 13 14:21:47 2023 -0700 [Kernel] API clean up - move few APIs to `internal` ## Description Move unneeded classes from `io.delta.kernel.utils` into `io.delta.kernel.internal.util` Also move `io.delta.kernel.fs.FileStatus` class to `io.delta.kernel.utils`. ## How was this patch tested? Ran the integration test/examples. commit d0490c8c9c7537568b5c621b72ec4c33397a7481 Author: Allison Portis Date: Fri Oct 13 14:08:46 2023 -0700 [Kernel] Rename row index column variable names to mention "metadata" (#2188) ## Description Renames `ROW_INDEX_COLUMN_NAME` to `METADATA_ROW_INDEX_COLUMN_NAME` and `ROW_INDEX_COLUMN` to `METADATA_ROW_INDEX_COLUMN`. Also fixes a few typos in the user guide. ## How was this patch tested? NA commit c7932974981ccc45c81ba279bef08bd12a0b18c3 Author: Venki Korukanti Date: Fri Oct 13 13:06:04 2023 -0700 [Kernel] Minor test clean up ## Description Remove the golden table `dv-with-columnmapping` from the `kernel` module and add it the `golden-tables` module (generated using `GoldenTables.scala`. This was done originally because of delta-io/delta#1886 which was fixed in the current master. ## How was this patch tested? Ran the `DeletionVectorSuite` and it passes. commit 67d6c5de763c2f73d61172b007e4159b1353ca41 Author: Allison Portis Date: Fri Oct 13 10:48:30 2023 -0700 [Kernel] Remove the ColumnVector::getStruct API (#2131) ## Description Removes the `getStruct` API from `ColumnVector`. We will use a wrapper to convert to rows only for the ColumnarBatch/FilteredColumnarBatch row-based processing APIs. ## How was this patch tested? Existing tests should suffice. commit ffd501a39c7d68c9e0b6941ae4a95acd4575feb5 Author: Venki Korukanti Date: Fri Oct 13 05:59:02 2023 -0700 [Kernel] Update Kernel USER_GUIDE based on the latest API updates ## Description There have been a few changes to APIs since the last update to `USER_GUIDE.md`. Update it to reflect the latest API changes. commit c5b84c69e9b70af1169dba8a327da3869e081b3d Author: Hao Jiang Date: Thu Oct 12 13:34:35 2023 -0700 Change Apache Iceberg src branch from `master` to `main` Change Apache Iceberg src branch from `master` to `main` Closes delta-io/delta#2174 GitOrigin-RevId: 3ebb2293b447ee5c92ebda33b152d87c572fd7c5 commit 91c56d962e8a0fdfb683a8edce71b0ee494491b9 Author: Felipe Pessoto Date: Thu Oct 12 12:20:35 2023 -0700 [Spark] Change Scala version to match Spark 3.4 - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) Matches Scala version of Spark 3.4: https://github.com/apache/spark/blob/59fcecb5a59df54ecb3c675d4f3722fc72c1466e/pom.xml#L171 https://github.com/scala/scala/releases/tag/v2.12.16 https://github.com/scala/scala/releases/tag/v2.12.17 https://github.com/scala/scala/releases/tag/v2.13.6 https://github.com/scala/scala/releases/tag/v2.13.7 https://github.com/scala/scala/releases/tag/v2.13.8 Fix #1909 Changes Scala version, which should be compatible: "As usual for our minor releases, Scala 2.12.17 is binary-compatible with the whole Scala 2.12 series." "As usual for our minor releases, Scala 2.13.8 is binary-compatible with the whole Scala 2.13 series." Closes delta-io/delta#1936 Signed-off-by: Allison Portis GitOrigin-RevId: 232abd3a2f7f8d7395e1cdeb21baecea096f15a6 commit dbfa010db02e473d51cb2c5ec2ad21fd8050ba20 Author: Haejoon Lee <44108233+itholic@users.noreply.github.com> Date: Fri Oct 13 02:53:15 2023 +0900 Minor refactor to DeltaAlterTableTests.scala GitOrigin-RevId: cfdb7483fd41c9371c66f8117cfcb92305523e18 commit f3ec1677088db3e164313b90a5e2b343eadabb0c Author: lzlfred Date: Wed Oct 11 05:59:55 2023 -0700 Use fieldId instead of index to lookup partition values during Iceberg to Delta Conversion Use fieldId instead of index to lookup partition values during Iceberg to Delta Conversion GitOrigin-RevId: 1970ad95b37759a78a97e14c8a70d8af1413a15e commit 208e73c5e2d21efd389d0f18c6c9f6881d3ffe8e Author: Haejoon Lee Date: Wed Oct 11 03:54:07 2023 +0000 Minor refactor to CloneIcebergSuite GitOrigin-RevId: 2fe3ed308cd1518530136e4836ca818f03b8cdb3 commit 29c177e07c4786606fbf12fe5c9c72e01351da09 Author: lzlfred Date: Tue Oct 10 19:33:20 2023 -0700 Disable iceberg convert to delta with partition evolution suite Disable iceberg convert to delta with partition evolution suite GitOrigin-RevId: 9a2f11abf2fd0eb403bcfbdabf2354d5522b7a5c commit e31138068a2073b8bd3475d6f80135adce5163e8 Author: Allison Portis Date: Tue Oct 10 16:16:04 2023 -0700 Update kernel docs github actions to publish kernelDefaults java docs Merged into my personal fork and successfully published default docs here https://allisonport-db.github.io/delta/snapshot/kernel-defaults/java/ https://github.com/allisonport-db/delta/actions/runs/6450710871 GitOrigin-RevId: e453ae9f2d6388fbbc3f57cfb4e55086e84a5003 commit 9bf98d6be5285d0974df19364988863d567492ca Author: Hao Jiang Date: Tue Oct 10 11:48:23 2023 -0700 Expose the cause when exceptions happens in getDeltaLogFromCache When `ExecutionException` happens in DeltaLog#getDeltaLogFromCache, the cause exception is not exposed and ExecutionException is thrown out to the caller. This does not follow the delta exception framework. Fix is to catch the ExecutionException and throw the cause. GitOrigin-RevId: 5d48eb84209f89c1690cf7f7f84099d06004989f commit 450dc84e7d1c53913e701f9718d5823b0ccebf0e Author: lzlfred Date: Tue Oct 10 10:10:58 2023 -0700 fix Delta Iceberg "generate_iceberg_jars" script The existing script fails if the machine does not have git setup, when it runs "git config" on a non-git dir. This fixes the problem by doing git clone first, then do a git config. GitOrigin-RevId: d35b870c3fa83c9a0f552f19c5b8bcb69df3bc5e commit ce3c00c82caa323de797ab0e403f3200e53d05c6 Author: Dhruv Arya Date: Tue Oct 10 01:06:49 2023 -0700 Reduce DeltaRetentionSuite flakiness by making ManualClock initialization consistent Most RetentionSuite tests rely on ManualClock to time travel and trigger metadata cleanup. Cleanup boundaries are determined by finding files that were modified before 00:00 of the day on which currentTime-LOG_RETENTION_PERIOD falls. This means that for a long running test started at 23:59, the number of expired files would jump suddenly in 1 minute (the expiration boundary would move by a day as soon as system clock hits 00:00 of the next day). By fixing the start time of the test to 01:00, we avoid these scenarios. This should reduce test flakiness. GitOrigin-RevId: 9540daec4f65174fb6a089b31011a70f57ecb770 commit cd50f62d1a6d541acf23186bd5d5bf380dc0bf08 Author: Lars Kroll Date: Mon Oct 9 15:26:10 2023 +0200 Make DeletionVectorDescriptor.STRUCT_TYPE a lazy val GitOrigin-RevId: 7c64a5f406aea5726bcbe4dbeadb06aace889a0b commit 41c175c65126cc498f36b100881253f8bca24eaf Author: Peter Toth Date: Sat Oct 7 02:46:21 2023 +0000 [Spark] Update test suite in advance for changes in Spark Closes delta-io/delta#2152 GitOrigin-RevId: 3faac61096037ef53f0482ffb52c9f40d62e32d8 commit 38d4bda251b2be282ae50bffee8a7d3ccb38f9f2 Author: Jared Wang Date: Fri Oct 6 17:38:11 2023 +0000 Revert 8962318 "Use fieldId instead of index to lookup partition values during Iceberg to Delta Conversion" GitOrigin-RevId: 857ac70dcb9d4d9cff9aa9f2b7a97ad06acfdd92 commit 200c72c92532369fab263fd25b6608ee39bbc9c0 Author: Bart Samwel Date: Fri Oct 6 15:49:15 2023 +0200 [Spark] Rename DeltaSource.isStartingVersion to DeltaSource.isInitialSnapshot `DeltaSourceOffset.isStartingVersion` means "is this offset part of an initial snapshot", which is the exact opposite of `"startingVersion"` which is the user specified option that means "no initial snapshot, just changes starting at this version". This PR renames `isStartingVersion` to `isInitialSnapshot`, keeping the serialized name as "isStartingVersion". Closes delta-io/delta#2139 GitOrigin-RevId: edcb79d942c597d5b4fd61820624951a717891a3 commit d45cbe23c3a8522739f02f7cde8fe85c6c618149 Author: Venki Korukanti Date: Tue Oct 10 22:23:27 2023 -0700 [Kernel] Update `kernel/examples` to work with latest API changes ## Description There have been changes to Kernel APIs since the last update to the `kernel/examples`. Fix the examples to work with the latest API changes. ## How was this patch tested? Manually ran the `/kernel/examples/run-kernel-examples.py` and verified the build and display results are valid. commit 1210edf271416e04d65ccfd9232492d74add53b4 Author: Allison Portis Date: Tue Oct 10 14:58:57 2023 -0700 [Kernel] Add a FileReadRequest interface for reading deletion vectors (#2142) ## Description Adds interface `FileReadRequest` which exposes the necessary information for reading a deletion vector. ## How was this patch tested? Existing tests suffice. commit cd02359e4dcd49c841eabe89865a2674179f382a Author: Allison Portis Date: Tue Oct 10 14:58:23 2023 -0700 [Kernel] Implement getChild for a few remaining column vectors (#2133) ## Description Provides implementations for `getChild` for column vectors that are missing them. ## How was this patch tested? Adds simple tests for `DefaultViewVector` and `DefaultGenericVector` (used by complex types in the JSON handler). https://github.com/delta-io/delta/pull/2131 also is based off these changes and uses `getChild` instead of `getStruct` everywhere in the code. commit 2117d503c0b05243124512726153870fb6ec9f7c Author: Allison Portis Date: Tue Oct 10 11:04:06 2023 -0700 [Kernel] Resolve some TODOs (#2143) ## Description Resolves some miscellaneous TODOs in the Kernel code including - Moves `MixedDataType` to an internal package since it is a temporary type - Change `INSTANCE` to `$TYPENAME` for all simple data types - Move `singletonColumnVector` to internal utilities - Remove logging to stdout for now - Remove miscellaneous TODO comments that don't need to be in the code ## How was this patch tested? Existing tests suffice. commit 4f9c8b9cc294ec7b321847115bf87909c356bc5a Author: Venki Korukanti Date: Fri Oct 6 00:38:12 2023 -0700 [Spark] Upgrade Spark dependency to 3.5.0 in Delta-Spark The following are the changes needed: * PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark. * `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`. * The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet * Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4. * Remove unrelated tutorial in `delta/examples/tutorials/saiseu19` * Test failure fixes * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed * `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-mr#1000 in `parquet-mr` dependency. * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`. * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage. * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release Closes delta-io/delta#1986 GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30 commit bcd18679488a24e72873347354fc4f078574e5c5 Author: Prakhar Jain Date: Thu Oct 5 10:05:55 2023 -0700 [Spark] Add test for checkpoints with no file actions Same as title GitOrigin-RevId: d0fee6fc7cdaa580a6e0445289f8d9eb2f742b8c commit f1fca3ae184afc252c566241e4a2461e6948bd19 Author: lzlfred Date: Wed Oct 4 21:24:05 2023 -0700 [Spark] Cleanup DeltaLog.startTransaction(snapshot) GitOrigin-RevId: 43061bf232f9cfc77423a8eaa787a5c169a182ff commit 7e5a9f499ff22e773ca4f6cceafcbbf016a44c31 Author: Ryan Johnson Date: Wed Oct 4 20:32:44 2023 -0700 [Spark] Reliably pass options when resolving path-based tables Path-based table resolution in spark often relies on options passed by the user. The recently-added `UnresolvedPathBased[Delta]Table` node types do not currently support options, which prevents the user from passing them reliably. Fortunately the code is dead so far, but future changes will rely on this capability. Thus, we fix the shortcoming now so those future changes can rely on it. Refactor only so far, unit tests verify that there are no unexpected behavior changes. No Closes https://github.com/delta-io/delta/pull/2137 GitOrigin-RevId: b0ac83939ffcf074c11cc66b4ef8acce693a7928 commit 4657426baf5cad1052cfeccdf4bbec1532fe189d Author: lzlfred Date: Wed Oct 4 16:31:39 2023 -0700 [Spark] Remove DeltaTableIdentifier.getDeltaLog GitOrigin-RevId: 991754414c6e1aefd09cabcadc1eee574510b8a7 commit ab95ab5147f8ef4b0396103c34a208f6c3817e53 Author: Ryan Johnson Date: Wed Oct 4 14:11:04 2023 -0700 [Spark] Handle unresolved RelationTimeTravel gracefully Spark doesn't handle unresolved `RelationTimeTravel` very gracefully (throwing spark internal error instead of `AnalysisException`). It will eventually get fixed, but meanwhile Delta needs a workaround. New unit test No Closes https://github.com/delta-io/delta/pull/2136 GitOrigin-RevId: 50cb3cca7734e9850134918cd54e4bbac02ef674 commit e319a8c0a2248d91fb7e737f72c1f30707d92776 Author: Hao Jiang Date: Wed Oct 4 12:27:05 2023 -0700 [Spark] Add End-to-end test to Uniform Iceberg conversion This PR add an e2e test to uniform HMS migration. It assumes an external HMS has been started and run the test case when the port is in use. Closes delta-io/delta#2135 GitOrigin-RevId: 5937269f7b13edfa9c9aaf48fb0327dfb15ff59d commit 8962318c62d0e26044348416f566437a5279a5bf Author: lzlfred Date: Wed Oct 4 09:57:21 2023 -0700 [Spark] Use fieldId instead of index to lookup partition values during Iceberg to Delta Conversion The existing conversion logic relies on the ordering of "table.spec().fields()" and "fileScanTask.file().partition()" to be exactly the same, which is not true for the latest Iceberg 1.4.0 release: For the exactly same iceberg table that experienced partition evolution: Before 1.4.0: partFields Buffer(1000: date: void(1), 1001: date_month: month(1)) 1.4.0: partFields Buffer(1001: date_month: month(1)) Then the lookup logic later will map the first/only partition col before evolution to 0 idx, which is date_month after the evolution, then results goes wrong. This PR uses "field_id" to look up for partition values to avoid the implicit dependency on ordering. Closes delta-io/delta#2132 GitOrigin-RevId: 6ed3ab3ad18999004ada7053b89b663d88f7f19e commit 7251507fd83518fd206e54574968054f77a11cc0 Author: Tom van Bussel Date: Wed Oct 4 13:21:58 2023 +0200 [Spark] Inline non-correlated scalar subqueries in Merge This PR fixes a bug in MERGE that affects MERGE statements that contain a scalar subquery with non-deterministic results. Such a subquery can return different results during source materialization, while finding matches, and while writing modified rows. This can cause rows to be either dropped or duplicated. Closes delta-io/delta#2134 GitOrigin-RevId: 79b5eadebf5781a10a31c85088261783a02c98a0 commit 5475e72ae456b8fee48c079a5a82df5905434994 Author: Jing Wang Date: Tue Oct 3 10:45:57 2023 -0700 [Spark] Repartition in delta source should throw exception in Streaming Previously streaming from repartitioned delta source would quietly return results with data loss columns. This PR adds checks so that streaming from repartitioned delta source would throw corresponding errors for all modes (no-mapping, name-mapping, id-mapping). GitOrigin-RevId: 9c209bb06188c8aff5830c895570bb19da75b6a1 commit d07b64cfe3405637baa5d057d68f81135c528587 Author: Paddy Xu Date: Tue Oct 3 19:24:46 2023 +0200 [Spark] Write DV metrics during UPDATE with DVs added metrics for deletion vectors in operationMetrics of the commit. ``` numDeletionVectorsAdded numDeletionVectorsRemoved numDeletionVectorsUpdated ``` using DeletionVectors test named "Metrics when deleting with DV" GitOrigin-RevId: cf6f982105fc2818c49d7144038f6b81d1d11bf5 commit 0a0ea97bc6dad6ae7229a21cc328d16446dc9e3a Author: Paddy Xu Date: Mon Oct 2 22:57:25 2023 +0200 [Spark][UPDATE with DV] Let UPDATE command write DVs This is the first PR in [[Feature Request] Support UPDATE command with Deletion Vectors](https://github.com/delta-io/delta/issues/1923). This PR introduces a `UPDATE_USE_PERSISTENT_DELETION_VECTORS` config to enable/disable writing DVs for the UPDATE command. In short, rows being updated will be marked as `deleted` by DV, while updated rows will be written to a new file. When CDF is enabled, updated rows and CDC (`preimage` and `postimage`) will be written to the file. New, preliminary tests. Yes. When `UPDATE_USE_PERSISTENT_DELETION_VECTORS` is set to true, `UPDATE` command will not rewrite the whole file but write only the rows being updated. Closes delta-io/delta#1942 Signed-off-by: Paddy Xu GitOrigin-RevId: 3ad7c251bb064420d17cd1e685265e61845096a7 commit bbf19c3735b666c14ebb8d41b8a7b068308cb44c Author: Prakhar Jain Date: Mon Oct 2 11:47:07 2023 -0700 [SPARK] Don't cache failed future result in LazyCheckpointProvider Don't cache failed future result in LazyCheckpointProvider. Currently we define LazyCheckpointProvider as following: ``` def createCheckpointProvider(): CheckpointProvider = { ... val v2Actions = ThreadUtils.wait(readActionsFuture) CheckpointProvider(v2Actions, ...) } lazy val underlyingCheckpointProvider = createCheckpointProvider() ``` If the future here fails, then the `createCheckpointProvider()` will fail and whoever is accessing the `underlyingCheckpointProvider` will also get exception. The `underlyingCheckpointProvider` is accessed by Snapshot class - so snapshot class will get some error and query will fail. But a user might run the same query again in some time - since the snapshot is cached under deltalog, the snapshot will again invoke methods on checkpointProvider, which will invoke lazyCheckpointProvider.underlyingCheckpointProvider. Due to this, it will again fail as future has already failed once in the beginning due to some intermittent failure. The solution here is to not use the future and instead invoke real readV2Actions method in the subsequent invocation of `createCheckpointProvider()`. If the `createCheckpointProvider()` has already succeeded in first attempt, then the checkpointProvider will be cached under `lazy val underlyingCheckpointProvider`. But if it failed in first attempt, then next time we should not use future result for getting v2 actions and we should do I/O and read v2 actions again. GitOrigin-RevId: a5631e91a15ec7b991bf5c7ba213a59a465b1d1a commit 4e7cb5fcbbce8038bfbce8fe0ad3904974736080 Author: Lukas Rupprecht Date: Mon Oct 2 10:47:44 2023 -0700 [Spark] Replaces startTransaction(Snapshot) calls with catalogTable overload This PR replaces all calls to DeltaLog.startTransaction(Snapshot) with calls to DeltaLog.startTransaction(Option[CatalogTable], Snapshot). This PR is part of https://github.com/delta-io/delta/issues/2105 and a follow-up to https://github.com/delta-io/delta/pull/2125. It makes sure that transactions have a valid catalogTable attached to them so Uniform can correctly update the table in the catalog. This is a small refactoring change so existing test coverage is sufficient. No Closes https://github.com/delta-io/delta/pull/2126 GitOrigin-RevId: d82787c64979a2dd4a363bf92a1640b7635ec02e commit 5f9b98e86590cf8891c73eb2b78482aee89547d1 Author: Venki Korukanti Date: Tue Oct 3 09:33:37 2023 -0700 [Kernel] Implement partition pruning ## Description Part of delta-io/delta#2071 (Partition Pruning in Kernel). This PR integrates the different pieces added in previous PRs to have an end-to-end partition pruning. ## How was this patch tested? Added `PartitionPruningSuite` commit f1eae09c4dc0686c06e9a25e6011e95cc525c7e3 Author: Gerhard Brueckl Date: Tue Oct 3 08:22:50 2023 +0200 [Connectors] Update PowerBI Connector - fixes and restructuring ## Description This PR resolves [issue #1978](https://github.com/delta-io/delta/issues/1978). During the restructuring of the project, I also fixed some other minor bugs that no one found yet if possible, changed my tests to use the Golden Tables instead of Delta Lake tables residing on my blob stores which partially required authentication ## How was this patch tested? Tested with a set of Golden Tables from https://github.com/delta-io/delta/tree/master/connectors/golden-tables/src/main/resources/golden` ## Does this PR introduce _any_ user-facing changes? No Co-authored-by: Gerhard Brueckl commit a03818666f6cf11df618504a9014135ec2b7bb9f Author: Allison Portis Date: Mon Oct 2 14:43:56 2023 -0700 [Kernel] Refactor the Array and Map return types in ColumnVector and Row (#2087) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description BASED OFF OF #2069; to review changes in this PR select the last 4 commits ([here](https://github.com/delta-io/delta/pull/2087/files/a5073b4c14496314bae1ca97163f9e7bbc67bc6d..d160915717d683ebfd6087a93ccb224a57216d82)) Refactors the ColumnVector and Row `getArray` and `getMap` APIs to return wrapper objects. These wrappers provide APIs to retrieve column vector views of the elements or keys/values of the map/array. ## How was this patch tested? Supporting the new APIs in some of the java testing infrastructure was non-trivial, so this PR also moves some tests from TestDeltaTableReads to DeltaTableReadsSuite and adds the complex types to the scala testing infrastructure. Existing tests have been modified for this PR as well as a few additional checks/tests added. commit 32918db26e87849ac5107e936ea6141d6a304436 Author: Allison Portis Date: Mon Oct 2 08:38:59 2023 -0700 [Kernel] Upgrade JUnit versions for kernel modules Upgrade JUnit so we can use `assertThrows` in the kernel modules GitOrigin-RevId: ea9ad1ca13c2e4af76a594964a4a9bdf13794bc0 commit 01fee68c533e2a85d4a0d52feb03d8c7115dc028 Author: Hao Jiang Date: Mon Oct 2 02:52:51 2023 -0400 [Spark] Support UniForm to use Hive Metastore UniForm will use HMS as catalog instead of using file system. Closes delta-io/delta#2120 GitOrigin-RevId: f2d863c6e91e4d5d8c2e0f373f4c0c4ad9956fb6 commit 789ea30f494eb0ea71f386398348014224d6edff Author: Lukas Rupprecht Date: Sun Oct 1 22:04:03 2023 -0700 [Spark] Replaces startTransaction calls with catalogTable overload This PR replaces all calls to OptimisticTransaction.startTransaction() with calls to OptimisticTransaction.startTransaction(Option[CatalogTable]). This PR is part of https://github.com/delta-io/delta/issues/2105 and ensures that transactions have a valid catalogTable attached to them so Uniform can correctly update the table in the catalog. This is a small refactoring change so existing test coverage is sufficient. No Closes https://github.com/delta-io/delta/pull/2125 GitOrigin-RevId: 014a459275e5fec5bfc51ff143563b2949c607c3 commit 5d43f1db5975dca31da29f714b1a155aa4367aee Author: Prakhar Jain Date: Fri Sep 29 16:25:25 2023 -0700 Protocol changes for log compaction Protocol changes for log compaction Issue: https://github.com/delta-io/delta/issues/2072 Closes https://github.com/delta-io/delta/pull/2122 GitOrigin-RevId: c15ff24a2a4242520f5cf8ffdb8604a4ffc36805 commit eb33558b5bd1cf08281d2e0d794318c5c55ae4b0 Author: Dhruv Arya Date: Fri Sep 29 12:38:36 2023 -0700 [Spark] Update V2 Checkpoint feature name to indicate feature completeness Now that all V2 Checkpoints are feature complete as per the Delta spec, this updates the table feature name to indicate that it is ready for use. Closes delta-io/delta#2124 GitOrigin-RevId: 8e45e735ef3e1ce071082e1c0833e9157f1c81b5 commit 0e05caf5c2124f61da69dc6671c8011450a6e831 Author: Prakhar Jain Date: Fri Sep 29 11:30:19 2023 -0700 [Spark] Read support for log compactions This PR adds read support for log compactions described here: https://github.com/delta-io/delta/issues/2072 Closes https://github.com/delta-io/delta/pull/2073 GitOrigin-RevId: 6f4a09c3fa09c303cdeb747c382cedcfda5a2a4c commit 2d922660d6fbb8b65bc5d20b276969eedde74c8d Author: Venki Korukanti Date: Fri Sep 29 11:16:17 2023 -0700 [Spark] Fix resource leaking when a deletion vector file is not found The `DeltaParquetFileFormat` adds additional glue for filtering the deleting rows (according to the deletion vector) from the iterator returned by the `ParquetFileFormat`. In case when the DV file is not found, the iterator returned from `ParquetFileFormat` should be close. An integration test simulating the DV file deletion and verifying no resource leak. Closes delta-io/delta#2113 Signed-off-by: Venki Korukanti GitOrigin-RevId: d378b495630da31ff2af062dd9124874fdb69e12 commit 9e3d4c232ba9f80f040b762eecfd19f5d0f0cca9 Author: Dhruv Arya Date: Fri Sep 29 09:51:38 2023 -0700 [Spark] Add metadata cleanup logic for V2 Checkpoints MetadataCleanup has been updated to make it aware of V2 Checkpoint sidecar files. Sidecar files are only deleted if 1. They are not being used by any non-expired checkpoints AND 2. They have expired as per the log retention period. Closes delta-io/delta#2102 GitOrigin-RevId: 82f76853e049f47d5affc8fe82628c7ff7fecbf9 commit fe7c8b8b8e344d19bef9233879a0e11e8bda8edc Author: Venki Korukanti Date: Fri Sep 29 09:19:01 2023 -0700 [Spark] Minor update to `generate_iceberg_jars.py` to be specific about exclude files The script `generate_iceberg_jars.py` breaks the build if the delta source is cloned in a directory with the name `source` or `opensource`. Update the script to be specific about the files to exclude (it should be `sources.jar`). Ran `build/sbt clean publishM2` successfully after deleting the `icebergShaded/libs` folder and was able to see the generated libs. Closes delta-io/delta#2121 Signed-off-by: Venki Korukanti GitOrigin-RevId: dbc0f87b3054c06c57eec699dba9349151260de1 commit ccf7fbc04b85df9cd850f4aa4d1fb6edf7c9c389 Author: Christos Stavrakakis Date: Thu Sep 28 20:08:14 2023 +0200 [Spark] Check DV cardinality when reading DVs Update DeletionVectorStore to verify that the cardinality of the DVs we read from files are the expected ones. Closes https://github.com/delta-io/delta/pull/2079 GitOrigin-RevId: c6b3c7a3088bcb05e4541e9e073249c2886b16c9 commit aa1285441b3899d6ffc746a35da909c1953e232f Author: Ryan Johnson Date: Thu Sep 28 10:13:52 2023 -0700 [Spark] Propagate catalog table through DeltaSink #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description In order to implement https://github.com/delta-io/delta/issues/2052 for streaming writes, `DeltaSink` needs to track the catalog table, if any, so it can properly initialize the transactions it executes. We can't change the Spark DataSource API that creates the sink, so instead we add logic in `DeltaAnalysis` that extracts the catalog table from the `WriteToStream` and applies it to the underlying `DeltaSink`. New unit test. ## Does this PR introduce _any_ user-facing changes? No. Closes https://github.com/delta-io/delta/pull/2109 GitOrigin-RevId: b1abc21ecdbe863611fee2dc82a519d2f214bf43 commit a8eeeb14b68669a16b47d3690194eccaabd0a123 Author: Ryan Johnson Date: Wed Sep 27 15:30:21 2023 -0700 [Spark] Remove unused DeltaTableIdentifier overloads of DeltaLog.forTable #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The `DeltaTableIdentifier` overloads of `DeltaLog.forTable` involve an internal catalog lookup, which interferes with https://github.com/delta-io/delta/issues/2052 because the resulting `CatalogTable` is lost. Fortunately, the overloads are not actually used, so we can simply delete them to prevent any future problems. Dead code removal. Compilation suffices. ## Does this PR introduce _any_ user-facing changes? No. Closes https://github.com/delta-io/delta/pull/2114 GitOrigin-RevId: c9d90f8af57658a9302dcda0685fb7aa78d7eea4 commit eb92c28b6306dd9e74e14adb3bff8d791b727be2 Author: Jacek Laskowski Date: Wed Sep 27 18:23:12 2023 +0200 [Spark] Code cleanup #### Which Delta project/connector is this regarding? - Spark ## Description There are various code cleanups to make reading the code easier: 1. Remove unnecessary type annotations 2. Typo fixes 3. Code formatting 4. Replacing `Literal(true)` with `Literal.TrueLiteral` 5. Replacing `new Column` with `Column` (object) Local build. Expecting more to come from the official checks on github ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2067 Co-authored-by: Johan Lasperas Signed-off-by: Johan Lasperas GitOrigin-RevId: f3127b17345467b07d3901e1f58bf339dc45c0f2 commit 28a84a3c81bc68b52fd8ba51b361d84b027dff4f Author: lzlfred Date: Wed Sep 27 07:30:36 2023 -0700 [Spark] Pass `catalogTable` to `WriteIntoDelta` Closes delta-io/delta#2110 GitOrigin-RevId: 6db61ce5d11629ba55e26efe05fc65257f4ecb2b commit ca698951e6e223a07d584445e164678626da986c Author: Venki Korukanti Date: Thu Sep 28 12:42:57 2023 -0700 [Kernel] Add partition pruning related utility methods ## Description Part of delta-io/delta#2071 (Partition Pruning in Kernel). This PR adds the following utility methods: * Dividing `Predicate` given to the `ScanBuilder.withFilter` into data column and partition column predicates * Rewrite the partition column `Predicate` to refer to the columns in the scan file columnar batch with the appropriate partition value deserialization expressions applied. ## How was this patch tested? Added UTs commit 879df3c5b1bad076372e6e840acdb382f7aee29f Author: Venki Korukanti Date: Wed Sep 27 14:30:58 2023 -0700 [Kernel] Add `partition_value` and `element_at` expressions (#2096) ## Description Part of delta-io/delta#2071 (Partition Pruning in Kernel). We need the following two expressions in order to evaluate the predicate on scan file columnar batch data. * `element_at(map_column, key_value)`: Take input a `map` type column and `key` value, return the `value` for the given `key`. This is similar to Apache Spark™ UDF for similar purposes. This expression will be used to retrieve the specific partition value from the `(partition column name -> string serialized partition)` map * `partition_value(string_type_value, datatype)`: Decode the partition value given as a string into the given datatype format. The interpretation of the string value is according to the Delta protocol. ## How was this patch tested? Added UTs commit 7f8fe8bbb7693a39c84362e1cf0f7b0006bba757 Author: Gengliang Wang Date: Tue Sep 26 15:35:20 2023 -0700 Add CustomCatalogSuite for testing external DSV2 catalog support Add CustomCatalogSuite for testing external DSV2 catalog support. It sets the current catalog to a dummy DSV2 catalog and verifies the result of utility commands. GitOrigin-RevId: 9b0ce66abcc6025e1a8810cee4c1b1c05e8a8c0e commit dcd4b502469064a8173190726c7883c9590a0ef6 Author: Lukas Rupprecht Date: Tue Sep 26 14:00:25 2023 -0700 #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR adds a catalogTable member to UPDATE, DELETE, and MERGE commands. This catalogTable member is then passed to the transaction these commands create so that it is later accessible in the Iceberg conversion that Uniform performs. This is necessary so Uniform can correctly retrieve and update a table from HMS. This PR is part of https://github.com/delta-io/delta/issues/2105. Adapted existing unit tests. This is mainly a refactoring change so existing unit test coverage is sufficient. ## Does this PR introduce _any_ user-facing changes? No Closes https://github.com/delta-io/delta/pull/2108 GitOrigin-RevId: 81ac8a60f9527e1200fe1b40b2a9b4c0ebad00d5 commit a12ff5cab8da2509a0590f1720a0236d962991ad Author: Ryan Johnson Date: Tue Sep 26 11:43:02 2023 -0700 [Spark] Rewrite SHOW COLUMNS as v2 command #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description The Delta verison of SHOW COLUMNS command used to rely on a hacky custom SQL parser rule, but it turns out this is unnecessary -- we can simply add a new DeltaAnalysis rule to intercept the v2 `ShowColumns` command, and replace it with the Delta-specific `ShowDeltaTableColumnsCommand` (which we rename and also upgrade to v2, so it doesn't trigger extra catalog lookups any more). Existing unit tests updated. ## Does this PR introduce _any_ user-facing changes? No. Closes https://github.com/delta-io/delta/pull/2093 GitOrigin-RevId: 56f1ed3524b73ff62700733195b3a1f46ce7a302 commit f026ea3ef714b99d365d56297887a169c3da8345 Author: Prakhar Jain Date: Mon Sep 25 18:34:46 2023 -0700 [Spark] Make randomization expression configurable in MultiDimension clustering configurable Make randomization expression configurable in MultiDimension clustering configurable. GitOrigin-RevId: 0927a8003a40499bb5d1aa57bcf4f364ed44949e commit 0a364eccb5df99b998b3697112546aaf980b469f Author: Hao Jiang Date: Mon Sep 25 17:13:56 2023 -0700 Add [errorClass] prefix to delta error messages Modify delta exceptions to use the format `[errorClass] message` for their error messages. GitOrigin-RevId: 241d0157eb93ecc27fbf0a750a055f9a898a794e commit 8409c28e0b0ec1ed41a48ad3868aad75b9328ad8 Author: Venki Korukanti Date: Wed Sep 27 06:17:02 2023 -0700 [Kernel] Add `PredicateEvaluator` interface and default implementation ## Description Part of delta-io/delta#2071 (Partition Pruning in Kernel). We need a way to evaluate the predicate on a `ColumnarBatch` using an existing optional selection vector. Adds `PredicateEvaluator` interface. ``` public interface PredicateEvaluator { /** * Evaluate the predicate on given inputData. Combine the existing selection vector with the * output of the predicate result and return a new selection vector. * * @param inputData {@link ColumnarBatch} of data to which the predicate * expression refers to input. * @param existingSelectionVector Optional existing selection vector. If not empty, it is * combined with the predicate result. The caller is also * releasing the ownership of `existingSelectionVector` to this * callee, and the callee is responsible for closing it. * @return A {@link ColumnVector} of boolean type that captures the predicate result for each * row together with the existing selection vector. */ ColumnVector eval(ColumnarBatch inputData, Optional existingSelectionVector); } ``` ## How was this patch tested? Unit tests commit 8df0ff097d9a4141236c4a0f84aa4a589f8bb3f3 Author: Ryan Johnson Date: Mon Sep 25 13:25:36 2023 -0700 [Spark] Remove String+options overload of DeltaLog.forTable The `String` overloads of `DeltaLog.forTable` are dangerous, because we can't tell whether the caller intended to pass a path-string or a table-identifier string. Rework unit tests to no longer use the string+options overload of that method, preferring instead to use the `Path` that is almost always available. Closes https://github.com/delta-io/delta/pull/2025 GitOrigin-RevId: 7ecd856afb5acf0111c7e150c76ddfb7fd2bf8af commit 6859c863e88bfe7be6d5ccbb0c221bdde57a00c3 Author: Prakhar Jain Date: Mon Sep 25 09:22:51 2023 -0700 [Spark] Read side changes for v2 checkpoints This PR adds read side changes for v2 checkpoints. Closes https://github.com/delta-io/delta/pull/2056 GitOrigin-RevId: 3673bb576aed5e1b572f2dfc4b69e829ae9555a6 commit 4622db66fdab41a7e1729a58ab7daa8038f70da1 Author: Kam Cheung Ting Date: Fri Sep 22 16:00:05 2023 -0700 Code refactor refactor code inside StatisticsCollection.scala. GitOrigin-RevId: d59a7df828db4df6eafec94d8622176b3fc8bb49 commit 4eb177eaf4c16080887d78407bb64a4183832686 Author: Lukas Rupprecht Date: Fri Sep 22 14:14:40 2023 -0700 [Spark] Rewrite DESCRIBE HISTORY to use Spark Table Resolution This PR rewrites the Delta DESCRIBE HISTORY command to use Spark's table resolution logic instead of resolving the target table manually at command execution time. For that, it changes DescribeDeltaHistory to a UnaryNode that takes either a UnresolvedTable or UnresolvedPathBasedDeltaTable as a child plan node, which will be resolved by Spark. Once resolved, the DescribeDeltaHistory node is transformed to an actual runnable command (DescribeDeltaHistoryCommand) in DeltaAnalysis. The resolved table is passed to the command in the form of a DeltaTableV2. This is mainly a refactor and the existing DescribeDeltaHistory suite already contains a large set of tests, which this PR relies on. The PR also updates the DeltaSqlParserSuite to check that commands are correctly parsed into a DescribeDeltaHistory. No Closes https://github.com/delta-io/delta/pull/2090 GitOrigin-RevId: 75eb8c8ea06350612b8b51fc6a88e11845e21b92 commit 6f08187c39f5ba2f680304ebe47da33ecd6f9dfa Author: Paddy Xu Date: Fri Sep 22 23:11:43 2023 +0200 [Spark] Correct protocol downgrade error when some table features are enabled This PR simplifies the logic of ALTER TABLE command by making protocol downgrade a no-op. Some examples: ```sql -- table has Protocol(2, 2) ALTER TABLE table SET TBLPROPERTIES ( delta.minReaderVersion = '1', delta.minWriterVersion = '1' ) -- before: cannot downgrade from Protocol(2, 2) to Protocol(1, 1) -- after: table stays on Protocol(2, 2) ``` ```sql -- table has Protocol(2, 2) ALTER TABLE table SET TBLPROPERTIES ( delta.minReaderVersion = '1', delta.minWriterVersion = '1', delta.enableChangeDataFeed = 'true' ) -- before: cannot downgrade from Protocol(2, 2) to Protocol(1, 1) -- after: table upgraded to Protocol(2, 4) ``` Closes https://github.com/delta-io/delta/pull/2062. New tests. Yes. See the first section. GitOrigin-RevId: c60acb990bf8e93581b9ed7e8ee36865b1960ea3 commit f0a3864921f4d34946a3c6b218bd40c817f9bcf4 Author: Andreas Chatzistergiou Date: Fri Sep 22 23:09:36 2023 +0200 Protocol version downgrade support in Delta Tables Currently, we support Protocol downgrade in the form of feature removal but downgrading protocol versions is not possible. This PR adds support for protocol version downgrade. This is only allowed for tables that support either reader+writer table features or writer table features. The protocol downgrade takes place when the user removes a table feature and there are no non-legacy table features left in the table. The protocol is downgraded to the minimum reader/writer versions required to support all enabled legacy features. For example, `Protocol(3, 7, readerFeatures=(DeletionVectors), writerFeatures=(DeletionVectors, ChangeDataFeed)` is downgraded to `Protocol(1, 4)` after removing the DeletionVectors table feature. Closes delta-io/delta#2061 GitOrigin-RevId: 76633f6a08ae747ea508ef84e4e4f62a7ad5609d commit 87f80ce099564844a74775392180ff7689834076 Author: Venki Korukanti Date: Fri Sep 22 14:06:23 2023 -0700 [Kernel] Misc. cleanup of code in `kernel-api` module around scan file generation ## Description Various refactorings which together reduce the code size by 1000+ lines. * Change the `Scan.getScanFiles()` to return `FilteredColumnarBatch` instead of `ColumnarBatch`. The former has an additional selection vector which avoids rewriting the `ColumnarBatch`es generated by the readers of Delta log commit/checkpoint files. Before this change, there was a rewriting of `ColumnarBatch`es in `kernel-api` module. * Remove the POJO based `ColumnarBatch` and `Row`. They are no longer needed. * Create a `ScanFile` API class that contains the schema of the scan file rows returned by `Scan.getScanFiles` * Create an extension `InternalScanFile` for utility methods that are internal only to `kernel-api` module. * Clean up the `ScanStateRow` (move related APIs from `Utils.java` to `ScanStateRow.java`) * Remove unneeded `Action` classes ## How was this patch tested? Existing tests. commit c7a39da509434d6864ffc73c753af7f414fec657 Author: Gengliang Wang Date: Wed Sep 20 17:27:54 2023 -0700 [spark] Support external DSV2 catalog in: - Restore command - Clone command #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Support external DSV2 catalog in RESTORE & Clone command. After the changes, the restore command supports tables from the external DSV2 catalog. For example, with ``` spark.sql.catalog.customer_catalog=org.apache.spark.sql.CustomerCatalog ``` We can query ``` SET CATALOG customer_catalog; RESTORE TABLE t1 VERSION AS OF 0 CREATE TABLE spark.default.t2 SHALLOW CLONE t1 VERSION AS OF 1 ``` Or simply ``` RESTORE TABLE customer_catalog.default.t1 VERSION AS OF 0 CREATE TABLE spark.default.t2 SHALLOW CLONE customer_catalog.default.t1 VERSION AS OF 1 ``` --> 1. new end-to-end tests 2. new parser test cases ## Does this PR introduce _any_ user-facing changes? Yes, users can use RESTORE & Clone command on the tables of their external DSV2 catalogs. Closes delta-io/delta#2057 Signed-off-by: Gengliang Wang GitOrigin-RevId: 95f76326f8a3d35f0cfa1037d8ce7c31f430b47e commit 444bdfcdc8ea976e776eb20f4822e1ea2541446a Author: Ryan Johnson Date: Wed Sep 20 15:58:01 2023 -0700 [Spark] Define OptimisticTransaction.catalogTable #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description As part of implementing https://github.com/delta-io/delta/issues/2052, `OptimisticTransaction` needs the ability to track a `CatalogTable` for the table it updates. That way, post-commit hooks can reliably identify catalog-based tables and make appropriate catalog calls in response to table changes. For now, we just define the new field, and add a new catalog-aware overload of `DeltaLog.startTransaction` that leverages it. Future work will start updating call sites to actually pass catalog information when starting a transaction. The new field is currently not used, so nothing really to test. Existing unit tests verify the existing overloads are not broken by the change. ## Does this PR introduce _any_ user-facing changes? No Closes https://github.com/delta-io/delta/pull/2083 GitOrigin-RevId: 03f2d2732a939cdd9ee2e56e07b23e8be00bcb6f commit 60a3c03b298acf099edfd19a39a570cda7e2be2e Author: lzlfred Date: Wed Sep 20 09:03:59 2023 -0700 minor change on error message minor change on error message GitOrigin-RevId: 4365ea1d64d427264eb546121ce4cd46076ff02f commit a1b41a2f57a6ea0f3d1358a015ae0881e5f3b01d Author: RunyaoChen Date: Tue Sep 19 15:13:55 2023 -0700 Followup on "Rewrite DescribeDeltaDetailCommand to v1/v2 hybrid" This PR is a follow-up of https://github.com/delta-io/delta/commit/ffdd04a7134f0e131b993fc44127b7d38f62d85d: * Added ResolvedPathBasedNonDeltaTable as a resolved placeholder plan for UnresolvedPathBasedTable * Added test cases for raw path in the parser UT. Closes delta-io/delta#2084 GitOrigin-RevId: 5eab8703469e65bd31df6284cbff3a258a8541a8 commit 87e0b79d4431f6b3b220cffabfd66196d5da6e9d Author: Allison Portis Date: Wed Sep 20 10:33:24 2023 -0700 [Kernel] Remove kernel/build.sbt as it is not used anymore (#2085) commit 8e3943f639c1e2e0b00e8bec6deb5acfd388eafd Author: Venki Korukanti Date: Tue Sep 19 21:48:44 2023 -0700 [Kernel] Add support for nested column reference expression ## Description Part of delta-io/delta#2071 (Partition Pruning in Kernel). We need a way to reference the `partitionValues` nested column in scan file `ColumnarBatch`. Currently, the `Column` expression can only be used to refer to a top-level column. There is no way to refer to a nested column. This PR updates the `Column` expression to be a multi-part identifier. This is similar to the Spark's [`NamedReference`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/NamedReference.java) DSv2 expression. Fixes delta-io/delta#2040 (also contains different approaches to refer to a nested column). ## How was this patch tested? Added a UT commit 3648be74852a673355111e4bcf787f3883d5855e Author: Venki Korukanti Date: Tue Sep 19 21:47:25 2023 -0700 [Kernel] Add selection vector create API on `ExpressionHandler` ## Description Part of delta-io/delta#2071 (Partition Pruning in Kernel). Delta `kernel-api` module creates a selection vector (a `ColumnVector` boolean data type) that goes along with `ColumnarBatch` of scan files read from the Delta checkpoint or commit files. It uses the `selection vector` to select only a subset of rows from the scan file `ColumnarBatch`. Also at the same time `kernel-api` module shouldn't be creating `ColumnVector`s. Instead should rely on the `TableClient` APIs to create the vectors. It adds the below API on `ExpressionHandler` ``` /** * Create a selection vector, a boolean type {@link ColumnVector}, on top of the range of values * given in values array. * * @param values Array of initial boolean values for the selection vector. The ownership of * this array is with the caller and this method shouldn't depend on it after the * call is complete. * @param from start index of the range, inclusive. * @param to end index of the range, exclusive. * @return A {@link ColumnVector} of {@code boolean} type values. */ ColumnVector createSelectionVector(boolean[] values, int from, int to); ``` This also handles [this](https://github.com/delta-io/delta/pull/1939#discussion_r1293764014) code review comment. ## How was this patch tested? Added tests. commit 2d57978e5c6339f8fa7ef42f251c43e1dcf07ac3 Author: Christos Stavrakakis Date: Tue Sep 19 23:12:27 2023 +0200 Throw deletionVectorSizeMismatch for non-incremental DV updates. Update the sanity check for non-incremental updates to throw a `deletionVectorSizeMismatch` instead of using `require()`. Closes delta-io/delta#2081 GitOrigin-RevId: 79b4a8372c585916c7a167f49b29d598ec750ef1 commit 6570e8af0ced298076cb7b1ce9b65931eb7247ef Author: Allison Portis Date: Tue Sep 19 12:53:20 2023 -0700 Use the root build.sbt unidoc command for the kernel docs github action Instead of using `kernel/build.sbt` use the root build and the unidoc command to publish the java API docs Merged it into master on my own fork [allisonport-db/delta](https://github.com/allisonport-db/delta) and ran the job to publish the docs. - Job: https://github.com/allisonport-db/delta/actions/runs/6229920837 - Published docs: https://allisonport-db.github.io/delta/snapshot/kernel-api/java/index.html GitOrigin-RevId: 6e82e81137d680137abcd0334c6dbcf402b54306 commit 1a6c419f40798da8d6461d778f56c642a9c138e7 Author: Ming DAI Date: Tue Sep 19 10:23:33 2023 -0700 Disable stats computation on Decimal and unknown types, also respect parquet rebase mode for Date. GitOrigin-RevId: e676a67fce08692f4b1d4a5d08a934f323517589 commit 84eabee84e9ddf4fae77b70ae12506991c61f7c6 Author: Dhruv Arya Date: Mon Sep 18 16:12:27 2023 -0700 Minor refactor in DeltaConfig and TableFeature Minor refactor in DeltaConfig and TableFeature. No logical changes. Only affects code related to V2 Checkpoints. GitOrigin-RevId: 44dc4df44543a34af531354b962d29bedaec55e1 commit dd02e9440df8825f6bf680c7133efd7f45d239a2 Author: Ming DAI Date: Mon Sep 18 16:04:18 2023 -0700 Update imports GitOrigin-RevId: 59bb6223b2c9c71bcabec70081bacb132b0f7133 commit f84d8782c5e8782d25dba878ad3fb678d1a1be20 Author: Hao Jiang Date: Sun Sep 17 05:38:42 2023 -0700 Minor refactor to DeltaDDLSuite.scala GitOrigin-RevId: 727281f96f587217e137d6e848cf8ebe408e70a2 commit 2ee75788d77eb62c9304a09af5cff4cb360fd59f Author: Kam Cheung Ting Date: Sat Sep 16 17:37:19 2023 -0700 Minor refactor to StatisticsCollection.scala and StatsCollectionSuite.scala GitOrigin-RevId: 24857055e38290b694ceea15e519ce4085b8da65 commit f82dc65aa0bc850e77d4818baeae72fde37d5323 Author: Ryan Johnson Date: Sat Sep 16 07:11:23 2023 -0700 Java File overload of DeltaLog.forTable is now test-only Production Delta code doesn't use Java files -- it uses Hadoop paths -- so we can remove all references to `java.io.File` from DeltaLog.scala. Because unit tests make heavy use of Java files, we add suitable replacements in DeltaTestImplicits.scala as a convenience (and to avoid churning test code so much in this PR). Closes https://github.com/delta-io/delta/pull/2016 GitOrigin-RevId: e1c52abb664879f376bb233b6dd15829771d6c9b commit f5da44f1187d3d4f5a396922ca891aa5abfefdad Author: Hao Jiang Date: Fri Sep 15 18:05:56 2023 -0700 Allow user to create table with varchar type on a pre-existing location When user run `create table (a varchar(100)) using delta location xx` twice with the same delta location, an error message reports the new schema does not match the existing one, while in fact they match. Root cause is that Spark convert `varchar` columns as `string` + a key-value pair `["__CHAR_VARCHAR_TYPE_STRING" -> "varchar()"]` in metadata. But when Delta compares new schema with the existing one, it does not convert. Comparing `varchar` vs. `string` causes a mismatch and the error. We apply the conversion to the new schema before comparing it against the existing schema to fix the problem. GitOrigin-RevId: 645bdd48a8dd6b8c0ac121551c27d91c8f81cf68 commit fc08c2f1344e6b06d025b9bef1fc40ed095a50bf Author: Ryan Johnson Date: Fri Sep 15 18:04:23 2023 -0700 [Spark] Define and use DeltaTableV2.startTransaction helper method As part of making Delta more catalog-aware (see https://github.com/delta-io/delta/issues/2052), we need to solve two basic problems: 1. When code calls the `TableIdentifier` overload of `DeltaLog.forTable`, the catalog lookup is performed internally and immediately discarded after extracting the table's storage location from it. If the caller needed the catalog info, they are out of luck. Caller can avoid the problem by creating a `DeltaTableV2` instead, which already provides both `DeltaLog` and `CatalogTable`. To support this use case, we define new helper methods that make this easy to do (especially for unit tests). 2. Even if we have a `DeltaTableV2`, there's no convenient way to start a transaction from it, in order to pass along the catalog info. To make it easy for callers to do the right thing, we define new helper methods for starting transactions directly from the `DeltaTableV2` itself. When transactions eventually become aware of catalog info, these new helper methods will make a narrow waist that can be enhanced to pass along their catalog info. Closes https://github.com/delta-io/delta/pull/2053 GitOrigin-RevId: b93443826b6de666fde81c824259b5569738df0e commit 87d15cc5d21a669a480a5bdd0abc2e5034beb57b Author: Venki Korukanti Date: Fri Sep 15 16:17:03 2023 -0700 [Kernel] Add `commons-io % test` dependency to `kernel-defaults` module `commons-io` contains many utility methods which make writing tests simpler. GitOrigin-RevId: 39beff6d0348596b969d21de3f99846c0bd75d31 commit b402ad120831855f3c78dad6e9c80c07ed8cd9c1 Author: Ami Oka Date: Fri Sep 15 14:32:39 2023 -0700 Minor refactor to DeltaTableBuilderSuite.scala GitOrigin-RevId: cc2cbb4cc69f4ecb2a4cd3ede12c89152a908578 commit ed6378e3536368473eb06b7af8357a67160a6082 Author: Ryan Johnson Date: Fri Sep 15 11:34:09 2023 -0700 [Spark] Prod Delta code avoids String overloads of DeltaLog.forTable The `String` overloads of `DeltaLog.forTable` are dangerous, because we can't tell whether the caller intended to pass a path-string or a table-identifier string. Rework prod Delta code to explicitly pass `Path` instead of relying on strings to be converted automatically. Closes https://github.com/delta-io/delta/pull/2026 GitOrigin-RevId: 0b75d31c71ab001da88b6061b81f09531cc1b3c4 commit d75c35f294be4d7085346f187daa5d9cfbaf077f Author: Ryan Johnson Date: Fri Sep 15 09:58:19 2023 -0700 [Spark] Rename DeltaTableV2.snapshot as initialSnapshot Closes https://github.com/delta-io/delta/pull/2063 `DeltaTableV2.snapshot` can be stale, if the object was instantiated long before the command using it. This can happen, for example, with a `DeltaTable` (which binds an internal `DeltaTableV2` when constructed and never refreshes it). This can be dangerous, so we rename the field and document the danger so it is less likely to cause bugs. GitOrigin-RevId: 76e69f6889823dba8183cbd8f5e8d2671100d500 commit 5aee228a7693badd2fb5023102e50a0255aeab2f Author: Ryan Johnson Date: Fri Sep 15 07:52:55 2023 -0700 [Spark] Define and use DeltaTableV2.toLogicalRelation Most call sites of `DeltaTableV2.toBaseRelation` immediately use the result to create a `LogicalRelation`, using other fields of `DeltaTableV2` as additional input. The duplicated code is not always consistent, either. Create helper methods `toLogicalRelation` and `toDf` which encapsulate the common-case logic for these operations. Closes https://github.com/delta-io/delta/pull/2042 GitOrigin-RevId: 985e93d415a6881f87cef38b7ebf0ae233897e7d commit b1eeed776052253aae0415b1e6fb5d836727596b Author: panbingkun Date: Fri Sep 15 14:09:50 2023 +0000 Update error message in tests Update error message in tests GitOrigin-RevId: 3c2d87d3b0c63b72e7617600e9011944d7dfaf9a commit a183559172642f55ccf55e240467404a06acdc8f Author: Bo Gao Date: Thu Sep 14 23:25:03 2023 -0700 Minor refactor to DeltaSource.scala. GitOrigin-RevId: cec529a1efb53e6eb1d2366eb592bc80973ea6e1 commit 631d681f4b8fd21fc33c80a40d7da266aa7023bc Author: Ryan Johnson Date: Thu Sep 14 15:15:53 2023 -0700 [Spark] Remove String+Clock overload of DeltaLog.forTable The `DeltaLog.forTable` overloads that accept a `Clock` only exist for testing purposes, and have no place in prod code. Furthermore, even the unit tests that do use these overloads are better-served by calling more specific overloads instead of passing a string. Closes https://github.com/delta-io/delta/pull/2024 GitOrigin-RevId: 5e05b1da764d689d0791c32b59005a8428f15684 commit 701520bcec41a0f82079e25b571b632730a90cd6 Author: Venki Korukanti Date: Fri Sep 15 16:03:38 2023 -0700 [Kernel] Resolve path given to `Table.forPath` using `TableClient` APIs ## Description Currently, we expect the path given to `Table.forPath()` to be fully qualified. This enforces an unnecessary burden on the connector to add the necessary schema or authority etc. Instead, add a `FileSystemClient.resolvePath` API and use it to resolve the path to a fully qualified path from `Table.forPath()`. ## How was this patch tested? Existing tests (deleted the `file:` prefix added to tests tables in the path) and a couple of new tests around missing table paths. commit 9ee024b5107657f2a8c9d77ce0add18d58de3958 Author: Venki Korukanti Date: Fri Sep 15 11:07:15 2023 -0700 [Kernel] Rename `DataReadResult` to `FilteredColumnarBatch` ## Description This is a preparatory change for supporting partition pruning in Delta Kernel. Currently `DataReadResult` is used in representing a `ColumnarBatch` with a selection vector. It is used when reading the data from the scan files. The `ColumnarBatch` is the data read from Parquet data files and the selection vector is populated based on the deletion vector associated with the scan file. Having selection vector avoids the cost of rewriting the `ColumnarBatch`. We want to use the similar `DataReadResult` in metadata. Currently `Scan.getScanFiles` returns an iterator of `ColumnarBatch`. Instead, the plan is to return `DataReadResult` (which contains the `ColumnarBatch` of scan files, and a selection vector that tells whether a scan file is pruned or not based on the partition filter.). Renaming `DataReadResult` to `FilteredColumnarBatch` to make it generic so that it is used both in the data and metadata path. Also: * Remove the API `public ColumnarBatch rewriteWithoutSelectionVector()` as `FilteredColumnarBatch` is `kernel-api` class and `kernel-api` shouldn't rewrite `ColumnarBatch`es. * Add `getRows()` API on `FilteredColumnarBatch` to iterate through the rows that survived the selection vector. This API helps when the connectors have an easy way to access the survived rows (for example to iterate over the scan file rows that survived the selection vector). * Fix a bug in `ColumnarBatch.getRows().hasNext()` commit 19b6c9e9f5f148e53d1b2a63907cd73d091fda1c Author: Allison Portis Date: Thu Sep 14 19:15:31 2023 -0700 [Kernel] Add test utilities like checkAnswer and checkTable (#2034) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description Improves the testing infrastructure for Scala tests in Delta Kernel. For now adds it to `kernel-defaults` but if we have tests with `ColumnarBatch`s in `kernel-api` we can move it there. ## How was this patch tested? Refactors existing tests to use the new infra. commit a9f8c8d6eb496691ab07ea37fb74449a2f04b26c Author: Venki Korukanti Date: Thu Sep 14 18:38:58 2023 -0700 [Kernel] Remove unused source files ## Description These files are not used anywhere and no longer needed. commit e2633aebc0ce43d3c1ce05c8888dd795121cbeb5 Author: Venki Korukanti Date: Thu Sep 14 16:19:05 2023 -0700 [Kernel] Fix a bug around closing partially opened resources ## Description Fix a bug in resource handling where we are always closing the resources which should be closed only when a failure occurs. Currently, the code opens two resources, it maintains the opened resource list to close in case the function isn't successful in opening all the required resources. Resources in the maintained list should only be closed when an error occurs (it should be in the `catch` block not in `finally`) ## How was this patch tested? Existing tests commit 36d460f84760aa7ae3f5906e44157b2fb28bc934 Author: Scott Sandre Date: Thu Sep 14 13:18:41 2023 -0700 Remove delta-io/connectors references from delta-io/delta repo [PR A] ## Description When the delta-io/connectors repo was migrated into delta-io/delta, various `delta-io/connectors` references were not updated. This PR fixes that. N/A ## Does this PR introduce _any_ user-facing changes? Updates `pom.xml`s to reference the correct repo. Closes delta-io/delta#2028 Signed-off-by: Scott Sandre GitOrigin-RevId: 8e0f14d9d009cd5e3aa76097f83c0757f4396b58 commit a926bcb4f9b03cdeb341d89baa746d576a8fc1bb Author: Scott Sandre Date: Thu Sep 14 13:17:39 2023 -0700 [Spark] Fix integration tests (use `delta-spark` or `delta-core` artifact based on the version to test) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Update our integration tests (scala, python, with pip, without pip) to use the artifact `delta-spark` when the version is >= 3, else `delta-core`. ``` python3 run-integration-tests.py --version 3.0.0rc1 --no-pip python3 run-integration-tests.py --version 2.4.0 --no-pip python3 run-integration-tests.py --version 3.0.0rc1 --pip-only python3 run-integration-tests.py --version 2.4.0 --pip-only ``` ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#2049 Signed-off-by: Scott Sandre GitOrigin-RevId: 795841f2633852ef2db3455915df841230436b77 commit c67d9829ec44aa716b4274863017408d158dcd21 Author: Scott Sandre Date: Thu Sep 14 13:16:27 2023 -0700 [Standalone] Remove delta-storage classes from delta-standalone jar #### Which Delta project/connector is this regarding? - [ ] Spark - [X] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description When the delta-connectors repo was migrated to delta-io/delta, and when delta-standalone's build.sbt configuration was updated to depend on the local delta-storage project (instead of the maven dependency), we were accidentally including the delta-storage classes inside the delta-standalone JAR. This is not what we want. This PR fixes that, so that delta-storage classes are correctly excluded from the delta-standalone JAR. Resolves delta-io/delta#1892 ``` build/sbt standaloneCosmetic/publishM2 jar tvf /Users/scott.sandre/.m2/repository/io/delta/delta-standalone_2.12/3.0.0-SNAPSHOT/delta-standalone_2.12-3.0.0-SNAPSHOT.jar ``` Before this PR: ``` 0 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/ 905 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/AzureLogStore.class 290 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/CloseableIterator.class 4945 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/GCSLogStore.class 6103 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/HDFSLogStore.class 6153 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/HadoopFileSystemLogStore.class 1645 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/LineCloseableIterator.class 1157 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/LocalLogStore.class 1544 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/LogStore.class 727 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/S3SingleDriverLogStore$FileMetadata.class 11387 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/S3SingleDriverLogStore.class 0 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/internal/ 1355 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/internal/FileNameUtils.class 1319 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/internal/LogStoreErrors.class 1151 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/internal/PathLock.class 547 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/internal/S3LogStoreUtil$1.class 3761 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/internal/S3LogStoreUtil.class 1000 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/internal/ThreadUtils$1.class 2316 Fri Jan 01 00:00:00 PST 2010 io/delta/storage/internal/ThreadUtils.class ``` After this PR: excludes any `io/delta/storage` classes. ## Does this PR introduce _any_ user-facing changes? Yes, but an intended one; we remove delta-storage classes from the delta-standalone jar. Closes delta-io/delta#2021 Signed-off-by: Scott Sandre GitOrigin-RevId: 1beeaafce52cf559a71fab22180e8ba084b12552 commit 3837bc383cfa358c14e528601a74f1161e968561 Author: Scott Sandre Date: Thu Sep 14 13:15:15 2023 -0700 [Standalone] Include sources in delta-standalone JAR #### Which Delta project/connector is this regarding? - [ ] Spark - [X] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Include the proper sources in the delta-standalone JAR. Note: these sources are from the delta-standalone-original project, and so currently excludes `/connectors/standalone-parquet/src/main/java/io/delta/standalone/util/ParquetSchemaConverter.java`. I couldn't figure out how to include that one file. But this at least is a substantial improvement for now. Resolves delta-io/delta#1922 ``` build/sbt standaloneCosmetic/publishM2 jar tvf /Users/scott.sandre/.m2/repository/io/delta/delta-standalone_2.12/3.0.0-SNAPSHOT/delta-standalone_2.12-3.0.0-SNAPSHOT-sources.jar ~/.m2/repository/io/delta jar tvf /Users/scott.sandre/.m2/repository/io/delta/delta-standalone_2.12/3.0.0-SNAPSHOT/delta-standalone_2.12-3.0.0-SNAPSHOT-sources.jar 144 Fri Jan 01 00:00:00 PST 2010 META-INF/MANIFEST.MF 0 Fri Jan 01 00:00:00 PST 2010 io/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/actions/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/data/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/exceptions/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/expressions/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/internal/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/internal/actions/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/internal/data/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/internal/exception/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/internal/expressions/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/internal/logging/ 0 Fri Jan 01 00:00:00 PST 2010 io/delta/standalone/internal/scan/ ........ ``` ## Does this PR introduce _any_ user-facing changes? Yes! Correctly includes the sources in the delta-standalone jar on maven. Closes delta-io/delta#2038 Signed-off-by: Scott Sandre GitOrigin-RevId: ac26f9258caa5c2ed3b84a9a62f25fbe1d474fb3 commit a6bdc35733f7f005c7814aff03ae10d9d2497b2c Author: Scott Sandre Date: Thu Sep 14 13:13:45 2023 -0700 [Delta-Iceberg] Fix delta-iceberg jar to not pull in delta-spark and delta-storage jars - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (delta-iceberg) Resolves delta-io/delta#1903 Previously, the `delta-iceberg` jar was incorrectly including all of the classes from `delta-spark` and `delta-storage`. You could run ``` wget https://repo1.maven.org/maven2/io/delta/delta-iceberg_2.13/3.0.0rc1/delta-iceberg_2.13-3.0.0rc1.jar jar tvf delta-iceberg_2.13-3.0.0rc1.jar ``` and see ``` com/databricks/spark/util/MetricDefinitions.class ... io/delta/storage/internal/ThreadUtils.class ... org/apache/spark/sql/delta/DeltaLog.class ``` This PR fixes that by updating various SBT assembly configs: 1) `assemblyExcludedJars`: excluding jars we don't want (but this only works for jars from `libraryDependencies`, not `.dependsOn`) 2) `assemblyMergeStrategy`: manually discarding other classes using case matching Added a new test suite and sbt project. The new project depends on the assembled version of the `delta-iceberg` jar. The test suite loads that jar and analyses its classes. Published the jars locally and ran through a simple end-to-end UniForm example. ``` ========== Delta ========== build/sbt storage/publishM2 build/sbt spark/publishM2 build/sbt iceberg/publishM2 spark-shell --packages io.delta:delta-spark_2.12:3.0.0-SNAPSHOT,io.delta:delta-storage:3.0.0-SNAPSHOT,io.delta:delta-iceberg_2.12:3.0.0-SNAPSHOT --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" val tablePath = "/Users/scott.sandre/uniform_tables/table_000" sql(s"CREATE TABLE delta.`$tablePath` (col1 INT, col2 INT) USING DELTA TBLPROPERTIES ('delta.universalFormat.enabledFormats'='iceberg')") sql(s"INSERT INTO delta.`$tablePath` VALUES (1, 1), (2,2), (3, 3)") sql(s"SELECT * FROM delta.`$tablePath`").show() +----+----+ |col1|col2| +----+----+ | 3| 3| | 2| 2| | 1| 1| +----+----+ ========== Iceberg ========== spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ --conf spark.sql.catalog.spark_catalog.type=hive \ --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.local.type=hadoop \ --conf spark.sql.catalog.local.warehouse=/Users/scott.sandre/iceberg_warehouse spark.read.format("iceberg").load("/Users/scott.sandre/uniform_tables/table_000").show() +----+----+ |col1|col2| +----+----+ | 1| 1| | 2| 2| | 3| 3| +----+----+ ``` Fixes a bug where delta-iceberg jar included delta-spark and delta-storage Closes delta-io/delta#2022 Signed-off-by: Scott Sandre GitOrigin-RevId: 187ca6a09bd423de0fde00bff26e47a88797a2f2 commit 51f97b8917850581c9198bec50bcaf65f2f70c57 Author: Gengliang Wang Date: Thu Sep 14 11:08:41 2023 -0700 [Spark] Support external DSV2 catalog in Vacuum command #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Support external DSV2 catalog in Vacuum command. After the changes, the vacuum command supports tables from external DSV2 catalog. For example, with ``` spark.sql.catalog.customer_catalog=org.apache.spark.sql.CustomerCatalog ``` We can query ``` SET CATALOG customer_catalog; VACUUM t1 ``` Or simply ``` VACUUM customer_catalog.default.t1 ``` This PR also introduce a new analyzer rule ResolveDeltaPathTable so that external DSV2 catalogs won't need to implement the resolution of delta file path table. 1. new end-to-end tests 2. new parser test cases ## Does this PR introduce _any_ user-facing changes? Yes, users can use Vacuum command on the tables of their external DSV2 catalogs. Closes delta-io/delta#2039 Signed-off-by: Gengliang Wang GitOrigin-RevId: 2c4ac73b53fe3f550a8d978f038a0d038402eab9 commit 2e8cab38d9240e4f5e8b36eff870e3818767251a Author: Ryan Johnson Date: Thu Sep 14 10:22:36 2023 -0700 Use DeltaTable.forTableWithSnapshot more often in tests Many unit tests do not currently take advantage of `DeltaLog.forTableWithSnapshot` and can be simplified. Closes https://github.com/delta-io/delta/pull/2017 GitOrigin-RevId: c30269bf5ef4b9a8d1db2f035f15a9cba21f6ff9 commit f3f04a70bdad48710f8a9eb9aa9094124f5f406b Author: Andreas Chatzistergiou Date: Wed Sep 13 18:18:43 2023 +0200 Reader+writer feature removal requires history truncation to allow protocol downgrade. Logs can only be truncated if they are outside the retention window and if they are succeeded by a checkpoint. This PR adds a new checkpoint after the feature cleanup to help slow moving tables to satisfy the truncate history requirement. GitOrigin-RevId: be1e6c75869f0760ad46547f71041c4fc55856a5 commit ffdd04a7134f0e131b993fc44127b7d38f62d85d Author: RunyaoChen Date: Tue Sep 12 15:24:05 2023 -0700 Rewrite DescribeDeltaDetailCommand to v1/v2 hybrid This PR updates the Delta Lake DESCRIBE DETAIL command to work with V1 and V2 tables. It rewrites the Delta DESCRIBE DETAIL command to use Spark's table resolution logic instead of resolving the target table manually at command execution time. For that, it changes `DescribeDeltaDetailCommand` from a `LeafRunnableCommand` to a `UnaryLike` command that takes an `UnresolvedTable` as a child plan node, which will be resolved by Spark. In addition, it also uses an `UnresolvedPath` LogicalPlan for the case that DESCRIBE DETAIL is run against a raw path. This logical plan is an indication that the target table is specified as a path so no table resolution is necessary. Closes delta-io/delta#2012 GitOrigin-RevId: 33e35c225f0da0f970578e3ca065eaf7cb57c9bc commit bc50775977408aed7d7e0c1f99b00ce862f403f1 Author: Christopher Watford Date: Tue Sep 12 12:03:51 2023 -0700 [DOCS] Fix broken markdown in PROTOCOL.md #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (Documentation) ## Description Minor markdown fix. GitHub preview. ![image](https://github.com/delta-io/delta/assets/132389385/a39a9e9d-2918-4a99-b5ce-4660e4a64565) ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#2004 Signed-off-by: Allison Portis GitOrigin-RevId: 219a0a5a8e7e9021f566830ae3748e440e241e48 commit ceeb4daa33b84573dd5ef64487ea9c4da45ce722 Author: Andreas Chatzistergiou Date: Tue Sep 12 19:44:59 2023 +0200 The DROP FEATURE command allows to drop table features from Delta Tables. Dropping a reader+writer feature is performed in two steps: 1) We clean all traces of the feature in the latest version and inform the user they need to wait until the retention period is over. 2) After the retention period is over, the user executes the command again and the protocol is downgraded. This PR adds the `TRUNCATE HISTORY` option in `DROP FEATURE` command. The new option is automatically sets the history retention period to minimum and cleans up metadata. This operation occurs at the second time the user invokes the operation. Closes delta-io/delta#2045 GitOrigin-RevId: ae70a00d4dec9d5867f09bd5644d4546461f1c12 commit 7442ebfb8df1ae7ed8630d092abd617c110be5d6 Author: Dhruv Arya Date: Mon Sep 11 20:08:07 2023 -0700 Add support for writing V2 Checkpoints Adds capability to write V2 Checkpoints. This PR only adds write support. More Unit Tests to follow in an upcoming PR which will add read support. Closes delta-io/delta#2031 GitOrigin-RevId: c3a889b8f67829ec48f6ff650d4d77b75531288f commit 201e89b7641a0358d2e17859a5134ca6c0740af4 Author: Scott Sandre Date: Thu Sep 14 13:19:35 2023 -0700 Remove delta-io/connectors references from delta-io/delta repo [PR B] (#2051) commit 03217d1fd2396c8748903dac4cc3461bbf869d8a Author: Venki Korukanti Date: Tue Sep 12 15:40:18 2023 -0700 [Kernel] Refactor expressions ## Description * Refactor the Kernel expression interfaces. Currently, the `Expression` base interface contains a few unnecessary methods such as `dataType` and `eval`. Keep the `Expression` to a minimum, so that it is just used to represent a `SQL` string version of the expression in Kernel `Expression` classes. ``` interface Expression { /** * @return a {@link List} of the immediate children of this node */ List children(); } ``` * Introduce a subtype of `Expression` called `ScalarExpression` which is a base class for all scalar expressions. * Introduce a subtype of `ScalarExpression` called `Predicate` as the base expression class for scalar expression that evaluates to `boolean` type. The `Predicate` is defined such that it takes a generic expression name and any number of input expressions. It is up to the evaluator to make sure the given `Predicate` is evaluable. Currently `Predicate` only allows a subset of expressions (`=`, `<`, `<=`, `>`, `>=`, `AND`, `OR`, `ALWAYS_TRUE`, `ALWAYS_FALSE`) as of now. In the future, this can be extended to support more predicate expressions with minimal code changes. * Update scan-related APIs to `Predicate` instead of `Expression`. * Remove the use of `Literal.FALSE` and `Literal.TRUE` and instead use `AlwaysTrue.ALWAYS_TRUE` and `AlwaysFalse.ALWAYS_FALSE`. `Literal` is not a predicate. * Extract the expression evaluation from `kernel-api` into `kernel-defaults`. * `DefaultExpressionEvaluator` validates the expression and adds necessary implicit casts to allow evaluation. TODO (will be addressed after this PR is landed): * It is not clear now whether we need the `CAST` expression as a first class `Expression` in the `kernel-api` module. If needed in the future, we can add one (https://github.com/delta-io/delta/issues/2043). * Implicit cast in `kernel-default` may need to support more type conversions, especially around the Decimal type (https://github.com/delta-io/delta/issues/2044) * Add support for nested column reference in `Column` expression (https://github.com/delta-io/delta/issues/2040). * Implicit cast of `DefaultExpressionEvaluator` output type to expected type (https://github.com/delta-io/delta/issues/2047) ## How was this patch tested? Moved the existing Java-based test to Scala and also added new tests (some of them are copied over from the standalone `ExpressionSuite` and updated). commit 70ad00fc2125b1c1dac00cab8add945badcf214a Author: Venki Korukanti Date: Tue Sep 12 12:40:47 2023 -0700 [Kernel] Add `@Evolving` and `@since` tags to all the Kernel API interfaces ## Description Kernel APIs are in the development phase. Add tags to indicate the APIs are evolving to set the expectations for API users. Also, add a `@since` tag to indicate which version the API interface/method was introduced in. This is not done for the Kernel APIs in this PR. In the future, Kernel API docs can adopt the same. The `Delta-Spark` module does extra labeling of evolving APIs by modifying the generated API - example [javadoc](https://docs.delta.io/latest/api/java/io/delta/tables/DeltaTable.html#detail--), [scaladoc](https://docs.delta.io/latest/api/java/io/delta/tables/DeltaTable.html#detail--), [code](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/io/delta/tables/DeltaTable.scala#L138) - Evolving API label in Scala and Java docs. This is done by patching the generated HTML docs (code [here](https://github.com/delta-io/delta/blob/master/docs/api-javadocs.js#L44) and [here](https://github.com/delta-io/delta/blob/master/docs/generate_api_docs.py#L55)) commit d906e342396835810abce03626ad86c05082a2a8 Author: Scott Sandre Date: Tue Sep 12 11:20:15 2023 -0700 [Standalone] Use GoldenTableUtils in ShadedJarSuite (to test delta standalone jar) (#2046) Update ShadedJarSuite to use GoldenTableUtils to load golden table resource commit 2d8520cb019116562e27f8783286f5c799448840 Author: Lars Kroll Date: Fri Sep 8 14:45:54 2023 +0200 Trivial refactor of ConflictChecker GitOrigin-RevId: 8f411a8bcf8b5ca10be539273ff8194582693a31 commit 9a5eeb73dc69c0054823c81272235a842485957c Author: lzlfred Date: Thu Sep 7 16:56:40 2023 -0700 Fix to Delta Uniform to support convert Delta null partition values to iceberg The existing Delta to iceberg conversion has a bug that it does not handle null partition values as it will write the string with content "null" in the partition path, and "null" cannot be converted to other numeric types. The fix uses a special marker from iceberg library so it recognizes the null value and converts correctly. GitOrigin-RevId: 667e795ead753803565340abcc23ae01d9738a2c commit b30a9033d699c2f6788ebaab075612fe1c66d7f1 Author: Ryan Johnson Date: Thu Sep 7 15:33:26 2023 -0700 [Spark] Clean up OptimisticTransaction constructor code Get rid of an unnecessary alternate constructor, eliminate the implicit `clock` arg by referencing `DeltaLog.clock` instead, and update unit tests accordingly. Closes https://github.com/delta-io/delta/pull/2029 GitOrigin-RevId: b3ff58a13df71807fea5b90a3d57b71d09dfd2de commit 8e4ac0a2c333a1a29e0d4201da1d849df62d9817 Author: Christos Stavrakakis Date: Wed Sep 6 18:50:51 2023 +0200 Minor fix in concurrency test Drop `operationMetrics` when checking actions, since metrics can contain values that are specific to the commit, e.g. execution time. GitOrigin-RevId: bc58bef023d00c299e00dbf0f742c497673bac7d commit 3461aeef295c77ac80626aab3186241c9ca0ad5f Author: Johan Lasperas Date: Wed Sep 6 10:49:44 2023 +0200 Move schema evolution tests to dedicated traits ## Description Schema evolution tests are moved out of `MergeIntoSuiteBase` and collected in separated test traits to allow more flexibility when running tests with different combinations of parameters in the future: - MergeIntoSchemaEvolutionCoreTests: a very small subset of tests intended to be run with a large number of varying combinations to preserve core test coverage with e.g. CDF, DVs, column mapping. - MergeIntoSchemaEvolutionBaseTests: All basic schema evolution tests - and by extension tests that don't fit in other traits. - MergeIntoSchemaEvolutionNotMatchedBySourceTests: Tests covering schema evolution with NOT MATCHED BY SOURCE clauses. - MergeIntoNestedStructEvolutionTests: Tests covering nested struct evolution. This is a refactor of existing tests. Closes delta-io/delta#2009 GitOrigin-RevId: cef8641a54c21ac5b83d0e06278a3791e5a07ad3 commit 245857690eb5d70a2e0235062a2c33f4ef40ff7d Author: Lars Kroll Date: Wed Sep 6 09:38:19 2023 +0200 [Spark] Exclude all temp view tests from Scala API DML suites Instead of trying to work around the issue that `DeltaTable.forName` cannot resolve temporary views, exclude these tests from the {Delete,Update,Merge}ScalaSuites, since there is no public API to use this anyway. GitOrigin-RevId: 247f26f1e95b6aef8a2fabde904a78e5bce3de46 commit 7d76a325a8ca962b970d63f56fbfa1b19e23cda5 Author: Boyang Jerry Peng Date: Tue Sep 5 17:26:29 2023 -0700 minor refactoring GitOrigin-RevId: dbb93e024ad614586b917959927a8ee757c8d226 commit ca6271ac657463b1a606e5e3451be279c1e138b2 Author: Lin Zhou Date: Tue Sep 5 14:41:39 2023 -0700 Minor refactor to DeltaSource.scala Use SnapshotDescriptor instead of Snapshot as the class type for parameter snapshotAtSourceInit, to make it more light weight, and easier to be used by DeltaSharingSource. GitOrigin-RevId: 379d9a797545b5c473c165870c370c7b3e134446 commit 2efef5b697100bb09d431d49f2166ea2c20a94f6 Author: Tathagata Das Date: Mon Sep 11 14:10:29 2023 -0400 Remove unnecessary files ## Description Remove unnecessary files from the connectors directory commit d9ba620c9c4c6df422a705b3b81f480c602c6e3c Author: Christopher Watford Date: Wed Aug 30 20:16:37 2023 -0700 Don't commit empty transactions on Write Closes delta-io/delta#1934 commit 27f2ce6b6303b0188bd5dbaa06f5927fd5c9b957 Author: Venki Korukanti Date: Thu Sep 7 16:54:18 2023 -0700 Revert "Don't commit empty transactions on Write" This reverts commit 3e3b5ef181e8ad97d7b184055add3bec89b8180e. The author attribution is wrong due to a mixup during merging. commit 6dacf13e964562098bb4d986979c126ec1f01947 Author: Lars Kroll Date: Mon Sep 4 12:47:46 2023 +0200 [Spark] Use unique id to match pre- and post-clone files Use a unique id tag instead of paths to match pre- and post-clone files during verification in unit tests. GitOrigin-RevId: 99de25a12e791da3f3ffad79f2eb3246e847faa9 commit 9595d4d1e53dbaeb52a2a3943512f90e990ff679 Author: Jungtaek Lim Date: Fri Sep 1 17:10:54 2023 +0900 [MINOR] Fix broken logging target offset for Trigger.AvailableNow GitOrigin-RevId: 0ee6c162dc2ec3bdfc8f9cfb4f0b2de33038098b commit 8566a9ddefe424de656faa6435910fb552a0a076 Author: Ryan Johnson Date: Thu Aug 31 13:22:39 2023 -0700 Obtain snapshots more efficiently in DeltaAlterTableTests DeltaAlterTableTests rely on `DeltaLog.unsafeVolatileSnapshot` because they only call `DeltaLog.forTable`. Easily fixed by calling `DeltaLog.forTableWithSnapshot` instead. GitOrigin-RevId: c44f51ebe432ca69def9e53b1989f5a1b0da1373 commit 3e3b5ef181e8ad97d7b184055add3bec89b8180e Author: Prakhar Jain Date: Wed Aug 30 20:16:37 2023 -0700 Don't commit empty transactions on Write Closes https://github.com/delta-io/delta/pull/1934 Closes delta-io/delta#1934 GitOrigin-RevId: 047fe31ca51e31ac60d5c79dc58a1c960c13157b commit 475b164b6df198bd6069e6b99e3ad1af7ac56543 Author: Neil Ramaswamy Date: Wed Aug 30 13:20:04 2023 -0700 Minor refactor to DeltaSink.scala GitOrigin-RevId: c61611a67f691a6e5c9de9a20d3c47b7bd0c3794 commit 1450475e9e6483e90d7a24157e565ba17e50d114 Author: Johan Lasperas Date: Wed Aug 30 10:19:37 2023 +0200 Resolve UpCast expressions introduced in Delta DMLs https://github.com/delta-io/delta/pull/1938 changed the casting behavior in MERGE and UPDATE to follow the value of the `storeAssignmentPolicy` config instead of the `ansiEnabled` one, making the behavior consistent with INSERT. This change breaks MERGE and UPDATE operations that contain a cast when `storeAssignmentPolicy` is set to `STRICT`, throwing an internal error during analysis. The cause is the `UpCast` expression(s) added by `PreprocessTableMerge` and `PreprocessTableUpdate` when processing assignments. `UpCast` is meant to be replaced by a regular after performing checks by the `ResolveUpCast` rule that runs during the resolution phase **before** `PreprocessTableMerge` and `PreprocessTableUpdate` introduce the expression. The fix is to run the `ResolveUpCast` rule once more after `PreprocessTableMerge` and `PreprocessTableUpdate` have run. Missing tests covering cast behavior for the different values of`storeAssignmentPolicy` for UPDATE and MERGE are added, covering: - Invalid implicit cast (string -> boolean), valid implicit cast (string -> int), upcast (int -> long) - UPDATE, MERGE - storeAssignmentPolicy = LEGACY, ANSI, STRICT Closes delta-io/delta#1998 GitOrigin-RevId: 5b18006b9cb8efa522cda0a4cfa6e981a0b40d28 commit 7693cfce8e227620f4078136e7fef7018a16b515 Author: Jackie Zhang Date: Tue Aug 29 12:25:53 2023 -0700 Minor refactor to Delta source. GitOrigin-RevId: d6343242ee3f8514ab1df4c91fa0bc22dae3cbc5 commit 3392434a537ab88c5f8b0f147958c19793a36626 Author: Jackie Zhang Date: Tue Aug 29 09:11:39 2023 -0700 Minor refactor to DeltaSourceMetadataTracking log. GitOrigin-RevId: 7adfcb744d31c4593b33d2e72de91b0e7be7352e commit 7c3eaf99719f050a7fef54c9a439f0535e724a94 Author: Christos Stavrakakis Date: Tue Aug 29 14:29:11 2023 +0200 Extend metric tests to compare metrics to committed actions Delta emits file-level and byte-level metrics that can be derived from the actions that are committed to each Delta version. This commit extends metrics tests for MERGE, UPDATE and DELETE commands to compare the operation metrics with the committed actions. GitOrigin-RevId: b73d3d6d7b286bc7589c5bd80d802f5dd111fe34 commit d62358ff3998a078056cbec2dd96e18f0e50aaa3 Author: Ming DAI Date: Mon Aug 28 14:08:51 2023 -0700 Fix the path used in the snapshot GitOrigin-RevId: 61a989079fbbe3845942b470d40a968b7e3ab154 commit 67c4b983088ae7e2d1d5cac6403792eb9ba35385 Author: Lars Kroll Date: Mon Aug 28 12:11:08 2023 +0200 [Spark] Fix flaky test suite DeltaSourceDeletionVectorsSuite Remove flakiness from a previously flaky test case, by ensuring that the stream is stopped processing before running new DML commands on the source table. Fixes https://github.com/delta-io/delta/issues/1982 Closes https://github.com/delta-io/delta/pull/1989 GitOrigin-RevId: a47f5fdd533e4f4c7ff2a044085cfd99367a7287 commit 22ade20b20e34b6ae18715d81288b26e0f53194e Author: Prakhar Jain Date: Thu Aug 24 14:10:47 2023 -0700 Introduce V2 Checkpoint actions and Table Feature This PR introduces actions and placeholder TableFeature for V2 Checkpoints. Closes [delta-io/delta#1983](https://github.com/delta-io/delta/pull/1983) GitOrigin-RevId: fd41917ea6b21701defe422ef643fd0ff59e125f commit c69892a6e6bd1931ce19386a3304a0044bef5a99 Author: Eric Maynard Date: Thu Aug 24 12:46:03 2023 -0700 SchemaTrackingLog can now handle non-StructType values SchemaTrackingLog is able to track any DataType now instead of specifically StructType GitOrigin-RevId: ae6a9f682430e85b82dbee5f385c3f98e37ddf4d commit 512562dbd43b03bd3debcebd059ed2a36577d140 Author: Venki Korukanti Date: Thu Aug 31 09:54:44 2023 -0700 [Kernel] Clean up `ParquetBatchReader` move to next row logic ## Description Minor cleanups. * Currently, the `Converter` implementations have a method called `boolean moveToNextRow()`, which returns whether the previous row is `null` after moving the converter state to consume the next row. The return value is unnecessary now after delta-io/delta#1974. This PR removes the return type. * There is an extra method on `RowConverter` called `boolean moveToNextRow(int rowIndex)`. This method is called to pass the row index of the record that was read. Also, there is an `if..and..else` to call this method or other method depending upon the `Converter`. Remove this extra method and just add one method `boolean finalizeCurrentRow(long currentRowIndex)`. This method sets the row index of the current row depending on the converter type and also finalizes the state of the column read from the current row. ## How was this patch tested? Existing tests. commit 8fe6ea6115f22b8341fce1917ee55974eb5f728d Author: Allison Portis Date: Wed Aug 30 14:56:04 2023 -0700 [Kernel] Support reading multipart checkpoints (#1984) ## Description Support reading multipart checkpoints. ## How was this patch tested? Adds a few unit tests and an end-to-end test. commit 9726f7f2e3922ecdbe4c3c9a0a1c11174193d14d Author: Dhruv Arya Date: Wed Aug 23 12:20:52 2023 -0700 Minor refactor to `Checkpoints.scala` for future V2Checkpoint changes. Makes `checkpointAndCleanUpDeltaLog` publicly accessible so that it can be invoked directly in the future. GitOrigin-RevId: dedded5e4b5dff526489ccbb0e256aaf5cbf5319 commit ab52b7a8124f7eabdbe7d40dd1088067959c155e Author: Johan Lasperas Date: Wed Aug 23 11:46:47 2023 +0200 Refactor merge & DML test helpers to allow reusability ## Description This change is a plain refactor of our test helpers, gathering merge and DML test helpers outside of the existing test suites into traits that can be reused in separate suites. N/A Closes delta-io/delta#1988 GitOrigin-RevId: 92dad77bacba7764a2d6f82142a2b9ec0e575647 commit c55d63d5034e1ca251d3da6fa27ff56e46834167 Author: Fredrik Klauss Date: Tue Aug 22 18:31:07 2023 +0200 Fix double spacing indentation in Delta classes ## Description * Fix double space indentation in some Delta classes. This is inconsistent and makes the code less readable. N/A Closes delta-io/delta#1995 GitOrigin-RevId: 43dee69401da5ac46f9333947d73da7e6b750abc commit e756131776f0271121712b18c2723960c2890619 Author: Lars Kroll Date: Tue Aug 22 12:53:36 2023 +0200 [Spark] Skip flaky suite temporarily Temporarily disable `DeltaSourceDeletionVectorsSuite` `subsequent DML commands are processed correctly in a batch - INSERT->UPDATE` until I can fix the flakiness. GitOrigin-RevId: 89dd29cb86926b8f34602c976548e77c9f24039a commit ae652f2eb4684dd99717cc61596c049f26dfeece Author: Lars Kroll Date: Tue Aug 22 12:48:38 2023 +0200 [Other] Clarify DV file format byte order - Clarify in the spec that DV files are big endian, while serialized DVs are little endian. Fixes https://github.com/delta-io/delta/issues/1979. Closes https://github.com/delta-io/delta/pull/1987 GitOrigin-RevId: 156e456f81d7fd5d67365923373ff0ae05fd97ee commit 8398040e666e5b23cc8a4fcd97f46002668f1a28 Author: Martin Grund Date: Tue Aug 22 12:46:46 2023 +0200 Minor refactor to `test_deltatable.py`. Check lower case table name in test_replace_table_behavior. GitOrigin-RevId: af181e8e6c74010edcadde098624f8719cbc1e27 commit 252d6b538d66342358de9de66f71506b12baf477 Author: Christos Stavrakakis Date: Tue Aug 22 10:09:55 2023 +0200 Add sanity check for incremental DV updates Extend `addFile.removeRows` method that is used by DML commands to update the DV of files to check that the cardinality of the new DV is at least the cardinality of the previous DV, i.e. that no rows are restored GitOrigin-RevId: f537979826414865223b8e0ffe9e503c9673d331 commit 3ff4075d6ccfc834223aeeb231831b7b8d5ec1b8 Author: Gengliang Wang Date: Mon Aug 21 09:54:49 2023 -0700 [Spark] Support 3-part naming in table identifier parsing -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) Support 3-part naming in table identifier parsing. Before the changes, the following command ``` OPTIMIZE catalog_foo.db.tbl ``` will throw error ``` org.apache.spark.sql.delta.DeltaParseException: Illegal table name catalog_foo.db.tbl(line 1, pos 9) == SQL == optimize catalog_foo.db.tbl ---------^^^ at io.delta.sql.parser.DeltaSqlAstBuilder.$anonfun$visitTableIdentifier$1(DeltaSqlParser.scala:430) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:160) at io.delta.sql.parser.DeltaSqlAstBuilder.visitTableIdentifier(DeltaSqlParser.scala:427) at io.delta.sql.parser.DeltaSqlAstBuilder.$anonfun$visitOptimizeTable$5(DeltaSqlParser.scala:348) at scala.Option.map(Option.scala:230) at io.delta.sql.parser.DeltaSqlAstBuilder.$anonfun$visitOptimizeTable$1(DeltaSqlParser.scala:348) ``` After the changes, the command works. A new unit test No Closes delta-io/delta#1985 Signed-off-by: Venki Korukanti GitOrigin-RevId: dd297a9d8e77a6fdfafb834c74a915de1aeae737 commit 79bf28d0d616d7c6373e692d43288d2fd03fb7fe Author: Costas Zarifis Date: Fri Aug 18 18:54:22 2023 -0700 Minor refactor to: WriteIntoDelta.scala and DeltaSink.scala GitOrigin-RevId: fad6bb91fca7bdf79d8afe1f3167cdbb685c3c58 commit 508eaf21f6b90ae55ef73c247bac262a5a0cdd42 Author: Paddy Xu Date: Fri Aug 18 14:21:09 2023 +0200 Count DVs that are updated as both "added" and "removed" Increase DV metrics "added" and "removed" for all DVs that have been "updated" because, from Delta Log's perspective, an "update" action is equivalent to a "remove" followed by an "add". Updating the counting rule will ensure that the metric reflects more accurately what happened at a lower level. Closes https://github.com/delta-io/delta/pull/1981. GitOrigin-RevId: 78acfad7df288baf906c59a7af3b432a2b44eaaa commit 51e320d1126af7ab6824241fe0763943e7ed1f89 Author: Desmond Cheong Date: Thu Aug 17 16:42:53 2023 -0700 Minor refactor GitOrigin-RevId: f3572f95a04ac11e2d351cb97ce754379139fa89 commit 6368ad0d2d1b8835266195d204687eeb40fa7a69 Author: Jackie Zhang Date: Wed Aug 16 16:07:47 2023 -0700 Fix and improve Delta source metadata log GitOrigin-RevId: 957b32d5baa318cc8597197c46aadf3c8df88a48 commit 64653676a8514fbd0b80d2c4a6acc36700bb2162 Author: Christopher Watford Date: Wed Aug 16 15:43:05 2023 -0700 [StorageS3DynamoDB] Minor exception message changes - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [X] Other (storage-s3-dynamodb) I was trying to track down the source of a weird FileSystemException and eventually discovered what it meant after looking at the source code. This is a minor change to the language used to help push you into a direction to discover the actual issue. My original change was to introduce a new, more specific exception type, but that would likely be a more painful breaking change. Executed unit tests locally. Yes, exception message text changed which if being explicitly tested against may fail. Closes delta-io/delta#1972 Signed-off-by: Venki Korukanti GitOrigin-RevId: 8f18d5f8580ff9234cb2b4374080af1fcfd1bc71 commit 23826a3b25f39cc1097d3b6174f009da496028b7 Author: Hang Jia <754332450@qq.com> Date: Thu Aug 24 01:34:22 2023 +0800 [Flink] fix STRING Type converted to varchar(1) (#1930) commit 4c5fa1f5b5069d1b247cf1a337ffdc87f7edfce7 Author: Allison Portis Date: Thu Aug 17 19:17:32 2023 -0700 [Kernel] Enforce header in java files with checksytle (#1977) commit 3835df1babba7597e804a947117b4127dcb972bd Author: Tathagata Das Date: Wed Aug 16 15:21:58 2023 -0400 #### Which Delta project/connector is this regarding? -Spark -Standalone -Flink -Kernel - [ ] Other (fill in here) ## Description This is leftover task from the connector code migration to this repo. The unidoc/java doc generation was disabled for standalone, and flink. This is to enable them back. Specifically I did the following 1. Refactored all the unidoc source file filtering code in a clean way so that each subproject can configure but reuse all the code. 2. With this refactoring we can also easily generate combined docs for multiple projects. For example, the 2 Kernel projects can be documented together using the kernelGroup aggregation project. This is what I plan to upload to docs.delta.io. 3. ~The generated doc for Maven upload is also fixed to make sure we dont public doc jars exposing inner classes. As a result, each project with public APIs will generate it own scala/java docs.~ I reverted this as this is throwing errors that I cannot fix yet. Added a TODO in the code. Also in a separate PR, I will update the higher level doc generation scripts in delta/docs to consolidate the separate docs into one location. You can generate all the docs by `build/sbt unidoc` Existing tests force generation of docs ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1950 GitOrigin-RevId: f5d6c7abca44296004b6e629ca2ef48462a6a509 commit f5814caf7a5c39eba8094e4927cc9e2252b0d45e Author: Ming DAI Date: Tue Aug 15 14:58:23 2023 -0700 Disable stats computation on INT64 timestamp in stats collection. GitOrigin-RevId: ca668d17e7da482e442974bf1db11cc06d3d50fc commit 4815007c066a46efe9aca22e4138815cc3cc0f29 Author: Allison Portis Date: Tue Aug 15 13:25:13 2023 -0700 [Build] Update scalastyle config to not enforce a specific year for copyright Updates the scalastyle config to accept years other than 2021 in the copyright header. Note: This doesn't specify valid vs non valid years but accepts any 4 digit number. We can be more specific if desired. Updates one file in Kernel to be "2023" which fails without this change. Closes delta-io/delta#1970 Signed-off-by: Allison Portis GitOrigin-RevId: 36ff94b7c7dcdd0aec58c6cefadfa37417de97ef commit d19e989e7b83ac848ebe11cefb122207a8556ae5 Author: Paddy Xu Date: Tue Aug 15 14:30:45 2023 +0200 Improve CDF reading speed by scanning in batch This PR improves CDF reading speed and reduces the number of tasks by creating a single scan for all CDF files from all versions(*). Compared to the previous approach (which creates a scan for each version), the new method significantly reduces the number of cloud requests and puts less stress on the Spark scheduler. (*) Due to a limitation in how we broadcast DVs, the combining will happen for files that have only one associated DV across the entire requested version range. GitOrigin-RevId: faf9a5ce062914ee182ff86d5f5696aeb62d1421 commit 7e871926c20bac7c29b326a310380c143714e3e7 Author: Andreas Chatzistergiou Date: Mon Aug 14 08:14:50 2023 +0200 DROP FEATURE command allows dropping a feature from a delta table. This PR adds support for dropping legacy features. Legacy features can be removed, as long as the protocol supports Table Features. This will not downgrade protocol versions but only remove the feature from the supported features list. For example, removing legacyRWFeature from (3, 7, [legacyRWFeature], [legacyRWFeature]) will result in (3, 7, [], []) and not (1, 1). GitOrigin-RevId: f77d1662fd1a7653b5bf8f1bc69d95f2b2679472 commit 74bcce8634119796f9faa709b96d3de97eb9fa71 Author: Venki Korukanti Date: Wed Aug 16 20:55:44 2023 -0700 [Kernel] Fix the Parquet reader identifying empty groups as null ## Description Currently the default parquet reader `ParquetBatchReader` can't distinguish between empty `array` or `null` `array` values (similarly between empty `map` or `null` `map` values). It always returns `null` when it is possible that the value could be an empty `array` or `map`. Example: Expected output: ``` | tag| ids| +------+---------+ |normal|[1, 2, 3]| | empty| []| |normal|[4, 5, 6]| | null| null| +------+---------+ ``` Actual output: ``` | tag| ids| +------+---------+ |normal|[1, 2, 3]| | empty| null| <-- notice the difference here |normal|[4, 5, 6]| | null| null| +------+---------+ ``` The `ParquetBatchReader` uses `parquet-mr`'s [`ReadSupport`](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/api/ReadSupport.java) framework to get the values for each column. This Parquet [framework](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/RecordReaderImplementation.java#L400) parses each leaf level value, its definition level and repetition level, and calls methods on `Converter`s returned by the Kernel's `ParquetBatchReader` to pass the null-ness, end-of-array-element, end-of-map-element signals and the column values. If the column is a nested column, based on the definition level, it calls the intermediate converter's (which is basically a `GroupConverter`) `start` method which signals that the value at that level is not null. For example for column: a.b.c, it is possible the a=null or a.b=null or a.b.c=null. The definition level tells (0-3) at what level the value is null. If the definition level 1, that means it calls `GroupConverter(a).start`, but not `GroupConverter(a.b).start` or `GroupConverter(a.b.c).start`. However the current implementation of `ParquetBatchReader`'s `GroupConverter`s for `map`, `struct` and `array` types, whenever `GroupConverter.start()` is called, they call the `GroupConverter.start()` on its children's converters. This results in assuming that a `null` value is a `non-null` value. This PR also refactors the common code in `MapConverter` and `ArrayConverter` into a new `RepeatedValueConverter` abstract class. ## How was this patch tested? Enable commented existing tests and also add new tests for testing `null` top-level structs. commit 44f59b59f76986cc64a1389e80b81975b38c6bc8 Author: Allison Portis Date: Wed Aug 16 13:05:07 2023 -0700 [Kernel] Support TimestampType (#1920) ## Description Adds support for reading TimestampType data columns. Does not support reading TimestampType partition columns, this is explicitly blocked. See the the decision doc [here](https://docs.google.com/document/d/15chiWn7QwUzAxy6t2inIoTCKTBQMnv4UWRWtrZvfULQ/edit?usp=sharing) for details on the in-memory format choice. ## How was this patch tested? Adds unit tests. commit 5e90a97e0c4c0eed9dfca895c39658ff046ea091 Author: Venki Korukanti Date: Sun Aug 13 22:39:14 2023 -0700 [Kernel] API user guide ## Description This PR adds `USER_GUIDE.md` which explains how to use the Delta Kernel APIs to build Delta Lake connectors. commit d36623f0819a3b9a7e9ff924bb1faa4e9caf6050 Author: Lars Kroll Date: Fri Aug 11 11:18:15 2023 +0200 Fix streaming when the same file occurs with different DVs in the same batch ## Description There was an edge case in streaming with deletion vectors in the source, where in `ignoreChanges`-mode it could happen that if the same file occurred with different DVs in the same batch (or both with a DV and without a DV), then we would read the file with the wrong DV, since we broadcast the DVs to the scans by data file path. This PR fixes this issue, by reading files from different versions in different scans and then taking the union of the result to build the final `DataFrame` for the batch. Added new tests for having 2 DML commands (DELETE->DELETE and DELETE->INSERT) in the same batch for all change modi. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1899 Signed-off-by: larsk-db GitOrigin-RevId: 43a2d479832ba9b4be7b888c0a633e725729744a commit 1ebea3dacb35565b8bb8d4d93c49bf42f72cde35 Author: Andreas Chatzistergiou Date: Fri Aug 11 10:23:40 2023 +0200 Reader+writer feature removal validation does not take into account commits before the earliest checkpoint This PR addresses an issue that occurs when validating the history of reader+writer features during feature removal. The issue is that the validator would not take into account commits before the earliest checkpoint of the table. That could result into incorrectly allowing feature removal in some cases. GitOrigin-RevId: 746050ecd16625dab2c01ff178b592fd556783b4 commit 9e1b627bbd136a4361ce02b6a648984e40b1693e Author: srielau Date: Thu Aug 10 23:19:09 2023 +0000 Changes to error message text The error message text from UNRESOLVED_COLUMN has changed. Delta wrongly tests the text with contains instead of checkError GitOrigin-RevId: 2a2c06708683590bb51cce063016c8c932b9a127 commit bac0bab342cce309c505cf5273bb0afc1709e9f6 Author: Jackie Zhang Date: Wed Aug 9 12:07:29 2023 -0700 Fix a corner case with schema tracking when reading initial snapshot GitOrigin-RevId: 106e2e00dc79007b3a923586e2cf13078924450e commit 45a6902ed1652ee9b840c816288c2d596c2d1d35 Author: Allison Portis Date: Wed Aug 9 11:31:02 2023 -0700 [Standalone][Flink] Re-enables mima checks for connectors #### Which Delta project/connector is this regarding? - [ ] Spark -Standalone -Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Re-enables mima checks for Delta Standalone. Adds mima checks for Delta Flink (previously was not done in delta-io/connectors). Made changes locally and verified the check fails. Closes delta-io/delta#1952 Signed-off-by: Allison Portis GitOrigin-RevId: 36d99421db4a22777250b1cadfecb552a08c58fd commit 79a5f8f7f8b368b0ed8b162fd8c254fea91eec95 Author: Yuya Ebihara Date: Wed Aug 9 09:35:18 2023 -0700 Fix feature name for column invariants in PROTOCOL.md #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel -Other (fill in here) ## Description Fix feature name for column invariants in PROTOCOL.md N/A ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1964 Signed-off-by: Venki Korukanti GitOrigin-RevId: c3033b9315ad0a21094f41aecd86212bb11e7691 commit a35c559c569ea83fc1535c406ee9f5cd47b7f3fb Author: Scott Sandre Date: Thu Aug 10 11:05:41 2023 -0700 [Kernel] Improved LogReplay logic using column pruning + ColumnarBatch-based design (#1939) * Implement columnar-batch-based LogReplay for active add files and for latest protocol & metadata * use resolved path in test suite * add TODO comment to ActionsIterator.java * Keep track of closeables to close in getNextActionsIter commit 84d6dedcb627b660eabe7a4bcaf33bd12e33ded9 Author: Allison Portis Date: Wed Aug 9 15:04:30 2023 -0700 [Kernel] Support reading decimals (#1951) ## Description Adds support for `DecimalType`. - Adds the `getDecimal` API to `ColumnVector` and `Row` - Reading decimals from parquet (in the default parquet reader) - Reading decimal partition columns - Creating decimal type literals, and evaluating them in the default `ExpressionHandler` - Also refactors some test utilities to `TestUtils` and updates `DeletionVectorSuite` accordingly ### Some details 1) We add the `getDecimal` API to `ColumnVector`. Connector implementations of a decimal `ColumnVector`s are welcome to add and use their own decimal API in their implementation. 2) In the default parquet reader, for now we materialize decimals in memory as java `BigDecimals` in `DefaultDecimalVector` since performance is not our main goal. We can revisit this later and back int/long decimals with primitive arrays (and store the precison/scale separately). We would still implement the `getDecimal` API so this can be optimized later. ## How was this patch tested? Adds test tables to `GoldenTables`. Adds end-to-end tests and some specific tests for the parquet reader. commit efa566fc24af0b2c65b08c2a7ad81e9c865e0e83 Author: Venki Korukanti Date: Tue Aug 8 22:23:15 2023 -0700 [Kernel] Clean up package organization of `kernel-default` module ## Description Rename the module `kernel-default` to `kernel-defaults`. Rename package `io.delta.kernel` to `io.delta.kernel.defaults` in `kernel-defaults` module. Move non-public classes under the `io.delta.kernel.defaults.internal` package. Note: Reason for not using the `default` is because it is a keyword in Java and can't be part of the package namespace. ## How was this patch tested? Updated the `kernel/examples` and ran the `python3 kernel/examples/run-kernel-examples.py --use-local` successfully. commit a986f45f98d3caf0f1e78224d0c1f693b217cde0 Author: Venki Korukanti Date: Tue Aug 8 18:28:49 2023 -0700 [BUILD] Rename `kernel-default` module to `kernel-defaults` GitOrigin-RevId: d947a3568669c0ac2e75b6eb018e420221c7370b commit a9206aad3d50016da62158b8e9260f8c838a0082 Author: Feng Zhu Date: Tue Aug 8 11:18:54 2023 -0700 Use repos.spark-packages.org instead of Bintray as the repository service for spark-packages. See https://www.mail-archive.com/dev@spark.apache.org/msg27774.html Closes delta-io/delta#1635 GitOrigin-RevId: 642442c207d096fb79828b99f85a82cf0a8bfac9 commit 2236479e156898a3cc5005df0f49995398462a3e Author: Jonas Irgens Kylling Date: Tue Aug 8 09:32:25 2023 -0700 Add required/optional column to transaction and protocol action description Adds a optional/required column to the tables describing the fields of transaction and protocol actions, similar to what we have for the other actions. This clarifies which fields of the transactions and protocol actions are required or optional. Closes delta-io/delta#1890 GitOrigin-RevId: 46df4b76b15fa785aa85cd3497d83751e643834d commit e38a86393b84fd9d9f507f3a4feeaecdf2916720 Author: Desmond Cheong Date: Mon Aug 7 21:53:20 2023 -0700 Minor refactor GitOrigin-RevId: 7673f4f2533d849d0af88bae809993da4070c6ae commit 8add682fbfb51302ba4a4d587ed0ed3a9f0c5ed8 Author: Scott Sandre Date: Tue Aug 8 10:27:53 2023 -0700 [Kernel] Replace `Arrays.stream` with `for` loops to improve replay performance ~3x (#1960) Replace `Arrays.stream` with `for` loops to improve replay performance ~3x commit 011bda5b573cd3fe90975034fc578b793b941e54 Author: Allison Portis Date: Mon Aug 7 15:41:47 2023 -0700 [Kernel] Update the Javastyle to enforce brace style and fix instances - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) Updates the java checkstyle for Kernel to enforce - Left { braces are at the end of the line https://checkstyle.org/checks/blocks/leftcurly.html#LeftCurly - Right } braces are on the same line as the next block start https://checkstyle.org/checks/blocks/rightcurly.html#RightCurly - Fixes the import order Checkstyle passes Closes delta-io/delta#1962 commit d2c4d506d8d8649eb8573fc0778045791500deff Author: Jacek Laskowski Date: Mon Aug 7 12:21:27 2023 +0200 [Spark] Avoid superfluous always-true filter conditions in MERGE command ## Description While building the output dataset to write to a delta table as part of MERGE command, Delta Lake builds conditions based on optional parts that lead to extra processing (possibly even slowing down MERGE execution = just guessing = no evidence = just gut feelings). Although Spark SQL optimizer could likely remove such extra always-true filter conditions, so can we (making the codebase smaller and easier to comprehend, hopefully). Hence the PR. Local builds ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1943 Signed-off-by: Johan Lasperas GitOrigin-RevId: e07de437718896b59a0753c267d1e619314f0862 commit a73c485cf33e5967d341fd4eff18fbe6c9a8d8bf Author: Scott Sandre Date: Fri Aug 4 11:33:41 2023 -0700 Make the UniversalFormat validation error msg describe exactly how to "enable IcebergCompatV1", by setting the table property `'delta.enableIcebergCompatV1' = 'true'`. Some users had confusion about this. GitOrigin-RevId: 1bee540767744d103ba81612f5a20a88b8d98f6c commit 9b3d1f1158f07a78602b6c682bcb5dd1e317d08d Author: Fredrik Klauss Date: Fri Aug 4 13:41:44 2023 +0200 Delete unused filesForScan method ## Description * Delete unused filesForScan method to clean up the code. there is an alternative `filesForScan` method that does exactly the same. GitOrigin-RevId: 59d4c82013a17e3662d3d0217f0d4e7739738a2a commit a6c1176a020a09254ea5a78e4c07c80881f0e0e2 Author: sherlockbeard Date: Fri Aug 4 12:33:43 2023 +0200 added metrics for DELETE with Deletion Vectors ## Description added metrics for deletion vectors in operationMetrics of the commit . Resolves #1879 ``` numDeletionVectorsAdded numDeletionVectorsRemoved numDeletionVectorsUpdated ``` using DeletionVectors test named "Metrics when deleting with DV" ## Does this PR introduce _any_ user-facing changes? The operationMetrics for the delete operation will have two additional values ``` +-------+-----------------------+------+--------+-----------------+-----------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------------------------+ |version|timestamp |userId|userName|operation |operationParameters |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics |userMetadata|engineInfo | +-------+-----------------------+------+--------+-----------------+-----------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------------------------+ |4 |2023-07-01 08:59:31.257|null |null |DELETE |{predicate -> ["(trgKey#4162 > 1)"]} |null|null |null |3 |Serializable |false |{numRemovedFiles -> 1, numRemovedBytes -> 0, numCopiedRows -> 0, numDeletionVectorsAdded -> 1, numDeletionVectorsRemoved -> 2, numAddedChangeFiles -> 0, executionTimeMs -> 6773, numDeletedRows -> 3, scanTimeMs -> 0, numAddedFiles -> 0, numAddedBytes -> 0, rewriteTimeMs -> 0} |null |Apache-Spark/3.4.0 Delta-Lake/3.0.0-SNAPSHOT| |3 |2023-07-01 08:59:20.848|null |null |DELETE |{predicate -> ["(trgKey#2301 > 2)"]} |null|null |null |2 |Serializable |false |{numRemovedFiles -> 0, numRemovedBytes -> 0, numCopiedRows -> 0, numDeletionVectorsAdded -> 2, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 10001, numDeletedRows -> 2, scanTimeMs -> 0, numAddedFiles -> 0, numAddedBytes -> 0, rewriteTimeMs -> 0}|null |Apache-Spark/3.4.0 Delta-Lake/3.0.0-SNAPSHOT| |2 |2023-07-01 08:59:07.022|null |null |SET TBLPROPERTIES|{properties -> {"delta.enableDeletionVectors":"true"}} |null|null |null |1 |Serializable |true |{} |null |Apache-Spark/3.4.0 Delta-Lake/3.0.0-SNAPSHOT| |1 |2023-07-01 08:59:02.902|null |null |WRITE |{mode -> Append, partitionBy -> []} |null|null |null |0 |Serializable |true |{numFiles -> 2, numOutputRows -> 6, numOutputBytes -> 1525} |null |Apache-Spark/3.4.0 Delta-Lake/3.0.0-SNAPSHOT| |0 |2023-07-01 08:58:44.631|null |null |CREATE TABLE |{isManaged -> true, description -> null, partitionBy -> [], properties -> {}}|null|null |null |null |Serializable |true |{} |null |Apache-Spark/3.4.0 Delta-Lake/3.0.0-SNAPSHOT| +-------+-----------------------+------+--------+-----------------+-----------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------------------------+ ``` Closes delta-io/delta#1880 Co-authored-by: Lars Kroll Signed-off-by: Paddy Xu GitOrigin-RevId: e6e6b4165b7c51d571eb6442bdd95829511bdaa7 commit e61bb6a2776b2c1bdc52ef047fedab3edcaa40e9 Author: Ole Sasse Date: Fri Aug 4 08:55:56 2023 +0200 Implicit DML casts: Fix Delta version in error message and add more This is a follow up to delta-io/delta#1938 that fixes the Delta version mentioned in the error message and adds more tests for the implicit casting behaviour in MERGE and UPDATE Closes https://github.com/delta-io/delta/pull/1944 GitOrigin-RevId: afed6bb66c1832120731b9633fe9dcbbcbdb18d7 commit ceb433edf2f34e4041d5a9e976148457b4ade30e Author: Ming DAI Date: Wed Aug 2 16:05:06 2023 -0700 Refactoring to separate unit tests of Clone Parquet and Clone Delta GitOrigin-RevId: 781a810bbf77727a8e6da342a927c72ad6b3db47 commit 2823efd587106ee184a828ab6fd2ada636ae8bde Author: Prakhar Jain Date: Wed Aug 2 15:49:43 2023 -0700 Protocol changes for V2 checkpoints Same as title Closes https://github.com/delta-io/delta/pull/1946 GitOrigin-RevId: 1c95ed5bd24eea4d995b81537f86fc4ba805bf72 commit 004f8071415db7929e559b5dad46e4391183387d Author: Jacek Laskowski Date: Wed Aug 2 13:54:50 2023 -0700 [BUILD] targetJvm and default_scala_version settable ## Description It is currently not possible to build the sources with Java 11 and Scala 2.13 with no changes to `build.sbt`. In order to make Java and Scala versions "settable" (e.g. on command line) this PR introduces two sbt settings: * `default_scala_version` * `targetJvm` With the settings, it's possible to build the sources with Java 11 and just a single Scala 2.13 (no cross-compiling that would take longer). ```shell sbt 'set Global / default_scala_version := "2.13.5"' \ 'set Global / targetJvm := "11"' \ spark/publishLocal storage/publishLocal ``` Various build configurations (valid and invalid, e.g. `targetJvm := "123"`) Once built, checked the ivy directories and ran `spark-shell`. ``` ls -l /Users/jacek/.ivy2/local/io.delta/delta-storage/3.0.0-SNAPSHOT/jars ls -l /Users/jacek/.ivy2/local/io.delta/delta-spark_2.12/3.0.0-SNAPSHOT/jars ls -l /Users/jacek/.ivy2/local/io.delta/delta-spark_2.13/3.0.0-SNAPSHOT/jars ``` ``` ./bin/spark-shell --packages io.delta:delta-spark_2.13:3.0.0-SNAPSHOT \ --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog ``` ``` sql("DROP TABLE IF EXISTS demo_001") spark.range(0, 5, 1, numPartitions = 1).write.format("delta").saveAsTable("demo_001") sql("DESC EXTENDED demo_001").where($"col_name" === "Provider").select("data_type").show +---------+ |data_type| +---------+ | delta| +---------+ assert(spark.table("demo_001").collect.length == 5) ``` ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1908 Signed-off-by: Venki Korukanti GitOrigin-RevId: bb5712591690be5bebf87302717689eaa7669902 commit 555dad7d6f0cc946cb1a7d3222dbef77c9f914fc Author: Andreas Chatzistergiou Date: Wed Aug 2 17:45:54 2023 +0200 Allow users to run DROP FEATURE command. GitOrigin-RevId: 3fce378b61f3849f2afebefd72399327addf19ea commit 790b636442bef8a7b7d957498077c58fd18bf3d7 Author: Fredrik Klauss Date: Wed Aug 2 00:15:10 2023 +0200 Expose operation names for CLONE, RESTORE, SET TBLPROPERTIES ## Description * Expose operation names in CLONE, RESTORE, SET TBLPROPERTIES * Remove redundant opName in CLONE and match on the CLONE operation name instead * Removing dead code in CLONE path GitOrigin-RevId: ec93e4ef08db5d287e12ae949e8f87e3caeefb08 commit dd1b5ec00c91554a20df0fa5469ad8eb7370ec71 Author: Venki Korukanti Date: Sun Aug 6 23:29:38 2023 -0700 Revert "Refactoring to separate unit tests of Clone Parquet and Clone Delta" This reverts commit 98fe7797f681408036d08aefda8c6684d8b0e0a1. commit 98fe7797f681408036d08aefda8c6684d8b0e0a1 Author: Ming DAI Date: Wed Aug 2 16:05:06 2023 -0700 Refactoring to separate unit tests of Clone Parquet and Clone Delta GitOrigin-RevId: 781a810bbf77727a8e6da342a927c72ad6b3db47 commit 08049c8b96ef0562fdba39645117d6831b2b1144 Author: Prakhar Jain Date: Tue Aug 1 01:10:53 2023 -0700 Minor refactoring in CheckpointProvider GitOrigin-RevId: 33cb8a6f68fd2de4c7ee2278c33fa3ce2811ed5e commit 0658ee90833516921fd33d896646582810b75284 Author: Burak Yavuz Date: Mon Jul 31 17:45:23 2023 -0700 Add more DDL test coverage and support ALTER TABLE REPLACE COLUMNS ## Description This PR adds more DDL test coverage around commands like SHOW TBLPROPERTIES, SHOW COLUMNS, DESC COLUMN and also adds support for ALTER TABLE REPLACE COLUMNS. REPLACE COLUMNS as it is implemented today drops all columns and then adds all new columns. This is actually undesirable for a Delta table, as it will cause all data to be dropped when column mapping is enabled. Therefore the way we implement is to look at the final schema, which is provided by all the ADD COLUMN table changes that are provided and try to resolve the differences. If a difference is ambiguous, then we will throw an error. A column being dropped and added would lead to an ambiguous change, because we don't know if we should be treating that as a column rename or literally a drop + add. Users should explicitly call ALTER TABLE drop columns + ALTER TABLE add columns instead. This patch introduces many new unit tests ## Does this PR introduce _any_ user-facing changes? Yes, this PR may start throwing ambiguous change errors for ALTER TABLE REPLACE COLUMNS. I doubt that it is used as much, but it would prevent accidental deletion of data. Users can do a REPLACE TABLE if they really wanted to replace the table with empty data. Closes delta-io/delta#1822 Co-authored-by: Ryan Johnson Signed-off-by: Burak Yavuz GitOrigin-RevId: e1ff7c1d2c03028fa4803bdc13456fbc63448cc4 commit 9888acfe4ffa2737f65672870b726813251122f5 Author: Ala Luszczak Date: Fri Jul 28 13:10:01 2023 +0200 Minor refactoring to CreateDeltaTableCommand GitOrigin-RevId: 577f8fa4ba3bacd085e335c2a51091c53d472a7f commit dfaa3156f608b8f270069b3c5241b5264d821a9d Author: Andreas Chatzistergiou Date: Fri Jul 28 09:49:51 2023 +0200 Adds SQL support for table feature removal. The syntax is the following: `ALTER TABLE table_name DROP FEATURE feature_name` GitOrigin-RevId: a4206bbd1a74e260e09d0070e77d5eb2fd8f47ff commit 278e1d214da5ba71497ea24aba00546e7dca58c7 Author: Venki Korukanti Date: Wed Aug 2 10:38:57 2023 -0700 [Kernel] Add examples for Delta Kernel API usage ## Description Adds an example project that shows how to read a Delta table using the Kernel APIs. The sample program can also be used as a command line to read the Delta table. Single threaded reader ``` java io.delta.kernel.examples.SingleThreadedTableReader \ --table=file:/connectors/golden-tables/src/main/resources/golden/data-reader-primitives \ --columns=as_int,as_long --limit=5 as_int| as_long null| null 0| 0 1| 1 2| 2 3| 3 ``` Multi-threaded reader (simulating a distributed execution environment) ``` java io.delta.kernel.examples.MultiThreadedTableReader --table=file:/connectors/golden-tables/src/main/resources/golden/data-reader-primitives \ --columns=as_int,as_long --limit=20 --parallelism=5 as_int| as_long null| null 0| 0 1| 1 2| 2 3| 3 ``` ## How was this patch tested? Manual testing ``` Usage: java io.delta.kernel.examples.SingleThreadedTableReader [-c ] [-l ] -t -c,--columns Comma separated list of columns to read from the table. Ex. --columns=id,name,address -l,--limit Maximum number of rows to read from the table (default 20). -t,--table Fully qualified table path ``` ``` Usage: java io.delta.kernel.examples.MultiThreadedTableReader [-c ] [-l ] [-p ] -t -c,--columns Comma separated list of columns to read from the table. Ex. --columns=id,name,address -l,--limit Maximum number of rows to read from the table (default 20). -p,--parallelism Number of parallel readers to use (default 3). -t,--table Fully qualified table path ``` commit 115abb3a7b8788efc65bc7bd4c022c9272606652 Author: Allison Portis Date: Thu Jul 27 14:08:33 2023 -0700 Rename incorrectly named file for java checkstyle GitOrigin-RevId: 892491d6b7106438d5c1edfb09d31281449b3d51 commit 22a22d223273a0875bf5ae98f7b274d464dd4637 Author: Tom van Bussel Date: Thu Jul 27 22:24:34 2023 +0200 Minor refactorings GitOrigin-RevId: 34152fe02c3b3953dbc819008c2bdbccc363ac8b commit f81e0ec7d5331baba64bc7cc02270647b7acd23a Author: Fredrik Klauss Date: Wed Jul 26 18:16:42 2023 +0200 Extract method to check if table exists and refactor CreateTableCommand ## Description * Add `tableExists` method in CLONE that checks whether a table exists at a given snapshot. * Refactor CreateDeltaTableCommand so `checkPathEmpty` takes a `txn` as argument. * Allow to pass a specific snapshot that should be used to start the table creation transaction. * Existing UTs. GitOrigin-RevId: e049cc52618b3ce7f6d2d8e91077cabc04a6f1b8 commit 6d78d43470fb14bba264c5107d1f07b3beaacec4 Author: Ole Sasse Date: Wed Jul 26 17:30:06 2023 +0200 Use storageAssighmentPolicy for casts in DML commands Follow spark.sql.storeAssignmentPolicy instead of spark.sql.ansi.enabled for casting behaviour in UPDATE and MERGE. This will by default error out at runtime when an overflow happens. Closes https://github.com/delta-io/delta/pull/1938 GitOrigin-RevId: c960a0521df27daa6ee231e0a1022d8756496785 commit 06266643b118eff1928c33a1ffd7e48ea5113508 Author: Animesh Kashyap Date: Wed Jul 26 16:54:02 2023 +0200 Minor refactor to DeltaSuite.scala GitOrigin-RevId: 75df0d544be34d482070dbb92464c6a0b9e22242 commit 9da7aa73a24c22976954fa292c99662062f01727 Author: Christos Stavrakakis Date: Wed Jul 26 12:26:54 2023 +0200 Extend remove actions with stats field Extend the `remove` actions with statistics about the data in the file that is being removed from the table and that can be used to optimize queries that read the changes of older versions, e.g., Change Data Feed queries. These statistics are copied from the `add` action when a file is being removed. File statistics can be large and increase the size of checkpoints. However, statistics of `remove` actions are not needed for tombstones but only for Delta JSON files, and so are dropped from checkpoints. Resolves https://github.com/delta-io/delta/pull/1907 GitOrigin-RevId: 5013848522e906fea207bea5078a2e4a0ce605c8 commit e4dfbeae3804d10de2291e2f742dc429c372a3ae Author: Sabir Akhadov Date: Wed Jul 26 10:44:15 2023 +0200 Prohibit spark.databricks.delta.dynamicPartitionOverwrite.enabled=false when DPO requested With Spark SQL config `spark.databricks.delta.dynamicPartitionOverwrite.enabled=false`, both Spark SQL `spark.sql.sources.partitionOverwriteMode=dynamic` and Delta write option `partitionOverwriteMode=dynamic` are ignored. This could lead to corrupt data as users might explicitly require and expect DPO instead of overwriting the whole table. We prohibit such a combination by throwing an exception instead. The `spark.databricks.delta.dynamicPartitionOverwrite.enabled` is marked internal, so we do not expect users to have set it. GitOrigin-RevId: b857a04e9e97199479945df44a935af9d7b5db46 commit 29f84e045b02a13f5ba2003654ff550adb6d6000 Author: Desmond Cheong Date: Tue Jul 25 22:01:16 2023 -0700 Minor refactor GitOrigin-RevId: eaeb6c784639d20046c8a1e72547ce95ec7dc7f6 commit 644cb77a52af4d4cfba01902871e0eba2a4d4102 Author: Yijia Cui Date: Tue Jul 25 10:39:46 2023 -0700 Refactor MergeIntoMaterializeSource Refactor MergeIntoMaterializeSource so it doesn't mix with MergeIntoCommandBase. GitOrigin-RevId: 02b157a10332eb3df1c598955930c011ee0a391d commit 5613d1ef9195c085c937e288e0744da80976512e Author: Ole Sasse Date: Tue Jul 25 17:55:19 2023 +0200 Prefactoring: Propagate column name in UpdateExpressionsSupport This is preparation for an upcoming change that introduces custom error messages to explicit casts added in UpdateExpressionsSupport. Those will have the column name as part of them and this prefactoring make them available GitOrigin-RevId: 1c738d531e6dff2e337994f495c23a1f7397d1b4 commit d85aea44cfec6cbacc31b6b584717c0f1f81e688 Author: Fredrik Klauss Date: Mon Jul 24 18:41:10 2023 +0200 Put post commit updates in CREATE TABLE into separate method ## Description * Put post table creation commit updates in CREATE TABLE into separate method to increase readability * Pure refactoring, no code change. * Existing UTs. GitOrigin-RevId: ce3f70ec43fc5a2498a0ef9cf8b39362456d32d1 commit fdeb15399739aadd1e07a11b2791822f92b6800d Author: Wenchen Fan Date: Mon Jul 24 15:52:09 2023 +0000 Update error messages in DeltaColumnRenameSuite This change updates error messages in DeltaColumnRenameSuite to match the change in https://github.com/apache/spark/pull/41863 GitOrigin-RevId: e25d6f4ce206af00bccc14cb40f9a57dd9ad7929 commit f0f7fea9512cb9abef976599265576ba638392de Author: Carmen Kwan Date: Mon Jul 24 17:07:01 2023 +0200 Minor refactor to DeltaDLLSuite.scala . Use unique table names when running tests with ALTER TABLE to get rid of flakiness when running many tests in parallel. We DROP the table at the end of each test, which leads to a validation error in a concurrent test that the command is running on a table that exists. GitOrigin-RevId: 42f51e446e4b8a2da298f34738b3b150cadb0506 commit d90706852256c0d271d69ca8a2c13c934958c481 Author: Sabir Akhadov Date: Mon Jul 24 14:12:48 2023 +0200 Disallow overwriteSchema with dynamic partitions overwrite Disallow overwriteSchema when partitionOverwriteMode is set to dynamic. Otherwise, the table might become corrupted as schemas of newly written partitions would not match the non-overwritten partitions. GitOrigin-RevId: ce30323db9c458ed994b26431b931fdfa5673c31 commit 124c87f0735dccea471560201cc36f66c437135a Author: Zhen Li Date: Fri Jul 21 18:59:24 2023 -0700 Minor refactor GitOrigin-RevId: 9d633ca993b5f912c91cb2cd903d499f642ac88d commit 29147f8c9e367eff5b3b104da7367cdd40d1648d Author: Scott Sandre Date: Fri Jul 21 15:35:17 2023 -0700 Re-enable iceberg module and clone iceberg github repo using shallow clone #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR uncomments the iceberg module (it was commented out previously to prevent compilation). This PR also changes how we clone the iceberg src code (in order to be able to apply patch files). Instead of using a deep clone, we use a shallow clone. This had some ramifications in how the JAR files were named ... so needed to update the python generation script, too. Closes delta-io/delta#1917 Signed-off-by: Scott Sandre GitOrigin-RevId: d8249310313b4a01c387e81e9bc5c6a9d8d0cf0b commit 9881a6ec590ec52dd830512c1f31c90e762555df Author: Zhen Li Date: Fri Jul 21 21:42:37 2023 +0000 Minor refactor GitOrigin-RevId: 5530091bbb4e97216e62f75583744f05f5b3e7d2 commit 79b15566bb7070214d6601c5b6d2f967e8b682b3 Author: Ming DAI Date: Fri Jul 21 13:55:04 2023 -0700 Correct some comments in ConvertUtils and refactor ParquetFileManifest to uniform the interface of doList() GitOrigin-RevId: 8ee7a408b464b31bea2babd1577c02f328cde1c9 commit 53f981b89c58c5805570439984b6348c1226013f Author: Burak Yavuz Date: Fri Jul 21 12:39:53 2023 -0700 Refactor table_changes TVF resolution Refactors the CDC and table_changes TVF resolution a little bit to make it cleaner and catalyst-y. GitOrigin-RevId: 43e97829a84e620aa588acd0db9f9cea84353aad commit 0325f8cad1d0c698e3b572ed147ec7af20a1f135 Author: panbingkun Date: Fri Jul 21 11:48:54 2023 +0000 Unify error message for unsupported time travel Unify error message for unsupported time travel GitOrigin-RevId: 61d0a8a4cb6eb533ce25e120e66c15aa3de5cbc0 commit 7fa7fdb2cfeefe671be39b5ffa26b8465918c7f1 Author: Wenchen Fan Date: Fri Jul 21 11:46:51 2023 +0000 Update tests to accept more general error message Update tests to accept more general error message GitOrigin-RevId: 7ef53ddb7a1de549230ae0fe1fca93960cd64990 commit 8824a348ebe0b7ea430ad31ea4423fe65a6dee21 Author: Allison Portis Date: Thu Jul 20 21:03:01 2023 -0700 [Kernel] Updates build.sbt to add javastyle checks for Kernel GitOrigin-RevId: 56bf82fc83a68dd16a285228a06abaa652e5b409 commit a2bbf9dadf3e56b46021aace522ff3cf41b37767 Author: Adam Binford Date: Thu Jul 20 13:59:59 2023 -0700 Incremental loading for DeltaSourceSnapshot actions Resolves https://github.com/delta-io/delta/issues/1699 Updates `DeltaSourceSnapshot` with a few improvements, most notably loading actions for the snapshot incrementally using `toLocalIterator` instead of `collect`'ing the entire snapshot on the driver and keeping it there. If we're worried about performance aspects of `toLocalIterator` I can put the two different methods behind a feature flag. Additionally: - Adds a `repartitionByRange` before the sort on time and path to control the number of output partitions of the sort. Without this the default shuffle partitions are used, which can be large for large datasets and make `toLocalIterator` as well as `collect` take longer. - Skip encoding the cached dataset as an `IndexedRow` because it is immediately converted back to a DataFrame to create the iterator. - Includes the same updates as https://github.com/delta-io/delta/pull/1703 to drop stats before creating streaming file indices - Uncaches the `Delta Source Snapshot` in addition to the underlying `Snapshot` on close Existing UTs, shouldn't change any behavior. ## Does this PR introduce _any_ user-facing changes? No, just resource requirement changes. Closes delta-io/delta#1721 GitOrigin-RevId: 3add8f2ca2234c1c16f7ba36eadbbad9f1cab3ee commit 7ee0db0beb80813fc5a09106e2628f222c0ff167 Author: Fredrik Klauss Date: Thu Jul 20 17:28:27 2023 +0200 Pass through txn created in CREATE TABLE to CLONE ## Description * Pass through transaction created in CREATE DELTA TABLE codepath to CLONE to make sure the transaction is actually used for the CLONE commit * Don't make Clone a `RunnableCommand`, as it's always used as part of CREATE TABLE and isn't self-contained. * Ensure that one can not pass in an already committed transaction into CLONE Existing UTs GitOrigin-RevId: cd9c40b606c9718cc4fa1f830f7d2df55c5f3043 commit 113f18ee750938d155bc13ce0be24ca6e0f506ed Author: Fredrik Klauss Date: Thu Jul 20 09:16:44 2023 +0200 Separate methods for validations of CREATE TABLE commit ## Description * Wrap validations done before table creation into separate method to make it clear what they are used for * Existing UTs GitOrigin-RevId: ec4a4fc9e37922afd5e75d9375f417d548968d39 commit 0c4cd02387828263053c44c58bd8afe8c16fe144 Author: Allison Portis Date: Wed Jul 26 14:41:25 2023 -0700 [Kernel] Re-enable javastyle checks for the kernel projects Re-enables javastyle checks for Kernel java code and fixes any violations of it. Runs the checks as part of compile & test. Closes delta-io/delta#1906 commit c78daeff2062ce3c9d8c159301add1848f25c809 Author: Fredrik Klauss Date: Wed Jul 19 19:53:54 2023 +0200 Put commit handling logic in CreateDeltaTableCommand into separate method ## Description * Put commit handling logic into extra method to make control flow easier * Existing UTs GitOrigin-RevId: b575cd7e5fcef1a01edc52343ae615d310f61350 commit 3de8248aca60b872c05624975a499acdfeb3dcac Author: Satya Valluri Date: Wed Jul 19 10:25:33 2023 -0700 Minor code refactor in DeltaSink.scala. GitOrigin-RevId: 9804cd1d4bffa598a633e657bd518501aea661fc commit 7336a2b1085732a91ae02524f2918e512112a15d Author: Fredrik Klauss Date: Wed Jul 19 13:01:53 2023 +0200 Extra method for CREATE TABLE without AS clause in CreateDeltaTableCommand ## Description * Refactor code in `CreateDeltaTableCommand` to put the logic that handles the commit for CREATE TABLE AS SELECT into a separate method. * Existing UTs GitOrigin-RevId: 2121edd03c6dceeba0ac08eb1544c6566bae4cef commit 8c140a9035d907bc48065c4e621eea54372b8a20 Author: Jiaan Geng Date: Tue Jul 18 23:36:37 2023 +0000 Add additional check for spark error message. GitOrigin-RevId: 11844642799230fb8522aff4cfe5061057457ee2 commit e41db5c17ab0b2edb5bcd2ba594be98c62fb4522 Author: sherlockbeard Date: Mon Jul 17 17:00:50 2023 -0700 Fixed DV DELETE on column mapping enabled tables Resolves delta-io/delta#1886 Added ROW_INDEX_COLUMN_NAME in DELTA_INTERNAL_COLUMNS set so Delta consider it as internal field Added test in `DeletionVectorsSuite.scala` Closes delta-io/delta#1891 GitOrigin-RevId: 5e881a392634d1636ac1864e08c4d92ff1cfeb26 commit 355804caee1924842c640bb99697cfd4dd66d07c Author: Johan Lasperas Date: Mon Jul 17 16:38:28 2023 +0200 Assign materialized Row ID and Row commit version column names ## Description This change is part of implementing row tracking as specified in https://github.com/delta-io/delta/pull/1610 and https://github.com/delta-io/delta/commit/7272b04db76507a98e07e96dc88e2c1a54283693. It covers assigning a column name for the materialized Row ID and Row commit version columns by setting them in the table metadata when creating or cloning a table. - Add test suite `rowtracking.MaterializedColumnSuite` to cover assigning materialized column names in various table creation and clone scenarios. Closes delta-io/delta#1896 GitOrigin-RevId: 3b963ed7e08524f24d60744160490e41a0aab3e8 commit dbb2210054ede73068cd83cf1a911dd5ca36fbf7 Author: Johan Lasperas Date: Mon Jul 17 14:42:13 2023 +0200 Support struct evolution inside maps ## Description This PR resolves issue https://github.com/delta-io/delta/issues/1641 to allow automatic schema evolution in structs that are inside maps. Assuming the target and source tables have the following schemas: target: `id string, map map>` source: `id string, map map>` ``` SET spark.databricks.delta.schema.autoMerge.enabled = true; MERGE INTO target t USING source s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * ``` returns an analysis error today: ``` AnalysisException: cannot resolve 's.map' due to data type mismatch: cannot cast map> to map>; ``` With this change, the merge command succeeds and the target table schema evolves to include field `c` inside the map value. The same also works for map keys. - Tests are added to `MergeIntoSuiteBase` and `MergeIntoSQLSuite` to cover struct evolution inside of maps values and keys. ## Does this PR introduce _any_ user-facing changes? Yes, struct evolution inside of maps now succeeds instead of failing with an analysis error, see previous example. Closes delta-io/delta#1868 GitOrigin-RevId: 07ce2531e03c4e2fa69e8a34f33ba8d2dc3a0228 commit 71e0a83e41a678e5e5fc7aa6ee404ea0b62ee462 Author: Herman van Hovell Date: Sun Jul 16 05:19:51 2023 +0000 Minor refactor in test assertion to use ParseException GitOrigin-RevId: dbe29e3e8c4886c19586808e20a723180e259de1 commit b2b2bd7499dcfff808cf6e7bee873d88a5a89ec4 Author: Andreas Chatzistergiou Date: Fri Jul 14 12:28:25 2023 +0200 This PR lays the groundwork for Table Feature phase-out and adds support for removing writer features from a delta table. This is achieved as follows: Introduces a new alter table command for removing features, AlterTableDropFeatureDeltaCommand. Introduces a new trait, RemovableFeature, that all removable features must implement. Each removable feature may implement a PreDowngradeTableFeatureCommand that is responsible of removing any traces of the features in the table. Closes delta-io/delta#1888 GitOrigin-RevId: 7e1310d807c2ab852b8dd069bd7e92b633df36e9 commit 4932956ae52f7a1211999f5364b6f9e1875d183a Author: Lars Kroll Date: Fri Jul 14 12:03:03 2023 +0200 Add config to run MERGE source materialization eagerly. - Add a new config that that controls whether the MERGE source is materialized eagerly or lazily. - New config is set to eager by default to avoid suspected determinism issues with lazy materialization. Closes delta-io/delta#1910 GitOrigin-RevId: 2b9157f82bc1826467c6fb06baa734790c5fdc75 commit dc061f7b5b47ffb4cf980b445d85100690bedf05 Author: Allison Portis Date: Fri Jul 14 17:40:26 2023 -0700 [Kernel][Standalone] Consolidate testing infrastructure that uses the GoldenTables project and other related clean-up - Updates the `goldenTables` project to include utils that exposes the resource file path. Updates other projects to depend on it and use that for getting the golden table file paths. - Moves the resources to `src/main/resources` so that other projects can depend on the `goldenTables` compile conf and skip all the test dependencies - Updates `goldenTables` to use the latest `delta-spark` version (just fixes a couple compile errors) - Update the kernelDefault `DeletionVectorSuite` - Update `"end-to-end usage: reading partitioned dv table with checkpoint"` to use a golden table since that's how the resource table was originally generated. Adds this table to GoldenTables and removes the no longer used test resource table. - Update the test resource `"dv-with-columnmapping"` to actually have DV enabled (previously it mistakenly wasn't). Testing change. No. Closes delta-io/delta#1887 commit e0f0e91c80b431ee47861c49bd21ff5ac0294cc3 Author: Paddy Xu Date: Thu Jul 13 11:10:57 2023 +0200 Probe whether the metadata path is canonicalized in Spark The mechanism of this fix is to call the Spark internal method, which is used to generate metadata columns, to see if it will canonicalize spaces in a crafted path string. If the answer is yes, then we don't need to do anything on the Delta side; otherwise, we manually canonicalize the obtained metadata column. Why don't use the Spark internal method on `FileToDvDescriptor`, so both sides of the join are either canonicalized or not-canonicalized? Because most Delta methods are expecting a canonicalized path, thus the returned DF must be canonicalized in all cases. Closes https://github.com/delta-io/delta/pull/1769. PR targeting `branch-2.4`: https://github.com/delta-io/delta/pull/1770. GitOrigin-RevId: 3538b18ff23e81c603acbc7df3930ba1730903f2 commit 1861c8a260ba1dd29afb5230214c946e73782cb9 Author: Eric Maynard Date: Wed Jul 12 23:55:16 2023 -0700 Minor refactors to DeltaSource.scala GitOrigin-RevId: 4358924cb0ac745cc79dd56a30fe2dd1487a0c69 commit d0e1223f135bb0392c72685839ddb672f5a27c55 Author: Ming DAI Date: Wed Jul 12 16:40:12 2023 -0700 Minor refactor to create a helper function for IcebergTablePlaceHolder GitOrigin-RevId: 47a900e16bead9e2374dc08499ea6f82400c6d91 commit 4e500b39d361b0b87ca7d8490b9c6fca0ad331b2 Author: Johan Lasperas Date: Wed Jul 12 12:52:02 2023 +0200 Resolve merge actions in batch Expressions in merge clauses are resolved one at a time by running a full analysis on a fake plan that includes that single expression. The time required to resolve `UPDATE SET *` or `INSERT *` actions then grows quadratically with the number of columns in the target and source tables respectively: the number of analysis pass grows linearly with the number of expressions, the cost of each pass grows linearly with the number of columns. This change makes the analysis time linear by resolving all `UPDATE SET *` or `INSERT *` assignment expressions in a single analysis pass, using a fake plan that includes all expressions to resolve. In addition, the logic to pass the plans that should be used for resolution is simplified. Instead of passing a dummy plan with the right children - either the source only, the target only or both - we directly pass the list of plans to use to the resolution helper. - Confirmed locally that the analysis time is orders of magnitude faster when the source and table contain 5000 columns. Closes delta-io/delta#1875 GitOrigin-RevId: 6208cc52aead6a04c80b602ee9d4018c48e9b816 commit aae30ac0d5bcd64917e3bba5c0215bf626866a01 Author: Allison Portis Date: Tue Jul 11 15:09:02 2023 -0700 Update build.sbt for golden tables refactoring Update the build dependencies so projects using golden tables to test can depend on goldenTables and avoid hacky path solutions. GitOrigin-RevId: 23fb7794bfc0aa2ab5b0ee03bb82bc7631302424 commit 41348c32692724efb5da1be7d6329673265915a4 Author: Jackie Zhang Date: Tue Jul 11 09:18:12 2023 -0700 Refactor Delta Source Schema Log. GitOrigin-RevId: 2e22ab1a157197f39deaa8ef8d2d082f7d692023 commit 0b7e4cab1912929f12e011cd4fb224c70fe53e80 Author: Allison Portis Date: Mon Jul 10 14:02:58 2023 -0700 Update issue templates to include the related component ## Description Now that more than just the Spark connector resides in this repository we need issues to specify the relevant connector. This updates the templates to do so. N/A Closes delta-io/delta#1883 Signed-off-by: Allison Portis GitOrigin-RevId: 8bdce12f376a38339ae2e2c01c1a6b6975490935 commit d45e868f19dd09d19c936087ef97d47d2a5ed60d Author: Felipe Pessoto Date: Mon Jul 10 12:52:21 2023 -0700 Add Iceberg to the table feature appendix and fix Timestamp NTZ link Adds icebergCompatV1 to the table features in the appendix. And fix the Timestamp NTZ link N/A Docs only Closes delta-io/delta#1894 Signed-off-by: Scott Sandre GitOrigin-RevId: 7efb94f515aba9a6bfa853c0a3f6d27649c51b59 commit 800168e60e5c1191e59dcba866b7ab71273a1921 Author: Lars Kroll Date: Mon Jul 10 14:48:00 2023 +0200 Small refactor to DeltaErrorsSuite.scala GitOrigin-RevId: 42251075ba6a9e2f7fd953edf7b663d46a38bc84 commit 396587980e6681fbdc4b8221f770311e5ef61154 Author: Wenchen Fan Date: Fri Jul 7 22:26:16 2023 +0000 Allow SQL MERGE statement to serve other data sources SQL MERGE is a standard DML command and is not Delta-only, but supports DS v2 as well. GitOrigin-RevId: b93e5a366f4aeb1616b9989b41847a59e846f717 commit 3a05da8d013d13ef91bc1cbf6fa216ec373a0990 Author: panbingkun Date: Fri Jul 7 21:44:49 2023 +0000 Minor refactoring for the DeltaInsertIntoTableSuite Add additional check for spark error message. GitOrigin-RevId: 55007a8f6acea55f3e47b5ddbb05ed296ce5211b commit 59ac1a67a7a95f2fb9aba9ff114ce057e29e262d Author: ChengJi-db Date: Fri Jul 7 10:13:46 2023 -0700 Refactor getCheckpointProvider GitOrigin-RevId: 79b9630263b1d8bda0e67050db8045995eb1fc1f commit 975daae690bce6295e5f8814f2515562553ed2ad Author: Fredrik Klauss Date: Fri Jul 7 18:56:10 2023 +0200 Set spark.sql.extensions to enable Delta in tests instead of overriding SparkSession * Set spark.sql.extensions instead of overriding extensions of SparkSession object. * The current implementation overrode the extensions of SparkSession, as it did not pick up spark.sql.extensions. This was fixed and now it picks up extensions set in spark.sql.extensions GitOrigin-RevId: cbd1654496225baffca02ec3072cbdc2297c1e3c commit 84cb047671768e07d09231b08ee1d3783b305476 Author: Fredrik Klauss Date: Thu Jul 6 20:41:43 2023 +0100 Extra method for CREATE TABLE without AS clause in CreateDeltaTableCommand * Refactor code in `CreateDeltaTableCommand` to put the logic that handles the commit for CREATE TABLE without AS clause into a separate method. GitOrigin-RevId: 8ff04b0971d8984b29a27857fdc69ef01548da14 commit ee5a0c8325a91e8e97a0809261bbabdf30aef9df Author: Johan Lasperas Date: Thu Jul 6 18:46:18 2023 +0200 Move committing merge actions before collecting stats This is a follow-up on https://github.com/delta-io/delta/pull/1834 which too aggressively moved committing merge actions into the `collectMergeStats` method. This is a non-functional refactor moving code around. Existing merge metrics tests in `MergeIntoMetricsBase` extensively cover this code path. GitOrigin-RevId: 6d505b4dea8ad02618d7debc5138c1a557083f03 commit 4e2c7b094149fcdbe4535c2a673025a4667c8137 Author: Fredrik Klauss Date: Thu Jul 6 17:42:56 2023 +0100 Minor refactor Minor refactor to prep for subsequent refactors in `CreateDeltaTableCommand`. GitOrigin-RevId: 5d2417186b4396e2c8df941710fde848b2bbebbe commit 3b2dea946fcd357810b894c82efca0192d9cd2a3 Author: Wenchen Fan Date: Thu Jul 6 16:04:47 2023 +0000 minor code style changes minor code style changes GitOrigin-RevId: 5dbb8d6ac934142fddb41a971cf95e29f324b682 commit 88a9ba174c1d0d605683e84ffced9fa2b4f4b6f3 Author: Jackie Zhang Date: Wed Jul 5 12:39:15 2023 -0700 Adding support for evolving config and protocol changes for Delta source schema tracking. GitOrigin-RevId: dc614a1285ecd5d3ab6d6a454ed1093a79fe286e commit 91cdf40d6ef6745ab40496a76c9fd488f8dfa549 Author: Jacek Laskowski Date: Wed Jul 5 23:05:14 2023 +0200 Replace the underscore since Java 11 says it may not be used as an identifier Fix to build the sources using Java 11 (as the target JVM). Otherwise the following errors show up: ```text $ sbt kernelApi/Compile/doc ... [error] /Users/jacek/dev/oss/delta/kernel/kernel-api/src/main/java/io/delta/kernel/internal/deletionvectors/RoaringBitmapArray.java:214:1: as of release 9, '_' is a keyword, and may not be used as an identifier [error] /Users/jacek/dev/oss/delta/kernel/kernel-api/src/main/java/io/delta/kernel/internal/deletionvectors/RoaringBitmapArray.java:214:1: as of release 9, '_' is a keyword, and may not be used as an identifier [error] /Users/jacek/dev/oss/delta/kernel/kernel-api/src/main/java/io/delta/kernel/internal/deletionvectors/RoaringBitmapArray.java:214:1: as of release 9, '_' is a keyword, and may not be used as an identifier ``` It is only "visible" when you change `build.sbt` to use the following line (which I'm going to submit in a separate PR): ```text Compile / compile / javacOptions ++= Seq("-target", "11"), ``` A local build with Scala 2.13 (`val default_scala_version = scala213`) and Java 11 (`val targetJvm = "11"` not `"1.8"`) with the dependencies loaded from local-ivy-repo (not central) ```shell $ sbt clean publishLocal $ ./bin/spark-shell --packages io.delta:delta-spark_2.13:3.0.0rc1 \ --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog ... found io.delta#delta-spark_2.13;3.0.0rc1 in local-ivy-cache found io.delta#delta-storage;3.0.0rc1 in local-ivy-cache ``` No Close delta-io/delta#1882 commit 1bbe2ebdbea2f7d06ad88ef80c83215e79a0d885 Author: Paddy Xu Date: Tue Jul 4 11:32:48 2023 +0200 Add huge DV reading & writing tests This PR adds a test case to make sure reading & writing a huge DV work as expected. GitOrigin-RevId: 29e87ffb87718357a98c6871dc05bbc545b323c9 commit d6d87d02901f3fa4d995a7a30e407023baeeaf1d Author: Xinyi Date: Mon Jul 3 18:42:03 2023 -0700 Avoid string matching on delta as provider This PR avoids few string matching on 'delta' as provider and instead uses the utility function. GitOrigin-RevId: 96483e97280a9c84f086e512f894c993e4e1f6ff commit 8a2da73df31040b79b800f790aacc6a36787d397 Author: Min Yang Date: Fri Jun 30 14:38:38 2023 -0700 Fix a bug that CDCReader.changesToDF does not produce a plan with correct stats CDCReader.changesToDF always creates a empty LogicalRDD to retain schema and some properties, but LogicalRDD has stats of INT_MAX sizeInByte by default so it could make CDF-based queries optimized incorrectly in many cases. GitOrigin-RevId: 80bfe6b50d8504d6d013e102366b6225cfafa3f7 commit bf482d59aa9261d95ac90f62a424332f8ded0ae2 Author: Fredrik Klauss Date: Fri Jun 30 09:06:04 2023 +0100 Stats code refactor * Refactor stats code to ease further development by putting code to collect stats for a single AddFile into separate helper method GitOrigin-RevId: 7eb20386907a1e92d8e30518e5ab215f7fb4b36c commit 02f16a41c9da33dc2be4b78e9a65eb5db6cea9ce Author: Desmond Cheong Date: Thu Jun 29 22:32:32 2023 -0700 Minor refactor Remove an unused function. GitOrigin-RevId: 58f4d50ccbe3bc0762ddde8e92d6b7fb8562c243 commit a83fb634101aa60c21f2cb4b8857a8ed6b66b289 Author: Tathagata Das Date: Thu Jun 29 11:05:21 2023 -0700 Updated build for 3.0.0 As the title says. GitOrigin-RevId: 96d2a89c898d343dbb959d47428f5009e6055823 commit d883bd34c999d9e35bb4081ef921ea2bcc43ec03 Author: Scott Sandre Date: Thu Jun 29 11:05:02 2023 -0700 Currently, every per-PR test run is building the iceberg module, which depends on the icebergShaded module, which is cloning the iceberg github. This is a bad bad idea. Let's figure out a safer and smarter way to pull the iceberg github later. For now, disable it. GitOrigin-RevId: eff97223bd506b3e8f243d4ba3eccff4510170fe commit 6426ee53024182e2b9317e0d9c5b3140a0971d1d Author: jintao shen Date: Wed Jun 28 12:34:02 2023 -0700 Minor refactor to `SnapshotState.scala` .. to replace `collect_list` with `collect_set` when collecting `domainMetadata` to fix commit perf regression. GitOrigin-RevId: 41f174c505c03322ebf69d034c9079e8a21bb459 commit 30dd4d0fae1c2e4a745571427f9bcda24b71be6c Author: Jackie Zhang Date: Wed Jun 28 09:04:22 2023 -0700 Allow column mapping to reuse id and physical name across overwrite. Closes https://github.com/delta-io/delta/pull/1840 GitOrigin-RevId: 9c04721445ae8547b0f14e1db751d321f5145bc0 commit 1f5a7c010bfc2089feed494f221d1b62635c34b0 Author: Ming DAI Date: Tue Jun 27 10:09:14 2023 -0700 Fix a small typo in CatalogFileManifest GitOrigin-RevId: ad9386edfda9820d602434f8ea5021e80fed29dd commit 9f70ee5a65f90970432c80ba96d718f56997442c Author: Lars Kroll Date: Tue Jun 27 16:09:12 2023 +0200 Do not split commits that contain DVs in CDF streaming - When the add and remove file entries for a commit that contains DVs gets split into different batches, we don't recognise that they belong together and we need to produce CDF by calculating the diff of the two DVS. - This PR prevents splitting of these commits, just like we do for AddCDCFile actions to prevent splitting the update_{pre/post}_image information. GitOrigin-RevId: 11438c01ecc69f7c55c3a8826fa540f6b984e4c4 commit 8503647bf17d08d5a233356961fccd2c354011f0 Author: Tom van Bussel Date: Mon Jun 26 20:06:13 2023 +0200 Protocol Specification for Row ID High Water Mark using DomainMetadata This PR updates the protocol specification of Row Tracking to use `DomainMetadata` to store the high water mark instead of the special purpose `RowIdHighWaterMark` action. Closes delta-io/delta#1821 GitOrigin-RevId: 1e7124120b95e6c1e23d11f7322a435adfa5d673 commit 4383b7ca7650830d766c28f5d91eab4898f95213 Author: Pulkit Singhal Date: Mon Jun 26 22:49:27 2023 +0530 Filter the non-parquet files while processing the parquet footers Filter out the non-parquet files while reading the footers for inferring the schema GitOrigin-RevId: e2b67b4842dca4650d8780a39d504026602b89ce commit 3a4ac1bdb0182e32407f749582664ff525b9b748 Author: Tom van Bussel Date: Mon Jun 26 18:03:38 2023 +0200 Expose version as partition col in DeltaLogFileIndex This PR changes Delta Log replay to compute the file version only once per delta file, instead of once per every action in the delta files. It does this by exposing the commit version as a "virtual" partition column in `DeltaLogFileIndex`. Closes delta-io/delta#1864 GitOrigin-RevId: c049bfccf3985c10b32950a0db5e20a556a5ded2 commit dafbbc4fb719289a5469cd0000db8cc4b03297bb Author: Fredrik Klauss Date: Mon Jun 26 14:08:42 2023 +0100 Refactor clone metadata/protocol updates * Refactor metadata and protocol updates in CLONE to use helper methods. Currently all updates are part of the main code block in CLONE, making it difficult to understand the control flow. GitOrigin-RevId: 109f2333472067227e8070777c6a3f7d056c4b58 commit 52cc6a440a066eef49b343bebbb1fed8afd9c12e Author: Allison Portis Date: Wed Jun 28 12:22:48 2023 -0700 [Kernel] Hide internal packages from java doc generation This is used by the github action to publish the API docs Ran build/sbt kernelApi/doc locally and checked the index. Closes delta-io/delta#1872 commit acca47c3159a512f024624c74f1f6478200f6c9a Author: Tathagata Das Date: Tue Jun 27 18:12:08 2023 -0700 Updated the Kernel README Added more information about the Kernel project. Close delta-io/delta#1871 commit a2be73a9fa17dd56a5573565735af6d44d97f6cf Author: Anonymous <> Date: Wed Jun 28 08:59:41 2023 -0700 update version.sbt and flink/README v2.5.0 to 3.0.0 commit 6430fc07bf2a9bd3f55d14402a077adcd57643ba Author: Allison Portis Date: Tue Jun 27 20:51:16 2023 -0700 DV implementation for Kernel Support DVs in Kernel - Read the DeletionVectorDescriptors in AddFiles - Read DVs into RoaringBitmapArrays using the FileSystemHandler - Convert to a selection vector based on row_ids for each DataReadResult Closes delta-io/delta#1858 commit 54492016f0eb9c12d25cbbdcc85804fb0a5d9a3a Author: Scott Sandre Date: Tue Jun 27 14:26:53 2023 -0700 IcebergCompatV1 Protocol Specification ## Description This PR introduces a new table feature called `icebergCompatV1`. The goal of this table feature is to keep Delta tables in valid states (and block certain operations) such that Delta to Iceberg conversion can be applied to them through the Universal Format feature. This table feature has nothing to do with implementing that actual Delta to Iceberg metadata conversion. See the UniversalFormat PR here: https://github.com/delta-io/delta/pull/1870. Closes delta-io/delta#1869 GitOrigin-RevId: b84b9801a98f43cb7930c6681cdbdc1dc7a5fb9c commit 9b50cd206004ae28105846eee9d910f39019ab8b Author: Scott Sandre Date: Tue Jun 27 14:39:13 2023 -0700 Delta Universal Format (UniForm) allows you to read Delta tables with Iceberg clients. ## Description UniForm takes advantage of the fact that both Delta and Iceberg consist of Parquet data files and a metadata layer. UniForm automatically generates Iceberg metadata asynchronously, allowing Iceberg clients to read Delta tables as if they were Iceberg tables. You can expect negligible Delta write overhead when UniForm is enabled, as the Iceberg conversion and transaction occurs asynchronously after the Delta commit. A single copy of the data files provides access to both format clients. This PR adds the implementation for Universal Format (Iceberg) as well as the IcebergCompatV1 protocol validation. To create a table with UniForm: ```sql CREATE TABLE T(c1 INT) USING DELTA SET TBLPROPERTIES( 'delta.universalFormat.enabledFormats' = 'iceberg'); ``` To enable UniForm on an existing table ```sql ALTER TABLE T SET TBLPROPERTIES( 'delta.columnMapping.mode' = 'name', 'delta.universalFormat.enabledFormats' = 'iceberg'); ``` See the IcebergCompatV1 protocol specification PR here: https://github.com/delta-io/delta/pull/1869. New UT `iceberg/src/test/scala/org/apache/spark/sql/delta/ConvertToIcebergSuite.scala` as well as manual local publishing and integration testing with two spark shells, one loaded with Delta, the other with Iceberg. ## Does this PR introduce _any_ user-facing changes? Optional delta table property `delta.universalFormat.enabledFormats`. Closes delta-io/delta#1870 GitOrigin-RevId: 8a4723680b12bb112190ee1f94a5eae9c4904a83 commit 27111ee9fb0107d63fe550c6bd51019a0f3d0abb Author: Venki Korukanti Date: Thu Jun 22 13:33:45 2023 -0700 [Kernel] Delta table state reconstruction This PR is part of delta-io/delta#1783. It adds the Delta table state reconstruction and end-2-end API implementation. Integration tests with different types of Delta tables. Closes delta-io/delta#1857 commit 957c64db01b5869365f48519efcbbcb141d28747 Author: Scott Sandre Date: Mon Jun 26 11:15:19 2023 -0700 Import delta-connectors#559 Flink SQL documentationi Closes delta-io/delta#1866 Author: Scott Sandre Date: Mon Jun 26 11:02:39 2023 -0700 update version from 0.7.0 to 2.5.0 commit ed7304b1dc74229f71c1b80992c5700f409786d0 Author: Krzysztof Chmielewski Date: Wed Apr 26 13:33:38 2023 +0200 - Adding information about configuring Flink to work with Hive Cluster. Signed-off-by: Krzysztof Chmielewski commit 9c7fde5538e294d9638d438a573a57a83fc7faa4 Author: Krzysztof Chmielewski Date: Wed Apr 19 19:21:53 2023 +0200 - adding SQL doc #2 Signed-off-by: Krzysztof Chmielewski commit f3828195e21aebbf0547cd3d624a4d9524a39182 Author: Krzysztof Chmielewski Date: Fri Apr 14 12:39:44 2023 +0200 - adding SQL doc. Signed-off-by: Krzysztof Chmielewski commit 2ce6643c58e4d64ed87f3159d0fa48ef97723355 Author: Costas Zarifis Date: Fri Jun 23 17:56:13 2023 -0700 Minor refactor to DeltaSuite.scala GitOrigin-RevId: 1fad154a001354a3d55542c22f5aae9de9ba6d8c commit ead1ff0079173e15e9176850d11c1342161dc155 Author: Johan Lasperas Date: Fri Jun 23 23:34:10 2023 +0200 Improve generating and writing out changes in Merge This changes is part of a larger effort to improve merge performance, see https://github.com/delta-io/delta/issues/1827 ## Description This change rewrites the way modified data is written out in merge to improve performance. `writeAllChanges` now generates a dataframe containing all the updated and copied rows to write out by building a large expression that selectively applies the right merge action to each row. This replaces the previous method that relied on applying a function to individual rows. Changes: - Move `findTouchedFiles` and `writeAllchanges` to a dedicated new trait `ClassicMergeExecutor` implementing the regular merge path when `InsertOnlyMergeExecutor` is not used. - Introduce methods in `MergeOutputGeneration` to transform the merge clauses into expressions that can be applied to generate the output of the merge operation (both main data and CDC data). This change fully preserve the behavior of merge which is extensively tested in `MergeIntoSuiteBase`, `MergeIntoSQLSuite`, `MergeIntoScalaSuite`, `MergeCDCSuite`, `MergeIntoMetricsBase`, `MergeIntoNotMatchedBySourceSuite`. Closes delta-io/delta#1854 GitOrigin-RevId: d8c8a0e9439c6710978f2ec345cb94b2b9b19e0e commit 3c264fa764216914d3f52f29a556f2eadd45533a Author: Fredrik Klauss Date: Fri Jun 23 11:03:56 2023 +0200 Skip reading commit version in streaming if all addFiles in version processed * Fix a bug that breaks an optimization in Streaming to skip reading a commit after all AddFiles within that commit have been processed. * Currently, we mark an AddFile as the last AddFile within a commit if it is the last action. Instead, we should mark it as the last AddFile when it is the last AddFile within a commit. To do so, we add a sentinel AddFile at the end, and check whether the next AddFile in the iterator is the sentinel, and then mark the current AddFile as the last. GitOrigin-RevId: fdb692bf3e65091871123fc9dc40ae7e2815a5bd commit b10430953ad605b1df1196066bf2e86b8ef88396 Author: Johan Lasperas Date: Fri Jun 23 08:59:31 2023 +0200 Use Merge Insert-only path with multiple NOT MATCHED clauses and merges with only inserted rows ## Description This change improves the insert-only code path in merge to be able to support more than one `NOT MATCHED clause`. It also leverages the insert-only path for merges that have arbitrary clauses but effectively only have rows inserted after the `findTouchedFiles` step. - Added test cases specifically covering merge with multiple `NOT MATCHED` clauses and merge with `MATCHED`/`NOT MATCHED` where only the `NOT MATCHED` clause is satisfied. - Added a test to check that the insert-only code path is used when there are only inserted rows after `findTouchedFiles` by looking at the recorded operations. - Insert-only merge are otherwise already covered by a number of different existing tests. Closes delta-io/delta#1852 GitOrigin-RevId: 569b1fb706f95581d4670942e3e3678b76828e29 commit cfeec84b18e02f9f22eb5c1bef67b4689dcab150 Author: Fredrik Klauss Date: Thu Jun 22 12:57:00 2023 +0200 Allow maintenance operations to downgrade to snapshot isolation with row tracking * Row IDs handle correct serialization of row IDs during conflict checking, so should not hold maintenance operations from downgrading their isolation level. * Allow maintenance operations to downgrade to snapshot isolation with row tracking by ignoring the row ID high watermark action when looking for non-file actions. GitOrigin-RevId: c52248d496a3770ffbd984771f6f5be37b3d19a9 commit 4da9bf4a6072136e5980ac2752da56f4487e6954 Author: Allison Portis Date: Thu Jun 22 19:49:36 2023 -1000 [Kernel] Introduce the row_index column to Structfield and update ParquetBatchReader to populate it - Introduces the row_index column to StructField. For now, we indicate a metadata column in the field metadata with `{"isMetadataColumn" : "true"}`. - Updates `ParquetBatchReader` to track any file row index columns using [ParquetReader.getCurrentIndex()](https://javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.12.3/org/apache/parquet/hadoop/ParquetReader.html#getCurrentRowIndex()). Adds a test to TestParquetBatchReader. Existing tests check for scenario when the row_index column is not requested. Closes delta-io/delta#1856 commit 73ac6d9421b6343ca458697190bbcbc1622b416e Author: Venki Korukanti Date: Sun Jun 18 09:41:53 2023 -0700 [Kernel] Add Default client implementations This PR is part of delta-io/delta#1783. Following client implementations for the default module are added: * `JsonHandler` * `ExpressionHandler` * `FileSystemClient` and the supporting classes. UTs Closes delta-io/delta#1843 commit 963b1cdc401508eb1ac284b8bee42c5582173951 Author: Jiaheng Tang Date: Wed Jun 21 14:07:44 2023 -0700 Add `getMessageParameters` to `DeltaUnsupportedOperationException` to make `checkError` able to check parameters. GitOrigin-RevId: f365a54a0e05c4cdddb5f87bc473cf12db5f3040 commit 60a582b51fff28aa73b34153b3450c136e1ce96b Author: Tathagata Das Date: Wed Jun 21 12:00:06 2023 -0400 Update the build and test infra for connectors ## Description - Updated build.sbt to add new subprojects for connectors. - Updated the unidoc settings to make sure it ignores all the new kernel and connectors code. - Moved unidoc generation testing into an explicit separate step (explicitly call `sbt test unidoc` from run-tests.py) instead of implicitly via sbt task dependencies. - Disabled code coverage because it was having problems with scala-2.13 builds and we were not really using it anyways. - Added a new Github action for running all connector tests by copying the existing action from the connectors dir. - Renamed Spark test Github action file from tests.yaml to spark_tests.yaml - Increased SBT's Java heap size with .sbtoptions New and existing Github actions Closes delta-io/delta#1845 GitOrigin-RevId: d195e7123e37aae67d117d434139f2f28e3c6f2b commit 7a80c800515162e2fc68a6884b92546496acd878 Author: Prakhar Jain Date: Tue Jun 20 13:57:57 2023 -0700 Minor refactoring to CheckpointProvider Minor refactoring to CheckpointProvider GitOrigin-RevId: fa25d45e8e4adb74b81384aba9c67480605c011d commit ad3bc96cefff00d8ceb2f9b6f9c365a22e35623d Author: Scott Sandre Date: Tue Jun 20 12:49:57 2023 -0700 Minor refactor to CreateDeltaTableCommand. This PR removes an unused import. GitOrigin-RevId: e61ce04ed5a6bd4fe9b9fec00e7f6fc5e5318346 commit 2271132b3bab7a84e08e1307a8944b028e01bccc Author: Fredrik Klauss Date: Tue Jun 20 18:36:56 2023 +0200 Ensure no table features are enabled by default metadata * Add test to ensure that no table feature is automatically enabled by the default metadata. Otherwise, this will break reading from existing tables that do not support that table feature. GitOrigin-RevId: 1e4e7f034148de1f2a1fa5feb7be65f8f0b8af7c commit 7e51538dee68bff4775220af11e1b7c321d5b6ab Author: Lukas Rupprecht Date: Tue Jun 20 07:55:53 2023 -0700 Fixes wrong path concatenation in DeltaLog Uses the recently introduced DeltaTableUtils.safeConcatPaths to fix path concatenation in Delta to be able to deal correctly with URIs with an empty path component. GitOrigin-RevId: 6c1955134439ebb67b133e502af4efc9073a6482 commit 04a29a4f7e898b49bc8dacbb3d93cbee74ad6dc7 Author: Venki Korukanti Date: Sun Jun 18 08:28:31 2023 -0700 [Kernel] Default Parquet reader implementation This PR is part of delta-io/delta#1783. It implements Parquet reader based on `parquet-mr` and generates the output as columnar batches of `ColumnVector` and `ColumnarBatch` interface implementations. UTs Closes delta-io/delta#1846 commit cb89436e5421e08c6194d837934aaf1f7620faad Author: Venki Korukanti Date: Thu Jun 15 23:02:31 2023 -0700 [Kernel] Add missing data types and support for data type (de)serialization This PR is part of delta-io/delta#1783. It adds additional data types supported by Delta Lake protocol that were missing from the interfaces PR delta-io/delta#1808. It also adds serialization and deserailization of table schema represented as `StructType`. UTs Closes delta-io/delta#1842 commit 8dc900004476d6dc4b6f294cff30eda5c6455f80 Author: Tathagata Das Date: Wed Jun 21 00:17:12 2023 -0400 Another trivial flink test fix Fix test helper method so that only .checkpoint.parquet files are returned (and not, for example, .crc files) Closes delta-io/delta#1850 commit e1a84d24f12dae7ddbeed1ff6aaccebfc8bbf39f Author: Tathagata Das Date: Tue Jun 20 19:37:05 2023 -0400 Trivial flink test fix getDeltaCheckpointFiles should just explicitly filter for file names that end with .checkpoint.parquet. We don't want to accidentally read .crc files and return true. Closes delta-io/delta#1848 commit 7fee6f947dfa12d19d26356aebe6333f44c3c831 Author: Venki Korukanti Date: Mon Jun 19 21:52:07 2023 -0700 [Delta Kernel] Additional build changes ## Description These are few additional changes needed for Kernel module. N/A GitOrigin-RevId: 0fd92dc438cc06cd7c585bd4180f19a02b5a9a04 commit fc39f78d5d451dac02666c9e146dfe704a97c0a6 Author: Paddy Xu Date: Mon Jun 19 17:20:08 2023 +0200 Improve DV path canonicalization ## Description This PR improves the FILE_PATH canonicalization logic by avoiding calling expensive `Path.toUri.toString` calls for each row in a table. Canonicalized paths are now cached and the UDF just needs to look it up. Future improvement is possible for handling huge logs: build `canonicalizedPathMap` in a distributed way. Related PR target the 2.4 branch: https://github.com/delta-io/delta/pull/1829. Existing tests. Closes delta-io/delta#1836 Signed-off-by: Paddy Xu GitOrigin-RevId: c4810852f9136c36ec21f3519620ca26ed12bb04 commit 6583b2e0771529b180a3b4c4e984e051b2092e77 Author: panbingkun Date: Mon Jun 19 14:46:22 2023 +0000 Improve error message quote identifiers correctly in the error message GitOrigin-RevId: 6e80340a7c8e5abfbdbe31e87faf2c93e07ae30b commit ea680a9ff71d3b023a8de6a427f4f2a3fcfb8ff9 Author: Wenchen Fan Date: Sat Jun 17 10:32:51 2023 +0000 Prepare to support the default column value Add necessary code to support column default value. We will officially enable this feature after Spark 3.5 is released. GitOrigin-RevId: 820e1608162cb32b2fdecb155cdf6a655ad1ed8b commit 7a34e866ae2ec4965804559346c479ef7c25cddf Author: Junyong Lee Date: Fri Jun 16 22:32:38 2023 -0700 Refactor Vacuum command snapshot construction into a helper VacuumCommand contains a lot of logic to digest, this PR adds a helper to create a valid file from given snapshot. GitOrigin-RevId: 82966fb8a41bc8e3eb7cd00ac76a798cf53f2048 commit a5341a5675aad33b80f9bdca29c0be814cea87c5 Author: Jackie Zhang Date: Fri Jun 16 18:18:17 2023 -0700 Fix misc issues with Delta source schema tracking. GitOrigin-RevId: 6496df10dd2b432181ff68af39d355ffeaa4b754 commit 82bfefb60b2b0517ef9496c6ee1fa741431809a4 Author: jintao shen Date: Fri Jun 16 15:21:19 2023 -0700 Made a small change to Snapshot.scala to force create domainMetada in VersionChecksum. GitOrigin-RevId: 490947ce3de5a7bf1141968ab224241c9f37a00a commit d9a5f9f9c64b5066467801917de158650ae091cb Author: Tathagata Das Date: Fri Jun 16 12:09:52 2023 -0700 Re-enabled the iceberg module using iceberg-1.3 on spark 3.4 ## Description Re-enabled the iceberg module using iceberg-1.3 on spark 3.4 Existing unit tests. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1820 Signed-off-by: Scott Sandre GitOrigin-RevId: c48bfb7c3f4bc3d28b99f31d1d3d6d60afd2fbad commit cebedd983953a13a2127305a269ce3df22aaa08f Author: Johan Lasperas Date: Thu Jun 15 14:54:36 2023 +0200 Track individual merge steps using recordMergeOperation ## Description This change adds finer-grained of individual MERGE steps (`findTouchedFiles`, `writeAllChanges`, `writeInsertsOnlyWhenNoMatchedClauses`) by recording additional merge sub-operations. The helper method `recordMergeOperation` is improved to include an operation type. Adding tests to check tracking works as expected for upserts and insert only merges. Closes delta-io/delta#1841 GitOrigin-RevId: 265a186092dbbfd0db810f9adaaed54eee4252e6 commit 51e7b2ba24d753a9df6620fa931d7d124ad55f61 Author: Johan Lasperas Date: Wed Jun 14 19:57:54 2023 +0200 Replace UDF to increment metrics in DML commands with native IncrementMetric expression ## Description This change introduces a new expression `IncrementMetric` to support the frequent use case of incrementing metrics in Merge, Update and Delete. It replaces the use of UDFs to achieve that goal. Added IncrementMetricSuite to cover usage of the newly introduced expression. Metrics correctness is covered in existing test suites `MergeIntoMetricsBase`, `UpdateMetricsSuite` and `DeleteMetricsSuite`. Closes delta-io/delta#1828 GitOrigin-RevId: 746d761b5d2a51651e9d1df0eb1fcae0280791fa commit 8ea0c89fa6f06f5fc101e3d013edf1981105baa5 Author: Desmond Cheong Date: Wed Jun 14 08:54:04 2023 -0700 Minor refactor GitOrigin-RevId: 0c8d7b0a0b80dcb3fd715dff19f6d6f0fda4fd41 commit f46809fe8d1be972681074d21633f74c27d8a1d6 Author: Johan Lasperas Date: Wed Jun 14 14:09:57 2023 +0200 Refactor collecting merge stats ## Description This change is a plain refactor that will help future work to improve overall merge performance, see https://github.com/delta-io/delta/issues/1827 It creates a base merge class that gather functionalities that are shared by the current insert-only and classic merge code paths to allow splitting them in a following PR. Shared code to collect merge statistics is added there already. This is a non-functional refactor covered by extensive merge tests, e.p. MergeIntoMetricsBase tests. Closes delta-io/delta#1834 GitOrigin-RevId: caf346b4136e6738e30bd15219eaaeabbd833bd5 commit f7e852a842bcfb35a1937baf8b5cf60583026598 Author: Fredrik Klauss Date: Wed Jun 14 11:29:34 2023 +0200 Refactoring of commitLarge Add an empty map to commitLarge that gets committed GitOrigin-RevId: 2378d498c117ba69cf1c3b21f0540fb99e899908 commit 15884ade679f5a6296eff7184b76edd57583d425 Author: Ala Luszczak Date: Wed Jun 14 11:27:59 2023 +0200 Tests counting number of executed jobs should ignore failing jobs. Under heavy load Spark jobs can occasionally fail, but then get retried and succeed on a second attempt. In such case tests checking number of jobs received an unexpected count. This change addresses the issue. GitOrigin-RevId: 2cd7accd446f55e9552e242aee685ffd22d0c5d2 commit 826ca24449ac58e50f33860421d49a2b7647d5b6 Author: Mohamed Zait Date: Tue Jun 13 23:57:34 2023 -0700 Minor refactor to DeltaOperations.scala GitOrigin-RevId: 29b9ae846f269fc033e1dfce7f20e5dddcfefba1 commit 02990faf1f3cd03e863f4dc37ec555b25924c260 Author: Ming DAI Date: Tue Jun 13 14:30:43 2023 -0700 Use a different table name in test to avoid conflict GitOrigin-RevId: b6d1b788754f872fb75842e261cde9334f472bd1 commit 8d6c1fe84b32c7ed1418bba2cbfd575fad64f5cc Author: Jiaheng Tang Date: Tue Jun 13 13:21:36 2023 -0700 Track numOfDomainMetadatas in commitLarge We currently track numOfDomainMetadatas in doCommit but missed it in commitLarge. This change adds numOfDomainMetadatas to commitStats for commitLarge path. GitOrigin-RevId: fc50a467d58f595abf684bb28977ebab7aa204b9 commit 0bb113629e7ccdb74565ae329df39ce7f4c66e41 Author: Jiaheng Tang Date: Tue Jun 13 13:01:15 2023 -0700 Add helper for handling domain metadata for RESTORE table Add helper function handleDomainMetadataForRestoreTable to help with handling domain metadata for RESTORE table. GitOrigin-RevId: 5dae6a60a7fdb62c69dea88ed20992849523a178 commit f38c544da1f5f9c5f98a0dca31586c37d8304210 Author: Tom van Bussel Date: Mon Jun 12 16:58:04 2023 +0200 Small refactoring in UpdateCommand GitOrigin-RevId: 59b972c267f43fb97d5ee432d3497b3a26d2822a commit 0a34f0f193c83d93d54d47b942ee992d971eaf0c Author: Terry Kim Date: Fri Jun 9 22:54:45 2023 -0700 Fix the default behavior of Domain metadata's lifespan for REPLACE TABLE By default, existing domain metadata should “survive” for the `REPLACE TABLE` operation (similar to what AppId is already doing). Also, we introduce a list of metadata domains (an empty list for now as a placeholder) that should be removed by default (e.g., table features require some specific system domains). GitOrigin-RevId: f790296072120fa34fe535c35674be9dbb1e1eed commit f7ce183c72d8f58eb5cad7f90b4fb59034db49cd Author: Terry Kim Date: Fri Jun 9 10:49:26 2023 -0700 Use String instead of Map[String, String] for DomainMetadata.configuration The change also provides a generic framework to convert from a scala class to DomainMetadata Also the Delta protocol is changed in this PR. Closes delta-io/delta#1818 GitOrigin-RevId: 572d6022e0a0b6da0ac5e8589085e70ff59b3c2c commit 31cee975445756d0deeed4006ed55aaede38429b Merge: 0bec32820 6278da543 Author: Tathagata Das Date: Mon Jun 19 21:28:37 2023 -0600 [#1824] Merging Delta Connector code into this repo with full commit history (#1837) ## Description In this PR, I am merging the entire codebase of Delta Connectors repository into the connectors/ subdirectory. I have maintained full commit history up to current connectors master (https://github.com/delta-io/connectors/commit/47ae5a3540d3e9400b8140e460b74e09343b0497). This is the first step the process to unify these 2 repos. See #1824. ## How was this patch tested? Not yet tested. The test infra is unable to handle a diff of this large size. I am just merging the code. I have ensured that all the changes in this PR are in the connectors/ directory (which did not exist before) and therefore will not affect any existing code. I plan to merge this PR and then make follow up PRs to get it integrated in the main build and tested. commit 6278da543276943b9686a17b6e54cd09d81612ce Author: Tathagata Das Date: Wed Jun 14 10:43:11 2023 -0600 undo changes outside connectors repo commit 96a60ca44c66c5816c78520db7ec9cbc892c0a96 Merge: 6cd5b87ca 0bec32820 Author: Tathagata Das Date: Wed Jun 14 10:07:26 2023 -0600 Merge remote-tracking branch 'origin/master' into connectors3 commit 0bec32820e95e80b6accd2c88e4d532b1247497d Author: Bart Samwel Date: Fri Jun 9 17:50:35 2023 +0200 Revert of commit 579a3151db611c5049e5ca04a32fc6cccb77448b This was a breaking change GitOrigin-RevId: dbfdb9c29c6134b97e8ace925e5751cea070ea5c commit 8ccbfd1c6919c9cdf0dcbc05b06d50209f2abb29 Author: Bart Samwel Date: Fri Jun 9 15:03:45 2023 +0200 Revert: Prohibit spark.databricks.delta.dynamicPartitionOverwrite.enabled=false when DPO requested Revert for commit 2207930656a1508651bbec8e1ca2c9f2c6691f7d This was a breaking change. GitOrigin-RevId: 763248327e3d13ecbfdd0c64dfe19d88497f1522 commit afaf27b3d5076abb2eed5b73a4ce196707b6bf36 Author: Jiaheng Tang Date: Thu Jun 8 22:41:30 2023 -0700 Minor refactor to DeltaOperations.scala and WriteIntoDelta.scala GitOrigin-RevId: c5c99c752e74c286468d1d3386249a9608edc391 commit 19bfa9fc32907512f7bdb8ad1fcec1296d0548b0 Author: Scott Sandre Date: Thu Jun 8 21:30:14 2023 -0700 Minor refactor to actions.scala (Protocol) and OptimisticTransaction.scala GitOrigin-RevId: 1f561f3433de2919465be001241786f8eb58e8c3 commit 78b50a3fb9c53fe5ed925c06dee504d8f8cc8eb9 Author: Tom van Bussel Date: Thu Jun 8 16:49:57 2023 +0200 Minor refactoring GitOrigin-RevId: b50bca76746f85856a058a8e6ec7e22aa8b7313a commit 377181ce9d746b29b2c178238a84e4798ed7a8d9 Author: Jackie Zhang Date: Wed Jun 7 14:18:02 2023 -0700 Detect read-incompatible physical name change for batch scan. GitOrigin-RevId: c91177467d885447d564c78f19faa4a10796ff89 commit 9cdd748cbb4381d64896f841f8d56a2327655428 Author: Tom van Bussel Date: Wed Jun 7 19:19:03 2023 +0200 Use DomainMetadata to store Row ID High Water Mark This PR removes the `RowIdHighWaterMark` action by using the `DomainMetadata` action instead. As part of doing this it adds a dependency between the row tracking and domain metadata table features. Closes delta-io/delta#1815 GitOrigin-RevId: 5009f531b22a107b90697c358341ed8b2f29b5ac commit 420f8c5fb155fbdd8bd188a16512eb90ca30b98a Author: Prakhar Jain Date: Tue Jun 6 20:13:59 2023 -0700 Make checkpointProvider in LogSegment non-optional. Same as title. This is to avoid `logSegment.getCheckpointProvider.map(_.method).getOrElse(..)` pattern which were getting common in code. Closes https://github.com/delta-io/delta/pull/1817 GitOrigin-RevId: 180594b82a38bea687677b99a78e31de33bc2f7b commit 1fb363c91f90a9c10321f3d820ccaa74e27c544b Author: Allison Portis Date: Tue Jun 6 19:05:26 2023 -0700 [Delta Kernel] Add github action to publish the javadocs for KernelApi ## Description Adds a github action that can be manually triggered to publish the javadocs using github pages. This was adapted from the starter workflow suggested by github actions (when you go to the Pages settings and select github actions to deploy). Merged this code into my own fork and ran the github actions. Action run: https://github.com/allisonport-db/delta/actions/runs/5193716240 Docs: https://allisonport-db.github.io/delta/snapshot/kernel-api/java/api/index.html Closes delta-io/delta#1819 GitOrigin-RevId: f54c3f5e782e29633aab9a44d0ec8d6a3a673883 commit 6cd5b87ca57d6ed05c5550ac837323017705fb08 Author: Tathagata Das Date: Mon Jun 5 03:21:56 2023 -0400 integrated connectors into main build.sbt, compile and test:compile passing commit 995b16b7761afe87e22ffb76eb9d0f01de750d97 Merge: 0b6ae9233 a03ed3296 Author: Tathagata Das Date: Tue Jun 6 18:34:15 2023 -0400 Merge remote-tracking branch 'origin/master' into connectors3 commit 0b6ae9233ea34b4337e8b594be772d47980f92d9 Author: Tathagata Das Date: Tue Jun 6 18:30:51 2023 -0400 moved all connectors code to connectors dir commit 47ae5a3540d3e9400b8140e460b74e09343b0497 Author: Krzysztof Chmielewski Date: Mon Jun 5 17:03:44 2023 -0700 Flink SQL/Catalog Support (#555) * [FlinkSQL_PR_1] Flink Delta Sink - Table API UPDATED (#389) Signed-off-by: Krzysztof Chmielewski Signed-off-by: Krzysztof Chmielewski Signed-off-by: Krzysztof Chmielewski Co-authored-by: Paweł Kubit Co-authored-by: Krzysztof Chmielewski * [FlinkSQL_PR_2] - SQL Support for Delta Source connector. (#487) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_3] - Delta catalog skeleton (#503) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_4] - Delta catalog - Interactions with DeltaLog. Create and get table. (#506) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_5] - Delta catalog - DDL option validation. (#509) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_6] - Delta catalog - alter table + tests. (#510) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_7] - Delta catalog - Restrict Delta Table factory to work only with Delta Catalog + tests. (#514) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_8] - Delta Catalog - DDL/Query hint validation + tests. (#520) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_9] - Delta Catalog - Adding Flink's Hive catalog as decorated catalog. (#524) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_10] - Table API support SELECT with filter on partition column. (#528) * [FlinkSQL_PR_10] - Table API support SELECT with filter on partition column. --------- Signed-off-by: Krzysztof Chmielewski Co-authored-by: Scott Sandre * [FlinkSQL_PR_11] - Delta Catalog - cache DeltaLog instances in DeltaCatalog. (#529) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_12] - UML diagrams. (#530) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_13] - Remove mergeSchema option from SQL API. (#531) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_14] - SQL examples. (#535) Signed-off-by: Krzysztof Chmielewski * remove duplicate function after rebasing against master --------- Signed-off-by: Krzysztof Chmielewski Signed-off-by: Krzysztof Chmielewski Co-authored-by: kristoffSC Co-authored-by: Paweł Kubit Co-authored-by: Krzysztof Chmielewski commit f17a6fd3bcddb923705c23e1a50b787e7636a01e Author: Tathagata Das Date: Tue Jun 6 18:07:10 2023 -0400 Revert "Flink SQL/Catalog Support (#555)" This reverts commit e036171db2e46c506a3dd4670039a50b91461dff. commit a03ed32969198f20f79d5670fe7111b04c20f93c Author: Allison Portis Date: Tue Jun 6 14:30:38 2023 -0700 [Delta Kernel] Add Delta Kernel java interfaces Adds the initial java interfaces for Delta Kernel #1783. Also adds javastyle checks and some javadoc settings. N/A. Only adds interfaces. Closes delta-io/delta#1808 commit 8e8ef793baffab3d745bef3a11bc5ee0cd7f41f6 Author: maryannxue Date: Tue Jun 6 09:31:03 2023 -0500 Minor refactor to InvariantViolationException.scala GitOrigin-RevId: f6a4fe88f8fe6e2b1ce248a2bff0d655cac38a18 commit a655bed59e436dd0e9500a02441e65bfa9f103d6 Author: Carmen Kwan Date: Tue Jun 6 11:23:06 2023 +0200 Fix protocol check in DeltaColumnMapping.verifyAndUpdateMetadataChange This PR patches the check in DeltaColumnMapping.verifyAndUpdateMetadataChange. Table protocols should not need to support both reader table features and writer table features for it to support Column Mapping. For example, if a table has protocol version minReaderVersion=2 and minWriterVersion=7, it should be able to upgrade to column mapping. Right now, the system asks the user to upgrade to minReaderVersion=3, which is unnecessary. Resolves delta-io/delta#1816 GitOrigin-RevId: d527241a2b1f7923a672babd9cfc86089d6f6195 commit ccd3092da05a68027bf9be9ec4273a810b4b9ef3 Author: Tathagata Das Date: Tue Jun 6 01:28:33 2023 -0400 Integrate Delta Kernel project into Delta main SBT build and rename Delta Spark Maven artifact The Delta Kernel sub project is currently disconnected from the main build. This PR updates the main build to compile the kernel project as well. Specifically, it does the following changes 1. Add the kernel subprojects in the main build.sbt 2. Define groups of the subprojects, "spark" and "kernel" in build.sbt such that each group can be easily tested while disabling tests for all other groups. 3. Update the run-tests.py with support for specifying test groups, so that different test build (Github actions) can test each group just by calling run-tests.py with different groups. 4. Updated github actions for kernel to use this. In addition, to the kernel related changes, this PR also updates the delta on spark build to be named more appropriately; instead of delta-core, it should be delta-spark (similar to delta-flink and delta-hive in the connectors repo). The `core` directory has been updated to `spark` as well. Closes delta-io/delta#1814 GitOrigin-RevId: 5fd0ce3c148783525cee7af34763385b7c5334d6 commit 7696e0437918c94c503d32298e40c0ebd6d02141 Author: Lukas Rupprecht Date: Mon Jun 5 16:18:50 2023 -0700 Rewrite OPTIMIZE to use Spark Table Resolution This PR is part of https://github.com/delta-io/delta/issues/1705. It rewrites the Delta OPTIMIZE command to use Spark's table resolution logic instead of resolving the target table manually at command execution time. For that, it changes OptimizeTableCommand from a LeafRunnableCommand to a UnaryLike command that takes a UnresolvedTable as a child plan node, which will be resolved by Spark. In addition, it also introduces a UnresolvedPathBasedDeltaTable LogicalPlan for the case that OPTIMIZE is run against a raw path. This new logical plan is an indication that the target table specified as a path so no table resolution is necessary. Through the large set of existing unit tests that OPTIMIZE already has + adapted the DeltaSqlParserSuite to work with the new OptimizeTableCommand. N/A Fixes https://github.com/delta-io/delta/issues/1705. Closes https://github.com/delta-io/delta/pull/1708. GitOrigin-RevId: 33cb734ec9c9082aa29de790f6fc5e5959705d93 commit cdcd32377ecad09c1b6d49dd17c9b8a934600409 Author: Ole Sasse Date: Mon Jun 5 21:38:09 2023 +0200 Fix a bug in RowTracking conflict resolution Fix a bug where a concurrent commit that did not bump RowIdHighWaterMark was not handled correctly GitOrigin-RevId: 5fa1b2f34d8448e0eeb0350265a65afd1bcc1054 commit ba9b66a1bab1296b38eaae59f2b290830ccb8a03 Author: jintao shen Date: Mon Jun 5 10:44:06 2023 -0700 Minor change for /org/apache/spark/sql/delta/OptimisticTransaction.scala to support required features. GitOrigin-RevId: 7081ebf491817b591110619fada72d78297c5973 commit 7d78ff26ec3298f93587123edee9698a8c62cdc8 Author: Terry Kim Date: Mon Jun 5 10:38:55 2023 -0700 Minor refactor to CreateDeltaTableCommand.scala for making "isReplace" a class-level "private def". GitOrigin-RevId: 0eba89e6369768c0ca26be13962a145307d11400 commit 4f448d101ae303cbc60407c7d2af395af8739ba5 Author: Yuya Ebihara Date: Sun Jun 4 17:16:21 2023 -0700 rename `timestampNTZ` feature to `timestampNtz` in PROTOCOL.md ## Description Rename `timestampNTZ` feature to `timestampNtz` in PROTOCOL.md. Ref https://github.com/delta-io/delta/blob/d74cc6897730f4effb5d7272c21bd2554bdfacdb/core/src/main/scala/org/apache/spark/sql/delta/TableFeature.scala#LL310C35-L310C35 ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1766 Signed-off-by: Rahul Mahadev GitOrigin-RevId: e766c1c779cd922442c8afaa24054f6ab150f88a commit 84c869c5f7c555118ba0b3f4fb9c6a35cb16d81b Author: lzlfred Date: Fri Jun 2 12:03:55 2023 -0700 support write all data files to "data" subdir with sql conf Add the support to write parquet data files to data/ subdir for Delta table via sql conf. Closes delta-io/delta#1810 GitOrigin-RevId: 735d1a0d6f196cc98b4afce08ad045fb4343dad9 commit e036171db2e46c506a3dd4670039a50b91461dff Author: Scott Sandre Date: Mon Jun 5 17:03:44 2023 -0700 Flink SQL/Catalog Support (#555) * [FlinkSQL_PR_1] Flink Delta Sink - Table API UPDATED (#389) Signed-off-by: Krzysztof Chmielewski Signed-off-by: Krzysztof Chmielewski Signed-off-by: Krzysztof Chmielewski Co-authored-by: Paweł Kubit Co-authored-by: Krzysztof Chmielewski * [FlinkSQL_PR_2] - SQL Support for Delta Source connector. (#487) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_3] - Delta catalog skeleton (#503) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_4] - Delta catalog - Interactions with DeltaLog. Create and get table. (#506) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_5] - Delta catalog - DDL option validation. (#509) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_6] - Delta catalog - alter table + tests. (#510) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_7] - Delta catalog - Restrict Delta Table factory to work only with Delta Catalog + tests. (#514) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_8] - Delta Catalog - DDL/Query hint validation + tests. (#520) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_9] - Delta Catalog - Adding Flink's Hive catalog as decorated catalog. (#524) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_10] - Table API support SELECT with filter on partition column. (#528) * [FlinkSQL_PR_10] - Table API support SELECT with filter on partition column. --------- Signed-off-by: Krzysztof Chmielewski Co-authored-by: Scott Sandre * [FlinkSQL_PR_11] - Delta Catalog - cache DeltaLog instances in DeltaCatalog. (#529) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_12] - UML diagrams. (#530) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_13] - Remove mergeSchema option from SQL API. (#531) Signed-off-by: Krzysztof Chmielewski * [FlinkSQL_PR_14] - SQL examples. (#535) Signed-off-by: Krzysztof Chmielewski * remove duplicate function after rebasing against master --------- Signed-off-by: Krzysztof Chmielewski Signed-off-by: Krzysztof Chmielewski Co-authored-by: kristoffSC Co-authored-by: Paweł Kubit Co-authored-by: Krzysztof Chmielewski commit 61689859dc1e91f012f84c38ebb8be7547ffc7c0 Author: kristoffSC Date: Fri Jun 2 23:50:31 2023 +0200 [Test][Flink] - Recovery from savepoint test (#516) Signed-off-by: Krzysztof Chmielewski commit 767970f4002efe0cdd4739414ba9459c379984fe Author: Allison Portis Date: Fri Jun 2 11:26:08 2023 -0700 Update github actions for Delta Kernel Based off of #1785. Updates our current `test.yaml` to not run when there are only changes in `kernel/`. Adds a separate action for future `kernel/` tests. Currently this just compiles the code. ### Testing I merged these changes to my own fork and created these PRs to check https://github.com/allisonport-db/delta/pull/11 Changes to `core/src/main/scala/org/apache/spark/sql/delta/Snapshot.scala` - https://github.com/allisonport-db/delta/actions/runs/5125555163/jobs/9218878662?pr=11 - https://github.com/allisonport-db/delta/actions/runs/5125555163/jobs/9218878794?pr=11 https://github.com/allisonport-db/delta/pull/12 Changes to `build.sbt` - https://github.com/allisonport-db/delta/actions/runs/5125557362/jobs/9218883935?pr=12 - https://github.com/allisonport-db/delta/actions/runs/5125557362/jobs/9218884043?pr=12 https://github.com/allisonport-db/delta/pull/13 Changes to `examples/scala/src/main/scala/example/ChangeDataFeed.scala` - https://github.com/allisonport-db/delta/actions/runs/5125558258/jobs/9218886849?pr=13 - https://github.com/allisonport-db/delta/actions/runs/5125558258/jobs/9218886959?pr=13 https://github.com/allisonport-db/delta/pull/14 Changes to `run-tests.py` - https://github.com/allisonport-db/delta/actions/runs/5125560869/jobs/9218892247?pr=14 - https://github.com/allisonport-db/delta/actions/runs/5125560869/jobs/9218892419?pr=14 https://github.com/allisonport-db/delta/pull/15 Changes to `kernel/build.sbt` (Spark tests SHOULD NOT run) - https://github.com/allisonport-db/delta/actions/runs/5125973036/jobs/9219853307?pr=15 (skipped) - https://github.com/allisonport-db/delta/actions/runs/5125973036/jobs/9219853403?pr=15 (skipped) Closes delta-io/delta#1786 Signed-off-by: Allison Portis GitOrigin-RevId: cd5449a905c057f61f66afd7dfd5e5cb19348270 commit a4ceca2249a918b8c0a221298fad1b4bf8da65ac Author: Ryan Johnson Date: Fri Jun 2 07:08:45 2023 -0700 Use deletion vector schema from log schema instead of hand-coding it Today, `DeletionVectorDescriptor.STRUCT_TYPE` schema is hand-coded, and has to be updated manually whenever the schema changes. Eliminate the possibility that the two can get out of sync, by extracting the schemas automatically from the actual log schema. GitOrigin-RevId: 2be0b2b867813ac0d9e8d57458a2736b9d5cd19c commit c73a1ab357886ba3d7cebf81858ada7660c82375 Author: Lars Kroll Date: Fri Jun 2 14:59:34 2023 +0200 Minor refactoring to DeltaSparkPlanUtils. GitOrigin-RevId: 40259a9d19e38e6e27111189c01f101c1073b24f commit 79530dba8b079ab0ff12d2a233250bc6305e10ff Author: jintao shen Date: Thu Jun 1 11:59:30 2023 -0700 Minor refactor to table feature code GitOrigin-RevId: f08d2570b863320cb6cbe7f026c3ca33aeead209 commit 3822a443d7f8851fcc6d11ff66a32215077a758f Author: Ole Sasse Date: Thu Jun 1 19:01:09 2023 +0200 Conflict resolution when concurrently enabling Row Tracking Add conflict resolution and testing for scenarios in which the Row Tracking table feature gets enabled concurrently with another transaction and conflict resolution has to happen. This PR focuses on AddFiles actions that are concurrently added, RemoveFile actions will be handled in a separate PR. Closes https://github.com/delta-io/delta/pull/1807 GitOrigin-RevId: 20d95203aa55d942d3703e90cb8fd5d03740b83c commit adb1e8d066a36d3cc6f2ec2f84782b57983c9535 Author: Fredrik Klauss Date: Thu Jun 1 13:39:28 2023 +0200 Remove ROW_IDS_ALLOWED flag * Remove the ROW_IDS_ALLOWED flag to test as if row tracking would be enabled * Change test suites to either commit AddFiles with statistics Closes delta-io/delta#1805 GitOrigin-RevId: 1030d32e12bdbf2f2b034603e63cb31b03a234b1 commit c8313527039d7df2e93476120f1ffca083d94c92 Author: Shixiong Zhu Date: Thu Jun 1 00:56:06 2023 -0700 Minor refactor to `CloneTableBase` Avoid using a magic value `true` for `dataChange` directly. Pull `true` into a constant value. GitOrigin-RevId: 459040a0465320cf1db23eae9fc8b74950fbe8fc commit 3edf7300d3874851830f2559c66ea235debe21c8 Author: Lars Kroll Date: Wed May 31 19:47:00 2023 +0200 Refactor deterministic-plan detection. GitOrigin-RevId: 42f27b7de3845c8650f71bfba65a4e78a7657179 commit bc40515e1296ce73d2dd33c31d68ac5b46d3bf0b Author: lzlfred Date: Wed May 31 08:12:59 2023 -0700 Minor refactor to CreateDeltaTableCommand Minor refactor to CreateDeltaTableCommand GitOrigin-RevId: 44aecd1d01d4e2965b9580a7f26d69708ecc3057 commit bb987a6e195030319379966ca0be40e2055f6506 Author: Lars Kroll Date: Wed May 31 14:16:17 2023 +0200 Fix DV struct schema The DV schema description had become out of sync with the actual case class fields. This PR brings them back in line. GitOrigin-RevId: a579bcac4a4917f9a51cbbd75aebba2d2616e3a4 commit e1bbbc812cacd59532b3ed39586ecdf9ce320320 Author: Wenchen Fan Date: Wed May 31 10:10:55 2023 +0000 Minor import refactor DeltaConfig GitOrigin-RevId: 4d75aee59c5b370baebee3ec758280ae288dbc5c commit 262d213c54ae3afcaec08771c11faf63e8e07429 Author: Tom van Bussel Date: Wed May 31 08:28:39 2023 +0200 Reassign Default Row Commit Versions on conflict This PR handles setting the `defaultRowCommitVersion` field in `AddFile` correctly when a concurrent transaction commits first. This also handles the case in which the row tracking table feature is set to supported by a concurrent transaction. GitOrigin-RevId: a039bda21bc8dd492cd1a61043c5fefc99a0e3e7 commit b29805c4d5b9e009a391a54812eecccaf0f5e1a5 Author: Felipe Pessoto Date: Tue May 30 10:30:12 2023 -0700 Validate operation name in Optimize command tests ## Description Validate the operation name for optimize command - Fixes #1756 Also changed DeltaLog.snapshot to DeltaLog.unsafeVolatileSnapshot, as snapshot field is deprecated. Run tests locally ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1758 Signed-off-by: Venki Korukanti GitOrigin-RevId: a6340ca9b31f45f43a1e7d26a751bb69cefa76e4 commit d7c4f06da1e7ee7e2761e4a6d8a8fc1c84081b6d Author: Fredrik Klauss Date: Tue May 30 19:29:25 2023 +0200 Column mapping test infrastructure fix * Bump the reader protocol version when enabling column mapping to table features when the writer version of the existing table is using table features * Add tests for schema evolution with partition columns GitOrigin-RevId: 45b81a55f5f443d10ba9356b9a81b8af21398142 commit 10dc607c27b712e559fc6072ab59214e05d3ff2f Author: Wenchen Fan Date: Mon May 29 15:52:42 2023 +0000 Minor refactor DeltaCreateTableLikeSuite GitOrigin-RevId: 8febe82f5704a8a4423b65cc50fd5b78b658c734 commit 3cab25d13c30f853d33575dd19644f012895e2ad Author: Terry Kim Date: Fri May 26 17:31:37 2023 -0700 Minor refactor to WriteIntoDelta GitOrigin-RevId: c09d7b174537f435553f840001a308dbea76bcd1 commit e56db3c4f19917ddff5d9eddb18bc246a3430bea Author: jintao shen Date: Fri May 26 16:17:07 2023 -0700 Minor refactor to DeltaAnalysis GitOrigin-RevId: 796d8a851d646b7d96aaab9b79c995feafd172b2 commit b221b69160c4527c7b016ad06dc6fc26dc8bd092 Author: Allison Portis Date: Tue May 30 11:52:13 2023 -0700 Create initial structure for Delta Kernel development See #1783 for details. This PR just sets up the `kernel/` subdirectory and sbt for future development. https://github.com/delta-io/delta/pull/1786 updates the github actions and is based off of this PR Closes delta-io/delta#1785 commit 458bcaeb26420adee7dfa10d5b89a2c0a500615f Author: Mohamed Zait Date: Fri May 26 15:08:31 2023 -0700 Minor refactoring GitOrigin-RevId: 991120919893a5d1500d14bdf45a2e2c7ee5946f commit f5be0ce67f91dff5039a248dac2c508f3ca63ba4 Author: Tom van Bussel Date: Fri May 26 09:18:01 2023 +0200 Dependencies between Delta Table Features This PR adds the ability to specify dependencies between Delta table features. When enabling a table feature all dependent table features will automatically be enabled as well. This will be used by Row Tracking to add a dependency on the Domain Metadata feature (which will be used to store the high water mark). Closes delta-io/delta#1789 GitOrigin-RevId: c0b03df9d708f1c0ff9b902679a06357567d3907 commit 91d24352ea7530f9426bce7036fd33e505f64d5b Author: Jiaheng Tang Date: Thu May 25 19:56:31 2023 -0700 Minor refactor to DeltaCatalog.scala GitOrigin-RevId: 9361e833bf1a35e975898dbf71d3f03206ad585d commit 351cc8c1f66b926a46119ee4917ca8ab6b3c4fd8 Author: Felipe Pessoto Date: Thu May 25 17:38:08 2023 -0700 Update checkpoint schema description ## Description Update checkpoint schema description to explicitly mention that readerFeatures and writerFeatures should only exist in supported versions. N/A ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1682 Co-authored-by: Paddy Xu Signed-off-by: Allison Portis GitOrigin-RevId: 688d607ac4ec6b6e36b09e8cef6eb6d73634db8b commit 0925df8c22e4445231ac9da1ee80cc855b9d9460 Author: Allison Portis Date: Thu May 25 16:39:30 2023 -0700 Bump version to 2.5.0-SNAPSHOT 2.4.0 is released Closes delta-io/delta#1795 Signed-off-by: Allison Portis GitOrigin-RevId: 8a65587b5a787461835eb1569634f4e39a443f20 commit d50b224a28727c587678f25178d862026c5c0d61 Author: Jackie Zhang Date: Thu May 25 16:39:13 2023 -0700 Adding consecutive schema change support for schema tracking. GitOrigin-RevId: 50555e560d530ed966d64a1b65059b226ae7f6b4 commit 8f2b532ae88929e8685d1224d7b727c8b483ab69 Author: Kam Cheung Ting Date: Thu May 25 16:38:34 2023 -0700 Introduce Delta Statistics Columns dataSkippingStatsColumns Allow user to specify delta dataskipping statistic columns list. Closes delta-io/delta#1763 GitOrigin-RevId: b20c801057431ca0ba2a3494de49c24c5812434d commit b99d700d7f2a8acc2552ebdfcaedffbb7a2ddfd2 Author: Lars Kroll Date: Thu May 25 15:37:33 2023 -0700 Fix TimestampNTZ section location ## Description The TimestampNTZ section in the protocol somehow ended up within the DV section. I moved it to after the end of the DV section instead. n/a ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1792 Signed-off-by: Allison Portis GitOrigin-RevId: b92b9f9872b0d6a7cf67c72ced4a0f8f9e5a98be commit 79cb385461e9070294136c6d8f257562d778c801 Author: Tom van Bussel Date: Thu May 25 20:40:02 2023 +0200 Add Default Row Commit Version to AddFile and RemoveFile This PR implements part of the changes proposed in https://github.com/delta-io/delta/pull/1747. It adds the `defaultRowCommitVersion` field to `AddFile` and `RemoveFile`, and it makes sure that it's populated during commits and read during log replay. It **does not** handle any transaction conflicts yet. Closes delta-io/delta#1781 GitOrigin-RevId: 781617fd33b3be2f39ac8ab36aa0b741ba99c97e commit 3e2157f5f330cd5ca4df7127cfe15d64b8c77a34 Author: Fredrik Klauss Date: Thu May 25 12:58:31 2023 +0200 Minor refactor GitOrigin-RevId: c4d7d8659f3a7c67e50c28a4083b5b26562a0c3c commit df9d74dd5d6dac72cad869cfab29909c37b40f97 Author: Lin Ma Date: Wed May 24 19:00:29 2023 -0400 Add jobRunId to JobInfo This PR adds a jobRunId field to JobInfo. When a multi-task job is created, the runId in JobInfo represents the task run ID, and jobRunId (retrieved from multitaskParentRunId in the CommandContext) represents the job run ID. GitOrigin-RevId: 1bd0027983de4af93e1e07e5237ac1c286853da8 commit 6eb78c25537b5f77a23ede657c552b3f119d0122 Author: Terry Kim Date: Wed May 24 13:57:39 2023 -0700 Minor refactor to CreateDeltaTableCommand and WriteIntoDelta GitOrigin-RevId: c095f38503014c502541d83c5cc873e94451097a commit 031a34819f2d92bb0d8bf5b8d0d301ca40f03723 Author: Prakhar Jain Date: Wed May 24 11:16:02 2023 -0700 Minor Change to CheckpointInstance Pass deltaLog instead of path to CheckpointInstance's getCheckpointProvider GitOrigin-RevId: d6d92a124932d49355a53ff0f08e0644087826c6 commit 42c80c164420f371cab06f63fe229891913538e7 Author: jintao shen Date: Tue May 23 17:20:26 2023 -0700 Minor refactor to CreateDeltaTableCommand GitOrigin-RevId: 0d6b570b8da26dc7481e3dedf8e6a03d729c9220 commit fdcd6d8629fc9ae0331a6bb3d588a5162ba777b1 Author: Allison Portis Date: Thu May 25 17:13:33 2023 -0700 bump delta storage version (#554) commit 5ad64436ecc3218d6dc3c079afbfc0dc70325500 Author: Allison Portis Date: Tue May 23 15:56:23 2023 -0700 Handle FileAlreadyExistsException in S3DynamoDBLogStore ## Description Resolves #1734 It is possible that more than one concurrent reader/writers will try to fix the same incomplete entry in DynamoDB. This could result in some seeing a `FileAlreadyExistsException` when trying to copy the temp file to the delta log. We should not propagate this error to the end user since the recovery operation was successful and we can proceed. See #1734 for more details. Note, we attempt to copy a temp file in two places: 1. As part of writing N.json [here](https://github.com/delta-io/delta/blob/master/storage-s3-dynamodb/src/main/java/io/delta/storage/BaseExternalLogStore.java#L249) 2. In `fixDeltaLog` when performing a recovery for N-1.json as part of either a write or listFrom We only need to catch the exception in scenario (2). In scenario (1) we already catch _all_ seen errors. This is hard to test without manipulating the FS + external store a lot. We could manipulate `FailingFileSystem` to throw a `FileAlreadyExistsException`. Closes delta-io/delta#1776 Signed-off-by: Allison Portis GitOrigin-RevId: fcce5d5577d79dff4d071ebdd63b3ee837e5b645 commit adfe0ceadb601efe585397377d4563bc15f727d6 Author: jintao shen Date: Tue May 23 15:20:26 2023 -0700 Implement DomainMetadata Action This change implements [DomainMetadata](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#domain-metadata) Action in the delta spec. Closes delta-io/delta#1759 GitOrigin-RevId: 0a216f1e05f887c9369f31d5fc3fa7ee4e506600 commit c9bbe2e9647396e0be19b476f54e0b667dff6bbd Author: Fredrik Klauss Date: Tue May 23 18:18:56 2023 +0200 Throw proper error message when assigning row IDs without stats * Create a proper error message to throw when assigning row IDs without stats * Tell the user to call the Scala API to recompute statistics on the table. GitOrigin-RevId: b3883eda96d5fa8c3554727b62f61d21ff64410c commit 7272b04db76507a98e07e96dc88e2c1a54283693 Author: Tom van Bussel Date: Tue May 23 11:36:48 2023 +0200 Protocol Specification for Row Commit Versions This PR adds the protocol specification changes for the Row Commit Versions that are proposed in https://github.com/delta-io/delta/issues/1715. In particular it makes the following changes: - Renames the rowIds feature to rowTracking. - Renames the delta.enableRowIds property to delta.enableRowTracking. - Renames and moves the preservedRowIds flag in rowIdHighWaterMark to delta.rowTracking.preserved in the tags of commitInfo. - Refactors the specification of Row IDs - Adds the specification for Row Commit Versions. Closes delta-io/delta#1747 GitOrigin-RevId: ac774c4b92c53643d9f4f5b174270a94ab71e1e1 commit 4944e24ec7b7fc84c07e5ddbb71d8f5c68df5bc9 Author: Eric Ogren Date: Fri May 19 17:18:28 2023 -0700 Minor refactor to ScanReport.scala GitOrigin-RevId: 6628128310d4055aed9bb9965da8f11831e26342 commit d948e039ec626309e821c196d800befdbda4c957 Author: Satya Valluri Date: Fri May 19 09:11:17 2023 -0700 Minor refactor to CreateDeltaTableCommand GitOrigin-RevId: 7793451172a002f1c49ecf66ac3a4b1f9045f97c commit a467814bae3564ab5a76a4b78c2966c0d5d4da6a Author: Lars Kroll Date: Fri May 19 12:44:48 2023 +0200 Refactor deterministic plan detection into a separate trait. GitOrigin-RevId: 3e695d333a1c78bc076e71d5a309682e778224a0 commit e06424dc11237ccf0a9925394f0635dc7333ed0e Author: Prakhar Jain Date: Thu May 18 14:29:16 2023 -0700 Rename checkpointV2 to checkpointWithStructCols. The Delta Spec has only 1 version of checkpoint but the delta connector uses the term CheckpointV2 in order to write partition values in struct form (`partitionValues_parsed` col). This is kind of confusing. This PR fixes this terminology. Closes https://github.com/delta-io/delta/pull/1772 GitOrigin-RevId: 7964ce3ba8e6b92281e3de509b8a1a10ab5a5ec2 commit f052bbb61caea48a681142abca9bdb7583caeb7f Author: Prakhar Jain Date: Wed May 17 18:18:10 2023 -0700 Rename CheckpointMetadata class to LastCheckpointInfo since that's what it actually is. Closes https://github.com/delta-io/delta/pull/1771 GitOrigin-RevId: f7be12e7b46603bbf00f3c7736a014803191ab0a commit 95f4efd03111b6e475c048a95b252053ccab3f6f Author: Yuming Wang Date: Wed May 17 17:13:18 2023 -0700 Upgrade antlr4 to 4.9.3 ## Description Upgrade antlr4 to 4.9.3 to fix incompatible warning: ``` ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3 ``` Manual test. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1751 Signed-off-by: Allison Portis GitOrigin-RevId: 6184957b7c83f9d9fe53445d8e509cbf23a11c9f commit 3fcb9a701734ab7d25f65374cb889887d5d7c16f Author: Jackie Zhang Date: Wed May 17 16:41:22 2023 -0700 Allow empty schema table to be recreated when txn log exists GitOrigin-RevId: 0859323756ab0082051e5e090094cc3763069081 commit d5f85c0493f79f18648ce61a4782634ee033e4d1 Author: Tom van Bussel Date: Wed May 17 20:36:12 2023 +0200 Minor refactor GitOrigin-RevId: b103477e3f7147458fcbb96469e6af1388b8cf10 commit 2207930656a1508651bbec8e1ca2c9f2c6691f7d Author: Sabir Akhadov Date: Wed May 17 10:33:28 2023 +0200 Prohibit spark.databricks.delta.dynamicPartitionOverwrite.enabled=false when DPO requested With Spark SQL config `spark.databricks.delta.dynamicPartitionOverwrite.enabled=false`, both Spark SQL spark.sql.sources.partitionOverwriteMode=dynamic and Delta write option partitionOverwriteMode=dynamic are ignored. This could lead to corrupt data as users might explicitly require and expect DPO instead of overwriting the whole table. We prohibit such a combination by throwing an exception instead. The `spark.databricks.delta.dynamicPartitionOverwrite.enabled` is marked internal, so we do not expect users to have set it. GitOrigin-RevId: 31731d80117d899df66581a9fccd90f67e21495e commit d74f788fa28b0ab9e5ee2f743db23375621fd7aa Author: Prakhar Jain Date: Tue May 16 17:41:14 2023 -0700 Remove logSegment.checkpoints API and move it to CheckpointProvider GitOrigin-RevId: f84ce9d06434eca19ea98bb63f76b2c0ce4b862b commit 921c98e36a48dacc4537af534afa8f91599dca07 Author: Paddy Xu Date: Tue May 16 16:43:34 2023 +0200 Table feature usage logging This PR adds a new `opType = delta.protocol.change` to `tahoeEvents`, to record protocol changes. The payload (blob) includes old and new protocol components. Example: For a new table: ```json { "toProtocol":{ "minReaderVersion":1, "minWriterVersion":2, "supportedFeatures":["appendOnly","invariants"] } } ``` For an existing table: ```json { "fromProtocol":{ "minReaderVersion":1, "minWriterVersion":7, "supportedFeatures":["appendOnly", "invariants"] }, "toProtocol":{ "minReaderVersion":3, "minWriterVersion":7, "supportedFeatures":["appendOnly","deletionVectors","invariants"] } } ``` GitOrigin-RevId: 60ee7ebf3e38e01743774f675a3284faeef5f409 commit cb618e17f106c3d9374320cb69791e80f28f4fce Author: Fredrik Klauss Date: Tue May 16 15:19:13 2023 +0200 Don't mark row tracking as enabled if table feature supported on new table * Don't mark row tracking as enabled if table feature is supported on a new table, as the use case is to enable the table property. Flipping the table property on in this code path is non-standard and could lead to errors. * upgrade the protocol include the automatically enabled table features when `canAssignAnyNewProtocol` is false, as otherwise we miss them until we call `commit` GitOrigin-RevId: 0469997ca5243f7e4c82737460060cc3bf971fbf commit 4b93d5c12121b9bd3594ae47b53d19b2c1562ea4 Author: Eric Ogren Date: Mon May 15 15:24:43 2023 -0700 Minor refactor to DeltaAnalysis.scala GitOrigin-RevId: 20a92c88c68af441e609120743f0f54daccf4094 commit db3c5a361dc6725522c9b0be0a15e9fbb252c9e6 Author: lzlfred Date: Mon May 15 13:35:03 2023 -0700 write field_id in parquet schema with delta column mapping This PR fixes an existing bug where delta misses to set field_id in parquet schema with column mapping, although the [protocol]( https://github.com/delta-io/delta/blob/master/PROTOCOL.md#writer-requirements-for-column-mapping ) requires so. Closes delta-io/delta#1762 GitOrigin-RevId: d875c7d4dc2e977cfbddaaf8c5530c7e310af286 commit 4311aa7f992e6d933fc1ba0520130e76361e047d Author: Prakhar Jain Date: Mon May 15 13:27:40 2023 -0700 Move Checkpoint File Index from Snapshot to CheckpointProvider This PR moves the Checkpoint File Index from Snapshot to CheckpointProvider. This PR also adds comments to CheckpointProvider. GitOrigin-RevId: b9ae37ddf87482b4c97eac11d0f7fbccbdbbd90c commit 09ad24560054a5ae8190587f158677528ed5ba96 Author: Fredrik Klauss Date: Mon May 15 20:17:23 2023 +0200 Expose getMessageParameters in DeltaErrors Expose getMessageParameters in DeltaErrors, that are used in Spark since Spark 3.4. GitOrigin-RevId: 46b4ec1a1e4b698ec54ac8c3fc08869b53912480 commit 1c95fda75b6b25be6d952a9cf676511bff2e5f59 Author: Lars Kroll Date: Mon May 15 17:50:46 2023 +0200 Minor RoaringBitmapArray add fix. Fixed an issue with creating bitmaps with `add(Range)`, when they end on a container boundary. GitOrigin-RevId: f253a607775ce4ad24fceb4cfca6a4360e1d4a7a commit da4b27105d633b13d8f7ee13b3d27384db7d04ab Author: Tom van Bussel Date: Sat May 13 17:41:49 2023 +0200 Rename Row IDs feature and flag to Row Tracking GitOrigin-RevId: c32678071adc7c33cc93acd9f11f880d1dc98197 commit 939156804c9996a652a119e6995975ae3dc9fd75 Author: Venki Korukanti Date: Fri May 12 18:33:46 2023 -0700 Add REORG operation details in commit history ## Description `REORG` command was added as part of the delta-io/delta#1732. The operation is recorded as `OPTIMIZE`. This PR changes it to `REORG` Updated existing UTs to verify the Delta table history Closes delta-io/delta#1753 Signed-off-by: Venki Korukanti GitOrigin-RevId: d5d27e6b5710659d3766eed0b0c32ec5b716d644 commit 9e714405ce5bafbbc04890f743d98fe598410b73 Author: Boyang Jerry Peng Date: Fri May 12 14:51:17 2023 -0700 Minor refactor to Delta source and sink test suites GitOrigin-RevId: 13ddca182d667ad710338253799fabec9d527de4 commit fbcf5df1912f908734b00fbec61f621afd4ca21b Author: Prakhar Jain Date: Fri May 12 14:49:37 2023 -0700 Move Checkpoint related logic to Checkpoint Provider Move Checkpoint related logic to Checkpoint Provider. Also reuse old checkpointProvider inside the new snapshot after the commit if possible. GitOrigin-RevId: 0748825793f80033b303d8a8319f4e91a2960d65 commit fd601d202389af22b44427948b7d648a76f2f88e Author: Siying Dong Date: Fri May 12 13:07:35 2023 -0700 Add Logging in DeltaSink Add logging in DeltaSink that records duration of file writing and transaction commit, as well as some other information. The log lines look like following: ``` 23/05/11 17:11:03 INFO DeltaSink: Wrote 1 files, with total size 1429, 1 logical records, duration=467 ms. ...... 23/05/11 17:11:04 INFO DeltaSink: Committed transaction, batchId=5, duration=331 ms,added 1 files, removed 0 files. ``` GitOrigin-RevId: b71b74886dd520139239012208d6705305ecf103 commit 0c10fe08fcf7a79b7df1d31d51a7a0b363aa155b Author: Paddy Xu Date: Fri May 12 11:57:30 2023 +0200 Fix incorrect row index filter in CDCReader ## Description This PR fixes a bug in CDCReader, where row index filters are incorrectly flipped for insert rows. Bug description: when a tuple `(remove, file, dv1)` is removed and re-added with a new DV, i.e., `(add, file, dv2)`, we expect that: 1. rows marked in `dv2` but not in `dv1` are deleted. 2. rows marked in `dv1` but not in `dv2` are inserted (during restore for example). We diff two DVs to produce inline DVs and pass them to file scans. The bug is in how we use the inline DVs -- we wrongly used the `IfContained` filter for Item (2), and therefore all rows that are not marked in DV are being returned. The correct filter type here is `IfNotContained`, which will return all marked rows. New test. Closes delta-io/delta#1750 Signed-off-by: Paddy Xu GitOrigin-RevId: 3e3c767022e9cd4feaccfb0b58ffbb7ba2db0e6f commit c219bcc24748a09821a9910db6abac512321ccbc Author: Mohamed Zait Date: Thu May 11 16:50:39 2023 -0700 Minor refactoring GitOrigin-RevId: aa566d198db184dffc1b852d7afbc5e3c6745819 commit c06e1598846737fa57b56091056fdf8e24da100a Author: Paddy Xu Date: Thu May 11 23:49:27 2023 +0200 Remove super slow DV stress tests This PR removes DV stress tests because they run too slow in Github Actions. GitOrigin-RevId: 405e8c59b0eccf0ed2af365f34dc76c6075734c9 commit 9f4b336a25cc2d90fc54c6fdf2a6cb09dc70045c Author: Paddy Xu Date: Thu May 11 21:42:13 2023 +0200 Write the number of deleted rows for DELETEs with DV This PR improves DELETE with DVs to write the following operational metrics: `numDeletedRows` and `numRemovedFiles`. GitOrigin-RevId: a237be027a53a7b7fa55396293ab380d9e8e1473 commit 6a6cbb33342f7585d76c84f9f5550bec174c49e6 Author: Paddy Xu Date: Thu May 11 14:52:57 2023 +0200 Refactor protocol version handling in DeltaConfig This PR improves `minimumProtocolVersion` handling in `DeltaConfig` by nuking it from the codebase, because it's supposed to guard a config from being enabled but we never did that. Now this is not needed anymore as table features will manage table protocols. GitOrigin-RevId: 5b0761b6e8c474ff0f1605add744186558cdcdc2 commit ba10d1526119cbfc437479858e61e9e5daf8b8b1 Author: Johan Lasperas Date: Thu May 11 13:24:20 2023 +0200 Assign fresh Row IDs during commit ## Description This change implements assigning unique fresh Row IDs when committing files on tables that support Row Ids: - Set the `baseRowId` field of every `add` and `remove` actions in commits. - Generate `highWaterMark` action to update the Row ID high watermark. - Gracefully resolve conflicts between transactions by reassigning overlapping Row IDs before committing. - Adding tests to RowIdSuite to cover assigning fresh Row IDs. - Adding tests to CheckpointSuite to ensure `baseRowId` and `highWaterMark` information survives checkpointing. Closes delta-io/delta#1723 GitOrigin-RevId: 3ab7cba5b66585baf17de10b1d5fbe6e320e7665 commit c8b5acb0d6a5bc6fc36ec9d3e6bf3698881d8db6 Author: aokolnychyi Date: Thu May 11 11:02:34 2023 +0000 Allow UPDATE command with non-delta tables Delta is not the only format that supports UPDATE. GitOrigin-RevId: 1d59679ee96796de2428861b4e90bff03f6e4812 commit 02a46d191584de4dd8004a2d3a20704e6f30cdcf Author: Antoine Amend Date: Thu May 11 17:02:04 2023 -0700 Allow absolute paths (#543) commit d74cc6897730f4effb5d7272c21bd2554bdfacdb Author: Allison Portis Date: Wed May 10 11:27:10 2023 -0700 Fix a test in DeltaVacuumSuite to pass locally "vacuum after purging deletion vectors" in `DeltaVacuumSuite` fails locally because the local filesystem only writes modification times to second accuracy. This means a transaction might have timestamp `1683694325000` but the tombstone for a file removed in that transaction could have deletionTimestamp `1683694325372`. ---> The test fails since we set the clock to the transaction timestamp + retention period, which isn't late enough to expire the tombstones in that transaction. GitOrigin-RevId: 63018c48524edb0f8edd9e40f1b21cc97bc546cc commit c3ff8eeef3a1ba1a3b32fea96b4c90bdfbd8d57c Author: Sabir Akhadov Date: Wed May 10 12:47:03 2023 +0200 Add estLogicalFileSize to FileAction Add estLogicalFileSize to FileAction for easier Deletion Vector processing. GitOrigin-RevId: c7cf0ad32e378bcfc4e4c046c5d76667bb8659c7 commit 422a670bc6b232e451db83537dcad34a5de97b67 Author: Allison Portis Date: Tue May 9 23:01:57 2023 -0700 Support insert-into-by-name for generated columns ## Description Spark 3.4 no longer requires users to provide _all_ columns in insert-by-name queries. This means Delta can now support omitting generated columns from the column list in such queries. This test adds support for this and adds some additional tests related to the changed by-name support. Resolves delta-io/delta#1215 Adds unit tests. ## Does this PR introduce _any_ user-facing changes? Yes. Users will be able to omit generated columns from the column list when inserting by name. Closes delta-io/delta#1743 GitOrigin-RevId: 8694fab3d93b71b4230bf6f5dd0f2a21be6f3634 commit 9fac2e6af313b28bf9cd3961aa5dec8ea27a2e7b Author: Paddy Xu Date: Tue May 9 21:41:11 2023 -0700 Implement PURGE to remove DVs from Delta tables ## Description This PR introduces a `REORG TABLE ... APPLY (PURGE)` SQL command that can materialize soft-delete operations by DVs. The command works by rewriting and bin-packing (if applicable) only files that have DVs attached, which is different from the `OPTIMIZE` command where all files (with and without) DV will be bin-packed. To achieve this, we hack the `OPTIMIZE` logic so files of any size with DVs will be rewritten. Follow-up: - Set the correct commit info. Now the resulting version is marked as `optimize` rather than `purge`. - Clean up DVs from the filesystem. New tests. Closes delta-io/delta#1732 Signed-off-by: Venki Korukanti GitOrigin-RevId: 98ef156d62698986bfb54681e386971e2fec08b8 commit dcad4fddce630caefcc7d49c2170b71d9a71ca57 Author: Lukas Rupprecht Date: Tue May 9 18:35:10 2023 -0700 Unify predicate strings in CommitInfo to record the information in a consistent way. GitOrigin-RevId: 043a6a4181c112b9c9a45906c1275fbbdbbb1388 commit ba7dc56aa66f3afb6deefb7d6ed956883125d1f2 Author: Jackie Zhang Date: Tue May 9 15:31:10 2023 -0700 Minor refactoring to Delta source. GitOrigin-RevId: 3625a5c44999139ef4976c62473b233167a4aa83 commit c26a4cb985553d915420b9fb1d3b0855ebe53192 Author: Christos Stavrakakis Date: Tue May 9 19:06:43 2023 +0200 Add Option.whenNot Scala extension helper and replace usage of Option.when(!cond). GitOrigin-RevId: e26244544cadeeff1d55862f840d4c6c5570e83b commit a3526d803dece4948da1c2bb2453f75aadab1820 Author: jintao shen Date: Tue May 9 10:03:10 2023 -0700 Introduce DomainMetadata action to delta spec We propose to introduce a new Action type called DomainMetadata to the Delta spec. In a nutshell, DomainMetadata allows specifying configurations (string key/value pairs) per metadata domain, and a custom conflict handler can be registered to a metadata domain. More details can be found in the design doc [here](https://docs.google.com/document/d/16MHP7P78cq9g8SWhAyfgFlUe-eH0xhnMAymgBVR1fCA/edit?usp=sharing). The github issue https://github.com/delta-io/delta/issues/1741 was created. Spec only change and no test is needed. Closes delta-io/delta#1742 GitOrigin-RevId: 5d33d8b99e33c5c1e689672a8ca2ab3863feab54 commit baf54ffdd7287312bda19796e5bfbce939ff7844 Author: Scott Sandre Date: Wed May 10 15:31:37 2023 -0700 Delta/Flink Sink and Delta Standalone - allow disabling delta checkpointing (#545) commit 03e9a806d72dec53ee9252dd0924fdc0eb160299 Author: Paddy Xu Date: Tue May 9 17:25:11 2023 +0200 DV stress test: Delete from a table of a large number of rows with DVs This PR tests DELETing from a table of 2 billion rows (`2<<31 + 10`), some of which are marked as deleted by a DV. The goal is to ensure that DV can still be read and manipulated in such a scenario. We don't `delete a large number of rows` and `materialize DV` because they run too slow to fit in a unit test (9 and 20 minutes respectively). GitOrigin-RevId: 1273c9372907be0345465c2176a7f76115adbb47 commit 6ef881f3bad99d951201e221a73f74961bb5be6b Author: Venki Korukanti Date: Tue May 9 07:21:00 2023 -0700 RESTORE support for Delta tables with deletion vectors This PR is part of the feature: Support Delta tables with deletion vectors (more details at https://github.com/delta-io/delta/issues/1485) It adds running RESTORE on a Delta table with deletion vectors. The main change is to take into consideration of the `AddFile.deletionVector` when comparing the target version being restored to and the current version to find the list of data files to add and remove. Added tests Closes delta-io/delta#1735 GitOrigin-RevId: b722e0b058ede86f652cd4e4229a7217916511da commit 579a3151db611c5049e5ca04a32fc6cccb77448b Author: Sabir Akhadov Date: Tue May 9 13:25:54 2023 +0200 Disallow overwriteSchema with dynamic partitions overwrite Disallow overwriteSchema when partitionOverwriteMode is set to dynamic. Otherwise, the table might become corrupted as schemas of newly written partitions would not match the non-overwritten partitions. GitOrigin-RevId: 1012793448c1ffed9a3f8bde507d9fe1ee183803 commit 6556d6fa256611070f9fc8e1f05ba73b29837dcc Author: Venki Korukanti Date: Tue May 9 00:32:03 2023 -0700 SHALLOW CLONE support for Delta tables with deletion vectors. This PR is part of the feature: Support Delta tables with deletion vectors (more details at https://github.com/delta-io/delta/issues/1485) This PR adds support for SHALLOW CLONEing a Delta table with deletion vectors. The main change is to convert the relative path of DV file in `AddFile` to absolute path when cloning the table. Added tests Closes delta-io/delta#1733 GitOrigin-RevId: b634496b57b93fc4b7a7cc16e33c200e3a83ba64 commit 7c352e9a0bf4b348a60ca040f9179171d2db5f0d Author: Allison Portis Date: Mon May 8 22:47:29 2023 -0700 Adds tests for REPLACE WHERE SQL syntax Spark 3.4 added RELACE WHERE SQL support for insert. This PR adds tests for the feature after upgrading to Spark 3.4. Closes delta-io/delta#1737 GitOrigin-RevId: 8bf0e7423a6f0846d5f9ef4e637ee9ced9bef8d1 commit fe83c966f8694cf7ef786f6b24b73b2ed3cbbd7d Author: Yaohua Zhao Date: Mon May 8 13:53:29 2023 -0700 Fix a test in `DeltaThrowableSuite.scala` Fix a test in `DeltaThrowableSuite.scala` GitOrigin-RevId: 28acd5fe8d8cadd569c479fe0f02d99dac1c13b3 commit 16ca361dc5fab9d02fab6ddd5de2205b4d1f6c75 Author: Venki Korukanti Date: Mon May 8 12:55:26 2023 -0700 Fix statistics computation issues with Delta tables with DVs This PR makes following changes: - Delta protocol [requires](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#writer-requirement-for-deletion-vectors) that every `AddFile` with DV must have `numRecords` in file statistics. The current implementation of DELETE with DVs violates this requirement when the source `AddFile` has no statistics to begin with. This PR fixes it by computing stats for `AddFile`s with missing stats and have DVs generated as part of the DELETE with DV operation. The stats are generated by reading the Parquet file footer. - DELETE with DVs currently has a bug where setting the `tightBounds=false` for `AddFile`s with DVs doesn't correctly set the `NULL_COUNT` for column with all nulls. - Throw error when stats re-computation command is run on Delta tables with DVs. This is a TODO, we need to implement it but for now throw error to avoid calculating wrong statistics for Delta tables with DVs. GitOrigin-RevId: f69968961dcf4766b6847a191b66aae7f9ff295d commit 0331737f8b94f8a941c66eab05cb2664dea08c4c Author: Venki Korukanti Date: Mon May 8 10:39:47 2023 -0700 Remove the check that disables writes to Delta tables with deletion vectors Given that now we have support for writing into DV tables and table utility operations as as part of the delta-io/delta#1485 and delta-io/delta#1591, we should remove the check. Closes delta-io/delta#1736 Signed-off-by: Venki Korukanti GitOrigin-RevId: 17e7e9c6796229ada77148a730c69348a55890b9 commit ca986a6b1020b82b417dc36a80b9bf367f500bb1 Author: Lars Kroll Date: Mon May 8 16:50:52 2023 +0200 Regex based table matching in DeleteScalaSuite Use a more reliable regex-based approach to getting a `DeltaTable` instance from a sql identifier string in `DeleteScalaSuite`. GitOrigin-RevId: 1d0e1477a7d22373e8478d7debc3565c092090da commit c53e95c71f25e62871a3def8771be9eb5ca27a2e Author: Johan Lasperas Date: Mon May 8 15:48:22 2023 +0200 Enable SQL support for WHEN NOT MATCHED BY SOURCE # Description The SQL syntax for merge with WHEN NOT MATCHED BY SOURCE clauses was shipped with Spark 3.4. Now that Delta picked up Spark 3.4, we can enable SQL support and mix in SQL tests for WHEN NOT MATCHED BY SOURCE. Existing tests for WHEN NOT MATCHED BY SOURCE are now run in the Merge SQL suite. Closes delta-io/delta#1740 GitOrigin-RevId: 1ddd1216e13f854901da47896936527618ea4dca commit c180d15083d043a526ea0acd0c5bd384732b7b70 Author: Jiaheng Tang Date: Fri May 5 18:54:14 2023 -0700 Minor refactor to DeltaCatalog.scala GitOrigin-RevId: 53b083f9abf92330d253fbdd9208d2783428dd98 commit 243c0ebbafa786260accd091fdf4765c42b2e55f Author: Johan Lasperas Date: Thu May 4 11:21:38 2023 +0200 Correctly recurse into nested arrays & maps in add/drop columns It is not possible today in Delta tables to add or drop nested fields under two or more levels of directly nested arrays or maps. The following is a valid use case but fails today: ``` CREATE TABLE test (data array>>) ALTER TABLE test ADD COLUMNS (data.element.element.b string) ``` This change updates helper methods `findColumnPosition`, `addColumn` and `dropColumn` in `SchemaUtils` to correctly recurse into directly nested maps and arrays. Note that changes in Spark are also required for `ALTER TABLE ADD/CHANGE/DROP COLUMN` to work: https://github.com/apache/spark/pull/40879. The fix is merged in Spark but will only be available in Delta in the next Spark release. In addition, `findColumnPosition` which currently both returns the position of nested field and the size of its parent, making it overly complex, is split into two distinct and generic methods: `findColumnPosition` and `getNestedTypeFromPosition`. - Tests for `findColumnPosition`, `addColumn` and `dropColumn` with two levels of nested maps and arrays are added to `SchemaUtilsSuite`. Other cases for these methods are already covered by existing tests. - Tested locally that ALTER TABLE ADD/CHANGE/DROP COLUMN(S) works correctly with Spark fix https://github.com/apache/spark/pull/40879 - Added missing tests coverage for ALTER TABLE ADD/CHANGE/DROP COLUMN(S) with a single map or array. Closes delta-io/delta#1731 GitOrigin-RevId: 53ed05813f4002ae986926506254d780e2ecddfa commit 4dadc0288ac0671403c62b753440c64fbd60f57d Author: Allison Portis Date: Tue May 9 11:14:20 2023 -0700 Block adding a new column with nullable=false for existing tables (#546) commit 5c3f4d37951ab4edf1cc3364ec6e8259844b331c Author: Allison Portis Date: Wed May 3 12:45:04 2023 -0700 Support Spark 3.4 ## Description Makes changes to support Spark 3.4. These include compile necessary changes, and test _and_ code changes due to changes in Spark behavior. Some of the bigger changes include - A lot of changes regarding error classes. These include... - Spark 3.4 changed `class ErrorInfo` to private. This means the current approach in `DeltaThrowableHelper` can no longer work. We now use `ErrorClassJsonReader` (these are the changes to `DeltaThrowableHelper` and `DeltaThrowableSuite` - Many error functions switched the first argument from `message: String` to `errorClass: String` which **does not** cause a compile error, but instead causes a "SparkException-error not found" when called. Some things affected include `ParseException(...)`, `a.failAnalysis(..)`. - Supports error subclasses - Spark 3.4 supports insert-into-by-name and no longer reorders such queries to be insert-into-by-ordinal. See https://github.com/apache/spark/pull/39334. In `DeltaAnalysis.scala` we need to perform schema validation checks and schema evolution for such queries; right now we only match when `!isByName` - SPARK-27561 added support for lateral column alias. This broke our generation expression validation checks for generated columns. We now separately check for generated columns that reference other generated columns in `GeneratedColumn.scala` - `DelegatingCatalogExtension` deprecates `createTable(..., schema: StructType, ...)` in favor of `createTable(..., columns: Array[Column], ...)` - `_metadata.file_path` is not always encoded. We update `DeleteWithDeletionVectorsHelper.scala` to accomodate for this. - Support for SQL `REPLACE WHERE`. Tests are added to `DeltaSuite`. - Misc test changes due to minor changes in Spark behavior or error messages Resolves delta-io/delta#1696 Existing tests should suffice since there are no major Delta behavior changes _besides_ support for `REPLACE WHERE` for which we have added tests. ## Does this PR introduce _any_ user-facing changes? Yes. Spark 3.4 will be supported. `REPLACE WHERE` is supported in SQL. GitOrigin-RevId: b282c95c4e6a7a1915c2a4ae9841b5e43ed4724d commit 9f217a5d4f56e577eab833b7fa9037f05bc6c06a Author: Scott Sandre Date: Tue May 2 13:16:14 2023 -0700 S3DynamoDBLogStore::read: use a ThrowingSupplier to correctly propagate the FileNotFoundException ## Description Previously, we were wrapping `super.read` calls in the Supplier to RetryableCloseableIterator with a try-catch statement, and re-threw IOExceptions as UncheckedIOExceptions. This could cause downstream readers who are catching `FileNotFoundException` to fail, as instead of a `FileNotFoundException` we were throwing an `UncheckedIOException`. Instead, this PR creates a `ThrowingSupplier` class so that `super.read` calls don't need to be caught and re-thrown. New UTs and tested with a real S3 bucket and DDB table: ``` spark-shell --packages io.delta:delta-core_2.12:2.3.0,org.apache.hadoop:hadoop-aws:3.3.1 \ --jars /Users/scott.sandre/.m2/repository/io/delta/delta-storage-s3-dynamodb/2.4.0-SNAPSHOT/delta-storage-s3-dynamodb-2.4.0-SNAPSHOT.jar,/Users/scott.sandre/.m2/repository/io/delta/delta-storage/2.4.0-SNAPSHOT/delta-storage-2.4.0-SNAPSHOT.jar \ --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \ --conf spark.delta.logStore.s3a.impl=io.delta.storage.S3DynamoDBLogStore \ --conf spark.io.delta.storage.S3DynamoDBLogStore.ddb.region=XXXX \ --conf spark.io.delta.storage.S3DynamoDBLogStore.ddb.tableName=YYYY \ --conf spark.io.delta.storage.S3DynamoDBLogStore.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider \ --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider > spark.range(100).write.format("delta").save(tablePath) 23/05/01 15:54:52 INFO S3DynamoDBLogStore: using tableName YYYY 23/05/01 15:54:52 INFO S3DynamoDBLogStore: using credentialsProviderName com.amazonaws.auth.profile.ProfileCredentialsProvider 23/05/01 15:54:52 INFO S3DynamoDBLogStore: using regionName XXXX 23/05/01 15:54:52 INFO S3DynamoDBLogStore: using ttl (seconds) 86400 23/05/01 15:54:52 INFO S3DynamoDBLogStore: Table `YYYY` already exists 23/05/01 15:54:52 INFO DelegatingLogStore: LogStore `LogStoreAdapter(io.delta.storage.S3DynamoDBLogStore)` is used for scheme `s3a` > spark.read.format("delta").load(tablePath).show() // shows valid output ``` ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1730 Signed-off-by: Scott Sandre GitOrigin-RevId: ac2cd2c9bd02aa1298ad8426e267df8eb41af491 commit 8ce42dc97ba25686c960ae65542edb89c9ae818f Author: Desmond Cheong Date: Mon May 1 19:51:51 2023 -0700 Minor refactor removing unused code Removes code related to an unused statistics tracker. GitOrigin-RevId: bdc1312f71bc466ae2c5675ca537f798fe59655c commit ad6a53c3fa9029d45dbb5f9500623d98b4a76968 Author: Matthew Powers Date: Fri Apr 28 14:10:41 2023 -0700 Add pypi downloads to README ## Description This PR adds the number of monthly pypi downloads to the README. This is a README-only change. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1668 Signed-off-by: Allison Portis GitOrigin-RevId: 46f915a9a075b25b1a4a1dbbdbdc4dc4a72d3b3c commit e432cbc27f46c11226c48d75804c6d90d810bbb4 Author: Satya Valluri Date: Fri Apr 28 14:08:42 2023 -0700 Adding new stats tracker for auto-stats. In this PR we are adding a new stats tracker that collects both analyze statistics and Delta statistics. The tracker is added in parallel with the Delta stats tracker and eventually will replace it. The tracker tasks collect per file analyze statistics during data ingestion. The file level statistics are then aggregated to task level statistics. The tracker job that runs in the driver aggregates the task level statistics to create table level statistics. GitOrigin-RevId: f7cb4aae152207f09e2bb6f6acc03282a5ffa4b3 commit d7483ad5a5ad50cafbe74cbe9019be8f9389d8b4 Author: Scott Sandre Date: Fri Apr 28 13:16:16 2023 -0700 Catch `RemoteFileChangedException` inside of `S3DynamoDBLogStore::read` ## Description In S3, if `N.json` is over-written while an input stream is open on it, then the ETag will change and a `RemoteFileChangedException` will be thrown. This PR adds logic to retry reading that `N.json` file, at the exact same line that the error occurred at. This assumes and requires that the contents of N.json have been overwritten with the same identical content! As an important implementation highlight: so, if we are at index 25 (thus, the last successfully read index is 24), and we try to call `.next()` on the read iterator, and an RemoteFileChangedException is thrown, we will re-generate the read iterator, skip all the way to index 25, and try reading it again. New UTs. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1712 Signed-off-by: Scott Sandre GitOrigin-RevId: e94217d3207abf9a246be7464b1f9d76d1463597 commit 91ab166c1055d73e9400eb334936dea56c1e6a0e Author: Sabir Akhadov Date: Fri Apr 28 18:44:25 2023 +0200 Minor refactor to DeltaLog.scala and TahoeFileIndex.scala GitOrigin-RevId: 3808e971762b1af76b3acfbde1e727fca9c06092 commit 0487682248a75d7a8df1fb5060839be320c0b6ba Author: Fredrik Klauss Date: Fri Apr 28 07:56:44 2023 +0200 Add assertMetadata check to commitLarge commitLarge never called assertMetadata, that checks that the metadata is in a correct state after a commit. This PR added a call to it to ensure correct behavior. Also move assertMetadata call to last position before updating metadata, to ensure any updates to it within that codepath are validated. GitOrigin-RevId: 3b4ea566c93e2c6e92372c7087c1bf2e53f30ba8 commit 0940475f9fde51358c898556a8db299722d3b768 Author: Scott Sandre Date: Thu Apr 27 14:01:00 2023 -0700 Use per-JVM lock in S3DynamoDBLogStore to reduce number of `N.temp.json -> N.json` copies ## Description Adds a global (per JVM) path lock to S3DynamoDBLogStore to reduce the number of `T(N) -> N.json` copies, which can occur when there are concurrent readers/writers. Note: multiple `T(N) -> N.json` copies will not cause data loss, but it may impact readers who already have an existing InputStream open on that particular file. It's really hard to test this specific concurrency issue. Code review + existing tests. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1711 Signed-off-by: Scott Sandre GitOrigin-RevId: 49514b8192c904b5a648d2d6146269ce5b368f6f commit 6d5fc3e6696c12f90037da20a32cd89908e66556 Author: Scott Sandre Date: Thu Apr 27 09:55:44 2023 -0700 Minor refactor to CloneIcebergSuite GitOrigin-RevId: a3cb8307ada03351798c7a45d2bdf3918a388881 commit d82c3155117c2db61aa36959b4f1c8a064aa7c2c Author: Allison Portis Date: Wed Apr 26 15:19:56 2023 -0700 Fix Dockerfile to work after Debian removed support for stretch To get both Java 8 and Python 3.7 we use `openjdk:8-jdk-buster` as our base image. We install pip3 ourselves. GitOrigin-RevId: 80b4e903ddeeb808fd54a8a1c0abd6797f427a0e commit ce4520110dcd4d251209f3238f82a8d077ee982c Author: David Lewis Date: Wed Apr 26 14:15:57 2023 -0600 Clarify path encoding in comment. GitOrigin-RevId: 2c353f6107d91be928f92c814d8d5596731cc9e4 commit f4689810019c7ba4bedef3bfebd2498076d1b43e Author: Prakhar Jain Date: Tue Apr 25 17:33:19 2023 -0700 Add test case in SnapshotManagementSuite. GitOrigin-RevId: 50c12b1d3731e5c1ef239ee03e34ebe83e8b825c commit 5ffc0568075337d77d55a00d0c7a207429005193 Author: Ming DAI Date: Tue Apr 25 13:43:47 2023 -0700 Refactor ConvertToDeltaCommand to separate out common conversion logics GitOrigin-RevId: f5048489b573ba9b8a29a093cb60f63d9108376e commit ce9fcbfd4e9a217b5536d1760615e16807322c90 Author: Dhruv Shah Date: Tue Apr 25 13:07:35 2023 -0700 Minor interface refactor to DataSkippingReader and DataSkippingPredicateBuilder GitOrigin-RevId: e2e1aec2fe403ae4cce92a0ee4d13c7a11a685bb commit 9e59cf8cc5c86f46a594e6ee7ff72c170685d753 Author: Paddy Xu Date: Tue Apr 25 20:41:30 2023 +0200 Allow reading CDF from files with Deletion Vectors This PR is part of https://github.com/delta-io/delta/issues/1701. This is a follow-up of https://github.com/delta-io/delta/pull/1680 to add support to allow reading CDC from files that have DV associated. In this PR we modify the CDC reader to construct in-line DVs diff'ed from two existing DVs, and modify the corresponding FileIndex to use the in-line DV. Closes https://github.com/delta-io/delta/pull/1704 GitOrigin-RevId: 9e3589eb576a773b9f05777521b01485ebeaf33e commit a54ef2a3959b111d1d9510d3106a86f1bba54030 Author: Paddy Xu Date: Wed Apr 26 17:41:08 2023 +0200 Add tests to make sure that DML commands are able to read Delta tables having DVs - `UPDATE` and `MERGE` can read DVs, and will re-write a file to materialize DV if rows within are modified. - `DELETE` can write DVs. New tests. No, this is a test-only PR. Closes delta-io/delta#1678 Signed-off-by: Paddy Xu GitOrigin-RevId: 4f5721d850a1aa93e963354a5afc2e262cde2f08 commit 94fff7109cec3e08550f73f3a64935ce847632a6 Author: aokolnychyi Date: Tue Apr 25 07:50:25 2023 +0000 Fix `StagedDeltaTableV2.partitioning` Fix `StagedDeltaTableV2.partitioning` as it should not always return an empty array. GitOrigin-RevId: 21be990f6926ff61bd64f5b1fe0a95755227ea1a commit 32bdb031aecb7dc808ac5e70e7fbe5e3f41a25fa Author: Jackie Zhang Date: Tue Apr 25 00:13:53 2023 -0700 Add schema tracking support for CDC streaming. Closes https://github.com/delta-io/delta/pull/1709 GitOrigin-RevId: 9fb1cc676336ce171d26c87c17cc19ea1cc04a81 commit a12aa4754979e5e13252d01689af8091678102bc Author: Chaoqin Li Date: Mon Apr 24 20:23:40 2023 -0700 Minor refactor to InvariantEnforcementSuite GitOrigin-RevId: 73451deac7816367549a772a9e2b92bcb4bb6cb9 commit 2093d0414d3f139b4ba3da5e4bad2b29de97ef08 Author: Allison Portis Date: Mon Apr 24 11:08:39 2023 -0700 Bump version in master to 2.4.0-SNAPSHOT after 2.3 release n/a Closes delta-io/delta#1713 Signed-off-by: Allison Portis GitOrigin-RevId: 4f42281b646114c541c04894ed81a518299a4c20 commit e132ff9dce6d74c791455ac18cd91f1b9b7b1be4 Author: Paddy Xu Date: Mon Apr 24 11:44:00 2023 +0200 Block writing to Delta tables that supports identity columns This PR fixes an issue described in https://github.com/delta-io/delta/issues/1694, where it is possible to INSERT values into an identity column without updating the high watermark. This issue is caused by a misplaced check in `actions.scala`. The check didn't fire for INSERTs. Added new tests. Yes. After this PR is merged, it will no longer be possible to write to a table that has `minWriterVersion` = `6`, or has `identityColumns` in `writerFeatures`. Closes delta-io/delta#1695 Signed-off-by: Paddy Xu GitOrigin-RevId: 048bd370b7c50fa052650c77ba96469cdda5f0e9 commit 270571a53bbfd5992e671e9ffb71d5036f91f866 Author: Jackie Zhang Date: Fri Apr 21 19:17:41 2023 -0700 Adding SQL conf logic to allow user to unblock a specific stream. Closes: https://github.com/delta-io/delta/pull/1710 GitOrigin-RevId: be6de5af6a51d8a644a65855042a6b476a31f324 commit 47983402771853b43cfd23704f2e45df2dac4406 Author: Jintian Liang Date: Fri Apr 21 19:09:47 2023 -0700 Ignore internal column mapping table properties when creating table When creating a table against an existing Delta location, a comparison is made between the configuration of the newly created table and the existing Delta table. The create fails if this comparison fails. However, some internal column mapping properties cannot be set by the create so they should be ignored for the purposes of this verification. GitOrigin-RevId: 29505a33c0c99678b50448f048fbdf5ef8c20a7d commit 8272ee916a581f413dc24cfa6e2aaffde1db5b3d Author: Johan Lasperas Date: Fri Apr 21 20:35:46 2023 +0200 Introduce row ID write table feature and table property This change adds knobs to enable Row IDs when creating a new Delta table, as defined in the Row ID specification (https://github.com/delta-io/delta/pull/1610): - Write table feature `rowIds`: Require writers to support row IDs. Used to enable row IDs on newly created tables. - Table property: `rowIds.enabled`: Indicate whether all rows have an assigned row ID. - SQL conf `rowIds.allowForDevOnly`: restrict the use of Row IDs to testing for now. Adding test suite RowIdSuite: - Test enabling Row IDs on a new table succeeds - Test enabling Row IDs on an existing table fails. Closes delta-io/delta#1702 GitOrigin-RevId: 961ff72f1ae7abf1f08d53052062ce20669d4aad commit c47445ed748ce4fed9d196bd228f6e3a05bc9c89 Author: Tom van Bussel Date: Fri Apr 21 18:59:55 2023 +0200 Add a test for schema evolution of a non-nullable column. GitOrigin-RevId: e8eb0ac8905fb9bfb80a4c114f62f9ec6e17337c commit 4779d473d77f6fe465b05b7bd7c84f8467217d18 Author: Paddy Xu Date: Fri Apr 21 10:15:22 2023 +0200 Minor DELETE test refactoring Minor refactoring of DELETE suite to use `executeDelete` of the superclass. GitOrigin-RevId: e130924fa662c04914341d441175d63ff1fd0a36 commit 29dd9741c64df1a40b14dde86ebdab983d79f0a8 Author: Paddy Xu Date: Thu Apr 20 10:06:48 2023 +0200 [DELETE with DVs] Allow reading CDF from files with Deletion Vectors - Part 1/2 This PR is part of https://github.com/delta-io/delta/issues/1701. A detailed overview of changes is described at https://github.com/delta-io/delta/issues/1701. This is the first PR to add support to allow reading CDC from files that have DV associated. In this PR we do some preparation work to allow fine control of how to handle masked rows: keep or drop. Later these two types will be used by CDCReader to pull masked rows out from files. Closes https://github.com/delta-io/delta/pull/1680 GitOrigin-RevId: d0f49ee0a11e604f089d45df1611272a81d47813 commit 8a352eb36f4db94adf1a56882880b5491df2d97c Author: Paddy Xu Date: Wed Apr 19 23:33:49 2023 +0200 Fix attaching metadata column to files inside subquery ## Description We found that the metadata column won't be attached to the file source for query plans generated from temp views created by `SELECT`., for example ```scala sql("CREATE TEMP VIEW v AS SELECT * FROM tab") sql("DELETE FROM v WHERE key = 1 AND value = 5") ``` corresponding to ```scala 'Project [key#599, value#600, __delta_internal_row_index#785L, '_metadata.file_path AS filePath#792] +- SubqueryAlias v +- Project [cast(key#601 as int) AS key#599, cast(value#602 as int) AS value#600, __delta_internal_row_index#785L] +- Project [key#601, value#602, __delta_internal_row_index#785L] +- SubqueryAlias spark_catalog.default.tab +- Relation default.tab[key#601,value#602,__delta_internal_row_index#785L] parquet ``` When being executed, the above query plan will fail with an error message `_metadata column does not exist`. The possible reason is because of multiple levels of projection, which hides the `_metadata` column from the underlying file scan. This PR fixes the above issue by attaching the metadata column before sending it to the analyzer. The root issue, however, should be hidden in the Spark library which is hard to fix. Adapting existing tests. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1684 Signed-off-by: Paddy Xu GitOrigin-RevId: e692c4cfd22880ef115ab76a7f0329c0138f41c6 commit 81c7a58e5bfa3aff191cd60f0459bb894822713d Author: Shixiong Zhu Date: Wed Apr 19 01:20:36 2023 -0700 Remove spark's internal metadata stored intentionally in Delta [SPARK-43123](https://github.com/apache/spark/pull/40776) fixed an issue that Spark might leak internal metadata, which caused Delta to store Spark's internal metadata in its table schema. Spark's internal metadata may trigger special behaviors. For example, if a column metadata has `__metadata_col`, it cannot be selected by star. If we leak `__metadata_col` to any column in a Delta table, we won't be able to query this column when using `SELECT *`. Although [SPARK-43123](https://github.com/apache/spark/pull/40776) fixes the issue in new Spark versions, we might have already persisted internal metadata in some Delta tables. To make these Delta tables readable again, this PR adds an extra step to clean up internal metadata before returning the table schema to Spark. GitOrigin-RevId: 60eb4046d55e955379c98e409993b33e753c5256 commit 30fa2c5842c32aebbc05b8cad9f6313c3db7df20 Author: Jackie Zhang Date: Tue Apr 18 21:32:56 2023 -0700 Fix the streaming unsafe escape flag for non-additive schema changes. GitOrigin-RevId: 9dfe9417689563810fa63fd10c035937ea8c6d5b commit 8dfa534b11c300319b00d1ab62b6be9150a16177 Author: Tom van Bussel Date: Tue Apr 18 10:17:28 2023 +0200 Add Row ID specification to protocol This change updates the Delta protocol to add the specification for the Row ID feature. Row IDs are integers that are used to uniquely identify rows within a table. See [Row ID design document](https://docs.google.com/document/d/1ji3zIWURSz_qugpRHjIV_2BUZPVKxYMiEFaDORt_ULA/edit?usp=sharing) for additional context: Closes delta-io/delta#1610 GitOrigin-RevId: 506a41f2d0f6495f656bea659f63a1004a0b0ecb commit 08d6f63ce1848f619de115bc476747c92f81636c Author: Prakhar Jain Date: Mon Apr 17 23:39:43 2023 -0700 Refactoring around CheckpointMetadata - Move it from Snapshot to LogSegment. GitOrigin-RevId: 200db588b57e8a5fdbf43a1199533a54fb66b4d0 commit 3bcb8f06bc5958bf3dcff29765c0d8456a1872d5 Author: Jackie Zhang Date: Mon Apr 17 17:54:14 2023 -0700 Fix base index for CDC streaming GitOrigin-RevId: d41d3d29161b53d2f4e12a19ed90378236d184fd commit 3441df161defdcbedf3b6fe7d68106332f4f53b7 Author: Jackie Zhang Date: Thu Apr 13 16:13:55 2023 -0700 Support non-additive Delta source schema evolution with schema tracking log. Closes https://github.com/delta-io/delta/pull/1690 GitOrigin-RevId: 5b06490b5bb16ea1f92f5e68d67674537ab7cb24 commit c87ea1d9a5bdff6794ed677fc4db65c6ab478c11 Author: Chaoqin Li Date: Thu Apr 13 10:54:31 2023 -0700 Test table initialization status in DeltaProtocolVersionSuiteBase. Allow testTableCreation method to accept parameter that specify the initialization status of the table GitOrigin-RevId: 7927f8ce2fe06ff75619096c840f64f058752e57 commit db980a33dbe6842be30a6ec81fe8e36150f048b8 Author: Johan Lasperas Date: Thu Apr 13 14:06:44 2023 +0200 Add not matched by source tests with empty source Adding missing test coverage for merge with WHEN NOT MATCHED BY SOURCE clauses when the source is empty. GitOrigin-RevId: f45912d95f427a3bb408fca5c4a5693d93899f1f commit 73d0ff1ee9e5accc42f278169cf9dcb8e1cec127 Author: Paddy Xu Date: Thu Apr 13 15:06:13 2023 +0300 Allow restricting table features to auto-update protocol versions This PR allows marking a metadata table feature as "auto-update capable", such that during ALTER TABLE when its metadata requirements are satisfied, the table's protocol will be automatically and silently bumped to the required version. Note that this mechanism only applies to "metadata requirements". `'delta.feature.featureName' = 'supported'` table prop will still auto-update the table's protocol to support table features. Also, this mechanism only affects existing tables. For new tables, the protocol is always set to the min version that satisfies all enabled features, aka. all features are auto-update capable. For compatibility, legacy table features (features supported by `Protocol(2, 6)`) are always auto-update capable. Specifically, Column Mapping implements its own mechanism to block the usage without protocol version bumps. Closes https://github.com/delta-io/delta/pull/1660 GitOrigin-RevId: 3961434992d2fef2ac89e4bfe67a1b39cd91a5ca commit 3e685faef7271a04c0244beb747b42ba14ced833 Author: Lars Kroll Date: Thu Apr 13 11:57:06 2023 +0200 Cleanup unneeded dv path relativization in vacuum We no longer try to relativize absolute DV paths in VACUUM, since this will never succeed anyway. GitOrigin-RevId: 9b85e416ef0cf05078128445b4c2fbaf8ab22618 commit 28ec93ea023f4eaae922239148cb7d5d2a002f16 Author: Jintian Liang Date: Wed Apr 12 22:59:01 2023 -0700 Split DeltaErrorsBaseSuite into two test cases. Single test case for DeltaErrorsBaseSuite grew to be too big. Split all test cases into two groups. GitOrigin-RevId: d195addb04f854211a8314fc980a75e1429a629b commit ba42944343092469611cf615e24c2b21e28c1368 Author: Prakhar Jain Date: Wed Apr 12 10:56:51 2023 -0700 Minor refactor in Snapshot Management class. GitOrigin-RevId: 1d162ccc778c5ae10458c6a694609e7fc19096ba commit 9e1439ce4466898b85ed9718fed6be59fdd93a29 Author: Lars Kroll Date: Wed Apr 12 18:02:26 2023 +0200 Minor refactoring to DataSkipping Reader GitOrigin-RevId: 83e98666aa454cd72c850d96d436b40a3bcd05c6 commit c834cd093c47a86fcfed8a6989e77bc4ca0d4039 Author: Lars Kroll Date: Wed Apr 12 13:41:32 2023 +0200 Testing utilities. GitOrigin-RevId: 72c0dafa3887e5d75d917eab6093f1c0c3ac48cb commit deb290fd730854dff93328a5da1993aff2b2f63e Author: Wenchen Fan Date: Wed Apr 12 11:02:44 2023 +0000 Refactoring in CDCReader Small refactoring in CDCReader to make further development easy. GitOrigin-RevId: 5beaad4b9561b097848c8fc5d142a2b778ca248a commit 6a49af44a02bb7eb697e66c5a43b83e7ca6cf7a9 Author: aokolnychyi Date: Wed Apr 12 02:20:38 2023 +0000 Update test for upcoming Apache Spark 3.4 changes Apache Spark 3.4 adds runtime null check for table insertion, instead of compilation time, and some tests need to be adjusted. GitOrigin-RevId: 15a264780bab735d502aaea70bcd3609b94b7872 commit 547d6ba30d5fc3248c220854394d866a724f8461 Author: Hussein Nagree Date: Tue Apr 11 09:05:50 2023 -0700 Add permissions to extract state method Ensure that extractStateFromDF has the right permissions to run GitOrigin-RevId: dd84cdbef82a8e94a85f76af9941d73bce4233b1 commit 4854a295ad476ffff0d62be314b5907ff5603f76 Author: Ming DAI Date: Mon Apr 10 19:53:30 2023 -0700 Support partition of truncate[int] and truncate[long] in iceberg converter GitOrigin-RevId: 216d6dcae9343cb4cf57390f28e16398611a95e5 commit e0bece9ac47bdac0e840e68d934576edd5fb42d2 Author: Prakhar Jain Date: Thu Apr 6 17:51:41 2023 -0700 Make parts in Checkpoint Instance optional. The parts will be used only for multi-part checkpoint tie breaking. So making them optional for others makes it cleaner. GitOrigin-RevId: 8e8987b3f28602e4f9b4a08df9a7778c80b910eb commit f11c355649bf1182dd74b563bc399fa7f1b7fe97 Author: Scott Sandre Date: Wed Apr 12 19:43:58 2023 -0700 [Delta-Standalone] Compute Snapshot protocol and metadata more efficiently (#533) commit aaaad9fb57478f18626142ebc63f1812fb407c68 Author: Bart Samwel Date: Thu Apr 6 15:45:40 2023 +0200 Cleanup: Use Option instead of null for AvailableNow trigger offset The AvailableNow trigger offset was being set to `null` when it was not set. Proper Scala idiom is to use `Option` for this. GitOrigin-RevId: 313314890c554aa99c14cb56034c560ae2148c2e commit 3ffa2f4363ee6b59560ee27f96361946c8ecb117 Author: Lukas Rupprecht Date: Wed Apr 5 13:58:05 2023 -0700 Adds DeltaTableV2.tableExists helper method This PR is a minor refactor that adds a tableExists helper method to DeltaTableV2. This allows callers to directly call tableExists on a DeltaTableV2 instead of going through the DeltaTableV2's deltaLog member (which is currently required). Instead of `table.deltaLog.tableExists`, callers can now just call `table.tableExists` (where `table` is a DeltaTableV2). GitOrigin-RevId: f07c3977733b9df9f5087318a13526c30465ab4c commit 5759de83f4ba135cd9b25328e8cac6aad02e8f47 Author: Scott Sandre Date: Mon Apr 10 13:45:21 2023 -0700 [Delta-Flink Sink] Compute `txnVersion` lazily to reduce CPU utilization on commit (#532) commit 5ab678db978ee28d64b6ba3ff258a54769177fbc Author: Andreas Chatzistergiou Date: Tue Apr 4 20:07:04 2023 +0200 This PR fixes an issue in CDC read where we convert timestamp datatypes to string without extracting the timezone. While the timezone can be independently inferred, DST cannot. This affects records created during the ambiguous hour the DST change occurs. GitOrigin-RevId: b3f6996d05d90e8968ba81b22f6c591b2c730b2b commit a44cbbef4f48366e3135940b84fd27f3709b7757 Author: Andreas Chatzistergiou Date: Tue Apr 4 19:15:48 2023 +0200 Minor refactoring. GitOrigin-RevId: 42f9216bf74a42be0a0cca080f52f7864537d753 commit 324536f83462e0d0dbe979f2576ab4bf7f1be9d5 Author: Venki Korukanti Date: Tue Apr 4 08:37:28 2023 -0700 This is part of [support reading Delta tables with deletion vectors](https://github.com/delta-io/delta/issues/1485) It adds support for running VACUUM command on Delta tables with deletion vectors. Main change is to include the DV file of the GC survived FileAction in the list of GC survived files, so that the valid DV files are not considered for deletion. Added tests. Closes delta-io/delta#1676 Signed-off-by: Venki Korukanti GitOrigin-RevId: c5a156779701934366c36b4049648f43c7b97ebe commit ee1aec62aec0eb4cc475567a57709c95c2ec4aaf Author: Johan Lasperas Date: Tue Apr 4 12:11:19 2023 +0200 Add test coverage for special characters in Delta table path This change provides coverage for special characters in Delta table path in all tests by default by including a space in the default temp directory used when creating test tables in DML tests. An improved helper to parse the table name and table alias from tests that behaves correctly when the table name or path contains spaces is provided. GitOrigin-RevId: 989649fd59ad33e55400786fb43924b6d8040480 commit 72b7a74dcc21c2f40963516a803802fa7a408820 Author: Lars Kroll Date: Mon Apr 3 12:37:15 2023 +0200 Better error message for merge materialization exhausted retries failures Add a more user friendly message for repeated merge materialization failures. GitOrigin-RevId: e4f44545ff9e4baea07bd43e79bda4b1c69f6403 commit 9706210d517d8ddb76898d2985673cfe55b30bd0 Author: Chaoqin Li Date: Fri Mar 31 20:30:57 2023 -0700 Move DeltaSink transaction commit logic to a separate function. Refactor: Move DeltaSink transaction commit logic to a separate function. GitOrigin-RevId: b42e357fe371fd8fa36df6e13bbec5df5070da8c commit 6b403a73641e5980e4bcf4ccbd00820a372c44e0 Author: Prakhar Jain Date: Fri Mar 31 15:32:52 2023 -0700 Minor refactor around checkpoint metadata GitOrigin-RevId: 4b6977d8b5197e600e5a56c6c74c0f1a443ac8e1 commit 25351c472f135e5b2ed9f2a61d6ef8a3b3ce5540 Author: Rui Wang Date: Thu Mar 30 16:11:35 2023 -0700 Minor refactor to UpdateExpressionsSupport GitOrigin-RevId: 7f2c55b7c86ec4ed99cbb7a2b4539b01d25af9af commit f1bad1f2269e36811d60a678c19033c271418b00 Author: Bart Samwel Date: Thu Mar 30 11:40:48 2023 +0200 Refactor getFileChanges to reduce duplication of logic The `getFileChanges` function repeats the same logic using two iteration models, one for in-memory commits and one for commits that are read from disk. This PR refactors that to not replicate logic, using an iterator factory helper function. GitOrigin-RevId: 41699d6da1930d95d67e5fa49befc066be97cbc1 commit 47119d4fc2a80c97a9e69626c617d31db61adc20 Author: Paddy Xu Date: Thu Mar 30 16:17:44 2023 +0800 [DELETE with DVs] Write stats for DELETE with Deletion Vectors This PR is part of https://github.com/delta-io/delta/issues/1591. A detailed overview of changes is described at https://github.com/delta-io/delta/issues/1591. This PR gives the ability for DELETE with DVs to produce correct stats for `AddFile`s, which was a TODO item before. Rather than from costly recomputation, the stats are obtained by copying, for each AddFile with DV, previous stats from the last committed version and changing `tightBounds` to `false`. The stats are guaranteed correct before the physical parquet file didn't change when doing soft deletes. Closes https://github.com/delta-io/delta/pull/1661. GitOrigin-RevId: 1392dba07977d07856f8fd1c14a1a389954c231b commit f6cef1741b865580c0d6fdc00b3328523b5de442 Author: Rahul Shivu Mahadev Date: Wed Mar 29 13:50:22 2023 -0700 Update ZOrder test names to specify if they are on partitioned tables or not GitOrigin-RevId: 595ef18fddaed5dcbae16f71c9f56013eece1a2b commit 3e50fd46747df4c479c4e81ad97a4689efcbf03b Author: Venki Korukanti Date: Wed Mar 29 11:59:00 2023 -0700 This PR is part of [DELETE with Deletion Vector implementation](https://github.com/delta-io/delta/pull/1591). It adds support of combining multiple DVs, each belonging to different data files, into a one DV file. This helps reduce the number of DV files created. Config the controls the DV file size is `DeltaSQLConf.DELETION_VECTOR_PACKING_TARGET_SIZE` (default value is 2MB) Closes delta-io/delta#1670 Signed-off-by: Venki Korukanti GitOrigin-RevId: c861944ab496aa980d4fa012464a2d5e5fb6588e commit 6f66012950d02ed3152c96928df1c233d48f9b5e Author: Bart Samwel Date: Tue Mar 28 11:36:05 2023 +0200 Cleanup: Don't initialize AvailableNow limit in getBatch The AvailableNow limit is not used in `getBatch`, but we do initialize it. We also initialize it in `latestOffset`, which is where we actually need it. `getBatch` doesn't need it because it works entirely on already-defined ranges. So we can just leave the initialization to `latestOffset`. GitOrigin-RevId: 18fdfe5a3d48a9967cb720eb4d5ed3cef51de2d6 commit 26d50722eefad16c7d512aa2381cc5623156f6dc Author: Lars Kroll Date: Tue Mar 28 09:52:28 2023 +0200 Add pre-commit check for mandatory numRecords with DVs. Add a new check in prepareCommit to ensure that all files with DVs added also have the numRecords stat, as preparation for DV write support. GitOrigin-RevId: e5b0d05f630dd10e3f4d51d149aa8ef89e64af3a commit dd72a517a152f76f80855ffae517d852663dac67 Author: Lukas Rupprecht Date: Mon Mar 27 14:39:10 2023 -0700 Fixes and adds more tests to DeltaTableUtilsSuite This PR fixes a failing test in DeltaTableUtilsSuite and adds more test cases. GitOrigin-RevId: dd462ac9997b786f42587fe0089b8c5150ecb212 commit 4fdc2687882cb2485cb59f7d83b2e8a251f33ce6 Author: Ryan Johnson Date: Mon Mar 27 13:41:38 2023 -0700 Async Delta operations run with correctly captured spark session. Today, Delta code that uses scala's `Future` API must manually capture the current spark session and ensure the future's body executes inside `SparkSession.withActive`. This makes it easy to forget, and e.g. asynchronous snapshot update was missing it. Fix the problem by defining and using a new `DeltaThreadPool` that abstracts away the unholy java/scala mix of futures, execution contexts, etc and presents a simple `submit` method that requires users to pass the spark session the async task should use. While we're at it, also ensure that failure to launch an async snapshot update doesn't surface as a user error, since it's anyway just a best-effort operation. GitOrigin-RevId: 423b4d2e35f4ecd3d3f3a7c5decd4d8b2f8b709e commit 9627de80f34c9c4f2b401fa6968060cf01938792 Author: Vitalii Li Date: Fri Mar 24 22:22:53 2023 -0700 Minor refactor to DeltaLog GitOrigin-RevId: d2ec92c47a0af516d6121e51da23e95db4a8303f commit cbc68db811c7008dff4869335f41188e937577e4 Author: Rahul Shivu Mahadev Date: Fri Mar 24 14:06:20 2023 -0700 ## Description - This change(653253545a97f224bb123b8fa2f3d67e7bff3852) added support for Timestamp without timezone data type via a table feature. - This PR updated the protocol spec to reflect these changes - no test as this is a protocol doc update. ## Does this PR introduce _any_ user-facing changes? - no GitOrigin-RevId: 4706547157df5f8d7ebce8d7562b733f6e5b87ba commit 2ef0fefb14e2d91ba9a9d69940a0d8e30b02f890 Author: Carl Fu Date: Fri Mar 24 10:17:25 2023 -0700 Minor refactor to WriteIntoDelta.scala GitOrigin-RevId: d2686f2e8570a156c4cf3613e86e288a95917af7 commit 12cdd526bdd2df91dac5cc65cfbbbca4fafbf47c Author: Lars Kroll Date: Fri Mar 24 17:51:16 2023 +0100 In DV Testing use a DeltaTable function instead of instance - `DeltaTable` hangs on to the dataframe it is created with for the entire object lifetime. That means subsequent `table.toDF` calls will return the same snapshot. - A lot of the DV tests are written assuming `table.toDF` would return a new snapshot. - Change the commonly used `withTempDeltaTable` function to pass in a `() => DeltaTable` instead of a raw `DeltaTable` to catch these issues, by providing a fresh instance on every call. GitOrigin-RevId: 0f16c1cb6ef80dd9202437d090f5e3b190034e8f commit 8800cb925705664f328d4e4c8249c7315354d5b5 Author: Johan Lasperas Date: Fri Mar 24 14:25:00 2023 +0100 Fix nested schema evolution tests Some schema evolution tests for nested columns are silently failing due to invalid JSON inputs and expected results being parsed as `null`: the actual and expected results end up both being equal to `null` and the tests are passing. This change makes parsing JSON inputs and results more strict in tests to raise an error on invalid inputs. It also fixes tests that had malformed JSON data. N/A. Confirmed that tests are failing after enforcing strict JSON parsing and fixed them. GitOrigin-RevId: 6432bdb2d25ea9efcab41dd2762de4821060db87 commit 075be756b7eb68eb96dd486553bc1cced2f9c50a Author: Ryan Johnson Date: Fri Mar 24 00:52:36 2023 -0700 Refactor SnapshotManagement.isSnapshotFresh as getSnapshotIfFresh Simplify code that calls `SnapshotManagement.isSnapshotFresh` by changing the function to return the fresh snapshot as an `Option`, so it can be directly consumed by the caller. GitOrigin-RevId: 73ca48fd5a6fa00414e5ebc81e2d0c9b7e64421f commit c264251a678732d0842d1efa9e5291f6842906bd Author: Venki Korukanti Date: Thu Mar 23 18:13:13 2023 -0700 Fix test build errors and enable running SBT tests There was an [inadvertent change](https://github.com/delta-io/delta/pull/1622/files#diff-55af8247ec95043557b9810df657aade6cf363e20e83a29ed961b4f1f00ccb7aL56) that went in recently causing the SBT tests to not run. Couple of failing tests are disabled for now. They will be fixed in followup PRs. GitOrigin-RevId: 9b8c83e74368f659c988346441c5e5f326ad84c2 commit 40d837949f7e555b0d09c501d60cf673d8c79b59 Author: Ming DAI Date: Thu Mar 23 16:59:57 2023 -0700 Upgrade Iceberg to 1.1.0 and adapt new API GitOrigin-RevId: 1cfcf6b31cc9910082d503e9b041b86c6753e5e1 commit 1bb39dbf31bacb444477683b071acf02fc95cfc7 Author: Eric Maynard Date: Thu Mar 23 11:36:12 2023 -0700 Add a minimum number of files to CompositeLimit GitOrigin-RevId: 1fa2b5dad57e2d2825779dd34a3502220fdd4e54 commit 106e1a70a3951ce12f134be48d984a4cda6b672c Author: Christos Stavrakakis Date: Thu Mar 23 11:14:13 2023 +0100 Allow commands to set CommitInfo tags Extend `OptimisticTransaction.commit()` to allow commands to specify custom tags that are set in the `CommitInfo` object. GitOrigin-RevId: e8e1f9fc7847b60b99ce7e43af2a3838971bd15b commit f882b66d75135541013f69c331c2fc62f60e8c98 Author: Prakhar Jain Date: Wed Mar 22 20:12:19 2023 -0700 Add additional metrics in Delta Log Update GitOrigin-RevId: d904f1c546119751001d89b12a8fb173bc790c60 commit c250938220659c66d570536fb2a4f27460a6c076 Author: Christos Stavrakakis Date: Wed Mar 22 19:37:21 2023 +0100 Minor refactor to CDC reader. Extend `CDCDataSpec` to contain the `CommitInfo` object. GitOrigin-RevId: d9ba901dd9474bfe8c1ae007f0ab19eb7b22b147 commit be2e99df9763a34dbc84e1479a06d160abe921c6 Author: Andreas Chatzistergiou Date: Wed Mar 22 16:53:53 2023 +0100 Added a new property in RemoveFile. GitOrigin-RevId: 3ad447b522a47902aad86472d2dccc8088685d89 commit 65f86f2b7cbfac8e14e6379f50f43c4a32a1b9c2 Author: Johan Lasperas Date: Wed Mar 22 15:52:07 2023 +0100 Move MergeStats into own file Move `MergeStats` outside of `MergeIntoCommand` and into its own file. This change only moves code around and doesn't update it in the process, only updating imports when needed. The refactor doesn't introduce any functional change, existing tests, e.p. MergeIntoMetricsBase, cover it. GitOrigin-RevId: 63e8e5a855487ef74cc5a4fb16abc21b6ffca95f commit d4de19996eaa76752cfda8848f6f83143d390186 Author: Allison Portis Date: Wed Mar 22 14:18:58 2023 +0000 Remove generated column related error Apache Spark 3.4 will support generated columns and we no longer need this error in Delta Lake. GitOrigin-RevId: 938a3647d53e1368aec44fd7af30a40a3906b834 commit 9a9b60910b53c76459df280821957580adcfd275 Author: Lars Kroll Date: Wed Mar 22 13:39:00 2023 +0100 Add pre-commit check for mandatory numRecords with DVs Add a new check in `prepareCommit` to ensure that all files with DVs added also have the `numRecords` stat, as preparation for DV write support. GitOrigin-RevId: 00c3f6c2c5aa6997b104b1dc53a77a0461521ea6 commit b16b1bc0231c182eea3441fda012c584bc1ec995 Author: Allison Portis Date: Tue Mar 21 23:02:54 2023 -0700 Upgrade to Spark 3.3.2 Upgrade spark version to 3.3.2. Updates one test for a fix in 3.3.2. Resolves #1479 Closes delta-io/delta#1644 Signed-off-by: Allison Portis GitOrigin-RevId: c7d6751d28558b41a1d4c72c1301c367fd505339 commit 28148976839063dbdcba430387a5fafba64b44a8 Author: Lukas Rupprecht Date: Tue Mar 21 20:29:06 2023 -0700 Fixes bug in DeltaTableUtils.findDeltaTableRoot Previously, DeltaTableUtils.findDeltaTableRoot would throw an exception if it is passed a base path that is converted to a Uri with an empty path component (e.g. `s3://my-bucket`). This PR catches such cases and prepends a slash when combining a base path with a _delta_log subdirectory. It also adds a new test suite for DeltaTableUtils. GitOrigin-RevId: ebf74770dc3b0cdfddeadb97114d38cb00802995 commit c2baa30237570a59f83a015b5b9491ca3102d0e8 Author: Christos Stavrakakis Date: Wed Mar 22 00:29:53 2023 +0100 Implement RoaringBitmapArray bitwise and operation GitOrigin-RevId: 6f5c71a94b9505767c9c72213fe7eb4de88bf827 commit ab346af6bd928e005f566b6c1ef85220964ad715 Author: Naga Raju Bhanoori Date: Mon Mar 20 13:10:32 2023 -0700 Improve error message to not expose commit validation config DELTA_COMMIT_VALIDATION_ENABLED shouldn't be disabled by users. This PR improves the error message by not suggesting users to turn it off. GitOrigin-RevId: 22f20e1e51a0efbe7006a49d8f9d85eb1aa345ba commit 981307aa3c336fbed15d9eb22d99feedb8febe26 Author: Ming DAI Date: Fri Mar 17 16:21:47 2023 -0700 Improve error message of Clone Parquet source blocking Delta sharing tables. GitOrigin-RevId: 5afa7addad259d127b6d6a0ddc1b02e11eadd545 commit 48388b9aca560fb215019796bf7d73c074809b4e Author: Scott Sandre Date: Fri Mar 17 15:53:20 2023 -0700 Add support for `.show()` in COUNT(*) queries Previously, metadata-only aggregate pushdown was only working for `COUNT(*)` queries when you were collecting the result, as opposed to calling `.show()`. This PR fixes that bug. Added a UT that captures the optimized logical plan and checks that it is using the LocalRelation created by OptimizeMetadataOnlyDeltaQuery. Also did a performance test locally. Created a table with 100M rows and 100K files and ran the query `sql("SELECT COUNT(*) FROM ").show()` - master took ms ~161 seconds. - this PR took ~16 seconds. Thus, this is a ~10x improvement. Resolves delta-io/delta#1571. Closes delta-io/delta#1643 Signed-off-by: Scott Sandre GitOrigin-RevId: e266e5d82220ca331e117f202abc6f085a99448c commit 5bf749cf0576831a98986de20f33eaa557c22d19 Author: Paddy Xu Date: Fri Mar 17 17:57:47 2023 +0800 Allow supporting table features via `DeltaTable` user-facing API This PR adds a new user-facing API `addFeatureSupport`, allowing users to add table features to a table's `protocol` action. Along with an existing API `upgradeTableProtocol`, we now provide sufficient API to manipulate table protocol. Closes https://github.com/delta-io/delta/pull/1649 GitOrigin-RevId: 926f9406727227386addfb42cb75588e0d5e6ec6 commit aac9aa8e4821ada7f55ca7de5197edaf647a9e15 Author: Bart Samwel Date: Fri Mar 17 09:03:12 2023 +0100 Get rid of previousOffset in DeltaSource DeltaSource was storing its latest processed offset in `previousOffset`, in `getBatch`. At the same time, it was ignoring the `startingOffset` that is passed into `latestOffset`, which is the parameter that is actually intended to convey this information. This PR gets rid of `previousOffset` and actually uses the `latestOffset` parameter for this. In addition, the PR changes the `DeltaSource` code to use `DeltaSourceOffset` consistently inside the internals of the class, instead of passing the base class `Offset` everywhere and then converting back to `DeltaSourceOffset` all over the place. Furthermore, it adds a utility function to do the conversion from `Offset` to `DeltaSourceOffset`. GitOrigin-RevId: b47de2fe9ce3381c09b4c5efd15ecb828c3f39e2 commit d6d3e6545a95e4221e214454d25cca64666f5578 Author: Chaoqin Li Date: Wed Mar 15 21:10:37 2023 -0700 Clean up conflict check logic for appId txn Clean up conflict check logic for appId txn to make it more concise and understandable. GitOrigin-RevId: 4fa615a3b5404ae007448e8c4b483c80582463a6 commit ea2a4ab485bfe0d52a35bf91ac3f512bc2de42f7 Author: Venki Korukanti Date: Wed Mar 15 11:16:06 2023 -0700 Fix bugs around blocking write operations on DV enabled tables ## Description Currently the disabling of writing operations on DV tables added as part of #1603 is not complete. It allows following few updates: * VACUUM can proceed if the logging is disabled - fix to always check for DV enabled table before running the VACUUM * DVs can be enabled using table property `'delta.feature.deletionVectors'='supported'`. Consider this table property when checking for DV presence in tables Added tests. Closes delta-io/delta#1642 Signed-off-by: Venki Korukanti GitOrigin-RevId: 9cae36161a69f149e63142e97ab7b03daaaa37c6 commit 59a0c23442b8eb2104efcea63f95453468084aeb Author: Lars Kroll Date: Wed Mar 15 17:43:15 2023 +0100 Minor refactoring to `DeletionVectorsTestUtils.scala`. GitOrigin-RevId: 0d69fc88aceecdd3b983f79e3345e841c344f74f commit 72b388a1c75d33fb3221e7875b2c43f8831aa6fb Author: Paddy Xu Date: Wed Mar 15 17:29:18 2023 +0100 Disable reading CDC from DV-supported Delta tables ## Description This PR adds a check to disable reading CDC from a table that have DV marked as `supported` in the protocol. Reading CDC with DV is not yet supported in the current version of Delta Lake. Not needed. Trivial change. Closes https://github.com/delta-io/delta/pull/1640. Signed-off-by: Paddy Xu GitOrigin-RevId: 586b8fa1bedbf53cc537ee89e0c4a63032ff0c9c commit 51d882c72c3293c927d5ccae75dd186e56f34c2c Author: Paddy Xu Date: Wed Mar 15 15:50:12 2023 +0100 Assert DV does not exist during CLONE and table prop overrides (continued) The previous PR does not work on Delta Lake. This PR fixes that. GitOrigin-RevId: c2465d45b6ac80930678cc9036a0fdba9c270946 commit ae0938a295a68e30be91f414733a15b345452263 Author: Tom van Bussel Date: Wed Mar 15 10:48:36 2023 +0100 Minor test refactoring Small test refactor to remove usage of RDDs and to be more lenient with which function in `DeltaTableBuilder` is required to throw `AnalysisException`s. GitOrigin-RevId: 9cc4041d1a920587cc94fa39c065386147324786 commit 42e6ff58050eb4e3189fe928fabe36bcf6090dab Author: Hussein Nagree Date: Tue Mar 14 11:40:45 2023 -0700 Factor out computedState from Snapshot.scala Factor out Snapshot.State into its own class SnapshotState, with a SnapshotStateManager to provide all the utility methods previously present in Snapshot. Changes are lifted as-is to SnapshotStateManager. Additionally, re-arrange the remaining methods in Snapshot to logically group together similar code. GitOrigin-RevId: c244b4c80faf3af3499020918ef2e591643cec6a commit d56c031fb53ec3e99db28fada71353e43b492596 Author: Paddy Xu Date: Tue Mar 14 11:58:16 2023 +0100 Assert DV does not exist during CLONE and table prop overrides This PR adds a check to `txn.commitLarge` to avoid cloning DVs to a table while disabling DV table prop: ```sql CREATE TABLE tbl DEEP CLONE tbl_dv TBLPROPERTIES('delta.enableDeletionVectors'='false') ``` Before: `tbl_dv` will have DVs. After: the transaction will fail. Closes https://github.com/delta-io/delta/pull/1647 GitOrigin-RevId: f7ee8fba30383756c631c987765cc8c0447c5e73 commit 97b409a5bb9cfe06c7f1151c0429aefb3bb2920a Author: Vitalii Li Date: Mon Mar 13 18:29:40 2023 -0700 Minor refactor to DeltaLog GitOrigin-RevId: b1400dea6ac8d98fc17ae2e23a98054ed36fabe6 commit afda535fc426f324943dab60068a4aab683f1d6b Author: itholic Date: Sat Mar 11 05:07:06 2023 +0000 Update test Update test in DeltaTableCreationTests GitOrigin-RevId: 0596e7aaa90fd8e9513c31a9fb47677cbb30a7c1 commit 44457df06f4ed5e2614fbd0917e5a87e9e958bf2 Author: Tom van Bussel Date: Fri Mar 10 16:04:00 2023 +0100 Improve Python tests by using `ReusedSQLTestCase` This PR improves the Python tests by sharing a `SparkSession` for all test cases using `ReusedSQLTestCase`. In order to make these change we have to clean up the created tables after each test using Pyspark's `table` context manager. GitOrigin-RevId: f91a6bf65fb2254bbefa66a723bb296b1e971792 commit 83513484bc064b73cb5855e8c7aa8f244fcb4119 Author: Paddy Xu Date: Fri Mar 10 10:40:19 2023 +0100 Return empty CDC result when no commit is in range This PR improves timestamp handling for CDC reads, so that a range with no commit in between can return an empty DF instead of throwing an exception: ``` version: 4 5 ---------|-------------------------------------------------|-------- ^ start timestamp ^ end timestamp ``` Before: fail with `end version 4 is older than the start version 5`. After: success and return an empty DF. GitOrigin-RevId: c048b00df27b18c7072c205481340007e8bba6f7 commit 53b8464418fd9df1c2dfb65afe0f2b822dd50c3a Author: Chaoqin Li Date: Thu Mar 9 22:27:09 2023 -0800 log the offset range when delta source is processing a large batch Sometimes delta source is stuck in a large batch without logging any progress. Log the offset range in getBatch and latestOffsetInternal for observability. GitOrigin-RevId: a28bacbe2056c4dfe51f7507aea3123d50b5969d commit 2f60ac294f82d49152b19c146095d98db5666663 Author: Bo Gao Date: Thu Mar 9 17:40:03 2023 -0800 Refactored duplicated code in DeltaSource regarding getting changes in Delta log. ### Description We have several duplicated code in function `filterAndIndexDeltaLogs`. This PR aims to reduce the code duplication. - Use same function `validateCommitAndDecideSkipping` for both in-memory file and file too large to be read into memory cases to reduce duplicated code. - Changed variable names `ignoreChanges` -> `shouldAllowChanges`, `ignoreDeletes` -> `shouldAllowChanges` to resolve confusions since they are different from the `ignoreChanges` and `ignoreDeletes` options we have. Also moved them inside `validateCommitAndDecideSkipping` as local variables. GitOrigin-RevId: 0777c07feb0582438f10b0f7f083b1d43529f790 commit 52e549d84e377cc31d388aed449b182625807c44 Author: Jackie Zhang Date: Thu Mar 9 16:08:43 2023 -0800 Introduce DeltaSourceSchemaLog to support non-additive schema evolution Closes delta-io/delta#1634 GitOrigin-RevId: bec8b0cc1448b3e12cb08346812602585150e902 commit 4a7a279f4a46e4feaf56d020980aae21e3afb7a8 Author: Christos Stavrakakis Date: Thu Mar 9 13:59:25 2023 +0100 Remove stale method. Remove the stale trackReadPartitionPredicates method from OptimisticTransaction.scala. GitOrigin-RevId: 30d667137fd5bf0f4be558dfa86870c4dbb32b58 commit 303d640ab895402be080252def0595455a27335b Author: Wenchen Fan Date: Thu Mar 9 17:10:48 2023 +0800 Allow altering column type with char/varchar to follow the Apache Spark behavior Today we can't alter a char/varchar column of a Delta table to a different but compatible char/varchar type. This is too strict and we should follow Apache Spark to allow it. GitOrigin-RevId: 84e5550457edfe4075dfd130d689302679f82e8e commit c1f3b37d734890cce63f02c53b8e2738e45069de Author: Andreas Chatzistergiou Date: Thu Mar 9 08:00:21 2023 +0100 Currently, the Delta Statistics column names are only accessible thought the snapshot instance or by directly inheriting from UsesMetadataFields. This also causes the fields to be instantiated for every snapshot instance. This PR exposes Delta Statistics through a singleton. GitOrigin-RevId: b472247070541e47344c3af402f2fef719a37dd9 commit e3a0378a0e91a796938b6b12ffc708dbebf62dbd Author: Prakhar Jain Date: Wed Mar 8 16:10:36 2023 -0800 Use Pattern matching for different Delta log file types Use pattern matching to simplify code and reduce regex matches around delta/crc/checkpoint versions. GitOrigin-RevId: 87884c293e3d5d8114c1eac9214b76130b408bea commit 7396cfb7f05cfc7398c03e29e154aedf5dfa7266 Author: Allison Portis Date: Wed Mar 8 15:53:03 2023 -0800 Add DELTA_TESTING env var to support table features test Some table features tests rely on some "test table features" that are only supported when `DELTA_TESTING=1`. Otherwise tests fail with ``` [info] org.apache.spark.sql.delta.DeltaTableFeatureException: Table features configured in the following Spark configs or Delta table properties are not recognized by this version of Delta Lake: delta.feature.testwriter. ``` This PR sets this in `build.sbt` and also removes setting it from `run-tests.py` since it should no longer be needed. Closes delta-io/delta#1622 Resolves delta-io/delta#1602 Signed-off-by: Allison Portis GitOrigin-RevId: 0f1c5d8c6384712192338437914ea74dd89bee03 commit deb2a62a118da224f5adbad1b5a0aef57bb0f2eb Author: Lars Kroll Date: Wed Mar 8 15:52:36 2023 -0800 Update the protocol to reflect that for all files with deletion vectors the `numRecords` statistic is mandatory. Closes delta-io/delta#1624 Signed-off-by: Allison Portis GitOrigin-RevId: 4697410ed9154d43939be3deafadca9d74c4fe98 commit a7269a29522f84585c795e0768c5f37ae64cf3a8 Author: Naga Raju Bhanoori Date: Wed Mar 8 12:48:15 2023 -0800 Removing config commit info enabled DELTA_COMMIT_INFO_ENABLED is an old config that is already enabled by default. This PR removes the config, as there are no reasons for a customer to explicitly disable the config. This allows for lesser dependencies on DELTA_USER_METADATA and DELTA_VACUUM_LOGGING_ENABLED which currently rely on DELTA_COMMIT_INFO_ENABLED to be enabled. GitOrigin-RevId: 95f24cebe5c518a5f4d55bbd1e1852fb764b9bc1 commit 7ef1e887f3d0d21648ff056d2b52c9679d8ef62d Author: Lukas Rupprecht Date: Tue Mar 7 13:56:40 2023 -0800 Better unit testing This PR changes the visibility of `DataSkippingReader.pruneFilesByLimit`, which allows for better unit testing of the method. GitOrigin-RevId: ac2be239ed05a77d7bde1587ebe6ded31226c6a9 commit 3de308e631155db4f72dc7a6932ed4bb84d3346a Author: Christos Stavrakakis Date: Tue Mar 7 19:47:18 2023 +0100 Make OptimisticTransaction collect all predicates Currently `OptimisticTransaction.readPredicates` contains only the partition predicates by which the transaction has queried files, as instances of `DeltaTablePartitionReadPredicate`. This commit updates the transaction code to collect both partition and data predicates in `DeltaTableReadPredicate` objects. GitOrigin-RevId: 0a63dbd33eaa533c5a99ee09e6e463a6438f5148 commit f801d1b753f045204e32d6e7790e3f6782c8a865 Author: Xi Liang Date: Tue Mar 7 10:32:13 2023 -0800 Minor refactor to DescribeDeltaDetailsCommand. GitOrigin-RevId: ad87fc479a192da122365150c1c0cfb01403e1b4 commit 3090db5607e4336ededff50a1831e762436a0cd0 Author: Paddy Xu Date: Tue Mar 7 18:48:19 2023 +0100 Allow ignoring protocol-related configs in Spark session defaults when creating a table This PR introduces a Spark SQL config `delta.ignoreProtocolDefaults` that affects `CREATE TABLE` and `REPLACE AS` commands to ignore protocol-related Spark session defaults, including: - `spark.databricks.delta.properties.defaults.minReaderVersion` - `spark.databricks.delta.properties.defaults.minWriterVersion` - configs with keys start with `spark.databricks.delta.properties.defaults.feature.` When these session defaults are ignored, the user must specify the protocol versions and table features manually, or the table will get a min protocol that satisfies its metadata. For example, an empty table will get `Protocol(1, 1)`, regardless of the default protocol version `(1, 2)`. Closes delta-io/delta#1628 GitOrigin-RevId: 05670d6a1cf46cba1592a1dfd460d02f081c51b0 commit 71b43a62abb24a5c6fe877af464ed85341f7f336 Author: Vitalii Li Date: Mon Mar 6 20:09:01 2023 -0800 Minor refactor to DeltaCatalog GitOrigin-RevId: 42245c83ba2d92d2b0bbe16eaacb5a1bd6ea71ad commit 4b38c4b68c749caa5b12991f326d7a9931133bf2 Author: lzlfred Date: Mon Mar 6 17:15:36 2023 -0800 Remove extra exists call during Delta log checkpoint creation GitOrigin-RevId: e74e85d366f68a8f29b87f4bebe9b8a5ef5f903f commit 595e15eff5a3218467e99c3bdeda4e90fcd91662 Author: Alkis Evlogimenos Date: Mon Mar 6 15:18:36 2023 +0000 Remove `LocatedFileStatus` wrapping from synthesized `FileStatus`. When synthesizing a `FileStatus` from delta metadata and we have no `BlockLocations` instantiate `FileStatus` instead of `LocatedFileStatus`. GitOrigin-RevId: edca7a79f075e9bca7b440f885d904df57bddec3 commit 8ddc9e3978e09c36a3f888557cbc278b08b0622e Author: Prakhar Jain Date: Thu Mar 2 15:47:36 2023 -0800 Add numSetTransaction metric to CommitStats class GitOrigin-RevId: d162d37fc54263023d65080d2af0b011d1384e21 commit e83bd06f8362ec430fa3c66a8acc001546c2ac83 Author: Prakhar Jain Date: Thu Mar 2 13:21:29 2023 -0800 Minor refactoring, make tests independent of default checkpoint interval GitOrigin-RevId: 4cb152ec47d76b6d7545f4bb9bc31ff43bf62e2c commit 653253545a97f224bb123b8fa2f3d67e7bff3852 Author: Rahul Shivu Mahadev Date: Wed Mar 1 12:36:41 2023 -0800 Adds support to TimestampNTZ type in Delta Previously this type was not supported in Spark and Spark 3.3 added support for this To prevent(gate) older writers/readers from reading to this column we need a protocol(feature) bump that does the gating * This PR creates a new TableFeature TimestampNTZ feature that is a ReaderWriter feature * This is how to feature is automatically enabled Scenario | Previously | With this change -- | -- | -- User creates a new table with timestamp NTZ column | AnalysisException saying type not supported | Protocol upgraded to feature vector protocol and TimestampNTZ Feature automatically enabled and DDL successful User adds a new column of type TimestampNTZ on legacy protocol version | AnalysisException saying type not supported | User DDL completes successful.(Protocol also upgraded automatically) User adds a new column of type TimestampNTZ on table with TimestampNTZFeature enabled on the table | AnalysisException saying type not supported | User DDL completes successful. Closes delta-io/delta#1626 GitOrigin-RevId: d92b62895cf1cdc3dfaed9e97d2ef6e9378f98a3 commit 380425bd6eb4f54572e858af0bdc26bfbec17371 Author: Hussein Nagree Date: Wed Mar 1 11:30:41 2023 -0800 Reorganize methods in Snapshot Separate out some methods in Delta DataSkippingReader & Snapshot code for convenience. GitOrigin-RevId: ca64e6421a869f55253ba3d1687a729781880326 commit 27888a90951baef0405b64bab718bb8acedb14c6 Author: Andreas Chatzistergiou Date: Wed Mar 1 15:40:08 2023 +0100 Minor refactoring. GitOrigin-RevId: b149961c88b430f4eedcb1f5b187595af2264a32 commit d2223faf20787244b685b8b02fa2665590f64221 Author: Ming Dai Date: Tue Feb 28 15:45:19 2023 -0800 Add more metrics into delta log. GitOrigin-RevId: 3510f3d4ca58ae70f86c1d07067712658bc662f7 commit d55cb5f8ae8c76bf6f120f540767258dc872686a Author: Bo Gao Date: Tue Feb 28 10:08:28 2023 -0800 Added skipChangeCommits flag to skip commits that contain removeFiles in DeltaSource ### Description Added skipChangeCommits flag to skip commits that contain removeFiles in DeltaSource for structured streaming. The purpose for this change is to replace the current `ignoreChanges` flag that could result in data duplication issues when using Delta as a structured streaming source with DELETE/UPDATE/MERGE INTO operations (e.g. GDPR) Behavior for existing `ignoreChanges` flag: - When there's `removeFile` detected in a commit, the `removeFile` would be ignored. And if `addFile` exists as well, the `addFile` would be processed by structured streaming. Behavior for new `skipChangeCommits` flag: - When there's `removeFile` detected in a commit, both `removeFile` and `addFile` (if exists) would be ignored by structured streaming to prevent duplication in the sink. Since a lot of users are using `ignoreChanges` flag today, to not change the expected behavior, I will not modify the existing `ignoreChanges` flag. Instead, I added a new `skipChangeCommits` flag. Once the new `skipChangeCommits` flag is available, we can start deprecating the old `ignoreChanges` and `ignoreDeletes` flag flags. Resolves https://github.com/delta-io/delta/pull/1616 GitOrigin-RevId: 125755f4bc3f03968a4ab60332abf708195df1bc commit bac5f40b2c33abf3ba16175a2560e7bcb06c8df8 Author: Ryan Johnson Date: Tue Feb 28 09:27:18 2023 -0800 Add protocol to DeltaParquetFileFormat, for table features As table features become more and more prominent, it will become important for `DeltaParquetFileFormat` to understand a table's `Protocol`, in addition to its `Metadata`. Here, we pre-emptively add that capability (a messy change) so that it can be used easily when the time comes. GitOrigin-RevId: c9694da060bc52b1d82e83c7db82cf72cdc62508 commit 021e439b541cdda26fa44f54f477676aa44ae1cb Author: Andreas Chatzistergiou Date: Tue Feb 28 09:36:08 2023 +0100 Exposed the largest element of the RoaringBitmapArray in the BitmapAggregator. GitOrigin-RevId: 5ccf483e229516891014b89916f393219b4c2ffb commit 18aed87f658db1484f5b3044ed898b934e36f749 Author: Tom van Bussel Date: Tue Feb 28 08:42:41 2023 +0100 Remove type checks from test_deltatable These checks won't be compatible with Spark Connect in the future, as Spark Connect uses `pyspark.sql.connect.DataFrame` instead of `pyspark.sql.DataFrame`. GitOrigin-RevId: 46167d358c7d3e381c5bc27e2bc4278061d36c88 commit 8e295b22d939a2a34440b9d541171b877e8e3845 Author: Ming DAI Date: Fri Feb 24 19:40:50 2023 -0800 Minor refactoring of Clone code for readability GitOrigin-RevId: 0991e1d588cb720cde007c70af9858cd9654ef8a commit 5fc938d481ad2080bf1fe9ad24f75d4089b5c3e6 Author: Lars Kroll Date: Fri Feb 24 18:51:01 2023 +0100 Add isEmpty, first, and last to RoaringBitmapArray - Added some additional methods to `RoaringBitmapArray` (`isEmpty`, `first`, `last`). - Added unit tests for the new methods. GitOrigin-RevId: 2a03d9f05630ba8fb7a6eb749632393019493380 commit 2a5a2bce0e45decc28aec7c6f1ffc929b1b84e82 Author: David Lewis Date: Fri Feb 24 10:50:21 2023 -0700 Clean up path handling for Spark 3.4 scala style. Also fix some miss handled path <-> URI conversions. GitOrigin-RevId: b42f52c4363f26dfe944167ad73be521a28ec315 commit b724c3b898807f3014795703e5626bdce167281b Author: Yaohua Zhao Date: Fri Feb 24 21:34:12 2023 +0800 Minor code re-organization GitOrigin-RevId: 9dad7fbcdfb3eab72e7c81677d2d13f04a9df9ce commit 7edb182268639b6464f8263aab4bce5ad1efe54a Author: Scott Sandre Date: Mon Mar 13 10:06:08 2023 -0700 Upgrade pgp plugin to improve release automation (#521) commit a7c137370bd92c1581c55147c2b319a498c2c124 Author: Scott Sandre Date: Sat Mar 11 15:55:33 2023 -0800 Upgrade sonatype plugin to version 3.9.15 (#519) commit 67985e367cd3668a296650f66907a8c0eafcc8af Author: Scott Sandre Date: Sat Mar 11 15:50:47 2023 -0800 Update build.sbt for better release automation (#518) commit b733cf7b73abcfebb2ff84de6c83a00e9efe6531 Author: kristoffSC Date: Fri Mar 10 19:21:58 2023 +0100 GlobalCommitter_Perf - fix intensive logging on a hot path for Global committable and committer objects. Use class member field for DeltaLog in DeltaGlobalCommitter. (#515) Signed-off-by: Krzysztof Chmielewski commit bfce882075a1b667f2a3c12bc4d1dc58d3838869 Author: kristoffSC Date: Fri Mar 10 19:18:33 2023 +0100 FlinkComplexNestedTypeTest - adding test for Delta Sink to try to write rows with nested complex types such as Array Array and Array> that are not supported in flink-parquet library. (#507) Signed-off-by: Krzysztof Chmielewski commit 1c18c1d972e37d314711b3a485e6fb7c98fce96d Author: lzlfred Date: Thu Feb 23 09:16:52 2023 -0800 Minor Delta refactor GitOrigin-RevId: bf5899b860e10789f413254b4901fda5893c30d2 commit 398bf03edb13081099169e4bd64a26c7f9067019 Author: Paddy Xu Date: Thu Feb 23 14:10:03 2023 +0100 Make use of `mergeGlobalConfigs` method to process default table feature configs This PR modifies the existing `mergeGlobalConfigs` method, which is responsible for copying session default configs to table metadata, to copy table feature configs (configs start with `spark.databricks.delta.properties.defaults.feature.`) to table metadata, so CREATE TABLE and REPLACE TABLE will be using them. Before this PR, we are reading session default configs manually for table features. This creates some problems w.r.t. REPLACE, SHALLOW CLONE, and RESTORE commands. After this PR, the above commands will not process session defaults manually but rely on the existing, battle-tested logic to do so. These commands will have the following behavior: - REPLACE will respect table features set in session defaults. - SHALLOW CLONE and RESTORE will not respect table features set in session defaults. GitOrigin-RevId: bcc862b7923b160547be1e9e6fcc2171f3cb7fb0 commit d4e49c290a47e7493a350cbfe74aa84e8855f1b6 Author: Paddy Xu Date: Thu Feb 23 11:33:51 2023 +0100 Use "supported" to replace "enabled" for table feature configs This PR changes the value of table feature configs to be `supported` instead of `enabled`, to avoid confusion with the same word in the sentence `Change Data Feed is enabled on a table when delta.enableChangeDataFeed=true`. The old value `enabled` will still work. Before this change: ```sql CREATE TABLE tbl ... TBLPROPERTIES ('delta.feature.featureName' = 'enabled') ``` After this change, both of the following will work: ```sql CREATE TABLE tbl ... TBLPROPERTIES ('delta.feature.featureName' = 'enabled'); CREATE TABLE tbl ... TBLPROPERTIES ('delta.feature.featureName' = 'supported'); ``` New terminlogy: - `supported` means a feature is listed in the protocol, and can be enabled when its metadata requirement is satisfied. For example, `changeDataFeed` is supported when the feature name is in the protocol's `writerFeatures`, but no CDF is captured unless `delta.enableChangeDataFeed` is set to `true`. - `enabled` means a feature's metadata requirement is satisfied. For example, `changeDataFeed` is enabled when `delta.enableChangeDataFeed` is set to `true`, and there are CDFs captured.closes https://github.com/delta-io/delta/pull/1609 GitOrigin-RevId: d9304b6866355f91f44ef217d2831d92e39ce27e commit 91dc9808dc29d1081f1d4ae2577fbad5f41bf323 Author: Jackie Zhang Date: Wed Feb 22 17:02:58 2023 -0800 Minor refactor to `ConvertIcebergToDeltaPartitioningSuite`. GitOrigin-RevId: af0601ba3f7d2712ab901a89bd02b738fcede3cf commit 9a4c38b29a585acc03de40fd5e940408de5dba85 Author: Lars Kroll Date: Wed Feb 22 15:25:03 2023 +0100 Minor refactoring to RoaringBitmapArray. GitOrigin-RevId: 95de91e5b9559e8de614e2210bc6076784a20bb4 commit 419e011389080d58324a8a29b37325478f86b277 Author: Andreas Chatzistergiou Date: Wed Feb 22 12:12:17 2023 +0100 This PR adds a validation in DML operations to ensure we do not generate DVs that contain out of bounds indexes. This is achieved by adding a temporary property, maxRowIndex, to the DeletionVectorDescriptor. The new property is not stored in the log. GitOrigin-RevId: f3b7d76f90560032e0ee02eb7529a3441857df2c commit 94b902050c400e7b33657dafa322387578af2535 Author: Allison Portis Date: Tue Feb 21 15:16:40 2023 -0800 Support CDF table value function in SQL queries ## Description Adds support for change data feed functions `table_changes` and `table_changes_by_path` in SQL to read CDF data. This is done by injecting into spark's table function registry in `DeltaSparkSessionExtension`. We then resolve these expressions in `DeltaAnalysis`. Adds unit tests in `DeltaCDCSQLSuite` ## Does this PR introduce _any_ user-facing changes? Yes, users will now be able to use functions `table_changes` and `table_changes_by_path` to read the CDF data from Delta tables. For example: ``` SELECT * FROM table_changes('tbl', 0, 2) ``` ``` +---+------------+---------------+--------------------+ | id|_change_type|_commit_version| _commit_timestamp| +---+------------+---------------+--------------------+ | 2| insert| 1|2023-02-16 14:22:...| | 3| insert| 1|2023-02-16 14:22:...| | 1| insert| 1|2023-02-16 14:22:...| | 5| insert| 2|2023-02-16 14:22:...| +---+------------+---------------+--------------------+ ``` Closes delta-io/delta#1604 GitOrigin-RevId: c9abb6325c546c4344b5dc6358cf269cd7a718e9 commit eaba4058f501f477913f8dbe6b177f9d90c88211 Author: Jackie Zhang Date: Tue Feb 21 13:12:02 2023 -0800 Add an integration test for Delta Iceberg Converter Closes delta-io/delta#1573 GitOrigin-RevId: a84094218d4edeff2e1eb460f0becbc0b40eb2b8 commit d730b53c43b60d81705cd314fbc0961dd2eff746 Author: Ole Sasse Date: Tue Feb 21 14:07:49 2023 +0100 Make CDCReader.CDC_PARTITION_COL an internal column CDC_PARTITION_COL is missing in the internal column logic in SchemaUtils. The code path is skipped in most cases, because there is short-circuiting logic around it. It is only triggered when writing out custom metadata columns. GitOrigin-RevId: cfba46aa984ac1692d6c034def6cd9876ec452e7 commit 60d6d1ee8a585702176dc16ab8e5224ecfa5e09c Author: Paddy Xu Date: Mon Feb 20 15:35:09 2023 +0100 Rename enabledTableFeatures to tableFeatures in DESCRIBE DETAILS This PR changes the output of `DESCRIBE DETAILS`, a row header from `enabledTableFeatures` to `tableFeatures`. The reason to drop "enable" is because: the listed features are required by the protocol for readers/writers, and does not mean that the features are enabled/used. GitOrigin-RevId: 4a505d21852fc3fcd1bba31c670e6cff379e07ad commit c8913fc2c16d34d9895f06d098a477c7e83cedec Author: Paddy Xu Date: Fri Feb 17 20:54:52 2023 +0100 Make `DESCRIBE DETAILS` show implicitly-enabled features This PR makes `DESCRIBE DETAILS` show table features implicitly enabled by the current table protocol, when the protocol does not support table features. Example query: ```sql CREATE TABLE tbl TBLPROPERTIES (delta.minReaderVersion='1',delta.minWriterVersion='2'); DESCRIBE DETAIL tbl; ``` Result: ``` ... minReaderVersion 1 minWriterVersion 2 enabledFeatures [appendOnly,invariants] ... ``` GitOrigin-RevId: 4724ebc1e9751a39754a8f9ac05e99063f92b9b2 commit 66ef697a3f21185b669efbda2d51a24772bf765f Author: Paddy Xu Date: Fri Feb 17 17:48:24 2023 +0100 Bump only the writer version when delta.feature. props enable writer-only features This PR is a follow-up of https://github.com/delta-io/delta/pull/1600 where we do allow `delta.feature.featureName = enabled` to bump table protocols to support table features. But there is a bug where enabling a writer-only table feature will also bump the reader protocol version to `3`. This PR fixes this issue so that reader versions are bumped only when a reader-writer feature is enabled. Before: ```sql CREATE TABLE tbl TBLPROPERTIES (delta.feature.appendOnly = 'enabled') -- tbl gets Protocol(3, 7, [], [appendOnly]) ``` After: ```sql CREATE TABLE tbl TBLPROPERTIES (delta.feature.appendOnly = 'enabled') -- tbl gets Protocol(1, 7, None, [appendOnly]) ``` GitOrigin-RevId: b92e983abf049f1d73be20f86f193778329ab90b commit 67bdf8c47070e1284e70dec482d6078805661987 Author: Paddy Xu Date: Fri Feb 17 10:31:43 2023 +0100 Allow delta.feature.xxx table property to bump protocol silently This PR allows the table property `delta.feature.featureName = 'enabled'` to silently bump protocol version to support table features. Therefore, no manual version bump is required. Note that even if the feature being enabled is a legacy feature, the final table will still have `Protocol(3, 7)` becuase the `delta.feature.` prefix is exclusively used by table features, and once used, we assume that the user is explicitly asking to bump the table to support table features. Before: ```sql -- assume tbl is on Protocol(1, 1) ALTER TABLE tbl SET TBLPROPERTIES ( delta.feature.columnMapping = 'enabled' ) -- Exception: "table features are required but not supported" ALTER TABLE tbl SET TBLPROPERTIES ( delta.minReaderVersion = '3', delta.minReaderVersion = '7', delta.feature.columnMapping = 'enabled' ) -- tbl will have Protocol(3, 7, [columnMapping], [columnMapping]) ``` After: ```sql -- assume tbl is on Protocol(1, 1) ALTER TABLE tbl SET TBLPROPERTIES ( delta.feature.columnMapping = 'enabled' ) -- tbl will have Protocol(3, 7, [columnMapping], [columnMapping]) ALTER TABLE tbl SET TBLPROPERTIES ( delta.feature.deletionVectors = 'enabled' ) -- tbl will have Protocol(3, 7, [deletionVectors], [deletionVectors]) ALTER TABLE tbl SET TBLPROPERTIES ( delta.minReaderVersion = '3', delta.minReaderVersion = '7', delta.feature.columnMapping = 'enabled' ) -- tbl will have Protocol(3, 7, [columnMapping], [columnMapping]) ``` Closes https://github.com/delta-io/delta/pull/1600. GitOrigin-RevId: e98170f18554fd7a257093b2ee556604fe64f90f commit 336dae36a5be4a178fd5a183ab399ff53c722ad1 Author: Dhruv Shah Date: Thu Feb 16 17:26:12 2023 -0800 CREATE TABLE LIKE Support for Delta ### Description This PR aims to enable CREATE TABLE LIKE for Delta. CREATE TABLE LIKE is a SQL DDL that creates an empty new table using the definition and metadata of an existing table or view. CREATE TABLE LIKE is not supported for creating Delta tables. Currently, when a user tries to use CREATE TABLE LIKE with Delta, it throws an exception saying this command is not supported for Delta tables. ### What changes were proposed in this pull request? This PR aims to enable CREATE TABLE LIKE for Delta. [CREATE TABLE LIKE](https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-like.html) is a SQL DDL that creates an empty new table using the definition and metadata of an existing table or view. [CREATE TABLE LIKE](https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-like.html) is not supported for creating Delta tables. Currently, when a user tries to use CREATE TABLE LIKE with Delta, it throws an exception saying this command is not supported for Delta tables. Specifically, the following table properties are being copied over: | Metadata Field/Type | Non-Delta Tables | Delta Tables | |----------------------|-----------------------------------------------------|--------------| | Description(Comment) | Yes | Yes | | Schema | Yes | Yes | | | Partition Columns | Yes | Yes | | Configuration | Yes | Yes | | Delta Protocol | No(Current Default Protocol for that spark session) | Yes | ### How was this patch tested? This patch is tested by adding unit tests for CREATE TABLE LIKE in the DeltaCreateTableLikeSuite.scala file. Other existing tests were modified to reflect the change that Delta Tables are now supported for the Create Table Like Command. ## Description This PR aims to enable CREATE TABLE LIKE for Delta. CREATE TABLE LIKE is a SQL DDL that creates an empty new table using the definition and metadata of an existing table or view. CREATE TABLE LIKE is not supported for creating Delta tables. Currently, when a user tries to use CREATE TABLE LIKE with Delta, it throws an exception saying this command is not supported for Delta tables. Closes delta-io/delta#1584 GitOrigin-RevId: de25640020ae12ed2d15bfbcb4e46c0305f9e4a2 commit 94eb62e2f7034aa2e07d3bb209e2e937c3be8e3d Author: Paddy Xu Date: Thu Feb 16 14:23:33 2023 +0100 Misc code refactoring for OptimisticTransaction This PR refactors a confusing variable name, and fixes a grammar issue in error messages. GitOrigin-RevId: a819658a623de9d0b9f7f8791af4577ecbe93213 commit 41d0f825b696924bfe2a6447ef86df5f9228e8f0 Author: Paddy Xu Date: Thu Feb 16 11:41:20 2023 +0100 Fix formatting and missing CDF feature name in Delta PROTOCOL.md This PR fixes a formatting issue that causes the Markdown engine fails to render a table , and an issue where CDF feature name is missing from the table. Closes delta-io/delta#1581closes https://github.com/delta-io/delta/pull/1581 GitOrigin-RevId: 36a9eb54963ea46053d025b2c431e7fa3a4669a5 commit 417589c198e6676d42f8b44e423fc744805f345f Author: Jackie Zhang Date: Wed Feb 15 17:05:58 2023 -0800 Ensure delta.columnMapping.maxColumnId is set correctly during any commit GitOrigin-RevId: e3368f35fc58544435d2cc967aeb64c694fadaea commit 3aa32d50b66caa0a534d480fdb5a09b12ba23316 Author: kristoffSC Date: Thu Feb 16 01:27:41 2023 +0100 Flink_1.16.1 - update Flink version to 1.16.1 for Flink connector. (#500) Signed-off-by: Krzysztof Chmielewski commit e0e9b0095dcc5b1c4372474a54c87428340fe899 Author: Venki Korukanti Date: Wed Feb 15 11:12:21 2023 -0800 Temporarily block write operations on Delta tables with deletion vectors ... until we have implementation and correct handling of all update operations on Delta tables with deletion vectors. Users will continue to read tables containing Deletion vectors. Closes delta-io/delta#1603 Signed-off-by: Venki Korukanti GitOrigin-RevId: b8e09cb2e49be6bbc82b19d32ce54fb2a0c1d05c commit ecd420af5e1d2fda884e91315aec8ad72103d315 Author: Christos Stavrakakis Date: Wed Feb 15 17:25:01 2023 +0100 Create trait for bitmaps stored as Deletion Vectors Convert `StoredBitmap` to a trait and make `DeletionVectorStoredBitmap` extend this trait. GitOrigin-RevId: 19c280bafca4182de174e134843953dd5a562edc commit dbf315eacd35c0ea95e04450f297fd3404036c53 Author: Christos Stavrakakis Date: Tue Feb 14 17:31:57 2023 +0100 Return table name in DeltaTable.detail() `DeltaTable.detail()` API currently does not return the name of the Delta tables. This is because `DescribeDeltaDetailCommand.getPathAndTableMetadata()` skips the catalog lookup when a path is set, assuming that either a path or a table identifier will be provided. This commit fixes this issue by always performing a lookup to the catalog when a table identifier is provided. Closes delta-io/delta#1549 GitOrigin-RevId: a899637cd384e6e303809ac3106b9159bcfae0e4 commit 5fddbc18d71f32982dae2d35655414e6b0ca5170 Author: Venki Korukanti Date: Tue Feb 14 07:41:00 2023 -0800 [DELETE with DVs] Update DeleteCommand to generate DVs instead of rewriting files This PR is part of [DELETE with Deletion Vector implementation](https://github.com/delta-io/delta/pull/1591). Detailed overview of changes are described in [feature request issue](https://github.com/delta-io/delta/pull/1591). Closes delta-io/delta#1596 GitOrigin-RevId: 3cbf0283345819336a7727bccbc4bae03e380038 commit 88393a764a6522eaa685967724df0d9ee53cab48 Author: Venki Korukanti Date: Tue Feb 14 03:45:15 2023 -0800 [DELETE with DVs] Update `DeltaParquetFileFormat` to return a row index This PR is part of [DELETE with Deletion Vector implementation](https://github.com/delta-io/delta/pull/1591). It updates the `DeltaParquetFileFormat` to populate row index when the schema contains the columns. In order to generate the row_index, we disable file splitting and filter pushdown into reader. This is temporary until we upgrade to Parquet reader in Spark 3.4 that generates the row_indexes with file splitting and filter pushdown. Closes delta-io/delta#1593 Signed-off-by: Venki Korukanti GitOrigin-RevId: 2a3cf2816eff576346ad3f0edebd3e09cd86f780 commit 48040a58ab9ed9a04534ac50c1676147f7c071be Author: Venki Korukanti Date: Tue Feb 14 03:44:15 2023 -0800 Block manifest file generation when DVs are present This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at https://github.com/delta-io/delta/issues/1485) Manifest file contains the list of Parquet data files paths that can be consumed by clients with symlink table type reading capability. However when DVs are present on top of the Parquet data files, there is no way to expose the DVs to symlink table reader. It is best to block manifest file generation when DVs are present in the table. Closes delta-io/delta#1595 GitOrigin-RevId: 2589bcbd08c59bb65c5dc465d31b5e0e4aabf5c0 commit 0b52f911a96a08afce2439049f04215dadeff516 Author: Daniel Tenedorio Date: Mon Feb 13 16:50:22 2023 -0800 Minor refactoring change only. This change performs a minor adjustment to whitespace. GitOrigin-RevId: 0fee6751fc3bd9ef821c6c08610830d1813173b4 commit 5533714ec07ccaa85c8d434ef0db19b819f15acb Author: Vitalii Li Date: Fri Feb 10 21:02:45 2023 -0800 Minor refactor to DeltaCatalog GitOrigin-RevId: 3a22a579caba0cd7365eca8422e35dceb5688e76 commit aaf3cd77dae06118f5cb7716eb2e71c791c6a148 Author: Daniel Tenedorio Date: Thu Feb 9 13:18:59 2023 -0800 Minor refactoring change only. GitOrigin-RevId: aafd68aaa6e66c01e1392f8aed5275fb829f5140 commit 2e3b97806153638a862ff8483c67d67e81ce9654 Author: Venki Korukanti Date: Thu Feb 9 11:49:10 2023 -0800 Fix a bug where selecting `_metadata` columns on Delta tables with DVs fails Currently when rewriting the scan `PreprocessTablesWithDVs`, we use the `Scan` relation output as the data output for `HadoopFsRelation`. Data output comes directly from the data file or data file reader. Scan output contains additional columns such as `_metadata` which are populated in Scan operator once the data is read from the file reader. Added a test. Also added a golden table for partitioned tables with DVs and tests. Closes delta-io/delta#1583 Signed-off-by: Venki Korukanti GitOrigin-RevId: ee0d1d43e38597b653fdd5f47163b32710393054 commit e54e1a25ff906848b924ea4c583402bf9b16ba87 Author: Venki Korukanti Date: Thu Feb 9 11:24:32 2023 -0800 [DELETE with DVs] Add a bitmap aggegator UDAF This PR is part of [DELETE with Deletion Vector implementation](https://github.com/delta-io/delta/pull/1591). It adds a UDAF implementation that takes a column of longs (basically the row indexes) and generates a `RoaringBitmapArray`. Closes delta-io/delta#1592 Signed-off-by: Venki Korukanti GitOrigin-RevId: 04b998b2832670aee19a1ee3cf1f93b4114e75ea commit 30669ba6765003fa816c9511c75e6dddf692d7a7 Author: Felipe Pessoto Date: Thu Feb 9 11:24:07 2023 -0800 Add optional/required column to AddFile and RemoveFile spec. Resolves #1566 Add the optional/required column to AddFile and RemoveFile spec. I used the Scala source code as source of truth. I also formatted the markdown table, so every column is aligned. Docs only No Closes delta-io/delta#1588 Signed-off-by: Scott Sandre GitOrigin-RevId: 11e26432717050436d74de3154b7a2dcdbe1b5b3 commit 6815482708eba947ae057c287f6776303ef51d35 Author: Jackie Zhang Date: Wed Feb 8 16:48:07 2023 -0800 Block Converting MOR Iceberg Table GitOrigin-RevId: 97bcaf472fd5be202fc5cf8e66f854718db702ab commit 89715537f95278844085d22a9a2c41d827d7bbb3 Author: itholic Date: Wed Feb 8 13:16:03 2023 +0000 Minor test refactor GitOrigin-RevId: 8bd581603d8fe97e334cbd9a9d62b888a0db266a commit 952130615983bc6805cbd3f01d69369c6a13de9d Author: Bo Zhang Date: Wed Feb 8 18:31:41 2023 +0800 Remove DeltaErrors.deltaFileNotFoundHint This change removes an unused method `DeltaErrors.deltaFileNotFoundHint`. GitOrigin-RevId: 7966d59628992ba18dc2db4872f69cebd0eb28cf commit 142667a35392cfe6a0ed2e23e90246fe5a7b3286 Author: itholic Date: Tue Feb 7 21:44:23 2023 +0000 Minor refactor to DeltaAlterTableTests GitOrigin-RevId: 0c64f23eb8d7d5ac21b1f569589493bd7a1c2a65 commit 7e446fded11d1769e98b68aeb7a26f5f42ff4f3d Author: Ryan Johnson Date: Mon Feb 6 13:15:36 2023 -0800 Support generated column for yyyy-MM-dd date_format The generated column machinery supports partition pruning for `date_format` for `yyyy-MM` and `yyyy-MM-dd-HH` but missed `yyyy-MM-dd`. Add the latter to close the gap. GitOrigin-RevId: aee7e7ecda7761327d9b35e777d3059b4ce6b613 commit b38f4046ba0a83d5da9eab7ec7eb6c7c8da2cc27 Author: Tom van Bussel Date: Mon Feb 6 17:55:59 2023 +0100 Small test refactor GitOrigin-RevId: 3f3c7051612b9e62698f7edeb767b0a8183eedd3 commit 668a51528be54e113d35dfed06c0db5ba9d0b703 Author: Lin Ma Date: Fri Feb 3 21:01:57 2023 -0500 Add table column information to OptimizeMetrics This PR adds two metrics related to the table column information to the Optimize metrics: (1) the number of columns in the table, and (2) the number of columns to collect the data skipping stats in the table. GitOrigin-RevId: 64c73d6b1d17c89c3a3f574cd6ba1f9ae3e77224 commit 3694ddc138317855dbfff4f9c011f16b26c20b5f Author: Rahul Shivu Mahadev Date: Fri Feb 3 17:38:38 2023 -0800 Record byte level metrics for Delete and Replace Where Start recording byte level metrics with Delete (numAddedBytes, numRemovedBytes) and ReplaceWhere(numOutputBytes, numRemovedBytes) that will be accessible via DESCRIBE HISTORY command GitOrigin-RevId: 119020918df37fb93632c5bdd56536fc9556013a commit 6a1a58df55625a7b84ca0dbf7858107205ee1a24 Author: Serge Rielau Date: Fri Feb 3 16:26:07 2023 -0800 Revise SQLSTATEs for a set of DELTA_* error classes We start issuing "proper" SQLSTATEs with a special attention to differentiating between system, unsupported, compile time, runtime GitOrigin-RevId: 6e505e32f4265eecc045caf241b9c6e9e8c20b65 commit 645d6fb4a89070ee5b2635ca26b265e23749ad84 Author: Paddy Xu Date: Fri Feb 3 18:57:28 2023 +0100 Separate metadata-enabled feature logic from assertion check Refactor the `verifyNewMetadata` method to `assertNewMetadata` which only does assertions. GitOrigin-RevId: 0f5ee8ff52bdd35cf1da2276f2a18da88dbb0276 commit affd5774def34bc4d553aff805cd17afe7d57ea1 Author: Johan Lasperas Date: Fri Feb 3 17:55:00 2023 +0100 Schema evolution in merge for UPDATE/INSERT non-star actions This change implements support for schema evolution in merge for UPDATE SET and INSERT (...) VALUES (...) actions. Before this, schema evolution was only triggered with UPDATE SET * and INSERT * actions. The following example fails on resolving `target.newColumn` before this change, with schema evolution enabled it now succeeds and adds `newColumn` to the target table schema: Target schema: `key: int, value: int` Source schema: `key: int, value: int, newColumn: int` ``` MERGE INTO target] USING source ON target.key = source.key WHEN MATCHED THEN UPDATE SET target.newColumn = source.newColumn ``` Changes: - When schema evolution is enabled, allow resolving assignments in merge actions against the source table when resolving against the target table fails. - When schema evolution is enabled, collect all new columns and nested fields in the source table that are assigned to by any merge action and add them to the table schema, taking into account * actions. Extensive tests added to MergeIntoSuiteBase covering schema evolution for both top level columns and nested attributes. GitOrigin-RevId: 3387f19fd987d769fdc719255bbdbcaf92db6bba commit 8a70d113514159262a5c580ec0a9878753cd2840 Author: Johan Lasperas Date: Fri Feb 3 12:31:37 2023 +0100 Generate update expressions for new nested target fields in PreprocessTableMerge With schema evolution, new nested columns can be added to the target table during a merge operation. The code used to generate expression to set these columns to null when they are not otherwise set by a merge clause only handles top-level columns. It is extended here to also handle evolution of nested attributes. GitOrigin-RevId: e0592fdc6ce6c3b98a5153f48b65ff4b1d905499 commit 1405322f2d8dc5fe4878e32803faacb69aadbeb7 Author: Prakhar Jain Date: Thu Feb 2 23:48:28 2023 -0800 Use listingPrefix at all listFrom callsites in Delta This PR modifies the checkpointPrefix and makes it a generic listingPrefix to take care of all deltaLog filetypes. This makes code more uniform and future proof against all fileTypes irrespective of whether they are lexiographically smaller/larger than the checkpoint path. We also unify all other listFrom callSites which currently do not use this helper. GitOrigin-RevId: 0bdc3e2061d460f5830ffdb56ac881e21e15775a commit c6bbdaf8dae8fc118e478ba2a2831d3efb8f4423 Author: Yingyi Bu Date: Thu Feb 2 14:37:40 2023 -0800 Minor refactor DataSkippingDeltaTests GitOrigin-RevId: 8136303d4436f16d4416cfbafd4e238c13b4465c commit 67846927def3500f563afc412afe2c8c7396886d Author: Serge Rielau Date: Wed Feb 1 14:34:35 2023 -0800 Revise SQLSTATEs for a set of DELTA_* error classes We start issuing "proper" SQLSTATEs with a special attention to differentiating between system, unsupported, compile time, runtime GitOrigin-RevId: 3a43fc477f53785f867e512d95f2a398803d5e57 commit 4e6f8075b421c64927779f322d9c025b9d55dda7 Author: Herivelton Andreassa <57838698+handreassa@users.noreply.github.com> Date: Wed Feb 8 22:21:37 2023 -0300 Fixing powerbi connector tutorial (#498) commit 8a28c7ea488147321e54bda0d8a50573bf49aacd Author: Scott Sandre Date: Tue Feb 7 11:41:58 2023 -0800 update version to 0.6.1-SNAPSHOT commit 553476dcc0e8d556f8c813f0cc8e4a3b872322e3 Author: Venki Korukanti Date: Tue Jan 31 12:38:17 2023 -0800 Support OPTIMIZE on Delta tables with deletion vectors This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at https://github.com/delta-io/delta/issues/1485) This PR adds support for running OPTIMIZE (file compaction or Z-Order By) on Delta tables with deletion vectors. It changes the following: * Selection criteria * File compaction: earlier we used to select files with size below `optimize.minFileSize` for compaction. Now we also consider the ratio of rows deleted in a file. If the deleted rows ratio is above `optimize.maxDeletedRowsRatio` (default 0.05), then it is also selected for compaction (which removes the DVs) * Z-Order: This hasn't been changed. We always select all the files in the selected partitions, so if a file has DV it gets removed as part of the Z-order by * Reading selected files with DV for OPTIMIZE: We go through the same read path as Delta table read which removes the deleted rows (according to the DV) from the scan output. * Metrics for deleted DVs Added tests. Closes delta-io/delta#1578 GitOrigin-RevId: f20d234357fa5b24e56aea098fa60f026ad1f160 commit 639c7b327cf48200bd3cb514ff1a7cd9fb257492 Author: Tom van Bussel Date: Tue Jan 31 14:35:52 2023 +0100 Turn DeltaTableTests into a mixin to allow reuse This will allow future reuse of this suite when we are adding Delta support for Spark Connect. GitOrigin-RevId: db28f942de9d3ad989028cce2d7dbbbe39cdffe3 commit e12974a1c9457c3312686bd294ca0ec46418693c Author: Andreas Chatzistergiou Date: Tue Jan 31 12:23:25 2023 +0100 Minor test utility change GitOrigin-RevId: 987da1337fa27c01cd861f7155cb2f9fbd9d6451 commit 8a5aa66857f473605b48ade6800952d1146c982e Author: Venki Korukanti Date: Mon Jan 30 15:47:13 2023 -0800 Minor indentation clean up GitOrigin-RevId: 0c442a1719a4192ea0906bba590dae1acec3b91d commit d87fdfd3199c197b15d95c1cc28669e4d13bc66a Author: Pranav Date: Mon Jan 30 15:42:13 2023 -0500 Minor refactor to DeletionVectorDescriptor.scala Make existing method public GitOrigin-RevId: fdac404b6ed03e12d32e1a1e7a03a6ee366d65c2 commit be3e49a145d01f7113b5005d37e33e8bbc735825 Author: Venki Korukanti Date: Mon Jan 30 11:27:47 2023 -0800 Support limit pushdown on Delta tables with DVs. This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at https://github.com/delta-io/delta/issues/1485) Currently the limit pushdown code doesn't take into account the DVs when pruning the list of files based on the `limit` value. Update the code to take `dv.cardinality` when finding the number of rows in a `AddFile`. It also adds * comments around data skipping code and how data skipping continue to work without giving wrong results (with a bit performance overhead) when querying tables with DVs * test utilities to associate DVs with `AddFile` for testing purposes. Closes delta-io/delta#1577 GitOrigin-RevId: fdb0c78db9b2d0ac37cce67886110a32688fd531 commit 9e065c86999b188598cc3eb28da604c23e67df20 Author: Jungtaek Lim Date: Mon Jan 30 15:34:12 2023 +0900 Loosen scope of DeltaSource.initLastOffsetForTriggerAvailableNow Loosen scope of DeltaSource.initLastOffsetForTriggerAvailableNow from private to protected, so that it is open for extension. This is in line with the scope of DeltaSource.initForTriggerAvailableNowIfNeeded. GitOrigin-RevId: aad820562d2503c95d46bf9840734c0b04aa1a71 commit 0c6ea501f23ea4e6c5a3e0476435a34792446e10 Author: Allison Portis Date: Mon Jan 30 01:33:13 2023 +0000 Minor updates to DeltaInsertIntoTableSuite GitOrigin-RevId: c8cd480f6093cebaeba464ae140d46a044bb89ae commit bdf9938e6895c45238c40d1565e0473f8470c04d Author: Serge Rielau Date: Fri Jan 27 13:10:49 2023 -0800 Revise SQLSTATEs for a set of DELTA_* error classes We start issuing "proper" SQLSTATEs with a special attention to differentiating between system, unsupported, compile time, runtime GitOrigin-RevId: 2a54963f48afa43bf582076709007ef0c8cb4c59 commit 60d2d114df632368e35b06b123fd1c96c88e267d Author: Venki Korukanti Date: Thu Jan 26 13:35:55 2023 -0800 Support checkpoints on Delta tables with deletion vectors This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at https://github.com/delta-io/delta/issues/1485) Add support of checkpoints when updating Delta tables containing deletion vectors. In checkpointing job, read the existing deletionVector in file action and write it to checkpoint file. Closes delta-io/delta#1576 GitOrigin-RevId: 9ee23c8a876c45f539fc81cef100382f6efe6fae commit 79d3d0bbbf1e816899edf6aa2ba8375a0e0a3577 Author: Serge Rielau Date: Thu Jan 26 10:53:34 2023 -0800 Refine small number of SQLSTATEs The following SQLSTATE changes: - DELTA_ACTIVE_SPARK_SESSION_NOT_FOUND: 42000 => 08003 - RESERVED_CDC_COLUMNS_ON_WRITE": 42000 => 42939 GitOrigin-RevId: d1047a87659d1d8d7e44b52385e3d7c79e8c6669 commit deeb59d60c71f38f221b95ff97397dd69e21c8cb Author: Venki Korukanti Date: Thu Jan 26 05:50:55 2023 -0800 Add DV table plan transformer trait to prune the deleted rows from scan output This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at https://github.com/delta-io/delta/issues/1485) Add a trait (used by `PrepareDeltaScan` to modify its output) to modify DV enabled tables to prune the deleted rows from scan output Planner trait to inject a Filter just after the Delta Parquet scan. This transformer modifies the plan: * Before rule: ` -> Delta Scan (key, value)` * Here we are reading `key`, `value` columns from the Delta table * After rule: ` -> Project(key, value) -> Filter (udf(__skip_row == 0) -> Delta Scan (key, value, __skip_row)` * Here we insert a new column in Delta scan `__skip_row`. This value is populated by the Parquet reader using the DV corresponding to the Parquet file read (refer [to the change](https://github.com/delta-io/delta/pull/1542)) and it contains `0` if we want to keep the row. * The scan created also disables Parquet file splitting and filter pushdowns, because in order to generate the `__skip_row` we need to read the rows in a file consecutively in order to generate the row index. This is a drawback we need to pay until we upgrade to latest Apache Spark which contains Parquet reader changes that automatically generate the row_index irrespective of the file splitting and filter pushdowns. * The scan created also contains a broadcast variable of Parquet File -> DV File map. The Parquet reader created uses this map to find the DV file corresponding to the Parquet file. * Filter created just filters out rows with `__skip_row` equals to 0 * And at the end we have a `Project` to keep the plan node output same as before the rule is applied In addition * it adds the `deletionVector` to DeltaLog protocol objects (`AddFile`, `RemoveFile`) * It also updates the `OptimizeMetadataOnlyDeltaQuery` to take into consideration of the DVs when calculating the row count. * end-to-end integration of reading Delta tables with DVs in `DeletionVectorsSuite` In following up PRs, will be adding extensive tests. Close delta-io/delta#1560 GitOrigin-RevId: 3d67b6240865d880493f1d15a80b00cb079dacdc commit 6e482678decb438dba961f288f45f91b6477cfc2 Author: Fredrik Klauss Date: Thu Jan 26 14:43:50 2023 +0100 INSERT REPLACE WHERE with less than condition should be evaluated as less than INSERT REPLACE WHERE with a less than condition was evaluated as a less than or equal condition. This was wrong, and leads to data being deleted that shouldn't be. GitOrigin-RevId: f188a2be4d62f87a0711a93c4c75168bab98d8d9 commit f934153e0b4e2ae2d8f54862a3255754d3e12ca3 Author: Allison Portis Date: Wed Jan 25 17:29:27 2023 -0800 Additional tests for inserts/overwrites and generated columns GitOrigin-RevId: cacd71552bf4f4bd1d04ea659be4229f1dba6a21 commit 1f03aef3afe944b2723d70eb2bf0ffbd778ae5ab Author: Yingyi Bu Date: Tue Jan 24 18:47:02 2023 -0800 Minor formatting/cleanup changes GitOrigin-RevId: add3f0eab03f84344f97782b86f3f41fbac20d3d commit 4adb95d4c2c22d6c3b589755d86b4080c8f6cb64 Author: Tom van Bussel Date: Tue Jan 24 20:54:09 2023 +0100 Minor cleanup Turns a `var` that could be a `val` into a `val`. GitOrigin-RevId: 1f6e6cd801061086d33272e88ec73608536f08fd commit 496b529eb2ba49c5c22c259a9985acac7f9c62fa Author: Ahir Reddy Date: Tue Jan 24 11:11:56 2023 -0800 Update sbt-assembly to fix Scala shading (#495) commit c442ddac8f8b1c65bdb94e2bf39ed2974ba14257 Author: Andreas Chatzistergiou Date: Tue Jan 24 17:08:50 2023 +0100 Minor refactoring GitOrigin-RevId: 628cf9ccc155bfc9debdacfb9443d5c0ad2348a1 commit 17f3c58747a4833841e98d1165c0840ac520749b Author: Lukas Rupprecht Date: Tue Jan 24 07:07:03 2023 -0800 Adds spark session to cleanupTableDefinition Minor refactor to make a spark session available to the catalog update during table creation. GitOrigin-RevId: 70e4693bf4e7d04d35f6b28cf71e2be71d835f99 commit b6aa5893f5381dbf2639952eed5569ae6e50331e Author: Kam Cheung Ting Date: Mon Jan 23 16:43:20 2023 -0800 Minor refactoring GitOrigin-RevId: dde389b2fab0adca0d14cd5a2d408d34df1bc848 commit a46a42704cb25be3eee1cc75d805be2e8ff7912e Author: Jackie Zhang Date: Fri Jan 20 15:23:38 2023 -0800 Minor refactor to Delta streaming source. GitOrigin-RevId: 4716f84e9ae00b7169e74e3b57769011c8d3d39c commit c144366351e21826c1b1caa3c56ebefe3978da28 Author: Rajesh Parangi Date: Fri Jan 20 09:55:11 2023 -0800 Minor refactoring GitOrigin-RevId: 617e41de5c14277472416e7c8de4923c79316583 commit abefa4f2e4ca3b58528a77a708ce23399ec43f25 Author: Jackie Zhang Date: Fri Jan 20 09:24:35 2023 -0800 Fix a column mapping helper method name GitOrigin-RevId: a4da677f7c8c111ad95438ce16757ecf51a60217 commit 77c1e0a20778a8b7df22741a4579e2658126ff7b Author: Lukas Rupprecht Date: Thu Jan 19 12:24:13 2023 -0800 Fixes schema evolution for complex types during INSERT OVERWRITE GitOrigin-RevId: f3310fa7be21364b3c36f47115d18d24ddc47c42 commit 098543674ba74bf8f26f7e53a557a1e4a7cd11fb Author: Yingyi Bu Date: Thu Jan 19 12:01:48 2023 -0800 Let DeltaScan::unapply returns the original plan. This way, call site can decide to canonicalize the plan or not. GitOrigin-RevId: 262af5bbdc76c93db14bf5832a0fec4477c5b2c3 commit b2249970c9f55ffdb7e2cf556802ef547f2a3132 Author: Hussein Nagree Date: Thu Jan 19 11:54:46 2023 -0800 Vacuum test refactorings GitOrigin-RevId: ee48a0dac2ce92292c43140280c0412028cf3742 commit 7f8d0419aec0f20c2becba528a133258c9a30a60 Author: Andreas Chatzistergiou Date: Thu Jan 19 15:55:18 2023 +0100 Replaced the use of `input_file_name()` in `StatisticsCollection.Recompute` with `_metadata.file_path` plus minor Refactoring. GitOrigin-RevId: 7505995e704c75de3312fa4c799f142f637b7c25 commit f1ec3a0d77a2ab2701fbc8170adae3d5d7d10d90 Author: Jiaheng Tang Date: Wed Jan 18 18:05:17 2023 -0800 Support idempotent write using SQL options for INSERTS/DELETE/UPDATE/MERGE Currently, delta supports idempotent write using Dataframe writer options. These writer options are applicable to inserts only. This PR adds support for idempotency using SQL options(`DELTA_IDEMPOTENT_DML_TXN_APP_ID` and `DELTA_IDEMPOTENT_DML_TXN_VERSION`) to INSERTS/DELETE/UPDATE/MERGE etc. When both writer options and SQL conf are specified, we will use the writer option values. Idempotent write works by checking the txnVersion and txnAppId from user-provided write options or from session configurations(as a SQL conf). If the same or higher txnVersion has been recorded, then it will skip the write. Add unit tests to test out the idempotency. Closes delta-io/delta#1555 GitOrigin-RevId: 842e56508137c5913e42bc32267ac74e6f2643ce commit 6eddbb01ba65349e4448d3653bcb1c4b3c1e9f61 Author: Scott Sandre Date: Wed Jan 18 12:53:08 2023 -0800 Minor updates to build.sbt to improve release process. GitOrigin-RevId: 4731d65fa13243c46006b111fc884c8a1599fd06 commit 1fa48c8764125d0a16c4718374763b1e1bfd44be Author: Scott Sandre Date: Fri Jan 20 16:09:48 2023 -0800 Update flink README.md to remove old reference to `shouldTryUpdateSchema` commit b4f88a10c7489110fcf365a61bd2aefde12720ce Author: Ole Sasse Date: Wed Jan 18 12:11:25 2023 +0100 Return DataFrame from target building functions in MERGE All plans returned from buildTargetPlanWithFiles are wrapped into DataFrame before using them. This PR proposes to return DataFrame from buildTargetPlanWithFiles to have a central place where the wrapping happens. No change in functionality, and hence no new tests added GitOrigin-RevId: 2310189528d499b12e413d11e11baa9597a2076d commit 46f0c5ef830821a77bb75ae234ce19469e1a7f47 Author: Paddy Xu Date: Wed Jan 18 10:41:00 2023 +0100 Protocol update for Table Features ## Description This PR proposes a change to the Delta Protocol to accommodate Table Features discussed in [[Feature Request] Table Features to specify the features needed to read/write to a table #1408](https://github.com/delta-io/delta/issues/1408). TOC is updated using `doctoc`. Not needed. Closes delta-io/delta#1450 Signed-off-by: Paddy Xu GitOrigin-RevId: 66a0b89a12d0c387a6ac03a1458ab4d64af5ac3d commit d739caf3a5f0025c1f65a7e6f4de6286e0eff6ab Author: Paddy Xu Date: Tue Jan 17 20:02:49 2023 +0100 Minor test refactoring. GitOrigin-RevId: 92fa36521202eb10ac6263ccaac926e7736b442d commit 6313782104a5c1fd12968efb4ba9a388458d9d5f Author: Yingyi Bu Date: Sat Jan 14 16:06:21 2023 -0800 Refactor tests in DataSkippingDeltaTests GitOrigin-RevId: 924cc50af29f7f7c8522d061b9c7024f02dc43c3 commit 714b33ea7f1a7f0f75fac2985be71965ca75bc2b Author: Wenchen Fan Date: Sat Jan 14 04:20:25 2023 +0000 Minor refactor to DeltaAnalysis.scala Author: Juliusz Sompolski Author: itholic Author: Wenchen Fan GitOrigin-RevId: 76f3a60bb1bb4dc03eb6f8aac016aa47d88f61f5 commit 78160ed377d702e52b8d7500f14015f0fe28904a Author: Tushar Machavolu Date: Fri Jan 13 12:28:42 2023 -0800 Updated PySpark version to v3.3.1 Signed-off-by: Tushar Machavolu This PR resolves issue #1546 The patch is for a version upgrade of PySpark for testing. The newer version has already bee in use for the Scala tests but not in PySpark. Hence testing is not necessary No Closes delta-io/delta#1561 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 43d6694f1ca16099cb0a1e441565b7d6614e264f commit 46822a0489c639f15fdafbb20449edbb1ba20dce Author: Christos Stavrakakis Date: Fri Jan 13 21:26:53 2023 +0100 Forbid protocol downgrades when replacing Delta tables Replacing a Delta table is implemented as committing a new Delta version that clears existing data and updates metadata. Currently this is treated as creating a new table and the protocol version is determined based on existing settings and used features. However, the the table's history is not cleared and customers can time travel or read CDF from previous versions. If the new version results in a protocol downgrade, these operations may return wrong data. This commit fixes this issue by preventing protocol downgrades when replacing a table. Replacing a table is treated as a commit that is allowed to reset all metadata, but only upgrade the protocol. Note, that this change has two important side-effects: * Replacing a table will no longer pick the latest default table protocol. * Users can no longer perform an in-place downgrade using `REPLACE TABLE X AS SELECT * FROM X`. This was allowed until now, but could return wrong results for time-travel/cdf queries. GitOrigin-RevId: a2aa81835981cd7b502e6df1b755153889cfb878 commit 2f99adab90680f846c1f8fba2d98ca67c1aaa6b9 Author: Jackie Zhang Date: Fri Jan 13 09:15:18 2023 -0800 Clean up and fix nullability handling for batch CDF schema check GitOrigin-RevId: 11927ec971815bcafbf226d62066443ffde1684d commit 026b42900a5955f10dce5b272a4d624e16d42324 Author: Tom van Bussel Date: Fri Jan 13 13:19:22 2023 +0100 Minor refactor in CheckpointsSuite GitOrigin-RevId: ee18c0203e57857aa8ef34b55d7d4696e012c8a3 commit ceb19cdf9b12f878d49ffe77bb07de655f4ac169 Author: Hussein Nagree Date: Thu Jan 12 14:21:33 2023 -0800 Fix column mapping advice error message Remove the `<>` around `table_name` in order to render it properly to the user GitOrigin-RevId: 252b478b41af6599dbf2534a31d1cecf3facbb51 commit dd744c8fe87f675012450f7417f2f1b32e5dfec5 Author: Johan Lasperas Date: Thu Jan 12 19:00:59 2023 +0100 Add merge metrics for number of rows updated/deleted by each clause type This change introduces 4 new metrics to the merge command that come in addition to the already existing `numTargetRowsUpdated` and `numTargetRowsDeleted` metrics: - `numTargetRowsMatchedUpdated`: number of rows updated by any MATCHED clause. - `numTargetRowsMatchedDeleted`: number of rows deleted by any MATCHED clause. - `numTargetRowsNotMatchedBySourceUpdated`: number of rows updated by any NOT MATCHED BY SOURCE clause. - `numTargetRowsNotMatchedBySourceDeleted`: number of rows deleted by any NOT MATCHED BY SOURCE clause. The new metrics are checked in existing tests and new tests covering not matched by source clauses are added to `MergeIntoMetricsBase` GitOrigin-RevId: 9ecce4be1a7bbe2a6bffa83281e5ddcdcbf1dfc4 commit fa6f44194b796a1376e3cef8f123b5718b3dd6dc Author: Fredrik Klauss Date: Thu Jan 12 15:22:59 2023 +0100 Don't drop table feature set by table property override during CLONE In CLONE, we merged the `txn.metadata` to come up with a right protocol version to account for table property overrides. With table features, this approach does not work anymore as it doesn't account for table features set in the table property overrides, as they end up in the protocol of the transaction, and not the metadata. This is fixed by merging the table property overrides with the `txn.metadata` when coming up with the minimal protocol components. GitOrigin-RevId: 18aaf85df24b80970b5098d6e891451c875ea79c commit d9a889e03ae63f34fd14b453c9d8f43b64d84059 Author: Paddy Xu Date: Thu Jan 12 10:12:05 2023 +0100 Fix a nit in the Delta error message This PR follows https://github.com/delta-io/delta/pull/1520 to fix a small grammar issue in the Delta error message. GitOrigin-RevId: 197159689d32a35c05d5c1d7a0022e9e8d373a07 commit a6ee38b629062f2cb8971c805202ffac2bbdac4c Author: Brayan Jules Date: Wed Jan 11 19:35:00 2023 -0700 Automatically generate partition filters for generated columns using the trunc date function We are adding support for partition filter on partition columns that use the generation expression trunc date. We are considering the following data filters: * lessThan * lessThanOrEqual * equalTo * greaterThan * greaterThanOrEqual * isNull Because the date resolution of the partition column has truncated information, we need to reuse the implementation of the following functions: * lessThanOrEqual reused in lessThan function. * greaterThanOrEqual reused in greaterThan function. I added three new tests to the OptimizeGeneratedColumnSuite class. Resolves delta-io/delta#1446 Closes delta-io/delta#1513 Co-authored-by: Brayan Jules Signed-off-by: Allison Portis GitOrigin-RevId: e6744298c85ebfd508e858ad391e168e015d7957 commit c77d2581a9c98c73d678cfc666c18b213b41e3be Author: Scott Sandre Date: Wed Jan 11 17:47:15 2023 -0800 Register VACUUM operations in the delta log This PR registers the start and end of VACUUM operations in the delta log. This means that we commit a commit with no Add/Remove files, and only a `CommitInfo` file which contains the delta operation info. `VacuumStart` operation contains metrics: `numFilesToDelete` and `sizeOfDataToDelete` `VacuumEnd` operation contains metrics: `numDeletedFiles` and `numVacuumedDirectories` New UTs. Expose additional metrics and history in the _delta_log for the start and end of VACUUM operations. Closes delta-io/delta#1552. Resolves delta-io/delta#868. Co-authored-by: Yann Byron GitOrigin-RevId: 94805531d022bac4afafd0b672d17b8828d8aa2c commit 74f73bf23853feedf4ba5eaa0d08334a6844c8b4 Author: Allison Portis Date: Wed Jan 11 17:49:34 2023 -0700 Adds note about void columns to the Delta protocol Resolves delta-io/delta#1499 Closes delta-io/delta#1559 GitOrigin-RevId: 7bf0e780b56c10169afcdddac5572e010a407ff3 commit 0a172308c26d64c9aa1fcee74aa8bc5f57c9bdfb Author: Fredrik Klauss Date: Wed Jan 11 13:28:15 2023 +0100 Enabling legacy metadata table feature should not enable other table features during CLONE If we'd enable a table feature using its' metadata property during CLONE for a table that already has a table feature, we'd enable all table features up to the protocol version of the table feature enabled through the table property. This was caused by coming up with a protocol for the table property overrides and source metadata without considering the overall protocol version resulting from the CLONE. This is fixed by: 1. Determining the maximum protocol versions from the source table, table property overrides, transaction metadata, and (if required) the target table. Additionally determine the table features that result from a new table created from `txn.metadata`. 2. Create a protocol out of that and merge it with the other protocols. GitOrigin-RevId: f8fd213de796f98346909b44987c807d364417ef commit 41d42718ab6b705df16a0ee6978dec7246bdd31d Author: Johan Lasperas Date: Wed Jan 11 09:44:33 2023 +0100 Update docstrings mentioning outdated merge limitation The scala and python delta merge API docstrings mention a limitation of at most one merge clause/action of each type. This limitation was lifted and the comments are outdated, this change rephrases them. This change only updates comments. Python unit tests are added nevertheless to cover cases were multiple clauses/actions are specified. The scala API is already covered by existing tests. Rewords docstring for delta merge python and scala API. GitOrigin-RevId: 0c37132346e260d172e9ea2d785199272506621f commit 9a814a86b98302d025fd2929b5b5bfc1580d1edd Author: Prakhar Jain Date: Tue Jan 10 12:46:30 2023 -0800 Minor refactor in DeltaTableV2 GitOrigin-RevId: 29053af447efecc758a35ec78d75117a1af7971c commit b6eb00601e7e590960ef61b1b3431114d0307781 Author: Fredrik Klauss Date: Tue Jan 10 10:21:20 2023 +0100 Rename ignoreAddCDCFileActions to useCoarseGrainedCDC Rename ignoreAddCDCFileActions to align it more with what it actually does, namely using coarse-grained CDC. GitOrigin-RevId: 80dbb3e836e1f301a54cb0e4098d51ba52fdc4d0 commit 7bb687173c4cf8f2244e0bc58f5a1e9620d58850 Author: Venki Korukanti Date: Mon Jan 9 20:54:16 2023 -0800 Update `DeltaParquetFileFormat` to add `isRowDeleted` column populated from DV This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at https://github.com/delta-io/delta/issues/1485) It modifies the `DeltaParquetFileFormat` to append an additional column called `__delta_internal_is_row_deleted__`. This column is populated by reading the DV associated with the Parquet file. We assume the rows returned are in order given in the file. To ensure the order we disable file splitting and filter pushdown to Parquet reader. This has performance penalty for Delta tables with deletion vectors until we upgrade Delta to Spark version to 3.4 (which has Parquet reader that can generate row-indexes correctly with file splitting and filter pushdown). Currently added a single test. There will be e2e tests that cover the code better. Closes delta-io/delta#1542 GitOrigin-RevId: b73fa628ad6d04c171f56534b80a894e9cd1220e commit 3a396458c5a4a139189ccb77758c861a3c2315e8 Author: Prakhar Jain Date: Mon Jan 9 17:47:09 2023 -0800 Adds new API commitIfNeeded in OptimisticTransaction class to allow skipping commits when they are not needed. This patch makes following changes: 1. WriteSerializable mode: Empty commits today are not recorded for DELETE/UPDATE. This is incorrect behavior and this patch fixes it. Merge today already records empty commit - which is the correct behavior. 2. Serializable mode: Empty commits today are not recorded for DELETE/UPDATE but MERGE records empty commit. We can either record them or skip recording them - both are accepted behavior from correctness perspective. But empty commits may have CDF overheads and so sometimes customers wants to avoid them. So as part of this PR, we make behavior of empty commits consistent for all 3 - MERGE/UPDATE/DELETE. All three will skip empty commits when `DELTA_SKIP_RECORDING_EMPTY_COMMITS` conf is enabled (defaults to true). Added UTs. GitOrigin-RevId: af1825dbe4879cfcac72aeeaa596d1d241c9468d commit c31a8f6114f1f4ad216ff4c8fcc13f7124761c13 Author: Venki Korukanti Date: Mon Jan 9 17:30:38 2023 -0800 Improve Z-Order Python tests ## Description Followup from test failures in(https://github.com/delta-io/delta/commit/d8c4fc17c25d6b5e0e9b3ebe1ff4cba39ecb39c5) - The expected number of files added/removed is hard coded which could change depending upon the Spark version (For example [change](https://issues.apache.org/jira/browse/SPARK-40407) causes different number of output files between Spark 3.3.0 and Spark 3.3.1 Update the test to get the number of files from query. - Change `assertTrue` to `assertEqual`. Existing tests Closes delta-io/delta#1550 Signed-off-by: Venki Korukanti GitOrigin-RevId: a909d1b67526bcd218968d07d5ec9ca8cba638bb commit 9b6c08d80166c67f223829972ca5889736d4ad46 Author: Lars Kroll Date: Mon Jan 9 14:45:58 2023 +0100 Minor test refactoring. GitOrigin-RevId: 83ded804437d4295f13284cc0d739d9d140123e1 commit fee3a58060ced3f30fc587a1d6432eead5eb553d Author: Johan Lasperas Date: Sat Jan 7 03:38:59 2023 +0100 Fix pruning with target only predicate in merge with NOT MATCHED BY SOURCE clause Fix a bug introduced that may lead to incorrect results in merge commands that include a target only predicate in the merge condition and a WHEN NOT MATCHED BY SOURCE clause. Example impacted merge command: ``` MERGE INTO target USING source ON source.key = target.key AND target.key > 4 WHEN NOT MATCHED BY SOURCE THEN DELETE ``` The target only predicate `target.key > 4` in the merge condition is incorrectly used to prune target files early when collecting files to rewrite. Files that have no rows matching the target only predicate may be skipped instead of being deleted or updated. Test cases added that covers the problematic merge statements. GitOrigin-RevId: 7d78d989fcfc68b5ba44b84f453f1cd5ff6c457c commit a981d568bf5057ecf9bbeb701cbb9f8efc100dae Author: Ole Sasse Date: Fri Jan 6 23:19:33 2023 +0100 Fix a var instead of val warning Change a var to a val in WriteIntoDelta.scala. Small style changes GitOrigin-RevId: 3a27019ac794d7fc487d8241dcb8cb3b1a6c1b36 commit 77441cb96e16919d066913f04aba915b11744c17 Author: lzlfred Date: Wed Jan 4 17:34:11 2023 -0800 Add test to verify DataSkippingReader always return latest AddFile GitOrigin-RevId: 71fc56e257c67926961bdea90bad869286e3d6fd commit 69c6a5255f134c9a59520c9f94be3e7afb1d99fe Author: Prakhar Jain Date: Wed Jan 4 15:14:57 2023 -0800 Fix a typo in PROTOCOL.md. GitOrigin-RevId: 6231c2e8cee74f412e539ca412fe58ec4fe4549c commit 42349b8122e66a97a4d053683645d4ec2b210c83 Author: Scott Sandre Date: Wed Jan 4 17:32:39 2023 -0500 Fix CI job GC overhead limit exceeded This PR a) increases the JVM heap size from 1GB to 4GB, and b) changes the garbage collector to be G1GC ([link](https://docs.oracle.com/javase/7/docs/technotes/guides/vm/G1.html)) which should be more performant on systems with more processors and more RAM (such as the worker this job runs on) ``` java -Dsbt.override.build.repos=true -Dsbt.repository.config=/ephemeral/tmp/tmpfj8j3fgv/build/sbt-config/repositories -Xms1000m -Xmx1000m -XX:ReservedCodeCacheSize=128m -jar build/sbt-launch-1.5.5.jar clean "++ 2.13.5 test" ... ... [error] Error while emitting DatasetRefCacheSuite.scala [error] Java heap space ``` ``` java -Dsbt.override.build.repos=true -Dsbt.repository.config=/ephemeral/tmp/tmptf2fzixi/build/sbt-config/repositories -Xms1000m -Xmx1000m -XX:ReservedCodeCacheSize=128m -XX:+UseG1GC -Xmx4G <-- this will override the -Xmx1000m above -jar build/sbt-launch-1.5.5.jar clean "++ 2.13.5 test" ``` GitOrigin-RevId: 6b0411588ae5e7df5cfd97a62feeb39e2ce88355 commit 36a7edb8cf507e713700ba827c5fb5ad32b9163e Author: Yaohua Zhao Date: Thu Jan 5 01:14:57 2023 +0800 Minor refactor of DeltaUnsupportedOperationsCheck GitOrigin-RevId: df0867bcb126c70d58e1b22c121da9562ed6b518 commit 9cb4dc4d3b14780b538efbaef9497dabc68d7e76 Author: Johan Lasperas Date: Wed Jan 4 09:50:32 2023 +0100 Extend the existing Python Delta Table API to expose WHEN NOT MATCHED BY SOURCE clause in merge commands. Support for the clause was introduced in https://github.com/delta-io/delta/pull/1511 using the Scala Delta Table API, this patch extends the Python API to support the new clause. See corresponding feature request: https://github.com/delta-io/delta/issues/1364 Adding python tests covering WHEN NOT MATCHED BY SOURCE to test_deltatable.py. The extended API for NOT MATCHED BY SOURCE mirrors existing clauses (MATCHED/NOT MATCHED). Usage: ``` dt.merge(source, "key = k") .whenNotMatchedBySourceDelete(condition="value > 0") .whenNotMatchedBySourceUpdate(set={"value": "value + 0"}) .execute() ``` Closes delta-io/delta#1533 GitOrigin-RevId: 76c7aea481fdbbf47af36ef7251ed555749954ac commit c3e0a1ad012f8fc1162cd74bc2053b1a843ef7be Author: Andrew Li Date: Tue Jan 3 19:30:33 2023 -0800 Minor refactoring of DeltaLog.scala GitOrigin-RevId: aedee6a7312093c325d003016b7130adb906326f commit e0177435bd0d1f16e0e2ea26e188aa4703f4f61f Author: Jintian Liang Date: Tue Jan 3 16:29:53 2023 -0800 Refactor DeltaLogging for tahoeId and tahoePath tags There are various places where we are explicitly grabbing the tahoeId and tahoePath for logging purposes. This is a simple refactor to unify that code into one place in DeltaLogging. GitOrigin-RevId: f634f7bf32d526b7241fb37fc6a896f01d0b2d06 commit 8e8d36c52d92c3b3bf1a0391dbf2caed96a3169f Author: Allison Portis Date: Tue Jan 3 15:47:32 2023 -0800 Add test for incorrect behavior that will fail when upgrading to Spark 3.4 See #1479 this test checks for the **incorrect** behavior we are seeing and should fail when upgrading to Spark 3.4. Closes delta-io/delta#1535 Signed-off-by: Allison Portis GitOrigin-RevId: 7c5c08038376a3df3f04df3d9e7dbde36a2295e4 commit cc08cd58f6fe8a3a30e46ddd5f69ceba3ace2a22 Author: Rahul Shivu Mahadev Date: Tue Jan 3 13:56:13 2023 -0800 Improve error message language Improve error message for 1. DELTA_CREATE_TABLE_WITH_NON_EMPTY_LOCATION 2. DELTA_READ_TABLE_WITHOUT_COLUMNS GitOrigin-RevId: 41203be807f70353a194204a337344a44cc730a3 commit aefc6ae487448e46026248a18f02f61f55d21cd0 Author: Rahul Shivu Mahadev Date: Tue Jan 3 13:10:03 2023 -0800 Record byte level metrics for Update command Start recording byte level metrics with Update (numTargetBytesAdded, numTargetBytesRemoved) that will be accessible via DESCRIBE HISTORY command GitOrigin-RevId: 786535aa4924d0f101740792b581b5cd90358975 commit 8e42f029e76fd371a36db1837252646549181df5 Author: Rahul Shivu Mahadev Date: Tue Jan 3 13:08:24 2023 -0800 Minor Refactor related to recordDeltaEvent and improve test flakiness GitOrigin-RevId: 33393d3fd33a1f1ec43058a2638c5849d51b884b commit bdc376eb3f6d6738905b14cf0a0400242334f781 Author: Johan Lasperas Date: Tue Jan 3 18:23:21 2023 +0100 Add tests for WHEN NOT MATCHED BY SOURCE Add 3 tests covering edge cases to NotMatchedBySourceSuite: - Full table update - Insert-only with a not matched by source clause updating no rows. - Merge statement with all 3 type of clauses with no rows changed or inserted. GitOrigin-RevId: 9f16c3c9784685ec38e78c197feea3c6063d9647 commit 8a2ce5c6ceb9428c47d2ccc9107101d264bc9a6c Author: Venki Korukanti Date: Mon Jan 2 21:04:18 2023 -0800 Add `RowIndexFilter` interface and implementation This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at https://github.com/delta-io/delta/issues/1485) It adds an interface called `RowIndexFilter` which evaluates whether to keep a row in the output or not. `FilterDeletedRows` implements `RowIndexFilter` to filter out rows that are deleted from a deletion vector. In the final integration, this filter is used just after fetching the rows from the data parquet file. Refer to task IDs 7 and 8 in the [project plan.](https://github.com/delta-io/delta/issues/1485) Test suite is added. Closes delta-io/delta#1536 GitOrigin-RevId: d2075dba21818e2889d51b4331d88bb62d192328 commit cc654a024f3630ce6ddf40080e7f4c6b0aff4c63 Author: Fredrik Klauss Date: Sat Dec 31 10:12:09 2022 +0100 Legacy table features enabled via metadata can be enabled When we enable a table feature from metadata only, we figure out the minimum required protocol to accommodate the metadata. For legacy features, this means they could end up in the implicitly enabled features, even though they have been enabled explicitly via the metadata if the minimum required protocol to accommodate the metadata is below activating table features. While initializing the protocol, we might bump the protocol version so that table features are enabled. If table features are enabled, this means there are no implicit enabled table features anymore. However, this means we lose the implicit table features that have been enabled via the metadata, as they are not carried over. To combat this, we first have to determine the protocol version, and then use this to add table features that are enabled by metadata either to the implicitly enabled features or the set of enabled table features. GitOrigin-RevId: 8c8f84e6d3951f09bdb336bcd9ab59fbd7368099 commit 5a5df79427e44ad645e7ca802167122dace874d5 Author: Paddy Xu Date: Fri Dec 30 17:24:42 2022 +0100 Small Refactoring to Table Features GitOrigin-RevId: 7b3aa9fe23f95d3bfe56b6462d4e2bb0d2c42a45 commit 959c98bc973b14e8798c5ed8c80f798c7563cb1d Author: Rahul Shivu Mahadev Date: Thu Dec 29 10:38:57 2022 -0800 Record byte level metrics for Merge Into command Start recording byte level metrics with Merge (numTargetBytesAdded, numTargetBytesRemoved) that will be accessible via DESCRIBE HISTORY command GitOrigin-RevId: 97281f19bf719b3dca9fafd334b357f655339627 commit 24c025128612a4ae02d0ad958621f928cda9a3ec Author: Venki Korukanti Date: Thu Dec 29 08:48:22 2022 -0800 Add `DeletionVectorStore` to read from and write DVs to storage This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at https://github.com/delta-io/delta/issues/1485) It adds a `DeletionVectorStore` which contains APIs to load DVs from and write DVs to Hadoop FS compliant file system. The format of the DV file is described in the protocol [here](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vector-file-storage-format). Added a test suite. Closes delta-io/delta#1534 GitOrigin-RevId: 2941dc32e87565a97c3e1d70470a1eaf65e524c7 commit 3255a00a559bbed478781e3e4583800d900ce977 Author: Fredrik Klauss Date: Thu Dec 29 16:32:33 2022 +0100 Refactorings Do a refactor in `CloneTableBase` to extract the logic of updating the metadata. GitOrigin-RevId: d8070b73a9245129d8029932f2ffa255039fb1b8 commit 10bd261a1eb632ac97f0804f965e4aca5f468894 Author: Burak Yavuz Date: Wed Dec 28 10:32:12 2022 -0800 Minor refactoring on metastore access. Trying to reduce the amount of metastore RPCs in order to reduce latencies on some calls. GitOrigin-RevId: 20640bf2ea17d3c35f914b867f7dc457169fdb5a commit 5c2e6d040092969898f1881c7d95ec278d0b108d Author: Paddy Xu Date: Sat Dec 24 13:40:39 2022 +0100 Remove feature descriptors in Table Features This change addresses feedback in https://github.com/delta-io/delta/pull/1450#discussion_r1054717897 to remove all uses of `TableFeatureDescriptor` in the codebase because that `status` field in the descriptor can only be `enabled` which can be omitted. Therefore we could simplify store feature names as `Set[String]` in the protocol and assume all listed features are `enabled`. `Protocol` action before this change: ```json { "protocol":{ "readerVersion":3, "writerVersion":7, "readerFeatures":[ {"name":"columnMapping","status":"enabled"} ], "writerFeatures":[ {"name":"columnMapping","status":"enabled"}, {"name":"identityColumns","status":"enabled"} ] } } ``` `Protocol` action after this change: ```json { "protocol":{ "readerVersion":3, "writerVersion":7, "readerFeatures":["columnMapping"], "writerFeatures":["columnMapping","identityColumns"] } } ``` GitOrigin-RevId: 445601bbaebf39f1cd41f8257ece7aeaff48e101 commit 211da7c5a021cd2d4337c4cd2967c12e35d5a1a0 Author: Paddy Xu Date: Fri Dec 23 14:10:07 2022 +0100 Allow native table features to be automatically enabled This change allows a native table feature to be automatically enabled on a table during (1) table creation or (2) metadata change. This change will make native features have an aligned behavior with which of legacy features, i.e., can be enabled via a metadata change. Given the following table feature: ```scala object NativeFreature extends ReaderWriterFeature(name = "nativeFeature") with FeatureAutomaticallyEnabledByMetadata { ... } ``` To enable this feature on a table, currently users must specify the feature as `enabled` along with the metadata change: ```sql ALTER TABLE tbl SET TBLPROPERTIES ( delta.feature.nativeFeature = 'enabled', delta.tablPropForNativeFeature = 'true' ) ``` After this PR is merged, specifying only the metadata will be sufficient: ```sql ALTER TABLE tbl SET TBLPROPERTIES ( delta.tablPropForNativeFeature = 'true' ) ``` GitOrigin-RevId: 8059c4d7c1c7745c7cce27cffb361ab1a785cf8e commit ff5904de26be3246c624ec55d995a3d85ab06f79 Author: Paddy Xu Date: Fri Dec 23 09:44:01 2022 +0100 Set Table Features Protocol Version to (3, 7) This PR changes the protocol version of Table Features to (3, 7) because all features are now adapted (see https://github.com/delta-io/delta/pull/1530). Closes delta-io/delta#1531 GitOrigin-RevId: fd3e2ba8fa473c5d751cfe1e8cb10ab8efaa706b commit a847526c8fc8cbffaa508c965649f20853491331 Author: Paddy Xu Date: Thu Dec 22 21:13:23 2022 +0100 Port existing features to Table Features This PR adapts the following legacy features to Table Features: - appendOnly - invariants - checkConstriants - changeDataFeed - generatedColumns - columnMapping - identityColumns Note that Deletion Vector will be ported in a separate PR. This PR does not modify each feature to check the `protocol` action to determine if it's used. Instead, there's a one-time check (in `Snapshot`) when opening a table for read, to ensure all legacy features implicitly enabled in metadata are referenced in `protocol`. Closes delta-io/delta#1530 GitOrigin-RevId: 0e89a718271d3ae94efce2bf2aef255beaad4d96 commit f248ae9aaedde77606c911d4621db945a819324c Author: Tobias Fabritz Date: Wed Dec 21 19:04:17 2022 -0800 Implement automated partition filters for TruncTimestamp (date_trunc) ## Description Adding automated partition filter creation for generated columns using TruncTimestamp (date_trunc) Implements #1445 Test case was added ## Does this PR introduce _any_ user-facing changes? Closes delta-io/delta#1514 Signed-off-by: Allison Portis GitOrigin-RevId: 85003be24893458a4105d857c8c92fe31213a075 commit 6d255c44ada940b1422041bd9dae96a1c4ee70fc Author: Venki Korukanti Date: Wed Dec 21 11:05:50 2022 -0800 Add `DeletionVectorDescriptor` to represent DV metadata in `DeltaLog` ## Description This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at delta-io/delta#1485) It adds new class called `DeletionVectorDescriptor` which is used to represent the deletion vector for a given file (`AddFile`) in `DeltaLog`. The format of the metadata is described in the [protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vector-descriptor-schema). New test suite is added Closes delta-io/delta#1528 GitOrigin-RevId: 3fe23fcf8b23c9478903e52b82b42f7cc1abf94f commit 2227446e331fbc5ac43a41f1348b78dc1768096e Author: Paddy Xu Date: Wed Dec 21 18:30:58 2022 +0100 Unlock Table Features supports for all newer protocol versions GitOrigin-RevId: 37e989fc58e3862ed38e18d2644ed99080ae6743 commit 6598679d72580239241dd328db2833e00ea11313 Author: Paddy Xu Date: Wed Dec 21 12:40:30 2022 +0100 Update `DESCRIBE DETAILS` and `RESTORE TABLE` to work with Table Features This PR updates the `DESCRIBE DETAILS` command to show enabled table features. The name of the column is `enabledFeatures` to be future-proof, in case we add more columns like `blockedFeatures`, `allowedFeatures`, etc. Example: ``` ... ... minReaderVersion 3 minWriterVersion 7 enabledFeatures [appendOnly,invariants] ``` This PR also updates `RESTORE TABLE` to merge work with Table Feature, by making the output table have a merged protocol from source and target. GitOrigin-RevId: b96a578830fa05eb63e18a56f03e57df94224810 commit e6185e56f5be9bf898a2b7f1ab05a6b749733266 Author: Andreas Chatzistergiou Date: Tue Dec 20 09:56:12 2022 +0100 Minor formatting GitOrigin-RevId: 3e3aa70befeb6009d38cf3713b29de719c590f01 commit 574fe73b15237179fe0b917b1fd285ec6d22f7db Author: Kam Cheung Ting Date: Mon Dec 19 16:26:28 2022 -0800 Support idempotent overwrite and append on saveAsTable 1. Unit test to ensure idempotent write is supported in saveAsTable 2. Unit test to ensure idempotent write is supported in save. GitOrigin-RevId: ece4875f8f254f58a7595c448534a01967179ca8 commit b6a1c50d91306c2e22e5b960582244330f8831d8 Author: Allison Portis Date: Mon Dec 19 16:11:59 2022 -0800 Fix DPO (dynamic partition overwrite) to work with multiple partition data types Fixes DPO to work with partition columns of more than 1 data type. Adds a test to `DeltaSuite` GitOrigin-RevId: f6db69052d6b3d7115d4671052f31a25b16fa84d commit 22889ec0ad889076ab494ad040542ebada2b07ae Author: Ming DAI Date: Mon Dec 19 16:11:25 2022 -0800 Minor refactoring in Convert To Delta code path GitOrigin-RevId: cfbf516cf51ef99a2c3fb7752e6ba8898fa6f079 commit d174c180b57482ca0854e2c0636f069209633d61 Author: Xinyi Yu Date: Mon Dec 19 23:58:11 2022 +0000 Minor refactor in generated column validation to expose the plan. GitOrigin-RevId: f74cf63bc705a74ef11e5d0cb8c58867f6e7497f commit 45f64cace216d474182381e6d52ce470f39658b9 Author: Fredrik Klauss Date: Mon Dec 19 13:02:42 2022 +0100 Match on op name in CloneTableBase to ensure only valid commands use code path Introduce a match on the RESTORE or CLONE op name, to ensure that if a new command starts using this code path, it is modified to work with the intended behavior of the new command. GitOrigin-RevId: 96e2a9f6d5eae11ca00c05aec9435a38be045e31 commit 019d268503178ccb29cfac0b76916d73a24d4aeb Author: Tyson Condie Date: Fri Dec 16 12:54:57 2022 -0800 UT for allowing column mapping change when schema is empty GitOrigin-RevId: 1442e7980b8e8f73129ba79e27559c70df6bde26 commit bed62cb02694a38ef906389605fa6e26c1f0f1bd Author: Jackie Zhang Date: Thu Dec 15 15:29:51 2022 -0800 Robust schema usage and checks for batch CDF queries Properly handle read-incompatible schema changes when querying batch CDF (e.g. `table_changes()` TVF, and SQL/DF APIs). Right now, batch CDF is almost always serving past data using the latest schema, but the latest schema may not be read-compatible with the data files. this PR introduces the following: 1. added checks for read-incompatibility in batch CDF -> this is a behavior change and since no one has complained, we are providing a SQL conf to allow customer to disable such check. 2. a new Delta SQL conf `changeDataFeed.defaultSchemaModeForColumnMappingTable` that can be set to `endVersion`, `latest` or `legacy`. If `legacy` it would fallback to the current behavior in which either latest schema is used or a time-travel version schema is used, if `endVersion` is set, we will use the end version's schema to serve the batch and if `latest` is set, the latest schema will always be used. Note, this is orthogonal to 1), checks will be triggered all the time to ensure read-compatibility, even when `endVersion` is used. 3. The SQL conf cannot be used when time-travel options `versionOf` is specified. Apparently ppl can time-travel the schema during querying batch CDF, this is probably unintentional, but since it exists, i explicitly blocked two options from being used concurrently. Closes https://github.com/delta-io/delta/pull/1509. Unit test. GitOrigin-RevId: 584b3b009202de73199f3ca7399a39c6855070ec commit 4e51a9969708080b9ac002462f20f64000288978 Author: Jackie Zhang Date: Thu Dec 15 13:06:55 2022 -0800 Fix a bug in Delta source that could miss read compatibility checks in some cases. Added check with stream start schema to detect an edge case in which we only have one schema change that looks exactly like the stream source schema. Enhanced `isReadCompatible` with option to determine the correct nullability conversion fallback. GitOrigin-RevId: 6fd79e5703da9ead16fc54e717a156e1b4f15fdd commit 7fa42c3db357c28d75f6ce98f366af83f3d6f684 Author: Ryan Johnson Date: Thu Dec 15 11:07:27 2022 -0800 Eliminate most calls to the unsafe DeltaLog.metadata method Because it relies on `DeltaLog.unsafeVolatileSnapshot`, which can change at any time without notice as other threads access the Delta log, the `DeltaLog.metadata` method is also unsafe. This PR eliminates almost all call sites of the method, and renames it `DeltaLog.unsafeVolatileMetadata` to make the danger more clear to callers. GitOrigin-RevId: 2e0a5b4b211c5b010a188adf08ff949a43c2e0ec commit cba13ec2219aa4cf1296b294a4208cd1c84cc869 Author: Paddy Xu Date: Thu Dec 15 14:10:25 2022 +0100 Remove check when adding feature descriptors to a protocol This PR fixes a bug when the following method is called twice in a row: ```scala var p = Protocol(3, 7) // Protocol(3, 7, {}, {}) p = p.withFeatureDescriptor(feature, addToReaderFeatures = true, addToWriterFeatures = false) // Return: Protocol(3, 7, {feature}, {}) p = p.withFeatureDescriptor(feature, addToReaderFeatures = false, addToWriterFeatures = true) // Return: Protocol(3, 7, {feature}, {}) // Expected: Protocol(3, 7, {feature}, {feature}) ``` The issue is the second call will not add `feature` to `writerFeratures`. It's because the check sees it already exists in `readerFeatures`, therefore will not try to add it to `writerFeratures`. This PR removes that check. GitOrigin-RevId: 8b14c5e91c9e610c142d0bfe613563f49d95c9c8 commit ff805d0a40fc3be377bf16fe01f9096bebad9024 Author: Ming DAI Date: Wed Dec 14 10:40:46 2022 -0800 Support SHALLOW CLONVERT Iceberg tables for Delta Lake As a followup to the SHALLOW CLONE [support](https://github.com/delta-io/delta/pull/1505) for Delta Lake, it would be great if we could enable SHALLOW CLONE on an Iceberg table as well. This will be a CLONVERT (CLONE + CONVERT) operation, in which we will create a Delta catalog table with files pointing to the original Iceberg table in one transaction. 1. It allows users to quickly experiment with Delta Lake without modifying the original Iceberg table's data. 2. It simplifies the user flow by combining a Delta catalog table creation with an Iceberg conversion. Similar to SHALLOW CLONE, it will work as follows: 1. Clone a Iceberg catalog table (after the setup [here](https://iceberg.apache.org/docs/latest/getting-started/)) ``` CREATE TABLE [IF NOT EXISTS] delta SHALLOW CLONE iceberg.db.table [TBLPROPERTIES clause] [LOCATION path] ``` 2. Clone a path-based Iceberg table ``` CREATE TABLE [IF NOT EXISTS] delta SHALLOW CLONE iceberg.`/path/to/iceberg/table`[TBLPROPERTIES clause] [LOCATION path] ``` Closes https://github.com/delta-io/delta/pull/1522 New unit tests. No. GitOrigin-RevId: e01994e037cf44e06f4ef3d6f185f5925dd77e48 commit 803d1492a2e4f4e005e09abf96861addac9a01ba Author: Johan Lasperas Date: Wed Dec 14 19:31:30 2022 +0100 Add tests for WHEN NOT MATCHED BY SOURCE Add tests covering Change Data Capture (CDC) with conditional insert + not matched by source update/delete Also clean up exceptions thrown during merge analysis when encountering invalid actions: changed to IllegalArgumentException and removed unnecessary error code DELTA_MERGE_INVALID_WHEN_NOT_MATCHED_CLAUSE (the exception can't be triggered by users). GitOrigin-RevId: 9bb1b73036ee6d7db4bfac6f31cc68640c4dbaa3 commit 63fc800e766aa6bd6d2ad4d81af321aad748fea1 Author: Wenchen Fan Date: Wed Dec 14 16:54:18 2022 +0000 Minor refactor DeltaAnalysis GitOrigin-RevId: ad7ad2f1b795a38eafc814c6868ccce70caaf631 commit 4a8786b31576341ce9b16c867dce2e443f74d686 Author: Paddy Xu Date: Wed Dec 14 13:38:36 2022 +0100 Basic functionality of Delta Table Features This PR implements Table Features proposed in the feature request (https://github.com/delta-io/delta/issues/1408) and the PROTOCOL doc (https://github.com/delta-io/delta/pull/1450). This PR implements the basic functionality, including - The protocol structure and necessary APIs - Protocol upgrade logic - Append-only feature ported to Table Features - Protocol upgrade path - User-facing APIs, such as allowing referencing features manually - Partial test coverage Not covered by this PR: - Adapt all features - Full test coverage - Make `DESCRIBE TABLE` show referenced features - Enable table clone and time travel paths Table Features support starts from reader protocol version `3` and writer version `7`. When supported, features can be **referenced** by a protocol by placing a `DeltaFeatureDescriptor` into the protocol's `readerFeatures` and/or `writerFeatures`. A feature can be one of two types: writer-only and reader-writer. The first type means that only writers must care about such a feature, while the latter means that in addition to writers, readers must also be aware of the feature to read the data correctly. We now have the following features released: - `appendOnly`: legacy, writer-only - `invariants`: legacy, writer-only - `checkConstriants`: legacy, writer-only - `changeDataFeed`: legacy, writer-only - `generatedColumns`: legacy, writer-only - `columnMapping`: legacy, reader-writer - `identityColumn`: legacy, writer-only - `deletionVector`: native, reader-writer Some examples of the `protocol` action: ```scala // Valid protocol. Both reader and writer versions are capable. Protocol( minReaderVersion = 3, minWriterVersion = 7, readerFeatures = {(columnMapping,enabled), (changeDataFeed,enabled)}, writerFeatures = {(appendOnly,enabled), (columnMapping,enabled), (changeDataFeed,enabled)}) // Valid protocol. Only writer version is capable. "columnMapping" is implicitly enabled by readers. Protocol( minReaderVersion = 2, minWriterVersion = 7, readerFeatures = None, writerFeatures = {(columnMapping,enabled)}) // Invalid protocol. Reader version does enable "columnMapping" implicitly. Protocol( minReaderVersion = 1, minWriterVersion = 7, readerFeatures = None, writerFeatures = {(columnMapping,enabled)}) ``` When reading or writing a table, clients MUST respect all enabled features. Upon table creation, the system assigns the table a minimum protocol that satisfies all features that are **automatically enabled** in the table's metadata. This means the table can be on a "legacy" protocol with both `readerFeatures` and `writerFeatures` set to `None` (if all active features are legacy, which is the current behavior) or be on a Table Features-capable protocol with all active features explicitly referenced in `readerFeatures` and/or `writerFeatures` (if one of the active features is Table Features-native or the user has specified a Table Features-capable protocol version). It's possible to upgrade an existing table to support table features. The update can be either for writers or for both readers and writers. During the upgrade, the system will explicitly reference all legacy features that are implicitly supported by the old protocol. Users can mark a feature to be required by a table by using the following commands: ```sql -- for an existing table ALTER TABLE table_name SET TBLPROPERTIES (delta.feature.featureName = 'enabled') -- for a new table CREATE TABLE table_name ... TBLPROPERTIES (delta.feature.featureName = 'enabled') -- for all new tables SET spark.databricks.delta.properties.defaults.feature.featureName = 'enabled' ``` When some features are set to `enabled` in table properties and some others in Spark sessions, the final table will enable all features defined in two places: ```sql SET spark.databricks.delta.properties.defaults.feature.featureA = 'enabled'; CREATE TABLE table_name ... TBLPROPERTIES (delta.feature.featureB = 'enabled') --- 'table_name' will have 'featureA' and 'featureB' enabled. ``` Closes https://github.com/delta-io/delta/pull/1520 GitOrigin-RevId: 2b05f397b24e57f1804761b3242a0f29098a209c commit 9fc7da818b0c5aa97756b8db2d4bdf2a5e6e147c Author: Ryan Johnson Date: Mon Dec 12 21:29:04 2022 -0800 Reliably obtain retention timestamps from Metadata Today, Delta log replay filters out expired `RemoveFile` tombstones and `SetTransaction` actions – but those retention periods are controlled by table properties and so log replay is required in order to properly access them. Delta has no code to break that cycle, and so on cold start it will access an empty/defaulted metadata and thus ends up using default values for both retention periods – even if the user set the table property to something longer. The solution is to craft a "miniature" version of log replay that only fetches protocol and metadata. The mini-replay can then run first, ensuring that the retention period table properties are reliably available during state reconstruction. This will add a small amount of latency to snapshot creation, but enforces correct behavior. GitOrigin-RevId: a953946146b419871e0f6686476f8119f7db12a0 commit 740a7fd8ee901e33d2c6734493504a6ac3f38f58 Author: Christos Stavrakakis Date: Mon Dec 12 21:04:09 2022 +0100 Add helper to get Delta prefix length. Add getRandomPrefixLength() helper that returns the length of the random prefix to use for the data files of a Delta table. GitOrigin-RevId: baa9f809e5e6c9d8a46d3b16c9e82b0b356a98ad commit e038c21945daad7624ca024f81631d0fc442c3ac Author: Vitalii Li Date: Fri Dec 9 22:00:32 2022 -0800 Minor refactor CloneTableSQLSuite GitOrigin-RevId: 65bd5911fb4d0287e41b2b58d6ef509bf7fd180d GitOrigin-RevId: 799d862f48b6fc0563c630d9b4183107ea3fba06 commit c2b0ce6d4f35df2fe6c2281299ae38513f50305d Author: Johan Lasperas Date: Fri Dec 9 09:25:28 2022 +0100 Add WHEN NOT MATCHED BY SOURCE to MergeIntoCommand This PR adds support for WHEN NOT MATCHED BY SOURCE clauses in merge into command using the Scala/Java and Python Delta API. Support for WHEN NOT MATCHED BY SOURCE using SQL will be available with Spark 3.4 release. Changes: - Extend Delta Merge API with support for NOT MATCHED BY SOURCE clause. - Extend Delta analyzer to support the new type of clause: - Resolve target column references in NOT MATCHED BY SOURCE conditions and update actions - Handle schema evolution (same as MATCHED clause): generate update expressions to align with the expected target schema - Implement support for NOT MATCHED BY SOURCE in MergeIntoCommand. Testing: New test trait MergeIntoNotMatchedBySourceSuite is added and collects all tests covering this feature. It is mixed into the Merge Scala test class to run tests against the Delta API and will be mixed in the Merge base test class to also cover the Spark SQL API once Spark 3.4 is released. Test coverage: - Analysis errors: invalid clauses or conditions, invalid column references. - Correctness with various combination of clauses. - Schema evolution. Close delta-io/delta#1511 GitOrigin-RevId: 5ccaa3f09e8d8b4e05c716014191e30ec38b64dd commit acb1f26f6e1940f632ec078554c9a3c111b17fb7 Author: Gengliang Wang Date: Thu Dec 8 12:43:49 2022 -0800 Minor refactoring GitOrigin-RevId: 9fa3c6a49abf67e9727a7e2d98880699a6de11c2 commit 6f02e6f07fc47374c9d1cc0f48175679b14c8317 Author: Vitalii Li Date: Thu Dec 8 10:52:32 2022 -0800 Minor refactoring GitOrigin-RevId: 28a7045d6695c593953bb3874b5da718cc8c4d05 commit 01df9aab0c504d24d825dede993f71fb08c45ce0 Author: Wenchen Fan Date: Thu Dec 8 09:09:15 2022 +0000 Add read-side char padding to cover external data files Author: Wenchen Fan Author: Wenchen Fan GitOrigin-RevId: 3a7894ad3a93d0290c127fabf8ad3ae3a2718d72 commit 4f766d9b6ca016de55e3d251a380252f06b4fc7b Author: Fredrik Klauss Date: Thu Dec 8 10:08:55 2022 +0100 Add more tests for MERGE source materialization The current MERGE source materialization test suite is rather small and doesn't test all behavior changes source materialization introduced. This change adds more tests to check for correct behavior. GitOrigin-RevId: 451c92ec2c05521a0993bea99649d9d0f44311ca commit a89375a92a157213386468c90803789b782a5a03 Author: Tyson Condie Date: Wed Dec 7 14:57:02 2022 -0800 Allow column mapping change when schema is empty GitOrigin-RevId: dfc514cad631d16d422ee21a8e2416ed64038154 commit b038b9d66bbe1f25a63bf404efaebb6e71587008 Author: Jackie Zhang Date: Tue Dec 6 11:59:27 2022 -0800 Support SHALLOW CLONE for Delta Lake This PR introduces support for **SHALLOW CLONE** for both Delta **and Parquet table** on Delta Lake, specifically the following command: `CREATE [OR REPLACE] TABLE [IF NOT EXISTS] target SHALLOW CLONE source [VERSION AS OF version | TIMESTAMP AS OF timestamp] [TBLPROPERTIES clause] [LOCATION path]` This enables the following use cases: 1. Create a `target` table with Delta log pointing to the files from the `source` table. The source table can be either a Delta table or a Parquet table. You may also specify a custom `path` to create as an external table in a path, a `clause` for additional table properties to append to, or a `version` to create the target table as a time-travelled version of the source table (if is Delta). E.g. `CREATE TABLE target SHALLOW CLONE source` 2. Replace/restore a table by shallow-cloning itself. This requires the table to be **empty** by the time of cloning to avoid data duplication. E.g. `REPLACE TABLE source SHALLOW CLONE source VERSION AS OF 1` Closes https://github.com/delta-io/delta/pull/1505 Existing tests. GitOrigin-RevId: 6b609eddc8c49c2831bd1d7a77660ae817c6246e commit eb43a2f457ebaa9e4c032ffbe376508df84f9b33 Author: itholic Date: Tue Dec 6 19:05:51 2022 +0000 Minor test assert improvement Author: itholic GitOrigin-RevId: c4760d22d51fe5100a65266802cc837d7ef54301 commit 7380c5c0629122921fcc7da913cbd8ce7c39bd8d Author: Mitchell Riley Date: Tue Dec 6 09:42:35 2022 -0800 Add varargs annotation to executeZOrderBy ## Description Using the current API from Java land means you have to use JavaConverters to massage a List into a Seq. Adding varargs exposes a new API that can take a String... `sbt compile` shows the newly overloaded method in the `target` output. ## Does this PR introduce _any_ user-facing changes? Introduces a public overloaded method for executeZOrderBy with the following signature: `executeZOrderBy (java.lang.String... columns)` Closes delta-io/delta#1510 Signed-off-by: Venki Korukanti GitOrigin-RevId: 025f2edfebc277213d4a9ae8fe92bdc38d44f639 commit 94a93e6cc7220b4951a4e405799e2fc666f0b089 Author: Shixiong Zhu Date: Mon Dec 5 14:30:48 2022 -0800 Fix the vacuum log message in DRY RUN Currently, we output a log line with `DeltaVacuumStats` like this ``` Found 12 files and directories in a total of 1 directories that are safe to delete.DeltaVacuumStats(true,Some(0),604800000,1669825818524,1,12,10404,11374,0) ``` This change fixes it in a better format: ``` Found 12 files (10404 bytes) and directories in a total of 1 directories that are safe to delete. ``` A simple log change. GitOrigin-RevId: 182e7c094647f639e925d714f60f2b4dc8c32944 commit 4cd7f756fb733b49d251c49aa950b506a2876222 Author: Venki Korukanti Date: Fri Dec 9 10:52:57 2022 -0800 Upgrade Delta Lake next version to 2.3.0-SNAPSHOT Upgrade Delta Lake next version to 2.3.0-SNAPSHOT NA GitOrigin-RevId: 15ad1531493a11542e0917f5252ddba931ba562e commit f295eccd86ff14533a5870d0d8dc39cd29a32cb4 Author: Scott Sandre Date: Thu Dec 8 11:08:53 2022 -0800 Update & fix flink integration tests dependencies (#485) * update pom.xml and build.sbt to include necessary dependencies and right flink version * Update some more dependencies; add testing shell script commit a29bc446b8ca5312e60c276c7ff692b590c71fae Author: Scott Sandre Date: Tue Dec 6 08:48:20 2022 -0800 Various minor changes for 0.6.0 release (#483) * set version to 0.6.0-SNAPSHOT * set delta storage version to 2.1.1 * update flink readme versions to 0.6.0 * update javadocs for 0.6.0 and latest * upgrade delta storage to 2.2.0 commit d46ea959f129dc491e64c6cd02f842142bc6b4b7 Author: Christos Stavrakakis Date: Mon Dec 5 15:51:27 2022 +0100 Do not store tags for tombstones in checkpoints Tombstones in checkpoints are read only by `VACUUM` and the only fields that are used are the `path` and `deletionTimestamp`. However, currently we store all fields, include the `tags`. This commit drops the tags from tombstones while creating checkpoints, reducing their size and allowing us to store more tags in RemoveFile actions without increasing checkpoint's size. GitOrigin-RevId: 6c1415d83c0cfe415cfa6275a65e23fbcecf46e3 commit e077f278e919b75942fd9fcd97545c67ebd84f12 Author: Andreas Chatzistergiou Date: Sun Dec 4 11:03:45 2022 +0100 Minor formatting change GitOrigin-RevId: 50e4f2d1edf4abd2f8d94fce5548c5b49b2eabdf commit 675dcc3be63d473454cc8c2ae79d97fd53cee8ce Author: Koki Otsuka Date: Thu Dec 1 20:23:13 2022 -0800 add metadata information to PROTOCOL.md ## Description I add metadata information to PROTOCOL.md because the link TODO is not available. This PR is link to [#1250](https://github.com/delta-io/delta/issues/1250) Did't do test. This PR is document only changes. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1473 Co-authored-by: Tathagata Das Signed-off-by: Venki Korukanti GitOrigin-RevId: d5522224a3c78affe4267ccad1cf0141ed8988bf commit c7934d3b307808535df60f71d91e6636f4d9c601 Author: Lars Kroll Date: Thu Dec 1 20:22:45 2022 -0800 Update Spec to use the portable Deletion Vector serialization format ## Description The RoaringBitmap community has recently standardised on a ["portable" 64-bit serialization format](https://github.com/RoaringBitmap/RoaringFormatSpec#extention-for-64-bit-implementations). We should switch the Deletion Vector feature to use this format, as it will make adoption by 3rd parties easier, enabling them to use an off-the-shelf 64-bit implementation from the RoaringBitmap library of their choice, as long as it supports this format (which CRoaring already does and the next release of Java RoaringBitmap will do as well). This PR makes the appropriate changes to the DV format description in the appendix of the Delta spec. N/A ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1494 Signed-off-by: Venki Korukanti GitOrigin-RevId: 1503f42337a553879ad527bee8b78bf2d034bfd2 commit a1fbbc67fdbe7d3689fd8c8e226fb6a38d78b0e8 Author: Lukas Rupprecht Date: Thu Dec 1 13:45:42 2022 -0800 Add check to enforce at most one Protocol action per commit This PR adds a safety check to ensure that every commit to a Delta table contains at most one Protocol action. As the added UT demonstrates, it is possible to have more than one Protocol action in a commit when manually constructing a commit. This can be a problem if Protocols conflict because Delta will just pick one of the two version when determining the current table version. This can lead to error because features that should be enabled on a table can be rejected. Currently, commits generated by Delta should be safe from this issue as none of the existing code paths can end up producing more than one protocol version. However, we are adding this additional check to future proof the code so that if such a case ever occurs (either through a newly introduced code path or by manually constructing a commit), it is caught early. GitOrigin-RevId: 7f5318c7c0606495fce8410572f1b97f2610f893 commit 1382c68a1deed8bed9d697a2cb851cb18d7d7f39 Author: Fredrik Klauss Date: Thu Dec 1 10:03:02 2022 +0100 Coarse-grained CDF Allow for CDF that computes the change data entirely from Add and RemoveFiles. GitOrigin-RevId: 461298c417c7b6d399f35cf7e3619d41943ecb54 commit 05e857525764b67b97fdde7e4568b4f66d00373d Author: slim Date: Wed Nov 30 11:52:18 2022 -0800 Make examples/scala compile -- detail() is available since v2.1 Signed-off-by: Slim Ouertani ## Description This Pr makes `examples/scala` compile and run with `sbt compile` and `sbt run`. By default this error is raised ``` [info] compiling 6 Scala sources to /private/tmp/delta/examples/scala/target/scala-2.13/classes ... [error] /private/tmp/delta/examples/scala/src/main/scala/example/Utilities.scala:67:16: value detail is not a member of io.delta.tables.DeltaTable [error] deltaTable.detail().show() [error] ^ [error] one error found [error] (Compile / compileIncremental) Compilation failed ``` This is tested by making sure that `sbt compile` and `sbt run` commands works are expected. ## Does this PR introduce _any_ user-facing changes? No changes, this is part of the build of example dir. Closes delta-io/delta#1489 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 15b132c42b6728a24cd0011f0ce5bf7031cd1302 commit 62b5ad09b411504bc22bd4a4a75ca3e946ae372b Author: kristoffSC Date: Fri Dec 2 17:45:41 2022 +0100 [Delta-474][Delta-481] - Use InitContext#getRestoredCheckpointId to get checkpointID needed by Delta writer. Flink 1.15.3 update. (#482) Signed-off-by: Krzysztof Chmielewski Signed-off-by: Krzysztof Chmielewski commit a5fcec4ff7611ce91ed499306ed4b78fef7e3df5 Author: Amir Mor Date: Tue Nov 29 18:11:16 2022 -0800 1385 - Collect statistics by default in ConvertToDelta & Update SQL API ## Description Resolves https://github.com/delta-io/delta/issues/1385 This PR gives the option to collect column statistics per file when doing ConvertToDelta operation. The current behaviour should stay the same (meaning that statistics should not be collected on convert). Current unit tests + added a couple of unit tests to cover this change ## Does this PR introduce _any_ user-facing changes? Yes. Previously ConvertToDelta command was without the ability to collect statistics. Now, the convert function is overloaded with additional boolean parameter (with false as default) to enable collect statistics. When this parameter is set to true, after the creation of the table, delta will compute all the files statistics (using legacy functionality and create another COMPUTE STATS commit to the log) Closes delta-io/delta#1401 Co-authored-by: Amir Mor Signed-off-by: Scott Sandre Signed-off-by: Venki Korukanti GitOrigin-RevId: 095b503053d2749ae1fd50e49348c3157b19366d commit 406e225cb0f74158dae682a5d6eb8e054ba84c87 Author: Joe Harris Date: Tue Nov 29 18:49:55 2022 -0500 Update Delta version to `2.1` GitOrigin-RevId: 5b091faeafd47bcb1c293059d5993da11a99c4e1 commit add68969f36534c2cf34f5128dd0ec750c8b0442 Author: Linhong Liu Date: Tue Nov 29 13:59:38 2022 -0800 Minor formatting change GitOrigin-RevId: 89f874449ef8507b0e5b1a9a8e15a1388bc61c44 commit 80b1224ecbdc0e8707a34edac4c1612bbcf9cc9b Author: Lars Kroll Date: Thu Nov 24 12:45:55 2022 +0100 Minor refactoring to RoaringBitmapArraySuite. GitOrigin-RevId: eb77c958731b273964277b6dc01f033df573bd25 commit b3ff96c9b822b9c54d62df04437eb772bde799a0 Author: Vitalii Li Date: Thu Nov 24 04:54:24 2022 +0000 Minor refactoring GitOrigin-RevId: 6515b748c9022b943d38fe635b6b2618c86d666e commit 40e6b19d74e912dc4e09978bd624aaa6a93ef037 Author: Paddy Xu Date: Wed Nov 23 16:43:36 2022 +0100 Remove redundant nullity check when reading and writing a Delta table This change removes an unnecessary nullity check during the protocol compatibility validation. When a table has no `protocol` action, an exception will be thrown when constructing a snapshot, hence the said nullity check is redundant. GitOrigin-RevId: 3b8bc0bfc40b235e81dc188bc8acb968fc52028b commit 3ec7c18880627d2f9216149d4d403d58361fd3aa Author: Tom van Bussel Date: Wed Nov 23 13:28:35 2022 +0100 Refactor DeltaParquetFileFormat to take metadata as argument This is a small cleanup to remove some duplication. GitOrigin-RevId: 45302ac456652d2ec4b9464ece7f8d517b2618e3 commit 99eaa10b8bc06da434902a0243049f459ba17328 Author: Lars Kroll Date: Wed Nov 23 10:28:32 2022 +0100 Implement Portable RoaringBitmapArray Format Implements an alternative serialization format to our existing "native" one. The new "portable" format corresponds to existing behaviour of CRoaring and has recently been standardised in the [Roaring Spec](https://github.com/RoaringBitmap/RoaringFormatSpec#extention-for-64-bit-implementations). GitOrigin-RevId: 06e1ff8e7532a31e95fb0c20bfdaef94e403baa8 commit 1a94a585b74477896cbcae203fc26eaca733cbaa Author: Scott Sandre Date: Tue Nov 22 20:15:40 2022 -0800 Add support for `LIMIT` push down during query planning to reduce number of files scanned ### Description This PR adds support for limit pushdown, where we will "push down" any `LIMIT`s during query planning so that we scan the minimum number of files necessary. ### How was this patch tested? New test suite. ## Does this PR introduce _any_ user-facing changes? No. Resolves delta-io/delta#1495 GitOrigin-RevId: 43d228fb10affbd87a10aabeba240525403c71dd commit 2574debe9570b668238dba2a67a9ccf08710388e Author: Shixiong Zhu Date: Tue Nov 22 13:36:40 2022 -0800 Cleanup OptimizeMetadataOnlyDeltaQuery code - Put `OptimizeMetadataOnlyDeltaQuery` and its tests in the consistent package `org.apache.spark.sql.delta.perf`. - A few style fixes. - Clean up a few tests. GitOrigin-RevId: 4d296caac2910aec726b9d716d1f2e6896dc8b4b commit 6c7d22385deee1c5b4929d3b7062a680e3bf6672 Author: Andreas Chatzistergiou Date: Tue Nov 22 22:33:48 2022 +0100 Minor refactor CheckSum and Snapshot GitOrigin-RevId: 4595bc7359e55fd3a192b36d80a2f412aa20f510 commit e9a3f2ffd14f47202baef39f854275697187b567 Author: Christos Stavrakakis Date: Tue Nov 22 10:56:24 2022 +0100 Minor refactor to AddFile action Minor refactor to removeWithTimestamp method. GitOrigin-RevId: 497a2158d6455f617b05a1300cb800bb5cf22cf8 commit 1003ae0ea2485c3342ede17ac35f7b89a2b42e4a Author: Lars Kroll Date: Tue Nov 22 10:29:37 2022 +0100 Refactor RoaringBitmapArray and RoaringBitmapArraySuite GitOrigin-RevId: 4d8c2bdd7b05c7694876f343411101beaa4b859f commit 9d3255de912f1a5fe936e56ca262e8e46383d017 Author: Wenchen Fan Date: Tue Nov 22 02:19:56 2022 +0000 Minor refactor DeltaSuiteBase GitOrigin-RevId: c3602216497ec1eaa67308164d96003595cf3634 commit 0c11c07e22bef855105c54b627ec44b3802f984f Author: Jungtaek Lim Date: Tue Nov 22 10:17:31 2022 +0900 Additional logging to help debug for Trigger.AvailableNow in Delta streaming source GitOrigin-RevId: db23d42a5af095ce9cc35bdc7df080f8a2f66fe5 commit b82354a68eb3dec68aa8c5c5d7703d4a7920104e Author: Ryan Johnson Date: Mon Nov 21 15:33:25 2022 -0800 Simplify scan generator code with new SnapshotDescriptor trait. The `Snapshot` class has an extremely broad surface, and many uses only need basic info such as `version`, `protocol`, and `metadata`. Users of the `TahoeFileIndex` API also need this same info. Introduce a new `SnapshotDescriptor` class that both of these classes can implement, and simplify the code that uses these classes. Also introduce a `TahoeFileIndexWithIndex` base class that factors out snapshot-related file index logic. GitOrigin-RevId: edce3b2cccbf9e7db21108352ef02af8203f6eb8 commit 238f15abcc9951527b3cadb6fa164567aff60813 Author: Venki Korukanti Date: Mon Nov 21 08:17:52 2022 -0800 Add RoaringBitmapArray to store indices of rows deleted in a data file This PR is part of the feature: Support reading Delta tables with deletion vectors (more at delta-io/delta#1485) Adds a new bitmap implementation called `RoaringBitmapArray`. This will be used to encode the deleted row indices. There already exists a `Roaring64Bitmap` provided by the `org.roaringbitmap` library , but this implementation is optimized for use case of handling row indices, which are always clustered between 0 and the index of the last row number of a file, as opposed to being arbitrarily sparse over the whole `Long` space. Closes delta-io/delta#1486 GitOrigin-RevId: c94d0cd1b4d1c179b9947e984705ab81e97a6dec commit cfa85f0dcd46a8ba99c9ed87b5ebe9e45f99176d Author: Venki Korukanti Date: Fri Nov 18 05:41:35 2022 -0800 Add utilities for encoding/decoding of Deletion Vectors bitmap This PR is part of the feature: Support reading Delta tables with deletion vectors (more at delta-io/delta#1485) Deletion vectors are stored either in a separate file or inline as part of `AddFile` struct in DeltaLog. More details of the format are found [here](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vector-descriptor-schema) In this PR, add utilities to encode/decode the DV bitmap in Base85 variant [Z85](https://rfc.zeromq.org/spec/32/) for storing it in `AddFile` struct in DeltaLog. Close delta-io/delta#1487 GitOrigin-RevId: e12b67abd7498b174cd3942b7c0ae82ffd362cc6 commit 6726e576d59c6f90bb740d55cb7925977fad794d Author: Prakhar Jain Date: Thu Nov 17 22:05:11 2022 -0800 Minor refactor to Snapshot and DataSkippingReader GitOrigin-RevId: f1c043eb56fe408e2db22d8f59bf3ef60d03e78e commit 26263efe451fcd8a60531f4cd97e1991cfc17778 Author: Hussein Nagree Date: Thu Nov 17 17:15:45 2022 -0800 Use consistent default snapshot for TahoeFileIndex Use the getSnapshot API for the default implementations of `metadata` & `tableVersion` instead of the cached snapshot No real functionality change, so existing unit tests GitOrigin-RevId: e56669c81737815ff8b8006756909d36e835fbf7 commit b07257df6bf7de2cadf44c21b767200f63094cb9 Author: Fredrik Klauss Date: Thu Nov 17 17:40:31 2022 +0100 Materialize non-deterministic source in MERGE - This PR fixes https://github.com/delta-io/delta/issues/527 - MERGE consists of two passes. During the first pass, it scans over the target table to find all files that are affected by the MERGE operation. During the second pass it reads those files again to update/insert the rows from the source table. - If the source changes between the two passes and contains an additional row that is in the target table, but not in one of the files that have been identified in pass 1, it will insert this row into the target table instead of updating the original row, leading to duplicate rows. - This can happen if the source is non-deterministic. A source is classified as non-deterministic if any of the operators in the source plan is non-deterministic (i.e. depends on some mutable internal state or some other input that is not part of the outputs of the children), or if it is a non-delta scan. - We solve this issue by materializing the source table at the start of a MERGE operation if it is non-deterministic, removing the possibility that the table changes during the two passes. The logic of source materialization is encapsulated in ```MergeIntoMaterializeSource``` and is used by ```MergeIntoCommand```. - The source is materialized onto the local disks of the executors using RDD local checkpoint. In case RDD blocks are lost, a retry loop is introduced. Blocks can be lost e.g. because of Spot instance kills. In case of using autoscaling through Spark dynamic allocation, executor decomissioning can be enabled with the following configs to gracefully migrate the blocks. ``` spark.decommission.enabled=true spark.storage.decommission.rddBlocks.enabled=true ``` - When materializing the source table we lose the statistics and inferred constraints about the table, which can lead to regressions. We include a manual broadcast hint in the source table if the table size is small, ensuring that we choose the most efficient join when possible, and a "dummy" filter to re-introduce the constraints that can be used for further filter inference. https://github.com/apache/spark/pull/37248 has implemented to make it work out-of-the-box in Spark 3.4, so these workarounds can be removed then. Closes delta-io/delta#1418 GitOrigin-RevId: f8cd57e28b52c58ed7ba0b44ae868d5ea5bd534c commit 6a2725a7945aef99f697e894be88835edbaa123b Author: Ryan Johnson Date: Wed Nov 16 23:54:01 2022 -0800 Remove unused DELTA_SNAPSHOT_ISOLATION config Remove the `DELTA_SNAPSHOT_ISOLATION` internal config (`spark.databricks.delta.snapshotIsolation.enabled`), which was added as default-enabled to protect a then-new feature that stabilizes snapshots in Delta queries and transactions that scan the same table multiple times. The feature has been stable ever since and the flag has never been used. GitOrigin-RevId: 741c4b6c2deae77e916684ed8132a6e6728d4fb8 commit 56e0fcc0e09332c452f29b783c949a4fe857b029 Author: Weitao Wen Date: Wed Nov 16 15:27:53 2022 -0800 Minor refactor to DeltaLog GitOrigin-RevId: 9956ec22837c1ff01327c1d7069927cf3c75f432 commit 38f146b3e9e319cb6817eb0cccf367f3ac2dc370 Author: lzlfred Date: Tue Nov 15 21:55:25 2022 -0800 make Delta checkpoint and state reconstruction visible in Spark UI Adding "withNewExecutionId" to state cache and checkpoint logic so those jobs are visible in Spark/notebook UI. GitOrigin-RevId: 9b606b57e1158b02e923a8b586f00fdcd52c15df commit 0c349da8c70d5abacc4aa6decd57c6daf511696b Author: Felipe Pessoto Date: Tue Nov 15 18:00:43 2022 -0800 Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 ## Description Running the query "SELECT COUNT(*) FROM Table" takes a lot of time for big tables, Spark scan all the parquet files just to return the number of rows, that information is available from Delta Logs. Resolves #1192 Created unit tests to validate the optimization works, including cases not covered by this optimization. ## Does this PR introduce _any_ user-facing changes? Only performance improvement Closes delta-io/delta#1377 Signed-off-by: Shixiong Zhu GitOrigin-RevId: a9116e42a9c805adc967dd3e802f84d502f50a8b commit 1521be5e61d98ce68146454b89ecb0bed4fa94f2 Author: Johan Lasperas Date: Tue Nov 15 10:06:46 2022 +0100 Refactor merge output expression generation in reusable methods Move code used to generate merge output expressions into reusable methods. GitOrigin-RevId: bb4cb7daec037422490e48fa5260ef8de391fb2a commit 379289fbb4b4362900106da57ba1debd399e2336 Author: Prakhar Jain Date: Mon Nov 14 21:50:43 2022 -0800 Minor refactor around SetTransaction This PR does a minor refactor around SetTransaction and minSetTransactionRetentionTimestamp. GitOrigin-RevId: 5ea8bb1a07fc337c1623fa2eb76a920c09da02c7 commit 25e1d7fa37136acf91f4ef8537ad34903e544f28 Author: Scott Sandre Date: Mon Nov 14 18:02:42 2022 -0800 Delta benchmarks - Remove .ivy2/cache and .ivy2/jar delta folders on EMR cluster When running the delta benchmarks on EMR, I found that sometimes my latest changes in my delta repo were not being run (e.g. `logInfo` statements not showingup). After some brief investigation, I surmised that the `.ivy2/cache/io.delta` and `.ivy2/jars/io.delta___` were causing it. When I updated the script to remove these folders, the problem went away. GitOrigin-RevId: 175ae77915864e369872539cd01229b3778b9af3 commit ba723db49ca1ea7f0b1a0852a523ae40d088a067 Author: lzlfred Date: Mon Nov 14 16:54:36 2022 -0800 New test Delta log should handle malformed json GitOrigin-RevId: aeb5c1509932766381c8f8267acbc084329e5d83 commit c45f269cdc31f566297e037ae9a79c641bbd91ca Author: Vitalii Li Date: Sat Nov 12 03:33:22 2022 +0000 Minor change to test in DeltaColumnRenameSuite Author: Vitalii Li GitOrigin-RevId: 2e8dc274e8260a7a3264330c08b940c381d6a10d commit 4aaa77e4727c1dcc2b10dbe5d1ccb8d8d8164526 Author: Jackie Zhang Date: Fri Nov 11 15:57:25 2022 -0800 Minor refactoring GitOrigin-RevId: 47b1a2bc80e7bf9c338ade9a8ea0f348cd3c0fab commit 083d0bdc9c1178c5d45f7390ec1c924e7124de1b Author: Shixiong Zhu Date: Fri Nov 11 07:36:28 2022 -0800 Improve the schema change error in Delta streaming source This PR makes the following improvements for the schema change error in Delta streaming source: - Report the table version that includes the schema change. - Adding a message to suggest people to change `startingVersion` or `startingTimestamp` when using a new checkpoint still hits the same schema change error. - Update `StreamingRetryableException` to use the new error framework to simplify the code. Updated tests GitOrigin-RevId: c48901a25a6dddf3b64ece2b07f22bf048229004 commit 446ebf19c5bbed354f21be01f748aa3f876f23dd Author: Vitalii Li Date: Fri Nov 11 02:44:52 2022 +0000 Minor refactor to error messages Author: Vitalii Li GitOrigin-RevId: 7a7f3aa74d14e0e3d3aff3f0891e32ea3e2b4404 commit 68c8e183e0e380caabde586fd23906c8110c2ea9 Author: Helge Brügner Date: Thu Nov 10 17:18:37 2022 -0800 Pass the SparkSession when resolving the tables in the DeltaTableBuilder Signed-off-by: Helge Bruegner ## Description - Forward the SparkSession when resolving the tables in the DeltaTableBuilder - Remove some unused imports Resolves #1475 / ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1476 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 880de0f55ee79289cecd19d74c4c177c76d30aeb commit 5e9a417a7bd94e908c2239b7287421ef31055bb9 Author: lzlfred Date: Thu Nov 10 12:45:04 2022 -0800 Minor refactor to Snapshot, DataSkippingReader and TahoeFileIndex GitOrigin-RevId: 5bd6113c5f30a0e4f774782ccd15c79c30378791 commit 3678bde9c914c27bdaa54a17055255ba4f97857f Author: Allison Portis Date: Thu Nov 10 12:44:35 2022 -0800 Update protocol.md with CHECK constraints Update the [Delta Transaction Log Protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md) to include CHECK constraints Closes delta-io/delta#1467 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 47cffc6ed16c2ef89e5253279fd148ff4daf9e80 commit dddaf3ee7f9dc27bde180c77af971e0da7fc9b41 Author: Jackie Zhang Date: Wed Nov 9 13:11:31 2022 -0800 Minor refactor PreprocessTableRestore --> PreprocessTimeTravel GitOrigin-RevId: 2bc3155e15e17b3baa90f4b846b891ac8a6773c2 commit 4f634f96d055710ca9eeaf5f91b586f705e142a3 Author: Ming DAI Date: Wed Nov 9 12:45:01 2022 -0800 Respect partition values from Catalog in CatalogFileManifest This PR updates CatalogFileManifest to respect catalog partition values blindly instead of infer the partition values from the actual partition path. GitOrigin-RevId: 640d13f31bdcee475e8d8a0180eb1de2f17d9523 commit 951a97d3705939d473967b7f1c97c99f4472f755 Author: Lars Kroll Date: Wed Nov 9 20:04:33 2022 +0100 Fix metadata updates via actions on first commit. Fixed an issue where you couldn't set the partition scheme on the initial commit via `txn.commit` but doing it via `txn.updateMetadata` followed by `txn.commit` worked fine, even though both submitted the same metadata during the same transaction. GitOrigin-RevId: 804baa968acbc8633cf3f25639331a41cd0a81e9 commit ca8df21783b844bda1f60443368f34402080271b Author: kristoffSC Date: Tue Nov 15 19:33:31 2022 +0100 Delta-467_Flink_1.15 - fix data loss on Delta Sink due to skipping late RPC calls to GlobalCommitter. (#473) Signed-off-by: Krzysztof Chmielewski commit e60d8b6fbb550353d235a428210c72fd7ebc49d1 Author: Serge Rielau Date: Wed Nov 9 01:52:57 2022 +0000 Minor refactor to DeltaErrors.scala GitOrigin-RevId: caab44784de767b3316c87bb8e42458f75a065b7 commit 4e62435da0265b2f872a4807e9b19422dffb642d Author: Ming DAI Date: Tue Nov 8 15:17:46 2022 -0800 Handle empty partitioned catalog table in ConvertToDelta GitOrigin-RevId: cb736b37878c8bcca9d6d2caabd10c391443038b commit 8fbe092e64a978b2f37eb9a69b56c2ff6b2ce56f Author: Jintian Liang Date: Tue Nov 8 14:06:56 2022 -0800 Refactor ZORDER key in Optimize code Define a constant for the hardcoded "zOrderBy" key in optimize command. GitOrigin-RevId: d2d4a8accc0947bdd127f7dfa9d9b627ac3f9b6b commit 4beb21a88a2b4577bc11cd31e0c3f6d74f0c378b Author: Christos Stavrakakis Date: Tue Nov 8 19:55:03 2022 +0100 Remove hard-coded versions from UPDATE CDC tests The CDC tests for UPDATE are using hard-coded versions. This commit makes them read the latest version from the Delta log. GitOrigin-RevId: aed5f32a69a87a0bf0a2357f0cc48626a74410ff commit 1e053353ff081801eb421653e1563d76d4ea302b Author: Shixiong Zhu Date: Tue Nov 8 16:30:12 2022 +0000 Add a test for self union for Delta streaming source when using `readStream.table`. GitOrigin-RevId: 4c65246a17d1bbf11b8668dd6834e0da4befcb1b commit f43e5f2f5e0385567443f27a7fb8bcfa08732ead Author: Johan Lasperas Date: Fri Nov 4 16:09:29 2022 +0100 Refactor Merge clauses in tests Small refactor of the test helper for creating merge clauses. GitOrigin-RevId: 6ea4c50d755eaf4d7d53655fb464e20815632c6a commit f0299fa3e0ae69f00a0f5a3e313f4c58d81402fc Author: Kevin Neville Date: Thu Nov 3 15:27:03 2022 -0700 Adress workflow warnings Resolves #1466 ## Description The version of setup-java is throwing a bunch of warnings related to https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/. This PR should remove those warnings. Closes delta-io/delta#1465 Signed-off-by: Allison Portis GitOrigin-RevId: 7d88f120e2102cc0b48244e30c13e1bd87eee1b5 commit 840f2c5dd88d0d87bb1c1d07483d7743d83599b0 Author: Christos Stavrakakis Date: Thu Nov 3 10:57:39 2022 +0100 Improvements to DeltaSourceColumnMappingSuite Column mapping tests try to upgrade the Delta protocol to a specific version, which makes the tests fail if the default protocol is for some reason set to a higher version. This commit fixes this issue by using the max of the default and required version. GitOrigin-RevId: 48ab8941733eb748945dd1848dc28c865688d3c5 commit 4ef595958e4788e8a0bd605bc84a263dc38e038e Author: Jackie Zhang Date: Wed Nov 2 22:50:30 2022 +0000 Adding support for converting Iceberg to Delta for the `CONVERT TO DELTA` command. Closes delta-io/delta#1463 GitOrigin-RevId: f1a334d3dafb3996a1105dcbc22b1024d66a5b18 commit 2e8d8439ff70336923834946cf91c4e0a52d5733 Author: kristoffSC Date: Thu Nov 10 18:00:55 2022 +0100 Flink_1.15 - Migrate Flink Delta connector to Flink 1.15. (#434) Signed-off-by: Krzysztof Chmielewski commit 05f1e9c6ea8e9988f03567224d6536bdfb022a01 Author: Max Gekk Date: Wed Nov 2 21:02:07 2022 +0000 Minor refactor to DeltaThrowable GitOrigin-RevId: 826a81dc9c689e5336bce1b8a3b1c66c91e8f365 commit e275e6cbe1c3d94f8af2e94ae0895e16d22c2fbc Author: Scott Sandre Date: Wed Nov 2 13:34:03 2022 -0700 Previously, we were filtering out empty AddFiles (i.e. numLogicalRecords = 0) in the MergeIntoCommand. This code belongs somewhere more common/central, instead of in only 1 DML command. This PR puts that logic in TransactionalWrite. It belongs there, and not in DelayedCommitProtocol, since we need the stats to determine if an AddFile is empty, and those are collected and used by TransactionalWrite, outside of DelayedCommitProtocol. GitOrigin-RevId: 33f06e15c3f87960e18ac551a201f8bc421e2d85 commit 0f30d29b5736b8eee204bdb903e22ce96bbaae51 Author: Fredrik Klauss Date: Tue Nov 1 23:39:52 2022 +0100 Use flag instead of extra method to keep the number of records when filtering files Currently, there's ```filterFilesWithNumRecords``` and ```filterFiles``` to either filter with row level metrics or not. This is difficult to use as one needs to do an if statement to use the correct method. Instead we just unify those two methods and add a flag to indicate if we need row level metrics. GitOrigin-RevId: 95f2497a55be5fba3782808b3143840ffb128911 commit 9a95bd640058b16c355d7790097a06e35e1ee561 Author: Ionut Boicu Date: Tue Nov 1 09:07:23 2022 +0100 Minor refactor to DeltaSuite GitOrigin-RevId: c8e9cec98c4f7e99e27e8ece35891e429945d486 commit 7b1a83dcccd1ce69840c6f641a6bea6b8ed92b0e Author: Johan Lasperas Date: Mon Oct 31 15:39:02 2022 +0100 Rename Delta merge clause classes The names of classes used today to represent Delta merge clauses will become ambiguous if NOT MATCHED BY SOURCE is ever introduced in the future. In particular, UPDATE and DELETE actions may be used with both MATCHED and NOT MATCHED BY SOURCE clauses. GitOrigin-RevId: 98b4a20694642651e9c0e54df51fe3df26b54f99 commit 7f571d530d62021934963cf1d62eea2273e23768 Author: Max Gekk Date: Fri Oct 28 12:31:32 2022 +0000 Minor refactor to DeltaHistoryManagerSuite GitOrigin-RevId: 9b8056a8632093b1cb7c3e174ffd3ebf55416eaf commit fd503d80328bec38a274ea36f99c2ba68e64f8ad Author: Rajesh Parangi Date: Thu Oct 27 12:24:38 2022 -0700 Add additional metrics to vacuum stats. This change adds these additional three metrics to the vacuum stats: 1. sizeOfDataToDelete 2. timeTakenToIdentifyEligibleFiles 3. timeTakenForDelete GitOrigin-RevId: 34a0e40cbe1049a219d8dfcd576631f7b6ed4aa2 commit 529313d61a0d998b66b6fa0cad7d35e971edfb94 Author: Yuming Wang Date: Thu Oct 27 10:40:28 2022 -0700 Update Spark to 3.3.1 ## Description Spark 3.3.1 released. This PR upgrade spark to Spark 3.3.1. Existing test. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1382 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 75413de8536a3a083ec59a67f4715bd269e9415b commit 518827aa2c5eeda317217fce5bccefdc94c72967 Author: lzlfred Date: Thu Oct 27 10:32:27 2022 -0700 Minor refactor to DataSkippingReader GitOrigin-RevId: a3d4b7862b0e1f36fdfbf7824570fc42d40ecabb commit c156c9814156dcdbc0524f6e1101accc265fb1b9 Author: Jonas Irgens Kylling Date: Wed Oct 26 19:58:09 2022 -0700 Use startAfter in S3SingleDriverLogStore.listFrom ## Description The current implementation of `S3SingleDriverLogStore.listFrom` lists the entire content of the parent directory and filters the result. This can take a long time if the parent directory contains a lot of files. In practice, this can happen for _delta_log folders with a lot of commits. We change the implementation to use the [startAfter](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/model/ListObjectsV2Request.html#startAfter--) parameter such that we only get keys lexicographically greater or equal than the resolved path in the S3 list response. This will usually reduce the number of S3 list requests from `size of _delta_log / 1000` to 1. This resolves #(https://github.com/delta-io/delta/issues/1191). I've tested the patch briefly with the sample test described in #(https://github.com/delta-io/delta/issues/1191). The [previous iteration of this patch](https://github.com/jkylling/delta/commit/ec998ee9bc62b65c0f4be5ae8f38a5c5753b443c) has been tested a bit more. Correctness has not been tested thoroughly. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1210 Signed-off-by: Scott Sandre GitOrigin-RevId: 2a0d1279655672cbdffd5604b7d7d781556888b9 commit 743cf095687802ecea29f315614974d43199404b Author: noelo Date: Wed Nov 2 23:36:02 2022 +0100 Update Flink FAQ to remove outdated answer (#463) Fixes #462 commit 4eb7d3cff0d60768865c16b08dd19a6f2f48da53 Author: Gopi Krishna Madabhushi Date: Sat Oct 29 00:57:44 2022 +0530 Add DeltaConnectorConfiguration options to flink sink impl (#425) * refactor - move options related classes from source package up to options package * Rename DeltaConfiguration to Delta, extract option validator class * move options to an internal package * fix import order * fix import order * fix test by passing default connector options * Add comments * fix comment * unused import * checkstyle fixes * checkstyle fixes * fix import order * fix import order * Add test for OptionValidator * Create an exception type for option validation errors * code review comments * Move source option validation to use common option validator * fix comment * fix indent * use junit5 api * checkstyle error Co-authored-by: Gopi Madabhushi commit 4a401ea992fe105de61cb152a8082b86b85551dd Author: Shixiong Zhu Date: Wed Oct 26 13:06:57 2022 -0700 Call DeltaLog.update in Delta streaming source to ensure we use latest table schema When a `DeltaLog` instance is cached, `DeltaSource` will get the cached `DeltaLog` when calling `DeltaLog.forTable`. However, it doesn't call `DeltaLog.update`. This means if nobody on the same cluster touches `DeltaLog`, running the streaming query on this cluster will always use a stale `Snapshot` in the cached `DeltaLog`. This breaks one use case: when a streaming query detects a schema change in a Delta table, it will fail. But when the streaming query gets restarted on the same cluster, it should recover and continue to run like running on a different cluster. Due to the above bug, the streaming query cannot get the latest schema (`DeltaSource.schema` is using the stale `Snapshot` to get the schema) and fail during restart. This PR adds the missing `update` calls to make sure `DeltaDataSource.sourceSchema` and `DeltaSource.schema` always get the latest table schema. The new added unit test. GitOrigin-RevId: b5488671ceaf942e48c4cbdb068b305fdc582d46 commit 155be0dba1bd0b3884655adceb607154557b1858 Author: Wenchen Fan Date: Wed Oct 26 00:57:15 2022 +0000 Minor refactoring Author: Wenchen Fan GitOrigin-RevId: 24136d53d29b8c80f296c0ce4797eb48d92955e3 commit 06524f0dcc36b691e4d4d79b459edc625293b045 Author: Paddy Xu Date: Tue Oct 25 17:19:30 2022 +0200 Return canonicalized paths from DeltaLog GitOrigin-RevId: a851c0781121c0b963d166c853aece33accf6f19 commit 9e392e680e0773b1bd203f984d17c57d80e877e5 Author: Christos Stavrakakis Date: Tue Oct 25 09:43:13 2022 +0200 Minor refactoring GitOrigin-RevId: 36b2bdb6daa4b56962d06388f45cd82346805992 commit eca0247ce53017c297984453f5387de982e5c3cf Author: Sabir Akhadov Date: Mon Oct 24 23:13:26 2022 +0200 Refactor `MERGE INTO` suites. Minor refactoring for test suites to facilitate the reuse of Merge Clause helpers. GitOrigin-RevId: f9fd57b0aa0b8721e505ad78636702750d9f2fc8 commit 4473b838e5e22e7725210c1c5c02753877d3500e Author: Juliusz Sompolski Date: Mon Oct 24 22:50:11 2022 +0200 Tests for handling user defined _change_type column with CDC disabled Extra tests for handling user defined _change_type column with CDC disabled. DML commands make various assumptions about _change_type column when CDC is enabled. Add some extra tests that check that it doesn't break when CDC is disabled and the column is user-defined. GitOrigin-RevId: 260cfeb29dc4ec90c22b22587400ee082295e9b0 commit 8b7154d94dd0c75f0681846568918a7a94fd1300 Author: Eric Maynard Date: Fri Oct 21 17:46:58 2022 -0700 Rename sourceType to sourceFormat in Delta logging We recently added a new field sourceType for ConvertToDeltaCommand, but let's call this `sourceFormat` instead. GitOrigin-RevId: 4d766a4105aa831e44be4f97f70e13a8e8fe33da commit f0a8523e03402544a08946345c856acd9b47d621 Author: Hussein Nagree Date: Fri Oct 21 18:25:35 2022 -0400 Create deltaLog.forTableWithSnapshot Create a new function that returns both the deltaLog & the latest snapshot. Added a new test. GitOrigin-RevId: 8d715469cc3b3c762f40e3d5a092e6de2cbcb2cc commit 53ba118f5259aabf58459acbd333ee370f421458 Author: Jackie Zhang Date: Fri Oct 21 15:06:29 2022 -0700 Support id column mapping mode in Delta Lake Enable id column mapping mode + related tests for Delta Lake. Unit tests. GitOrigin-RevId: 9cdffa0abdf8233937b6cd8ce28c42d49afe65eb commit 8370b5154ec8055e344184dbb28885d2b54fb501 Author: Lars Kroll Date: Fri Oct 21 17:46:35 2022 +0200 Minor refactoring GitOrigin-RevId: ea41fbeef8f6c91aa2f2c415601569bd0ade779d commit 926fd632205428b5044f80bfc6e872d5c0f4a8bf Author: Christos Stavrakakis Date: Fri Oct 21 10:59:16 2022 +0200 Remove hard-coded version from column mapping tests `DeltaColumnMappingTestUtils` are using table properties while running `CONVERT TO DELTA` to upgrade the Delta version to (2, 5). This breaks when Delta default version is higher. This commit fixes this issue by using the max of the versions. Existing tests. GitOrigin-RevId: 710638aa42eb1fab5c11bd133dc8a5e832dabbd1 commit f71ab622143453a2f29495db02caf2ab8252b7e7 Author: Weitao Wen Date: Thu Oct 20 17:05:08 2022 -0700 Minor refactoring GitOrigin-RevId: 505ae9acdd01268233231824286afe1d8d6c3825 commit 90ae723728a75acb0442163bb9368732a5040cc2 Author: Anish Date: Wed Oct 19 14:01:56 2022 -0700 Add test to validate that empty dataframe is generated for microbatch processing non-data operations in delta source GitOrigin-RevId: 5ef59fcb6598781f7589e260181382664483ddb5 commit 2118e64b22346aa367b7aad425befa13350be763 Author: Carl Fu Date: Wed Oct 19 09:12:05 2022 -0700 Populate metrics when running DELETE on a partitioned column/field Export numDeletedRows status in DeleteComand's Case 1 and Case 2, to enable the numRowsAffected showing up when user running the delete query on partition key and whole table. GitOrigin-RevId: f5af1c495a49d344a73c405c7d8582c7a2ed7160 commit 7e9ec8d2fa3c577994488b8861357db63e35a458 Author: Paddy Xu Date: Wed Oct 19 11:48:36 2022 +0200 Update CDC Streaming tests to check correctness This PR updates CDC streaming tests to check correctness Existing tests. GitOrigin-RevId: c239a9d3a82a9696f4b0db7530ad49b4d6e0a38c commit be6ddca5cb5c41c8534542b94f5e10e9de280d8b Author: Johan Lasperas Date: Wed Oct 19 09:45:23 2022 +0200 Refactor pattern matching on MergeIntoTable Pattern matching on MergeIntoTable fully expands all the arguments which is error prone - arguments can be swapped - and makes changing MergeIntoTable in the future harder. This is a plain refactor, with no functional change introduced. GitOrigin-RevId: 214a2fe57e0c399af5173fec59cf54674a55f6ad commit 23751f8eead9d73f743f4d2325657331997eb684 Author: ChenShuai1981 Date: Tue Oct 18 23:49:36 2022 -0700 Issue #1436: Fix restore delta table NotSerializableException for Hadoop 2 When execute `restore` command on delta table by spark sql with Hadoop 2, it reported `java.io.NotSerializableException: org.apache.hadoop.fs.Path`. The issue is only in Hadoop 2 because [Path is serializable in Hadoop 3](https://issues.apache.org/jira/browse/HADOOP-13519). ## Description Resolves #1436 Package new version of delta-core jar and put it under $SPARK_HOME/jars directory. Launch spark-sql and execute `restore table xxx TO VERSION AS OF xx` command on existed delta table, it executed successfully. Then execute `DESCRIBE HISTORY xxx` command on the delta table, it show `RESTORE` operation at the last commit. spark-sql (default)> restore table default.people10m TO VERSION AS OF 4; table_size_after_restore num_of_files_after_restore num_removed_files num_restored_files removed_files_size restored_files_size 1808 4 5 4 2260 1808 Time taken: 22.38 seconds, Fetched 1 row(s) spark-sql (default)> DESCRIBE HISTORY default.people10m; version timestamp userId userName operation operationParameters job notebook clusterIreadVersion isolationLevel isBlindAppend operationMetrics userMetadata engineInfo 7 2022-10-18 10:23:33.325 NULL NULL RESTORE {"timestamp":null,"version":"4"} NULL NULL NULL Serializable false {"numOfFilesAfterRestore":"4","numRemovedFiles":"5","numRestoredFiles":"4","removedFilesSize":"2260","restoredFilesSize":"1808","tableSizeAfterRestore":"1808"} NULL Apache-Spark/3.3.0 Delta-Lake/2.1.0-SNAPSHOT ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1440 Signed-off-by: Scott Sandre GitOrigin-RevId: e4e453bc07ee43e893d7356c8c5f45c7dd5ebe14 commit ac13fcb06b5a968a66ece5a9450d034744dd7a73 Author: Christos Stavrakakis Date: Tue Oct 18 15:35:15 2022 +0200 Add source type for CONVERT TO DELTA logging GitOrigin-RevId: c68e82ca5a9384a831c917fd6955f8056672e9ff commit 16dad5a05d6303aa991c988827b9c54b159b885f Author: Lars Kroll Date: Mon Oct 17 21:41:08 2022 -0700 Update Protocol Spec for Deletion Vectors ## Description - This PR makes the concrete changes proposed in #1367 to the Delta protocol specification. For details of what this proposal entails, see that issues. - In addition, this PR makes some clarification changes to the wording in the spec in various places, many of which where necessary to correctly reflect concepts introduced by the proposal (e.g., _logical files_, exact column stat semantics). N/A (document-only). ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1372 Signed-off-by: Scott Sandre GitOrigin-RevId: 3de4c4248db7a6ae3052ea65ccd0d8ebe741c8f2 commit d3f63633c56ecd5b6d805c6961bf14dbcee6b4f3 Author: Vitalii Li Date: Tue Oct 18 04:40:26 2022 +0000 Minor refactor test code GitOrigin-RevId: 2f43127c486e4738d8b13f92959106e94205a35a commit 94fdfc2a7cfb390e15cfcefd3327bf76e7b4f7a5 Author: Max Gekk Date: Fri Oct 14 13:00:33 2022 +0000 Minor refactor to error assertions in MergeInto- & Update-SuiteBase GitOrigin-RevId: cc02e2879e6974b5571ac8afa1d1b2f006fc8cf3 commit 41e064e9afef31fa84695f893f32a918b19e9e49 Author: Lars Kroll Date: Fri Oct 14 14:27:21 2022 +0200 Minor refactor to DeltaSuite GitOrigin-RevId: 882448b34f50933bb1e6865e951bc35ca8e5ac2f commit 6dbc55db53332c985e5bc8470df6c95106afac25 Author: Scott Sandre Date: Thu Oct 13 17:17:38 2022 -0700 Fix S3DynamoDBLogStore concurrent writer bug ## Description See the full bug explanation at #1410, also [this comment in the code explains the problem scenario further](https://github.com/delta-io/delta/pull/1416/commits/5ef0035fd210b3065e64b5f8a9adff41854b98be#diff-50c09f154175b53894ded5fe6a722f12129ab3742afc7612f56fd6eebfe3487dR52-R80). The existing code had the following bugs and we add the following fixes - the copy `T(N)` into `N.json` code path in `BaseExternalLogStore` was using `overwrite=true`. This can overwrite existing files and cause data loss. This PR sets `overwrite=false`. - when "acknowledging" the commit and writing to the external store entry `E(N, complete=true, commitTime=now)`, this let any underlying TTL infra clean up that commit **right away**, because our docs specified that the TTL key was the `commitTime` attribute. Instead, we need to add some delay (this PR uses 1 hour) after which the entry can be TTLd and cleaned up. This gives enough time that any future `fileSystem.exists(path)` call will return true and we don't run into any mutual exclusion issues. This PR also renamed this field to `expireTime`. - in general, we were not short-circuiting early if `overwrite=false` and path already exists. This PR now does that. Added new unit tests. Also, updated integration tests. Ran integration tests with ``` # 1. Created DDB table aws dynamodb create-table \ --region us-west-2 \ --table-name scott_test_2022_10_07_000 \ --attribute-definitions AttributeName=tablePath,AttributeType=S \ AttributeName=fileName,AttributeType=S \ --key-schema AttributeName=tablePath,KeyType=HASH \ AttributeName=fileName,KeyType=RANGE \ --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 # 2. Enabled TTL on that table aws dynamodb update-time-to-live \ --region us-west-2 \ --table-name scott_test_2022_10_07_000 \ --time-to-live-specification "Enabled=true, AttributeName=expireTime" # 3. Configured test params export DELTA_CONCURRENT_WRITERS=20 export DELTA_NUM_ROWS=50 export DELTA_DYNAMO_TABLE_NAME=scott_test_2022_10_07_000 # 4. Ran integration test ./run-integration-tests.py --use-local \ --run-storage-s3-dynamodb-integration-tests \ --dbb-packages org.apache.hadoop:hadoop-aws:3.3.1 \ --dbb-conf io.delta.storage.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider \ spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider ``` ## Does this PR introduce _any_ user-facing changes? Users will need to modify any existing TTL settings to use the new field of `expireTime` instead of the old field `commitTime`. ## Migrating from old TTL attribute `commitTime` to new TTL attribute `expireTime` ### Enabling TTL for the first time If TTL is not enabled but you would like to enabled it, perform ``` aws dynamodb update-time-to-live \ --region \ --table-name \ --time-to-live-specification "Enabled=true, AttributeName=expireTime" ``` You may wish to clean up old, existing entries, too. It is safe to clean up any entry **older than an hour** with attribute `commitTime` set. Any entry with this attribute set will already be completed and written to the file system. Since it is older than an hour, the new fixes in this PR will mean we no longer need DynamoDB to provide the mutual exclusion during the write. ### Changing TTL from `commitTime` attribute to the new `expireTime` attribute [As the DDB docs say](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-before-you-start.html) > You cannot reconfigure TTL to look for a different attribute. You must disable TTL, and then reenable TTL with the new attribute going forward. so first disable the current TTL attribute ``` aws dynamodb update-time-to-live \ --region \ --table-name \ --time-to-live-specification "Enabled=false, AttributeName=commitTime" ``` and then enable the new TTL attribute ``` aws dynamodb update-time-to-live \ --region \ --table-name \ --time-to-live-specification "Enabled=true, AttributeName=expireTime" ``` After this, you can perform the same cleanup as described above, for all entries older than an hour with a `commitTime` attribute set. Closes delta-io/delta#1416 Co-authored-by: ormahler Signed-off-by: Scott Sandre GitOrigin-RevId: 4f148ac93f5ff7ea0af0546ae6cad18e38755134 commit 643cf6312c8ae5fba07dd14c5ff6103570ff328a Author: Anish Shrigondekar Date: Thu Oct 13 09:20:32 2022 -0700 Skip processing on newer versions if rate limit has been reached and return only corresponding entries in iterator GitOrigin-RevId: 56678b556a0bab669f7aa137151313844078f80a commit 9017ac0d811c0a42ba8ac45720bddf06c8f17e63 Author: Adam Binford Date: Thu Oct 13 01:40:49 2022 -0700 Allow schema pruning for delete first pass Resolves https://github.com/delta-io/delta/issues/1411 Re-orders the delete find files to rewrite command to put the empty project (input_file_name) before the non-determinstic filter udf. This allows for top level schema pruning and it Spark 3.4 should allow for nested pruning as well. Also updated the formatting (and `Column(InputFileName())` -> `input_file_name()`) to match Update. New UTs. Performance improvement on delete on data condition. Closes delta-io/delta#1412 Signed-off-by: Shixiong Zhu GitOrigin-RevId: abfee4cf9f8d8ffaef9e397ad9c237c576b8b807 commit 6bca231e3fb1bf61aa376f8a497de075144e0ce5 Author: lzlfred Date: Wed Oct 12 12:23:01 2022 -0700 Always validate state reconstruction Delta has configs that disables state reconstruction validation and sometimes exposes them to user in err msg. Those are dangerous as backdoors to disasters. This PR made it always validate the state (for missing metadata and protocol) and always throw exception if fails. Existing UT. GitOrigin-RevId: 020d23774e7fda0d7b05bc68baf75cfe000c7746 commit 7420cfcf32f9015cbb2026586d1ffa0be6913285 Author: Shixiong Zhu Date: Wed Oct 12 11:23:39 2022 -0700 Fix Delta streaming CDC source filter logic to not return incorrect -1 index GitOrigin-RevId: 69028f5f24b2a4f1326475d02219addd1b72363b commit c396a1f037c58c85e34a5f4807d99483460a9566 Author: Abhishek Somani Date: Wed Oct 12 14:21:37 2022 -0400 Minor refactor to DataSkippingReader GitOrigin-RevId: dfa709c50d7fa12b584924475bea1cdded3f54d0 commit e1e8fa59145d816c6298ebd11747a32e427acb9f Author: Max Gekk Date: Wed Oct 12 12:13:22 2022 +0000 Minor refactor to DeltaThrowableHelper GitOrigin-RevId: 27bcb09814a28f479d58aa1a4b55a1728d2447da commit acd0bd4fa6f20da6af8f149b37b5d414d3174023 Author: Adam Binford Date: Wed Oct 12 03:20:24 2022 -0700 Add a nested schema pruning test case for Update GitOrigin-RevId: 0955ec2d924d264262097a9321306730624f677c commit ecb78cddb7d41051d924a7d34066ecfcff5eabdf Author: lzlfred Date: Wed Oct 12 00:33:52 2022 -0700 Minor refactor to DeltaSQLConf GitOrigin-RevId: 1e1541ae284f32f30c18d0462de2201d931de396 commit 2c236c9056abbde904849e5b543df0585961c0ae Author: Ionut Boicu Date: Wed Oct 12 08:44:12 2022 +0200 Minor refactor to UpdateMetricsSuite GitOrigin-RevId: 2e8b87e26ce9c07924eec203c5c20c175e4a3f4e commit 0bbec372842ea75b38cd37ec438fe8cadea2948e Author: Anish Shrigondekar Date: Tue Oct 11 11:42:55 2022 -0700 Update offset for delta source within latestOffset for no data changes in CDF case GitOrigin-RevId: fea46124c400d6a94beee5e6e29864cd1f520509 commit 90aef096a1818b67558fc30cefd9b5126e958021 Author: Jackie Zhang Date: Mon Oct 10 19:44:15 2022 -0700 Add extra test to DeltaSourceColumnMappingSuite GitOrigin-RevId: 02a117195c2b39705c835878d29710dd9993e537 commit 12336212f5dbf20f63be77e4e8d246a35d20e13c Author: Wenchen Fan Date: Tue Oct 11 10:21:17 2022 +0800 Minor refactor to DeltaTimeTravelSuite GitOrigin-RevId: 78a08be07cef652172f1d5474cb5ac95a5926939 commit 6505f028b1ca83fa4e0b1063470dcc9afc1c4a92 Author: Felipe Pessoto Date: Mon Oct 10 15:00:21 2022 -0700 Fix bug on merge command when DELTA_COLLECT_STATS is disabled ## Description When Delta stats is disabled the merge command removes all delta files and don't add any, resulting in an empty table. Add a new test to run without statistics. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1413 Signed-off-by: Shixiong Zhu GitOrigin-RevId: acec67906c54802ceeaa8b1f970331bd201f3d7b commit 6c76169ed9aad737e900ad70d6a22d8a660de4a7 Author: Paddy Xu Date: Mon Oct 10 17:31:27 2022 +0200 Add tests for operations with special characters in path to DeltaSuite.scala GitOrigin-RevId: 5bd50124031940fa929f296be497470e7d5b2f38 commit 162c10a7828c7263cfac8aea0335d3b553f34982 Author: Max Gekk Date: Sat Oct 8 05:01:25 2022 +0000 Minor refactor to RangePartitionIdSuite GitOrigin-RevId: 0d9ac761f4b3deda8a94a31791467dbdd7c531ef commit d1492c1cf3937aa0b3abe783a9e4514c9f9d45d1 Author: Ryan Johnson Date: Fri Oct 7 14:03:07 2022 -0700 Minor refactor to Snapshot.scala GitOrigin-RevId: 53d9c49c506b6dae68564808eadaed681e386f70 commit 553fada01700fdf850641db8f29091a4eb5fe66c Author: Jackie Zhang Date: Fri Oct 7 10:50:13 2022 -0700 Block progressing on `latestOffset()` for DeltaSource if detected incompatible schema changes GitOrigin-RevId: 0e618d07d92092f27f90c405c88febe97da1b141 commit 601b913f60f2e60cecf07d3866443cfee4de88ba Author: K.I. (Dennis) Jung Date: Wed Oct 12 11:34:24 2022 +0900 Add 'equivalent' for type comparison (#426) Currently, comparison expressions fail with `IllegalArgumentException` for decimal type arguments that don’t have equal precision/scale. This should be valid (and is in Spark SQL). This impacts IN and all the binary comparison operators. This change is to add `equivalent` method, to only be used internally by our code to compare data types. Refer https://github.com/delta-io/connectors/issues/222 for detail. Signed-off-by: Kwangin Jung commit 0eb4c7eb7a84aa451fc8accd82bb771167ad137f Author: Ryan Johnson Date: Fri Oct 7 07:27:23 2022 -0700 Delta metadata reads fail on missing/corrupt files GitOrigin-RevId: 92ef2532fe717da6368e6124c1ae3307133d134c commit 346c695866b2e23c1f45ae00953fa46bf3f076eb Author: lzlfred Date: Thu Oct 6 11:05:54 2022 -0700 Support suppress optional fields in _last_checkpoint Adding the new flag to suppress optional fields in _last_checkpoint in case some third party reader does not support those optional fields. Unit test GitOrigin-RevId: a935344e1e93d91fb3a914567b3780842934d115 commit 177443e914cb2a1c08a74ad7e38dfdc7406c4bdd Author: Johan Lasperas Date: Thu Oct 6 15:03:10 2022 +0200 Minor refactor to TahoeFileIndex GitOrigin-RevId: 0f0b97acf34cd1f8ac639eaa0ce680b6dd532f11 commit 8342674433e53a3c0e2b04882eb4c29bcac4d6a9 Author: Lars Kroll Date: Thu Oct 6 15:02:52 2022 +0200 Remove default impl of getSnapshot from TahoeFileIndex Existing tests. GitOrigin-RevId: b0f571927fddf308d921c1abed4973b72b4ba281 commit 8d4583cee596672c8fb19c236bf2fc7df7fce84b Author: Karthik Subramanian Date: Tue Oct 4 21:20:09 2022 -0700 Remove unused declaration in MergeIntoCommand Remove unused declaration - seqToString Remove unused declaration seqToString in `MergeIntoCommand`. Existing unit tests No Closes delta-io/delta#1407 Signed-off-by: Scott Sandre GitOrigin-RevId: 9580a350dbf16ca541f81df4f07e1527c64db950 commit c850b5b9cf0a3ae7a74bd04a55145eda563696bc Author: Allison Portis Date: Tue Oct 4 21:19:41 2022 -0700 Update issue workflow Tested on my fork Closes delta-io/delta#1400 Signed-off-by: Allison Portis GitOrigin-RevId: cddb6184a23daa899bb2a6c339714d688c8cbbc0 commit a934789fa0eb86d6b7ed1ad5fd791ce5b6a9520e Author: Hussein Nagree Date: Tue Oct 4 13:45:50 2022 -0700 Refactor `snapshot` to `unsafeVolatileSnapshot` Change the API `snapshot` to `unsafeVolatileSnapshot`. In all src code, we update the name to this new one to reflect the danger of using the api. In test code, we define an implicit helper method `snapshot` so that the test usages (of which there are hundreds) don't all have to change. The reason for the change is that the `snapshot` def is backed by a volatile var. It has the potential to introduce race condition bugs since its value might change between sequential accesses. There are a few cases where a better fix is available (for example, fetching the snapshot by doing a `deltaLog.update`), but in order to reduce the potential of bugs in such a large PR, I've left any actual logic changes to be done in follow-up PRs. GitOrigin-RevId: 2313c46fb4c1db7d85e4632a75a34949f56aec2f commit f8781ce880917359c468c4374bddca7edca4b654 Author: Jintian Liang Date: Tue Oct 4 10:42:53 2022 -0700 Minor refactor to DeltaScan GitOrigin-RevId: 73c0661973ca1f1140d746df833da03bacd0f94f commit fcfba8d0083fe82d803c30b237a8de1b229d434c Author: Tom van Bussel Date: Tue Oct 4 09:53:31 2022 +0200 Minor refactor to various test suites GitOrigin-RevId: 991796bfff8636b5d00a12e6e69d1cbc71e054e9 commit 4ce80dc4fa31fef3397e75809168fec6797c7070 Author: Shixiong Zhu Date: Mon Oct 3 16:25:24 2022 -0700 Fix Delta streaming source filter logic to not return incorrect -1 index This PR fixes a bug in `DeltaSource.getFileChanges` that may return an IndexedFile before the given `(fromVersion, fromIndex)`. Right now, if there is an `IndexedFile(version=X, fromIndex = -1)`, searching it using `(fromVersion=X, fromIndex=Any)` will return `IndexedFile(version=X, fromIndex = -1)`. GitOrigin-RevId: 6af2c7f44383b9eef40e5ad4185442ce5e88c994 commit 08a2ceb8ad1e4bae8d883327b77700cbdb2aee6d Author: Hussein Nagree Date: Mon Oct 3 11:22:32 2022 -0700 Minor refactor to SnapshotManagement GitOrigin-RevId: c8bb4ec8ad5cd8bba7867558be4dc7bf86a2f02f commit 1afdd952968a76e401016ca37cd6a03b4df9b4d1 Author: Gerhard Brueckl Date: Thu Oct 6 22:08:40 2022 +0200 PowerBI Connector - add support for MinReaderVersion 2 (#448) * add PowerBI connector * add parameter to buffer the binary file before reading it to mitigate streaming errors * update global readme to include reference to the Power BI connector * fix issue with online refresh * add inline-documentation update readme.md add *.pbix to .gitignore * - added supprt for ADLS Gen1 and recursive load of folder content in general (option IterateFolderContent) - better support for data types - now all simple data types are supported and casted correctly * fix docs/sample for ADLS Gen1 * - add support from TimeZoneOffset - fix issue with special characters in column names * update README.md * added option to use file pruning to skip files based on the min/max stats in the logs added support for all data types (simple and complex) added support for negative versions performance improvements * optimized Table.RowCount by leveraging the stats in the delta-log reading final column datatypes from delta-log * added full parsing for complex types added type for log schema * update readme * Add Example for ADLS Gen2 add proper error if the folder listing is not from a Delta Lake table * added support for `minReaderVersion` 2 Co-authored-by: gbrueckl Co-authored-by: Gerhard Brueckl commit a109bab61407e7c02fa551291f81cf12a58f1f6a Author: Hussein Nagree Date: Fri Sep 30 13:13:13 2022 -0700 Fix the metadata & tableVersion in FileIndexes Create a metadata method in the TahoeFileIndex, and where available fix the metadata & tableVersion so that they aren't dependent on the deltaLog.snapshot. However, don't change the TahoeLogFileIndex and leave the default implementations for the metadata and tableVersion methods in TahoeFileIndex. Existing tests GitOrigin-RevId: 05f2d8c7453b495db1ac40a05e335593bc1da54f commit 29d3a09289f5d799714df448ea9818d2fb5fcdd7 Author: Shixiong Zhu Date: Fri Sep 30 11:49:24 2022 -0700 Fix Delta source initialization issue when using AvailableNow When using AvailableNow, here are the flows for Delta source: - New query: prepareForTriggerAvailableNow, (latestOffset -> getBatch)*. - Restarted query: prepareForTriggerAvailableNow, getBatch, (latestOffset -> getBatch)*. When restarting a query, getBatch is required to be called first. Otherwise, previousOffset will not be set and latestOffset will assume it's a new query and return an incorrect offset. Today, we call latestOffset inside prepareForTriggerAvailableNow, which causes the incorrect initialization for lastOffsetForTriggerAvailableNow because previousOffset is not set yet at this moment when restarting a query. In this PR, we add isTriggerAvailableNow and set it to true in prepareForTriggerAvailableNow without initializing lastOffsetForTriggerAvailableNow, and make lastOffsetForTriggerAvailableNow initialization happen inside latestOffset (for new query) or getBatch (for restarted query) so that it can be initialized correctly. We add an internal flag spark.databricks.delta.streaming.availableNow.offsetInitializationFix.enabled to allow users switching back to the old behavior. In addition, we also add a validation for previousOffset and currentOffset to make sure they never move backward. This would ensure we will not cause data duplication even if we have any bug in offset generation. spark.databricks.delta.streaming.offsetValidation.enabled is added to allow users turning off the check. GitOrigin-RevId: a2684fbcdecf8f621fd3ce751f3e71fc96a17d7c commit ab0158d0145e37eca7bb75fb15bf021c7d1fbe19 Author: Hussein Nagree Date: Fri Sep 30 11:09:01 2022 -0700 Minor refactor CheckpointHook GitOrigin-RevId: 5017a36840a0c1e143c2d1f527b80d3fe0bf5915 commit 140e3ece954c89162a179e4987cae5c16dce8e2c Author: Prakhar Jain Date: Fri Sep 30 05:12:29 2022 -0700 More time travel `AS OF` tests This PR adds more `AS OF` SQL syntax time travel tests. Just unit test changes. GitOrigin-RevId: 09394e6c4320b322e90139e32d4629a9392eae96 commit 29a45af0614be0dfde959ae88fd3ed3d1a7f6b75 Author: Nick Karpov Date: Fri Sep 30 00:00:40 2022 -0700 Update PROTOCOL to include change data spec ## Description Update PROTOCOL.md to include change data file spec. I think it's possible to consider these new change files as "data files", but I've documented them as their own file type to start because they do not represent the actual table data the same way `add` and `remove` files do. N/A ## Does this PR introduce _any_ user-facing changes? Yes. This PR introduces changes to the documentation of the Delta Lake protocol Closes delta-io/delta#1300 Signed-off-by: Shixiong Zhu GitOrigin-RevId: a20c0767d1e17e266705424a8f805b80465a0757 commit c3020e1dfb04be08fc56ff96bfdfe4ce997539d1 Author: Scott Sandre Date: Thu Sep 29 11:59:07 2022 -0700 Update delta Dockerfile - update the delta Dockefile to install twine, setuptools, and wheel - remove the `pip install` of the above packages from the pypi tests that are run as part of the python unit tests - do not uninstall pyspark during unit test. **Without this change** when we install the local delta-spark artifact, it will re-install pyspark, which is not what we want. GitOrigin-RevId: 9971993e0de5e35e24293387177305039960671c commit ff6d9aa5d6fc18dbcd7bdff6ea0700a74e677896 Author: Lars Kroll Date: Thu Sep 29 09:38:48 2022 +0200 Minor refactor GitOrigin-RevId: fd552e2c06ba4636fadf0a8909dae19ce59fc08b commit 86451dfc2c0a048c554a84ded8481f564ad71508 Author: Lin Ma Date: Wed Sep 28 20:55:28 2022 -0400 Minor refactor DataSkippingDeltaTests GitOrigin-RevId: 1e89e4bd6cd9c5df93f9b826af83d354ebbe0e5f commit 0aeda64dac9c6e731aef70d8253bcf8639c1c95a Author: Zach Schuermann Date: Wed Sep 28 17:27:50 2022 -0700 Minor refactor deltaMerge GitOrigin-RevId: a360b30d48308e158c0d386efa8e61e378ed33ef commit 70d1f4cbe28c7cc372d8f4ba0a45f00e2cf71b62 Author: Tyson Condie Date: Wed Sep 28 16:05:27 2022 -0700 Minor refactor DeltaTestImplicits GitOrigin-RevId: 3551d0b5453abc1ddfbfd335bafe8b76257dec93 commit 7d528db4cfaaf7584949f01e05e6cab08fb66539 Author: Juliusz Sompolski Date: Wed Sep 28 20:24:58 2022 +0200 Minor refactoring to DeltaConfig GitOrigin-RevId: 7cd6d253911575d21b7db0ef80b7a77166f96255 commit c57d7d790fda1b2236551b5eda78b34287a93d9a Author: Prakhar Jain Date: Wed Sep 28 09:14:50 2022 -0700 Minor refactor GitOrigin-RevId: fbb0722f5d32898288e945a921c46733b2df9865 commit d41e30d3bf0d88a77a976d23421f087590fa26a3 Author: Juliusz Sompolski Date: Wed Sep 28 13:12:42 2022 +0200 Minor refactor PreprocessTableMerge and MergeIntoTimestampConsistencySuite GitOrigin-RevId: e658b841f66789383a6f31a4cb298dc79a1933d1 commit 2086a2b466b3962ee7405aaf02a1f80d52d44361 Author: Lin Ma Date: Tue Sep 27 21:04:41 2022 -0400 Minor refactor DataSkippingDeltaTests GitOrigin-RevId: 399b9efc6140f0b9c7c50e881553f56e3540c2fa commit bb0dde713e90516f9b2468af1f264d9856d96cd2 Author: Ryan Johnson Date: Tue Sep 27 13:07:43 2022 -0700 Revert "Make TahoeFileIndex not rely on..." This reverts commit 1f6ab824e14794c17202b5e4e5df6a95357a799c. GitOrigin-RevId: 12daf0436347dc8b94b1da7e78c917e7c4966f57 commit 7c48c13a4af07b207cbe7bd8f6d2f552990afe08 Author: Paddy Xu Date: Tue Sep 27 17:22:02 2022 +0200 Added test to UpdateCDCSuite and minor refactoring GitOrigin-RevId: dce234c1e54155dc0702dbd29985f267d1e9b74a commit 8a5ea84d37228ce9c991894997649ed1b3b892e4 Author: Lars Kroll Date: Mon Sep 26 14:44:46 2022 +0200 Misc metrics transformation helpers GitOrigin-RevId: b0f84e35c172285bd26c0682ad24fa0a76beeef1 commit dddb01ce0734ab33a83baab2b7cd791fc7b027e4 Author: Ryan Johnson Date: Sat Sep 24 17:13:44 2022 -0700 Minor refactor to DeltaCatalog GitOrigin-RevId: 458c7f02411bcd7e37794b480e5488ba7b8dfc89 commit dbb92daee845fed8052e1d91c4c3af3e0a40685a Author: Prakhar Jain Date: Fri Sep 23 13:58:14 2022 -0700 Minor refactor to Checksum and Snapshot GitOrigin-RevId: e142047d8025ea92e0ad762614a7140eb4c5a0b4 commit 524a042bc6cf7ab602e49fabfdc29d8e038deccd Author: Ryan Johnson Date: Fri Sep 23 09:26:05 2022 -0700 Revert "Make test-only DeltaLog.forTable overloads…" This reverts commit 213eaaff67eb2279243d2a3e9474df894e45dbbe. This reverts commit 851d360e18fb83b68974216b05bb63e7f437e303. GitOrigin-RevId: ce775dc80cf743a7d1d676c21399d1c634d7551a commit 3320693a4ac7faf6b251f628a85bfe6700c636df Author: Andreas Chatzistergiou Date: Fri Sep 23 08:21:46 2022 +0200 Minor refactor to DataSkipping files GitOrigin-RevId: cc13f1803ae718caeb4715c76187da57e7b5ffb0 commit 5a8daa9555a7e8dacf9a3450aa49fcb6fb3c9009 Author: Lukas Rupprecht Date: Thu Sep 22 10:15:05 2022 -0700 Adds back DeltaLog.forTable(spark, path) GitOrigin-RevId: 5114aea803d833758dd5aa8acec87f16e50b7cc8 commit 67bf022d3e4d8fc7c17c12be7875b855283b6996 Author: Lin Ma Date: Wed Sep 21 18:36:29 2022 -0400 Change the status collection schema to follow the order in the table schema Delta table collects column-level stats such as min/max and null counts. By default, stats for the first 32 columns are collected. However, the “first” is creating confusion for users. The public doc says Delta Lake collects statistics on the first delta.dataSkippingNumIndexedCols columns defined in your table schema, but the implementation collects the first columns based on the DataFrame schema for write. This PR proposes to change the stats collection to follow the order in the table schema. Added new unit tests GitOrigin-RevId: b852bcbd13462d72b68d4ef8223c5ee3e8344f9b commit 1f6ab824e14794c17202b5e4e5df6a95357a799c Author: Hussein Nagree Date: Tue Sep 20 23:50:09 2022 -0700 Make TahoeFileIndex not rely on deltaLog.snapshot Remove the dependence of the TahoeFileIndex on deltaLog.snapshot, by leaving any functions that require it to be abstract. All classes that extend it must reimplement the `tableVersion` and `metadata` functions Refactor, existing unit tests GitOrigin-RevId: b8d1a521c46ea54d509855a7eb0ed01beabda44a commit 8f30350d719a68ed8d435d910b0a13f732083c84 Author: Zainab Lawal Date: Tue Sep 20 11:47:18 2022 -0700 Removed the unused PinnedTahoeFileIndex class Removed the PinnedTahoeFileIndex class to fix this issue: Closes delta-io/delta#1386 Closes delta-io/delta#1388 Signed-off-by: Shixiong Zhu GitOrigin-RevId: ea58af65a17f8a3bf9a7be7b98841d3f2b2cb16f commit 213eaaff67eb2279243d2a3e9474df894e45dbbe Author: Ryan Johnson Date: Tue Sep 20 06:39:00 2022 -0700 Make test-only DeltaLog.forTable overloads implicit Today, `DeltaLog.forTable` has a large number of overloads, most of which are only called from unit tests (i.e. passing custom clocks, or taking `java.io.File` as the table path). This PR moves all those methods to `DeltaTestImplicits`, so it is obvious which overloads are actually used in prod code. If the code compiles, it should be correct. Meanwhile, the change almost exclusively affects unit tests. GitOrigin-RevId: 5cc024c32f8680f248cb2b9807c6c538e9dcdea3 commit 18bf2827270f1c06373598886133efe667d0727e Author: Ryan Johnson Date: Mon Sep 19 14:29:52 2022 -0700 Remove unused ValidateChecksum GitOrigin-RevId: b1fbbecc3e7b984eea8038fb69656115577da15b commit 938a00e1315534d44a5870c84fa0098812c3888e Author: Ryan Johnson Date: Mon Sep 19 14:29:05 2022 -0700 Make CommitInfo the first action in every commit GitOrigin-RevId: 8c5453afb13e6f5fe89f1e937ffc04ba1f330d60 commit bccac862bef0e8ec8dd936c2ae3750b17e3a5a3a Author: Paddy Xu Date: Mon Sep 19 10:17:08 2022 +0200 Refactor CDF code GitOrigin-RevId: 1d6e29f44a78799d6213d42c3649e127924674d2 commit 065f75f3e18f63290ad03139521336fe151034a8 Author: Ming DAI Date: Fri Sep 16 17:22:27 2022 -0700 Minor refactor for ConvertToDelta GitOrigin-RevId: 88e6e7215cfe15d6ed482175bd860d765bfc1331 commit 857a955a3fa2d1d32ead749ad4b5000a75f14fcc Author: Yuming Wang Date: Fri Sep 16 10:01:04 2022 -0700 Upgrade Scala to 2.12.15 Upgrade Scala to 2.12.15 to fix `java.lang.IllegalArgumentException: too many arguments` if running on JDK 17: ``` [info] Caused by: java.lang.reflect.InvocationTargetException [info] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [info] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) [info] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [info] at java.base/java.lang.reflect.Method.invoke(Method.java:568) [info] at java.base/java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:278) [info] ... 71 more [info] Caused by: java.lang.IllegalArgumentException: too many arguments [info] at java.base/java.lang.invoke.LambdaMetafactory.altMetafactory(LambdaMetafactory.java:511) [info] at scala.runtime.LambdaDeserializer$.makeCallSite$1(LambdaDeserializer.scala:105) [info] at scala.runtime.LambdaDeserializer$.deserializeLambda(LambdaDeserializer.scala:114) [info] at scala.runtime.LambdaDeserialize.deserializeLambda(LambdaDeserialize.java:38) [info] at org.apache.spark.sql.execution.LocalTableScanExec.$deserializeLambda$(LocalTableScanExec.scala) [info] ... 76 more ``` Existing test. No. Closes delta-io/delta#1380 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 89cfdae13bf5bc1d63fbdd8ae60538428f99e471 commit 5660484b6254ff7544411d6df38ed803c7b6eacc Author: Fredrik Klauss Date: Fri Sep 16 11:44:50 2022 +0200 Move getRandomPrefix to Utils GitOrigin-RevId: 93f4ee4eae545e419c87c4772b90efa8ef2ca180 commit 4a72fdd891169cf73e32f2a466fd19bc5023f515 Author: Ryan Johnson Date: Thu Sep 15 09:27:49 2022 -0700 import DeltaTestImplicits._ in tests that will need it soon As part of an ongoing cleanup to more cleanly separate prod and test code in Delta, `DeltaTestImplicits` was already created as a central place to hold test-only implicit method definitions. In preparation for moving more test-only code here, we first import the implicits into unit tests which we know will be affected by future code movement. That way, this change is trivial to review (extremely regular), and the subsequent cleanups will be vastly smaller and thus also easier to review. Test-only code, and also if it compiles it's correct. GitOrigin-RevId: 8573ba2fb8a5c86d84f6d89cb6ae8c79b530810c commit b4c656b6aa0637fd1590f4c0fcf9e73e7b1d3421 Author: Ryan Johnson Date: Wed Sep 14 21:16:16 2022 -0700 Minor refactor to Checksum.scala GitOrigin-RevId: 213602540e3380ad31ee0a4d8f0a7715dae74e64 commit df57fc60c43e01b37199114a62348c95ae1f3b13 Author: Ryan Johnson Date: Wed Sep 14 21:01:11 2022 -0700 Define object DeltaTestImplicits Define `object DeltaTestImplicits` (in unit test code base) and move the existing implicit class `DeltaTestUtils.OptimisticTxnTestHelper` there. The change improves test code readability by making it obvious where the implicits come from: `import DeltaTestImplicits._` vs. importing classes with vague names or even silently inheriting them from a parent trait. It also provides a central location where we can define additional test-only implicits for Delta code -- several more are coming soon as part of an ongoing effort to clean up certain aspects of the Delta code base. Test-only code; correct if unit tests continue to compile and pass. GitOrigin-RevId: e32928c2b6274259b8e4938d4e3c0165c9d567ee commit 2d29270aabd30aa145ce09852d1e237b24bc5cfc Author: Patrick Marx Date: Wed Sep 14 12:09:55 2022 -0700 Remove obsolete dependency on org.codehaus.jackson `org.codehaus.jackson.annotate.JsonRawValue` is a no-op today. Remove it and dependency on `org.codehaus.jackson:jackson-core-asl` can be simply dropped as well. Signed-off-by: Patrick Marx Closes delta-io/delta#1368 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 4966ed9b6b86dd2661e976455ab4ae330d016c68 commit e5a7cd0533c4d3a5e9c461a73b7ca04e8a680945 Author: Hussein Nagree Date: Wed Sep 14 11:26:09 2022 -0700 Catch errors when last_checkpoint write fails Catch exceptions that are thrown when writing the _last_checkpoint file. These are already handled for writing the checkpoints themselves, so extend the same logging/catching logic Added unit test GitOrigin-RevId: 95202c7c27c6c7875ae8e78a4261f7d236afe038 commit 1e8a1d2ae2d34d3a4f7d3d06785f77760c6529cb Author: lzlfred Date: Tue Sep 13 09:29:27 2022 -0700 Minor refactor to SnapshotManagement GitOrigin-RevId: 899ef2ad5924652954dd755962b0feef6257b2c0 commit 3720a734ca103197ee1c2ec8f7dc7edc627291e6 Author: Vitalii Li Date: Tue Sep 13 00:20:11 2022 +0000 Minor refactor to error message checks in Delete- and Update- SuiteBase Author: wuyi Author: Ivan Sadikov Author: Vitalii Li GitOrigin-RevId: ed717f5af67bd908c238456c347899396b8a96b3 commit 5258a366a6f2350a74d6ae9c82735f11c0e74db9 Author: lzlfred Date: Mon Sep 12 10:15:21 2022 -0700 Minor refactor to Checkpoints.scala GitOrigin-RevId: 24c5c4ef294692d20e5a56974a775e253ce76095 commit 2499f5408c63de39914a789cf8bb57137224fb3a Author: Jiaheng Tang Date: Mon Sep 12 03:34:21 2022 -0700 Fix wrong error message when delta table doesn't exist When user tries to query a Delta table whose data files are deleted, they will see `` is not a Delta table which is confusing. This patch fixes it. GitOrigin-RevId: 5cb10c842fbad9549e9848635f625455794c8486 commit 6a30e958de4322100b2ccfa13fa29ae155369a07 Author: Jiaheng Tang Date: Sat Sep 10 18:57:22 2022 -0700 Refactor twelfth set of 20 Delta error messages GitOrigin-RevId: de7634ca9e1dec12692df0657a7967e30dd8a7e3 commit 993870de9a28f0d16601f9e4f29a5e45d1e7202f Author: Anish Shrigondekar Date: Sat Sep 10 09:56:32 2022 -0700 Add shouldSkip field within IndexedFile if commitInfo metrics show no changes for no-op delta merge Add shouldSkip field within IndexedFile if commitInfo metrics show no changes for no-op delta merge for streaming queries within the Delta source. Added new test to verify no-op functionality works correctly within streaming query GitOrigin-RevId: 6ccbdbe7266b4a4af98c47b198a3d626cabd19f8 commit 3b315790054d9a4f224503c7278f13cf66890200 Author: Supun Nakandala Date: Fri Sep 9 23:47:18 2022 -0700 Improve plan canonicalization stability - It was observed that plan canonicalization is unstable across different JVM invocations. - The root cause for this issue is the use of `getClass.hashCode` in various places in the code. `getClass.hashCode` is not consistent across different JVM invocations. - As a fix we replace `getClass.hashCode` => `getClass.getCanonicalName.hashCode` which is consistent across different JVM invocations. GitOrigin-RevId: 599d30964cd88262ce3e45c2faf7a67eca501bc4 commit 1f5ca78ede796138ee48967693482678eb8047b1 Author: Anish Shrigondekar Date: Fri Sep 9 11:36:33 2022 -0700 Minor refactor to DeltaSinkSuite GitOrigin-RevId: a84a28addc0b39dea2ef15f8fbe71a49936cbbe5 commit 1d4fcae55b5636e90ad237b410367cea37109890 Author: CabbageCollector Date: Wed Sep 14 20:19:06 2022 +0200 Update CONTRIBUTING.md to fix a broken link (#443) commit 2dc8c561998dbb53c1c73fad3fac7f8ce129435e Author: kristoffSC Date: Mon Sep 12 21:36:06 2022 +0200 FlinkDeltaSink-MultipleJobsTests - Adding sink test for multiple jobs. (#440) Signed-off-by: Krzysztof Chmielewski commit fbfaae2afa9bc994670cf3ec315672fe6b655b8e Author: Scott Sandre Date: Fri Sep 9 11:10:07 2022 -0700 [424] Fix delta-flink connector classloader bug (#436) * Use a per-snapshot ForkJoinPool; not the static commonPool() * Check for leaked classes by default in flink tests * Unrelated: fix failing local time travel tests * change java.concurrent to scala.concurrent * unrelated: fix unused import warning * fix scala 2.11 and 2.12/2.13 ForkJoinPool incompatibility * Update SnapshotImpl.scala * update comment commit d2785aa93e8f2f5c70a3351ddc4995e2c943897b Author: Pranav Date: Thu Sep 8 15:21:27 2022 -0400 Minor refactor to DeltaSourceOffset GitOrigin-RevId: 0ec17c83de675a1f434f9f92bba9d1914f9189b5 commit d7df41c49e4ea4907822812e0cdb63f426285ea8 Author: Ronald Zhang Date: Wed Sep 7 23:00:38 2022 -0700 Specify source table path in IgnoreChanges/IgnoreDeletes error message Previously, the ignoreChanges/ignoreDeletes error messages only included metadata about the version of the source table with the non-append commit. This change adds the tablePath to the error message to specify which source table has the issue. Updated unit tests GitOrigin-RevId: ccf3faca0647e5dbed0dea6f999ad94d67a4faa6 commit 0bdfe8f0ba4a7ea5375af02b3b4dd406f9095e84 Author: Scott Sandre Date: Wed Sep 7 13:22:19 2022 -0700 Minor update to run-tests.py GitOrigin-RevId: 64364accedb00419f2136b866cefe565da4a76b7 commit a62c30f859e254a991766e602c7c885ac3e96199 Author: Ganesh Chand Date: Wed Sep 7 11:24:57 2022 -0700 DeltaLog checks for required Spark configs and throws an error if not found ## Description Resolves https://github.com/delta-io/delta/issues/1144 - Added a new test in `DeltaLogSuite`: ``` scala test("DeltaLog should not throw exception if SparkSession in initialized with " + ".withExtension api and spark.sql.catalog.spark_catalog conf is set") test("DeltaLog should not throw exception if SparkSession in initialized with " + ".withExtension api and spark.sql.catalog.spark_catalog conf is set") ``` - Refactored other unit tests ## Does this PR introduce _any_ user-facing changes? Yes. If `spark.sql.catalog.spark_catalog` configuration is not found in the `SparkSession`, it now throws the following user friendly error. ``` This Delta operation requires the SparkSession to be configured with the DeltaSparkSessionExtension and the DeltaCatalog. Please set the necessary configurations when creating the SparkSession as shown below. SparkSession.builder() .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") ... .getOrCreate() If you are using spark-shell/pyspark/spark-submit, you can add the required configurations to the command as show below: --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog ``` Signed-off-by: Ganesh Chand<> Closes delta-io/delta#1326 Signed-off-by: Shixiong Zhu GitOrigin-RevId: b13f73270620bb3dbf7fba435d7545620e9a1093 commit 2eab44f20b2a01e392457ca8810d6c3b48e525b5 Author: Ming DAI Date: Wed Sep 7 10:18:02 2022 -0700 Minor refactor to DeltaAnalysis GitOrigin-RevId: f867dd60b7533855febca1994407c9f980ee494a commit 1624ebc405a8678e1003ac4188e6e37f797ce2ee Author: Shixiong Zhu Date: Tue Sep 6 17:03:35 2022 -0700 Fix 'since' for new DeltaTable.forPath APIs ## Description New `DeltaTable.forPath` APIs missed `2.1.0` release. Changed `since` to `2.2.0`. Closes delta-io/delta#1369 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 55f29520ddfe4a550ee58d4bb07be0e3c7dc4850 commit 2620d3c313cbf13887e8e10f039b6cac29a8fb6e Author: Prakhar Jain Date: Tue Sep 6 11:07:11 2022 -0700 Implement logStore read via readAsIterator This PR implements `LogStore.read(fileStatus, hadoopConf)` via `LogStore.readAsIterator(fileStatus, hadoopConf)`. Existing UTs GitOrigin-RevId: 415adb5b0e2a49e68754d2bd1945a3755d645a38 commit 7e876792efdd92a85aa3f7b81d81f34c8b276d7b Author: Lars Kroll Date: Mon Sep 5 14:45:24 2022 +0200 Prevent Protocol Downgrades during RESTORE in Delta Until now RESTORE TABLE may downgrade the protocol version of the table. This is however unsafe, as it makes time travel assume an invalid protocol version, which can lead to corrupted reads. - This changes the default behaviour to never downgrade, only upgrade the protocol version during RESTORE TABLE. - The old behaviour can regained with a newly introduced flag, which comes with a stern warning to always wipe the table history afterwards to prevent time travel to illegal versions. - Added test cases for the protocol downgrade with flag on/off. GitOrigin-RevId: 8cc554f5ead17eb9f80fbdcf142a229026a0aade commit 8c3d5f88f427094fdfcee66dc172d9cc131bdfeb Author: Allison Portis Date: Fri Sep 2 15:44:37 2022 -0700 Upgrade version in master after 2.1 release Closes delta-io/delta#1362 Signed-off-by: Allison Portis GitOrigin-RevId: 36d348bd888787da895dcab3034b8475ec7eb2a9 commit 2f36c1aa2c08393087e47cd10d5b5ef94c31126a Author: Hussein Nagree Date: Thu Sep 1 18:24:27 2022 -0700 Minor update to SnapshotManagement GitOrigin-RevId: a9f3e54450681d873346831a7fb53d5365e46f75 commit f3fdfce560b8c522c2f5831a0fd0d675a95afd21 Author: zml1206 Date: Wed Sep 7 23:00:10 2022 -0700 Fix deltalog checkpoint/json not found When a table is written and a new checkpoint is generated, the old checkpoint is cleared due to expiration. If the executor is restarted for some reason at this time, an error will be reported when reading the table next. `[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38.0 failed 1 times, most recent failure: Lost task 0.0 in stage 38.0 (TID 34) (192.168.130.11 executor driver): java.io.FileNotFoundException: [info] File file:/private/var/folders/gc/c__qhntd7s502txfp0ltxh880000gn/T/spark-9a479158-0bf7-4e82-afbd-cef470f54fa6/_delta_log/00000000000000000001.checkpoint.parquet does not exist [info] [info] It is possible the underlying files have been updated. You can explicitly invalidate [info] the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by [info] recreating the Dataset/DataFrame involved. [info] [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.readCurrentFileNotFoundError(QueryExecutionErrors.scala:506) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:119) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:164) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) [info] at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) [info] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) [info] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) [info] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) [info] at org.apache.spark.scheduler.Task.run(Task.scala:131) [info] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:750) [info] [info] Driver stacktrace: [info] at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2403) [info] at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2352) [info] at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2351) [info] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) [info] at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) [info] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2351) [info] at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1109) [info] at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1109) [info] at scala.Option.foreach(Option.scala:407) [info] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1109) [info] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2591) [info] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2533) [info] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2522) [info] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) [info] Cause: java.io.FileNotFoundException: File file:/private/var/folders/gc/c__qhntd7s502txfp0ltxh880000gn/T/spark-9a479158-0bf7-4e82-afbd-cef470f54fa6/_delta_log/00000000000000000001.checkpoint.parquet does not exist ` The solution is adjust snapshot logSegment comparison for added the version comparison of checkpoint, when the log version is the same, the checkpoint version is different, need to create a new snapshot. Unit test. No Closes delta-io/delta#1351 Signed-off-by: Allison Portis GitOrigin-RevId: 3d483d2f818a915eb96b4be62009e153e37ccd89 commit 2e068af8d0bf5ae8ecbe36e2e8f3937966d5b2b9 Author: Jackie Zhang Date: Thu Sep 1 14:28:59 2022 -0700 Minor refactor to DeltaParquetFileFormat GitOrigin-RevId: 2408147cf1a0e8b6c44a4982f96a7acd1cefe383 commit f4df3538db904b68ac162ec2a369bd79b40897f5 Author: Thang Long Vu Date: Thu Sep 1 01:29:44 2022 +0200 Minor refactor to DataSkippingReader GitOrigin-RevId: 507d815a38986f1c6e4e843a300355bb98977ecc commit e3cf764235f325764859521a8eb2a815bfd1d13b Author: Hussein Nagree Date: Tue Aug 30 20:32:42 2022 -0700 Clean up some more unsafe snapshots GitOrigin-RevId: 9aa2e83cfbca6d3d423742500b5f0a30ee3238c1 commit ddc36911c20a43994accbc71e1120730e09a8c2f Author: sherlockbeard Date: Tue Aug 30 11:02:54 2022 -0700 Support multiple `where` calls in DeltaTable.optimize ## Description Resolves #1338 test suite & by check the metrics of optimize operation in history table one test added which replicates the previous (test above it ) with multiple where statement ## Does this PR introduce _any_ user-facing changes? Closes delta-io/delta#1353 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 17a4ed28cb5fa28a11f12426b8c869799b4d38a8 commit 89384632efde7cd87e8a273e22d250461d3ed02d Author: Jackie Zhang Date: Tue Aug 30 10:51:57 2022 -0700 Enable streaming / CDF streaming reads on column mapping enabled tables with fewer limitations Resolves delta-io/delta#1357 As streaming uses the latest schema to read historical data batches and column mapping schema changes (e.g. rename/drop column) can cause latest schema to diverge, we decided to temporarily completely block streaming read on column mapping tables before. As a close follow up in this PR, we think it is at least possible to enable the following use cases: 1. Read from a column mapping table without rename or drop column operations. 2. Upgrade to column mapping tables. 3. Existing compatible schema change operations such as ADD COLUMN. Resolves https://github.com/delta-io/delta/pull/1358 new unit tests. GitOrigin-RevId: f40a063dde0329da0750105186cc6711a7b0ea02 commit 0bedbbbb00c1337ad7968d36664ad7b042a9e7fb Author: Sabir Akhadov Date: Tue Aug 30 12:14:55 2022 +0200 Minor refactor to UpdateSuiteBase GitOrigin-RevId: 0075439fc32d805a14f5c518b52626e3caa5f5d7 commit 24ef5df4c90b5c57f72d2599144e3837a1562b08 Author: Scott Sandre Date: Mon Aug 29 16:00:10 2022 -0700 Minor refactor to the `updateTableProtocol` API implementation in DeltaTable. GitOrigin-RevId: c039f22723b4a37e19d4863bdfee6df750d6efe0 commit 526e9012b7fa9ff753d119003146f3087657dd26 Author: Xinyi Date: Sun Aug 28 22:00:20 2022 -0700 Minor refactor to alterDeltaTableCommands GitOrigin-RevId: c9f3cafd8c56e9c3250c09e2c245b5a20d93249c commit 36fabfd374c9834f78d695791347b53fb4a9086d Author: Lars Kroll Date: Sat Aug 27 16:30:49 2022 +0200 Minor refactor to RestoreTableSuiteBase GitOrigin-RevId: 6977f1509d58a5402203eaecb74d39a1e687e38e commit 3e8d2d164ca53accd3ee8addf432587626586924 Author: Ming DAI Date: Fri Aug 26 16:45:10 2022 -0700 Support partition directory starting with underscore and dot This PR enables "CONVERT TO DELTA" to support partition column name starting with underscore and dot for completeness. Unit test is added to reproduce the gap and fixed by code chang in this PR GitOrigin-RevId: 168a84d946ddb2d3575070fb3152b0e492946b0b commit ce466a7f3d0f4439e5f20a69d5d3b32f72a72106 Author: Ole Sasse Date: Thu Aug 25 13:52:11 2022 +0200 Minor refactor to DeltaTable.scala GitOrigin-RevId: 4e4a8cfe1b17651c03dc9958bf9a4924d4197359 commit bc1b4a05fc1a408e563e4e22b8b0c7bf4afa2742 Author: Allison Portis Date: Thu Sep 1 20:14:38 2022 -0700 ugprade delta-storage version to newest release (#431) commit 2041c3b7138b9614529540613195d4afd39fde71 Author: Jackie Zhang Date: Wed Aug 24 11:10:15 2022 -0700 Enable batch CDF queries on column mapping enabled tables with fewer limitations Resolves delta-io/delta#1349 Due to the unclear behavior of streaming/CDC read on column mapping tables, we decide to temporarily block batch CDC read when user has performed rename or drop column operations after enabling column mapping. Note that the following is not blocked: 1. CDC Read from a column mapping table without rename or drop column operations. 2. Upgrade to column mapping tables. 3. Existing compatible schema change operations such as ADD COLUMN. New unit tests. Resolves delta-io/delta#1350 GitOrigin-RevId: 9b83b570623f42ecd614990f8caf4b6febbbc724 commit d9c308efb9122c854300c8519a194caeb30fdab8 Author: Fabian Paul Date: Wed Aug 24 16:24:01 2022 +0200 Minor refactor to MergeCDCSuite GitOrigin-RevId: 7517e05157a12b3d1e67e9dd194cce024fef48b6 commit a283d834c2384d28c0832aee73a702a9e0173082 Author: Wenchen Fan Date: Wed Aug 24 08:25:44 2022 +0000 Minor refactor to DeltaThrowableSuite GitOrigin-RevId: f681e98e5226f2d62abab1e78614dbb0001d058e commit d93aacbcd667265bfdd876bdd23af9d9e078174f Author: Wenchen Fan Date: Wed Aug 24 00:50:28 2022 +0000 Minor refactor to DeltaTableBuilder GitOrigin-RevId: 28e8c78b172b94c4b6f0b920fb1dcb5a0e10282a commit 7a040c74e838ddc01cc30aa984f8e6c3e7c1109c Author: Allison Portis Date: Tue Aug 23 12:52:39 2022 -1000 Misc integration test updates ## Description - Uses SQL for time travel in QuickstartSQL example since support has been added for Delta 2.1 - Updates table_exists.py integration test in response to 11fb2eadf3ead06fa1bcb049e7dcc925e21166e1. That change was included in 2.0 and failing for me with 2.0 n/a Closes delta-io/delta#1340 Signed-off-by: Allison Portis GitOrigin-RevId: d2b00b98606e8b4fa7918bb72b3dbbdb113de77b commit 4fa3e4bdb6d6384594eea56bc911d2bf6dfe8979 Author: Shixiong Zhu Date: Tue Aug 23 13:44:55 2022 -0700 Block interval types in Delta ## Description Spark 3.3 will start to allow users using interval types in a data source. However, Delta is not ready to support interval types. Hence, this PR just disables for now. Existing tests ## Does this PR introduce _any_ user-facing changes? No Closes https://github.com/delta-io/delta/pull/1352 GitOrigin-RevId: 875a50e6425c145fe7301115beacde63676fd415 commit 0d7d10c7f78c37e05f01b65325388474cf6c1a23 Author: Allison Portis Date: Tue Aug 23 08:45:52 2022 -1000 Change DeltaTable.details() to DeltaTable.detail() to be consistent with SQL ## Description Change DeltaTable.details() to DeltaTable.detail() to be consistent with SQL command `DESCRIBE DETAIL` Also adds the @evolving tag. Updates corresponding tests. Closes delta-io/delta#1343 Signed-off-by: Allison Portis GitOrigin-RevId: dabee34a77aa880b8e81158afe6b9ef04eb2d16d commit a03c7b9d9ead250d2a4c83b757b7b4ab916d2ba6 Author: Ming DAI Date: Tue Aug 23 10:55:01 2022 -0700 Refactor DeltaEncoders and ConvertToDeltaCommand GitOrigin-RevId: b094f7bc2965b08f1936d18bb82ac2ccdf8d0730 commit 1f9de86378806047a5e663311bdfdf57b6de3368 Author: Adam Binford Date: Tue Aug 23 06:12:44 2022 -0400 Trigger available now ## Description Resolves https://github.com/delta-io/delta/issues/1318 Looks like the main functionality was added a while ago, just waiting for the upgrade to Spark 3.3. Just added the finishing touch. New UTs. ## Does this PR introduce _any_ user-facing changes? Add new trigger options for streaming reads Closes delta-io/delta#1319 Signed-off-by: Tathagata Das GitOrigin-RevId: a737a2baa0655d2228acaffd641ffa46b150bbd6 commit 627c7d5768def8e5ec6775f6a0110bee35f79d2d Author: Hedi Bejaoui Date: Tue Aug 23 01:39:17 2022 -0400 Add sbt test coverage and integration with Codecov ## Description Resolves #1123 What was done: - Added sbt-scoverage plugin and enabled coverage. - Modified `run-tests.py` to generate an aggregate report of all Scala-based subprojects (core, contribs, storage, and storageS3DynamoDB). This functionality is disabled by default and can be enabled using the `--coverage` flag. Todos: - [ ] Decide on the [minimum coverage](https://github.com/scoverage/sbt-scoverage#minimum-coverage) needed for the project. Ran `sbt test coverageAggregate` to generate the following coverage report: Screen Shot 2022-08-01 at 4 06 43 PM ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1312 Signed-off-by: Tathagata Das GitOrigin-RevId: dfe73529d6fed057aed0a5983ab1c7e09a7d20c0 commit 5d22a38d9d206e2909f184677f82d48ce19729cf Author: Scott Sandre Date: Mon Aug 22 20:36:15 2022 -0700 Fix merge command `numTargetFilesAdded` metric ## Description This PR fixes a bug in `MergeIntoCommand` where the metric `numTargetFilesAdded` sometimes gave an unexpected value. This PR ensures that `MergeIntoCommand` only writes out new files to the target table that are non-empty (i.e. at least 1 row). Note: the value was never wrong, for example it would say that we wrote out 1 file, and we did in fact write out 1 empty file. However, there was no logical reason for us to write out an empty file with no rows. This PR also updates existing tests (which knew about this bug and so were ignored) inside of `MergeIntoMetricsBase`. ``` build/sbt 'core/testOnly *DescribeDeltaHistorySuite -- -z "merge-metrics"' ``` Closes delta-io/delta#1334 Signed-off-by: Scott Sandre GitOrigin-RevId: 1c04cff75461ec1d2987653ad1a82dcbcf5926c1 commit b71ad65d045d9593d3ac49c721dd0ff42e683fa1 Author: Carlos Peña Date: Mon Aug 22 09:52:18 2022 -0700 Make update command return the number of updated rows. ## Description This PR makes the `UPDATE` command return the number of updated rows. The update command returns the following output: ![image](https://user-images.githubusercontent.com/6467558/184264601-89506bf7-d816-4992-996a-2ade4c9e38a8.png) Modified existing tests. Closes delta-io/delta#1331 Signed-off-by: Scott Sandre GitOrigin-RevId: a00cdd1a8a4e10dae5afece4ec5414d9bc9fde89 GitOrigin-RevId: 160dce9de1a1e34a29d47c6e21c9117f151b9a30 GitOrigin-RevId: 71f954cfcaf4d3a14e0a0693525a577bd0898d3f commit 6428aa19c1be5355160d15e06dd81ce9e2742321 Author: kristoffSC Date: Wed Aug 24 16:31:43 2022 +0200 Delta -Flink connector - Flink version updated to 1.14 (#427) Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit d9b766b1932bcf64fd18c7757eb4d2c790546e0b Author: Son <14060682+sonhmai@users.noreply.github.com> Date: Tue Aug 23 03:20:16 2022 +0700 Cache the partition pruning result in FilteredDeltaScanImpl to improve performance (#419) * Cache the partition pruning result in FilteredDeltaScanImpl to improve performance Signed-off-by: sonhmai <14060682+sonhmai@users.noreply.github.com> * Add feature flag for partitionFilterRecordCaching * Remove lazy evaluationResults for partition filter record caching * Add micro-benchmarking for partition filter record caching * Improve micro-benchmarking for partition filter record caching * Fix scalastyle * Fix scalastyle Signed-off-by: sonhmai <14060682+sonhmai@users.noreply.github.com> commit edaeb86304211513c8028d056a7d90e98ec2839c Author: Shixiong Zhu Date: Mon Aug 22 12:47:02 2022 +0000 Minor refactor GitOrigin-RevId: 95e11b9e84436540d6a78b134df80d03115933bd commit 3e66810579c60dd78eb84f275fb6d0f061e953e7 Author: Jiaheng Tang Date: Sun Aug 21 18:46:26 2022 -0700 Refactor eleventh set of 20 Delta error messages GitOrigin-RevId: 2fd735174f57079f3a8082ed6044437c58efb557 commit cfb5ce9bff42ff002363795ea5b3af6e2b406278 Author: Prakhar Jain Date: Fri Aug 19 12:21:39 2022 -0700 Add new read APIs in Scala LogStore to deal with FileStatus instead of Path This PR adds new read APIs to Scala LogStore interface. The new APIs takes the fileStatus instead of Path and returns the Iterator of rows in the file. Existing UTs GitOrigin-RevId: 9feef450f49eeb207b8f497441353885cf6c8523 commit 0979d188a21613c1b0a1434290b1d586743739b0 Author: Hussein Nagree Date: Fri Aug 19 11:43:55 2022 -0400 Add postCommitHooks for checkpoint Move checkpoint writing after a commit to PostCommitHooks. Refactor, no new functionality. GitOrigin-RevId: 7bdf3435b2fca0d2aefc6ccce08c2e9a936c6cbb commit f0d65edc6e7737c642e93e20cbe44f1ae97ad500 Author: Wenchen Fan Date: Fri Aug 19 22:39:05 2022 +0800 Minor refactor DeltaInvariantCheckerExec GitOrigin-RevId: c0fd6123cc63391f701f92175b2032ff7f5487cc commit 0abadc78d2b457d5ef4236095e077ec62a886017 Author: lzlfred Date: Thu Aug 18 18:24:17 2022 -0700 clean up and combine withDmqTag with FrameProfiler This removes indent caused by nested thunks. No test. do not expect functional change. GitOrigin-RevId: 57cda67c458c0ed3e67797325596817de608b51b commit b4f0975f25bb78fdbf498d05aaab118e096bab22 Author: sherlockbeard Date: Thu Aug 18 09:59:49 2022 -0700 Make MERGE operation return useful metrics Resolves #1322. This PR makes the `MERGE` operation return useful metrics, described below. Console Output after Change : ``` +-----------------+----------------+----------------+-----------------+ |num_affected_rows|num_updated_rows|num_deleted_rows|num_inserted_rows| +-----------------+----------------+----------------+-----------------+ | 2| 1| 0| 1| +-----------------+----------------+----------------+-----------------+ ``` image sql test suite Closes delta-io/delta#1327 Signed-off-by: Scott Sandre GitOrigin-RevId: 2589deb47603b43505ddea8f34c971f1be2e800e commit 12e611459d99ecd9c922a2e4a8a9fba41a0a5326 Author: lzlfred Date: Wed Aug 17 14:32:13 2022 -0700 Minor refactor GitOrigin-RevId: c0d7410635f32e2703471faea1b500c0dcc4af5d commit 9f0e4bb71835275a38a577b06bdc0454be6a551f Author: Kam Cheung Ting Date: Wed Aug 17 13:29:06 2022 -0700 Minor refactor test codes GitOrigin-RevId: 70ad13701906932ae95146a6ac0aed0e181c5e52 commit e0d001e57e99560513c497853dcd2ccb961eeaf4 Author: Prakhar Jain Date: Wed Aug 17 09:37:39 2022 -0700 Rename deltaLog.lastCheckpoint API to readLastCheckpointFile This PR renames the deltaLog.`lastCheckpoint` API to `readLastCheckpointFile()`. This API reads the LAST_CHECKPOINT file from object store and has overhead. The older name doesn't depict it clearly and so has potential for bugs where user don't capture it in a temp variable before reusing it. Existing UTs GitOrigin-RevId: 66b83975271558ade95be6da48bd9258cc2f6709 commit 861f19216c668a640ab1d6261af76d1730198ea0 Author: Josh Rosen Date: Wed Aug 17 06:07:34 2022 +0000 Use reflection to make MergeIntoAccumulatorSuite work in coming Spark 3.4 Author: Josh Rosen GitOrigin-RevId: 1b301d99ef910da2c5a09276980bd1df396730d4 commit f768a0f839e5fd5a3dc9c41bdc881ee08dbf299a Author: Joe Harris Date: Tue Aug 16 22:48:34 2022 -0400 Minor update to benchmarks.py GitOrigin-RevId: 1335f6f32c79ba40ce01e7052522f23d9580b3c6 commit 943e1531a8cd27fa58d3c82be1abea92a3b96e36 Author: Zach Schuermann Date: Tue Aug 16 19:43:17 2022 -0700 Fix PreprocessTableMerge to include new columns from WHEN MATCHED clauses Delta's `MERGE INTO` fails when there are multiple `UPDATE` clauses and one is `UPDATE SET *` with schema evolution. Specifically, if `UPDATE SET *` is used to merge a source with a superset of target columns and an additional `UPDATE SET` clause is present (which must operate on a target column), then the merge will fail due to inability to resolve some source-only columns. The example below fails: ```sql SET spark.databricks.delta.schema.autoMerge.enabled=true; -- tgt1 has columns: [a] -- s has columns: [a, b] WITH s(a, b) AS (SELECT * FROM VALUES (1, 's_b')) MERGE INTO zvs.tgt1 t USING s ON t.a = s.a WHEN MATCHED AND t.a < 1 THEN UPDATE SET t.a = 0 WHEN MATCHED THEN UPDATE SET * -- output: -- Error in SQL statement: AnalysisException: Resolved attribute(s) b#247210 missing from ... ``` This case seems to have been missed when implementing `processMatched` in `PreprocessTableMerge`. Specifically, that other `WHEN MATCHED` clauses can introduce new columns that must be filled in with ‘null’. Currently, only `WHEN NOT MATCHED` are considered. Best just shown with code flow in the example above: - `processMatched` is map over (clause1 [SET t.a=0], clause2 [SET *]) - resolvedActions: - clause1 resolvedActions are `[a=0]` - clause2 resolvedActions has `[a=a, b=b]` => causes schema evolution: target now has schema `[a, b]`. now we will only consider clause1: this causes the failure. clause2 is only important in that it triggers schema evolution so that finalSchema is `[a, b]`. Since there are no `WHEN NOT MATCHED` clauses, there are no `newColumns`. Then, the only `UpdateOperation` used in `generateUpdateExpressions` is: `[a=0]`. This means that `generateUpdateExpressions` is called with `targetCols` `[a, b]` and only `updateOp` `[a=0]`. column `b` (not present in the target) is passed through and an unresolvable attribute ends up in our final plan. The fix is to simply consider new columns from other `WHEN MATCHED` clauses as well as `WHEN NOT MATCHED`. New unit test validating correct behavior with multiple UPDATE clauses and one is `UPDATE SET *`. GitOrigin-RevId: f2a849c1fc5589a26512e0a7f1cc5adc8e8eb7f1 commit 7b3669109d3a11ab2caec5487dc00401beb44bc0 Author: Lars Kroll Date: Tue Aug 16 20:55:33 2022 +0200 Minor refector GenerateSymlinkManifest GitOrigin-RevId: cb46b88509c176a338a9326b741916e456db2826 commit df5a7c2b52b3fc58886a9057cfa2c2e98b3cca22 Author: Prakhar Jain Date: Tue Aug 16 11:08:05 2022 -0700 Undo changes around making LogSegment Json serializable Existing UTs GitOrigin-RevId: ef9e2646e74862a8143047296e97a98416818af0 commit ee3917fc010c090294b0ade56c4524f9e6efafc4 Author: jintao shen Date: Tue Aug 16 10:30:35 2022 -0700 Support passing Hadoop configurations via DeltaTable This PR makes DeltaTable support reading Hadoop configuration. It adds a new public API to the DeltaTable in both Scala and Python: ``` def forPath( sparkSession: SparkSession, path: String, hadoopConf: scala.collection.Map[String, String]) ``` Along with the API change, it adds the necessary change to make operations on `DeltaTable` work: ``` def as() def alias() def toDF() def optimize() def upgradeTableProtocol() def vacuum(...) def history() def generate(...) def update(...) def updateExpr(...) def delete(...) def merge(...) def clone(...) def cloneAtVersion(...) def restoreToVersion(...) ``` With the change in this PR, the above functions work and are verified in a new unit test. Some commands such as Merge/Vacuum/restoreToVersion etc don't pick up the Hadoop configurations even though they are passed to DeltaTableV2 through new forPath(..., options) API. Note that the unit test is written first by verifying that it fails without the change and passes with the change. New unit tests. AFFECTED VERSIONS: Delta 2.2 PROBLEM DESCRIPTION: Similar to DataFrame, DeltaTable API in both Scala and Python supports custom Hadoop file system options to access underlying storage system. Example: ``` val myCredential = Map( "fs.azure.account.auth.type" -> "OAuth", "fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", "fs.azure.account.oauth2.client.id" -> "...", "fs.azure.account.oauth2.client.secret" -> "...", "fs.azure.account.oauth2.client.endpoint" -> "..." ) val deltaTable = DeltaTable.forPath(spark, "/path/to/table", myCredential) ``` Before this PR, there is no way to pass these Hadoop configurations through DeltaTable. DeltaTable will only pick up ones starting with `fs.` or `dfs.` to create Hadoop Configuration object to access the storage the same way as DataFrame options for Delta. We avoid picking up other options because: - We don't want unrelated options to be passed into Delta underlying constructs such as DeltaLog. GitOrigin-RevId: 89cfb1a3465d30081a14f74ae6aa80a4c48f9e56 commit 8e7e109c3623ce146dced6427aa8844e8fce1414 Author: lzlfred Date: Tue Aug 16 08:54:26 2022 -0700 add FrameProfiler instrumentation for state reconstruction Add instrumentation to cover missing areas. GitOrigin-RevId: 4c301ead226125cba604cd9a88f538c4a6e92298 commit 3c14fe26c34c9ea25348b6ee1c3e9fc16e796b6e Author: Hussein Nagree Date: Tue Aug 16 11:53:26 2022 -0400 Reduce logError to logWarning for metadata id changes Just a logging change, no tests GitOrigin-RevId: 0f4efd15611079a8b0fcc8a75cda570146e878e3 commit 699a9851992dd2c6f1cde6779a679d4c9f801d30 Author: Shixiong Zhu Date: Mon Aug 15 01:07:04 2022 -0700 Eliminate ScalaReflection calls from Delta after warming up Spark's `ScalaReflection` is a performance killer when the concurrency is high. This PR makes the following changes to eliminate unnecessary ScalaReflection calls from Delta after warming up: - `typedlit` calls [Liternal.create](https://github.com/apache/spark/blob/144d4c546f7023b20e07619134feca1a46017a5f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L167) which will touch `ScalaReflection`. This PR replaces `typedlit` with `lit` or `new Column(Liternal.apply(v, ))`. Most of changes in this PR are caused by changing `val CDC_TYPE_NOT_CDC: String = null` to `val CDC_TYPE_NOT_CDC: Literal = Literal(null, StringType)`. - A new style check is added to block `typedlit`. - Add more templates to `DeltaUDF` and use these templates to create `udf`s in Delta to avoid touching `ScalaReflection`. - `count(*)` will touch `as(ExpressionEncoder[Long]())` in order to return a `TypedColumn`. Replace it with `count(new Column("*"))` in Delta so that we can avoid touching Scala reflection code. - A new style check is added to block `count(string)`. - Add a new `DeltaEncoder` class to simplify the code pattern that uses `ExpressionEncoder`. We introduce `DeltaEncoders` to cache all reusable `Encoder`s and mix it into `com.databricks.sql.transaction.tahoe.implicits._`. With this change, we can replace `import spark.implicits._` (always create new `Encoder`s) with `com.databricks.sql.transaction.tahoe.implicits._` (reuse the shared `Encoder`s) to minimize the code touching `ScalaReflection` after warming up. - A scala style check is added to block `spark.implicits._`. New tests + existing tests. No. Closes delta-io/delta#1327 GitOrigin-RevId: ee7a129fef821a83da8e16fa51c92dd3a6acc04c commit 0ce5e5f19eef183bcce65c7f78c3e5493e77a39b Author: Rajesh Parangi <89877744+rajeshparangi@users.noreply.github.com> Date: Sat Aug 13 22:55:58 2022 -0700 Minor refactor UpdateMetricsSuite GitOrigin-RevId: c2a11ea3798fcca3cf88a7e8753cacfef45b6731 commit c80ab9a0c0b1b94ae111b91f3bcfcfa60e6da3e1 Author: Tom van Bussel Date: Sat Aug 13 10:19:13 2022 +0200 Improve code style for insertionTime GitOrigin-RevId: fe5e8acb5ca8654b575cfbbecf21aae0c597a998 commit b433041fd786bb0e27486bd5174122edd82c97d3 Author: Prakhar Jain Date: Fri Aug 12 19:30:20 2022 -0700 Refactor SnapshotManagement GitOrigin-RevId: 317373b295141ad784639840c8fb18a90ade2fae commit 527e033f47d0a7a98785ce45f743daf463364070 Author: Hussein Nagree Date: Fri Aug 12 09:42:46 2022 -0700 Refactor doCommit and updateAfterCommit Improve the doCommit method to use updateAfterCommit instead of update. This will have a few incremental improvements like better logging, avoiding reading of non-existent checksums, etc. Just a refactor of existing methods, so existing unit tests should cover all changes. GitOrigin-RevId: 6b14b992905971ed31c2df5656ee115e9ddecff4 commit 4eeae3cdcca22718c5b949b48bb4c42dbeab66d2 Author: Scott Sandre Date: Thu Aug 11 12:05:23 2022 -0700 Support AS OF SQL syntax for time travel queries on Delta Table Support following SQL queries on Delta tables to allow reading from a specific version ```sql SELECT * FROM default.people10m VERSION AS OF 0; SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58'; ``` Unit tests. Closes https://github.com/delta-io/delta/pull/1288 GitOrigin-RevId: b59e2d7ff7416235f00f03725b00199472760b9e commit 31a57ae8445774a499679a26642ae26dafe83055 Author: Scott Sandre Date: Thu Aug 11 11:19:02 2022 -0700 Make Delta Lake `DELETE` command return numAffectedRows This PR makes the `DELETE` command in Delta Lake return the `numAffectedRows` metric. For example, ```sql DELETE FROM TABLE table WHERE ... ``` will now return a DataFrame with one row with column num_affected_rows of type Long. Updated tests inside of `DeleteMetricsSuite` which now test for `numAffectedRows` Resolves delta-io/delta#1222 Closes delta-io/delta#1328 GitOrigin-RevId: 5b89b20b0c9846981e92da7b7bc5c5dc5b6a3f19 commit 8f1fe3500bca64ae02ac8af5fe096730a770daf7 Author: Scott Sandre Date: Thu Aug 11 08:11:34 2022 -0700 Minor refactor to DeltaSourceOffset GitOrigin-RevId: b7a4f4b025f191cbf4367c2748d22fc7c4faef3d commit 2f36c03c7b59fe2d0dabf1c7fde7e62dbff0d186 Author: Sabir Akhadov Date: Thu Aug 11 13:36:09 2022 +0200 Minor refactor to UpdateCDCSuite GitOrigin-RevId: abd6adad4de914df311dcc0d64247e87121eecea commit e166c65b3db23b29bf3d3832fbecd993f5c77bdc Author: yikf Date: Wed Aug 10 22:20:45 2022 -0700 Fix documentation for `forPath` and `forName` ## Description This pr is a followup of https://github.com/delta-io/delta/pull/1275, it aims to fix doc of `forPath` and `forName` doc only ## Does this PR introduce _any_ user-facing changes? No, doc only Closes delta-io/delta#1290 Signed-off-by: Scott Sandre GitOrigin-RevId: 1c6436630688a3981510119903cd25db5dc5466b commit 5b214dd5456fac5a99aea8f8b782a37290d17165 Author: Hedi Bejaoui Date: Wed Aug 10 13:40:06 2022 -1000 Scala and Python API for Describe Detail ## Description Resolves #1188 ### PR changes Added a `details()` function to the Python and Scala APIs of `DeltaTable` similar to running `DESCRIBE DETAIL` to get information about a table (name, format, size, etc.) ### Why we need the changes Just like we have `DeltaTable.history()`, there should be a `DeltaTable.details()` - Scala unit tests were added to [DescribeDeltaDetailSuite.scala](https://github.com/hedibejaoui/delta/blob/70cf2656b44798136edf39a46678fa1de2b67f0b/core/src/test/scala/org/apache/spark/sql/delta/DescribeDeltaDetailSuite.scala#L61) - Python unit tests were added to [tables.py](https://github.com/hedibejaoui/delta/blob/70cf2656b44798136edf39a46678fa1de2b67f0b/python/delta/tables.py#L281) - Extended Scala and Python utilities examples and checked output of `python3 run-integration-tests.py --use-local`: ```bash [info] Describe Details for the table [info] +------+--------------------+----+-----------+--------------------+--------------------+--------------------+----------------+--------+-----------+----------+----------------+----------------+ [info] |format| id|name|description| location| createdAt| lastModified|partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion| [info] +------+--------------------+----+-----------+--------------------+--------------------+--------------------+----------------+--------+-----------+----------+----------------+----------------+ [info] | delta|13a60c59-2527-4cc...|null| null|file:/tmp/parquet...|2022-07-02 15:13:...|2022-07-02 15:13:...| []| 6| 2656| {}| 1| 2| [info] +------+--------------------+----+-----------+--------------------+--------------------+--------------------+----------------+--------+-----------+----------+----------------+----------------+ .. ######## Describe details for the table ###### +------+--------------------+----+-----------+--------------------+--------------------+--------------------+----------------+--------+-----------+----------+----------------+----------------+ |format| id|name|description| location| createdAt| lastModified|partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion| +------+--------------------+----+-----------+--------------------+--------------------+--------------------+----------------+--------+-----------+----------+----------------+----------------+ | delta|46d665c3-b2bd-4de...|null| null|file:/tmp/delta-t...|2022-07-02 15:27:...|2022-07-02 15:27:...| []| 6| 2656| {}| 1| 2| +------+--------------------+----+-----------+--------------------+--------------------+--------------------+----------------+--------+-----------+----------+----------------+----------------+ ``` ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1246 Signed-off-by: Allison Portis GitOrigin-RevId: 5438c88342f31a33322eec6e43c0ac0e3a1189e9 commit bb24d50650bbe4a1ea4e85aa87d1441811c2f018 Author: Ming DAI Date: Wed Aug 10 14:46:20 2022 -0700 Minor refactor to RestoreTableCommand GitOrigin-RevId: fd073d40365e491137f0988b3c7d6db2352b0c1a commit d55a2de2b642ecfb60ae291f943a6603aa154081 Author: Tom van Bussel Date: Wed Aug 10 23:29:52 2022 +0200 Make timing metrics more readable Swaps out `createMetric` to `createTimingMetric`, to make the timing metrics more readable. Existing tests to make sure nothing breaks. GitOrigin-RevId: b777a012e5480df8797c4086f4cd6d6cf536ad95 commit 8b3fd4855deda50b20f27984d187961e7fd4a5a3 Author: Jintian Liang Date: Tue Aug 9 11:03:43 2022 -0700 Refactore FileAction and OptimisticTxn Exposed common methods in all subclasses of FileActions to the parent class. Made commitLarge return the post-commit snapshot. GitOrigin-RevId: 9802ff990a4df6c3f54b97d66aa72f5d587e0885 commit de7ba235ec83bac401cf9e233f8308b578f41d8c Author: Adam Binford Date: Tue Aug 9 09:30:31 2022 -0700 Use ThreadUtils.parmap for optimize ## Description Resolves https://github.com/delta-io/delta/issues/1220 Uses ThreadUtils.parmap to parallelize the compaction instead of parallel collections. This should improve the "tail" of the compaction execution. It seems to be that the parallel collections method buckets each job into maxThreads groups and then executes each group with one of the threads in the pool. I think this is more of a proper queue based approach so any remaining tasks can be done by any free thread. Exsting UTs ## Does this PR introduce _any_ user-facing changes? Just tail performance gain for optimize commands. Closes delta-io/delta#1315 Signed-off-by: Scott Sandre GitOrigin-RevId: 0ef683ac806153dd41c1cb74fc2fe1864dab1c76 commit 95e87487cd64bfec0ff593767ce76ea99725689e Author: Terry Kim Date: Tue Aug 9 09:02:33 2022 -0700 Minor refactor to DataSkippingReader GitOrigin-RevId: 00086a827f9c895b331d272dcc552fcc70b77ba6 commit 1ff4299480437e9d25eceb4c146a015a56990482 Author: Prakhar Jain Date: Mon Aug 8 23:39:29 2022 -0700 Parallelize renames when creating checkpoint part files GitOrigin-RevId: bb8d39161963b23cbb3df35e18085d27a5cea163 commit 2fb3f2950485f24386ec9726a34d9a06df0a62b2 Author: Terry Kim Date: Mon Aug 8 21:16:50 2022 -0700 Refactor DataSkippingReaderBase to enable storing states for constructing data filters This PR proposes to refactor `DataSkippingReaderBase` to enable storing states for constructing data filters. Storing states per data filter construction is needed so that they can be accessed from extractors (e.g., to enable date / timestamp functions only when partition filters are not present). This PR is a pure refactor and existing tests should cover the change. GitOrigin-RevId: a48edee2dfa5f36896a03b9ed76b80d3c032453f commit 1349efc10ed8098ebdbcb354008c46a28ae309d7 Author: Scott Sandre Date: Mon Aug 8 15:50:23 2022 -0700 Fix ActionSerializerSuite This test changes when/how actions are instantiated inside of ActionSerializerSuite when passed into testActionSerDe. We change the action type from Action to => Action. We also make the failing metadata lazy. This ensures that the metadata are instantiated after SparkFunSuite::beforeAll has run. This ensures that Utils.isTesting is true, meaning that the metadata will be instantiated with id testId. Existing UTs. Resolves delta-io/delta#1283 Closes delta-io/delta#1317 GitOrigin-RevId: 020565687710afa4aa240355ca4aca4c0ba9d4e8 commit 27adb009c62283ef296076827e4b32da4aaa47b3 Author: Scott Sandre Date: Fri Aug 5 09:23:35 2022 -0700 Dockerized run-tests.py will now use test parallelism if env variable is set `run-tests.py` will now propagate the environmental variable `TEST_PARALLELISM_COUNT` to the SBT tests, if that variable is present. This, in turn, will cause the SBT tests to run using that set parallelism. . GitOrigin-RevId: a8e36af935987ed7ebc5d4e32e70afcbd0fba242 commit 7c3abac0da6005eedddaf222b7215cfb627e5826 Author: Hussein Nagree Date: Thu Aug 4 22:24:07 2022 -0700 Reduce unsafe snapshot usage in postCommitHooks Reduce the usage of unsafe snapshots by passing the postCommitSnapshot computed at the end of `doCommit` in to the PostCommitHooks. GitOrigin-RevId: 5551a06d22c629201d493d9f146c1e61181fa7f1 commit ef49ae21cc6130a2b2e4d9e7b10409d43e3a540d Author: Allison Portis Date: Thu Aug 4 20:55:38 2022 -0700 Fix Merge CDC delete with duplicate matches bug and add tests Resolves https://github.com/delta-io/delta/issues/1274. This adds tests for a Merge + CDF bug for delete merges with duplicate matches and as well as a fix. Implementation details are explained throughout the code in comments. This PR adds tests to `MergeCDCSuite`. Closes delta-io/delta#1309 GitOrigin-RevId: 2fdb845a6de144babce2efe393fae8b2e5eefbbf commit ff83a375e23b2de028c4dbead3518eb83b8e670c Author: Scott Sandre Date: Thu Aug 4 13:43:26 2022 -0700 Fix `numRecords` compilation issue This PR renames `numRecords` to `numLogicalRecords` inside of `WriteIntoDelta.scala` to fix a compilation issue. GitOrigin-RevId: ba64eb90ab211bc1a1c8b7f3f5bcb7e4abaaa469 commit 39aff15941b98d7bbb22efa92ec876f26e2a8c11 Author: Rahul Shivu Mahadev Date: Thu Aug 4 11:34:43 2022 -0700 Fix Replace Where Operation Metrics * ReplaceWhere is a Delta write operation mode that allows overwriting the table by appending data in the df and deleting rows that match the specified condition. * Operation Metrics is operation based history like numRemovedFiles, numAddedFiles that is retrievable via the `DESC HISTORY` command * A new modification of replaceWhere command was added recently that allows for non-partition commands. The replaceWhere is now implemented as an append and Delete(Delete command can write too) * Modified the metrics collection to look at stats and also collect through UDF replaceWhere - writeFiles(append) -> basicWriteStatsTracker (conflict with below when registerSQLMetrics is called) - DeleteCommand (delete + write out rows) -> basicWriteStatsTracker metrics from first write is not reported - added unit tests GitOrigin-RevId: b9c6d1036aa8d4a707fb649ab8fa0498558cc0d3 commit d1c9a2f63ab5c0859e3b83ee6bdd96bf8d2c0a75 Author: Ole Sasse Date: Thu Aug 4 18:21:10 2022 +0200 Remove duplicate rule ReplaceCurrentTimeInMergeInto Remove the ReplaceCurrentTimeInMergeInto rule. To have consistent timestamps in Delta, the same functionality had to be duplicated in PreprocessTableMerge. This is a follow up to remove the initial solution based on the ReplaceCurrentTimeInMergeInto rule. It also uses the code from the ComputeCurrentTime rule in PreprocessTableMerge when it is available. Test coverage already there GitOrigin-RevId: b848fe911479cbeb0a3949520dabf8434569382a commit f46728c55298b0ee9a0b4515eaef2ad09ca386da Author: Lars Kroll Date: Thu Aug 4 18:20:39 2022 +0200 Minor refactoring in OptTxn GitOrigin-RevId: 22cc1241e6875e6aca3678e6fce6642b881592c3 commit 4059a9f499808e7589bd256214c41c0e1af89159 Author: Ionut Boicu Date: Thu Aug 4 11:29:33 2022 +0200 Consistent naming of logical and physical number of records With splitting the DVs `numRecords` into logical and physical number of records, the usages are becoming vague and unclear. Currently, the code is using different names, like `numRecords`, `numTotalRecords`, `physicalNumRecords`, `numNonDeletedRecords`. For consistency we agreed on replacing: - `numRecords` and `numNonDeletedRecords` with `numLogicalRecords`. - `numTotalRecords` with `numPhysicalRecords`. Existing tests GitOrigin-RevId: 517e12a3a3223510f89c81e91daad6640ffeeef8 commit fac54d20aa9c80cb7982a860236b2980c761a54d Author: Tyson Condie Date: Wed Aug 3 19:48:55 2022 -0700 Do not mask checksum failures when testing When testing, do not mask checksum failures when incremental commit checksum is verified. Existing UTs have been fixed. GitOrigin-RevId: 76f49415e5a3644aeaca36c4ab18dcbbf6159053 commit e0b945cfdf9d47d7d1e6f7cce568c609674c7211 Author: KaiFei Yi Date: Wed Aug 3 17:48:07 2022 -0500 Remove unused import ## Description fix https://github.com/delta-io/delta/issues/1268 use `./build/sbt compile` and the compile passed ## Does this PR introduce _any_ user-facing changes? No, code beautiful only Closes delta-io/delta#1269 Signed-off-by: Venki Korukanti GitOrigin-RevId: f0e5480188adb322bad0c3f5681b2bad2513a265 commit 7a3d785937048841bf7cd75d2b6debc476d019d5 Author: Jackie Zhang Date: Wed Aug 3 14:43:02 2022 -0700 Minor refactoring in ALTER TABLE GitOrigin-RevId: 8714957d9d5d0be82ffaf8a8c42dae4409954443 commit ebff29904f3ababb889897343f8f8f7a010a1f71 Author: Ming DAI Date: Tue Aug 2 22:24:27 2022 -0700 Introduce a new file manifest for 'CONVERT TO DELTA' on c… …atalog table Currently, when _spark_metadata log is missing, "CONVERT TO DELTA" always scans the whole directory recursively for data files. However, this is not necessary for some cases. When the source table is a catalog table, we can use metadata from catalog to prune out-of-date files. This PR introduces a new manifest to fetch data files based on partition informations from catalog, and it is invoked for the scenario above. The feature is protected with a SQL conf, whose default is true Unit test is added in this PR. GitOrigin-RevId: 0eb87398bc65e95d54ebf32c1bec2d1c683dd327 commit 27797ec121d12818eb8f90dc730936fad6d240c2 Author: Prakhar Jain Date: Tue Aug 2 19:29:53 2022 -0700 Rename DeltaErrors.failOnCheckpoint to DeltaErrors.failOnCheckpointRename GitOrigin-RevId: eeff5c93d645cd1fc6302f7cea27acb3b103a7ff commit ed9ff6e1eced91bc54b55d059805b203273838d9 Author: Scott Sandre Date: Tue Aug 2 17:36:35 2022 -0700 Correctly use `initializeStats` in `DataSkippingStatsTracker` Add missing call to `initializeStats` inside of `DataSkippingStatsTracker`. Existing UTs. ## Followup Backport this to branch-1.2, branch-2.0 in Delta Lake, too. GitOrigin-RevId: b4e7c08d14d0e8ba6b9c6c7a12cf2e63990e0ad9 commit 7344149665de138f6863af99e11beffdc45402d9 Author: Shixiong Zhu Date: Tue Aug 2 16:14:51 2022 -0700 Avoid calling Dataset.isEmpty in vacuum ## Description `diff.isEmpty` may not be cheap. This PR removes it and updates the code to avoid using `Dataset.reduce`. ## Does this PR introduce _any_ user-facing changes? Closes delta-io/delta#1306 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 6a43ee8643da724423e42b37c4f503b1a8024295 commit 40943c66cf4fe850b201adfca3e8aa4811fb1076 Author: lzlfred Date: Tue Aug 2 09:29:21 2022 -0700 Minor refactoring GitOrigin-RevId: 46c3bdde452e1cf1d69cd7db7d977efe91317f29 commit ada1edc336db82cf22696ac2ee6aa8144a604242 Author: Tyson Condie Date: Tue Aug 2 04:07:07 2022 -0700 Provide context to validate checksum Added context argument to validateChecksum such that it can be passed to usage log on validation failures. This will be used to narrow down the root cause of an invalid checksum. Existing UTs. GitOrigin-RevId: 3169e85450b2900e82eb79d96653ec11c9c56ef5 commit 8c41fe7fc5d3cf3fb28f76b9f25c2e2139e01109 Author: Koert Kuipers Date: Fri Jul 29 15:37:07 2022 -0700 Add isolationLevel delta config but only allow Serializable for now ## Description Resolves #1265 Added test to DeltaConfigSuite for setting isolation level ## Does this PR introduce _any_ user-facing changes? Yes, user can now set `delta.isolationLevel = Serializable` on table Closes delta-io/delta#1276 Signed-off-by: Allison Portis GitOrigin-RevId: 949e4a663ab46f5f09a2b7742c3b53ef419ce01b commit 2d47ffdaa3f3f18804d1c2e7d8c0c4262b1564df Author: Venki Korukanti Date: Fri Jul 29 13:10:23 2022 -0400 Upgrade Delta to use Apache Spark 3.3.0 ## Description Upgrade the Spark dependency version to 3.3.0. Following are the major changes: * Test fixes to change the expected error message * `VacuumCommand`: Update the parallel delete to first check if there are entries before trying to `reduce`. * Update the `LogicalPlan` used to represent the create or replace command in `DeltaTableBuilder` * Spark version upgrade in build and test setup scripts * Spark 3.3 upgraded the log4j from 1.x to 2.x which has a different log4j properties format Fixes delta-io/delta#1217 Closes delta-io/delta#1257 Signed-off-by: Venki Korukanti GitOrigin-RevId: 3e930d3c2cef5fca5f2cd8dd94a8617dbe2f747b commit 58d0f4852838ad9913524c88e493519fad3d5297 Author: Christos Stavrakakis Date: Fri Jul 29 17:06:17 2022 +0200 Minor refactoring to UpdateMetricsSuite GitOrigin-RevId: 03b95eebd473bb25241f425fd63aa0481873829e commit b2eb3e098cc443fd51db1225d9389016851ae2cd Author: Juliusz Sompolski Date: Fri Jul 29 15:47:57 2022 +0200 Test MERGE with UDTs Adding a test for MERGE with UDTs should provide useful cross testing of using UDTs in many scenarios. GitOrigin-RevId: 3d26aac17cc5af3ef083b53e144ad5f4f5db36aa commit 12e648d20b656c11d87c0612dbc878b728153501 Author: Thomas Newton Date: Thu Jul 28 09:56:14 2022 -0700 Minor improvements to protocol.md table of contents Signed-off-by: Thomas Newton ## Description Run doctoc to auto-update the table of contents. This PR was repurposed after what it was originally intended to document was discovered to be a [bug](https://github.com/delta-io/delta/issues/1229). Its just a docs update ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1216 Signed-off-by: Scott Sandre GitOrigin-RevId: 4e912811520717dc07d8f2f1e32863e65e7b33b4 commit 90e99385555091051cd1be1ea1fbd04c02159038 Author: Denis Krivenko Date: Thu Jul 28 09:27:13 2022 -0700 Resolve variables for Delta commands Signed-off-by: Denis Krivenko ## Description This PR adds variables substitution to Delta SQL Parser. Resolves https://github.com/delta-io/delta/issues/1267 issue. DeltaSqlParserSuite was changed to test this PR. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1287 Signed-off-by: Scott Sandre GitOrigin-RevId: e77ec8925e74c881ff003ccc433f04bc3f80429a commit 99f5ecf2ccb3d0f00ac19ed6ac00e42d005de673 Author: Prakhar Jain Date: Wed Jul 27 18:22:25 2022 -0700 Minor refactoring GitOrigin-RevId: ad7361e8218d68cdf3002400d110debc45b51ac9 commit 8aa9db208c9a912e3a02bd1c579d6d72d9bb7138 Author: Tathagata Das Date: Wed Jul 27 19:28:53 2022 -0400 Updates to the benchmark framework - Allow multiple benchmarks to be run in a sequence, just give the names in a comma-separated sequence - Allow the Spark UI to work in EMR while benchmark is running - Allow benchmarking arbitrary function, not just SQL query - Locally download the full log as well, even if there is a failure Manual testing Closes delta-io/delta#1295 GitOrigin-RevId: 85bfc2fe906c8672d3461ae0e39aa94902759086 commit 815866382c08fc65f154a9eb21212ece182286c1 Author: Grzegorz Kołakowski Date: Wed Jul 27 13:27:22 2022 -0700 Add terraform deploying infrastructure for benchmarks ## Description Currently, in order to run performance benchmarks one need to create the infrastructure manually. This PR adds Terraform scripts which do that automatically for AWS and GCP. I tested this patch manually on AWS and GCP cloud. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1179 Co-authored-by: Grzegorz Koakowski Signed-off-by: Scott Sandre GitOrigin-RevId: 9cb7769afc7889beb743f499f271d8eac1167c1f commit ff6914ba43b30ea32fcb8ddb374a98e6b087912a Author: lzlfred Date: Tue Jul 26 11:21:32 2022 -0700 Minor refactor to DatasetRefCache.scala GitOrigin-RevId: 411e674c3f0bacc1728ca76a3108bca1f9a3a754 commit 176ed08781e30a156eda4c61da8f545c1e032ac6 Author: Adam Binford Date: Tue Jul 26 13:21:06 2022 -0500 Allow for schema pruning during update check for files to touch ## Description Resolves #1201 Allows for schema pruning in the first part of an update to check for files to touch. Code snippet I ran: ```python >>> import pyspark.sql.functions as F >>> from delta.tables import DeltaTable >>> table = DeltaTable.forPath(spark, "test") >>> table.toDF().printSchema() root |-- key: string (nullable = true) |-- value: long (nullable = true) >>> table.update("key = 'c'", set={'value': F.lit(6)}) ``` The execution plan for the find files to update: before: ``` (1) Scan parquet Output: [key#526, value#527L] Batched: true Location: TahoeBatchFileIndex [file:.../projects/delta/test] PushedFilters: [IsNotNull(key), EqualTo(key,c)] ReadSchema: struct ``` after: ``` (1) Scan parquet Output: [key#686] Batched: true Location: TahoeBatchFileIndex [file:.../projects/delta/test] PushedFilters: [IsNotNull(key), EqualTo(key,c)] ReadSchema: struct ``` Only key is read, not value as well. The line swap should result in the same behavior, but doing the select before the nonDeterminstic UDF allows schema pruning to happen. Existing UTs plus screenshot of execution plan. ## Does this PR introduce _any_ user-facing changes? Performance improvement for update with data predicate. Closes delta-io/delta#1202 Signed-off-by: Venki Korukanti GitOrigin-RevId: a4a52a19fa1d18f0727d1dd134e7d38d4cbabfc3 commit f3bdcd8ee6e8ae36c94750b146759549dfbe83d3 Author: Ming DAI Date: Tue Jul 26 01:25:06 2022 -0700 Add a conf to protect the behavior change introduced by 18d4d12ed06f973006501f6c39c8785db51e2b1f Introduce a conf (default true) to allow fallback to the bad behavior before 18d4d12ed06f973006501f6c39c8785db51e2b1f: the converted delta table will contain all columns from the data regardless the definition of catalog table on top of the data. E.g., data contains columns [col1, col2, col3], but the catalog table is created with [col1, col2] A unit test is added in this PR GitOrigin-RevId: 8b4aed4a0046b33aac73dad61f77484fd0d082e2 commit 69551d7ac2812a5a6b84920a7941058da75046d5 Author: Edmondo Porcu Date: Mon Jul 25 18:30:31 2022 -0700 Adding a test parallelization strategy (#1128) ## Description Adds a test parallelization option to run Delta tests on different JVM addressing #1128: - Implements a generic abstraction called `GroupingStrategy` and a default implementation `SimpleHashStrategy` that uses a fixed number of `Test.Groups` each forked in its own JVM. - Provides two collection of settings, one that can be added to enable test parallelization, one to use the default strategy which will use 4 JVM unless separately specified by an environment variable called `DELTA_TEST_JVM_COUNT` - Adds those two settings to the core package Resolves #1128 on local developer machine, but not on the CI Pipeline. Logging has been introduced using SBT logger so to get some statistics around the distribution of tests ``` sbt:delta-core> Test/testGrouping [info] scalastyle using config /Users/edmondoporcu/Development/personal/delta/scalastyle-config.xml [info] scalastyle Processed 151 file(s) [info] scalastyle Found 0 errors [info] scalastyle Found 0 warnings [info] scalastyle Found 0 infos [info] scalastyle Finished in 25 ms [success] created output: /Users/edmondoporcu/Development/personal/delta/core/target [info] Tests will be grouped in 4 groups [info] Test group 0 contains 34 tests [info] Test group 1 contains 31 tests [info] Test group 2 contains 34 tests [info] Test group 3 contains 30 tests [success] Total time: 6 s, completed Jul 3, 2022 3:39:29 PM ``` Additionally running tops on my Macbook shows 4 JVM running as expected, all with the same Parent PID Screen Shot 2022-07-03 at 3 40 53 PM Still need to be tested, needs to trigger build on CI/CD ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1249 Signed-off-by: Scott Sandre GitOrigin-RevId: 08a0fa1cc4bfbbb46d55c764441361dfa3b35b85 commit e71a45bac9f0fa5ec56ca004fa2f928faeef42fd Author: Max Gekk Date: Tue Jul 26 01:06:45 2022 +0000 Don't include 'Type` in SHOW TBLPROPERTIES SHOW TBLPROPERTIES will not show the Type property. It's generated internally and makes more sense to not show it to end users Authored-by: Max Gekk Author: Wenchen Fan Author: Wenchen Fan Author: Max Gekk GitOrigin-RevId: 98cc117f26d4db88c2d6507aef7d5cf7615a49dd commit 87a73b9c130d393077be4daa00c86dca812efb56 Author: Edmondo Porcu Date: Mon Jul 25 15:23:33 2022 -0700 Preserving case of table properties created via DeltaTableBuilder ## Description This PR makes the DeltaTableBuilder respect the case of table properties written via the builder, while offering an option to fallback to the previous behavior. It also adds tests that sets table property in 4 ways: - On Table creation via SQL - Altering a Table via SQL - Via DeltaTableBuilder, both in the "case sensitive" and "not case sensitive mode" Resolves #1182 The DeltaTableBuilderSuite was extended adding one tests that uses four different examples to verify the correct behavior ## Does this PR introduce _any_ user-facing changes? Yes. The Javadoc has been updated accordingly Closes delta-io/delta#1254 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 1ce32e8488c28b13869f91ee546a6fbb8313a3cd commit 75b471693d077233acd5519eefb08951a02981e5 Author: Karen Feng Date: Mon Jul 25 14:55:17 2022 -0700 Improve error message of duplicate constraint name Improves the error message thrown when a constraint with a duplicate name is created on the same table. GitOrigin-RevId: 9ba85fb4de3dbcf1ca41ecb1d500576b662f20c2 commit 9b26c98ba3047ff7eec30949ad88bf4a0ecc2a82 Author: Andreas Chatzistergiou Date: Mon Jul 25 22:34:57 2022 +0200 Minor refactor to DataSkippingReader GitOrigin-RevId: fa4368dad3c248b81288d8f03eaac37bd73c81c0 commit 3386ab500653a8ba1d60612ce1e2702a49072e56 Author: Hussein Nagree Date: Mon Jul 25 11:57:42 2022 -0700 Keep track of the latestLogSegment during a txn GitOrigin-RevId: 1ac5ae937d684032959673974e9b10a78421f4e5 commit 516d167096927ec12caac83a6800f0c5a30f5a4d Author: Scott Sandre Date: Fri Jul 29 12:33:51 2022 -0700 Upgrade version to 0.5.1-SNAPSHOT commit 5c4dc9a95bcc3510c386bd614bfceb558040f32e Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Jul 29 10:39:54 2022 -0700 update doc/example/readme versions to 0.5.0 or to x.y.z (more generic) (#400) commit ec6ce1780517f2d46b1773b3ac62466b7c6b4328 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Jul 29 10:39:03 2022 -0700 Move javadocs from 0.5.0-SNAPSHOT into 0.5.0, and then 0.5.0 into latest (#417) commit 0f5b88a226ba1841de196063c4493463ae8474a9 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Jul 28 16:06:52 2022 -0700 Stage latest javadocs for 0.5.0 release (#416) commit 226c0da740f577b05ca092b49fc747b5a221056e Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Jul 28 15:27:20 2022 -0700 Update delta-flink javadocs and README.md (#415) * Update delta-flink javadocs, using feedback provided by Allison * update docs for ignoreChanges and ignoreDeletes * grammar fix * minor comment changes; implement PR feedback * Fix flink unidoc compile error * More minor javadoc fixes * Add options to README * Minor updates in response to PR feedback * one last trivial update in response to PR feedback commit 5e37b0834ea3e421445e96a9d912d4a86f33f07e Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Jul 28 15:00:58 2022 -0700 Done (#414) commit 3d25c59472ee8b25e8057f36c7bee414aababc8f Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Jul 26 08:48:35 2022 -0700 Upgrade delta-standalone's delta-storage dependency version to 2.0.0 (#408) commit 7fb7759ae7764afda01fa0da18dc4fab38031cdd Author: Shixiong Zhu Date: Mon Jul 25 15:30:14 2022 -0700 Make delta-standalone jar include ParquetSchemaConverter (#406) * Make delta-standalone jar include ParquetSchemaConverter * bump cache key to cache recent dependency changes * more tests commit e65496665fd1b187b488ab77f79986a8b381ee16 Author: David Lewis Date: Fri Jul 22 13:38:15 2022 -0600 Minor refactor to DeltaScan GitOrigin-RevId: a08b1fcb2a72d141caf341c80a475ca4163ecb8b commit b0f3b384bc65d995004a6898ed5d5ac6f2c2fec9 Author: Kam Cheung Ting Date: Fri Jul 22 00:22:18 2022 -0700 Minor refactor to OptimisticTransaction GitOrigin-RevId: eb93e25a453a66aa2ee5c04d0f9b531dfb34e0f5 commit 18d4d12ed06f973006501f6c39c8785db51e2b1f Author: Ming DAI Date: Thu Jul 21 10:43:30 2022 -0700 Support partition schema autofill for 'Convert To Delta' on catalog tables Currently 'Convert To Delta' fails on partitioned Parquet tables if partition schema is not provided, this PR enables 'Convert To delta' to autofill the partition schema from the table's catalog if available. Unit test is added in this PR. GitOrigin-RevId: 7b4bb330fdfb98e560d1b440898a621c2df8d97b commit d3eadc6c32208bc2eea04939bd5a90be2d42a71e Author: Venki Korukanti Date: Thu Jul 21 11:38:56 2022 -0500 [Delta] Set version to 2.1.0-SNAPSHOT Delta Lake version 2.0.0 is released. Updated the current development version to 2.1.0-SNAPSHOT NA GitOrigin-RevId: 09bca4904448ed61091c61ebbc30662397e6ca9a commit 1f9440cf2c4231641cd5fc8a4150f83a5e649188 Author: Vini Jaiswal Date: Wed Jul 20 21:02:14 2022 -0400 Update slack invite links ## Description The current Slack invite has expired, updating it to a bit.ly link that is updated regularly. Also see #1270. Verified that it works for any user and goes to the current Slack invite ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1282 Signed-off-by: Scott Sandre GitOrigin-RevId: 6a49ff06c92b653d2b504a0c513f1422c97c2d8b commit 42895d6e3a4fedf42ab6dde5214283cbdb64f57b Author: Ole Sasse Date: Wed Jul 20 17:57:24 2022 +0200 Minor refactor to OptimisticTransaction GitOrigin-RevId: 7aec31f2a8569b3e4da2d94485d6fad3b886976a commit ca74733669165c6878ac6c5e1d6b6a6772931444 Author: Lars Kroll Date: Wed Jul 20 10:14:44 2022 +0200 Minor refactor to CDCReaderSuite GitOrigin-RevId: 9f678845d439d10a9c525f2d21e82cafb1ed8eff commit e3e0ca74b70092872bcad515725a6ef35216950b Author: Andy Lam Date: Tue Jul 19 21:39:55 2022 -0700 Remove `projection` field from `DeltaScan`, and removes `Project` as a plan differentiator for `PrepareDeltaScan` Two Spark plans having the same filter and relation, but different `Project` attributes, spawns different Delta file listing threads. However, `Project` should not affect file listing results and hence should not be considered as a plan differentiator. This PR removes `Project` before using it as a key for mapping plans to `DeltaScans`. Added UT in `PrepareDeltaScanSuite`. GitOrigin-RevId: b730c5a2908e7c46db88d732a30700891ea7225f commit 09e9770752c70b93608ec5b76085029749433519 Author: EJ Song Date: Tue Jul 19 18:10:59 2022 -0500 Use repartition(1) instead of coalesce(1) in OPTIMIZE ## Description Use repartition(1) instead of coalesce(1) in OPTIMIZE for better performance. Since it involves shuffle, it might cause some problem when the cluster has not much resources. To avoid it, add a new config to make it switchable: `spark.databricks.delta.optimize.repartition.enabled` (default: false) #### Reference and quick benchmark result for repartition(1) and coalesce(1) Spark API documentation: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#coalesce-int- > However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is). Quick benchmark result (spark 3.2): use 100 files, ~24MB parquet in total (64.2MB in Dataset) use 100 files, ~240MB parquet in total (640MB in Dataset) Result with 9 executors / 72 cores = enough resources image n/a ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1161 Signed-off-by: Venki Korukanti GitOrigin-RevId: 3ddd1a5d735004082febcbeabf6a17f375174d93 commit 61556885114b823bab3b9ac006e40e3b3684a3dd Author: KaiFei Yi Date: Tue Jul 19 18:10:36 2022 -0500 Fix documentation for DeltaTable.forPath ## Description fix https://github.com/delta-io/delta/issues/1272 doc update only ## Does this PR introduce _any_ user-facing changes? No, just internal based-code comment updated Closes delta-io/delta#1275 Signed-off-by: Venki Korukanti GitOrigin-RevId: fc6474dca61a2dbae91a06a206056a6033334dfe commit 01e5cade4b64b970cd9e3036fa2db1f9519b36fc Author: Ryan Johnson Date: Tue Jul 19 07:44:56 2022 -0700 Add and use FileStatus overloads for FileNames utility methods Today's Delta code is littered with calls such as `isDeltaFile(fileStatus.getPath)` -- which is a bit annoying because most such calls are in the context of file listing, which always returns `FileStatus` rather than `Path`. Especially annoying in the context of collection operations, such as: ```scala listing .filter(f => isDeltaFile(f.getPath)) .map(f => deltaVersion(f.getPath)) ``` This PR adds overloads to those utility methods, which accept `FileStatus`. With that, the above code can simplify to: ```scala listing .filter(isDeltaFile) .map(deltaVersion) ``` Existing unit tests cover all affected Delta code. GitOrigin-RevId: 5602f22bccfe7585c43b0b20c265ff6904394aaa commit 5f1ebff55fa0a010e1e95f608d5806a255bebef6 Author: kristoffSC Date: Wed Jul 20 17:44:08 2022 +0200 ReadmeUpdate_Azure - add known limitation for Azure writes. (#402) * ReadmeUpdate_Azure - add known limitation for Azure writes. Signed-off-by: Krzysztof Chmielewski * Fix `Bloob` to `Blob` Co-authored-by: Krzysztof Chmielewski Co-authored-by: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> commit 56e1b9b417e8815cd7003eaf1796988c670b8b0d Author: Fu Chen Date: Mon Jul 18 23:04:38 2022 -0500 [ZOrder] Fast approach of interleaving bits to resolve https://github.com/delta-io/delta/pull/1149#discussion_r881782878 Improve interleave bits perf, the basic idea is from http://graphics.stanford.edu/~seander/bithacks.html#InterleaveTableObvious Existing UTs. No. Before this PR `spark.databricks.delta.optimize.zorder.fastInterleaveBits.enabled = false`: ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.5 [info] Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz [info] 1000000 rows interleave bits benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] sequence - 1 int columns benchmark 428 440 16 2.3 427.6 1.0X [info] sequence - 2 int columns benchmark 572 605 39 1.7 571.6 0.7X [info] sequence - 3 int columns benchmark 749 768 32 1.3 748.6 0.6X [info] sequence - 4 int columns benchmark 893 934 64 1.1 892.9 0.5X [info] random - 1 int columns benchmark 384 391 11 2.6 383.9 1.1X [info] random - 2 int columns benchmark 551 557 7 1.8 550.9 0.8X [info] random - 3 int columns benchmark 716 722 6 1.4 716.3 0.6X [info] random - 4 int columns benchmark 922 932 11 1.1 922.4 0.5X ``` After this PR `spark.databricks.delta.optimize.zorder.fastInterleaveBits.enabled = true`: ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.5 [info] Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz [info] 1000000 rows interleave bits benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] sequence - 1 int columns benchmark 299 340 36 3.3 298.5 1.0X [info] sequence - 2 int columns benchmark 423 459 50 2.4 423.5 0.7X [info] sequence - 3 int columns benchmark 560 579 24 1.8 560.4 0.5X [info] sequence - 4 int columns benchmark 694 719 37 1.4 694.0 0.4X [info] random - 1 int columns benchmark 270 274 4 3.7 269.8 1.1X [info] random - 2 int columns benchmark 418 428 13 2.4 418.4 0.7X [info] random - 3 int columns benchmark 548 568 32 1.8 547.6 0.5X [info] random - 4 int columns benchmark 680 686 6 1.5 679.6 0.4X ``` cc and thank the Apache Kyuubi interleave bits author @ulysses-you Closes delta-io/delta#1160 Signed-off-by: Venki Korukanti GitOrigin-RevId: 7140594ca3b8a0806a0adcd68500dd971636c54c commit d2804cb92a7e36863144c7be9c55df1c6f1c1a1e Author: Allison Portis Date: Mon Jul 18 19:53:37 2022 -0700 [Delta] Metric tests for merge This PR introduces tests for metrics in Merge. These added tests revealed multiple bugs in our MergeIntoCommand metric calculation. These metrics are further outlined below. This PR also fixes 3 of these issues. The remaining failing tests are ignored in this PR so that we can merge them. When the metrics are fixed in the future we will reenable these tests. This PR adds tests and some fixes. The fixes are tested by the new tests. - `numOutputRows`: failing when CDF enabled. Fixed in this PR. - `numSourceRows`: only with an empty target. Fix in this PR. (non-regression) - `numTargetRowsDeleted`: failing " delete-only without join", "delete-only with duplicate". Fixed in this PR. (non-regression) - The first one was due to the cross-product join bug for Merge CDC. However the fix for duplicate matches also corrects this result. - `numTargetFilesAdded`: only when `partitioned=false` (non-regression) (expected: 0, found: 1) - `numTargetRowsCopied`: our implementation seems to return the number of rows persisted (without regards to whether they were physically rewritten or not) with some additional nuance. Not sure of an easy fix for this at the moment. (non-regression) GitOrigin-RevId: ee67a87fdff3f13c9401d5814b5308c9260d404f commit a39e17225d4743c4d3148611f1b4b8c9e294d04e Author: Jackie Zhang Date: Mon Jul 18 19:47:42 2022 -0700 Minor refactoring GitOrigin-RevId: 14e9baf00a8b59a2c9db02d00bb3a921f411316c commit b527c6fa86a5f050413001d99c32e8b9869eddee Author: Allison Portis Date: Mon Jul 18 19:25:58 2022 -0700 Fix incomplete SQL conf keys in DeltaErrors Fixes two error messages that provided incomplete conf keys to users. GitOrigin-RevId: 6fd821f29719adbd13bf63fe735ea120b62c5c5c commit 11b2d1fb845e625e9a134951067cb607792b278b Author: Allison Portis Date: Mon Jul 18 15:12:43 2022 -0700 [Delta] Accept LogStore conf keys with and without the "spark." prefix This PR adds support for both `spark.` prefixed and non-`spark.` prefixed keys for LogStore confs (class and scheme.) This is to maintain compatibility across the delta ecosystem. Originally, I did this similarly to https://github.com/delta-io/connectors/pull/370 where we "normalize" the `SparkConf`, but I think this is less safe than the way I've chosen to do it, because we would need to update all spark configurations that might be accessed from anywhere else in the code. For example, `DelegatingLogStore` uses `SparkEnv.get.conf`. Instead, we alter our access when looking up keys to check for both spark-prefixed and non-spark-prefixed keys. Changes in this PR: - adds `LogStoreProvider. getLogStoreConfValue` which checks for both keys - adds `verifyLogStoreConfs` which checks for discrepancies between the two keys - refactors part of `DelegatingLogStoreSuite` into `LogStoreProviderSuite` Adds tests to `LogStoreProviderSuite` and `DelegatingLogStoreSuite` GitOrigin-RevId: 2ea19970e3fc40e8f2b06cb70f9c27197f399b08 commit 2fbd54677ea214dd76ecb454681e72564a63becf Author: sherlockbeard Date: Fri Jul 15 11:49:26 2022 -0700 Delta un supported column mapping error message change Resolves #1186 From SqlTest Suite Yes ``` Exception in thread "main" org.apache.spark.sql.delta.DeltaColumnMappingUnsupportedException: Your current table protocol version does not support changing column mapping modes using delta.columnMapping.mode. Required Delta protocol version for column mapping: Protocol(2,5) Your table's current Delta protocol version: Protocol(1,2) Please upgrade your table's protocol version using ALTER TABLE SET TBLPROPERTIES and try again. ``` ``` Exception in thread "main" org.apache.spark.sql.delta.DeltaColumnMappingUnsupportedException: Your current table protocol version does not support changing column mapping modes using delta.columnMapping.mode. Required Delta protocol version for column mapping: Protocol(2,5) Your table's current Delta protocol version: Protocol(1,2) Please upgrade your Delta table to reader version 2 and writer version 5 and change the column mapping mode to 'name' mapping. You can use the following command: ALTER TABLE SET TBLPROPERTIES ( 'delta.columnMapping.mode' = 'name', 'delta.minReaderVersion' = '2', 'delta.minWriterVersion' = '5') ``` Closes delta-io/delta#1263 Co-authored-by: sherlockbeard Signed-off-by: Allison Portis GitOrigin-RevId: c6fa50748ccdd4c6cad0a50565c8db933bde07de commit d4e9c5f9933d7414a1e13bc52f7693cd6ac32e4e Author: minyyy Date: Fri Jul 15 09:15:00 2022 -0700 Minor refactoring GitOrigin-RevId: 3d26fb28af97856ae8b34ebb56a398a97cbb546e commit 0fcc6267a485869f2a114fd8b3d8c7747ea9537f Author: Sabir Akhadov Date: Fri Jul 15 10:50:09 2022 +0200 Minor refactoring GitOrigin-RevId: a6aa0f0d2d4b7022082f18539433cea11d859dbd commit f8638d2abe103805ee5c1725a0e76185c5bd7f72 Author: Prakhar Jain Date: Thu Jul 14 23:48:32 2022 -0700 Make SerializableFileStatus and LogSegment json serializable This PR makes SerializableFileStatus and LogSegment case classes json serializable. Added UTs. GitOrigin-RevId: afcc62e1af167969a0f7daabbc0adc993de9fbca commit 38945d0d62abd10e95c7d79cb4c8052146bf9b2e Author: Rahul Shivu Mahadev Date: Thu Jul 14 21:57:08 2022 -0700 Nullable columns should work when using generated columns - There was a bug in the generated columns code `addDefaultExprsOrReturnConstraints` that would not allow null columns in the insert DataFrame to be written even if the column was nullable. - added unit test GitOrigin-RevId: effdb5732e7aeaf0da7fa5e18bc2eda7436ecfbc commit 0eee6e3e12ba5d82339f6c44dba953f17b32c20f Author: Junlin Zeng Date: Thu Jul 14 15:53:17 2022 -0700 Minor refactoring GitOrigin-RevId: 5257ffb2d0e35c8515f4a809d8250f8ddc62420e commit 2871e550eabd9e4956aca1b02baeb20635a66931 Author: Ryan Johnson Date: Thu Jul 14 13:28:59 2022 -0700 Minor refactoring GitOrigin-RevId: 958542ad8affcc693d89a3130ee8a4151c25835e commit a6a0e9bd2848f58827fe0fa76e957276f6279771 Author: Prakhar Jain Date: Thu Jul 14 13:25:20 2022 -0700 Move SerializableFileStatus to SnapshotManagement and remove unnecessary getters GitOrigin-RevId: 991b865eb65abeed0addc4d6f595da0af80552c8 commit 86cf73f8e51d3bd63d136fde2529b4153ddc92fe Author: Nick Date: Thu Jul 14 08:41:15 2022 -0700 Add support of hadoop-aws s3a SimpleAWSCredentialsProvider to S3DynamoDBLogStore - Add support for `SimpleAWSCredentialsProvider` or `TemporaryAWSCredentialsProvider` in `spark.io.delta.storage.S3DynamoDBLogStore.credentials.provider` options. - Because delta rely on Spark and Hadoop FS storage layer, so it's obvious to have ability authorize in dynamo db client in same way as we authorize for s3. Resolves delta-io/delta#1235. We use it in production with spark 3.2 on YARN 2.9.1 and my own fork of delta 1.2.1. Fork made from latest 1.2.1 with cherypicked multipart checkpoint commit. Scala 2.12 I have more than 100 tables, where data ingested every 10 minutes and multiple job work daily. Like retention and Row Level Update in some files. No. Except may be that [official example](https://docs.delta.io/latest/delta-storage.html#quickstart-s3-multi-cluster ) will work in any environment, and not only environment when Node where Spark App scheduled have configured AWS credentials. Please find more details about reason in delta-io/delta#1235. Closes delta-io/delta#1253 Signed-off-by: Venki Korukanti GitOrigin-RevId: cbcc087457971c91d9908ac44398492bfa49d811 commit de92ca813824033346856d07238dfb4e5fc6f94b Author: Denny Lee Date: Wed Jul 13 19:15:29 2022 -0400 Update Slack invite The current Slack invite has expired, updating it to a bit.ly link that is updated regularly. Verified that it goes to the current Slack invite No Closes delta-io/delta#1270 Signed-off-by: Scott Sandre GitOrigin-RevId: fc6a987b800d790a2348f955f630e927a31fcb3c commit dd9d1eb68eafa6d9f6b79bf0a343ee55aff52dc4 Author: Christos Stavrakakis Date: Wed Jul 13 23:32:44 2022 +0200 Minor test refactoring GitOrigin-RevId: b44625315cc87eaa6d3ef8ff375eeb52447e006d commit 2b207bc3dab0072607b4eb198fe381b3f00559d2 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Jul 13 13:48:44 2022 -0400 Create `delta-standalone-parquet` project to refactor public API parquet dependencies out from `delta-standalone` (#394) * Refactor classes to new project - update build.sbt - refactor java/scala/test classes * Add flink test * update .github test.yaml * update copyright * fix standalone MIMA * refactor utils project to standalone-parquet * refactor packages * fix package visibility * do not publish standalone-parquet package; have standaloneCosmetic depend on standaloneParquet * update .github test.yaml * update build.sbt; add comment; rename dependency commit 0126b5e3a36fc374c6c78a28882e28195ec57b28 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Jul 13 13:28:48 2022 -0400 Fix typo in README.md commit 25188eb939840d335a4376b79a7180e414f30075 Author: Scott Sandre Date: Wed Jul 13 11:20:05 2022 -0400 Fix `numCopiedRows` in UPDATE command metric This PR fixes how the `numCopiedRows` metric is calculated in the `UpdateCommand` GitOrigin-RevId: ffb5dda80196627d6161d81af68b50afe1bc5dc7 commit dadd9723596ef190945cd0002039b216a12c7117 Author: Scott Sandre Date: Wed Jul 13 11:19:38 2022 -0400 Add metric tests for `RESTORE` and `CREATE TABLE` This PR adds more metric tests for the `RESTORE` and `CREATE TABLE` operations. GitOrigin-RevId: 289f2ac2500b2741e62222be3fededbd6a282368 commit 464d7cf3e92453c21e22f20789bf727b7246cb50 Author: Christos Stavrakakis Date: Wed Jul 13 10:15:56 2022 +0200 Minor refactoring DELETE test suite GitOrigin-RevId: 34178158a640e8836f99200bc6a50dd4c37eaa0f commit 97afc34dbf408f18b51906d84ef46986a4bcb7cc Author: Scott Sandre Date: Tue Jul 12 09:33:18 2022 -0400 DELETE operation metrics and tests GitOrigin-RevId: 31631ccc2dab73844beb3d2879e2b74981e617d7 commit d4c7a2414b5db35acaec6c41a9a09887bced776f Author: Venki Korukanti Date: Mon Jul 11 16:11:54 2022 -0700 Add a evolvability test for `numRecords` removed as part of 5842b5f Make sure latest versions are able to read the checkpoint contianing the `numRecords` in the schema. GitOrigin-RevId: 83edb2767497de33d7c71995b97ed2b41364c1c6 commit efdc46a0e6aff94d03acbf9623fca7ca9b0539cd Author: Felipe Pessoto Date: Mon Jul 11 13:16:12 2022 -0700 Fix error message Signed-off-by: Felipe Pessoto ## Description Fix an error message with incorrect instructions. Tested locally, the old code example doesn't even compile. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1238 Signed-off-by: Venki Korukanti GitOrigin-RevId: fa2393dae000a2497ccba7466b704de4aecb9393 commit cf27bde9d46d77854ff287ad8bb8604fed21dfb1 Author: Yousry Mohamed Date: Mon Jul 11 13:15:47 2022 -0700 Fix broken link in PROTOCOL.md file ## Description There was a broken link in `PROTOCOL.md` file. It is the link titled `Delta Sink for Apache Spark's Structured Streaming` inside **Transaction Identifiers** section. The fix is a simple change in the URL to point to the right source file. It is a markdown update hence no tests included. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1251 Signed-off-by: Venki Korukanti GitOrigin-RevId: 65122335b90cb7820035a669d19cd9438b314988 commit 143a6b9a0a56374dc6c40d60d438e669a18f6b9f Author: Denny Lee Date: Wed Jul 13 09:16:57 2022 -0700 Updated Slack invite (#396) Current Slack invite is expired, update it to point to a bit.ly link that is regularly updated with a new invite. commit 0d3e92c6b3318fb934c4925cb214f8ec9fac5aae Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Jul 13 11:40:34 2022 -0400 generated javadocs for standalone and flink (#398) commit 29d8399d138bae44fdf97c8513dae88e17976424 Author: kristoffSC Date: Wed Jul 13 16:22:14 2022 +0200 Update Flink README.md limitations (#392) * ReadmeUpdate_Limitations - add limitations to README.md Signed-off-by: Krzysztof Chmielewski * ReadmeUpdate_Limitations - chanegs after code review Signed-off-by: Krzysztof Chmielewski * ReadmeUpdate_Limitations - changes after code review Signed-off-by: Krzysztof Chmielewski * Update README.md Co-authored-by: Krzysztof Chmielewski Co-authored-by: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> commit 0644fa2f2271bb8e65405ea7a8123bebe5782b5b Author: kristoffSC Date: Wed Jul 13 15:59:47 2022 +0200 PR_20_UpdateFlinkSource_JavaDoc - changes to Javadocs (#395) * PR_20_UpdateFlinkSource_JavaDoc - changes to Javadocs Signed-off-by: Krzysztof Chmielewski * PR_20_UpdateFlinkSource_JavaDoc - changes to Javadocs Signed-off-by: Krzysztof Chmielewski * PR_20_UpdateFlinkSource_JavaDoc - changes to Javadocs Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit eaab5866bb7d2c9240fb97c09fc48925a25510da Author: kristoffSC Date: Tue Jul 12 16:25:28 2022 +0200 Flink Sink [PR19 Source plan] - Adding a test that verifies if Flink Delta Source created Delta checkpoint (#387) * FlinkDeltaSource_PR19_Flink_Sink_integration_checkpoint - WIP test for DeltaSink Delta Checkpoint Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR19_Flink_Sink_integration_checkpoint - Test to verify if Flink Delta Sink produces Delta's checkpoints. Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit 43aaea1b44d810e4762db21fffc415b06e71ddf2 Author: Shixiong Zhu Date: Mon Jul 11 11:12:26 2022 -0700 Fix unshaded files in delta-standalone jar (#391) Currently the delta-standalone jar has a few unshaded files: ``` com/github/mjakubowski84/parquet4s/RowParquetRecord$.class com/github/mjakubowski84/parquet4s/RowParquetRecord.class com/github/mjakubowski84/parquet4s/RowParquetRecordConverter.class com/github/mjakubowski84/parquet4s/SchemaDef$.class ... shaded/parquet/net/openhft/hashing/MurmurHash_3$AsLongHashFunctionSeeded.class shaded/parquet/net/openhft/hashing/MurmurHash_3$BigEndian.class shaded/parquet/net/openhft/hashing/MurmurHash_3.class shaded/parquet/net/openhft/hashing/Primitives.class ``` This PR fixes the shading rules and also adds tests to make sure we will not add files unintentionally to delta-standalone jar. commit ab0946e739e6e4dece26e2a675ad452c8d33f8be Author: Will Jones Date: Fri Jul 8 18:50:53 2022 -0700 Clarify the serialized form of column invariants in PROTOCOL.md ## Description I've been looking into implementing support for writer protocol V2, and had trouble figuring out the expected format of column invariants. Thankfully, it was clarified in #1239. Note: the current Spark implementation does not enforce column invariants if column is not nullable (see #1239), but I believe that is a bug, so I didn't include that behavior as part of the updated protocol. I was able to verify existing behavior in this script:

Example script ```python import pyarrow as pa import pyspark import pyspark.sql.types import pyspark.sql.functions as F import delta from delta.tables import DeltaTable def get_spark(): builder = ( pyspark.sql.SparkSession.builder.appName("MyApp") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config( "spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog", ) ) return delta.configure_spark_with_delta_pip(builder).getOrCreate() spark = get_spark() schema = pyspark.sql.types.StructType([ pyspark.sql.types.StructField( "c1", dataType = pyspark.sql.types.IntegerType(), nullable = True, metadata = { "delta.invariants": "{\"expression\": { \"expression\": \"c1 > 3\"} }" } ) ]) table = DeltaTable.create(spark) \ .tableName("testTable") \ .addColumns(schema) \ .execute() # This now fails spark.createDataFrame([(2,)], schema=schema).write.saveAsTable( "testTable", mode="append", format="delta", ) ```
## Does this PR introduce _any_ user-facing changes? No, this just documents the existing behavior as the protocol. Closes delta-io/delta#1241 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 6bb0aaa481711aab0362bffa38432607e11c3ab9 commit 892ed27f9df9c1b42c96141c97e3a4569fd248c0 Author: Jintian Liang Date: Fri Jul 8 16:36:59 2022 -0700 Improve error messages for char/varchar type length limitation check in delta check constraints Improved the error message for varchar type length checks by including the maximum allowed length as well as the expression being evaluated. This should give the end user a better idea of where the error occurred. Modified an existing unit test and verified both the expected max length as well as the expression being evaluated both show up in the error message. GitOrigin-RevId: 7c36df8bdc89d7a62b6c5057723dc46322e6cfb9 commit 60362325da66e8bf605b0d608cbaf1a2bbbcfd52 Author: Kam Cheung Ting Date: Fri Jul 8 11:36:26 2022 -0700 Add metricis for cluster parallelism in optimize metrics GitOrigin-RevId: 76e3828bce188278d8df8dd92e382a4edcd35548 commit e3a7e37f61d0494cc2fb6673067b1acd007729bd Author: Jiawei Bao <1041291059@qq.com> Date: Fri Jul 8 10:33:57 2022 -0700 Support `SHOW COLUMNS` command in Delta SQL Resolves delta-io/delta#1027 . ``` SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]; ``` Compared with [Spark SQL syntax](https://spark.apache.org/docs/3.0.0/sql-ref-syntax-aux-show-columns.html), this command added the support of representing the table by file path. The Delta command `Describe Detail` adds the similar support path based table extension to Apache Spark. ``` SHOW COLUMNS (FROM | IN) ${schema_name}.${table_name} SHOW COLUMNS (FROM | IN) ${table_name} (FROM | IN) ${schema_name} ``` This feature was tested with 8 cases. Including: - Delta table and non-Delta table. - Tables with wrong table identity. - Tables represented by separated schema name. And some other edge cases. See [ShowTableColumnsSuite.scala](https://github.com/6a0juu/delta/blob/1f77fae9dce98441dee43eade932d985272b41be/core/src/test/scala/org/apache/spark/sql/delta/ShowTableColumnsSuite.scala) for details. Yes. Before this PR, when making `SHOW COLUMNS` query, like: ``` spark.sql(s"SHOW COLUMNS IN delta.`test_table`").show() ``` It returns: ``` org.apache.spark.sql.AnalysisException: SHOW COLUMNS is not supported for v2 tables. ``` But with this PR, the output would be like: ``` +----------+ | col_name| +----------+ | column1| | column2| +----------+ ``` Closes delta-io/delta#1203 Signed-off-by: Jiawei Bao GitOrigin-RevId: f68947004bced59a4fcbce693b462604df63a39e commit b60f6e98fd8db3d50b9a5e9b5e3e8da0af138d2a Author: Prakhar Jain Date: Fri Jul 8 08:44:35 2022 -0700 Minor refactoring GitOrigin-RevId: b6101a1256987225b2af46bc945de4d82b8bd805 commit de1e5c585f9f346c81dbd462c1cc6ccda6cf9055 Author: Prakhar Jain Date: Fri Jul 8 01:18:29 2022 -0700 Minor refactoring GitOrigin-RevId: 4542e6100f349d1f8d1ea8453945466a57557e29 commit ed1793642856c6b236f96c2a3ee55cca88c5ef8b Author: Serge Rielau Date: Fri Jul 8 05:58:17 2022 +0000 Minor test improvements Authored-by: Serge Rielau GitOrigin-RevId: 19dc8a3cf07544563559cfca313c810110cb2451 commit 4ef57bf77c22050ff4a78fe9852b99cf5fc7aaac Author: Junlin Zeng Date: Wed Jul 6 13:27:17 2022 -0700 Minor refactoring GitOrigin-RevId: db3c92d41729d533b9ce25f4c5bb3b00b8771978 commit 8b73fee3aeade22bb848315305e35c56c8a280c4 Author: Karen Feng Date: Sat Jul 2 10:39:10 2022 +0200 Minor refactoring GitOrigin-RevId: e84050099d6ee97d67c60a8510d33ad910f005a3 GitOrigin-RevId: 5e3731b97a1765b7639428658af3dec0aa8f6bb6 commit 8d38708efd13fbd93bc0968c5a0268188ad90790 Author: Juliusz Sompolski Date: Sat Jul 2 02:03:37 2022 +0200 Minor refactoring in tests commit 9c790ef019712c4a9e5b8b2cb4e25cef5f216a17 Author: Jackie Zhang Date: Thu Jun 30 23:14:32 2022 -0700 DROP COLUMN test with path based table Just tests. GitOrigin-RevId: 0cd4fae023279aab59653cb561cacd82370d3d7f commit f1fa49c2f3ea99481b9f66a010628dd646765df8 Author: Terry Kim Date: Thu Jun 30 18:36:48 2022 -0700 Minor refactoring in data skipping tests GitOrigin-RevId: aa389dae193b6aaf8f5cabf0d89349b3c9bc733d commit 06569d54de2487b1aeefd7b79a6a068460bb07f3 Author: Venki Korukanti Date: Thu Jun 30 09:46:44 2022 -0700 [Delta] Handle the last version in previous major release Update the logic in finding the last version in previous major release. This is in preparation for the Delta 2.0.0 manually tested GitOrigin-RevId: ef0fbcf19c7ed5b758a4cafae53327f0a42615b9 commit f0e29b9971fb1a152f9cae8bceabda35aaea684f Author: Venki Korukanti Date: Thu Jun 30 09:19:25 2022 -0700 Fix the integration test `image_storage.py` ## Description The integration test `image_storage.py` requires `delta-spark` package setup, but currently is run as part of the no-python setup. Move this test to `delta-spark` python package setup loop. Also update the test to use the staging Maven artifact if given. Ran the integration test locally. Closes delta-io/delta#1237 Signed-off-by: Venki Korukanti GitOrigin-RevId: 3a972f241dc7627fd21d461ce86c52b9d15b3c2e commit 7afc547efb128e020006bc085c1c58f61899bd81 Author: Hussein Nagree Date: Wed Jun 29 13:20:17 2022 -0700 Reduce unsafe snapshot usage in Delta GitOrigin-RevId: cbf2852816cebc29596ef76e1e9df5800aa105fa commit 88d73f0a8f9635fd47d74f2fada1ce97ae5e0398 Author: Lars Kroll Date: Wed Jun 29 15:42:13 2022 +0200 Minor refactoring in DELETE test suite GitOrigin-RevId: de6f1ae1a4a3dd2683c991896ca77e35866407a9 commit 1905a58c7dfff5034a780f63890a8f19a3c8385d Author: Venki Korukanti Date: Tue Jun 28 22:41:49 2022 -0700 [Delta] Update the API `since` version to 2.0.0 from 1.3.0 Currently the new APIs added after Delta 1.2 have `@since` version as 1.3. Change it to 2.0. N/A GitOrigin-RevId: b6395ad15faba230940fc237cc5bf561e0c45c51 commit 5842b5f300688f83d3bc4adbc9637c21af80a583 Author: Kam Cheung Ting Date: Tue Jun 28 17:26:50 2022 -0700 Avoid Persisting NumRecord of RemoveFile to Checkpoint This PR avoids persisting NumRecord of RemoveFile action to checkpoint by removing this attribute from the constructor of RemoveFile object. Resolves delta-io/delta#1229. This PR ensures that `RemoveFile.numRecords` field is not written out to the delta checkpoint. We write out a checkpoint, and read it back as parquet, and ensure that its schema does not contain `numRecords`. No. Closes delta-io/delta#1230. GitOrigin-RevId: 518e46c0622cca4277729e9e6e7ebb08452619f3 commit d4b70a8a43508131e0d66c63faa5816a3905d40f Author: Xi Liang Date: Tue Jun 28 15:57:31 2022 -0700 Minor refactoring GitOrigin-RevId: b5ed991cfafb37da0fdef00b02073c7b50d4963d commit 4442be5a8816a83c3f2ba31c9df8e8c4d660e83c Author: Ryan Johnson Date: Tue Jun 28 13:57:27 2022 -0700 Minor refactoring GitOrigin-RevId: 5036a99fd6d5375420140074b3181e52259f55c7 commit 93187ed03b79a350eac82c63cab26599c460fb03 Author: Rui Wang Date: Tue Jun 28 01:28:43 2022 +0000 Minor refactoring Authored-by: Rui Wang GitOrigin-RevId: 8ff814108705a35dbe5c0f1cc5d06497b7c1914c commit fadefea91a41dbb87b49e352f261054d306858d3 Author: Ole Sasse Date: Sat Jun 25 17:59:45 2022 +0200 Consistent timestamps in MergeIntoCommand Transform timestamps for MergeIntoCommand during PreprocessTableMerge. GitOrigin-RevId: ddcb2ecb5a04a2887997d117c2049555eb8816e5 commit 6b7802a478e477f07f93704f16a30a84314b4aa9 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Jul 11 11:33:16 2022 -0400 Update flink `README.md` (#384) * remove java boilerplate from flink readme examples; fix indentation * add time travel examples; add table version info to example descriptions * re-arrange examples and metrics to per-source and per-sink sections * add table of contents * remove more imports * remove unused tag * update flink source supported versions * update flink sink connector version in flink/README.md commit c91813d7f43ec4abc5c46a503cbe3e0ed16c87e5 Author: Shixiong Zhu Date: Fri Jul 8 11:11:00 2022 -0700 Checkpoint write should use DeltaLog.hadoopConf (#388) We should pass DeltaLog.hadoopConf to parquet writer so that it will use the correct hadoop conf. commit d6a9a13ce53b320180127732e8d3286620ed44c8 Author: Grzegorz Kołakowski Date: Fri Jul 1 19:10:15 2022 +0200 [362] Shade parquet dependencies (#382) This PR solves org.apache.parquet.io.InvalidRecordException from https://github.com/delta-io/connectors/issues/362 . * upgrade parquet4s-core to 1.9.4 and parquet-hadoop to 1.12.0 * shade org.apache.parquet.* packages commit 67c9196b086846adc6865bdcf05c9cbcd4436a57 Author: kristoffSC Date: Mon Jun 27 21:11:36 2022 +0200 FlinkDeltaSource_PR_17_Update_Examples (#381) * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Bounded mode example. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Examples Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Examples Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Examples Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Examples Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Examples Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Examples Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Examples Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Examples Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Examples Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Examples Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Delta Source Examples Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Changes after code review Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - Changes after code review Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - changes after code review Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - changes after code review Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_17_Update_Examples - changes after code review Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit 5d3d73fe714f47bbe30e0414a8f9132000d8932c Author: Allison Portis Date: Fri Jun 24 11:03:42 2022 -0700 Add a SQL conf to allow arbitrary table properties (disable checks) Adds SQL conf `ALLOW_ARBITRARY_TABLE_PROPERTIES` that disables our enforcement of valid table properties. This added as a solution to https://github.com/delta-io/delta/issues/1129. Adds tests to `DeltaConfigSuite` Closes delta-io/delta#1234 GitOrigin-RevId: d9b5c19dbfc197e6b4805595ace073c2e1e09560 commit 3f3be4663f263b465f0e26bf822bac17b09e7a6d Author: Venki Korukanti Date: Thu Jun 23 16:37:19 2022 -0700 Block streaming reads when column mapping is enabled Column mapping enables schema changes such as rename a column and drop a column. These schema change operations have undefined behavior in streaming. As a proactive step we are blocking streaming reads from a Delta table that has column mapping enabled. Added UTs Closes delta-io/delta#1224 GitOrigin-RevId: 84f1781b664bce115a4849567cea67056ca538b9 commit 0d594875e151d9a898a86fe77d23faf1e26997c3 Author: lzlfred Date: Thu Jun 23 15:30:52 2022 -0700 [Delta] stats collector: directly build the stats schema from table schema Instead of using analyzer to get the stats schema, this PR directly builds the stats schema from table schema. GitOrigin-RevId: e4dd4606c4f469d8762dbabf2e6083257580b299 commit 6d611639de180d866d7d5947b7882c8aa0449bc4 Author: Scott Sandre Date: Thu Jun 23 13:47:10 2022 -0700 Block CDC + column mapping batch/streaming reads This PR blocks CDF read queries (batch, streaming) in Delta Lake whenever column mapping (CM) is enabled. We do this because CDF + CM semantics are currently undefined. Adds a new unit test and covers cases of - batch read with CM, CDF disabled -> okay - stream read with CM, CDF disabled -> okay - batch read with CM, CDF enabled -> blocked - stream read with CM, CDF enabled -> blocked Closes delta-io/delta#1228. GitOrigin-RevId: ca9cfac39f6ca160091cc3a1d7727ee0f784d044 commit abb171c8401200e7772b27e3be6ea8682528ac72 Author: Scott Sandre Date: Thu Jun 23 08:24:20 2022 -0700 Block writes that contain CDF + Drop/Rename + FileActions This PR block commits if 1. table has CDC enabled and there are `FileActions` to write 2. table has column mapping enabled and there is a column mapping related metadata action (e.g. DROP, RENAME, upgrade) We do this because the current semantics is undefined. So, we block it for now, and can un-block it in the future once we define the expected output. Please note: under current public APIs, this scenario is not possible. e.g. during a DROP or RENAME operation, only metadata is changes, so no `FileAction`s are committed. Nonetheless, we want to future proof this. This block occurs during `OptimisticTransactionImpl::prepareCommit` At a high level, this PR solves the given problem by solving 3 smaller problems - what is the mapping of physical -> logical names for a given schema? For this we add `DeltaColumnMapping::getPhysicalNameFieldMap` - given a new schema and an existing schema, detect a drop/rename column operation using the map from 1). For this we add `DeltaColumnMapping::isDropColumnOperation` and `DeltaColumnMapping::isRenameColumnOperation` - Using the helper method from 2, implement the desired blocking algorithm. For this we add `OptimisticTransaction::performCdcColumnMappingCheck`. For each of the 3 sub-problems described above, we add a unit test to DeltaColumnMappingSuite. Closes delta-io/delta#1211 GitOrigin-RevId: 5ea6bc8c6271d2d82b79c8cf462a510d9607546a commit 6328adf09e1c9645230ce9abd8eacac292887342 Author: Chang Yong Lik Date: Thu Jun 23 07:21:38 2022 -0700 Python APIs for OPTIMIZE ZORDER BY ## Description Resolves https://github.com/delta-io/delta/issues/1185 This PR replaces https://github.com/delta-io/delta/pull/1213 which had the initial code. Added Python API for executeZOrderBy in `DeltaOptimizeBuilder` Added `test_optimize_zorder_by` and `test_optimize_zorder_by_w_partition_filter` to `test_deltatable.py` ## Does this PR introduce _any_ user-facing changes? New API for Python executeZOrderBy Closes delta-io/delta#1226 Co-authored-by: Chang Yong Lik Co-authored-by: Venki Korukanti Signed-off-by: Venki Korukanti GitOrigin-RevId: 235edf246c7e3f545008a1e45fdd4ab3c6d22074 commit ef45b6eb52575da649b9b0e2a8f620e3ad7e57f0 Author: Allison Portis Date: Wed Jun 22 22:20:38 2022 -0700 [Delta] Enable replaceWhere when DPO (dynamic partition overwrite) is enabled in the Spark configuration This PR changes how `replaceWhere` and DPO interact with each other. Current behavior: - DPO + replaceWhere always throws an error Behavior in this PR: - DPO in spark session configuration + replaceWhere = data overwritten according to replaceWhere - DPO as DataFrameWriter option + replaceWhere = throw error Updates and adds a test to `DeltaSuite`. Closes delta-io/delta#1214 GitOrigin-RevId: 2940a671dae60935ea4b8016a8f527b92e532d4a commit d96cfea307fcb9cc3e763b7fb0fef0627bf37465 Author: Zach Schuermann Date: Wed Jun 22 22:37:26 2022 -0600 Minor refactoring in `DescribeDeltaDetailSuite` GitOrigin-RevId: b0f3d3d41f5e2c27d02d700633c76804a5f8b535 commit f321cae21c12f5f58da9e380e677ab9ec267fe4f Author: Allison Portis Date: Wed Jun 22 18:16:38 2022 -0700 Update accepted hadoop configuration keys for setting LogStore class (#370) commit fdaeb68759f8bd7394f11a5c840f8b54b55bbb17 Author: Lin Zhou Date: Wed Jun 22 10:43:56 2022 -0700 Minor refactor to DeltaCatalog GitOrigin-RevId: f38e16d9f243fb57ee7d00fe2944ac0d40fd1fd9 commit bc6e678922ccd620e431e99dbc015ecbeebec205 Author: Jintian Liang Date: Wed Jun 22 09:22:06 2022 -0700 Refactor code in OptimisticTransaction and DeltaCommand GitOrigin-RevId: ae73596bf98c0d2f0f6133f25c313937d3bbfcff commit b5bf70ede5f2b7e07b81d72691e65e77be115c97 Author: Allison Portis Date: Tue Jun 21 20:10:17 2022 -0700 Adds RESTORE + CDF test Adds test for CDF output with RESTORE Adds a test Closes delta-io/delta#1212 GitOrigin-RevId: 421b6618a33f9d4814fe23778224a08325bd34e3 commit 6663deb1eac34438c7e6f15d679677d97fc61dd8 Author: Jonas Irgens Kylling Date: Tue Jun 21 12:03:27 2022 -0700 Set scalac and javac source version for all sbt subprojects ## Description Some of the sbt subprojects are built using the source version of the java version used to compile the project. This creates class files with versions greater than 52.0, which will fail to run on JVM 1.8, which is typically used to run Spark. We fix this by setting the scalac and javac source version explicitly in each subproject. I've tested that the updated configuration works by running ``` delta$ build/sbt clean package delta$ find . -type f -name "*class" | xargs file -b | sort | uniq compiled Java class data, version 52.0 (Java 1.8) ``` ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1190 Signed-off-by: Scott Sandre GitOrigin-RevId: fcbda2f3c13cc751113650b1ce68c3a2518f3a84 commit 752145a63e405fe7c6e0f2a04ec25e69ea842377 Author: Allison Portis Date: Sat Jun 18 12:13:36 2022 -0700 Fix error class format Fixes formatting in `delta-error-classes.json`. GitOrigin-RevId: 1c61e9cb117187dadcf550854d59f1b9568790cb commit 0141dfeb86aee6601520993b618c88a68784620a Author: Allison Portis Date: Fri Jun 17 16:01:04 2022 -0700 Adds test for DPO (dynamic partition overwrite) + CDF Adds a test for DPO + CDF. Also adds a CDF check to `WriteIntoDelta`. Closes delta-io/delta#1209 GitOrigin-RevId: 265cbb3f0f285cb49cb37bb82a5067edb0b6910e commit 5d06507d3a034bf8c5069a3b1223b985314d7cd6 Author: Lars Kroll Date: Fri Jun 17 09:14:27 2022 +0200 Minor refactor to Snapshot and DataSkippingReader GitOrigin-RevId: c34dcac9ab316522d033e888ff1f076faf3193d0 commit a02672c37ba986eb00208138e88d205c22c1b2d1 Author: Terry Kim Date: Thu Jun 16 20:43:56 2022 -0700 Refactor DataSkippingReader GitOrigin-RevId: 01ae777b9ee6a9b81d80a1d105ecdeb42f8aa381 commit 5265f7dbe0e039a8f5f3f704b1e014f3bfa76f3c Author: Karen Feng Date: Fri Jun 17 02:44:34 2022 +0000 Improve newline formatting for error messages Error messages in the JSON file should not contain newline characters; newlines are delineated as different elements in the array. This PR: - Checks that newline characters do not exist, and improves the formatting of the JSON file so that each array element is on a new line - Checks that messages are trimmed, and improves the formatting of the JSON file by adding spaces after all newlines and between error class messages/subclass messages in the code - Introduces an environment variable to generate a formatted file: `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "core/testOnly *DeltaThrowableSuite"` Author: Karen Feng Author: Takuya UESHIN Author: Wenchen Fan Author: Wenchen Fan GitOrigin-RevId: 6e6c571b08600cec41a9294f6ac14cbc7c7fb5e8 commit 0284f7b3b5f023fb8408e207e5ab5ebbabb0c3a5 Author: Tyson Condie Date: Thu Jun 16 19:20:57 2022 -0700 Minor refactor to DeltaTable and DeltaTableV2 GitOrigin-RevId: 74207506e58d5cd499b519801bce96e77e41e4a1 commit ff69130458bb7f61b42a58205d7a95d30fe99b31 Author: Jackie Zhang Date: Thu Jun 16 17:24:06 2022 -0700 Miscellaneous minor refactors GitOrigin-RevId: 889a55ae8af2ece4b43a2f88b6b36cba22e105ec commit 75cb021a9017ef7ad9a9a9d41d03be1e2af81d84 Author: Scott Sandre Date: Thu Jun 16 15:28:11 2022 -0700 Add better error message when delta-storage dependency can't be found ## Description Resolves #1199 This PR checks that classes within the delta-storage dependency can be properly found. If not, we throw an error message with more information and link on how to solve this problem. Wrote an integration test. In this test, we run a python file using `spark-submit` by passing in the `delta-core` JAR instead of the package. This is not the suggested way to do this. But by doing it this way, we know that the `delta-storage` JAR will be missing, and that we can expect our nice error message to be thrown. You can execute this test using: `python3 run-integration-tests.py --python-only --use-local` Closes delta-io/delta#1200 Signed-off-by: Scott Sandre GitOrigin-RevId: 1f71bf638fa04dc5241fed28d6b4d843d608f9fc commit eff5163865eb718e12671133a18798028b6d492b Author: Scott Sandre Date: Thu Jun 16 15:27:45 2022 -0700 Tests for operation metrics for CDF + Merge command This PR adds a new test suit skeleton for tests, and adds 1 concrete test. This PR adds a new test suite trait `MergeIntoMetricsBase` which is extended only by `trait DescribeDeltaHistorySuiteBase`. `MergeIntoMetricsBase` is only the skeleton of a test suite - it adds all of the necessary helper methods, but doesn't actually implement all of the tests. Instead, it adds only 1 test. We will add more tests in a followup PR. GitOrigin-RevId: a0557c9c239d1de8e209e2e6461bb9959f85860e commit 171badd6eab71e560748084d07a8235ed4d6f3a2 Author: Rajesh Parangi Date: Thu Jun 16 10:25:38 2022 -0700 Minor refactor to DeltaTableCreationTests GitOrigin-RevId: 330de22b6b637075cfd1859a77fdba1e0aebe08d commit 352d2f144628f6c6a9de27045114981dd7ce1072 Author: Felipe Pessoto Date: Wed Jun 15 16:58:13 2022 -0700 Remove old comment Fix https://github.com/delta-io/delta/issues/1204 No, no code change No Closes delta-io/delta#1207 Signed-off-by: Venki Korukanti GitOrigin-RevId: bbf167d147d9bb60aa4b59004506f61af68985c8 commit 76cda893491243217bba60f4791ceb5aaa7cffb2 Author: Gerhard Brueckl Date: Wed Jun 22 23:43:57 2022 +0200 Feature-update Power BI connector (#369) - Support for File Pruning using file stats - Support all simple and complex data types (struct, map, array, ...) - Added shortcut to read `COUNT` from `_delta_log` directly if possible commit 29a6444dd73d09eec3a3d123384a241ea290794c Author: kristoffSC Date: Wed Jun 22 02:02:45 2022 +0200 FlinkDeltaSource_PR_16_RAEDME - README.md update (#379) * FlinkDeltaSource_PR_16_RAEDME - README.md update Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_16_RAEDME - Changes after code review. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_16_RAEDME - Changes after code review. Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit 93a42b74c10f3554603e37bd1c28711cf09560fa Author: kristoffSC Date: Tue Jun 21 19:38:53 2022 +0200 FlinkDeltaSource_PR_15 - Table remote file path test for Flink source (#378) * FlinkDeltaSource_PR_15_IT_test_case_for_remote_file_table_path - use schema full path in Source Execution tests. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_15_IT_test_case_for_remote_file_table_path - merge from master + retry on end to end tests. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_15_IT_test_case_for_remote_file_table_path - fix checkstyle Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit 098a053a608bfb08fb7ed40fd35456174b1c6f3a Author: kristoffSC Date: Fri Jun 17 20:21:15 2022 +0200 FlinkDeltaSource_PR_14_IT_Tests - Add Execution IT tests for Delta Source (#375) * TestUtilCleanup - Move few methods from Flink sink test utils to common Flink connector tests utils. Signed-off-by: Krzysztof Chmielewski * TestUtilCleanup - #2 Move few methods from Flink sink test utils to common Flink connector tests utils. Signed-off-by: Krzysztof Chmielewski * TestUtilCleanup - #3 Move few methods from Flink sink test utils to common Flink connector tests utils. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - chery pick changes to Test Utils -> pull up Row Type as an argument. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - end2end WIP test Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - Changes after Review Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - new changes Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - tests. Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - more refactoring Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - end to end tests Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - additional refactoring Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - end2end test for unbounded stream with updates. Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - Merge From master + remove source partition table. Add log4j property file for tests. Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - Merge From master + remove source partition table. Add log4j property file for tests. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - end to end test with reading/writing all types. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - merge from master Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSorce_SupressFLinkLogs_FixNPE - Set Log level for Flink to ERROR, fix NPE in logs for a couple of tests. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSorce_SupressFLinkLogs_FixNPE - repeat failed integration test Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - Add Source Execution test to read all data types. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - Add Source Execution test for options. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - Changes after merge from master. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - Changes after merge from master. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - Changes after Code Review Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - change Delta Log last modification time attribute for tests. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_14_IT_Tests - changes after code review. Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit 73ca6fcea0a25f302ee655f9849f86832bbe5f23 Author: Allison Portis Date: Wed Jun 15 12:49:03 2022 -0700 [Delta] Add feature flag for dynamic partition overwrite This pull request adds a feature flag `DYNAMIC_PARTITION_OVERWRITE_ENABLED` for dynamic partition overwrite mode. Adds tests to `DeltaSuite` and `DeltaOptionsSuite` to check when it's disabled. Closes delta-io/delta#1208 GitOrigin-RevId: 0a57c6c3ecc28ecf71c57b631be3fed0d71cd440 commit 7060710fc67acbefc5c987258829181295168834 Author: Shixiong Zhu Date: Tue Jun 14 21:31:16 2022 -0700 Create `predicateToken` to workaround an ANTLR issue to fix Optimize command parsing Fixes delta-io/delta#1205 It's unclear which ANTLR issue we are hitting. But creating a new `predicateToken` fix it. The new added test No Closes delta-io/delta#1206 Signed-off-by: Venki Korukanti GitOrigin-RevId: 60fea7c93473d4f80f05d215cf2697192d5dfbea commit 2c0c940e44cfff0ee59243aa1841a38de5597087 Author: Ivan Sadikov Date: Wed Jun 15 15:58:54 2022 +1200 Improve the exception verification in DeltaSourceSuite GitOrigin-RevId: 4cd6ad2b8f778107f6e4bf1beecc89546a5c873b commit 31d6b32388dcecf50d59bbac7201202c9c6bcbea Author: Rahul Shivu Mahadev Date: Tue Jun 14 16:43:15 2022 -0700 Minor improvement to table cleanup in `withTable` test helper method GitOrigin-RevId: 6cac3550adc18b34963e53dbd8779ee089900e10 commit d50d4f276cdf71ddef7214fea279693f7c3da6ab Author: Lukas Rupprecht Date: Tue Jun 14 14:43:45 2022 -0700 Minor refactoring in Delta command implementations GitOrigin-RevId: 0add86e1af153e7ce5f55720f2ff3b422532c685 commit bd8c7657633917c7c1bf5793c6553fe719c5e0f5 Author: jintao shen Date: Tue Jun 14 14:40:33 2022 -0700 Enhance CDF start timestamp supprot to handle out-of-range timestamp in stream workload timestamp in stream workload This is a change to follow on the work 786abc7e34f8732a915852310360f3c811cb77ec, and it enables to handle start timestamp out of range in stream workload. Without this change, streaming workload with start timestamp beyond latest commit version throws an error. The changed behavior is protected under the config DeltaSQLConf.DELTA_CDF_ALLOW_OUT_OF_RANGE_TIMESTAMP which in default is disabled, so it only takes effect when user opts in. The key change in this patch modifies `getStartingVersion` in DeltaSource.scala so that it won't throw exception if the config is enabled. Added a new unit test to verify the change GitOrigin-RevId: 75a9c3fb1a31d9324f6f0c428d77c5e343ccfe90 commit 3bb3119fc07c730b1a06926d6fc031df17dc60f9 Author: Andrew Vine Date: Tue Jun 14 12:37:40 2022 -0700 Remove unused class AlterTableReplaceColumnsDeltaCommand Remove unused class `AlterTableReplaceColumnsDeltaCommand` Resolves delta-io/delta#1189 Closes delta-io/delta#1194 Co-authored-by: Andrew Vine Signed-off-by: Venki Korukanti ORIGINAL_AUTHOR=Andrew > GitOrigin-RevId: d3853b89dd0903fadea2ec247a1fa522e8214dfd commit 569bc07ba786ba418fbadcb402174ed2644ef44a Author: Hussein Nagree Date: Tue Jun 14 11:40:32 2022 -0700 [Delta] Reduce unsafe snapshot usage in DeltaLog Remove usages of `deltaLog.snapshot` from the codebase. For this PR, fixed all the instances where an alternate source of the snapshot was easily available. Ran existing tests GitOrigin-RevId: cfe4d4f138fdd16a2df309ae1c6c1b80f4dd1aab commit 0196bee085875b02596f7d606913ab2455e2f5f6 Author: kristoffSC Date: Wed Jun 15 17:01:06 2022 +0200 FlinkDeltaSorce_SupressFLinkLogs_FixNPE - Log Level for Flink classes (#376) * FlinkDeltaSorce_SupressFLinkLogs_FixNPE - Set Log level for Flink to ERROR, fix NPE in logs for a couple of tests. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSorce_SupressFLinkLogs_FixNPE - repeat failed integration test Signed-off-by: Krzysztof Chmielewski commit 5f58b96f38dbdd1e12cae35bcd8b1d4753d572ac Author: kristoffSC Date: Tue Jun 14 18:13:13 2022 +0200 Flink connector Test Util Cleanup - Move few methods from Flink sink test utils and ExecutionITCaseBase class to common util class (#372) * TestUtilCleanup - Move few methods from Flink sink test utils to common Flink connector tests utils. Signed-off-by: Krzysztof Chmielewski * TestUtilCleanup - #2 Move few methods from Flink sink test utils to common Flink connector tests utils. Signed-off-by: Krzysztof Chmielewski * TestUtilCleanup - #3 Move few methods from Flink sink test utils to common Flink connector tests utils. Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - Changes after Review Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - new changes Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - more refactoring Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - additional refactoring Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - Merge From master + remove source partition table. Add log4j property file for tests. Signed-off-by: Krzysztof Chmielewski * TestUtilCleanUp - Merge From master + remove source partition table. Add log4j property file for tests. Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit 71ad22370e6e4f3de932aa6ce2d4268c422d9d05 Author: Andreas Chatzistergiou Date: Tue Jun 14 14:20:59 2022 +0200 Minor refactoring in data skipping code GitOrigin-RevId: 6b91b1528ebdeda1e6756cc4ef71364b28b075f3 commit 9feb6d99a18624beae78c569f4a8271f849503aa Author: Karen Feng Date: Mon Jun 13 22:18:30 2022 -0700 Minor refactoring in DeltaCatalog GitOrigin-RevId: 854d1ab3026d47d7e9ba3cac77b40fb332e63df0 commit c2c0e4f20ce7adf5892b13a0ed8c6e3c87ead30d Author: Adam Binford Date: Mon Jun 13 20:59:03 2022 -0700 Add scala API for optimize zOrder ## Description Resolves https://github.com/delta-io/delta/issues/1184 Adds Scala API for optimize zorder by Updated UTs to test both SQL and Scala ## Does this PR introduce _any_ user-facing changes? New API for Scala optimize Closes delta-io/delta#1195 Signed-off-by: Venki Korukanti GitOrigin-RevId: 49411ad60be9cb7dbd79d1c1ec96bf70ed572125 commit 75de6fba66c94033999c2052fd6a932767ba2914 Author: Venki Korukanti Date: Mon Jun 13 20:57:45 2022 -0700 [ZOrderBy OSS] Add zOrderBy column names to DeltaOperation Add zOrderBy columns to DeltaOperation `Optimize`. This will help log the zOrderBy columns in Delta table history Added test for checking the Delta history once the zOrderBy command is complete Added test to verify the zOrderBy command output. As part of this test a bug was found where we are assigning the number of zCubes constructed as 1. It should be equal to the number of partitions. Close delta-io/delta#1197 GitOrigin-RevId: b4152a73a40d105d604685633dfe0ae8b6a50456 commit f124f7a610643b3b049124465d9d01141f88c9d4 Author: Tathagata Das Date: Mon Jun 13 19:06:25 2022 -0400 Dynamic Partition Overwrite with SQL INSERT OVERWRITE This PR enable Dynamic Partition Overwrite (DPO) to work with SQL `INSERT OVERWRITE`. The "clean" way to make partition overwrite to work is that Delta DSv2 implementation specify that it supports the `DYNAMIC_PARTITION` capability, and with its existing capability of `BATCH_WRITE_V1`, Delta can execute DPO using the v1 code paths (i.e., `WriteIntoDelta`). However, Spark currently does not allow fallbacks of dynamic overwrite (which is a surprising gap as it allows all other write fallbacks). So instead, in this PR, we handle the `DynamicPartitionOverwrite` logical plan explicitly by converting it to a command `DeltaDynamicPartitionOverwriteCommand` which will produce the `WriteIntoDelta` during execution. This is similar to how we handle MergeIntoTable logical plan -> MergeIntoCommand. However, I also added the necessary changes to make it eventually work the "clean" way. - Enabled existing dynamic partition overwrite tests with INSERT OVERWRITE - Also changed tests with `DataFrameWriterV2.overwritePartitions` because it now works though only if the DF schema is exactly same as table schema Closes delta-io/delta#1187 GitOrigin-RevId: 5add8f0d1be6ef5dfb650f1877d97ee4a13041e6 commit 5a181aea0d519ed98f9b3f8700f5c0468161b9fe Author: Jackie Zhang Date: Mon Jun 13 11:31:14 2022 -0700 Fix potential false alarm caused by Delta column mapping duplicated physical name checks We also need to apply backticks to column paths with dots in them to prevent a possible false alarm in which a column `a.b` is duplicated with `a`.`b`. Manually tested without the code change the added test would fail. new unit test. GitOrigin-RevId: 4148026483ed8a9c3058f96a9f04f71c9af74afc commit b49935e0e03ffe2e8884c340c577520fd458e612 Author: Scott Sandre Date: Mon Jun 13 09:37:35 2022 -0700 CDF Scala + Python integration tests This PR adds two integration tests / examples for CDF, one in python and one in scala. These integration tests run insert, update, delete, and merge commands on a table with CDF enabled, and then read those CDC changes. They perform both batch and streaming reads. You can run them using ``` python3 run-integration-tests.py --use-local --scala-only --test ChangeDataFeed ``` and ``` python3 run-integration-tests.py --use-local --python-only --test change_data_feed ``` Closes delta-io/delta#1176 Signed-off-by: Scott Sandre GitOrigin-RevId: 5d21546e355ff6d79b0167f47066292e5563df3b commit ffaec115bee9e9f39c6eed336cec7ec9a2b8910f Author: Christos Stavrakakis Date: Mon Jun 13 18:06:08 2022 +0200 Minor refactor in VACUUM code Refactoring commit only with no change in external behaviour. Existing tests provide coverage. GitOrigin-RevId: afe942824f297700c57343eca850935c49a7b840 commit 8e1ca4691480ffac46f9a8dd19a67e234037aa7e Author: Yijia Cui Date: Mon Jun 13 02:42:47 2022 -0700 Cast Expression first then perform array functions in Merge Schema Evolution We introduced schema evolution for ArrayType to cast each array element into the specified type. However, the fromExpression can be an array function, for example, array_union. In this case, when we apply the transformation on array elements, the array function will also be evaluated for every elements. To avoid this cost, we need to perform the transformation for underlying expression first and then perform the array function. Unit tests. GitOrigin-RevId: 11b53b6b421445f4c0fac5447e86550e28611ba7 commit 6db8d832d3079fd1b31a4175c80f4cdbadb34869 Author: Max Gekk Date: Mon Jun 13 05:55:49 2022 +0000 Minor improvement to tests Authored-by: Max Gekk Author: Max Gekk GitOrigin-RevId: d8225daca9990f78c3c6607e4575f9913fca78c1 commit 606ca9670b93b820ef781b64b93c039c36c465b0 Author: Jiaan Geng Date: Mon Jun 13 05:55:25 2022 +0000 Minor refactoring in DeltaAnalysis Authored-by: Jiaan Geng Author: Jiaan Geng Author: Linhong Liu Author: Wenchen Fan GitOrigin-RevId: 4e5db5b3787533bbab6457af084ebcf28f3a437e commit 786abc7e34f8732a915852310360f3c811cb77ec Author: jintao shen Date: Fri Jun 10 15:52:24 2022 -0700 Enhance CDF start and end timestamp support to automatically handle out-of-range timestamps handle out-of-range timestamps Please fill in changes proposed in this fix Add an option to let customer opt in not to throw exception when timestamp in time travel command such as "table_changes" exceeds the latest delta commit timestamp case1: start timestamp is greater than latest commit timestamp : return empty relation instead of crash case2: end timestamp is greater than latest commit timestamp: cap the end timestamp to latest commit instead of crash Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests Add two new unit test cases in DeltaCDCSuite GitOrigin-RevId: 82fb848c9c4e68a626ac04a4037154f4295e4491 commit 6e4e38e2a869ac2e34e540043cc092af7f048861 Author: lzlfred Date: Fri Jun 10 12:35:36 2022 -0700 Minor change to DESCRIBE HISTORY tests GitOrigin-RevId: f2f1e4a16afbb6b82524be5d78ab3e80ef3e94a4 commit 399348aa7c52d679a2e1c3e5581ac49ee8ee4870 Author: lzlfred Date: Thu Jun 9 18:30:46 2022 -0700 Code import cleanup GitOrigin-RevId: f4832574bf64f81fbb0a8419c226907aca487d9b commit 3e9f609ef1a729f8ca2995dd9beacec221ec17b9 Author: Jintian Liang <105243217+jintian-liang@users.noreply.github.com> Date: Thu Jun 9 15:34:53 2022 -0700 [Delta] Refactor sixteenth set of 20 Delta error messages This change refactors a set of twenty Delta-related error messages. Existing unit tests pass and additional unit tests in `DeltaErrorsSuite` were introduced to verify the message of the refactored errors. GitOrigin-RevId: f988ae3110bd26a0339e5d0510d34e8853df3869 commit 60b3de985644f7eaaebfaa2f934112e30e675501 Author: Terry Kim Date: Wed Jun 8 15:24:48 2022 -0700 Minor refactoring GitOrigin-RevId: c48880cbb1345c9a99a089d16949d5944eeeb9b1 commit cb0883653865ee5eff2e899004114d57f8caa1e3 Author: Prakhar Jain Date: Wed Jun 8 09:18:10 2022 -0700 Minor refactorings GitOrigin-RevId: d9d6d85cf879c8e4984c3499f0ee742d6fbcfc13 commit c1c10909842c684adaa52f7e1b6c07a9fe60eed7 Author: Scott Sandre Date: Wed Jun 8 08:04:13 2022 -0700 Add tests for CDF timestamp valid and invalid formats This PR adds tests for valid and invalid timestamp formats for CDF. GitOrigin-RevId: 0ec09576ed4454d589e8acdbea8afa934bfd35d8 commit e79b13c140ee03bd94a466d75426ca945f8af665 Author: Terry Kim Date: Wed Jun 8 07:27:35 2022 -0700 Refactor DataSkippingReader code by introducing StatsProvider and updating SkippingEligibleColumn to return DataType This PR refactors the code around `DataSkippingReader` by 1. Introducing `StatsProvider` for easier APIs to access stats 2. Updating `SkippingEligibleColumn` to return `DataType` and to mark an expression as eligible only if it's resolved. Refactoring, so the existing test should cover the changes. GitOrigin-RevId: e4b6886fbf9234036df6c2dda0cba788a4191a6f commit fa3a8737467cfb30249582cd7388404ee2ae6e1f Author: Wenchen Fan Date: Wed Jun 8 09:49:25 2022 +0800 Minor refactoring in DeltaTable GitOrigin-RevId: f0580e2cccca2479a803907b797afca8be696a05 commit 42e023885f0f11062f480ab217b02fb6c31f1b7b Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Jun 13 15:26:28 2022 -0700 [delta-standalone] Add check that schema contains partition columns (#353) * add partition cols check to OptTxn; new tests pass; 68 old existing standaloen tests fail * num failing tests reduced from 68 to 59 * num failing tests reduced from 59 to 48 * num failing tests reduced from 48 to 31 * num failing tests reduced from 31 to 25 * all delta standalone tests pass * fix compatibility tests * Fix Sink test after regenerating test/resources partitioned table (#46) * KC_fix_schema_contains_partition_cols_check - Fix Sink test after regenerating test/resource partitioned table to one that includes partition columns in Delta Schema Signed-off-by: Krzysztof Chmielewski * KC_fix_schema_contains_partition_cols_check - code cleanup. Signed-off-by: Krzysztof Chmielewski * Fix compilation issue * Update missing partitions cols algorithm and error message to show all missing cols * respond to PR feedback * fix typo * update test * Add comments to make test case more clear Co-authored-by: kristoffSC commit 9a5459d1c6fa32fdcd1240ed7a8ffc19f89db727 Author: kristoffSC Date: Sat Jun 11 01:08:08 2022 +0200 FlinkDeltaSource_PR_13 - Use Delta's new API DeltaLog::getVersionAtOrAfterTimestamp to support startingTimestamp option (#374) * FlinkDeltaSource_PR_13_getVersionAtOrAfterTimestamp - use Delta's new API deltaLog.getVersionAtOrAfterTimestamp(long) to properly support startingTimestamp option. Signed-off-by: Krzysztof Chmielewski * FlinkDeltaSource_PR_13_getVersionAtOrAfterTimestamp - changes after code review. Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit ca5224212c54cac180127c913fbf6410b833edcd Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Jun 10 12:56:57 2022 -0700 [delta-standalone] Implement "get version at time X" APIs (#346) * Add public API to DeltaLog.java * Update DeltaHistoryManager.scala * Update DeltaErrors.scala * Add implementation to DeltaLogImpl * Add tests to DeltaLogSuite * fix imports * fix mima * fix broken time travel test due to wrong exception type * Rename API (to 'Timestamp'), update API comment * Update getActiveCommitAtTime method docs * Add check for non-existent table * Add recoverability tests * update time constants * remove timeBefore/Start/After final variables, and use numeric constants instead commit a2f119bc589f352cb27f7ce02e694ac02daf1f2e Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Jun 10 11:51:05 2022 -0700 [delta-standalone] Add `DataType::fromJson` API (#366) * add public from Json API to DataType.java * Added tests for positive and negative cases * fix imports commit 153993dc242adf3a774345a8746e97ceb1dfcdcc Author: kristoffSC Date: Wed Jun 8 21:58:55 2022 +0200 Flink Delta Source PR 12.1 - DeltaSource::option(...) value type conversion - bug fix and more tests (#365) * PR ColumnsFromDeltaLog - get table schema from Delta Log. Signed-off-by: Krzysztof Chmielewski * PR ColumnsFromDeltaLog - get table schema from Delta Log. Using SnapshotVersion in Enumerator factory. Signed-off-by: Krzysztof Chmielewski * PR 10.1 ColumnsFromDeltaLog_Tests - extra test for Delta table Schema discovery. Signed-off-by: Krzysztof Chmielewski * PR 11 - Partition support using Delta Log Metadata Signed-off-by: Krzysztof Chmielewski * PR 10 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 10.1 - Added Delta - Flink - Delta type conversion test. Signed-off-by: Krzysztof Chmielewski * PR 10 ColumnsFromDeltaLog - Prevent user to set internal options via source builder. Signed-off-by: Krzysztof Chmielewski * PR 10 ColumnsFromDeltaLog - Changes after code review Signed-off-by: Krzysztof Chmielewski * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski * PR 11 Partition Support Signed-off-by: Krzysztof Chmielewski * PR 12 Option validation - Get BATCH_SIZE for Format builder from Source options. Add validation for option names and values. Signed-off-by: Krzysztof Chmielewski * PR 12 Option validation - Get BATCH_SIZE for Format builder from Source options. Add validation for option names and values. Signed-off-by: Krzysztof Chmielewski * PR 10.1 Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 10.1 Added tests. Signed-off-by: Krzysztof Chmielewski * PR 11 Test fix after merge Signed-off-by: Krzysztof Chmielewski * PR 10.1 cleanup. Signed-off-by: Krzysztof Chmielewski * PR 11 Fix after merge conflicts from master. Signed-off-by: Krzysztof Chmielewski * PR 11 Make RowDataFormat constructor package protected. Signed-off-by: Krzysztof Chmielewski * PR 11 Changes after Code review Signed-off-by: Krzysztof Chmielewski * PR 12 Fix compilation error after merge from base branch. Signed-off-by: Krzysztof Chmielewski * PR 12 - Validation for Inapplicable Option Used + tests. Signed-off-by: Krzysztof Chmielewski * PR 12 - Javadocs Signed-off-by: Krzysztof Chmielewski * PR 12 - Adding option type safety Signed-off-by: Krzysztof Chmielewski * PR 12 - Adding option type safety tests Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 12 - test for options Signed-off-by: Krzysztof Chmielewski * PR 12 - Type conversion for Timestamp based options and adding tests. Signed-off-by: Krzysztof Chmielewski * PR 12 - Change TODO's from PR 12 to PR 12.1 Signed-off-by: Krzysztof Chmielewski * PR 12.1 - Option validation, throw DeltaSourceValidationException + tests. Signed-off-by: Krzysztof Chmielewski * PR 12 - changes after code review Signed-off-by: Krzysztof Chmielewski * PR 12.1 - Validation for option setting and more tests. Signed-off-by: Krzysztof Chmielewski * PR 12.1 - Validation for option setting and more tests. Signed-off-by: Krzysztof Chmielewski * PR 12.1 - Validation for set options - BugFixes, tests, changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 12.1 - Changes after code review. Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit 6707b29f58a30989f40478ce40b3200c48bf37ce Author: Allison Portis Date: Tue Jun 7 09:55:30 2022 -0700 Add Checkpoint + CDF test. This PR adds a test to ensure CDC fields are not included in the checkpoint file. Resolves delta-io/delta#1180 GitOrigin-RevId: d4a7b8bc4d1a79d30806ff18c5b507f6edcd964c commit 8fb49618cbd014ccc303302dc9dcc9ac55eca896 Author: koertkuipers Date: Tue Jun 7 08:59:06 2022 -0700 Fixes #348 Support Dynamic Partition Overwrite The goal of this PR to to support dynamic partition overwrite mode on writes to delta. To enable this on a per write add `.option("partitionOverwriteMode", "dynamic")`. It can also be set per sparkSession in the SQL Config using `.config("spark.sql.sources.partitionOverwriteMode", "dynamic")`. Some limitations of this pullreq: Dynamic partition overwrite mode in combination with replaceWhere is not supported. If both are set this will result in an error. The SQL `INSERT OVERWRITE` syntax does not yet support dynamic partition overwrite. For this more changes will be needed to be made to `org.apache.spark.sql.delta.catalog.DeltaTableV2` and related classes. Fixes delta-io/delta#348 Closes delta-io/delta#371 Signed-off-by: Allison Portis GitOrigin-RevId: 5b01e5b04e573dabe91ac2d71991a127617b8038 commit 3c15c6c2eb47f933aaf62fed4110ce3a3fbbe5d5 Author: Sabir Akhadov Date: Tue Jun 7 11:09:55 2022 +0200 Minor refactor to UpdateCommand GitOrigin-RevId: ab3fcbe0522aa194ac0781730a50112655d5c7ec commit 68092ed8a4eedd29d4f55edcb04f27b34d0bc8a5 Author: Allison Portis Date: Mon Jun 6 11:46:52 2022 -0700 Adds miscellaneous (e.g. end-to-end workload, and CDCReader) tests for CDF. Resolves delta-io/delta#1178 GitOrigin-RevId: 5c7da4ff9413d84e73137a80673872065de8267b commit 6f51258382cf35d1c3722dc28e9752b064a4cf2b Author: Scott Sandre Date: Mon Jun 6 11:39:34 2022 -0700 CDF + Vacuum tests This PR adds two tests testing CDF + VACUUM integration. Closes delta-io/delta#1177 GitOrigin-RevId: f2f7b187cb3cc78c267d378eaf3d0657a56241d9 commit 60cc04c96c6e73b749c6adbf2d4d60d95ee0c1c3 Author: Scott Sandre Date: Sat Jun 4 19:08:04 2022 -0700 Minor refactor to DeltaErrors and DeltaErrorsSuite GitOrigin-RevId: ceacc3a6239f36e76cc8cf4f3447285bb06fc6a0 commit e151fdedd4735cb6b90e9c1ae8105c39b0cadfe3 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Jun 7 11:56:05 2022 -0700 update version to 0.5.0-SNAPSHOT (#367) commit 5ebb74e50b7e60e038150bc445603e78361db9c6 Author: Jiawei Bao <1041291059@qq.com> Date: Mon Jun 6 20:13:04 2022 -0700 [delta-standalone] Implement actions iterator from `VersionLog` (#354) * java implementation with scala implementation * add action closeable iterator * delete java action closeable iterator * Switch VersionLog constructor * Build two implementation with the same name by interface * replace comments to interface file * fix comments * make wrapper class anonymous * restore comments * make Scala implementation private * add getActionIterator in VersionLog Java class * fix naming and parameter issues * add counter to VersionLog test * add comments and docs for MemoryOptimizedVersionLog * fix coding style * add tests & fix dependencies * add comments for tests & fix bugs based on PR comments * fix bugs based on PR comments * using generic extension instead of asInstanceOf, fix class doc, ensure thread-safe * revert generic extension * apply lazy loading for action list * add type casting from MemoryOptimizedVersionLog to VersionLog * fix test & variable name * fix coding style & removed some useless thread-safe restrictions * update unused dependencies and annotations * remove obvious clue * fix tests, naming, class doc and format * fix code format by scalafmt * fix list constructor and duplicated description in class doc * fix format * fix class doc * update class doc * update class doc commit 2af893251136444c43727334eb70bccfa90fae11 Author: kristoffSC Date: Mon Jun 6 19:56:53 2022 +0200 Flink Delta Source PR 12 - DeltaSource::option(...) value type conversion. (#364) * PR ColumnsFromDeltaLog - get table schema from Delta Log. Signed-off-by: Krzysztof Chmielewski * PR ColumnsFromDeltaLog - get table schema from Delta Log. Using SnapshotVersion in Enumerator factory. Signed-off-by: Krzysztof Chmielewski * PR 10.1 ColumnsFromDeltaLog_Tests - extra test for Delta table Schema discovery. Signed-off-by: Krzysztof Chmielewski * PR 11 - Partition support using Delta Log Metadata Signed-off-by: Krzysztof Chmielewski * PR 10 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 10.1 - Added Delta - Flink - Delta type conversion test. Signed-off-by: Krzysztof Chmielewski * PR 10 ColumnsFromDeltaLog - Prevent user to set internal options via source builder. Signed-off-by: Krzysztof Chmielewski * PR 10 ColumnsFromDeltaLog - Changes after code review Signed-off-by: Krzysztof Chmielewski * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski * PR 11 Partition Support Signed-off-by: Krzysztof Chmielewski * PR 12 Option validation - Get BATCH_SIZE for Format builder from Source options. Add validation for option names and values. Signed-off-by: Krzysztof Chmielewski * PR 12 Option validation - Get BATCH_SIZE for Format builder from Source options. Add validation for option names and values. Signed-off-by: Krzysztof Chmielewski * PR 10.1 Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 10.1 Added tests. Signed-off-by: Krzysztof Chmielewski * PR 11 Test fix after merge Signed-off-by: Krzysztof Chmielewski * PR 10.1 cleanup. Signed-off-by: Krzysztof Chmielewski * PR 11 Fix after merge conflicts from master. Signed-off-by: Krzysztof Chmielewski * PR 11 Make RowDataFormat constructor package protected. Signed-off-by: Krzysztof Chmielewski * PR 11 Changes after Code review Signed-off-by: Krzysztof Chmielewski * PR 12 Fix compilation error after merge from base branch. Signed-off-by: Krzysztof Chmielewski * PR 12 - Validation for Inapplicable Option Used + tests. Signed-off-by: Krzysztof Chmielewski * PR 12 - Javadocs Signed-off-by: Krzysztof Chmielewski * PR 12 - Adding option type safety Signed-off-by: Krzysztof Chmielewski * PR 12 - Adding option type safety tests Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 12 - test for options Signed-off-by: Krzysztof Chmielewski * PR 12 - Type conversion for Timestamp based options and adding tests. Signed-off-by: Krzysztof Chmielewski * PR 12 - Change TODO's from PR 12 to PR 12.1 Signed-off-by: Krzysztof Chmielewski * PR 12 - changes after code review Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit 2231e02a33be8af30f2323b300caa8556868cc68 Author: Venki Korukanti Date: Sat Jun 4 09:15:54 2022 -0700 [ZOrderBy] Implement OPTIMIZE ZORDER BY command This PR is part of https://github.com/delta-io/delta/issues/1134. This is the final PR that integrates end-to-end of from OPTIMIZE SQL to Z-Order clustering execution. It adds the parser definition to accept OPTIMIZE ZORDER BY SQL. Modifies the existing `OptimizeTableCommand` to handle the Z-Order by making the following changes. - `OptimizeTableCommand` now has two modes `MultiDimClustering` and `Compaction` - When selecting the files for OPTIMIZE jobs: In `MultiDimClustering` mode select all files in the selected partition. In `Compaction` mode select only the files that have size less than the `minFileSize` in the selected partitions - When constructing bins for Optimize jobs: In `MultiDimClustering` mode, select all files in a partition as one bin. In `Compaction` mode, divide the files in partition into bins where the total size of the files in the bin is at most the `maxFileSize` - When constructing the Optimize jobs to run: In `MultiDimClustering` mode, convert the DataFrame containing input files into a clustering DataFrame using `MultiDimClustering`. In `Compaction` mode, we just coalesce all input files into one partition. Detailed design is [here](https://docs.google.com/document/d/1TYFxAUvhtYqQ6IHAZXjliVuitA5D1u793PMnzsH_3vs/edit?usp=sharing) This closes delta-io/delta#1171 Unit tests GitOrigin-RevId: 0d7b2972d1345a5ab99e1f4c3affd3b036821113 commit fe36a53f3c70c5f9c9b5052c12cd1703f495da97 Author: Allison Portis Date: Fri Jun 3 16:08:23 2022 -0700 CDF for Merge command See the project plan at https://github.com/delta-io/delta/issues/1105. This PR adds CDF to the `MERGE` command. Merge is implemented in two ways. - Insert-only merges. For these we don't need to do anything special, since we only write `AddFile`s with the new rows. - However, our current implementation of insert-only merges doesn't correctly update the metric `numTargetRowsInserted`, which is used to check for data changes in [CDCReader](https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/commands/cdc/CDCReader.scala#L313). This PR fixes that. - For all other merges, we generate CDF rows for inserts, updates, and deletions. We do this by generating expression sequences for CDF outputs (i.e. preimage, insert, etc) on a clause-by-clause basis. We apply these to the rows in our joinedDF in addition to our existing main data output sequences. - Changes made to `JoinedRowProcessor` make column `ROW_DELETED_COL` unnecessary, so this PR removes it. Tests are added in `MergeCDCSuite`. Closes delta-io/delta#1155 GitOrigin-RevId: 0386c6ff811abe433644b5f5f46a3c7d51001740 commit cdb354e3b2bededd69a566d52c171929a79d9703 Author: Scott Sandre Date: Fri Jun 3 12:47:18 2022 -0700 CDF evolvability tests This PR adds two evolvability tests for CDF. Specifically, we test that CDF will continue to work even if there is some future column inside the delta log and checkpoint. Closes delta-io/delta#1172 GitOrigin-RevId: df4705cb1bdbfa6802dfa96d9f0ba0902e0d53a6 commit 764a7d392dd283679443d089609d80fa9cd6be38 Author: Allison Portis Date: Fri Jun 3 11:50:44 2022 -0700 Update github actions Closes delta-io/delta#1138 Signed-off-by: Allison Portis GitOrigin-RevId: e525abcd966a8affb408c0b135e294d3726c9b7d commit d10f2f07ed7777606ea0e8f31f814c5d4e80a3b3 Author: Allison Portis Date: Fri Jun 3 11:36:07 2022 -0700 Support generated columns with CDF CDF and Generated Columns would throw an error together since we were dropping CDC columns when generating our generated columns. This PR fixes it and adds a test. Closes delta-io/delta#1173 GitOrigin-RevId: 234def0e140ada5deb21680aea926a222e14dbf8 commit 44d4c848727bf9ff5b946e33b59c33a8533b39f3 Author: Jackie Zhang Date: Fri Jun 3 10:06:00 2022 -0700 More streaming + column mapping tests Added tests after stream restarts. new unit tests. GitOrigin-RevId: c2be51f574ddef0a1b9fdb2a54ba1f07e82c27a9 commit 9e7f3e9d070345c5f53b2d14313f0f1aee615169 Author: Grzegorz Kołakowski Date: Fri Jun 3 09:03:45 2022 -0700 Enable to run TPCDS performance benchmarks on GCP ## Description This PR enables to run TCPDS performance benchmarks not only on AWS but also on Google Cloud Platform.This is an extension of the framework introduced by: https://github.com/delta-io/delta/pull/973 . In order to run the benchmark, you need to manually create a Dataproc cluster first. All prerequisites and sample gcloud commands are in README file. After that you can run *Load data* and *Query data* steps provided by the framework. Please see the README updates for more details. I manually ran the benchmarks on GCP. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1142 Co-authored-by: Grzegorz Koakowski Signed-off-by: Allison Portis GitOrigin-RevId: 392d1f277c837f85bf152d429dfffad4c166c550 commit aaf9fbdfecb0a48f85f937f4ad48c56f77c2e1aa Author: Jintian Liang Date: Fri Jun 3 03:54:24 2022 -0700 [Delta] Refactor seventeenth set of Delta error messages This change further refactors a set of Delta-related error messages as Delta-classed exceptions and adds more unit tests in `DeltaErrorsSuite` to verify the messaging is correctly formatted. Existing unit tests already verify the refactored callsites have the correct error message. Additional unit tests were added to `DeltaErrorsSuite` to verify the refactor. GitOrigin-RevId: 696bfcf2aeea2b90d951ad83e9428b658325b240 commit bdf2ef106ce79237eda6f80473ef8899cab70a24 Author: Jintian Liang Date: Thu Jun 2 23:52:03 2022 -0700 [Delta] Refactor eighteenth set of Delta error messages This change refactors exceptions into Delta-classed exceptions that also expose a SQL error code. Added unit tests in `DeltaErrorsSuite` to verify the messaging of the exceptions is correct. GitOrigin-RevId: 774ed5e2df893e05d6efd08446f7c93dab660be4 commit 3488b212acd0a738ea6eaec177dab554662cb76f Author: Shixiong Zhu Date: Thu Jun 2 21:14:28 2022 -0700 Minor refactor to SchemaUtils GitOrigin-RevId: cbd51d5ab5de98a38819fa5fb0f6a3de84cc55dc commit 9bcb7ebace2aea8e6d0b63aea6dff98ec492af9b Author: Peng Zhong Date: Thu Jun 2 15:00:27 2022 -0700 Refactor DeltaSourceOffset and DeltaSource GitOrigin-RevId: a7a220a636b5567a8d5aa2ad8d2edc962e8a3b8a commit b9acbc759a5be610df45706740cba159d64427f4 Author: kristoffSC Date: Fri Jun 3 20:11:49 2022 +0200 Flink Delta Source PR 11 - Partition support using Delta Log Metadata (#361) * PR ColumnsFromDeltaLog - get table schema from Delta Log. Signed-off-by: Krzysztof Chmielewski * PR ColumnsFromDeltaLog - get table schema from Delta Log. Using SnapshotVersion in Enumerator factory. Signed-off-by: Krzysztof Chmielewski * PR 10.1 ColumnsFromDeltaLog_Tests - extra test for Delta table Schema discovery. Signed-off-by: Krzysztof Chmielewski * PR 11 - Partition support using Delta Log Metadata Signed-off-by: Krzysztof Chmielewski * PR 10 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 10.1 - Added Delta - Flink - Delta type conversion test. Signed-off-by: Krzysztof Chmielewski * PR 10 ColumnsFromDeltaLog - Prevent user to set internal options via source builder. Signed-off-by: Krzysztof Chmielewski * PR 10 ColumnsFromDeltaLog - Changes after code review Signed-off-by: Krzysztof Chmielewski * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski * PR 11 Partition Support Signed-off-by: Krzysztof Chmielewski * PR 10.1 Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 10.1 Added tests. Signed-off-by: Krzysztof Chmielewski * PR 11 Test fix after merge Signed-off-by: Krzysztof Chmielewski * PR 10.1 cleanup. Signed-off-by: Krzysztof Chmielewski * PR 11 Fix after merge conflicts from master. Signed-off-by: Krzysztof Chmielewski * PR 11 Make RowDataFormat constructor package protected. Signed-off-by: Krzysztof Chmielewski * PR 11 Changes after Code review Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit b36f6a7a57caa47dce72d9eb7fac8b7a4d25b15e Author: Tathagata Das Date: Thu Jun 2 15:24:37 2022 -0400 Limit SetTxn action retention in the Delta Log This PR adds a new optional table property `delta.setTransactionRetentionDuration` that lets users set a duration so that newly created snapshots ignore all `SetTransaction`s older than that duration. Since we have allowed options to specify arbitration txnAppId and txnVersion for idempotent writes, there is a chance that unlimited number of SetTxn actions will forever clutter the Delta log. This is an opt-in mechanism to clear those actions. New unit tests Closes delta-io/delta#1167 GitOrigin-RevId: ae88a9ecfb9130439b27f4e4609ce82fc91f5c75 commit fbaca4385a061ddb36e7da9714019a41f12c449a Author: jintao shen Date: Thu Jun 2 05:56:06 2022 -0700 [Delta] Refactor fifteenth set of 20 Delta error messages Refactor 20 Delta error messages into strong error class Error code is tested in unit test inside DeltaErrorsSuite target GitOrigin-RevId: 8715cd7be0d8ce887ead44c58c30aed12b0dd74b commit 6f0c48ecc9054122462c3bcb5519b2b7c6ce3f58 Author: Yijia Cui Date: Thu Jun 2 00:17:27 2022 -0700 Fix CDF with ReplaceWhere When replaceWhere on a non-partition Column, insert type should be generated for new rows if CDF is enabled. Unit test. GitOrigin-RevId: b0ca02609e42b4b846d39e3568189063a2157416 commit 05d643d6df039ec88f5085b66852c4e2bb287c62 Author: jintao shen Date: Wed Jun 1 16:30:18 2022 -0700 [Delta] Refactor fourteenth set of 20 Delta error messages Refactor 20 Delta error messages into strong error class GitOrigin-RevId: d6dae67ba6587c7d54583171cc7d371aa53ff031 commit f94b75623eaf40a543d6d4ab40f0f3b65d80d165 Author: jintao shen Date: Wed Jun 1 11:25:09 2022 -0700 [Delta] Refactor thirteenth set of 20 Delta error messages Refactor 20 Delta error messages into strong error class Error code is tested in unit test inside DeltaErrorsSuite target GitOrigin-RevId: c0c57d2d07ec8a9997ce26ec2ae1607d4f475d0b commit 32c3681f37a01d10565c21b70e1296bba0b2ea25 Author: chenqingzhi Date: Wed Jun 1 11:18:35 2022 -0700 Fix describeDeltaDetailsCommand that return a stale metadata DescribeDeltaDetailsCommand doesn't call DeltaLog.update See https://github.com/delta-io/delta/issues/921 Closes delta-io/delta#1133 Co-authored-by: chenqingzhi Signed-off-by: allisonport-db GitOrigin-RevId: 8e9b7aa23a5304142bda561f02f2d21173aa88d4 commit a71cc67f640418f82195aef98a6e7ba0c7a647ff Author: Michael Mengarelli Date: Wed Jun 1 10:04:54 2022 -0700 [788-DELTA] Update DeltaMergeBuilder documentation: Updated Python Do… …cstring and Javadoc to indidcate that you can any number of whenMatched and whenNotMatched clauses. Signed-off-by: Mike Mengarelli Closes delta-io/delta#1113 Signed-off-by: Scott Sandre GitOrigin-RevId: a04349891bf61e93b0b39acb0db94927f0f778ef commit 7cc7eafb42a7d15831e8aea45351b4f7912780e3 Author: Min Yang Date: Tue May 31 17:20:53 2022 -0700 Disabled a DeltaSourceSuite test case as it is too flaky GitOrigin-RevId: 3dcf1f4cb9733f90b31773b9c83bfb8a606a417d commit 3ebbb91a6a723f9791ae4f652ff91b5d5988817d Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Jun 2 09:03:21 2022 -0700 [0.4.1 release] Update examples and READMEs for master (#358) * updated examples and READMEs * update flink example readme * change version to x.y.z in README commit e7e880aea20740c2d5dcb997bfc890d9bd15c845 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Jun 2 08:58:10 2022 -0700 [0.4.1 release] add 0.4.1 javadocs to master (#356) commit 6ee1c14e6d4e0a4fee631081699e32e9f4077160 Author: kristoffSC Date: Wed Jun 1 19:31:40 2022 +0200 Flink Delta Source PR 10.1 - Adding more tests to Delta table schema discovery for Flink source. (#352) * PR ColumnsFromDeltaLog - get table schema from Delta Log. Signed-off-by: Krzysztof Chmielewski * PR ColumnsFromDeltaLog - get table schema from Delta Log. Using SnapshotVersion in Enumerator factory. Signed-off-by: Krzysztof Chmielewski * PR 10.1 ColumnsFromDeltaLog_Tests - extra test for Delta table Schema discovery. Signed-off-by: Krzysztof Chmielewski * PR 10 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 10.1 - Added Delta - Flink - Delta type conversion test. Signed-off-by: Krzysztof Chmielewski * PR 10 ColumnsFromDeltaLog - Prevent user to set internal options via source builder. Signed-off-by: Krzysztof Chmielewski * PR 10 ColumnsFromDeltaLog - Changes after code review Signed-off-by: Krzysztof Chmielewski * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski * PR 10.1 Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 10.1 Added tests. Signed-off-by: Krzysztof Chmielewski * PR 10.1 cleanup. Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit f4cbfaccad533dedee8a479a3011e4c2f7f9be20 Author: Scott Sandre Date: Tue May 31 15:59:10 2022 -0700 Delta Lake CDF - Stream API This PR adds CDF + Streaming functionality, as part of the ongoing CDF project. The bulk of this PR is a) adding `DeltaSourceCDCSupport`. This PR adds on the necessary CDF functionality to DeltaSource. b) updating DeltaSource to use the various `DeltaSourceCDCSupport` APIs when CDF is enabled c) adding a new test suite Closes delta-io/delta#1154 GitOrigin-RevId: e7e8f6d48f99a63e7c5d35a5d0173a9cc26cf274 commit cfe0177e46c4403f35513f3383d70bbbdbb58642 Author: Naga Raju Bhanoori Date: Tue May 31 15:25:44 2022 -0700 [Delta] Refactor tenth set of 20 Delta error messages This PR refactors 20 exceptions of the delta code base to follow the new error handling framework that better organizes exceptions and makes them queryable. This commit refactors the tenth set 20 delta error messages. Refactored existing exceptions into functions in DeltaErrors.scala and unit tested those functions part of DeltaErrorsSuite.scala. GitOrigin-RevId: 82ecc2a3988709c41aa33c922d7a79acb39da782 commit bb57d870bd5788e5c3fdf62d7e1d0cc9ead89b8b Author: Shixiong Zhu Date: Tue May 31 13:50:17 2022 -0700 Prevent a restarted streaming query from reading a non-existent version GitOrigin-RevId: 74748181cd5f11ef1a7deeb3f3ac33be78801b7c commit 40bf4486f327081091286a8572bb801c7ff19704 Author: Sabir Akhadov Date: Tue May 31 11:58:31 2022 +0200 Minor refactor to DeltaLogging.scala GitOrigin-RevId: 170e024bfb201205a5e0fd84f5b876a696b7d7ef commit e7ca437e9aaa2321c828062590aeb29b8c00460d Author: kristoffSC Date: Sun May 29 20:37:11 2022 +0200 Flink Delta Source PR 10 - get table schema from Delta Log. (#347) * PR ColumnsFromDeltaLog - get table schema from Delta Log. Signed-off-by: Krzysztof Chmielewski * PR ColumnsFromDeltaLog - get table schema from Delta Log. Using SnapshotVersion in Enumerator factory. Signed-off-by: Krzysztof Chmielewski * PR 10 - Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 10 ColumnsFromDeltaLog - Prevent user to set internal options via source builder. Signed-off-by: Krzysztof Chmielewski * PR 10 ColumnsFromDeltaLog - Changes after code review Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski commit 91051df45f573a9dcdeaf9173aff6ba18314c82e Author: Venki Korukanti Date: Fri May 27 13:48:02 2022 -0700 [ZOrderBy] Add transformation for multi-dimensional clustering of data This PR is part of https://github.com/delta-io/delta/issues/1134. This PR adds `MultiDimClustering` which makes use of the [`range_partition_id`](https://github.com/delta-io/delta/pull/1137) and [`interleave_bits`](https://github.com/delta-io/delta/pull/1149) to transform the layout of the data in Z-order clustering. Detailed design details are [here](https://docs.google.com/document/d/1TYFxAUvhtYqQ6IHAZXjliVuitA5D1u793PMnzsH_3vs/edit?usp=sharing). Following are the two new options to control the clustering - DeltaSQLConf.OPTIMIZE_ZORDERBY_NUM_RANGE_IDS - This controls the domain of rangeId values to be interleaved. The bigger, the better granularity, but at the expense of performance (more data gets sampled). - DeltaSQLConf.OPTIMIZE_ZORDERBY_ADD_NOISE - Whether or not a random byte should be added as a suffix to the interleaved bits when computing the Z-order values for multi dimensional clustering. This can help deal with skew, but may have a negative impact on overall min/max skipping effectiveness. Close delta-io/delta#1150 UTs GitOrigin-RevId: 01afcfcc3c01ef6e75320a30733f563e463a1eef commit 0d08f11fdf28ceb95998e8d34daa065c8c59ca24 Author: panbingkun Date: Fri May 27 01:29:35 2022 +0000 Minor refactoring Author: panbingkun Author: Bo Zhang GitOrigin-RevId: 67fbeb0c8abf17fd8b41c5dbd5520fa814fd91de commit 42f3521623ff540bdcffd9125464456d8900f39d Author: koertkuipers Date: Thu May 26 13:24:29 2022 -0700 Support setting metadata configuration upon write to a path This PR makes the change to save `delta.` prefix options set by `DataFrameWriter.option` as table properties when writing to a new path (In other words, create a new table). This makes Delta consistent with other APIs when creating a table: - DataFrameWriterV2 (either table or path) - DataFrameWriter.option(...).saveAsTable Closes delta-io/delta#374 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 07d047f537bac76b09689cd7bade57010e97496a commit a7c9bf89ebe0f583e566ba0d695a25b55050fba1 Author: Allison Portis Date: Wed May 25 16:18:56 2022 -0700 Adds an integration test that catches python exceptions This PR adds an integration test to test the fix in delta-io/delta#1078. It also checks that we are throwing `pyspark.sql.utils.AnalysisException` Run with `python3 run-integration-tests.py --python-only --test table_exists --use-local` to test local code and `python3 run-integration-tests.py --version 1.1.0 --python-only --test table_exists`. Closes delta-io/delta#1081 Signed-off-by: allisonport-db GitOrigin-RevId: 191aeefb0f94e11bb9137e930f5a54f2a4a0e5ba commit 1a1429a4cd73a14b606a730ba6f86364b2ef71e5 Author: Tathagata Das Date: Wed May 25 14:27:15 2022 -0400 Idempotent writes using Dataframe write options This PR makes changes to support idempotent delta writes within foreachBatch() command. Prior to this change, streaming delta writes within foreachBatch command were not idempotent -- that is, rerunning a failed batch could result in duplicate data writes. With this change, we expose two options to the users: - txnAppId: A unique string that customers can pass on each dataframe write. For example, customers can use the StreamingQuery's id as appId. - txnVersion: A monotonically increasing number that acts as transaction version. The streaming batch id can be used as the txnVersion. Combination of txnAppId and txnVersion are used to identify duplicate writes and ignore duplicate writes by the runtime. Specifically, WriteIntoDelta class is modified to persist the transaction state (which includes the txnAppId and txnVersion) and consult on new writes to the same location. Example: ``` df.write.format("delta").option("txnAppId", "myapp").option("txnVersion", 1).save(tablePath) df.write.format("delta").option("txnAppId", "myapp").option("txnVersion", 1).save(tablePath) // this will be ignored ``` Closes delta-io/delta#1141 New unit tests GitOrigin-RevId: aa9836a2498a65477584d8b60681246b972e4378 commit 2c8246a0dbf5e4605cb61feca124bcaed8f6a709 Author: Tathagata Das Date: Wed May 25 14:26:07 2022 -0400 [Delta] Improved dockerized testing environment with custom registry integration Added the ability for the run-tests.py script to run tests in docker while integrating with docker registry - Script will automatically attempt to generate docker images hashed by the Dockerfile contents so that the same deterministically-named-image can be reused across test runs. - Docker registry has to be specified via env variable DOCKER_REGISTRY. When specified, the script will push the image to docker registry and pull from it as needed to speed up all future test runs. Manual testing GitOrigin-RevId: da23bb3f1e44ddfec43b5a68eea1d6b863504f90 commit 4762dccce9e45b4077ed096e9e0f79bcdd4088e3 Author: Pablo Flores Date: Fri May 27 08:16:38 2022 -0700 [340] Fix java.io.IOException when writing delta log to non local path (#341) * Pass full table path using Path.toString Signed-off-by: Pablo Flores * Use the safer approach of building a hadoop Path using `Path.toUri` Signed-off-by: Pablo Flores * Added unit test to check the fix Signed-off-by: Pablo Flores * Removed single use function Signed-off-by: Pablo Flores * Added inline comments to the test Signed-off-by: Pablo Flores * Assert that the table path's scheme is file Signed-off-by: Pablo Flores commit 66045f384a42694662f2b02396aed7c8ac9a0908 Author: Allison Portis Date: Wed May 25 19:20:09 2022 -0700 Use `delta-storage` in Delta Standalone (#338) commit ef33f6e3891f702677c96f2a93c01a86a865327e Author: Allison Portis Date: Wed May 25 16:12:48 2022 -0700 Automatically select LogStore based on path scheme by adding DelegatingLogStore (#337) commit 47ca27f14f10c5b1ca48f818a19a20512684a1db Author: Scott Sandre Date: Wed May 25 08:33:36 2022 -0700 Delta Lake - CDF - UPDATE command See the project plan at https://github.com/delta-io/delta/issues/1105. This PR adds CDF to the UPDATE command, during which we generate both preimage and postimage CDF data. This PR also adds UpdateCDCSuite which adds basic tests for these CDF changes. As a high-level overview of how this CDF-update operation is performed, when we find a row that satisfies the update condition, we `explode` an array containing the pre-image, post-image, and main-table updated rows. The pre-image and post-image rows are appropriately typed with the corresponding CDF_TYPE, and the main-table updated row has CDF_TYPE `null`. Thus, the first two rows will be written to the cdf parquet file, with the latter is written to standard main-table data parquet file. Closes delta-io/delta#1146 GitOrigin-RevId: 47413c5345bb97c0e1303a7f4d4d06b89c35ab7a commit ae9c3f1eb83c71ea48c3cbe35f78d1fa19111eaf Author: Ole Sasse Date: Wed May 25 16:20:29 2022 +0200 Minor refactor to MergeIntoSQLSuite GitOrigin-RevId: b7b0b6039064210146351ce809d6b3f69c27a5bf commit 8f85c881e213f2b3f2939514404e36c3dd98f446 Author: Shixiong Zhu Date: Tue May 24 19:14:17 2022 -0700 Add a log for UserDefinedType GitOrigin-RevId: ffded8fd836995f552e598e912c6594367df121e commit 22a0be1c81e9f43fab1550ada9dab586f5e7203e Author: Venki Korukanti Date: Mon May 23 11:48:08 2022 -0700 [ZOrderBy] Add `interleave_bits` expression This PR is part of https://github.com/delta-io/delta/issues/1134. It implements `interleave_bits(col1:int, col2:int, …. col_n:int) -> byte array (Z-order value)`. This expression is used to combine multiple Z-order by columns into a [Z-order value](https://en.wikipedia.org/wiki/Z-order_curve). This Z-order value is used to layout the data such that the records with Z-Order By column having close values remain close when stored in files. Detailed design details are [here](https://docs.google.com/document/d/1TYFxAUvhtYqQ6IHAZXjliVuitA5D1u793PMnzsH_3vs/edit?usp=sharing), specifically [this](https://docs.google.com/document/d/1TYFxAUvhtYqQ6IHAZXjliVuitA5D1u793PMnzsH_3vs/edit#bookmark=id.pngsryg3gbl2) section. Closes delta-io/delta#1149 UTs GitOrigin-RevId: a91c6116179b53d6dbc3bacd700ae7bc5133e3c8 commit d63cf46753e27857bef6a18f3dbd8e7c9e4285c9 Author: Venki Korukanti Date: Mon May 23 09:57:09 2022 -0700 Additional tests for Optimize file compaction GitOrigin-RevId: 74cbd004db0ebdb4a8bfcf12f939f31e1cc3d1f1 commit b25d12dd71a7b6c0b845fe8deae38e6eb85455a9 Author: Wenchen Fan Date: Mon May 23 06:31:48 2022 +0000 Minor test cleanup in DeleteCDCSuite Author: Wenchen Fan GitOrigin-RevId: d24e74255d0b39c17a4638b4d6d0f07670d744ee commit 5d8974fdfa866da138e71479b1db27bdd54ecc30 Author: Sabir Akhadov Date: Sat May 21 10:39:07 2022 +0200 Minor refactoring in DeleteCommand GitOrigin-RevId: 16c6a7e33e40493b2e12aba1f5566fda0be8fe5b commit 728bf902542077ce1c2e97ca67a53c53bb460c64 Author: Venki Korukanti Date: Fri May 20 14:40:37 2022 -0700 [ZOrderBy] Add `range_partition_id` expression This PR is part of https://github.com/delta-io/delta/issues/1134. It implements `range_partition_id(col, N) -> int` expression. This expression is used to convert each Z-order column values to a range id. The ranges are selected by sampling the input column. For sampling and choosing the ranges make use of the existing `RangePartitioner` in Spark. Detailed design details are [here](https://docs.google.com/document/d/1TYFxAUvhtYqQ6IHAZXjliVuitA5D1u793PMnzsH_3vs/edit?usp=sharing), specifically [this](https://docs.google.com/document/d/1TYFxAUvhtYqQ6IHAZXjliVuitA5D1u793PMnzsH_3vs/edit#bookmark=id.5aav37q4qho2) section. Closes delta-io/delta#1137 UTs GitOrigin-RevId: 1b900174908d76945c582e3b0f67e1298c9e049b commit c78d4d03df4e03d5a7fbb31c963151a3f0e0bf9e Author: Hussein Nagree Date: Fri May 20 14:34:04 2022 -0700 Minor refactoring to snapshot management code GitOrigin-RevId: afcb4371f246e565148ac20eb9b3950335f81d9a commit 73314f274b3b0af33926e4c3df29faf8246444dd Author: lzlfred Date: Fri May 20 12:14:17 2022 -0700 Make createdTime in metadata as None by default Today if we use JsonUtils to deserialize an Metadata object that doesn't have createdTime field, Jackson will fill the current system timestamp. This is not desired and is not consistent with the behavior when using Spark to read. This inconsistency may cause issues in future. Hence we change the default value of createdTime to None and when a caller create a Metadata object, it should set it explicitly. In Delta protocol the createdTime is optional, and thus its totally valid for it to be empty. Unit test. the same unit test failed without this fix. GitOrigin-RevId: 04989db590694d3057b948d94e74dcdd81f80ad6 commit 8974685f574394628cf9b5c5457093e7cbcd3d23 Author: Patrick GRANDJEAN Date: Fri May 20 08:31:06 2022 -0700 [Issue #1088] Allow write option "compression" as for parquet write options This PR allows the use of the "compression" parquet write option in delta. Tests were added. Testing cases where "compression" option is absent and defaults to snappy, "compression" specifies a valid value and eventually "compression" provides an invalid value and raises an exception. Yes, please check issue #1088 for examples. Fixes delta-io/delta#1088 Closes delta-io/delta#1126 Signed-off-by: Scott Sandre GitOrigin-RevId: b5aa74467f5ec2d0f1813468232e51e1c0dbc653 commit 2f2242918e93bf38893ce739b00b2b254403e321 Author: Shixiong Zhu Date: Fri May 20 08:12:35 2022 -0700 Revert "Remove unused DeltaConfigs" This reverts commit f50ecb6e0a3c18d7e2d0ac0600aa6a2d1d7fbdc7 We realized removing DeltaConfigs could break old Delta versions. So just revert the change. This is a clean revert. Existing tests GitOrigin-RevId: 8eca43edefb83fce64fbb3f6d76020970ee01abb commit 258071dd02049f669b631a720277ec2140bf5270 Author: Adam Binford Date: Thu May 19 18:15:43 2022 -0700 Support nested columns in generated column skipping - Add supported for nested columns in the automatic partition pruning based on nested columns for generated columns. - Add the partition pruning support when a partition column is defined as a data column or a nested column of a data column directly. Updated existing tests to also test a nested version, and added a new test for the identity mapping of nested columns. Extends automatic data skipping from generated columns to nested columns Resolves delta-io/delta#1074 Closes delta-io/delta#1075 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 8a7d239782fbf9c0836ebcebd9e0c8b9fa1ababe commit 09c9f2c406fe70d6515f810e4690994ad3db649f Author: Nick Karpov Date: Thu May 19 14:07:58 2022 -0700 Fix Delta Standalone link in README ## Description `Delta Standalone` link was pointing to the YouTube channel. Changing it to the docs links. N/A ## Does this PR introduce _any_ user-facing changes? N/A Closes delta-io/delta#1143 Signed-off-by: Venki Korukanti GitOrigin-RevId: ff985fa7b095155c2b999ce0effcef42e95b15b1 commit a69ede38c82ee830f511fee97251b62c602c3ab0 Author: Jackie Zhang Date: Thu May 19 14:05:55 2022 -0700 Add tests for column mapping operations during streaming This PR adds tests for covering the behaviors of column rename/drop in: - streaming operations (Delta as source/sink) also piggybacked some changes to reenable some tests using a modified `CONVERT TO DELTA` in column mapping modes. GitOrigin-RevId: 447bfbfb4457c4775743151a7a81ce308407003f commit 900a3563c3b08123e3ee7c66cba55433385b125f Author: Scott Sandre Date: Thu May 19 11:04:16 2022 -0700 Delta Configs test matrix TLDR Tests 76 different ways of writing configs to delta :) Add a test suite that tests basically all different ways to write configs to a Delta table, and see which ways do/don't work as expected. Prints out nice summary tables at the end for us to view and analyze. Closes delta-io/delta#1120 GitOrigin-RevId: e735eb084fc5ce279d9ecfb4f1dac3b78b55f014 commit 01226e5158fc0a9c1f52245a8cb53af7ea874f0a Author: Shixiong Zhu Date: Wed May 18 14:16:21 2022 -0700 Rewrite EqualNullSafe for data skipping to simplify the code ## Description Instead of creating new rules to handle `EqualNullSafe`, we can rewrite `EqualNullSafe(a, NotNullLiteral)` as `And(IsNotNull(a), EqualTo(a, NotNullLiteral))` and rewrite `EqualNullSafe(a, null)` as `IsNull(a)` to let the existing logic handle it. Here is the diff of `core/src/main/scala/org/apache/spark/sql/delta/stats/DataSkippingReader.scala` when comparing to a commit not including changes in #1014: ```diff $ git diff 29530ae -- core/src/main/scala/org/apache/spark/sql/delta/stats/DataSkippingReader.scala diff --git a/core/src/main/scala/org/apache/spark/sql/delta/stats/DataSkippingReader.scala b/core/src/main/scala/org/apache/spark/sql/delta/stats/DataSkippingReader.scala index 5bceb4b4..ec911840 100644 --- a/core/src/main/scala/org/apache/spark/sql/delta/stats/DataSkippingReader.scala +++ b/core/src/main/scala/org/apache/spark/sql/delta/stats/DataSkippingReader.scala @@ -29,11 +29,11 @@ import org.apache.spark.sql.{DataFrame, _} import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, TrueLiteral} -import org.apache.spark.sql.catalyst.util.{GenericArrayData, TypeUtils} +import org.apache.spark.sql.catalyst.util.TypeUtils import org.apache.spark.sql.execution.InSubqueryExec import org.apache.spark.sql.expressions.SparkUserDefinedFunction import org.apache.spark.sql.functions._ -import org.apache.spark.sql.types.{AtomicType, BooleanType, ByteType, CalendarIntervalType, DataType, DateType, DoubleType, FloatType, IntegerType, LongType, NumericType, ShortType, StringType, StructType, TimestampType} +import org.apache.spark.sql.types.{AtomicType, BooleanType, CalendarIntervalType, DataType, DateType, NumericType, StringType, StructType, TimestampType} import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String} /** @@ -439,6 +439,19 @@ trait DataSkippingReaderBase case Not(EqualTo(v: Literal, a)) => constructDataFilters(Not(EqualTo(a, v))) + // Rewrite `EqualNullSafe(a, NotNullLiteral)` as `And(IsNotNull(a), EqualTo(a, NotNullLiteral))` + // and rewrite `EqualNullSafe(a, null)` as `IsNull(a)` to let the existing logic handle it. + case EqualNullSafe(a, v: Literal) => + val rewrittenExpr = if (v.value != null) And(IsNotNull(a), EqualTo(a, v)) else IsNull(a) + constructDataFilters(rewrittenExpr) + case EqualNullSafe(v: Literal, a) => + constructDataFilters(EqualNullSafe(a, v)) + case Not(EqualNullSafe(a, v: Literal)) => + val rewrittenExpr = if (v.value != null) And(IsNotNull(a), EqualTo(a, v)) else IsNull(a) + constructDataFilters(Not(rewrittenExpr)) + case Not(EqualNullSafe(v: Literal, a)) => + constructDataFilters(Not(EqualNullSafe(a, v))) + // Match any file whose min is less than the requested upper bound. case LessThan(SkippingEligibleColumn(a), SkippingEligibleLiteral(v)) => val minCol = StatsColumn(MIN, a) ``` Existing tests added by #1014 should cover the correctness. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1136 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 9d3997d51a89f8f65e3e2d8f0597e1d06b1d6a75 commit ef32026f7a864e84a87121d9e2d5fa213529a5a7 Author: Sabir Akhadov Date: Wed May 18 14:57:40 2022 +0200 Minor test refactoring in DeleteCommand GitOrigin-RevId: 741db660730196c1170b5d8bccd5841fa53eeeef commit cb134a362be42c41dadf68d9d59935cb00f8b80a Author: Prakhar Jain Date: Wed May 18 05:36:45 2022 -0700 Add Checkpoint Schema in Checkpoint Metadata This PR added the checkpoint file schema to `CheckpointMetadata` (stored in `_last_checkpoint`) and expose the checkpoint file schema in `Snapshot` for future improvements. UTs No Closes delta-io/delta#1145 GitOrigin-RevId: 5b8179d1baa154d46b015dd7dfba0f52e7032df5 commit 0764ea7721e6025956309a1a4d645ffac35af086 Author: Scott Sandre Date: Tue May 17 16:36:36 2022 -0700 Change Data Feed - PR 3 - DataFrame API See the project plan at https://github.com/delta-io/delta/issues/1105. This PR adds the DataFrame API for CDF as well as a new test suite to test this API. This API includes options "startingVersion" "startingTimestamp" "endingVersion" "endingTimestamp" "readChangeFeed" Misc. other CDF improvements, too, like extra schema checks during OptTxn write and returning a CDF relation in the DeltaLog::createRelation method. Closes delta-io/delta#1132 GitOrigin-RevId: 7ffafc6772fc314064971d65d9e7946b7a01de64 GitOrigin-RevId: b901d21804fe7aaecd6bb2e03cb33c76e19ae2ad commit 6712ce81d799066f3bcd8d06fe6fcb10113757e5 Author: John O'Dwyer Date: Tue May 17 12:59:56 2022 -0700 Adds image storage example - Resolves #1012 Signed-off-by: John ODwyer ## Description Adds an image storage example to the python examples Resolves #1012 It was not because it just adds an example ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1067 Signed-off-by: allisonport-db GitOrigin-RevId: 152bbf419c5343115e6972fd0a867e42786c9e17 commit ad1b2634ab856cf48fa9bf0de94e674affa88860 Author: Scott Sandre Date: Tue May 17 10:43:27 2022 -0700 refactor UPDATE command Refactor UpdateCommand so that we perform only one call to `rewriteFiles` instead of two. This simplifies how we parse the rewritten files (and their corresponding actions) to generate metrics. This will make the code changes for CDF for the UPDATE command simpler. Existing unit tests. GitOrigin-RevId: 5eb77a0b84afb71929fa5037dec72a4cb609c7d4 commit ec40367dbb1c9762e47ccd4ec4bc155bb71ee900 Author: Scott Sandre Date: Mon May 16 18:29:26 2022 -0700 Re-enable a test after the metric collection was fixed. GitOrigin-RevId: 293394774f4a6f3f605f137fcb7d8a4ddabd5401 commit ae3d551b096dead5cf2ed39f6b81a88002a70218 Author: Jackie Zhang Date: Mon May 16 13:52:11 2022 -0700 Enable drop column by default Existing test. GitOrigin-RevId: 2560a009783db6176b7b287372870d87ea75d4a6 commit 233727c61898acd7b988c51b14b3ff11a63fdd3a Author: Scott Sandre Date: Fri May 13 19:47:58 2022 -0700 Fix `DeleteCommand` `numCopiedRows` metric This metric was not being properly set in `DeleteCommand.scala` and was causing a test failure. Existing UT. GitOrigin-RevId: b693cdbc0ea45caee87016ebceff6c2d1a57aa36 commit 8f185adf4208a6362b71dc25a5aad8a6bcc8887e Author: Scott Sandre Date: Fri May 13 16:55:45 2022 -0700 Change failing data skipping metrics test to `ignore` Mark `DescribeDeltaHistorySuite > operation metrics - delete` test as `ignore` since it is failing and will take some time to investigate why. GitOrigin-RevId: f83d63cd0859c8f154a1f22b1201383aee9dead9 commit 6661d4e49843a4577cbaaa347aa4004f6780d261 Author: Scott Sandre Date: Fri May 13 14:14:55 2022 -0700 Add data skipping tests with column mapping enabled. GitOrigin-RevId: 680dfb54c121bfc460df4ef85215a6a21088806e commit fdf054b6396a0ba5ec31ed7d829d02b1f8e22192 Author: Ole Sasse Date: Fri May 13 09:13:20 2022 +0200 Do not overwrite numCopiedRows for DELETE commands During the collection of Metrics for an OptimisticTransaction, the value for "numCopedRows" was overwritten. This was done to fix errors in metric collection code. There are already sufficiently many tests for the different metrics The operational metrics for DELETE were wrong and differed from the correct values in the usage logs. This has been fixed. GitOrigin-RevId: 73fab1d54e094c897857a28b7fa5583f102411d7 commit 6bc46e70bc44d23044a6d1830a864e81b1a94f9c Author: Adam Binford Date: Thu May 12 20:02:59 2022 -0700 Multipart checkpoints Resolves delta-io/delta#837 Adds a config that sets the maximum number of actions to include in a single file of a multipart checkpoint. If the total actions of a snapshot is larger than this, the checkpoint will be split up into multiple parts. It is disabled by default so the behavior is purely opt-in, as most users probably don't know about actions or how to set this value. In the future it could be set to some sane default value if there's some consensus on what it should be. Closes delta-io/delta#946 Signed-off-by: Venki Korukanti GitOrigin-RevId: b57610d842256bdacd10a6986a71758de777f48d commit 7103115962ab795272d9a259b0c069c277777939 Author: Scott Sandre Date: Thu May 12 15:04:28 2022 -0700 Change Data Feed - PR 2 - DELETE command See the project plan at(https://github.com/delta-io/delta/issues/1105). This PR adds CDF write functionality to the DELETE command, as well as a test suite `DeleteCDCSuite`. At a high level, during the DELETE command, when we realize that we need to delete some rows in some files (but not the entire file), then instead of creating a new DataFrame which just contains the non-deleted rows (thus, in this new delta version, the previous rows were logically deleted), we instead partition the DataFrame into CDF-deleted columns and non-deleted columns. Then, when writing out the parquet files, we write the CDF-deleted columns into their own CDF parquet file, and we write the non deleted rows into a standard main-table parquet file (same as usual). We then also add an extra `AddCDCFile` action to the transaction log. Closes delta-io/delta#1125. GitOrigin-RevId: 7934de886589bf3d70ce81dcf9d7de598e35fb2e commit 8e0623398c8cf4971869a34f00b5501310b93a6f Author: Kam Cheung Ting Date: Thu May 12 13:49:43 2022 -0700 Minor refactor to actions.scala GitOrigin-RevId: 3fa59fe87d2204d6222f562c4ec97722e717c3a2 commit 76f1b74a0185bc50d313369dc413be1879fd5aa7 Author: Allison Portis Date: Thu May 12 12:54:53 2022 -0700 Minor refactor to DeltaErrors. - Removes no longer existing error - Fix `columnMappingAdviceMessage` wasn't a formatted string (and thus it also wasn't being caught for being omitted from `errorsWithDocsLinks`) - Make `DeltaErrors.errorsWithDocsLinks` and `DeltaErrorsSuite.errorsToTest` consistent - Enforce that errors added to `DeltaErrors.errorsWithDocsLinks` are added to `DeltaErrorsSuite` GitOrigin-RevId: 2aae86bc5936416620eebb18986935c8d9ba9fe0 commit e0b6efebc8dace20458345929f31cf2bd95b71c0 Author: Serge Rielau Date: Wed May 11 17:42:13 2022 -0700 Replace more '%s' in error messages Replace %s pattern in error messages with symbolic tags GitOrigin-RevId: 89dce4f14379b6c2961da2b070a1337c4d7c4563 commit fb5411a76f12b9b1d406a8d9179415062400df43 Author: Liwen Sun Date: Wed May 11 15:48:16 2022 -0700 Minor refactor to DeltaCommand.scala GitOrigin-RevId: 6493ac01857ad110450cd88556013b9915fd2b69 commit 6b0f13c4b4b3fd9f58b0cec5ad75370a09798b2c Author: Serge Rielau Date: Wed May 11 09:57:17 2022 -0700 Prefix Delta error classes with DELTA GitOrigin-RevId: 7198663d17b497ab1918fb83124a4e3c73127c39 commit 205d0440906c1fbd8f4ba65bef44a061f9c13c26 Author: Jackie Zhang Date: Tue May 10 18:58:01 2022 -0700 Minor refactor to OptimizeTableCommand.scala GitOrigin-RevId: 20b5dd2538fa9ed38c74fc2cc31ec85ed00952a4 commit 57207c8acd4bbe7ad2069ed19a940fd215a09727 Author: Shixiong Zhu Date: Tue May 10 18:22:16 2022 -0700 Block unsupported data type when updating a Delta table schema Currently Delta doesn't do a data type check hence any data type added by Spark will be supported by Delta automatically. This causes Delta support the following types unintentionally: - YearMonthIntervalType - DayTimeIntervalType - UserDefinedType In order to prevent such issue from happening, this PR will: - Add a data type check to only allow the following data types in a Delta table. The data types defined in the [Protocol](https://github.com/delta-io/delta/blob/6905ce757f67935960a9a13ecb6854d53c117d31/PROTOCOL.md#schema-serialization-format). - YearMonthIntervalType - DayTimeIntervalType - UserDefinedType - Add an internal flag `spark.databricks.delta.schema.typeCheck.enabled` to allow users to disable the check in case it’s needed. - Any new data type added in future will be blocked by default. - `TimestampNTZType` will be rejected given that a user cannot read/write a Delta table using `TimestampNTZType` today. New added tests. No. This is not a user facing change because we don't expect this would break any existing workflows. Closes https://github.com/delta-io/delta/pull/1119 GitOrigin-RevId: 685cae655553e72a05f9e36a2c9a5d839cbab7b6 commit d90f90b6656648e170835f92152b69f77346dfcf Author: Scott Sandre Date: Tue May 10 15:40:01 2022 -0700 Delta Lake CDF - PR 1 - Batch Reading + DelayedCommitProtocol This PR introduces a key class used for CDF and CDC in DeltaLake: `CDCReader`. As the class docs say, ``` The basic abstraction here is the CDC type column defined by [[CDCReader.CDC_TYPE_COLUMN_NAME]]. When CDC is enabled, our writer will treat this column as a special partition column even though it's not part of the table. Writers should generate a query that has two types of rows in it: the main data in partition CDC_TYPE_NOT_CDC and the CDC data with the appropriate CDC type value. ``` We also add `CDCReaderSuite` which tests very basic functionality of the CDCReader. We also update `DelayedCommitProtocol` to understand and handle cases when CDC is enabled. For example, it knows to keep track of the added CDC files during a write, and how to properly partition CDC data from main table data. GitOrigin-RevId: 380e23dba37075a726f5a307f3ab03d24f7877bf commit 9f46b9c3dadb237d5f80bc05ea0e21deb0f98be4 Author: lzlfred Date: Tue May 10 12:34:19 2022 -0700 Minor refactor to SnapshotManagement GitOrigin-RevId: 8c0db15b11f30d1b9194a454bcf0be160c4a5d85 commit 15d7d5456b65ae6092b8db8ee04cab5a6f3e836c Author: Jackie Zhang Date: Tue May 10 11:50:17 2022 -0700 Minor refactor to ConvertToDeltaSuiteBase GitOrigin-RevId: 0632df20068dba7c0aaf8a67e0b1e240ca559083 commit d82c37bc0a3fa1d0c839438500e464b22579ac6d Author: Lars Kroll Date: Tue May 10 09:46:41 2022 +0200 Minor refactor to ActionSerializerSuite GitOrigin-RevId: a0acc3769d71850a76d24e9eb92968559e389d2a commit 9d49cb221e7eaee54d3cc3a076704cfebdbd0877 Author: burakyilmaz321 Date: Mon May 9 10:02:07 2022 -0700 Add python api for optimize ## Description Adds optimize command to the Python API. - New class in ``tables.py``: ``DeltaOptimizeBuilder`` - New method: ``DeltaTable.optimize`` Resolves https://github.com/delta-io/delta/issues/1080 Added new unit tests to ``test_deltatable.py``: - ``test_optimize`` - ``test_optimize_w_partition_filter`` ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#1091 Signed-off-by: Venki Korukanti GitOrigin-RevId: 93ede77656227f068c1cd32157e6cc46a78922e8 commit bcc8d57422a5113b99dd8d6672f57952554618d9 Author: Jacek Laskowski Date: Mon May 9 08:41:21 2022 -0700 Code cleanup 1. Hiding internal private method (following up the comment) 1. Fix to scaladoc `sbt compile`. Waiting for the repo build to execute tests. No Closes delta-io/delta#1112 Signed-off-by: Scott Sandre GitOrigin-RevId: 82991cf8acd9aa4085223a99be1604084ece0b79 commit cacc4c644ce72f97c7150d83da60af6498c68883 Author: Prakhar Jain Date: Fri May 6 20:02:37 2022 -0700 Add checksum logic in LAST_CHECKPOINT file The LAST_CHECKPOINT file is overwritten after Delta writes a checkpoint. This can lead to different type of issues when concurrent writers/readers are accessing the file. e.g. split reads This PR adds a checksum to the LAST_CHECKPOINT_FILE. Changes include: 1. Canonicalization of json for generating checksum (to take care of whitespaces/json field reordering etc) 2. Update Delta's PROTOCOL.md 3. Calculate and write checksum while writing the LAST_CHECKPOINT_FILE 4. Calculate and validate the checksum while reading the LAST_CHECKPOINT_FILE Process: - The checksum is calculated by converting the CheckpointMetadata to json first, then finding the checksum of String and then embedding the checksum back to json. Then this content is written to the LAST_CHECKPOINT file. - On the read side, we read the content of LAST_CHECKPOINT file, then figure out the checksum from it. This checksum is called the `storedChecksum`. This checksum will be removed from json and checksum will be recalculated on the json String - `actualChecksum`. The `actualChecksum` is compared against the `storedChecksum` to do the validation. Added UTs Closes delta-io/delta#1114 Closes delta-io/delta#1117 GitOrigin-RevId: 8832e527b46b1b2d732f134ee86028e5aebadff0 commit 6019a8c934913091510eb224cfb48692afc465b7 Author: Lars Kroll Date: Fri May 6 09:48:24 2022 +0200 Various refactors and code cleanup GitOrigin-RevId: d94fda5f16ecaf2ee9936629d8a24a3af5b9d60a commit 122fb0954f86c3a08ab1a98b276f809a2be398ff Author: Naga Raju Bhanoori Date: Thu May 5 21:28:46 2022 -0700 Refactor ninth set of 20 Delta error messages This PR refactors 20 exceptions of the delta code base to follow the new error handling framework that better organizes exceptions and makes them queryable. This commit refactors the ninth set 20 delta error messages. Refactored existing exceptions into functions in DeltaErrors.scala and unit tested those functions part of DeltaErrorsSuite.scala. GitOrigin-RevId: 8820e122948669e68f5f06eaaea2b101ce42518a commit 236b483ded91d83310f3877914fb9b3bf99e32dd Author: Serge Rielau Date: Thu May 5 20:55:14 2022 -0700 Add to more error-classes.json files Replace %s markers in error-classes files with \ GitOrigin-RevId: a80193511050d6d140b9cb8a9f7a6cbc588dc573 commit 27ea61adc706f45a4c368e1acab1684600b5e00d Author: Kam Cheung Ting Date: Thu May 5 11:42:04 2022 -0700 Refactor InvariantViolationException to DeltaInvariantViolationException This PR fixes the error class not found exception while handle InvariantViolationException exception by deriving the InvariantViolationException to create new error class DeltaInvariantViolationException. GitOrigin-RevId: 767ca29486d408bb273f8ac328d75107e1ba03ff commit fb6696fee2255bd7d3681a578ee6e139f9fffbed Author: ericfchang Date: Wed May 4 21:48:15 2022 -0700 Move convert to delta errors to error classes GitOrigin-RevId: d906caa0638de994e8b0b1c4d3d2cca2911bc8d6 commit 4e6a13c403d417e259f3d5b4519bcd47dc8216dc Author: Rajesh Parangi Date: Wed May 4 16:38:24 2022 -0700 Minor refactor to tests in DeltaSourceTableAPISuite.scala GitOrigin-RevId: b938ca69a542a8e70814c8603edcb16e468ed59c commit 925bd79bcae9b0c85e7cd7342e4686e7be2a21d5 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu May 5 14:14:30 2022 -0700 [324] Fix examples (#335) * fix examples and run_examples.py and test * update comment in run_examples.py commit 23e479b2b539c638b583ed11dbe79794cc1f9d5e Author: Venki Korukanti Date: Wed May 4 15:47:14 2022 -0700 Fix python test failures ## Description Currently two python tests are failing on master branch (986ab1cfa46c93f848ab682880b18dbd5b193059) due to incorrect exception handling. These tests are added as part of 462418c77b0165fbdb1b28d772087a91b559591f. These tests rely [Apache Spark 3.3 python exception handling](https://github.com/apache/spark/blob/master/python/pyspark/sql/utils.py#L156). Currently Delta relies on Apache Spark 3.2.1. Until Delta lake upgrades to 3.3, copy the exception handling cases from Apache Spark 3.3 to Delta Lake python exception handling. Ran the tests in CI and verified all Python tests pass. Closes delta-io/delta#1110 Signed-off-by: Venki Korukanti GitOrigin-RevId: 2f9938cf1f9816a0ed5b612a612a8216a5ee24a0 commit 90736d54c9403bcaff8a03aa130c6f8072706218 Author: Scott Sandre Date: Wed May 4 15:07:51 2022 -0700 Minor refactor to DeleteCommand.scala. GitOrigin-RevId: a1beb1d213272b5c1a79171599e7e171d9208a0c commit 4439e414074d10972373309f42709abb2b834f77 Author: lzlfred Date: Wed May 4 10:01:46 2022 -0700 Minor refactoring to snapshot management code GitOrigin-RevId: af8138f52847712a7eea0722c0f4b58d9d063eea commit d596f6a8177c85562ac78ab4bad88586182dccca Author: Xinyi Date: Wed May 4 08:30:35 2022 -0700 Minor test refactoring GitOrigin-RevId: 0139e0fe77e7ec47953f7ba6351bfb50ce3efe96 GitOrigin-RevId: 0ff74fef932d3c746c1a04b702efb3e3fb6b9c96 GitOrigin-RevId: 4a6756a2b8387c52b7fe110144259178e072d894 commit 5c572380076f95017ef964d113db45e38989f6c7 Author: Naga Raju Bhanoori Date: Wed May 4 01:51:59 2022 -0700 [Delta] Refactor eighth set of 20 Delta error messages This PR refactors 20 exceptions of the delta code base to follow the new error handling framework that better organizes exceptions and makes them queryable. This commit refactors the eight set 20 delta error messages. Refactored existing exceptions into functions in `DeltaErrors.scala` and unit tested those functions part of `DeltaErrorsSuite.scala`. Few exceptions invoked part of `catalyst` weren't included in `DeltaErrors.scala` GitOrigin-RevId: fe0f4a86a035b392f860cb3a3656904979e1b885 commit 77a9eb0ca4795e8827840e50ea783afe4aa8059d Author: Hussein Nagree Date: Wed May 4 11:11:05 2022 +0530 Minor refactoring GitOrigin-RevId: 92178efd53d74992f5160db53b3272f31be0c874 commit 6f645e819644fb035109abb2f97a9d8f469caff6 Author: Hussein Nagree Date: Wed May 4 10:56:05 2022 +0530 Minor refactoring in snapshot management code GitOrigin-RevId: cb986827861fa99db600e9e40a05fb3cdf25ed79 commit 81e8a919d2a80f7170a986053e4e3c0246d7a21b Author: Jackie Zhang Date: Tue May 3 10:01:47 2022 -0700 Minor refactoring in Delta DROP Column test suite GitOrigin-RevId: 66645b8ff641f5694502565aeafd39cde35ca35c commit cd8f898ab9e5a17df34a94d5372b968c376846d6 Author: lzlfred Date: Mon May 2 10:58:18 2022 -0700 Add additional fields to _last_checkpoint Add `sizeInBytes` and `numOfAddFiles` to `_last_checkpoint` to help delta look up checkpoint details. unit test Closes delta-io/delta#1109 GitOrigin-RevId: 2e07b1358888685ca8cff254f87d0f4f2de9226b GitOrigin-RevId: 50493fdaa3d54698b913ea4a8f5bf6194b5b9cf3 commit c0b372624a1ec72d91700c1b483b789afaf1f7c9 Author: kristoffSC Date: Thu May 5 00:04:15 2022 +0200 Flink Delta Source PR_9 - Builder (#331) * PR 9 Two Builders, hidde format Signed-off-by: Krzysztof Chmielewski * PR 9 Hide generic types in Sink builders. Signed-off-by: Krzysztof Chmielewski * PR 9 Refactor names for base abstract classes Signed-off-by: Krzysztof Chmielewski * PR 9 Refactor name for RowDataFormat Signed-off-by: Krzysztof Chmielewski * PR 9 Validation for Builders WIP Signed-off-by: Krzysztof Chmielewski * PR 9 Validation for Builders WIP Signed-off-by: Krzysztof Chmielewski * PR 9 Validation for Builders WIP - temporary remove of option values validation. Signed-off-by: Krzysztof Chmielewski * PR 9 Javadocs WIP Signed-off-by: Krzysztof Chmielewski * PR 9 Javadocs and tests Signed-off-by: Krzysztof Chmielewski * PR 9 Changes after code review. Signed-off-by: Krzysztof Chmielewski * PR 9 Changes after code review Signed-off-by: Krzysztof Chmielewski * PR 9 Changes after code review Signed-off-by: Krzysztof Chmielewski * PR 9 Changes after code review Signed-off-by: Krzysztof Chmielewski * PR 9 increase timeout for IT execution tests to 5 minutes. Signed-off-by: Krzysztof Chmielewski Co-authored-by: Krzysztof Chmielewski Co-authored-by: Scott Sandre Co-authored-by: Tathagata Das commit 986ab1cfa46c93f848ab682880b18dbd5b193059 Author: Fu Chen Date: Mon May 2 09:42:05 2022 -0700 DataSkippingReader - `EqualNullSafe` support Resolves https://github.com/delta-io/delta/pull/974#discussion_r825686186 Add `EqualNullSafe` support in data skipping |Expression| Behavior| | --- | --- | |`EqualNullSafe(Literal(...), a)`| this has the same behavior as `EqualNullSafe(a, Literal(...))` | |`Not(EqualNullSafe(Literal(...), a))`|this has the same behavior as `Not(EqualNullSafe(a, Literal(...)))`| |`EqualNullSafe(a, Literal(null, _))` | this will be optimized to `IsNull(a)` by rule `NullPropagation` | |`Not(EqualNullSafe(a, Literal(null, _)))` | this will be optimized to `Not(IsNull(a))` by rule `NullPropagation` and then optimized to `IsNotNull(a)` by rule `BooleanSimplification`| |`EqualNullSafe(a, NotNullLiteral(v, _))` | we will select files meeting `min <= v && max >= v` | |`Not(EqualNullSafe(a, NotNullLiteral(v, _)))` | we will select files meeting `min.isNull or max.isNull or nullCount.isNull or !(min === v && max === v && nullCount === 0)` | Closes delta-io/delta#1014 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 87aff325226525300444f251b97f263cc3253aef commit 29530ae4083be0047894473c6b1d575eaa8b8c80 Author: Lukas Rupprecht Date: Mon May 2 08:57:14 2022 -0700 [Delta] Refactor fifth set of 20 Delta error messages This PR refactors 20 exceptions of the delta code base to follow the new error handling framework which better organizes exceptions and makes them easier to query. This commit refactors the fifth set 20 delta error messages. Refactored existing exceptions into functions in DeltaErrors.scala (where applicable) and unit tested those functions. GitOrigin-RevId: 7f43f408be2c0fcb3d14e2f2cc1a03b1f8f6f27a commit 198a4bb1e3009307b559bef790b6df5346b47014 Author: Adam Binford Date: Sun May 1 08:17:19 2022 -0700 Add Scala API for optimize Add functions to `DeltaTable` to perform optimization. API documentation: ``` /** * Optimize the data layout of the table. This returns * a [[DeltaOptimizeBuilder]] object that can be used to specify * the partition filter to limit the scope of optimize and * also execute different optimization techniques such as file * compaction or order data using Z-Order curves. * * See the [[DeltaOptimizeBuilder]] for a full description * of this operation. * * Scala example to run file compaction on a subset of * partitions in the table: * {{{ * deltaTable * .optimize() * .where("date='2021-11-18'") * .executeCompaction(); * }}} * * @since 1.3.0 */ def optimize(): DeltaOptimizeBuilder ``` ``` /** * Builder class for constructing OPTIMIZE command and executing. * * @param sparkSession SparkSession to use for execution * @param tableIdentifier Id of the table on which to * execute the optimize * @since 1.3.0 */ class DeltaOptimizeBuilder( sparkSession: SparkSession, tableIdentifier: String) extends AnalysisHelper { /** * Apply partition filter on this optimize command builder to limit * the operation on selected partitions. * @param partitionFilter The partition filter to apply * @return [[DeltaOptimizeBuilder]] with partition filter applied */ def where(partitionFilter: String): DeltaOptimizeBuilder /** * Compact the small files in selected partitions. * @return DataFrame containing the OPTIMIZE execution metrics */ def executeCompaction(): DataFrame } ``` Closes delta-io/delta#961 Fixes delta-io/delta#960 Signed-off-by: Venki Korukanti GitOrigin-RevId: 615e215b96fb9e9b9223d3d2b429dc18dff102f4 commit 462418c77b0165fbdb1b28d772087a91b559591f Author: Kam Cheung Ting Date: Fri Apr 29 13:57:26 2022 -0700 Add Test Case on exception handling to Pyspark In Delta Lake 1.2.0, we introduce DeltaAnalysisException and DeltaIllegalArgumentException, which derive from AnalysisException and IllegalArgumentException. Python UT for both new exceptions. GitOrigin-RevId: 130d8c04a6d6211c2dc256c56e9c7c0eb35dbd14 commit 392f8bbbed373e2e25b620df92fbc6c807b57517 Author: Jackie Zhang Date: Fri Apr 29 13:49:55 2022 -0700 Support creating Delta table with empty schema This PR supports creation of empty schema Delta tables by: 1. optionally turning off the existing check and 2. block read access to empty schema tables. New unit test. With this PR, we allow user to create **Delta** table with an **empty** schema using the following syntax: `CREATE TABLE table USING DELTA` Blocked: - Creating a table using an empty dataframe and `df.save()` won't be allowed - MERGE is blocked atm because there are no valid merge conditions if no columns is present. We are looking into supporting this in the future. - INSERT INTO is blocked unless `delta.schema.autoMerge.enabled` is turned on to allow schema evolution. Allowed: - `dataframe.save()` with `mergeSchema = true` GitOrigin-RevId: 640ab1b5c28c8f4c9cf32f00adc04853d550bdd7 commit b2d50490d7ee5282e53827f50dd62131b2ff6049 Author: Kam Cheung Ting Date: Fri Apr 29 13:49:14 2022 -0700 Additional tests for OPTIMIZE GitOrigin-RevId: f639239b858daadc1c09f6af399e94261245620a commit e83c0f2dffe4b8ee29d792417230a548ecc300a2 Author: Terry Kim Date: Fri Apr 29 12:19:49 2022 -0700 Minor refactoring GitOrigin-RevId: 12bafd0ab764440f3e7a6b430272149590b2b3ce commit 84ff14070a7590941136231929647b4ee527b4f8 Author: Terry Kim Date: Fri Apr 29 11:33:14 2022 -0700 Minor refactoring GitOrigin-RevId: 060fac6096fec66afcf3b4da8a874f85aea8831d commit 58b25108e1e275fa7e2ff4ba758184a19270cf15 Author: Liwen Sun Date: Fri Apr 29 10:16:33 2022 -0700 Drop column support in Delta column mapping mode Support ALTER TABLE DROP COLUMN as a metadata only operation. This feature is not ready for release yet, so it's currently behind a feature flag. Closes https://github.com/delta-io/delta/pull/1076 Unit tests GitOrigin-RevId: 181d6dc38f5c074eb16710191795118e78fdf972 commit bc1b550235f3bcd1e308473f2052911daad21d6a Author: Kam Cheung Ting Date: Fri Apr 29 02:02:39 2022 -0700 Minor refactoring GitOrigin-RevId: 4acc564dc6499f54986d9c74f80fced716295843 commit 11038cc4c0cdac102cc793ad56ff80f483fce23f Author: Shoumik Palkar Date: Thu Apr 28 20:26:43 2022 -0700 Minor refactoring GitOrigin-RevId: b529506832021b73375b5bab228890503417a3c8 commit acb33f0f8097f5b9962ea79ac7f81b42a55700af Author: Jacek Laskowski Date: Thu Apr 28 15:54:09 2022 -0700 README How to execute a single test suite and a test Add two more sample sbt command lines to show how to execute a single test suite as well as a single test (within a test suite). I consider it very helpful for new contributors (and I found it helpful myself too as I seem to have googled these commands all the time). Closes delta-io/delta#1102 Signed-off-by: Venki Korukanti GitOrigin-RevId: 1c6b612b063222493636636fd2e065f7b0aacd89 commit f8e6d8365fafbaa689c10dd47366ff2baa769a13 Author: Ankur Dave Date: Thu Apr 28 15:49:43 2022 -0400 Minor refactoring GitOrigin-RevId: c9573385923755406d7f566072c12b71a26e4fbf commit c71f3a513c48e1c83619fdfaaf3ad7c5ff8deee1 Author: Terry Kim Date: Wed Apr 27 19:19:32 2022 -0700 [Delta] Refactor seventh set of 20 Delta error messages Refactors 20 exception error messages to error-classes format. UT added in DeltaErrorsSuite. GitOrigin-RevId: 77019280f217343cf0fd7101559557a0c0ea31b2 commit 4085ad9dcb0224daa2a07361adffba4dc53efbec Author: Vini Jaiswal Date: Wed Apr 27 16:47:28 2022 -0700 Updated Contributing Guide 1. Updated the linux foundation delta chapter link. 2. Updated the Communication part to provide prescriptive guidelines for potential contributors. 3. Updated contributor metrics. Not Applicable. Users will see more prescriptive guidance and updated resources to contribute. Signed-off-by: Vini Jaiswal Closes delta-io/delta#1066 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 3f5f3ecf593fe489c5be1f94352978861da92268 commit 802f6d4fa53276321c8d610682e470e0dbc446f8 Author: Alkis Evlogimenos Date: Wed Apr 27 09:23:51 2022 +0200 Refactor sixth set of 20 Delta error messages Refactor existing exceptions in delta code base so that they are queryable. Existing and new unit tests. GitOrigin-RevId: 0d3cfc0a894d3cb946da8ba731e2a843d4da93bf commit 6c5f86461ff431f62cea3a2877c4663748e6a5d0 Author: Naga Raju Bhanoori Date: Tue Apr 26 20:02:30 2022 -0700 Minor refactoring GitOrigin-RevId: aca09e3288e3906edf62b88bcaf8d9546de38ea9 commit a83d9fcfda07d646a99ec286d69d535f9fabb860 Author: Naga Raju Bhanoori Date: Tue Apr 26 10:27:01 2022 -0700 Minor refactoring GitOrigin-RevId: f81d95c580eb6e71b00113064ec3b51acd80a6a7 commit 5cfe164705f5877fbf13544e4d4af962f19059d3 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon May 2 09:48:32 2022 -0700 Flink source feature branch (#315) - PRs 1 to 7 inclusive commit edd7c76838e1cfde8952e2d98fbfccdc3aa6a2f4 Author: Paweł Kubit <91378169+pkubit-g@users.noreply.github.com> Date: Thu Apr 28 21:20:01 2022 +0200 Fix examples and correct dependencies in the README (#327) commit 6905ce757f67935960a9a13ecb6854d53c117d31 Author: Scott Sandre Date: Tue Apr 26 08:16:35 2022 -0700 Fix config for S3 multi-cluster mode ## Problem Description - problem 1: there was a mismatch between the config params in the documentation (`io.delta.storage.S3DynamoDBLogStore.ddb.tableName`) and the config params actually used in the code + in the integration test (`io.delta.storage.ddb.tableName`). - solution 1: include the `S3DynamoDBLogStore` string in the config in the code. - problem 2: the `io.delta.storage` prefix didn't work when specified using `--conf`. they DID work using `spark.conf.set()` or `SparkSession.builder.config()` but not with `--conf`. - solution 2: we now allow 2 prefixes. - `spark.io.delta.storage.S3DynamoDBLogStore.$key.....` this will work all contexts (`--conf`, `spark.conf.set()`, etc). - `io.delta.storage.S3DynamoDBLogStore.$key`. this is the original prefix. this will be able to be used by delta-standalone and flink and hive. they just use hadoopConfs and don't need to have prefix starting with `spark` ## PR Description - resolves https://github.com/delta-io/delta/issues/1094 - update config for S3 multi-cluster mode (i.e. S3DynamoDBLogStore) to match the public documentation. the configs were missing the string literal `S3DynamoDBLogStore` from the conf prefix - now supports 2 confs. `spark.io.delta.storage.S3DynamoDBLogStore.$key` and `io.delta.storage.S3DynamoDBLogStore.$key`. - added unit test for the 2 new confs - ran integration test using existing + new s3 table, and using existing + new ddb table - ran manual tests using locally published jars on pyspark + spark-shell + spark-submit (spark-submit via integration test) - tested using `--conf` as well as `spark.conf.set` as well as `SparkSession.builder.config()` ## Does this PR introduce _any_ user-facing changes? Not really, just fixes a just-released, experimental LogStore config key Closes delta-io/delta#1095 Signed-off-by: Scott Sandre GitOrigin-RevId: e5db1e6b0dfe958e3234644462891f269313ca33 commit 4a1f1dc4c078c6d3ba74899bdd6413ce8e9f92e6 Author: Prakhar Jain Date: Mon Apr 25 10:53:45 2022 -0700 Minor refactor to PrepareDeltaScan GitOrigin-RevId: 0f73bc43a9965894c891a9ee3b80fd3860185c7e commit 9c9caaf2e0b7ab5a8ab12fae4ca627b8d839fee0 Author: gaurav-rupnar Date: Mon Apr 25 09:40:42 2022 -0700 Minor refactor to SchemaUtils GitOrigin-RevId: 49da22fbc76a4c3b6784a0feae3ea56da6dca7d7 commit 663819d71cc031dd45c63b68d0eb3a209027b540 Author: Tyson Condie Date: Sun Apr 24 10:20:13 2022 -0700 Refactor third set of error messages Refactors 20 exception error messages to error-classes format. UT added in DeltaErrorsSuite. GitOrigin-RevId: 47882917ae2d3eced7a116bff855d96ce1783bb9 commit 11fb2eadf3ead06fa1bcb049e7dcc925e21166e1 Author: Ole Sasse Date: Sun Apr 24 16:33:40 2022 +0200 Improve error message if delta table path does not exist The functionality was already sufficiently covered by tests, which have been adjusted. When the user uses a path based delta table as a source in a query and that path does not exist, the error message will now read "Path does not exist" instead of "not a Delta table". GitOrigin-RevId: a7a1b51e0a39903f1a5ff435e9d13786f3779d1d commit f50ecb6e0a3c18d7e2d0ac0600aa6a2d1d7fbdc7 Author: Shixiong Zhu Date: Fri Apr 22 17:06:45 2022 -0700 Remove unused DeltaConfigs Existing tests GitOrigin-RevId: 847f44295c5f3e7ec9f7e752d59680cfd604bf11 commit f950d534ce4c6eac6b94e790c45b19965d778eda Author: Denny Lee Date: Fri Apr 22 10:24:17 2022 -0700 Update CODE_OF_CONDUCT.md ## Description Fixed spelling of here, it looks like previous accidentally included "shipit" Reviewed the text ## Does this PR introduce _any_ user-facing changes? It does not. Closes delta-io/delta#1092 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 66ce1b97fae8dbc892d4e0b36118a5755922f537 commit 87e7758f4cfd26d8efe5cac19ba93f40afd4ba16 Author: Scott Sandre Date: Fri Apr 22 10:18:07 2022 -0700 Update DelegatingLogStore log statement to include LogStoreAdapter.logStoreImpl class ## Description Update the logInfo statement in DelegatingLogStore to include the actual implementation class of the LogStoreAdapter published jars locally and tested with pyspark Closes delta-io/delta#1089 Signed-off-by: Scott Sandre GitOrigin-RevId: 6520a6441dbad553ba08fefe2d87c26ba6799190 commit deab80da176aa77ad0dbdd3090305dfe2bf1c156 Author: lzlfred Date: Thu Apr 21 21:06:01 2022 -0700 Avoid getSnapshotAt in PreparedDeltaFileIndex getSnapshotAt can be very slow (up to 2 minutes in large delta log dir), and thus we pass the scannedSnapshot to PreparedDeltaFileIndex to avoid getSnapshotAt. existing UT. GitOrigin-RevId: f74cb8664247bf9571a865d9c7cfd3c4ec76daac commit 9009ca9d67ceddbf774ce2af5288257fa1efc36f Author: Ruslan Dautkhanov Date: Thu Apr 21 21:02:05 2022 -0700 Remove merge.maxInsertCount as not used It was brought up in delta-oss slack channel https://delta-users.slack.com/archives/CJ70UCSHM/p1626443655490700 Remove merge.maxInsertCount as it's not used Closes delta-io/delta#1090 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 20eb4dee4d7df82c9d8167dbbe43096f7f1affc1 commit b2c18e66e40e88d432d1d2cba5d2780a0d5acfc3 Author: lzlfred Date: Thu Apr 21 16:15:44 2022 -0700 Ensure DeltaLog is always created with dir ending with _delta_log This PR fixes all places the delta log is NOT ending with _delta_log by making the constructor private and thus only allow to create the log by calling forTable. GitOrigin-RevId: 01730b48dd8c432da59d8187d76c10ed846b5041 commit d1ad3ba34766e6a8582f9313256bdcbd74bbc68d Author: Wenchen Fan Date: Thu Apr 21 21:46:58 2022 +0800 Minor refactor to DeltaTableV2 GitOrigin-RevId: bc1c8024f4667a7fe015238e705c422e074e233d commit a9f2f85f7178be89a8fe183dd44369f45735862d Author: Adam Binford Date: Wed Apr 20 14:38:22 2022 -0700 Remove unused PartitionFiltering Closes https://github.com/delta-io/delta/pull/1077 Signed-off-by: allisonport-db GitOrigin-RevId: 4fabf7a96427870721801a6d6e44038491770990 GitOrigin-RevId: e29451c7a18f4cdc4162c57f86405cf36b451089 commit 7be147564baece6bcfbd74833c5d2ca6ac11f53e Author: Sabir Akhadov Date: Wed Apr 20 16:39:40 2022 +0200 Delete trait that is no longer used GitOrigin-RevId: 0f26075d179d24baab8207219b7f70d9a0634a24 commit f7a9b2bc5ffbe13a7a336a3bff3f9c72cde64834 Author: Kam Cheung Ting Date: Wed Apr 20 00:35:44 2022 -0700 Minor refactor to OptimizeStats and OptimizeMetricsSuite GitOrigin-RevId: a5c1620aa6da726209cf86184698bb057adde111 commit ccae0729d4aed8446e5b6582ec60b6c00f144633 Author: Rajesh Parangi Date: Tue Apr 19 22:46:09 2022 -0700 Minor refactor to DeltaTimeTravelSuite GitOrigin-RevId: 3f1cf666cffb4452d030ed3975e87c7d1a7dcaaa commit f68512685ca5fc7639df60f286d589d66b28baf5 Author: lzlfred Date: Tue Apr 19 22:20:38 2022 -0700 Minor refactors in DeltaVacuum and DescribeDeltaHistory suites GitOrigin-RevId: 2a37bc43ec1e05d2695e81f813b09691a6777d93 commit a6c161dc02454879bcf19a4a814fa13050b3cd4a Author: Karen Feng Date: Tue Apr 19 20:57:29 2022 -0400 Change default behavior of DROP CONSTRAINT to differ from IF EXISTS Changes the behavior of DROP CONSTRAINT to throw an error by default if the constraint does not exist. The error will not be thrown if the user provides the argument IF EXISTS. Unit test GitOrigin-RevId: 969e7d41899ed911f596ee2c9221e6c73e2d444d commit b664c7a923bff87fc25b9bcb0b8cb3d09246f4b9 Author: Chang Yong Lik Date: Mon Apr 18 21:27:06 2022 -0700 Added default LogStore implementation for the .gs scheme ## Description This PR adds gs scheme to default LogStore implementation Resolves https://github.com/delta-io/delta/issues/1068 Unit Test and manul testing with GCS (details [here](https://github.com/delta-io/delta/pull/1071#issuecomment-1101812886)) ## Does this PR introduce _any_ user-facing changes? Yes, users can automatically derive the GCS log store conf based on the path Closes delta-io/delta#1071 Signed-off-by: Venki Korukanti GitOrigin-RevId: 5e877a8a9723bc5cae253cd27924da4aeadcb7c8 commit 4692222bcc3fe73c81a7ee1c9c1f887a99026c90 Author: Tyson Condie Date: Mon Apr 18 16:34:42 2022 -0700 Minor refactor to Checksum.scala GitOrigin-RevId: bd8baa9b5bf4df11fb829e22fa9bbebed5682ed5 commit dde57a3b330ed34a9c71c09977c53ee394ef8c31 Author: Rahul Shivu Mahadev Date: Mon Apr 18 12:25:56 2022 -0700 Fix generate symlink manifest to delete partitions that contain spaces - Problem : URI paths were being compared with regular paths - Fixed by decoding URI path to get regular path - added unit test GitOrigin-RevId: 6ddf2dd4b2292aa8da7970fe7a37afed2523433c commit 81d56794b1ea1032b1d7726c343325661e8ed20f Author: Tyson Condie Date: Mon Apr 18 11:51:24 2022 -0700 Refactor second set of 20 Delta error messages. This PR is going to refactor the existing exceptions of delta code base, so that they are well organized and queryable. This commit refactor the second set 20 delta error messages. Tests added to DeltaErrorsSuite covering the revised error messages for this task. GitOrigin-RevId: 0027bea90515f4c0c18627c637598ba86170def2 commit 3f2249534de9112f359387c10d623d3eb5784439 Author: Kam Cheung Ting Date: Sun Apr 17 01:06:23 2022 -0700 Resource loading error with delta-spark This PR fixes the resource loading error in delta-pyspark. The issue is caused by the root path of pyspark different with spark. In this PR, we use the Utils.getContextOrSparkClassLoader to replace Utils.getSparkClassLoader. Add a python test to ensure the error class resource loading passed. GitOrigin-RevId: 1b93cebb936e324d55cf2f3f18f5576f714807cc commit 6cecd5a2917162027d6da5e6e3cb5a37c9b04fc3 Author: ericfchang Date: Fri Apr 15 11:47:08 2022 -0700 Minor refactor to DeltaErrors.scala GitOrigin-RevId: 483d849a61f5623f87482d5f847c10bccde7ea8c commit 771892fc9543f4926088e3f9cde7b73e074251ac Author: Rahul Shivu Mahadev Date: Fri Apr 15 11:22:55 2022 -0700 Make the Merge Set accumulator thread safe - Merge uses an accumulator to get the list of matching AddFiles that is not thread safe and can lead to tasks failing in very large merges. - This PR aims to fix this issue using` java.util.Collections.synchronizedSet` GitOrigin-RevId: b4bf64ff3e39aab239e50c81c7b217da72fb78e6 commit 5a829a2550c17effc379581db93fd9e2927f80a4 Author: Hussein Nagree Date: Fri Apr 15 09:57:30 2022 -0700 Fix snapshot version used to log CommitStats I found an issue when we were reporting commit stats after a checkpoint for large commits. We were grabbing the current snapshot instead of the one resulting from our commit, which could result in incorrect stats being logged (since the current snapshot can easily change while the checkpoint is in progress). Existing tests GitOrigin-RevId: feb0af2101e630fbc0b9b748db5f4f39e386ca73 commit 47f50a28f04e638d9cd0918ccb53cf9aee73d089 Author: Junyong Lee Date: Thu Apr 14 09:49:38 2022 -0700 Minor refactor to Checkpoints.scala GitOrigin-RevId: 06b12e1ea5564d363dfff8caae083254d46552e8 commit 30893b982883b71066d0ca91846827901c2869e3 Author: Paweł Kubit <91378169+pkubit-g@users.noreply.github.com> Date: Mon Apr 25 22:19:53 2022 +0200 Add Flink metrics to the DeltaSink (#328) commit 475451f0dbd9c3dc31e3bc1040f738a99fc2c5c4 Author: Scott Sandre Date: Wed Apr 13 10:45:20 2022 -0700 ExternalLogStoreSuite: add test that writing N+1 will first recover version N Add test cases to `ExternalLogStoreSuite` to ensure that, when writing version N+1 - if version N doesn't exist in the file system, and its external entry is marked as complete, then throw error - if version N doesn't exist in the file system, and its external entry is marked as incomplete, then recover version N before committing version N+1. Unit test. Closes delta-io/delta#1057 Signed-off-by: Scott Sandre GitOrigin-RevId: 0abe1bc5a31aeb8f00d9e737350bdb4e82b02134 commit 33b0ed3f40f161412224112cbc3490549415c67d Author: Venki Korukanti Date: Wed Apr 13 08:57:02 2022 -0700 Upgrade version to 1.3.0-SNAPSHOT Given the branch for 1.2 release is cut, upgrade the version to 1.3.0-SNAPSHOT on master branch Closes delta-io/delta#1058 Signed-off-by: Venki Korukanti GitOrigin-RevId: f528e0bc8b2f67857c3a42140420567102da4c48 commit c33fa72212b888eda0c7f8f0804adcfbe030e112 Author: Venki Korukanti Date: Wed Apr 13 07:16:30 2022 -0700 Update Delta version to 1.2.0 in integration tests Upgrade the version in integration tests to test the Delta Lake 1.2.0 release candidates. Closes delta-io/delta#1065 Signed-off-by: Venki Korukanti GitOrigin-RevId: 2b7737d03db557f6e63dc63ab3b4b62821395d98 commit 78f319f5f1655473889000181dc815927b77d02e Author: Jasraj Dange Date: Wed Apr 13 02:08:47 2022 -0700 Move the invalid chars in column name error to new error message framework GitOrigin-RevId: 5fc80d534ee391894ce8a42a241638999f9b999f commit 8b026ff3594e5a1b2b297bb9d1ac49317431d030 Author: Bart Samwel Date: Wed Apr 13 09:10:22 2022 +0200 Minor test refactoring GitOrigin-RevId: d63e0225eca6e65554706f2a7ade95f49fd1c2ae commit 0ecdc9a576568d8215e526c3063571852a723826 Author: Tyson Condie Date: Tue Apr 12 16:01:10 2022 -0700 Minor code refactoring GitOrigin-RevId: 3f7b87ecb4a5e69cbdc9e7f0af9b24a1115395df commit 912ae774869373bb6eef1e4c7d6e8e2fd5da56ea Author: Rajesh Parangi Date: Tue Apr 12 11:07:03 2022 -0700 Fix a bug in comparing checkpoints GitOrigin-RevId: 081f73586267ffb80f294d49dd666633119e27f1 commit 3d2e93ce4ff64aae19ff3120024f6a26f0e56b64 Author: Adrian Ionescu Date: Tue Apr 12 08:10:21 2022 +0200 Minor refactoring to LogStoreSuite GitOrigin-RevId: 16cb1eab132c079d9e043a38da374daa03d5ce20 commit 03353ac757284ed05cd79b03ad6bc72a2152119c Author: Wenchen Fan Date: Tue Apr 12 09:27:11 2022 +0800 Always use non-ANSI cast when extracting partition values GitOrigin-RevId: a36f95ae09ab472cf2173ca4eae686a92ac0e3e5 commit b8d83c638b0d6e4d87aeba353238abc6ab4d342d Author: Shixiong Zhu Date: Mon Apr 11 16:18:17 2022 -0700 Minor refactoring to lazily initialize the file histogram schema GitOrigin-RevId: f2886a332b12843d69e0e1a202eafcd8a6274881 commit 32f4a0f6c44d1aa0f3983fce206cae0acaf0e4e2 Author: Lukas Rupprecht Date: Sat Apr 9 12:43:10 2022 -0700 Revert the changes to exception type in SchemaMergingUtils Change was introduced in https://github.com/delta-io/delta/commit/a3a07e33e655cec741f81e8caa4e629ef180a5a9 and it is breaking the protocol. GitOrigin-RevId: d48d753fe40f71db61a752ee1d3ebe8fc3d3b5c8 commit 9bbaefb0103d12ca9b869ac16da55da932f0a7ac Author: Paweł Kubit <91378169+pkubit-g@users.noreply.github.com> Date: Mon Apr 11 23:04:38 2022 +0200 Flink Delta Sink - fix cluster deployment issue for the Flink Delta Sink (#320) * Fix cluster deployment issue for the Flink Delta Sink and add README.md note * Update README.md Co-authored-by: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> commit 3fe6f7a1bf6481cd588d96eba1822e040b7d2506 Author: Andrew Olson Date: Thu Apr 7 15:32:15 2022 -0700 Add support for Spark DataFrameWriter maxRecordsPerFile option Today, parquet supports the [maxRecordsPerFile](https://github.com/apache/spark/pull/16204) option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example, ``` spark.range(100) .write .format("parquet") .option("maxRecordsPerFile", 5) .save(path) ``` The above code will generate 20 parquet files and each one contains 5 rows. This is missing in Delta. This PR adds the support for Delta by passing the `maxRecordsPerFile` option from Delta to ParquetFileFormat. Note: today both Delta and parquet support the SQL conf `spark.sql.files.maxRecordsPerFile` to control the file size. This PR is just adding the `DataFrameWriter` option support to mimic the parquet format behavior. Fixes #781 Closes delta-io/delta#1017 Co-authored-by: Andrew Olson Signed-off-by: Shixiong Zhu GitOrigin-RevId: 02af2c40457fe0acc76a31687e4fd6c47f3f2944 commit 952f25b04956b323c37ae273a3671c5f9632a03c Author: Scott Sandre Date: Thu Apr 7 12:28:02 2022 -0700 Updates to delta-storage/-s3-dynamodb artifacts and java/scaladocs ## Description - `DynamoDBLogStore` renamed to `S3DynamoDBLogStore` - `delta-storage-dynamodb` artifact renamed to `delta-storage-s3-dynamodb` - `delta-storage` artifact name now has no scala version, and pom has no scala dependency - `delta-storage-s3-dynamodb` artifact name now has no scala version, and pom has no scala dependency - `io.delta.storage` scaladocs now contain only the `io.delta.storage` Java APIs - NO CHANGE: `io.delta.storage` java APIs docs include only `LogStore.java` and `CloseableIterator.java` - updated integration tests for new folder and new artifact name - java artifacts are only generated when using scala 2.12 ... we do NOT double publish them / double generate the jars. e.g. `build/sbt '++2.13.5 publishM2' does not generate these jars ### Artifact Names and POMs Ran `build/sbt publishM2` locally. #### delta-storage published correctly ``` [info] published delta-storage to file:/Users/scott.sandre/.m2/repository/io/delta/delta-storage/1.2.0-SNAPSHOT/delta-storage-1.2.0-SNAPSHOT.pom [info] published delta-storage to file:/Users/scott.sandre/.m2/repository/io/delta/delta-storage/1.2.0-SNAPSHOT/delta-storage-1.2.0-SNAPSHOT.jar [info] published delta-storage to file:/Users/scott.sandre/.m2/repository/io/delta/delta-storage/1.2.0-SNAPSHOT/delta-storage-1.2.0-SNAPSHOT-sources.jar [info] published delta-storage to file:/Users/scott.sandre/.m2/repository/io/delta/delta-storage/1.2.0-SNAPSHOT/delta-storage-1.2.0-SNAPSHOT-javadoc.jar // pom.xml 4.0.0 io.delta delta-storage jar delta-storage 1.2.0-SNAPSHOT Apache-2.0 http://www.apache.org/licenses/LICENSE-2.0 repo delta-storage io.delta https://delta.io/ git@github.com:delta-io/delta.git scm:git:git@github.com:delta-io/delta.git ... org.apache.hadoop hadoop-common 3.3.1 provided ``` #### delta-storage-s3-dynamodb published correctly ``` [info] published delta-storage-s3-dynamodb to file:/Users/scott.sandre/.m2/repository/io/delta/delta-storage-s3-dynamodb/1.2.0-SNAPSHOT/delta-storage-s3-dynamodb-1.2.0-SNAPSHOT.pom [info] published delta-storage-s3-dynamodb to file:/Users/scott.sandre/.m2/repository/io/delta/delta-storage-s3-dynamodb/1.2.0-SNAPSHOT/delta-storage-s3-dynamodb-1.2.0-SNAPSHOT.jar [info] published delta-storage-s3-dynamodb to file:/Users/scott.sandre/.m2/repository/io/delta/delta-storage-s3-dynamodb/1.2.0-SNAPSHOT/delta-storage-s3-dynamodb-1.2.0-SNAPSHOT-sources.jar [info] published delta-storage-s3-dynamodb to file:/Users/scott.sandre/.m2/repository/io/delta/delta-storage-s3-dynamodb/1.2.0-SNAPSHOT/delta-storage-s3-dynamodb-1.2.0-SNAPSHOT-javadoc.jar // pom.xml 4.0.0 io.delta delta-storage-s3-dynamodb jar delta-storage-s3-dynamodb 1.2.0-SNAPSHOT Apache-2.0 http://www.apache.org/licenses/LICENSE-2.0 repo delta-storage-s3-dynamodb io.delta https://delta.io/ git@github.com:delta-io/delta.git scm:git:git@github.com:delta-io/delta.git ... io.delta delta-storage 1.2.0-SNAPSHOT io.delta delta-core_2.12 1.2.0-SNAPSHOT test com.amazonaws aws-java-sdk 1.7.4 provided ``` #### other artifacts still generate correctly ``` ls /Users/scott.sandre/.m2/repository/io/delta delta-contribs_2.12 delta-core_2.12 delta-storage delta-storage-s3-dynamodb ``` ### Javadocs and Scaladocs Ran `build/sbt unidoc` locally. #### Javadocs BEFORE ![image](https://user-images.githubusercontent.com/59617782/161815044-c0b7b650-bcc4-4bb1-9bb4-c90a3c943ff4.png) #### Javadocs AFTER - *Note*: the `:: Developer API ::` tag isn't working here ... but it's not working on branch-1.1 master either for me locally ... so I don't think this is an issue ![image](https://user-images.githubusercontent.com/59617782/161815083-4869cb2f-b259-4b51-8ed4-04c7e24add2c.png) #### Scaladocs BEFORE ![image](https://user-images.githubusercontent.com/59617782/161814589-37ecb600-6f9a-47ab-be21-2ce69d9a47dc.png) #### Scaladocs AFTER ![image](https://user-images.githubusercontent.com/59617782/161869888-08678416-4081-488c-b919-becb63d86874.png) ### Integration Tests - also re-ran integration tests (to test the new artifact name, with no scala version) Closes delta-io/delta#1054 Signed-off-by: Scott Sandre GitOrigin-RevId: 80763cb099c95c342bc102b5c8de11048b56060e commit c55acf49e0ecff760e33fb4e2716a1a96565dd8c Author: Allison Portis Date: Thu Apr 7 10:54:24 2022 -0700 Don't mock `pyspark` import for python API doc generation ## Description Currently our python api docs has `sphinx.ext.autodoc.importer._MockObject object at 0x105f04580>` for all `pyspark` objects. This is because our sphinx configuration "mocks" all the imports under `pyspark`. This removes `pyspark` from `autodoc_moc_imports`. Build the docs locally in a `pipenv` with `pyspark==3.2.1` installed. @vkorukanti also verified this fix. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io/delta#1055 Signed-off-by: allisonport-db GitOrigin-RevId: f8a74949680065037e7ec17df3e2f5ad0486cff7 commit 225e2bbf9ecaf034d08ef8d2fee1929e51c951bf Author: Kapil Sreedharan Date: Thu Apr 7 09:55:41 2022 -0700 [#956 ] Delta LogStore Refactor - GCSLogStore This PR refactors GCSLogStore from scala to java Resolves https://github.com/delta-io/delta/issues/956 - adds GCSLogStore.java (in delta-storage artifact) - adds ThreadUtils.java (in delta-storage artifact) - adds ThreadUtilsSuite.scala (in delta-storage artifact) - updates LogStoreSuite.scala (in delta-core artifact) - deletes GCSLogStore.scala (from delta-contribs artifact) - deletes GCSLogStoreSuite.scala (from delta-contribs artifact) Closes delta-io/delta#1024 Co-authored-by: Scott Sandre Signed-off-by: Scott Sandre GitOrigin-RevId: 0350206fe857b93e6a9e5db60348c978149747bd commit d402b188b801085eb29b24ab321aea01b48edcc6 Author: Ryan Johnson Date: Thu Apr 7 04:54:27 2022 -0700 Avoid unnecessary struct() creation during state reconstruction. This PR fixes a performance regression in Delta state reconstruction due to bad query planning: When the Delta checkpoint schema contains `add.stats_parsed` column instead of `add.stats` (because JSON stats are disabled while struct stats are enabled), we must synthesize the json stats which in turn requires a rewrite of the `add` struct. Further, the struct was already rewritten before that, in order to canonicalize `add.path`. All of these rewrites were replicated at every access of any field of the final `add` struct, causing the `to_json(add.stats_parsed)` projection to run ten times instead of just once. Path canonicalization also ran multiple times, but the JSON overhead dwarfed it. This behavior is a definite shortcoming in the query optimizer's handling of nested fields, but there is no immediately obvious fix for it. The workaround is to rework the state reconstruction query, so that the `add` struct rewrite that applies stats construction and path canonicalization happens at the last possible moment -- just before converting to `Dataset[SingleAction]` -- so that the optimizer has no opportunity to replicate the projections. The result is actually faster than the pre-regression code, because even the original state reconstruction still evaluated the struct twice. Many existing unit tests cover correctness. GitOrigin-RevId: f4fa51a0a900f8a8b892ed4129fad69932181296 commit 827543feefa9c6b604a270a9bebf55f9cd435b9c Author: Junlin Zeng Date: Wed Apr 6 18:00:44 2022 -0700 Minor refactoring to DeltaLog GitOrigin-RevId: 6de3cf61941033bae3be416a2374c0fc3404a1a6 commit a3a07e33e655cec741f81e8caa4e629ef180a5a9 Author: Lukas Rupprecht Date: Wed Apr 6 13:47:45 2022 -0700 Refactor fourth set of 20 Delta error messages. This PR refactors 20 exceptions of the delta code base to follow the new error handling framework, which better organizes exceptions and makes them easier to query. This commit refactors the fourth set 20 delta error messages. Refactored existing exceptions into functions in `DeltaErrors.scala` and unit tested those functions. GitOrigin-RevId: 3efb54c2643923a35296400193cef845b8e19301 commit e2163bf883ef285eea642ee6a48a96d7bea45a3b Author: Ryan Johnson Date: Wed Apr 6 11:13:03 2022 -0700 Reliably verify delta/checkpoint file names Delta includes code to verify that the log and checkpoint files being used for log replay actually come from the snapshot's Delta log base path. However, the existing check ran on a per-row basis during Snapshot state reconstruction, which poses the following problems: 1. Significant performance overhead due to validating a per-file property on a per-row basis 2. The check only runs during state reconstruction, and even then did not run reliably when caching was involved. The new approach, taken in this PR, validates the file names as soon as they are first used, thus guaranteeing that the check always runs no matter what code uses the file names. GitOrigin-RevId: 2f522ab8e21b4e8faf9faf0ff6844287b4469f40 commit 4cc342087b64cbccb4b6f454acb4b2ccdee065a4 Author: Scott Sandre Date: Tue Apr 5 17:13:49 2022 -0700 Add 'IntelliJ Setup' section to README ## Description - Resolves https://github.com/delta-io/delta/pull/988 - Adds an IntelliJ Setup section to README.md which describes how to setup delta-lake in IntelliJ and how to fix a common build error Re-cloned Delta Lake and followed the steps myself. Closes delta-io/delta#1048 Signed-off-by: Scott Sandre GitOrigin-RevId: 81fe0aada56d33577c53ab3b446710b1853b6036 commit c068f5abd6146f8d94e87b29cddb6af51a6b6d44 Author: Scott Sandre Date: Tue Apr 5 08:12:07 2022 -0700 Update DelegatingLogStore to use Java LogStore implementations by default Resolves [#954](https://github.com/delta-io/delta/issues/954) ## Description - `DelegatingLogStore.scala` now, by default, uses the `io.delta.storage` Java `LogStore` implementations. At runtime, each of these implementations is then wrapped inside of a `LogStoreAdapter` to be used as a `LogStore.scala` throughout the delta code base. Updated existing unit tests. We test setting the scheme for both the default (java) LogStores, as well as the previous scala LogStores. Thus, if a user specifies a Scala LogStore for a given scheme, that will continue to work. Closes delta-io/delta#1041 Signed-off-by: Scott Sandre GitOrigin-RevId: 3b6caab303d09553802609fbc4d28c64b4e1d154 commit feabc2c110848a7a2f474996bcc51ec18437c8bd Author: Yaohua Zhao Date: Mon Apr 4 15:30:55 2022 -0700 Remove unused errors from DeltaErrors GitOrigin-RevId: 863d9c2d0e1600d434303e7f4c0c2d90d3f2966a commit 2964eeb9bfc12570a7e9fbf87d72e10e63b34321 Author: Hussein Nagree Date: Mon Apr 4 11:06:18 2022 -0700 Minor refactoring to ScanReport GitOrigin-RevId: 1cc65f560357194e81fd75326ba40cd8795dec90 commit 7ffb34fedfa2e0169ff11e7a87d479d4b9da21e2 Author: Allison Portis Date: Tue Apr 5 14:19:20 2022 -0700 remove accidental merge conflict (#317) commit c3b967ca0e37c1e424bdcef2d0be6edae6b33bcf Author: Allison Portis Date: Tue Apr 5 11:50:05 2022 -0700 [WAIT_TO_MERGE] Javadocs for 0.4.0 release (#310) commit 9d1347db9ff6469af74814729ddfb476deb489ee Author: Allison Portis Date: Tue Apr 5 11:49:38 2022 -0700 [WAIT TO MERGE] Upgrade versions for 0.4.0 release in README build files (#307) commit 5e2493804725832ec626823fdc7156cd7458b717 Author: Mariusz Krynski Date: Mon Apr 4 08:54:37 2022 -0700 S3 Multi-cluster writes support using DynamoDB Resolves #41 This PR addresses issue #41 - Support for AWS S3 (multiple clusters/drivers/JVMs). It implements few ideas from #41 discussion: - provides generic base class BaseExternalLogStore for storing listing of commit files in external DB. This class may be easily extended for specific DB backend - stores contents of commit in temporary file and links to it in DB's row to be able to finish uncompleted write operation while reading - provides concrete DynamoDBLogStore implementation extending BaseExternalLogStore - implementations for other DB backends should be simple to implement (ZooKeeper implementation is almost ready, I can create separate PR if anyone is interested) - unit tests in `ExternalLogStoreSuite` which uses `InMemoryLogStore` to mock `DynamoDBLogStore` - python integration test inside of `storage-dynamodb/integration_test/dynamodb_logstore.py` which tests concurrent readers and writers - that integration test can also run using `FailingDynamoDBLogStore` which injects errors into the runtime execution to test error edge cases - This solution has been also stress-tested (by SambaTV) on Amazon's EMR cluster (multiple test jobs writing thousands of parallel transactions to single delta table) and no data loss has beed observed so far To enable DynamoDBLogStore set following spark property: `spark.delta.logStore.class=io.delta.storage.DynamoDBLogStore` Following configuration properties are recognized: io.delta.storage.DynamoDBLogStore.tableName - table name (defaults to 'delta_log') io.delta.storage.DynamoDBLogStore.region - AWS region (defaults to 'us-east-1') Closes delta-io/delta#1044 Co-authored-by: Scott Sandre Co-authored-by: Allison Portis Signed-off-by: Scott Sandre GitOrigin-RevId: 7c276f95be92a0ebf1eaa9038d118112d25ebc21 commit cf1b815d9feb36e94d4995a81e6c30acd77daa5b Author: Scott Sandre Date: Sat Apr 2 19:11:47 2022 -0700 Public (Java) LogStores can use hadoopConf with non-`spark` prefix ## Description Update test case to ensure that the public, java LogStores can receive non-`spark` prefix configuration params. We test that our custom test suite LogStore implementation can still receive hadoopConf params using `--conf this.is.a.non.spark.prefix.key=bar` Closes delta-io/delta#1049 Signed-off-by: Scott Sandre GitOrigin-RevId: 987c551ead5f6f0d3160ed2f275d2bf1a1d726ef commit 1a2ce8bce222b268d812123be741aaa79cf63f4d Author: Hussein Nagree Date: Wed Mar 30 14:28:57 2022 -0700 Don't update snapshot if within the same request Allow SnapshotManagement.update to take in a timestamp, which if specified can be used to identify if the snapshot has already been updated in the duration of the current request. This enables us to save a listFrom call to the underlying logstore, since previously DeltaTableV2 could potentially trigger up to 2 LISTs - one when fetching the table (if it wasn't in the cache), and 1 when evaluating the `lazy val snapshot`. With the change, we no-op during the second update if the table was fetched from the logstore. GitOrigin-RevId: 01d1bccecb2fd1d4446102ecad561c1757c1049e commit 389dbae936cd9cb7142b52ec6b54c7b28f726918 Author: Vini Jaiswal Date: Sun Apr 3 21:23:11 2022 -0700 Adding the latest integrations and community channels (#1053) * Adding the latest integrations and community channels - Added a separate section for Delta Stand-alone. - Added the connectors information about Apache Hive, Apache Flink, PrestoDB, Trino, Rust and Kafka-delta-ingest - Added Delta Lake LinkedIn and youtube channels. * Updated the README.md Updated the README.md to include some more changes. - Added the integration section at the top. - Updated structure - Updated Roadmap links Signed by: Vini * Minor fix in formatting the note Minor fix in formatting the note under description. * Updated README.md with the format and links. Updated the file per the changes requested. * Removed a header commit 0cd5bf7521c68ad9d2e4190ba7c8f736f5f406ac Author: Allison Portis Date: Fri Apr 1 11:08:31 2022 -0700 Rename Flink connector artifact & folder (#312) commit 1420c95b75c7b96c6ec448c29f89d11891534681 Author: Allison Portis Date: Fri Apr 1 10:31:29 2022 -0700 Flink integration tests use sbt's scala version (#313) commit 851afc9170b8f1867ac7e44b680f42ce6189ba3d Author: Allison Portis Date: Thu Mar 31 18:32:02 2022 -0700 Fix `build.sbt` for flink connector release (#308) commit 29051725df933447b46b09b3822645f2f9cea2e8 Author: kristoffSC Date: Wed Mar 30 23:58:06 2022 +0200 Fix Time tests for Filnk Source after date time saving change. (#311) Signed-off-by: Krzysztof Chmielewski commit fab016e874532cad4de17f1077dbdc860548be9a Author: Chang Yong Lik Date: Wed Mar 30 09:06:06 2022 -0700 Delta LogStore Refactor - S3SingleDriverLogStore This PR refactors S3SingleDriverLogStore from scala to java Resolves https://github.com/delta-io/delta/pull/992 - adds `S3SingleDriverLogStore.java` - updates `LogStoreSuite.scala` No Closes delta-io/delta#995 Co-authored-by: Scott Sandre Signed-off-by: Scott Sandre GitOrigin-RevId: 641ffe0fe7371ed7f143b73cb9f09465c3bee642 commit 8b63a943d7aecff78962cda7a4f8973b584db7a2 Author: Adam Binford Date: Wed Mar 30 07:49:58 2022 -0700 Fix thread pool leak in optimize Adds shutting down the thread pool after we're done with it in an optimize job. This is particularly important right now before incremental optimization is implemented, since manual batching will create and leave around a new thread pool per batch. Closes delta-io/delta#1034 Signed-off-by: Venki Korukanti GitOrigin-RevId: 359f8ad4d8bcbc7c7bca4078a2e839faf39185cb commit 6165b933e65f06a85b93ad13dccf685dc3702244 Author: Venkata Sai Akhil Gudesa Date: Wed Mar 30 15:15:11 2022 +0200 Tag Delta Metadata Queries for Logging using a generalised frame-like tagging mechanism This PR adds a general utility object that allows "tagging" queries that are executed in an enclosing block with support for nested blocks. These "tags" are read by the usage logger/query profile logger and all parent/ancestor tags of the queries are added to the `QueryProfile`. This mechanism is implemented as a utility object `QueryTagger` (which extends the `ThreadLocalTagger` trait that contains the core logic). The tag is currently used for tagging Delta Metadata Queries but the tags can be easily extended to support tagging of say, queries generated by MERGE INTO. New unit tests + existing tests. GitOrigin-RevId: 403432a9baa0cda4d414acb8f57690cdd5b8dd19 commit c8131648750a78b3e33c6f851bf2114eae4c950a Author: Anton Okolnychyi Date: Wed Mar 30 03:34:50 2022 +0000 Minor refactor to DeltaDelete.scala GitOrigin-RevId: 6736f77cc42f8bab04b2376bb4e142e4ee15c474 commit f880311d90f92119eb17de025bc6bf26742d0d27 Author: Xinyi Date: Tue Mar 29 00:39:47 2022 -0700 Fix DeltaColumnRenameSuite under ANSI mode This PR fixes the DeltaColumnRenameSuite test failure when ANSI mode is on. Through local unit test. GitOrigin-RevId: 244427bf3744b61e4dcca0eb2892a183d4f8943d commit 06160b1586edbbaffda05f10550027a1ca63fb07 Author: Adam Binford Date: Mon Mar 28 22:25:07 2022 -0700 Add an internal config to set snapshot cache storage level Resolves #999 Adds a config to set the storage level for snapshot caching. Current behavior remains the same, but offers the ability to choose the storage level for a session. This benefits streaming workloads that don't benefit from the snapshot being cached, and offers ways to improve dynamic allocation: - Set the storage level to disk only, and then you can use the serve cache from shuffle service mechanism to still have the data cached but be able to deallocate the executors - Set the storage level to none to just disable caching of the snapshot Closes delta-io/delta#1000 Signed-off-by: Shixiong Zhu GitOrigin-RevId: fa4b108d22f5442be10a664687fbb7b27d4fef85 commit a0c54abc403052963a1d6cb14cc23f01b524e31e Author: Scott Sandre Date: Mon Mar 28 14:00:08 2022 -0700 LogStore: Replace spark.sparkContext.hadoopConfiguration by spark.sessionState.newHadoopConf Instantiates a LogStore using `spark.sessionState.newHadoopConf` instead of using `spark.sparkContext.hadoopConfiguration`. This ensures that the SQLConf values get passed to the hadoopConfig, and that users can pass in hadoopConfig values using `$key=$value` instead of having to use `spark.hadoop.$key=$value`. Closes delta-io/delta#1031 Closes delta-io/delta#1022 Signed-off-by: Scott Sandre GitOrigin-RevId: 71309ddf83a4247eed6d7a334dd76998bf897dd5 commit f6a64bf92b042eb36ee1cdb206bb8f335b39529a Author: Hussein Nagree Date: Fri Mar 25 16:11:19 2022 -0700 Create a CapturedSnapshot class with update timestamps Create a class called CapturedSnapshot, which strongly ties together the current snapshot with the timestamp it was last updated. This protects us from race conditions where the update took place but the updateTimestamp didn't go through yet, or where the current snapshot gets modified midway through an update Existing unit tests GitOrigin-RevId: c48fff9cf2496a91a8542c8aec07bca6d9b72e2f commit 3e65aeea89f80dba55bd039041fd70a344b7cf96 Author: Venki Korukanti Date: Fri Mar 25 12:27:30 2022 -0700 Add Delta RESTORE operation commit metrics in DeltaLog and output This change adds the following metrics to Delta commit log as part of committing RESTORE changes. ``` "tableSizeAfterRestore", // table size in bytes after restore "numOfFilesAfterRestore", // number of files in the table after restore "numRemovedFiles", // number of files removed by the restore operation "numRestoredFiles", // number of files that were added as a result of the restore "removedFilesSize", // size in bytes of files removed by the restore "restoredFilesSize" // size in bytes of files added by the restore ``` Same metrics are output as command output. ``` "table_size_after_restore" "num_of_files_after_restore" "num_removed_files" "num_restored_files" "removed_files_size" "restored_files_size" ``` Closes https://github.com/delta-io/delta/pull/1030 UTs GitOrigin-RevId: cc49bc3b161653fbc70d950bc038d35d560d47f1 commit c4aae1ed1010359919ee67ed46da235890528dcc Author: Pranav Date: Sat Mar 26 00:30:18 2022 +0530 Make calculation of backlog metrics more efficient for streaming sources - Ability to cache backlog metrics between batches such that fewer versions need to be iterated over. - Usage logging added on backlog calculation that takes longer than `DELTA_STREAMING_METRICS_CALC_DURATION_REPORT_THRESHOLD` Results: Metrics turned off (control): 90 mins Metrics turned on (without newest changes): 122 mins Metrics turned on (with newest changes): 94 mins - Augmented existing test GitOrigin-RevId: b706d12c400c82ffefd1d55425477542ee9ff78d commit c99cea01ef78b95a19ce1784c29f534021dd4a09 Author: Adam Binford Date: Fri Mar 25 11:30:35 2022 -0700 Make HDFSLogStore consistent with an Observer NameNode Resolves #767 Adds an `msync` call after a successful write in the HDFSLogStore. This is needed because after a write using the FileContext API, the FileSystem API, which is used for reading, is not guaranteed to be able to read the write, even from the same process. This causes issues when a transaction is committed and then the new version is immediately read. This msync on the FileSystem API forces the cached FileSystem object to update it's internal state version so that the next read is guaranteed to see the write. Closes delta-io/delta#769 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 70bd5a8d644f58489b70cd8590db82cccbba4038 commit 1118a72e0e7280b85a21c7e9f04a95cbbd427d28 Author: 110416 Date: Fri Mar 25 08:19:42 2022 -0700 improve scala example - tweak build.sbt - update dependencies - address deprecation warning - add scalafmt - add README.md Closes delta-io/delta#1020 Signed-off-by: Scott Sandre GitOrigin-RevId: 8208d8354329c131e0873cd7b4a962babf1a63cd commit 5bdec74b382722ccc08bece4e9b9f440e48ae550 Author: Venki Korukanti Date: Fri Mar 25 08:15:04 2022 -0700 Add Delta OPTIMIZE operation commit metrics in DeltaLog Commit metrics in DeltaLog are helpful when looking at the history of the Delta Table. Following metrics are added: ``` "numAddedFiles", // number of files added "numRemovedFiles", // number of files removed "numAddedBytes", // number of bytes added by optimize "numRemovedBytes", // number of bytes removed by optimize "minFileSize", // the size of the smallest file "p25FileSize", // the size of the 25th percentile file "p50FileSize", // the median file size "p75FileSize", // the 75th percentile of the file sizes "maxFileSize" // the size of the largest file ``` Closes https://github.com/delta-io/delta/pull/1029 Added a UT. GitOrigin-RevId: fb8a657937414d405cdd880556a1d8e1958a767d commit c688466263ec552bf5270e710ce2e6024af7bb1b Author: Xinyi Date: Thu Mar 24 21:29:45 2022 -0700 Fix ANSI tests in Delta and ACL Some test cases are fixed by only running with ANSI mode turns off. Some test cases failed due to bugs in test itself that silently passed when ANSI off but emerged when ANSI is on. Test locally with ANSI default on. GitOrigin-RevId: d8cf178def2ab514786760abddc7ebd18bf9a150 commit c08330f34a2a2be904ccce63ed1c2e0197bb7b59 Author: John ODwyer Date: Thu Mar 24 16:10:15 2022 -0700 Adds connection to the LF code of conduct Closes delta-io/delta#1018 Signed-off-by: Scott Sandre GitOrigin-RevId: d379a86229b7bbea8dda6d0d18d3428de390687f commit 02745437ba933b583e4968ab1559e97977fbe736 Author: Venki Korukanti Date: Thu Mar 24 15:34:06 2022 -0700 Minor refactor to DeltaUnsupportedOperationsCheck GitOrigin-RevId: ac3afafeb3e0459559d07baec2d2f7e532c58285 commit a203dabf2352b9bc9b6141b8e975ba0583a90da6 Author: Peng Zhong Date: Wed Mar 23 19:25:53 2022 -0700 Minor refactor to DeltaOptions GitOrigin-RevId: 552eb972457e8955bb4ac27c90b7b9e8b6ec8060 commit 03803ec52de95838cdac08d49b9f16115d11efed Author: Junlin Zeng Date: Wed Mar 23 15:43:51 2022 -0700 Minor refactor to alterDeltaTableCommands.scala GitOrigin-RevId: 72003144ff036c0302efb747bbe528ef189706af commit c70be8ec33d999dc28f1efb6ce0d95be330b12ac Author: Chang Yong Lik Date: Tue Mar 22 15:43:05 2022 -0700 Create GitHub issue and pull request templates ## Description This PR creates GitHub issue and pull request templates required for future Delta Lake releases. Resolves #917 This PR was not tested because it only introduces issue and pull request templates. ## Does this PR introduce _any_ user-facing changes? No Closes delta-io/delta#942 Signed-off-by: allisonport-db GitOrigin-RevId: 5b90038b5cf14bd18a742e0184a8809cc37df8e7 commit 631ea84e2be2d63b48a90f23c1242d6d852aff74 Author: John ODwyer Date: Tue Mar 22 15:32:36 2022 -0700 Points charter link to Delta.io Also removes the pdf from this repo which is not needed anymore Closes delta-io/delta#1007 Signed-off-by: Scott Sandre GitOrigin-RevId: 6b3c37e0eb000ed9e6ef1d3037c6a06303f9da05 commit 7a1f308a6af1c9da26ffdae3157b8870b536c1bc Author: Allison Portis Date: Tue Mar 29 18:07:19 2022 -0700 update directory strategy (#309) commit bbc979f587b93ea93eff695ec508922f10cff141 Author: Allison Portis Date: Mon Mar 28 16:16:40 2022 -0700 Add java docs links to main readme (#304) commit 307eb2df6e2d9b7a2de3a43171fed9f74109a22b Author: Allison Portis Date: Mon Mar 28 16:15:47 2022 -0700 Update integration tests for 0.4.0 release (#305) commit e9457e73a632fd93c8591c506904519d9863fd81 Author: Allison Portis Date: Fri Mar 25 11:41:36 2022 -0700 Update README's for the 0.4.0 release (#301) commit b8528381fe32b88a1204125a3ebabb19f133cc88 Author: Allison Portis Date: Fri Mar 25 11:30:39 2022 -0700 generate docs (#300) commit 855e5ccb9f6a5fcb37f699385bdd95cf702b7bcc Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Fri Mar 25 00:48:11 2022 +0100 Flink Delta Sink - Update Readme (#297) * Add 'known limitations' to Readme.md * Update flink-connector/README.md Co-authored-by: Allison Portis Co-authored-by: Allison Portis commit f6121ad216e0c6630ece34fd041c435d59c5551e Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Mar 24 14:25:23 2022 -0700 Update README.md - fix missing `)` commit 309002b975cf94ed620ff0d2b07f889f68cc363b Author: Olivier NOUGUIER Date: Thu Mar 24 21:54:45 2022 +0100 :arrow_up: Add Scala 2.13 test runs (#270) commit 532da849290f9fa63ccacfc176f39f6d4775035d Author: Liwen Sun Date: Mon Mar 21 18:56:23 2022 -0700 Block rename column when there are dependent check constraints and generated columns when renaming a column, and the column is referenced by a constraint or generated column expression, we need to block the rename command, otherwise the generated column or constraints will no longer work. new test GitOrigin-RevId: c6977ed4089e26920ec5be31b4cae845abfac016 commit 989078f4aaa58dbbb846c6611ca0c3cfe7dfe339 Author: Christos Stavrakakis Date: Mon Mar 21 10:51:26 2022 +0100 Fail MERGE queries that are adding void columns Void (`NullType`) columns are not properly supported in MERGE command. Specifically, the first merge can successfully add a void column, but subsequent merges will fail with a mismatch error. Until proper void column support, we prefer to fail any query that tries to add a void column and return an error message that instructs users to explicitly add a type. Note that different kind of writes can still add void columns to tables, and customers will not be able to merge into these tables until they provide a type for all columns. Added new unit tests that check that void column or nested fields in merge are not allowed. GitOrigin-RevId: c7d4129ecb8c27b399658eaba76f8a1976010c3a commit 68705a56420a16ad45d08aac9be3880396d17cd7 Author: Scott Sandre Date: Thu Mar 17 16:20:13 2022 -0700 Minor fix to AzureLogStore.java Ensure child method uses `@Override` annotation, to be consistent with other java class implementations. Trivial change. GitOrigin-RevId: 4736fbeb6af8e443ddb48f7a55b7fffc575fd477 commit fe5e3d7aa7aa71d89b15635fb8db6e76499fd6c0 Author: Hoang Pham Date: Thu Mar 17 14:54:49 2022 -0700 Delta LogStore Refactor: refactor LocalLogStore Solves #955 - Add `LocalLogStore.java` - Add `PublicLocalLogStoreSuite` class on `LogStoreSuite.scala` Signed-off-by: PhVHoang Closes delta-io/delta#1002 Signed-off-by: Scott Sandre GitOrigin-RevId: d2dfe87981dafe4da29ffbd3ef24cece7cca9962 commit 73092da524e9a39b9181ee7db4e963793ae3e191 Author: Jackie Zhang Date: Wed Mar 16 21:30:20 2022 -0700 Expanding name mode column mapping test coverage Add selected name mode column mapping tests within critical suites to ensure test coverage. Tests are automatically run under column mapping modes. Closes https://github.com/delta-io/delta/pull/1008 GitOrigin-RevId: e5bb882e95d10ca4e659e09c01276e22dc7dd4bc commit 618533e4bcaa1c15ad7cc2ebe1a462ea2e8bd91f Author: Xinyi Yu Date: Thu Mar 17 02:11:20 2022 +0000 Minor error message improvement Author: Xinyi Yu GitOrigin-RevId: 7de399f54e779c8d83413cd685a1c0cffd415a76 commit f3add0beba00327b83b0de24823b1c043288aaf5 Author: Kapil Sreedharan Date: Wed Mar 16 08:38:23 2022 -0700 Delta Storage: AzureLogStore - adds AzureLogStore.java - updates LogStoreSuite Closes delta-io/delta#1003 Signed-off-by: Scott Sandre GitOrigin-RevId: c13d901e0faea57e663b174c8a2c967c9f0df221 commit c7daae713195d9cadf202d65016c3ca83f370c8f Author: Xinyi Yu Date: Wed Mar 16 13:02:48 2022 +0000 Minor error message improvement in case of 'mismatched input' cases from ANTLR Author: Xinyi Yu GitOrigin-RevId: cf019028f4b03cae238f4f59445247865b93789b commit d110ec3fb74a190297f03acbfe6f77fc4a1418b4 Author: Tom van Bussel Date: Tue Mar 15 21:55:15 2022 +0100 Fix some metrics in DeleteCommand This change fixes a few metrics in `DeleteCommand`: - `numRemovedFiles` was incorrectly set to the number of files after skipping, instead of the number of files removed. - `numPartitionsRemovedFrom` and `numPartitionsAddedTo` were not set in some cases. GitOrigin-RevId: 32e41db1c38aca75f6788e07928011ecc919d420 commit cc1764082858890e0451ff1ff7e27091545cf276 Author: Jackie Zhang Date: Tue Mar 15 10:08:12 2022 -0700 Add test infra for running selected column mapping tests Implementing the test infra for selecting column mapping tests so we won't overwhelm the test runners. Enabled one `MergeIntoSQLSuite` test to demonstrate the effect. The tests are automatically run. Closes https://github.com/delta-io/delta/pull/1001 GitOrigin-RevId: 1854b3fe0dbf52757b7f1b2b79522de2e086ad81 commit c4e52ebf129abe68c9ec2fdf085788e5ed970a66 Author: Allison Portis Date: Tue Mar 15 10:03:48 2022 -0700 Update run-integration-tests.py to be able to use local source code Right now `run-integration-tests.py` can only be run with a released, or staged version. This PR adds a flag `--use-local` that allows us to run the integration tests for any local version. `--use-local` uses `build/sbt publishLocal` to package the local source code before running the tests. Closes delta-io/delta#990 Signed-off-by: allisonport-db GitOrigin-RevId: 632434bb3ae61fb3de96a36cf338066219ce7977 commit 3564851abed23482c4db4fdebbf911d94dfa0a74 Author: Allison Portis Date: Mon Mar 14 16:54:51 2022 -0700 Automatic data skipping using generated columns This PR adds automatic data skipping using generated column. This builds on delta-io/delta#966 which exposes `PrepareDeltaScan`. Closes delta-io/delta#928 GitOrigin-RevId: 4d15809bc277017ba3d08cee131d00a548ad13b4 commit df688f5d2a9c3c1d8f487d52479ec4492f3325bb Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Mon Mar 21 19:30:49 2022 +0100 Update pomExtra setting for flinkConnector (#296) commit c06be2d85bc36cd209787a9cc1cadb2d5d22853a Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Mon Mar 21 19:22:40 2022 +0100 Remove logging interface (#295) commit 4af5293145f079da2295edee312b435a6b602f55 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Mon Mar 14 22:46:56 2022 +0100 Update example project && docs and fix minor docs issues (#291) commit b6d9d53679a51ab7c4ea05b84cbe2668060c8d54 Author: Kapil Sreedharan Date: Mon Mar 14 12:21:46 2022 -0700 Delta Storage: Update HadoopFileSystemLogStore #953 - Update HadoopFileSystemLogStore.java to include writeWithRename - Required for child classes [[AzureLogStore](https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/storage/AzureLogStore.scala)](url) Closes delta-io/delta#993 Signed-off-by: Scott Sandre GitOrigin-RevId: d60e958a8c023d45c57d736aa9db602033a09ecf commit 7d5390dbbeb31d55cdcf89afda2d692ab1849d45 Author: Wenchen Fan Date: Mon Mar 14 07:43:12 2022 +0000 Minor refactor to DeltaTimeTravelSpec GitOrigin-RevId: 6f89dae42abf1f6255ff669941e4a4ce1021ec46 commit c4e40c15116965ad737b8334fb611590d0155050 Author: Venki Korukanti Date: Fri Mar 11 13:29:08 2022 -0800 Add support for RESTORE SQL in Delta OSS This PR adds RESTORE SQL syntax and handling the parsed SQL command to execute the RESTORE using the existing `RestoreTableCommand`. ``` RESTORE [TABLE] TO VERSION AS OF ex: RESTORE TABLE delta.`s3://bucket/table/deltatabl1` TO VERSION AS OF 20 RESTORE [TABLE] TO TIMESTAMP AS OF RESTORE TABLE delta.`s3://bucket/table/deltatabl1` TO TIMESTAMP AS OF '2021-11-18' ``` This PR also refactors the `RestoreTableCommand` to take `DeltaTableV2` with time travel info and both SQL and Scala API to use the same method i.e construct `DeltaTableV2` and pass it to `RestoreTableCommand`. Also adds a test suite. This fixes delta-io/delta#891 This closes delta-io/delta#998 Restore SQL suite is added GitOrigin-RevId: d4de80e025d9c559b255bdb6128a2e008278162a commit 348e4e079523bd2c5aa0ca95fd82a2c0f55887fa Author: Jackie Zhang Date: Thu Mar 10 14:36:30 2022 -0800 Support column rename This PR introduces the `DeltaColumnRenameSuite` which tests the column renaming functionalities supported by the column mapping feature. This test should be automatically run. Closes delta-io/delta#994 GitOrigin-RevId: 314d082bd8fdb2dc1a5fecf045439c21e3a83748 commit 5ffda69480475353940aeeb662629e2201535561 Author: Scott Sandre Date: Thu Mar 10 12:58:41 2022 -0800 Data Skipping PR 2/2 - DataSkippingReader This is PR 2/2 for Delta Lake OSS Data Skipping (aka file skipping with column stats) feature. See the issue https://github.com/delta-io/delta/issues/931. This PR completes the `DataSkippingReader` class implementation and adds tests. We add `DataSkippingDeltaTests`. Closes delta-io/delta#974 GitOrigin-RevId: 1e1b3473df0fab97567265ffcb16686e12072579 commit 10e0d5b3e3cfa706d12313377b3a80a9ce24b9fe Author: Allison Portis Date: Thu Mar 10 09:59:14 2022 -0800 Add new PR's to "Needs review" rather than backlog Closes delta-io/delta#992 Signed-off-by: allisonport-db GitOrigin-RevId: 810704ae3a03532fac4584e54a8dc6b1d6828534 commit 59b4c8e876c23404b8a6075c53d691a87896cce9 Author: alexoss68 Date: Thu Mar 10 09:35:00 2022 -0500 Improve row metrics for UPDATE GitOrigin-RevId: 26ec189bbdc45d795511c6f282e17f911b9a6621 commit 2d9968f79f8b0cdb81c124b5d36dc428f39cd601 Author: Prakhar Jain Date: Wed Mar 9 15:07:45 2022 -0800 Allow every Delta txn other than those containing Metadata updates against blind appends Now any transaction which has non-file actions goes through conflict detection against blind append unlike before. This may lead to failure of such transactions. This was intentionally done to stop METADATA / other non-file-action transactions to follow conflict detection. But as a side effect, Streaming Updates (which have SetTransaction action) also started following this flow and started failing. Instead of allowList, we want to move to blockList approach. The blockList approach is less safe from correctness perspective but more safe from rollout perspective. Added UT GitOrigin-RevId: d09aea4bacbe6de3d000cde58c84f1d9ee1bb638 commit eff0b3ca289fefb988faba99c4cb1de03cdd9b38 Author: Allison Portis Date: Wed Mar 9 10:23:30 2022 -0800 Add new PR's to "Needs review" rather than backlog (#290) commit 59d5ea234bcf1d84ddf5c6114a4208eae80cce6c Author: Kam Cheung Ting Date: Wed Mar 9 08:47:26 2022 -0800 Refactor first set of 20 error messages. This PR is going to refactor the existing exceptions in our code base, so that they are well organized and queryable. This commit refactor the first 20 error messages. To handle compilation and error class handling, there are two more classes created: 1. The DeltaThrowable: this is used as the based interface of delta code and contains all necessary implementation for error message framework. With its help, the compilation error of OSS Delta caused by outdated Apache Spark can be avoided. 2. The DeltaThrowableHelper: This is used to pick the error class template from JSON files. It handles all exceptions' message for Delta. Thus, there are 2 JSON files we read: 1. error-classes.json: it stores the error classes for exceptions inside Apache Spark code base. It will be maintained by the Apache Spark community. 2. delta-error-classes.json: it stores the error classes for exceptions inside our Delta code base. This is maintained by Delta Lake. Note: 0. The error-classes.json and delta-error-classes.json are not allowed to include same error class. 1. Test cases for all error classes are covered. 2. No duplicate error classes in Delta 4. No error classes are shared by Delta and Spark 5. Delta error classes are correctly formatted 6. Delta message format invariants GitOrigin-RevId: cff83f531faeaa97391f8debad7e8dbe7f18f632 commit 0d1652a2af7b673b36605784eba4d0ae79bf9b17 Author: Tathagata Das Date: Wed Mar 9 11:35:38 2022 -0500 Added TPCDS to Delta OSS benchmark framework In this PR, I am adding the TPCDS benchmark in the newly added benchmark framework. It is constructed such that you have to run the follow two steps. 1. *Load data*: You have to create the TPC-DS database with all the Delta tables. To do that, the raw TPC-DS data has been provided as Apache Parquet files. In this step you will have to use your EMR cluster to read the parquet files and rewrite them as Delta tables. 2. *Query data*: Then, using the tables definitions in the Hive Metatore, you can run the 99 benchmark queries. Please see the README updates for more details. Manual Closes delta-io/delta#973 Signed-off-by: Shixiong Zhu GitOrigin-RevId: ceb2bdaa51a4d637e791ea48bfd7eaabdd297167 commit f01a93c6c007e2edca8b4f0a9e8df7f032e26a2e Author: Scott Sandre Date: Wed Mar 9 08:10:00 2022 -0800 DynamoDBLogStore Java Refactor setup - create dynamodbLogStore SBT module - create basic skeleton for java implementation Closes delta-io/delta#959 Signed-off-by: Scott Sandre GitOrigin-RevId: a371dbb8f7622934d22567114717987d5a0aef99 commit 3cd8d704ed7ee5125f0bcb644ad4c653acb7c1d9 Author: Yaohua Zhao Date: Wed Mar 9 10:16:11 2022 +0800 Minor refactor to OptimisticTransaction GitOrigin-RevId: 1d06412d25359e2f06ab4fe50b7f8e593bd7dc6c commit e71c7a77dbd8addb965e391cbd62ac363eef26f1 Author: Venki Korukanti Date: Tue Mar 8 16:42:14 2022 -0800 Refactor Delta RESTORE command tests Refactor restore tests to make them common enough to use in other forms of RESTORE command such as SQL. Currently they are tied to the Scala RESTORE APIs. As part of it also increase the coverage by adding more tests. Test only change. GitOrigin-RevId: 37664d955d9e805b7788494f0994cc0006497a92 commit 8721451747168ae31fb3d94fce0148895c2d3303 Author: Vegard Stikbakke Date: Tue Mar 8 15:23:32 2022 -0800 Fix one grammar error and a broken link in PROTOCOL.md Hi all, thanks for this terrific project. I was reading through PROTOCOL.md and spotted these two mistakes. Closes delta-io/delta#982 Signed-off-by: Scott Sandre GitOrigin-RevId: 21b3dc7fb0d43ba25dcca96a0bc9b1d98c55f9da commit 89d2b9d58ad315379a1b2fc9fb7f56b97aef47a1 Author: Peng Zhong Date: Tue Mar 8 14:15:17 2022 -0800 Add log when trying to find last checkpoint This change adds a log message when trying to find last checkpoint of delta. GitOrigin-RevId: e3c5d27b27d4c751a532250aa9fd0f137713f27e commit 3eb6c869acad3e32d1a0c6edbab28c87febe1c1d Author: Scott Sandre Date: Tue Mar 8 13:10:21 2022 -0800 Add test to LogStoreSuite for HDFSLogStore rename() error edge case - follow up to https://github.com/delta-io/delta/pull/980 - rename `class HDFSLogStoreSuite` to `trait `HDFSLogStoreSuiteBase` so that internal (scala) and public (java) HDFSLogStore suites can both use that trait - we add a specific test to `HDFSLogStoreSuiteBase` to test the `HDFSLogStore.writeInternal ... rename` edge case brought up by the above PR [#980](https://github.com/delta-io/delta/pull/980) Closes delta-io/delta#983 Signed-off-by: Scott Sandre GitOrigin-RevId: fea6e50002cd8934685fc64d500c4f06fd06bf93 commit c645045aee415398ae9316332e6d839a0b368eca Author: Adam Binford Date: Tue Mar 8 12:20:38 2022 -0800 Add gitignores for VSCode + Metals Finally adding some gitignores that I've kept locally for a VSCode + Scala Metals plugin based setup. Closes delta-io/delta#970 Signed-off-by: Scott Sandre GitOrigin-RevId: 0158f7457825277fc2c81da1a6f5d4e749449876 commit 918da41d7f14c3199c87f9d99f4cd3b3db210884 Author: Jackie Zhang Date: Mon Mar 7 18:33:02 2022 -0800 Support arbitrary chars in column names This PR introduces the `DeltaArbitraryColumnNameSuite` that tests arbitrary column names that contains special characters such as `% , #` which weren't allowed before, but are now supported for table under column mapping. This test should be automatically run. Closes https://github.com/delta-io/delta/pull/976 GitOrigin-RevId: d7fb7bee6b1ef3456e660402e71fdb3c11cd5d80 commit 78898a09e7bd9f9b389b69f46659bf9499494b40 Author: lzlfred Date: Mon Mar 7 17:35:39 2022 -0800 Fix the support for gettimestamp in generated columns GitOrigin-RevId: 888ec2c8dd99eb09ecd868da1c3ff5b7aaae331e commit e3582493443e360046ea580480ca4f6ad5c00a91 Author: Hoang Pham Date: Mon Mar 7 15:27:44 2022 -0800 HDFSLogStore.java Fix catch - throw exception mismatch Resolve #978 Signed-off-by: Pham Hoang Closes delta-io/delta#980 Signed-off-by: Scott Sandre GitOrigin-RevId: 80dbdf00e701116572373fca7fd15de532a98993 commit c4acedf645ad67f6e73243f853fea60f3cd73787 Author: lzlfred Date: Thu Mar 3 19:53:25 2022 -0800 Handle exception in listDeltaAndCheckpointFiles when accessing results from listFrom The PR fixes the exception handling when some file system only throws exception when accessing the results from listFrom GitOrigin-RevId: 6acbb82984d2ee18a57554178b165fb9bb5d0e00 commit 8626b3f066bc8e4d35863bb96fb67b1bc04798e0 Author: Kam Cheung Ting Date: Wed Mar 2 21:27:34 2022 -0800 Disable unstable test from DeltaColumnMappingSuite The test case "read/write id mode should be blocked" of DeltaColumnMappingSuite is not stable. GitOrigin-RevId: fbe8bc832029403bdecbdef32f3ae47eab29ed16 commit f1b87d090c8fc19695e74b217e93801a18905140 Author: Adam Binford Date: Wed Mar 2 17:18:55 2022 -0800 Make hadoop provided for storage module Resolves delta-io/delta#967 Make Hadoop dependency provided for storage module. Usages must provide their own Hadoop libraries and versions. Closes delta-io/delta#969 Signed-off-by: Shixiong Zhu GitOrigin-RevId: bc2e0224acadcea4fbf10931c43226aa84c2ee2d commit a8d8b03b8d31f5f7c238e5b29cf6cea6f0d0f1fd Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Tue Mar 8 17:34:48 2022 +0100 Flink Delta Sink - refactor partition computer (#281) * Expose DeltaPartitionComputer instead of DeltaTablePartitionAssigner && add RowData partition computer * Make DeltaPartitionComputer internal and expose only utlity method * Create DeltaSinkRowDataBuilder and make all other interfaces internal * Update README.md * update DeltaBucketAssigner javadoc * minor comment changes * Update DeltaSinkPartitionedTableExample.java * Correct code review remarks Co-authored-by: Scott Sandre commit 9a0b8648e38fca1eeaf7ec87d9781dccef2b02c6 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Mar 8 08:23:33 2022 -0800 Update build.sbt and tested (#289) commit 6c7ebda877c4b4d69d65ca2b7abf806e29132d94 Author: Tathagata Das Date: Wed Mar 2 16:58:53 2022 -0500 Basic OSS framework for running benchmarks on Delta workloads Add a basic framework for writing benchmarks on Spark running in a EMR cluster. Here is how to use it. - Start your own EMR cluster, and get the hostname and the pem file to ssh into it. - The file `delta/benchmarks/run-benchmark.py` has a list of benchmarks as a map between benchmark name --> specification (explained further below). - In the directory `delta/benchmarks/`, run a selected benchmark as `./run-benchmark.py --cluster-hostname -i --benchmark `. - This will eventually produce an output like this. ``` ... There is a screen on: 18319..ip-172-31-54-89 (Detached) Files for this benchmark: 20220107-131623-test-benchmarks.jar 20220107-131623-test-cmd.sh 20220107-131623-test-out.txt >>> Benchmark script started in a screen. Stdout piped into 20220107-131623-test-out.txt.Final report will be generated on completion in 20220107-131623-test-report.json. ``` - You can then ssh into the EMR cluster to monitor it. The actual spark code of a benchmark is defined in the scala files that are compiled into a jar using build.sbt. The benchmark specifications in `delta/benchmarks/run-benchmark.py` defines the following - which main scala class to be started. - command line argument for the main function - additional maven artifact to load (example `io.delta:delta:1.0`) - spark confs The script `run-benchmark.py` does the following: - compile the scala code into a fat jar - upload it to the given hostname - using ssh to the hostname, it will launch a screen and start the main class with spark-shell/spark-submit - Monitor the output file with continuous logging and wait for the completion of the workload - Upload the generated benchmarks results as csv/json to the given Structure of the code - `build.sbt`, `project/`, `src/` form the SBT project. - `Benchmark.scala` is the basic interface, and `TestBenchmark.scala` is a test implementation. - `scripts` has the core python scripts that are called by `run-benchmark.py` Manual testing using the following command: `./run-benchmark.py --cluster-hostname --benchmark-path --benchmark test` Closes delta-io/delta#971 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 19aa890c5291e302fdbe9edb45c080bbe4b5d36e commit a944d79299715bf58dae76ed4706be8056e69b87 Author: Jackie Zhang Date: Wed Mar 2 13:52:34 2022 -0800 Enable column name mapping in Delta Add a column name mapping mode in Delta, which allows Delta to use different names in the table schema and in the underlying Parquet files. This is a building block for issues https://github.com/delta-io/delta/issues/957 and https://github.com/delta-io/delta/issues/958. New unit tests. Closes https://github.com/delta-io/delta/pull/962 GitOrigin-RevId: 7a64a33fd60781c17236bff6168044e702c8413a commit 1c60f8855e9af05ab5cf378e3264582fa09c7b8c Author: Carmen Kwan Date: Wed Mar 2 22:48:01 2022 +0100 Expand metrics for DeleteCommand In this PR, we add 13 new metrics to usage logs for the `DELETE` command: * `numRemovedFiles`: how many files we removed. Alias for `numTouchedFiles` * `numAddedFiles`: how many files we added. Alias for `numRewrittenFiles` * `numFilesBeforeSkipping`: how many candidate files before data skipping * `numBytesBeforeSkipping`: how many candidate bytes before data skipping * `numFilesAfterSkipping`: how many candidate files after data skipping * `numBytesAfterSkipping`: how many candidate bytes after data skipping * `numPartitionsAfterSkipping`: how many candidate partitions after data skipping * `numPartitionsAddedTo`: how many new partitions were added * `numPartitionsRemovedFrom`: how many partitions were removed * `numCopiedRows`: how many rows were copied * `numDeletedRows`: how many rows were deleted * `numBytesAdded`: how many bytes were added * `numBytesRemoved`: how many bytes were removed Unit tests. GitOrigin-RevId: e6cad814b21521e16b377e2da52ea89479eb5439 commit fcdaf67b8026e36f0ab9f6f11668d9c73313ace3 Author: Tom van Bussel Date: Wed Mar 2 19:53:06 2022 +0100 Minor refactor to UpdateCommand GitOrigin-RevId: e8da5718a5c8e65f6967fb5ca42d40222aceafb1 commit 0ed90b5822cd23c3d44b1a685d90f66cc741362e Author: Scott Sandre Date: Wed Mar 2 10:39:49 2022 -0800 Data Skipping PR 1/2 - PrepareDeltaScan This is PR 1 of 2 for the Delta Lake OSS Data Skipping feature (aka file skipping with columns stats). Closes https://github.com/delta-io/delta/issues/931 This PR adds several key classes (and some tests) that focuses on properly transforming the catalyst logical plan so that parquet files can be properly filtered using stats + filter expressions. Notably, this PR doesn't implement the actual statistic reader to properly filter those files. That will come in PR 2/2. The main changes are - added `PrepareDeltaScan`: a new rule (transformation) to be applied during the query planning process. It also ensures that scans on the same Snapshot in a query reuse that snapshot (performance optimization) - `DataSkippingReader`: a bare-bones skeleton to read stats + filter files that we will fully implement later in PR 2/2 - changed `OptimisticTransaction` and `Snapshot` to properly use `PrepareDeltaScan` - add `DeltaWithNewTransactionSuite` to test that concurrent transactions maintain snapshot isolation given the snapshot-reuse changes that we have added to PrepareDeltaScan GitOrigin-RevId: 284d8cc2df02bbe392b47b37b8b78f98cd989376 commit 4396c8997fd15f1b0f1eaf9586335d8cd2873aa2 Author: Lars Kroll Date: Wed Mar 2 16:56:00 2022 +0100 Minor refactor to MergeIntoSuiteBase GitOrigin-RevId: a1423359191b3d659eddfefc8dfcbba9391ee66a commit 8624b92ddd8d47f98e91b88b19b6d4af2e09033b Author: Bart Samwel Date: Wed Mar 2 09:33:38 2022 +0100 Calculate MERGE numSourceRows from findTouchedFiles Calculate the MERGE metric `numSourceRows` from the first job (findTouchedFiles) instead of from subsequent jobs. The problem is that numSourceRows as captured in Job 2 doesn't always get captured because Job 2 doesn't always run. So instead, we move the calculation of numSourceRows to Job 1 (preserving the existing metric, but calculating it in a different way), but we then re-add the calculation in Job 2 (if it runs) under a different name. In the diff it looks like renaming the metric, but it's really a move + re-add. Changes: * Renamed the existing metric from the other jobs to `numSourceRowsInSecondScan`. * Added calculation of `numSourceRows` to `findTouchedFiles`. * Added a check for mismatches between those two values, enabled only by a config that is disabled by default. This will help us to flag nondeterministic sources. Unit tests. New test for the config that detects nondeterministic sources. GitOrigin-RevId: 2d92c969a4077449b85e661c9044fac9aa095759 commit 582bc95a48434cae7779b4d6a7b22491c44f6298 Author: Ryan Johnson Date: Tue Mar 1 14:29:01 2022 -0800 FileNotFoundException does not imply InitialSnapshot This PR changes Delta snapshot management and file listing code to return options, with `None` meaning the directory was empty or missing. Otherwise, they return `Some(logSegment)` -- possibly with an empty file list, if the search found no usable commit files. That way, we can reliably distinguish a truly empty/missing Delta table from one whose log files are corrupted or missing in a way that prevents snapshot construction. The former should produce an `InitialSnapshot` while the latter should propagate an error. Previously, Delta snapshot management code made the unsafe assumption that `FileNotFoundException` always necessarily meant the directory was empty, and several code locations caught the exception in order to create an `InitialSnapshot` that designates an empty table. This led to an awkward and brittle design, where code had to either avoid throwing `FileNotFoundException` -- even if the problem was, in fact, a file not found... or else catch and wrap the exception to ensure it propagated past the catch clauses that would wrongly create an `InitialSnapshot`. Existing unit tests cover this code. GitOrigin-RevId: 6d4330b43cdfa11f69f64ee3849eab9192c9b268 commit 01ce4f5b4f319e64a4003db95b2f13a51ad35527 Author: Venki Korukanti Date: Tue Mar 1 11:49:53 2022 -0800 Minor improvements to Restore Python API implementation and tests * Add args check * Add more tests (negative tests) Existing UTs Closes https://github.com/delta-io/delta/pull/968 GitOrigin-RevId: dee05a116a1a685425a09b34ebb643d86e70486b commit 0a063fa18e56380ac8c1ef90b5e7cbbd5774fa86 Author: Bart Samwel Date: Tue Mar 1 09:22:47 2022 +0100 Minor refactor to DeltaLogging GitOrigin-RevId: 87d0e2103311deec34f15a1ee818982ce920e5b4 commit 06839e4ad34d01b3cdeac10f1bcf8cb9138205af Author: Yijia Cui Date: Mon Feb 28 21:56:50 2022 -0800 Fix array-of-struct schema evolution failing to cast when containsNull is false. There is no user-facing API involved in this PR fix. Problem: When the column is constructed by spark sql function `array(struct(…))`, all the struct fields are not nullable, and `containsNull` in that arrayType is false. Because castIfNeeded is called recursively to cast every array structType elements, which generates nullable expressions, ArrayTransform will generate an ArrayType with containsNull as true. Thus, an AnalysisException is thrown saying that it cannot resolve the two types because of the inconsistency in `containsNull` between the arrayType generated by ArrayTransform and the one being casted to when the schema evolution should be resolved properly. Fix: We need to make `containsNull` in `to` as true so casting won't fail in the above case. Unit test. GitOrigin-RevId: 38163dfc9f6b47ec70851507a70bec2e8e82f0fb commit 687a465350d5cffff4e49dfbe36661319ab6f2c0 Author: Jackie Zhang Date: Mon Feb 28 21:07:34 2022 -0800 Minor refactor to DeltaErrors GitOrigin-RevId: 690a783ce099787335b3a9d1b8df893ba76c1685 commit a848c78d5bd34301fc7bbc55f42325b7e14ba1e8 Author: Sabir Akhadov Date: Mon Feb 28 12:09:19 2022 +0100 Minor refactor to UpdateCommand GitOrigin-RevId: 06b977da0217ed4660225d27fc025393fae0289b commit 31b2480d902d3df5cec72f551e007ebbc1e5e5d2 Author: Hyukjin Kwon Date: Fri Feb 25 06:30:24 2022 +0000 Minor refactor to tables.py Authored-by: Hyukjin Kwon (cherry picked from commit 88696ebcb72fd3057b1546831f653b29b7e0abb2) Author: Hyukjin Kwon GitOrigin-RevId: 03d5caf29ea77c627785f128b5fe775a084fa227 commit de040fd41b843a637a89d4aaf0cb5c2f215feb61 Author: Kam Cheung Ting Date: Thu Feb 24 10:53:23 2022 -0800 Minor refactor to OptimisticTransaction GitOrigin-RevId: be17660a18efb173139302950cf542ac5ee4fe06 commit 2a5fcef74e35854eb9925531199dbcecee34953c Author: Sabir Akhadov Date: Thu Feb 24 11:07:16 2022 +0100 Minor refactor to DeleteCommand GitOrigin-RevId: 80194b9be69b6a1bb2ec9d0e7f2448b16172e571 commit f0610888ec7857edf24b47a901351725f1f303be Author: Scott Sandre Date: Wed Feb 23 23:20:32 2022 -0800 Minor refactor to DeltaLogging GitOrigin-RevId: bd43d008ade91803be89b1857d42a0df2b2a826d commit 04ffdf0a85ea6ade9b65979f02586f4191ad1440 Author: Pranav Date: Wed Feb 23 15:55:01 2022 -0800 Fix DeltaSinkSuite flakiness - Increase timeout for the first test - Test only change GitOrigin-RevId: aa53880b4e3f10d05b7d50a0e581aa403f45c1ee commit 08c286228a2f5462a1fc8ebf194f9ade7b3c7a20 Author: Ryan Johnson Date: Wed Feb 23 14:33:22 2022 -0800 Minor refactoring of log segment and unit test code This patch makes minor cleanups to Delta code: - Rename `LogSegment.checkpointVersion` to `checkpointVersionOpt`, to improve readability at use sites. - Rework several unit tests to bypass the Guava cache (and the exception wrapping it imposes) by updating a stale `DeltaLog` instance instead of invoking `DeltaLog.forTable`. The variable renaming is safe (compiles => correct), and existing unit tests heavily exercise the affected code. Meanwhile, the changed unit tests still pass. GitOrigin-RevId: 22b2e1480d6ae2ff6c36145e992fc4931c994da3 commit 4f06c09a5fdc1c0a9b30caf3566a8276eefe5646 Author: Venki Korukanti Date: Wed Feb 23 13:10:49 2022 -0800 Minor refactor to ZOrderMetrics GitOrigin-RevId: 459a25d90b5e5dea59980d3a7fb6bb71ab9b378a commit a03e6a9fbe1a581caa8fbac33f6c3a97ee1e3f7b Author: Junyong Lee Date: Wed Feb 23 09:55:00 2022 -0800 Fix unit tests not to depend on checkpoint interval config In many of our existing unit tests, we assume checkpoint interval is 10. Unit tests should be independent on certain config assumption, hence make those tests more robust by removing hard-coded checkpoint interval dependency. Ran unit tests with changing checkpoint interval config and verified now it passes all tests. GitOrigin-RevId: 3db16db7a4a444a49745918655b1576591c4e045 commit d91c8f75ea8edc33cec8de5d289564eec11c08d3 Author: Sabir Akhadov Date: Wed Feb 23 18:25:11 2022 +0100 Minor refactor to actions.scala GitOrigin-RevId: 04d8d739fd9232ac6812e02aa931d8dcd351c432 commit 85d194023b5aa53436e2ba506fc1fc9e5c03b1e9 Author: Scott Sandre Date: Tue Feb 22 20:21:20 2022 -0800 Delta Storage: `HDFSLogStore` - updates `LogStore.java` interface to include `throw IOException` when applicable - adds `HadoopFileSystemLogStore.java` abstract class - adds `HDFSLogStore.java` - updates `LogStoreSuite` - updates `DelegatingLogStoreSuite` Closes delta-io/delta#933 Signed-off-by: Scott Sandre GitOrigin-RevId: f190ad291a11a2fea8180e9bc69c5b9c1cfdbc88 commit e366ccd6179c70dd603c2093a912aacfe719ed00 Author: Venki Korukanti Date: Tue Feb 22 13:00:27 2022 -0800 Add support for optimize (file compaction) on Delta tables This PR adds the functionality to compact small files in a Delta table into large files through OPTIMIZE SQL command. The file compaction process potentially improves the performance of read queries on Delta tables. This processing (removing small files and adding large files) doesn't change the data in the table and the changes are committed transactionally to the Delta log. **Syntax of the SQL command** ``` OPTIMIZE ('/path/to/dir' | delta.table) [WHERE part = 25]; ``` * The SQL command allows selecting subset of partitions to operate file compaction on. **Configuration options** - `optimize.minFileSize` - Files which are smaller than this threshold (in bytes) will be grouped together and rewritten as larger files. - `optimize.maxFileSize` - Target file size produced by the OPTIMIZE command. - `optimize.maxThreads` - Maximum number of parallel jobs allowed in OPTIMIZE command New test suite `OptimizeCompactionSuite.scala` Closes https://github.com/delta-io/delta/pull/934 GitOrigin-RevId: f818d49b0f13296768e61f9f06fadf33a7831056 commit 58e9fcfbe09809915fce35a130ae13d66a385c75 Author: Gengliang Wang Date: Mon Feb 21 13:14:51 2022 +0800 Use parseMultipartIdentifier rather than parseTableIdentifier in DeltaTableV2 GitOrigin-RevId: cd88d403a7b2f7ceb46c8eb01cd3515bcdf58eeb commit 9ea19020419c4c23532b14cf4495351ad413654c Author: Scott Sandre Date: Fri Feb 18 10:36:36 2022 -0800 Column Stats Generation Resolves https://github.com/delta-io/delta/issues/923 Please see the full design doc here. An except is copied below. This PR adds statistics generation such that per-file statistics are injected into `AddFile` during writes. This is accomplished by `StatisticsCollection` which `TransactionalWrite` extends. During the actual file write, `DataSkippingStatsTracker` will use the given `StatisticsCollection` `Expression` to generate the stats values. We add `StatsCollectionSuite` to test that statistics generation is correct. Public API Changes - DeltaSQLConfig.DELTA_COLLECT_STATS - whether or not stats collection is enabled - DeltaSQLConfig.DATA_SKIPPING_STRING_PREFIX_LENGTH - for string columns, how long prefix to store in the data skipping index - DeltaConfig.DATA_SKIPPING_NUM_INDEXED_COLS - the number of columns to collect stats on for data skipping Statistics column schema - The AddFile metadata action already supports storing the numRecords statistic, and the Delta [protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#per-file-statistics) already gives an example of how to store per-column min, max, and null-count statistics, too. I propose we use the same such statistics schema. The actual implementation can be broken down into four fairly self-contained tasks. - Create the per-column statistics expression - I propose a StatisticsCollection class be created which creates the stats generation Expression given an input schema (and the number of column to index) - Execute that Expression to actually generate and collect the statistics values - [org.apache.spark.sql.execution.datasources.FileFormatWriter.write](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L95) already supports receiving a statistics tracker of type [org.apache.spark.sql.execution.datasources.WriteJobStatsTracker](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala#L37). We just need to inject in our own that uses the StatisticsCollection instance above - Inject both of these entities during the actual data write - `TransactionalWrite::writeFiles` seems like the place for this - Tests - All we really need to test is StatisticsCollection Closes https://github.com/delta-io/delta/pull/924 GitOrigin-RevId: 91c31a7ba73b13be4265cd44ee36afa1fbe7c47f commit af77ec6c9456073a4542f92f3c31ef57a7a0fa76 Author: Jackie Zhang Date: Sat Feb 19 00:04:24 2022 +1100 Minor refactor to DeltaTable tests GitOrigin-RevId: 35a3a29890415af959bd38459cb5f2a6d7739b2c commit e3b19011aebd4ae4caadad720b716813540ec445 Author: lzlfred Date: Thu Feb 17 10:55:29 2022 -0800 Minor refactor to Checkpoints, Checksum, DeltaLog, SnapshotManagement GitOrigin-RevId: fac3b286d3fbb7fcc5a2bd2dccbd4a9c5499e8d7 commit b40b2df3f800547d20741d1535b26eead6c5d7d7 Author: Scott Sandre Date: Tue Feb 15 14:26:00 2022 -0800 Re-enable and fix flaky test in DeltaRetentionSuite GitOrigin-RevId: 5202f660707957aabf9436a834536ce4cda9d2e9 commit 03af77bea6dca884762853259f8c34666689737f Author: Prakhar Jain Date: Tue Feb 15 10:12:16 2022 -0800 Disable special handling of empty transactions in Delta Delta reduces the isolation level for empty transactions e.g. MERGE/DELETE/Update etc. This causes SERIALIZATION issues when mixed with "blind append special handling" in Delta. This PR removes the special handling of empty txns behind a config so that we can go back to earlier behaviour (the wrong behaviour) if needed. GitOrigin-RevId: 190aa3f461f637202322e371a26d8a32d5f1c9af commit 3b185b210c7f51e884d123c8c37ccc407a8270a5 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Tue Feb 15 18:44:31 2022 +0100 Flink Delta Sink - Docs Refactor 2 (#278) * Refactor main classes to hide internal implementations from JavaDoc generation * Revert abstract interfaces and fix docs generation * Correct code review remarks commit aa2b06c96fa22ce6c7116d8b564d183bb583b175 Author: Scott Sandre Date: Mon Feb 14 09:37:41 2022 -0800 Add LineCloseableIterator to delta storage Adds Java implementation of `LineCloseableIterator` to `delta-storage` and updates `LineCloseableIteratorSuite` in `delta-core` to test both scala (internal) and java (public) versions. This class is a prerequisite for `HadoopFileSystemLogStore.java`. This PR also moves `storage/src/scala/java/io/delta/storage/CloseableIterator.java` (scala folder) to `storage/src/main/java/io/delta/storage/CloseableIterator.java`. It should be in the java folder. Closes delta-io/delta#926 Signed-off-by: Scott Sandre GitOrigin-RevId: 9f7c3b8236e8163f517515f3fac76fefab33faa5 commit 078ccf42cac53d54e313a75e098521f841bbe663 Author: Shixiong Zhu Date: Fri Feb 11 10:30:35 2022 -0800 Disable a flaky test in DeltaRetentionSuite GitOrigin-RevId: 34a13f51e9fae4b0376f5a702e93b16d84b267a2 commit 867256dcc77b9d707bd5677115b06b97241b004d Author: Wenchen Fan Date: Fri Feb 11 21:19:56 2022 +0800 Minor refactor to DeltaDataFrameWriterV2Suite GitOrigin-RevId: 5c8060aff4ef0d4f534c5d801c532bb54b782040 commit 2f5e51d2e66c66d0bb02dd941afdc8df4bf4a1ba Author: Tyson Condie Date: Thu Feb 10 17:46:34 2022 -0800 Add file size histogram to version checksum GitOrigin-RevId: 47a30a064c9c358c775c06202a5102224bc137ed commit d8e09cf64e018b2ef2f12025f344c6828d8a46c8 Author: Gengliang Wang Date: Fri Feb 11 01:24:03 2022 +0000 Minor refactor to DeltaTable.scala GitOrigin-RevId: c2ac7522614cdf900da2e235bc107ef7755ea886 commit 118963e45f0b6e0083bc5b850333d8d8b96a3aa5 Author: Tyson Condie Date: Thu Feb 10 10:57:43 2022 -0800 Removed set transaction count from checksum The removal of set transaction count in the checksum state. Existing tests cover these changes. GitOrigin-RevId: d6227b734389f6c8ce5b7c247fe08d51d3a20cfa commit 26193466508b14d0186ebb66fa7041eb0cd0d0e2 Author: Jackie Zhang Date: Thu Feb 10 17:45:15 2022 +1100 Fix duplicated physical name detection in upgraded name mode tables When user upgrades an existing table to name mode, we use the field's logical name as the physical name. In this case, fields like `a.b.c` should not conflict with `x.y.c` just because the leaf field share the same column name. GitOrigin-RevId: 8b109b61ac3bdc80ac54b89f51a1f1259e45d540 commit 113aa91d4e4409a3a91041ca8ffacfb5a05667ee Author: Prakhar Jain Date: Wed Feb 9 18:17:37 2022 -0800 Minor improvement to log message in ConflictChecker GitOrigin-RevId: 6b9f2f01c5a9f33b8dac061eda7b7a23c5cbbb61 commit 1b62b9dd463859530f7d8bf54db73257911600ae Author: Scott Sandre Date: Wed Feb 9 14:18:23 2022 -0800 Delta Storage: initial project setup The first PR https://github.com/delta-io/delta/pull/925 was accidentally merged into master and then reverted. So, here's the same PR again. Closes delta-io/delta#930 Signed-off-by: Scott Sandre GitOrigin-RevId: 22f053c68533b8e638ef835d436b974cc4086f80 commit fe99e939da66af8cdaa28ac6fba8a889571f72ab Author: Scott Sandre Date: Tue Feb 8 14:57:32 2022 -0800 Disable Python tests when running in docker mode temporarily Due to docker.com rate limit reached errors. Co-authored-by: Shixiong Zhu GitOrigin-RevId: 70087391783a83e8c4e6bd0ac3e654f50e8b6c7e GitOrigin-RevId: 37d69a60c0304b4e477224339495613529482a88 commit 4eb2c4d3739ae198af1b13d52dd51ba0f6c2d766 Author: Wenchen Fan Date: Tue Feb 8 13:28:17 2022 +0000 Fix Delta tests to pass under ANSI SQL dialect mode This PR fixes a few places in Delta, to make it work under Spark ANSI dialect mode. There are two types of fixes: 1. Some tests are not supported in ANSI dialect mode, disable ANSI mode explcitly in those tests 2. Some places need to do a cast, and expects the cast to return null. This PR sets the `ansiEnabled` flag to false in the created `Cast`. There is no behavior change when ANSI mode is false. Authored-by: Wenchen Fan Co-authored-by: allisonwang-db GitOrigin-RevId: 82702cb80b15ae50938c5d40c9085ed64e2980e0 commit 5101c932f785b2ba180b3991d9e86adf0a3b885f Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Tue Feb 8 17:39:41 2022 -0800 Fix hive assembly JAR name (#277) commit 6ab2655bb9195eb42a9930b35b67bca1c54427f3 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Feb 8 10:30:40 2022 -0800 Revert "Delta Storage: initial project setup (#925)" (#929) This reverts commit 5f8bd7ac94599b3039e59ee6be4d2df736660ce1. commit ebf398c14cae7861d6b14252ff9b8a0e7a6e5382 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Tue Feb 8 19:06:46 2022 +0100 Refactor to hide internal classes and expose only abstractions (#275) commit 5f8bd7ac94599b3039e59ee6be4d2df736660ce1 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Feb 8 09:31:35 2022 -0800 Delta Storage: initial project setup (#925) * initial project setup * remove mima settings from delta-storage sbt project * change hadoop-client-api and update comment * fix mima error for delta-core * update comment commit 59b38fb6c6fb0e9ba8fb2dd9868bf1210b95a232 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Tue Feb 8 18:10:03 2022 +0100 Flink Delta Sink - Docs (#236) * Add readme and wiki * Correct review remarks and correct docs about sink's dependencies * Add flink-connector example project * Consolidate readme and wiki files && correct docs and descriptions * Correct code review remarks * Get unidoc command (javadocs) running ... - error with internal classes mentioned in public javadocs, however * Correct code review remarks * Adjust to sbt 1.6.1 * Correct JavaDoc generation * Update Java doc plugin && temporarly comment out docs exclusions Co-authored-by: Scott Sandre Co-authored-by: Paweł Kubit commit 875d562cf67d7f5dee7f2311d62ae835c8a9000d Author: Shixiong Zhu Date: Fri Feb 4 14:30:11 2022 -0800 Document tables created by other systems cannot be read by the Hive connector directly (#274) commit ab366306a8c9c47e5a551bdaa36dc7e9adb20984 Author: Patrick Pichler Date: Fri Feb 4 22:02:22 2022 +0100 Existing Power BI function wrapped into a custom connector. (#271) * Existing Power BI function wrapped into a custom connector. * Gitignore added /obj /bin. Moved MEZ file to parent folder. * folder deleted /obj /bin commit 580aa9aa7c4ead8a76400ab25521b942876f7a81 Author: Venki Korukanti Date: Thu Feb 3 22:01:12 2022 -0800 Minor refactoring to DeltaTable.scala GitOrigin-RevId: 8f3475d23d344e9a464c9d9e23f16e71ba91553a commit e74c29a2489cffd88a22971563c91a19f64d7e9d Author: Maksym Dovhal Date: Tue Feb 1 12:42:39 2022 -0800 Refactoring and optimisation of RestoreTableCommand * RestoreTableCommand moved to org.apache.spark.sql.delta.commands package * cache() of filesToRemove DataFame removed (according to https://github.com/delta-io/delta/pull/863#issuecomment-1015672532) * cache() of filesToAdd will be applied only if spark.sql.files.ignoreMissingFiles = false (default value) Signed-off-by: Maksym Dovhal Closes delta-io/delta#912 Signed-off-by: Venki Korukanti GitOrigin-RevId: b10707c96766f74423874f01898587f97c69c6b5 commit cd0945c81c1e322896c9ba6e9732211abb701402 Author: David Lewis Date: Tue Feb 1 12:13:35 2022 -0700 Catch database exceptions as well as table exceptions The `TableCatalog.loadTable` method should only throw TableNotFound exceptions. Currently it might also throw `DatabaseNotFound`. This PR fixes that. GitOrigin-RevId: 6ecd89fb5799b650e310c3dcb7f732fecd0482b3 GitOrigin-RevId: fb93fc20d77a704b82b88d293d39d0b8530eeaa1 commit 38807a7b6b90032675b350b59b022b89cce48337 Author: Will Jones Date: Tue Feb 1 10:34:56 2022 -0800 Mention deletion of delta log entries in PROTOCOL Existing writers may delete old JSON log entries if there are newer checkpoints. Fixes #888. Closes delta-io/delta#913 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 79cce715d78edb9aca33f2f8db7861e15634e812 commit be605d1b402de92f678c49ecdd640f2991c3a124 Author: Maksym Dovhal Date: Tue Feb 1 08:17:47 2022 -0800 Python API for restoring delta table * Add possibility to restore delta table using version or timestamp from pyspark Examples: ``` DeltaTable.forPath(spark, path).restoreToVersion(0) DeltaTable.forPath(spark, path).restoreToTimestamp('2021-01-01 01:01-01') ``` Tested by unit tests. Fixes https://github.com/delta-io/delta/issues/890 Signed-off-by: Maksym Dovhal Closes delta-io/delta#903 Signed-off-by: Venki Korukanti GitOrigin-RevId: 8ca6a3643d97b1a95ebf3a48edcb23f4f2adb6f4 commit 4bd549fddd11b9b95f84b86c8f0cdc801b8832bb Author: Meng Tong Date: Mon Jan 31 17:26:30 2022 -0800 Update Delta Protocol for Identity column The ability to have a column that is auto incrementing and generates integer values is a highly requested feature. This is a well established feature in existing data warehouses (such as Oracle, Redshift, ...). Not having this basic functionality makes it difficult for users to migrate from their existing DWs to Delta Lake. Hence, we propose to support identity columns in Delta Lake. As this change requires to update Delta protocol, this PR updates `PROTOCOL.md` to describe the Identity column support in the transaction log layer. We will work on the user facing feature after the new protocol format is accepted. Closes delta-io/delta#904 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 76d8f070c914539b3fe5a4aeb4147d10e7a4cfe8 commit 6e1a488f3e0ca3ea4e605948aa3870c25d0b25ee Author: stikkireddy Date: Fri Jan 28 20:43:32 2022 -0800 configure_spark_with_delta_pip fix to stop overwriting spark.jars.packages This pull request enhances `configure_spark_with_delta_pip` function to add an optional parameter extra_packages to allow for users to add their custom jars as required. A test was added class was added `PipUtilsCustomJarsTests` to unit test by adding a duplicate of the delta jar and see if spark conf contained both jars in the assertion. A new test class was added due to having to modify the spark session rather than using the existing test class. This resolves issue #889. Closes delta-io/delta#909 Signed-off-by: Shixiong Zhu GitOrigin-RevId: e667ef3aa63e8546fe12dd820fed8dc790919abe commit 416a3d593d254e11d444f9c60ba71cec93d6739f Author: Olivier NOUGUIER Date: Mon Jan 31 05:34:08 2022 +0100 :arrow_up: Scala 2.13.8 (#259) commit db98bea38940bb9ae93219af371d517081805f99 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Fri Jan 28 23:54:08 2022 +0100 Rename shouldTryUpdateSchema -> mergeSchema (#263) commit 1fcb1dacc8d4ec671cd499676bb61d55d343cc2a Author: Gengliang Wang Date: Fri Jan 28 08:16:16 2022 +0000 Inline ParquetSchemaConverter.checkFieldName to avoid depending on this Spark internal API GitOrigin-RevId: 10064c35675bd8631a6ef0d0e0e81b3ba4c8e479 commit a00b85b01636980209b9800db842c008b5b351b5 Author: Jan Paw Date: Thu Jan 27 13:59:50 2022 -0800 Mima check for Scala 2.13 #911 To enable Mima check for Scala 2.13 Signed-off-by: Jan Paw Closes delta-io/delta#915 Co-authored-by: Jan Paw Signed-off-by: Shixiong Zhu GitOrigin-RevId: 5df9cc74ba738bbbe7eeb94b5e8a19b8a1a7df28 commit d093639f93ba110340303877e8df038fe2bda88d Author: Wenchen Fan Date: Thu Jan 27 07:40:54 2022 +0000 Update a test error message for Spark 3.3 support GitOrigin-RevId: 39f8495f21ac25e75042f18be1f6d61943ca000f commit eb2985f21ce2bbccfbc2c71b49a0a5eab7524fa2 Author: Yaohua Zhao Date: Thu Jan 27 06:54:49 2022 +0800 Minor refactor to DeltaErrors GitOrigin-RevId: bbaf6b8873af4ee08cabfac216f88cc7c3294708 commit 392b30562e982a565438bbbf93df3cf13a885f39 Author: stikkireddy Date: Wed Jan 26 12:16:41 2022 -0800 Update integration tests to run with all published Scala versions This PR adds a new flag --scala-version to `run-integration-tests.py`. You can run the following commands: ``` $ python run-integration-tests.py --version 1.1.0 --scala-version 2.13 --scala-only $ python run-integration-tests.py --version 1.1.0 --scala-version 2.12 --scala-only $ python run-integration-tests.py --version 1.1.0 --scala-only ``` It defaults to 2.12 when a version is not provided and fails if any version other than 2.12 and 2.13 are provided. The build.sbt is also updated in the build.sbt in examples/scala/build.sbt to fetch scala version from ENV var `SCALA_VERSION`. This is meant to resolve issue #846. Closes delta-io/delta#908 Signed-off-by: Scott Sandre GitOrigin-RevId: 24319d36a0d8c0524d99f3d360da670ea8db2fcb commit ff59810772e476bd7800c1c717c88a0f11fe9c7d Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Thu Jan 27 17:47:40 2022 -0800 update readme (#261) commit b15fe73a8e95a4696c676a982cf2ad343af62f36 Author: Denny Lee Date: Thu Jan 27 15:17:58 2022 -0800 Update contributing guide (#916) * Update contributing guide * minor fixes * fix formatting * fixed spelling commit 451a49492156da8048f93c7c02aa7fca2289dba4 Author: Shixiong Zhu Date: Wed Jan 26 12:03:39 2022 -0800 Fix crossScalaVersions for Scala 2.11 (#267) `crossScalaVersions` is set in a wrong place after upgrading SBT. For example, `build/sbt "++ 2.11.12 standalone/test"` still uses Scala 2.12. This PR fixes it. commit 0edd30e434ff0ed1b5b78e2dfbaa101882432b50 Author: Shixiong Zhu Date: Wed Jan 26 10:31:25 2022 -0800 Access tables in golden-tables/ directly (#266) `data reader can read partition values` currently fails because SBT doesn't copy all of files in the table to the correct place. I tried with SBT with various changes and noticed that it won't copy files whose depth is high. To workaround with the issue, I changed the code to read the tables in `golden-tables/` rather than asking SBT to copy files to `standalone/target/scala-2.12/test-classes/` and use files in this target folder. As some of tests may need to modify files, I also add utility methods `withLogForWritableGoldenTable`, `withLogImplForWritableGoldenTable`, and `withWritableHiveGoldenTable` for these tests so that we can avoid modifying files in `golden-tables/` when running these tests. commit 8ae11aff025dd495dfc27ec303535229a1362092 Author: Allison Portis Date: Tue Jan 25 13:44:59 2022 -0800 Correctly close iterator when reading large log files in Delta Streaming Source GitOrigin-RevId: a31e0529bf169d4a1f1e85b9c14982a83a899767 commit 1dd2ea0355de3302fecdb53cb6b2a1bbf8d88deb Author: Wenchen Fan Date: Tue Jan 25 10:51:40 2022 +0000 Minor change to DeltaTableV2 GitOrigin-RevId: e0a7eb4f0b69a0eaa42d44d1535081d7e7c6f99a commit 4922e167362cfca82592de9ca13a7f6f43030344 Author: Venki Korukanti Date: Mon Jan 24 09:26:59 2022 -0800 Minor refactor to DeltaCommand GitOrigin-RevId: dc12a0d93d84b30ffa4601eeed60d9b610a0c98b commit 6f396301f74bea16ad27404a7c79e9aa698d6e2c Author: Kam Cheung Ting Date: Fri Jan 21 23:38:31 2022 -0800 Minor refactor to actions.scala GitOrigin-RevId: c5fc4bac6c3275606e556cba74fc9cc253b0b78f commit 71f26fbf85cf840aefc9552cf5260e2286bafb93 Author: Allison Portis Date: Fri Jan 21 15:18:33 2022 -0800 Issue board automation Move new & commented on issues to "Needs Review" Closes delta-io/delta#899 Signed-off-by: allisonport-db GitOrigin-RevId: e020dabead1e187a20e5d88b7fdb9c62a11f38ae commit a86f43973a825f7a42ce52d48b49732dba1f1ad8 Author: Gengliang Wang Date: Fri Jan 21 19:12:31 2022 +0800 Minor refactor to ConvertToDeltaCommand GitOrigin-RevId: 4cb4a9b37e08094a8c1fcfe249e9c7def0f7d732 commit 5f6896ddc90426eb93902b8eb04c4ba059b46004 Author: Wenchen Fan Date: Fri Jan 21 01:48:33 2022 +0000 Minor refactor to test code GitOrigin-RevId: 361704dc1813dd13025e76ce123b466844095977 commit 834a3c9dfe2bbe5c8dcb32a06169afb893c27714 Author: Allison Portis Date: Thu Jan 20 12:05:19 2022 -0800 Remove sbt-coursier from plugings Resolves #800 Remove `sbt-coursier` from build. sbt 1.3.x includes `sbt-coursier` so it's no longer needed after version upgrade, version shipped with sbt 1.5.5 resolves this issue. Verified by running `build/sbt package` Closes delta-io/delta#900 Signed-off-by: allisonport-db GitOrigin-RevId: 653b615f0debd261e1003079fa398b988c7a7e80 commit e11ff840f203d338a4c487c22d3e8754032820ec Author: Liwen Sun Date: Wed Jan 19 17:43:47 2022 -0800 Minor refactor to DeltaTable GitOrigin-RevId: 01a66191e1b7fe0e5c042a5a67f390e9a902c9eb commit 298abdb7b15484f753e340c883e8fba571ebcfb4 Author: Jackie Zhang Date: Wed Jan 19 19:56:07 2022 +1100 Minor refactor to ImplicitMetadataOperation GitOrigin-RevId: c2cc0f2e47775b4e3c9991cad02dbdd9f5f1fecf commit d4eedb75af8592c63ef02a446b5800b7fc5de23b Author: Meng Tong Date: Wed Jan 19 00:05:10 2022 -0800 Minor refactor to alterDeltaTableCommands GitOrigin-RevId: 063aaa806ba730282a64a7cf504134e613f38713 commit 6c0eb2c926716bc39de16a28ffdd227b1b0c805d Author: Meng Tong Date: Tue Jan 18 19:06:57 2022 -0800 Minor refactor to prepare work for identity column GitOrigin-RevId: d6d016d23e1b06e7fa101595055bd7249a29bfeb commit 1b013fff3ae50a9401ad491c6c063acaf95c87a6 Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Tue Jan 25 18:40:39 2022 -0800 Fix MiMa issue (#264) commit 33570c7bc452bb7f1bcabceca7aea9ae17b6b136 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Tue Jan 25 16:52:18 2022 +0100 Flink Delta Sink - Logging (#254) * Add logging for committables and commit actions * Correct code review remarks * Correct code review remarks commit 8a3b270c6a65bb5678a9a4c74df2ca2e702b0aef Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Fri Jan 21 12:14:12 2022 -0800 Issue board automation (#262) commit 0d07d094ccd520c1adbe45dde4804c754c0a4baa Author: Prakhar Jain Date: Tue Jan 18 15:50:00 2022 -0800 Fix issues around empty file action transactions This PR fixes the empty file action transaction issue. Currently Delta uses SnapshotIsolation when a txn wants to commit stuff with no file actions. This is wrong - A txn without no FileActions can still have Metadata updates. A metadata update shouldn't use SnapshotIsolation. GitOrigin-RevId: 10d4fbe5f6ff29b9a6aad995f0e4dbc4b30da135 commit f5cef197ffbf35024377d2ad1d7eb8232d122688 Author: Jackie Zhang Date: Sat Jan 15 14:44:26 2022 +1100 Addition to DeltaErrors.scala GitOrigin-RevId: 167dcfbf372c0d66a4a3312eab9654fdbd7d28ea commit 88a906d31268888226a9c735d4b4bb5b95a12359 Author: John ODwyer Date: Fri Jan 14 13:27:58 2022 -0800 Include edits made on the contributing page of the new website dennyglee and I made a couple of edits to the version of CONTRIBUTING.md on the new version of the website. This PR reflects those edits and it also adds delta-charter.pdf to the repo which is the correct place for the charter. Closes delta-io/delta#853 Signed-off-by: Scott Sandre GitOrigin-RevId: a5f021e37ac0e013deaaa7210f553c4422a5f9ad commit ade04f191bbba0f5aa8c2daf0442bf9e7fcb39d6 Author: Wenchen Fan Date: Fri Jan 14 16:50:24 2022 +0000 Add support for sorted bucket with `BucketTransform` to DeltaCatalog.scala GitOrigin-RevId: 6eeb7c6f275a6e029778b6fdf3325e441cd06a15 commit 8ed75b93368f1e79f24c02e3d1aa41134c308de3 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Fri Jan 21 15:48:33 2022 +0100 Flink Delta Sink - Update Commit Metadata (#260) * Update engine metadata for commit log * Add test commit 8348ee49f0a8b4504ec84de4d60448783ad29422 Author: Yann Byron Date: Wed Jan 12 08:27:37 2022 -0800 Fix error in adding a column after an array column When execute to add column after a column whose dataType is `ArrayType` and the `ArrayType`'s dataType is not `StructType`, an exception will be raised as follow: `Cannot add col3 because its parent is not a StructType. Found ArrayType(MapType(StringType,StringType,true),true)`. The codes can reproduce this case are shown: ``` create table s1 (id int) using delta; alter table s1 add columns (array_map_col array>); alter table s1 add columns (col3 string after array_map_col); ``` with tracking the source code, i think adding the `col3` column above after `array_map_col` is considered to be inserted inside `array_map_col`. So open this pr. Closes delta-io/delta#864 Co-authored-by: biyan.by Signed-off-by: Venki Korukanti GitOrigin-RevId: a3f7157c69eabab04d6f6a5ca0402f74c3772138 commit 448d18d0b433a3f7b58e45a512bd98182017bfa0 Author: Maksym Dovhal Date: Tue Jan 11 12:33:33 2022 -0800 Scala API for restoring delta table Add possibility to restore delta table using version or timestamp. Examples: ```scala io.delta.tables.DeltaTable.forPath("/some_delta_path").restoreToVersion(1) io.delta.tables.DeltaTable.forPath("/some_delta_path").restoreToTimestamp("2021-01-01 00:00:00.000") io.delta.tables.DeltaTable.forPath("/some_delta_path").restoreToTimestamp("2021-01-01") ``` Fixes https://github.com/delta-io/delta/issues/632 Signed-off-by: Maksym Dovhal Tested locally using spark-shell ```bash sbt package spark-shell --jars ./core/target/scala-2.12/delta-core_2.12-1.1.0-SNAPSHOT.jar --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" ``` ```scala spark.range(2).write.format("delta").mode("overwrite").save("/tmp/delta_restore_test") spark.range(2,3).withColumnRenamed("id", "id_new").write.option("mergeSchema", "true").format("delta").mode("overwrite").save("/tmp/delta_restore_test") io.delta.tables.DeltaTable.forPath("/tmp/delta_restore_test").restoreToVersion(0) io.delta.tables.DeltaTable.forPath("/tmp/delta_restore_test").restoreToTimestamp("2021-12-18 16:40:14.54") // At next day io.delta.tables.DeltaTable.forPath("/tmp/delta_restore_test").restoreToVersion(0) io.delta.tables.DeltaTable.forPath("/tmp/delta_restore_test").restoreToTimestamp("2021-12-19") io.delta.tables.DeltaTable.forPath("/tmp/delta_restore_test").history().show(false) ``` Output: ``` +-------+-----------------------+------+--------+---------+------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------------------------+ |version|timestamp |userId|userName|operation|operationParameters |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics |userMetadata|engineInfo | +-------+-----------------------+------+--------+---------+------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------------------------+ |5 |2021-12-19 09:43:41.604|null |null |RESTORE |{version -> null, timestamp -> 2021-12-19} |null|null |null |4 |Serializable |false |{numRestoredFiles -> 2, removedFilesSize -> 1252, numRemovedFiles -> 3, restoredFilesSize -> 794, numOfFilesAfterRestore -> 2, tableSizeAfterRestore -> 794} |null |Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT| |4 |2021-12-19 09:43:17.415|null |null |RESTORE |{version -> 0, timestamp -> null} |null|null |null |3 |Serializable |false |{numRestoredFiles -> 3, removedFilesSize -> 794, numRemovedFiles -> 2, restoredFilesSize -> 1252, numOfFilesAfterRestore -> 3, tableSizeAfterRestore -> 1252}|null |Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT| |3 |2021-12-18 16:42:14.083|null |null |RESTORE |{version -> null, timestamp -> 2021-12-18 16:40:14.54}|null|null |null |2 |Serializable |false |{numRestoredFiles -> 2, removedFilesSize -> 1252, numRemovedFiles -> 3, restoredFilesSize -> 794, numOfFilesAfterRestore -> 2, tableSizeAfterRestore -> 794} |null |Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT| |2 |2021-12-18 16:40:53.861|null |null |RESTORE |{version -> 0, timestamp -> null} |null|null |null |1 |Serializable |false |{numRestoredFiles -> 3, removedFilesSize -> 794, numRemovedFiles -> 2, restoredFilesSize -> 1252, numOfFilesAfterRestore -> 3, tableSizeAfterRestore -> 1252}|null |Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT| |1 |2021-12-18 16:40:14.54 |null |null |WRITE |{mode -> Overwrite, partitionBy -> []} |null|null |null |0 |Serializable |false |{numFiles -> 2, numOutputRows -> 1, numOutputBytes -> 794} |null |Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT| |0 |2021-12-18 16:40:08.045|null |null |WRITE |{mode -> Overwrite, partitionBy -> []} |null|null |null |null |Serializable |false |{numFiles -> 3, numOutputRows -> 2, numOutputBytes -> 1252} |null |Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT| +-------+-----------------------+------+--------+---------+------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------------------------+ ``` Examples of transactions: /tmp/delta_restore_test/_delta_log/00000000000000000000.json ```json {"protocol":{"minReaderVersion":1,"minWriterVersion":2}} {"metaData":{"id":"b090f082-f927-4372-9537-9623ae280ad8","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1639838404635}} {"add":{"path":"part-00000-cb74dd35-ae80-4b3a-b97c-ea492e11ddc3-c000.snappy.parquet","partitionValues":{},"size":296,"modificationTime":1639838406641,"dataChange":true}} {"add":{"path":"part-00005-beac50f7-dbe7-40b7-9ce2-5e0e6a1607ad-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1639838406642,"dataChange":true}} {"add":{"path":"part-00011-7e35258f-a724-43f3-8622-c7efa51f01a6-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1639838406642,"dataChange":true}} {"commitInfo":{"timestamp":1639838407868,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numFiles":"3","numOutputRows":"2","numOutputBytes":"1252"},"engineInfo":"Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT"}} ``` /tmp/delta_restore_test/_delta_log/00000000000000000001.json ```json {"metaData":{"id":"b090f082-f927-4372-9537-9623ae280ad8","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id_new\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1639838404635}} {"add":{"path":"part-00000-b71c3566-429b-4038-8127-6e480656038c-c000.snappy.parquet","partitionValues":{},"size":304,"modificationTime":1639838414041,"dataChange":true}} {"add":{"path":"part-00011-7a5341e6-4876-467a-b33d-56d8dc0bf243-c000.snappy.parquet","partitionValues":{},"size":490,"modificationTime":1639838414041,"dataChange":true}} {"remove":{"path":"part-00000-cb74dd35-ae80-4b3a-b97c-ea492e11ddc3-c000.snappy.parquet","deletionTimestamp":1639838414511,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":296}} {"remove":{"path":"part-00011-7e35258f-a724-43f3-8622-c7efa51f01a6-c000.snappy.parquet","deletionTimestamp":1639838414511,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":478}} {"remove":{"path":"part-00005-beac50f7-dbe7-40b7-9ce2-5e0e6a1607ad-c000.snappy.parquet","deletionTimestamp":1639838414511,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":478}} {"commitInfo":{"timestamp":1639838414511,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numFiles":"2","numOutputRows":"1","numOutputBytes":"794"},"engineInfo":"Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT"}} ``` /tmp/delta_restore_test/_delta_log/00000000000000000002.json ```json {"commitInfo":{"timestamp":1639838436332,"operation":"RESTORE","operationParameters":{"version":0,"timestamp":null},"readVersion":1,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRestoredFiles":"3","removedFilesSize":"794","numRemovedFiles":"2","restoredFilesSize":"1252","numOfFilesAfterRestore":"3","tableSizeAfterRestore":"1252"},"engineInfo":"Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT"}} {"metaData":{"id":"b090f082-f927-4372-9537-9623ae280ad8","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1639838404635}} {"add":{"path":"part-00005-beac50f7-dbe7-40b7-9ce2-5e0e6a1607ad-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1639838406642,"dataChange":true}} {"add":{"path":"part-00011-7e35258f-a724-43f3-8622-c7efa51f01a6-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1639838406642,"dataChange":true}} {"add":{"path":"part-00000-cb74dd35-ae80-4b3a-b97c-ea492e11ddc3-c000.snappy.parquet","partitionValues":{},"size":296,"modificationTime":1639838406641,"dataChange":true}} {"remove":{"path":"part-00011-7a5341e6-4876-467a-b33d-56d8dc0bf243-c000.snappy.parquet","deletionTimestamp":1639838435578,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":490}} {"remove":{"path":"part-00000-b71c3566-429b-4038-8127-6e480656038c-c000.snappy.parquet","deletionTimestamp":1639838435586,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":304}} ``` /tmp/delta_restore_test/_delta_log/00000000000000000003.json ```json {"commitInfo":{"timestamp":1639838517073,"operation":"RESTORE","operationParameters":{"version":null,"timestamp":"2021-12-18 16:40:14.54"},"readVersion":2,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRestoredFiles":"2","removedFilesSize":"1252","numRemovedFiles":"3","restoredFilesSize":"794","numOfFilesAfterRestore":"2","tableSizeAfterRestore":"794"},"engineInfo":"Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT"}} {"metaData":{"id":"b090f082-f927-4372-9537-9623ae280ad8","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id_new\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1639838404635}} {"add":{"path":"part-00011-7a5341e6-4876-467a-b33d-56d8dc0bf243-c000.snappy.parquet","partitionValues":{},"size":490,"modificationTime":1639838414041,"dataChange":true}} {"add":{"path":"part-00000-b71c3566-429b-4038-8127-6e480656038c-c000.snappy.parquet","partitionValues":{},"size":304,"modificationTime":1639838414041,"dataChange":true}} {"remove":{"path":"part-00005-beac50f7-dbe7-40b7-9ce2-5e0e6a1607ad-c000.snappy.parquet","deletionTimestamp":1639838516199,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":478}} {"remove":{"path":"part-00011-7e35258f-a724-43f3-8622-c7efa51f01a6-c000.snappy.parquet","deletionTimestamp":1639838516199,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":478}} {"remove":{"path":"part-00000-cb74dd35-ae80-4b3a-b97c-ea492e11ddc3-c000.snappy.parquet","deletionTimestamp":1639838516202,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":296}} ``` /tmp/delta_restore_test/_delta_log/00000000000000000004.json ```json {"commitInfo":{"timestamp":1639899780668,"operation":"RESTORE","operationParameters":{"version":0,"timestamp":null},"readVersion":3,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRestoredFiles":"3","removedFilesSize":"794","numRemovedFiles":"2","restoredFilesSize":"1252","numOfFilesAfterRestore":"3","tableSizeAfterRestore":"1252"},"engineInfo":"Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT"}} {"metaData":{"id":"b090f082-f927-4372-9537-9623ae280ad8","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1639838404635}} {"add":{"path":"part-00005-beac50f7-dbe7-40b7-9ce2-5e0e6a1607ad-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1639838406642,"dataChange":true}} {"add":{"path":"part-00011-7e35258f-a724-43f3-8622-c7efa51f01a6-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1639838406642,"dataChange":true}} {"add":{"path":"part-00000-cb74dd35-ae80-4b3a-b97c-ea492e11ddc3-c000.snappy.parquet","partitionValues":{},"size":296,"modificationTime":1639838406641,"dataChange":true}} {"remove":{"path":"part-00011-7a5341e6-4876-467a-b33d-56d8dc0bf243-c000.snappy.parquet","deletionTimestamp":1639899779981,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":490}} {"remove":{"path":"part-00000-b71c3566-429b-4038-8127-6e480656038c-c000.snappy.parquet","deletionTimestamp":1639899779981,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":304}} ``` /tmp/delta_restore_test/_delta_log/00000000000000000005.json ```json {"commitInfo":{"timestamp":1639899805962,"operation":"RESTORE","operationParameters":{"version":null,"timestamp":"2021-12-19"},"readVersion":4,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRestoredFiles":"2","removedFilesSize":"1252","numRemovedFiles":"3","restoredFilesSize":"794","numOfFilesAfterRestore":"2","tableSizeAfterRestore":"794"},"engineInfo":"Apache-Spark/3.2.0 Delta-Lake/1.1.0-SNAPSHOT"}} {"metaData":{"id":"b090f082-f927-4372-9537-9623ae280ad8","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id_new\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1639838404635}} {"add":{"path":"part-00011-7a5341e6-4876-467a-b33d-56d8dc0bf243-c000.snappy.parquet","partitionValues":{},"size":490,"modificationTime":1639838414041,"dataChange":true}} {"add":{"path":"part-00000-b71c3566-429b-4038-8127-6e480656038c-c000.snappy.parquet","partitionValues":{},"size":304,"modificationTime":1639838414041,"dataChange":true}} {"remove":{"path":"part-00005-beac50f7-dbe7-40b7-9ce2-5e0e6a1607ad-c000.snappy.parquet","deletionTimestamp":1639899805448,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":478}} {"remove":{"path":"part-00011-7e35258f-a724-43f3-8622-c7efa51f01a6-c000.snappy.parquet","deletionTimestamp":1639899805445,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":478}} {"remove":{"path":"part-00000-cb74dd35-ae80-4b3a-b97c-ea492e11ddc3-c000.snappy.parquet","deletionTimestamp":1639899805444,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":296}} ``` Closes delta-io/delta#863 Signed-off-by: Scott Sandre GitOrigin-RevId: 3f1c0e77b403f49f9460baff13174dd4f88da47d commit a266abdc7d9f1e4b4b7d23999076abbd4789d8fc Author: Sabir Akhadov Date: Tue Jan 11 10:25:59 2022 +0100 copyWithTag should not set extendedMetadata. `extendedMetadata` should only be set to true if all 3 nullable fields are non-null: `tags`, `partitionValues` and `size`. It was wrongly set to true in all cases. The helper method in question was only used in `OPTIMIZE` operation where the fields are guaranteed to be non-null. Uses existing tests. GitOrigin-RevId: eddd30c9e8006778712ac3e0aa039463a4a6fd50 commit 0c88b45303c83454954bd6eac39d6140960afac0 Author: Yijia Cui Date: Mon Jan 10 23:53:18 2022 -0800 Fix AddCDCFile to preserve partitionValues with null `AddCDCFile.partitionValues` is missing `@JsonInclude(JsonInclude.Include.ALWAYS)`, hence when we update a partition column to null, the field in file action `map("partitionCol1" -> "foo", "partitionCol2" -> null)` will be serialized as `{"partitionCol1": "foo"}`. This will cause `NoSuchElementException` when reading cdc history to get `partitionCol2`. This PR adds `@JsonInclude(JsonInclude.Include.ALWAYS)` to `AddCDCFile.partitionValues` to fix the serialization issues. However, as old versions may still output incorrect `partitionValues` jsons, we also update `TahoeFileIndex` to handle the missing partition column case. Tested with unit tests. GitOrigin-RevId: 87ef34a01d5812feddbe2daee2528e27bb2b3096 commit 413201163f749cdd658dcffb22789bf70963235e Author: Prakhar Jain Date: Mon Jan 10 17:16:10 2022 -0800 Add unique identifier to Delta commits to avoid duplicate commits This PR fixes the duplicate commit issue in Delta by making sure that every commit is different from the other. This is done by adding the txnId in the CommitInfo. TxnId is an existing unique uuid for each commit. Added UT. GitOrigin-RevId: ab62afd597ccacd2b49e39a4d36cc085cf60fd54 commit aa5ec3fdfc028c9150636ff4f28a3211637f9138 Author: Hussein Nagree Date: Mon Jan 10 11:35:44 2022 -0800 Add `numPartitionColumnsInTable` to OptimisticTransaction.scala and DeltaCommand.scala GitOrigin-RevId: 68844006b24f8e39030c99374c1d1e67c8dce340 commit 3c05fad5dc9d91e60cdce1ddd5d018c239d4af6d Author: Hyukjin Kwon Date: Mon Jan 10 14:15:17 2022 +0000 Update test_deltatable.py to comply with PEP 8 rule E722 - [E722](https://www.flake8rules.com/rules/E722.html): Do not use bare except, specify exception instead I manually tested it Python linter. GitOrigin-RevId: 0112cdfd69214e37d55b496aef890f99175342da commit 47e393a6f5a33af952d920ebf28ed22adc8001fc Author: Olivier NOUGUIER Date: Fri Jan 14 06:21:40 2022 +0100 :arrow_up: sbt 1.6.1 (was 0.13.x) (#253) commit d6eae3967d3084ca080b63a5598b745efd20771d Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Jan 12 16:40:56 2022 -0800 done (#255) commit 6cfe05073ed23fcf3165408ac012abc480b7642d Author: Shixiong Zhu Date: Fri Jan 7 15:32:07 2022 -0800 Add udf template for Delta to avoid calling ExpressionEncoder() after warming up There are some places that always create an encoder by calling `ExpressionEncoder()` rather than `encoder.copy()`. `ExpressionEncoder()` needs to hit Scala Reflection to generate the SQL schema from Scala classes and it's a performance killer when the parallelism is high. This PR attempts to fix all of these places by creating UDFs used by Delta internally from a template rather than calling `functions.udf` which always calls `ExpressionEncoder()`. GitOrigin-RevId: 369f3e97bfc45362bf38e19c844d7ed6ebf0c2ee commit 95cad7f18923e4ac1307dbfbf39fc1b7679a0cde Author: Gengliang Wang Date: Fri Jan 7 05:10:59 2022 +0000 Minor refactoring to DeltaTable Author: Gengliang Wang Author: Gengliang Wang Signed-off-by: Ubuntu GitOrigin-RevId: 30b8f125d398f6a7f595f1e9493d001883baffe0 commit c6ffd6337c0d10583cce110d60d312e30e249c3e Author: sabir-akhadov Date: Thu Jan 6 17:00:49 2022 +0100 Minor refactoring to actions.scala GitOrigin-RevId: 9c89fe359d6c00aca9d1d901b62d89ee2ebd4911 commit 6622f1c98eef03e8fb75d08c6282fb6e0df1c710 Author: Lars Kroll Date: Thu Jan 6 09:25:12 2022 +0100 Import order in MergeIntoCommand GitOrigin-RevId: 0ab0ec05c0386f091dc59f17d989cebd0cd119ce commit e322192574a2e1bc33001a21f9e43808591c2cac Author: Junyong Lee Date: Wed Jan 5 20:42:50 2022 -0800 Change the interface of loadAction to reduce unnecessary conversion In state reconstruction of DeltaLog, we have a state in Dataset[SingleAction] converting it to Dataframe to massage some columns and sort, then we re-convert back this to Dataset[SingleAction]. This conversion is not necessary and may have small performance impact, so we improve the interface of loadActions to be just DataFrame. This will give following benefits: a) Minor performance improvement: conversion from DF -> DS[SingleAction] -> DF is not needed anymore. b) Since we return DF, the interface becomes more flexible if any future optimization needs to be applied for sorting and partition purpose. All existing unit tests (it is a common code path that uses state reconstruction for almost all deltalog operations) GitOrigin-RevId: 1e1fa5cc9d42ae294c90a0861278619a2a1fdcc3 commit 7c1b9d6b58a7d9c403289f7dceecc13a7e373f15 Author: Hussein Nagree Date: Wed Jan 5 12:15:23 2022 -0800 Add more metrics to Delta log commit operation Add in several more stats to the delta log commit operation. Updated unit tests GitOrigin-RevId: bca091828aada214f3b8814aefe43e6b90bd9657 commit 7146173141e9b92733e9c02b7eab26c5c6df989e Author: Shixiong Zhu Date: Wed Jan 5 10:46:49 2022 -0800 Recreate DeltaLog objects when the original SparkContext is stopped This PR keeps a reference to `SparkContext` in `DeltaLog` and uses it to detect whether the `SparkContext` used to create `DeltaLog` is stopped. If it's stopped, we should remove the invalid cached `DeltaLog` and create a new one. Closes #629 Closes delta-io/delta#881 Signed-off-by: Shixiong Zhu GitOrigin-RevId: f11ce9982cee9930ea57acb803de1d32a36d5fce commit d3bc12503f01e54628477596cd5a201100b29793 Author: allisonport-db Date: Tue Jan 4 11:13:45 2022 -0800 Project board automation for new & updated PR's Tested in connectors Closes delta-io/delta#871 Signed-off-by: allisonport-db GitOrigin-RevId: d4576aaf714fc3f1eed38c4bc92f4b902cc07db9 commit 2421ac92709494363567b559a9035362c9733500 Author: Junyong Lee Date: Tue Dec 21 11:05:25 2021 -0800 Add an internal flag for checkpoint to throw exceptions It's important for our existing test suite to catch any checkpoint related errors. We introduced a flag to control this in case a certain tests (that is not controlled by the original utils.IsTesting flag) to throw exception on checkpoints. Tested with unit tests, integration test to catch a buggy case GitOrigin-RevId: 0f2113d219f837bedf4b44e525fd2e8024ba3408 commit 9a86ca79a9063cffa9b361bc9ef423162c3de11e Author: Fabio Badalì Date: Tue Dec 21 13:20:20 2021 -0500 Optimize search through output attributes of the target table in merge operation This PR addresses the resolution of certain issues experienced during the merge operation for a table with a large number of columns (similar issues have been described by another user: https://github.com/delta-io/delta/issues/479). It might be possible to improve furtherly this solution both from execution time and memory consumption points of view in future PRs. * This PR introduces a map to perform search through output attributes of the target table in order to reduce the time complexity. The PR was tested with an additional test case (not included in the PR because of its intrinsic slowness) which generates a table with a lot of columns and tries to perform a merge on it: ``` test("updateAll and insertAll with a huge number of columns") { withTempPath { targetDir => val targetPath = targetDir.getCanonicalPath val columns = col("key") +: (1 to 1500).map(c => col("value") as s"value_${c}") val df = Seq((1, 10)).toDF("key", "value").select(columns: _*) df.write.format("delta").save(targetPath) val t = io.delta.tables.DeltaTable.forPath(spark, targetPath).as("t") val source = Seq((1, 11)).toDF("key", "value").select(columns: _*) t.as("t") .merge(source.as("s"), "t.key = s.key") .whenMatched() .updateAll() .whenNotMatched() .insertAll() .execute() checkAnswer( readDeltaTable(targetPath), source.collect().toSeq ) } } ``` Moreover, the PR was also tested in a real case scenario, preventing effectively the merge operation from "hanging". Signed-off-by: Fabio Badali Closes delta-io/delta#584 Signed-off-by: Scott Sandre GitOrigin-RevId: 39e9557c7b7741b03122db5d75e42f8f1245ecbe commit c3eec05d80d116daefc99d5307e17c007cf2f0db Author: Wenchen Fan Date: Tue Dec 21 09:14:53 2021 +0000 Minor refactoring to DeltaTableBuilder GitOrigin-RevId: 0ccd58a2730358c94c03c07b8c932d18fe347c0e commit 00a3abeec157130824d6a77e93c1213d746d5e04 Author: Shixiong Zhu Date: Fri Dec 17 16:53:09 2021 -0800 Make Snapshot creation resilient to corrupt checkpoint in OSS Delta When a checkpoint is corrupted, OSS Delta will fail to create the Snapshot. This PR improves the Snapshot loading to make it resilient to corrupt checkpoint: when it fails to read a checkpoint, it will try to search an alternative checkpoint and use it to construct the Snapshot. We will retry at most two times by default (attempt to create Snapshot at most three times). Alternatively, we could make `Snapshot.logSegment` mutable and re-create it when failing to load a checkpoint file. But it's risky as `Snapshot` would become mutable. E.g., if a caller calls `Snapshot.logSegment` first time before touching the corrupt checkpoint file, then after touching the checkpoint file, it calls `Snapshot.logSegment` again, and two calls now will return different `LogSegment`s. Adds new unit tests. GitOrigin-RevId: 1013c7831360a02e70e3713ee0ff664869b29188 commit 1369ea04c0ed6d4a30310a599e34551baa9893fb Author: lzlfred Date: Fri Dec 17 10:06:34 2021 -0800 Replace checkpoint metric `checkpointSize` with `checkpointRowCount` GitOrigin-RevId: 5aec20c44e89a0856eaaa53518aac80d87e483a7 commit b98a294cb243417a6f207e70bd977590cecc9415 Author: Wenchen Fan Date: Thu Dec 16 11:32:36 2021 +0000 Handle qualified `location` property for v2 commands GitOrigin-RevId: b9af01887165c13a10bd4184e08805e138ddae9d commit e324b679263eb10de2245ffa1d8d8c51d4735ac4 Author: Yijia Cui Date: Wed Dec 15 10:17:45 2021 -0800 Minor refactoring to deltaMerge GitOrigin-RevId: 7937f792cc2b4bd98a5a822319e6a923734d6a5c commit 6a3f759fc01162bcdcca8522e3672999e8c10af3 Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Thu Jan 6 17:37:03 2022 -0800 Artifact verification for release process (#247) commit 17a10489b969dce2d2bf837bca985bac4987e512 Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Thu Jan 6 11:43:08 2022 -0800 Regenerate docs for 0.3.0 release (#246) commit 94639203a1e5ce6d0f42b17c08cf04927c071a18 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Sat Dec 25 00:53:14 2021 +0100 Flink Delta Sink - PR 8 (#244) * Add complementary modifications to finalize development of Flink DeltaSink * Modify build.sbt to mark delta-standalone as provided commit f5d2b57f31cce6786aae85d99b4325ed61b0fab5 Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Tue Dec 21 14:31:59 2021 -0800 Fail when committing metadata with incorrect protocol update (#242) commit 339cf7ebc5c8c3830bed4bb2b910487c1d30aeda Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Tue Dec 21 06:58:49 2021 -0800 Upgrade sbt-pgp plugin to support `skip in publish` (#245) commit 62cee06ec5f9915d248f0e83cfaf337f372b6b48 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Dec 20 16:53:59 2021 -0500 DeltaLog::getChanges javadoc clarification (#243) * done updating java doc comment and updating test * fix import statement order commit caa7a39c306d28d9a2fd7fea636bb6a8d7193625 Author: allisonport-db Date: Tue Dec 14 10:04:02 2021 -0800 [DELTA-OSS-EXTERNAL] Update Github actions to use default bash shell `bash -l {0}` only reports errors for the last command executed. We return to the [default](https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions#exit-codes-and-error-action-preference) which is is `bash -e`. Closes delta-io/delta#836 Signed-off-by: allisonport-db GitOrigin-RevId: f2d9ebdfd558d4981ef89c4419a30cad056eab71 commit a5cbeb16e7cc1bef5d7371c88e391a681b313bfd Author: Meng Tong Date: Mon Dec 13 09:01:09 2021 -0800 [SC-86821][Delta] Minor refactoring GitOrigin-RevId: 6a1edec37b56e0db7cf581d2978725756d1db818 commit 33502e647bbb877252e12dcdcf2dd90c34c7cb1d Author: Kam Cheung Ting Date: Fri Dec 10 01:18:33 2021 -0800 [SC-90462] [Delta] Additional tags GitOrigin-RevId: e3fd45156d7971b703bcb43f7c975d2cab699abc commit 7ed203257ef0a800c6c3a45fe224ec94c778bb72 Author: Christian Williams Date: Wed Dec 8 22:47:17 2021 -0800 [DELTA-OSS-EXTERNAL] Add PROTOCOL.md mention that schemaString is required in first Delta log entry The existing protocol doc mentions that a metadata action is required in the first version of a table, but it does not specify required or optional fields for the metadata action. I believe _schemaString_ should be required at a minimum in the v1 delta log entry. This PR adds a short and sweet mention in the first paragraph of the `### Change Metadata` section to communicate that `schemaString` is required in the first `metadata` entry and transitively, the first log entry -- piggy backing on the existing mention that a `metadata` action [is required](https://github.com/delta-io/delta/blob/4277443703c5ab59a567c1e80189bbcdb7495817/PROTOCOL.md#change-metadata) in the first delta table version. Ideally, additional optional/required/required-if specifications for `metadata` properties should be provided in the protocol doc, because as it stands, an empty `metadata` object (i.e. `metadata: {}`) is permitted by the protocol doc which is not helpful to any human or computer. With this PR, I am interested in solving for absent schema in the first table version. Unfortunately, [some projects](https://github.com/delta-io/connectors/blob/master/golden-tables/src/test/resources/golden/canonicalized-paths-normal-a/_delta_log/00000000000000000000.json) are already leveraging the under-specification. A table with no schema metadata is not useable from a reader perspective, so should not be possible to create from a writer perspective IMHO. Hopefully, this PR should discourage folks following the protocol from creating scenarios that do not include metadata schema in the first log entry. I would like to see optional/required/required-if specifications added to the `metadata` section for all properties for v1 + n, but I cannot determine which properties fit which category other than that v1 must include `schemaString` at this time. --- UPDATE After realizing the full implications of the existing protocol statement that: > Subsequent metaData actions completely overwrite the current metadata of the table. and also grokking the required/optional mentions by @houqp - I have changed the PR content significantly. Instead of a "short and sweet mention in the first paragraph" - I have gone all in on an optional/required column in the `metadata` field table + a mention that required properties must be included in every `metadata` action. Closes delta-io/delta#610 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 81f826c53d42b531f9869de3825c31de4b8f5640 commit b6271dd701256efb7320bfaa7a880fa92be7d766 Author: Prakhar Jain Date: Wed Dec 8 13:08:33 2021 -0800 [SC-87761][DELTA] Track txn Id in commit stats GitOrigin-RevId: 8372b79e473ded71ab04a53271fa377d87dba380 commit e5a25176e3eea0b3d5b093280f1ffd8b5c0ecae9 Author: Yaohua Zhao Date: Tue Dec 7 15:38:14 2021 +0800 [SC-89743][Delta] Minor improvement to Delta error messages GitOrigin-RevId: 01479b276594c22dfdb6f4d9f0b66195c10cbb83 commit 3935fd80ddae25603f3bf4a7b66d01430a1831ad Author: Scott Sandre Date: Mon Dec 6 10:08:15 2021 -0800 [DELTA-OSS-EXTERNAL] Upgrade version to 1.2.0-SNAPSHOT Closes delta-io/delta#852 Signed-off-by: Scott Sandre #31800 is resolved by databricks/master. GitOrigin-RevId: 443c19dfd680904d8bbfbd2f993ae82ca6cd6f5d commit 7e88e6b75f16709a3380a7f304dda61e0254f097 Author: Hussein Nagree Date: Thu Dec 2 15:02:01 2021 -0800 [SC-89410] Capture non-fatal exceptions while writing checkponts We discovered that we might throw an exception to the user when a checkpoint fails during a write command. However, this is undesirable, since the write has succeeded in the user's perspective, even if the checkpoint failed. Capture non-fatal exception types (don't try to capture fatal exceptions like JVM failures though), and log when this occurs Updated existing unit tests that captured exceptions, and added a new one to test for different types of exceptions (like fs exceptions) GitOrigin-RevId: 4911f90d5d46bac44defe561bf32ce817f50656c commit 2a367f4dc1dab2e7287fa211d4bbea319ef398a6 Author: jackierwzhang Date: Thu Dec 2 02:01:07 2021 +0000 [SC-86212] Add name mode suites under column mapping Added name mode suites and bring back some tests blocked by CONVERT TO DELTA by doing a CONVERT followed by an upgrade to name mode. GitOrigin-RevId: b9106b5705806d1bbec37df5000652c0d1a8f57e commit 41fe03b0c628aad749dd04cb60b3fa911cbf5935 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Sat Dec 18 00:47:02 2021 +0100 Flink Delta Sink - PR 7 (#239) * Rename packages for flink-connector && mark its dependencies as provided in build.sbt * Create internal package and move all internal classes there commit 5088dab44d23bfff791b443868c5f8a1e94a5a78 Author: Shixiong Zhu Date: Thu Dec 16 21:58:20 2021 -0800 Clean up public APIs (#237) Fix some minor issues in public APIs commit 8d7479545fdf5fd13c39bf90df15a878fe834fd6 Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Thu Dec 16 15:16:50 2021 -0800 Fix: release process only publishes locally (#241) commit a046108b3dd1c80c6287c5f3ea8eaf7f67a1954d Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Thu Dec 16 12:41:44 2021 -0800 Update build.sbt (#240) commit ed24ceaed436b0a54e574533f64956de0eea540e Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Thu Dec 16 15:38:57 2021 +0100 Flink Delta Sink - PR 6 (#233) * Add support for partitioned tables commit 809b30f14b39141b2e13f0a0bc26af179773aa01 Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Wed Dec 15 13:51:57 2021 -0800 OptimisticTransactionImpl.commit changes absolute paths to relative paths (#234) Resolves #228 commit 1a3a2f8674c2870ccf9e2d6ceb30e7f8f2584913 Author: allisonport-db Date: Wed Dec 15 13:24:04 2021 -0800 add allisonport-db back to excluded usernames commit e8736f481266af0f597a7e2dd4db3e8875a14788 Author: allisonport-db Date: Tue Dec 14 15:22:10 2021 -0800 Use github.event.sender instead of github.event.actor commit 2c730c1dddeb84a8c3357ada9527c5e4d0427ecb Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Tue Dec 14 14:30:10 2021 -0800 Update updated_pull_request.yaml commit e32715f5622151110549ed3acaa1533c41190df3 Author: allisonport-db Date: Tue Dec 14 14:13:51 2021 -0800 remove allisonport-db from excluded usernames for testing purposes commit f66e601535717bd5b31a20a7b25e7cf1c92793e1 Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Mon Dec 13 21:42:19 2021 -0800 Create a Metadata builder from an existing instance (#231) Since actions are immutable, it is cumbersome to update an existing table metadata with one or two value changes. This introduces `Metadata::copyBuilder()` which instantiates a `Metadata.Builder` with properties copied over. - Added `Metadata::copyBuilder()` - Added tests - Updated `OptimisticTransaction::updateMetadata(Metadata metadata)` documentation with example usage commit 6c8192858e114c721d27bcb303adc92a5709cfe8 Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Mon Dec 13 12:31:37 2021 -0800 Automate adding new PR's and moving updated PR's on project board (#227) commit dbc09edf64db28622f87bc98a8ab7463c48904cc Author: Allison Portis <89107911+allisonport-db@users.noreply.github.com> Date: Mon Dec 13 11:37:03 2021 -0800 Add StructType.isWriteCompatible (#232) Add `StructType.isWriteCompatible` to allow users to check whether a new schema can be used in a Delta table safely. commit a4db48ecadf13d1082281ab60016f123481504ed Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Mon Dec 13 19:11:14 2021 +0100 Flink Delta Sink - PR 5 (#230) * Add schema managment and validation commit b8d3b9f571bbffe73f85c2bb6e7faa39dc37e56c Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Wed Dec 8 01:53:02 2021 +0100 Flink Delta Sink - PR 4 (#224) Add basic implementation for committing to the DeltaLog commit 9a9c6d7ffc5d59b69414a20d82bf023a59f981b7 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Dec 3 13:04:01 2021 -0800 DeltaLog.tableExists API (#216) commit 6b468dabcbea5e24a8f81887d2f6e855b2b63ed5 Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Thu Dec 2 15:40:39 2021 -0800 Java doc revisions (#226) commit bac070340a7d60fd386b8a6bceef73dd9c459dc1 Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Thu Dec 2 12:12:33 2021 -0800 Update AND, OR and NOT to fail during construction and not eval for invalid input expression types (#225) Moved the check for boolean input types to construction instead of eval. - More intuitive - Matches our behavior in `BinaryOperator` Updated tests to match. commit a1991f5ddeab903ebcd37199c4bf7710e52b66c4 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Thu Dec 2 06:19:32 2021 +0100 Flink Delta Sink - PR 3 (#220) Extend committables with transactional info (appId + checkpointId) commit 2ddff5e95074097320d3287345ebbc9a562b0a4b Author: Scott Sandre Date: Wed Dec 1 08:50:01 2021 -0800 Fix misc issues with Delta 1.1 release process In order to have our release process properly cross-publish our different Scala versions, we had to re-arrange some codes in `build.sbt`. I followed this tutorial [here](https://www.acervera.com/blog/2020/04/sbt-crossversion-release-bintray/#sbt-release-configuration) which basically just has us move the `releaseProcess` to the root project instead of in each sub-project. This step is also documented on the official sbt docs [here](https://www.scala-sbt.org/1.x/docs/Cross-Build.html#Note+about+sbt-release). I also noticed, when running our integration tests, that they were at first using the **old** JAR, and not the newly-staged JAR. Clearing the ivy2 local directory resolved this. Closes delta-io/delta#847 Signed-off-by: Scott Sandre GitOrigin-RevId: 6bfc5b0397eb02bcf54b78b2ca746be854091155 commit 3509050a6c75bebd45190bd1c27429a70df78ee4 Author: Fred Liu Date: Tue Nov 30 17:14:49 2021 -0800 Data skipping metrics Added 1) scanDurationMs 2) scanType(noSkippingV1, noSkippingV2, partitionFilteringV1, partitionFilteringV2, dataSkippingV1, dataSkippingV2, limit, filteredLimit) 3) number of files total and survived through skipping to class DeltaScan. Will add those metrics to usage log in future PRs. GitOrigin-RevId: f17011a95ae7b257d5284d35d4885fd21a70baeb commit 81189e1ac2d6045fc4adb599df450f7fbcb7378a Author: Jackie Zhang Date: Tue Nov 30 00:35:03 2021 -0800 Minor refactor to alterDeltaTableCommands GitOrigin-RevId: d6adf8aaa8f1c50283026be70a179b4b1f927575 commit 8f2c6939fbbfe7db010603fe5ae1188c5a34917b Author: Ryan Johnson Date: Mon Nov 29 20:09:29 2021 -0800 Refactor test code GitOrigin-RevId: d0f9feb520495b3a887f6c681c939cd7d67793bd commit 763a9266221ef1a6e1ce833cb04c0d32451f89fb Author: David Lewis Date: Mon Nov 29 18:22:26 2021 -0700 Minor refactor to DeltaHistoryManager GitOrigin-RevId: 1ea2c6210454c26af58d8e8e8b86e08495bec0d3 commit a14541a5de5a0960ccfa610d18f84381a42afd05 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Nov 30 12:34:37 2021 -0800 Add null-check for data reader partition field (#219) Our `RowRecord` APIs say that we will throw a `NullPointerException if field is not nullable and {@code null} data value read`. We are doing this for data fields, but we forgot to do this for partition fields. We can't test this as we can't insert a row with a null value for a a non-nullable collumn. commit 872f6d444fba35c89dd6074c08a6af2615cd9687 Author: Scott Sandre Date: Mon Nov 29 16:28:57 2021 -0800 Delta 1.1 `org.codehaus.jackson` dependency fix Delta 1.1.0 staged JAR is failing integration tests with error `java.lang.AssertionError: unsafe symbol JsonRawValue (child of package annotate) in runtime reflection universe`. Further investigation has led to realizing that Delta 1.0.0 does have `org.codehaus.jackson` as a runtime dependency, but our staged Delta 1.1.0 does not. Closes delta-io/delta#844 Signed-off-by: Scott Sandre GitOrigin-RevId: 3dbd3a9ff40a693afb6d87720c7eae6e829e53ff commit 5322c6e8621814d9abcef1d6152c47a165ac7be4 Author: Liwen Sun Date: Mon Nov 29 13:42:05 2021 -0800 Minor refactor to alterDeltaTableCommands GitOrigin-RevId: 80214b196955e157e3804c253989d283142aa68f commit 4e0f982db02664594d8b725c11ff1880b77ba375 Author: Liwen Sun Date: Mon Nov 29 12:59:23 2021 -0800 Minor refactor to ConvertToDeltaCommand GitOrigin-RevId: b82a20daf84fa414a1bb86680a719e6a91f3fe3b commit d43511d103ab6287e638bc2ffe58b957f5be8e2b Author: Meng Tong Date: Tue Nov 23 16:27:29 2021 -0800 Minor refactor to PreprocessTableMerge GitOrigin-RevId: 6bb67d4cae2f52e15cdca77a93768ef907a283b2 commit 80295c259e49c5e72e88b5b0ac932ba7dd1dca45 Author: Jackie Zhang Date: Thu Nov 18 19:08:59 2021 -0800 Minor refactor to DeltaTable GitOrigin-RevId: 965beb763c976c043431445c1c2cf102608769fe commit 1ec6b57d3bd0a29110b0b8e2b30bd2309e7be297 Author: Scott Sandre Date: Thu Nov 18 12:21:36 2021 -0800 Revert "Remove try and catches for deltaLog.checkpoint" This PR reverts the changes from commit 4821acd730f3ee7faeb54a7b02e16a84386aa497 which removes try and catch statements surround `deltaLog.checkpoint` calls. GitOrigin-RevId: ce936bf76eaefe1de313212f546f04b65f8f87a4 commit bf1bcca31b887f11ff4dab031e497071d7ddf058 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Tue Nov 30 17:07:21 2021 +0100 Flink Delta Sink - PR 2 (#213) * Add base implementation for writing files to the FS * Extend java docs && apply code review remarks * Correct java docs * Correct Java docs commit f971dc2e0f933b57a59354167b0d16e376ccb67b Author: Shixiong Zhu Date: Tue Nov 23 21:16:29 2021 -0800 Fix flaky build and add Hive tests back (#217) I noticed the following output in each failed build. ``` Files to process must be specified, found 0. Files to process must be specified, found 0. Error: Process completed with exit code 255. ``` After digging more, I found the root cause of our flaky build is the same as https://github.com/etsy/sbt-checkstyle-plugin/issues/32 sbt-checkstyle calls Java checkstyle library which will [call System.exit when the style check fails](https://github.com/etsy/sbt-checkstyle-plugin/blob/58b5c5fc4fc2d45b56ea07a9ad858089e34853e7/src/main/scala/com/etsy/sbt/checkstyle/Checkstyle.scala#L47-L52). In order to avoid exiting SBT JVM process, sbt-checkstyle [uses its own NoExitSecurityManager to prevent exits from the JVM](https://github.com/etsy/sbt-checkstyle-plugin/blob/58b5c5fc4fc2d45b56ea07a9ad858089e34853e7/src/main/scala/com/etsy/sbt/checkstyle/Checkstyle.scala#L122-L141). However, the implementation is [not thread thread](https://github.com/etsy/sbt-checkstyle-plugin/issues/32#issuecomment-976206148), and cannot prevent from exiting the JVM when two concurrent threads are running the [noExit](https://github.com/etsy/sbt-checkstyle-plugin/blob/58b5c5fc4fc2d45b56ea07a9ad858089e34853e7/src/main/scala/com/etsy/sbt/checkstyle/Checkstyle.scala#L122-L141) method. As most of our projects have zero Java files, Java checkstyle library will fail and call System.exit for most of our projects. Due to the above race condition, the SBT JVM may be killed when sbt-checkstyle's `noExit` method cannot prevent from exiting the JVM. `exit code 255` in our output matches the following code in Checkstyle: - https://github.com/checkstyle/checkstyle/blob/a42a3e1733f3c7583177a3c96de65e6a2fd7671f/src/main/java/com/puppycrawl/tools/checkstyle/Main.java#L87 - https://github.com/checkstyle/checkstyle/blob/a42a3e1733f3c7583177a3c96de65e6a2fd7671f/src/main/java/com/puppycrawl/tools/checkstyle/Main.java#L184-L189 - https://github.com/checkstyle/checkstyle/blob/a42a3e1733f3c7583177a3c96de65e6a2fd7671f/src/main/java/com/puppycrawl/tools/checkstyle/Main.java#L810-L812 The fix is just disabling parallel execution. While it makes the build slower, the speed is still acceptable since most of time is spent on tests which cannot run in parallel anyway. This PR also adds build cache, hive tests back and upgrade Ubuntu back to 20.04. The SBT heap size is increased as well as I hit SBT OOM occasionally during my investigation. commit d9a162aa22b07458c1c2c59ca32660d59312b1e7 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Nov 23 12:00:11 2021 -0800 DSW convert-to-delta example (#205) * example WIP * delete file made in error * remove parquet files, verify commit, generate less parquet * move script to resoures * dealing w existing/non-existant directory * edit comments * python instructions * removing personal file path * addressing comments * add back source vs target path * fix doc * use delta log for check instead * small changes Co-authored-by: allisonport-db Co-authored-by: allisonport-db <89107911+allisonport-db@users.noreply.github.com> commit c2b7a9c7d64b221020f431757d7b5cc9443311da Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Fri Nov 19 17:29:27 2021 -0800 Fix Github actions to fail when an sbt test fails (#212) A few issues fixed: - Github actions wasn't reporting any failed tests unless the last test on the list failed, since it restarted `build/sbt` each time, and Github actions only considers the final exit code. - Moved `compatibility/test` to only run on Scala 2.12 since it uses Delta 1.0.0 - Fixed a few JavaDoc compilation errors that were merged in #196. - [QUESTION] Documentation in #196 uses `@implNote` Javadoc tag. I added it to the Javadoc compilation, but should we just change that doc instead? - [NOTE] Seems related to https://github.com/sbt/sbt/issues/875, but when there is a compilation error, all the other warnings are reported as errors. For example in the previous commit, it showed 18 `[error]` messages but also: ``` [info] 2 errors [info] 18 warnings [error] (standalone/javaunidoc:doc) javadoc returned nonzero exit code ``` - Fixed the other tests that were failing commit adb70719a47637aa2d3fe33d205fcae477664735 Author: Junlin Zeng Date: Thu Nov 18 00:25:44 2021 -0800 Minor refactor in CreateDeltaTableCommand GitOrigin-RevId: c4c832f3a785315f1387ac5b6437735e8d66ac5f commit 45cc12d177ffd31ad4e711f0d0da2d0a4f5aec9a Author: Scott Sandre Date: Wed Nov 17 16:18:29 2021 -0800 Make test_pip_utils.py run during tests This PR adds code to `test_pip_utils.py` to make sure that the pip utils tests are actually run. To solve this, we rearrange the order of tests to ensure we locally generate `delta-spark` before we run that test. Closes delta-io/delta#835 Signed-off-by: Scott Sandre GitOrigin-RevId: cd3fafbd608b00710075c4c5eb67700431031473 commit 63c195cc8074f53976afef4b639aa556c38533a6 Author: Feng Zhu Date: Wed Nov 17 13:17:20 2021 -0800 Use a non-final field for the state of SetAccumulator Our production cluster ever encountered NPE caused by SetAccumulator. After investigation, we find the issue is similar to [SPARK-20977](https://issues.apache.org/jira/browse/SPARK-20977). Inspired by the fix, this PR proposes a similar one. Closes delta-io/delta#830 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 5f5b4829bb3fccd58da71deb57c9d872440acb0c commit 4d9df502d054a141592cc217f2ee30f563503160 Author: Prakhar Jain Date: Wed Nov 17 07:49:44 2021 -0800 Minor refactor in OptimisticTransactionSuite GitOrigin-RevId: b58e06897438da25294ec1e1b531103668493519 commit 1c8590168a03431926ea87432441d1e4fad29168 Author: Jackie Zhang Date: Tue Nov 16 20:24:50 2021 -0800 Fix accessing nested complex schema & flaky tests under column mapping modes Fix columnNotFound exception when we have complex nested schema such as a.element.element.key.value.field. Also removed flaky tests. New tests + existing tests. GitOrigin-RevId: 82baaadedeb843f784ff9512b4a5f6cf373b1143 commit 09a7b061c8e4c23a010d85ea3144c79342c33813 Author: Prakhar Jain Date: Tue Nov 16 17:00:41 2021 -0800 Minor refactor in ConflictChecker GitOrigin-RevId: 4c6405a5a9b2e8ff36b8bde4fa45dd474ffaab75 commit 9c1ff8792e237e6111cb95ae0840ebad2321d9b2 Author: Shixiong Zhu Date: Tue Nov 16 11:28:22 2021 -0800 Add 'columns with commas as partition columns' back to OSS Delta This PR adds the test `columns with commas as partition columns` back to OSS Delta but removes the test code that tries to load tables using special chars in a partition column. In Delta, we pass the entire schema as `dataSchema` to `HadoopFsRelation` to preserve the partition column locations in the table schema when *reading* a Delta table. For example, if the table schema is `(c1 INT, c2 INT, c3 INT)` and the partition column is `c2`. We will pass `(c1 INT, c2 INT, c3 INT)` as `HadoopFsRelation.dataSchema` to make the output schema be `(c1 INT, c2 INT, c3 INT)`. This is because if we passed `(c1 INT, c3 INT)` as `HadoopFsRelation.dataSchema`, the output schema would be `(c1 INT, c3 INT, c2 INT)`. This is because without extra information, `HadoopFsRelation` will always put the partition columns as the end of the output schema. However, Spark 3.2 has [a change](https://github.com/apache/spark/pull/33566/files#diff-5e4e444d7b0b87628cac7643c5be48c123b9aa2534ee0b892853003b013d6a2fR87) that adds field name check for `HadoopFsRelation.dataSchema` in `DataSourceUtils.verifySchema` (`HadoopFsRelation.dataSchema` is passed into `DataSourceUtils.verifySchema` [here](https://github.com/apache/spark/blob/9666046c90d80887d2e007ef661880af7cd81bf6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L437)). Hence, when we have a special char in a partition column, we won't be able to read such table. As creating tables using special chars in a partition column has been blocked since OSS Delta 0.6.0, it's unlikely people would hit this issue in OSS Delta. Hence, not supporting to read such tables in OSS Delta 1.1.0 should be fine. GitOrigin-RevId: 080c08ac6fd25af17ca86c35cdc64f23ba8303a0 commit bc3ec0da2cd5e2ff493b7a5e328311492801155f Author: Maciej Date: Tue Nov 16 09:30:36 2021 -0800 Add Python type annotations This PR adds type annotations for Python module, provided as stub files. This should: - Improve auto-completion performance in editors which support type hints. - Optionally enable static checking through mypy or built-in tools (for example Pycharm tools). - Provide additional way to document usage of certain methods, where arguments are required, despite providing default values. Closes delta-io/delta#305 Signed-off-by: Scott Sandre GitOrigin-RevId: fbd1b56d82c7951cf855665fe67cc71d3baadd03 commit 8d732b5becfac8e444677019d3038ec44c73b82f Author: Scott Sandre Date: Mon Nov 15 21:49:47 2021 -0800 Add test for broken SET NOT NULL; Fix ALTER TABLE semi-colon error Add test for failing statement `ALTER TABLE my_table CHANGE COLUMN id SET NOT NULL;` (Cannot change nullable column to non-nullable) Fix issue where `ALTER TABLE ... ADD CONSTRAINT ... CHECK ...` commands could not end with a semi-colon. Closes delta-io/delta#828 Signed-off-by: Scott Sandre GitOrigin-RevId: 5b5d9be6dede8a8327d1a9dcccf6d2720c20bcb3 commit db113dab3db5bdc371f3d49734e26a7403372c24 Author: Meng Tong Date: Mon Nov 15 19:59:34 2021 -0800 Reject empty string for string partition column This PR adds a projection to convert empty string to null values before `DeltaInvariantCheckExec` so that any constraint (NOT NULL or CHECK) can correctly treat empty string as null values. The change is behind an internal conf which is by default off. Added new tests to verify that empty strings are correctly rejected when the constraint does not allow null values. GitOrigin-RevId: af2923e7e5296201f4b3329673f2d87921f0d720 commit 18aa5d318f5a9eb0785f8010e3b647fc1f101a80 Author: Meng Tong Date: Thu Nov 11 16:14:22 2021 -0800 Add COPY TEST test for GENERATED columns GitOrigin-RevId: e86044a60ae9e0269f4d4d87a4f02b0bd211d4e3 commit a29d6b4b7b5911b398eafecd86fcb0b50805ae10 Author: pkubit-g <91378169+pkubit-g@users.noreply.github.com> Date: Tue Nov 16 19:09:39 2021 +0100 Flink Delta Sink - PR 1 (#196) * Add base classes and interfaces for Flink Delta Sink * Apply code review remarks commit 50bb8c23a771d26fb9f730da95145f3ddeca7547 Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Mon Nov 15 11:38:00 2021 -0800 [#124] Fix for getBigDecimal decoding bug with parquet4s LongValue (#208) Resolves #124 • Added `Parquet4s.LongValue` to `customDecimalCodec` • Added test case to `DeltaDataReaderSuite` • Created table for test in `GoldenTables.scala` commit f0c41c01a21afb2341c433af42e36b2a190eda86 Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Mon Nov 15 11:36:40 2021 -0800 javadocs with return statements' (#210) commit d9e38771fe8130f5dc01fd8576173ac0247accd1 Author: Liwen Sun Date: Thu Nov 11 11:45:29 2021 -0800 Support changing mapping mode on existing tables Support changing existing Delta table's mapping mode from none to name. All other changes are still blocked. new unit tests GitOrigin-RevId: 750af80737cd8af723ad23681c44828c534b3886 commit a69d2ebb6e68d204ee13e4e65c915cb5057e4746 Author: allisonport-db Date: Wed Nov 10 23:01:58 2021 -0800 Add Delta OSS version for generating CommitInfo.engineInfo `CommitInfo.getEngineInfo` is updated to use the added Delta version. GitOrigin-RevId: 8d538536682f2aabe990489cbfc7ebd2ba2c8381 commit c4227f3dd4c9bb174e51bfc5ee86855d4f23eda9 Author: Shixiong Zhu Date: Tue Nov 9 15:56:29 2021 -0800 [DELTA-OSS-EXTERNAL] Add Scala 2.13 support Most of changes are caused by the following two breaking changes in Scala. - scala.Seq[+A] is now an alias for scala.collection.immutable.Seq[A] (instead of scala.collection.Seq[A]). Note that this also changes the type of Scala varargs methods. - mapValues and filterKeys now return a MapView instead of a Map. The fix is just adding `toSeq` and `toMap`. See https://docs.scala-lang.org/overviews/core/collections-migration-213.html for more details. Closes delta-io/delta#823 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 843af7c94d17639ffc04f4877839d94e5a613eca commit b2452850689b916d9432ce979c0b938155945034 Author: Wenchen Fan Date: Tue Nov 9 18:18:18 2021 +0000 Minor Refactoring GeneratedColumnSuite Author: Wenchen Fan GitOrigin-RevId: 3ccb4391361c414368d56df5ac150022fc01a087 commit 9a6018a983f8e1becd3dd0d839a66deb295e83f1 Author: Meng Tong Date: Tue Nov 9 10:14:03 2021 -0800 Minor refactoring Delta Transaction Related Files GitOrigin-RevId: 6450a682f4725391c499929150e49bd1a55f55c9 commit 733bfd2d0799c9eee69457205d2e62eea8e0cb42 Author: Prakhar Jain Date: Mon Nov 8 19:18:58 2021 -0800 Minor refactoring to CurrentTransactionInfo Description: This PR do some minor refactoring to CurrentTransactionInfo class. Existing UTs. GitOrigin-RevId: a9e4ba8ef8ece7ed0b93723e3fa07698ddfb7449 commit b4a2ef4c792d36bbef775b432991e98a704c6e07 Author: allisonport-db Date: Mon Nov 8 14:08:06 2021 -0800 Update CommitInfo to include engineInfo and show it This PR adds the field `engineInfo` to the action `CommitInfo`, and updates `OptimisticTransaction` and `DeltaHistory` to include it. • `engineInfo = "Apache-Spark/3.2.0 Delta-Lake/1.1.0"` in OSS A follow-up change will be necessary to add the OSS Delta version. `engineInfo` tags commits so we can identify writers (i.e. connectors) in the Delta Standalone project. Unit tests: • checks `engineInfo` is part of the schema for `DeltaTable.history` • `engineInfo` serialization and deserialization added to `ActionSerializerSuite` GitOrigin-RevId: be474209a18bec2f5e48eae146f5cb9db1afd73c commit c42a6c64012d1464413d093d559d60903afe25e3 Author: Shixiong Zhu Date: Mon Nov 8 13:40:22 2021 -0800 [DELTA-OSS-EXTERNAL] Block SHOW CREATE TABLE for Delta Although Spark 3.2 adds the support for SHOW CREATE TABLE for v2 tables, it doesn't work properly for Delta. For example, table properties, constraints and generated columns are not showed properly. This PR blocks SHOW CREATE TABLE for Delta to unblock 1.1.0 release. In the future, we should implement Delta's own ShowCreateTableCommand to show the Delta table definition correctly Closes delta-io/delta#822 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 4bb2da1ce0d48b997376204beb1163b5994ef839 commit d5f79f14f44c7a65f54887b4bccd3014fabb9b17 Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Thu Nov 11 15:01:53 2021 -0800 Close ParquetIterable in SnapshotImpl.loadInMemory (#207) commit 53be736f515fd7d1296403ff0edd6427c6958254 Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Wed Nov 10 23:24:43 2021 -0800 update package object path (#206) commit cdfd7fc459a1319db338f3a5533dc75d36c302ab Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Nov 10 09:14:22 2021 -0800 Add `remove` API to AddFile (#201) commit 04dc2c9b3fe32ea6efa24c5e094ddbefece96e83 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Nov 9 18:18:03 2021 -0800 Add DSW 0.3.0-SNAPSHOT java docs to preview (#203) - ran `build/sbt standalone/unidoc` - copied output (from `connectors/standalone/target/javaunidoc/ `) into `connectors/docs/0.3.0-SNAPSHOT/delta-standalone/api/java/` commit b787012054205b3313ecc5b85ba0d75e016e84d3 Merge: fd34ec53f 6afab6e8d Author: Shixiong Zhu Date: Tue Nov 9 12:21:31 2021 -0800 Merge pull request #139 from delta-io/delta_standalone_writer_feature Delta Standalone Writer commit 6afab6e8dd1c917dd4e6f5e133c633c00ca6f35f Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Mon Nov 8 11:52:03 2021 -0800 EngineInfo: replace spaces with '-' for engineVersion and deltaVersion (#200) commit 860ba8d89e5b73802445a02f286b9d66f4cbe3d7 Merge: f9b10b6d4 fd34ec53f Author: Scott Sandre Date: Mon Nov 8 11:46:46 2021 -0800 merge with master commit f9b10b6d4bd0994185a661e2ea9ad2fa03638fbe Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Nov 8 11:19:32 2021 -0800 [DSW] Memory optimized iterator (#194) commit d490f43aef3ea9721fc5939464423b46057fd945 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Nov 8 09:46:36 2021 -0800 [DSW] Update README.md (#199) commit fd34ec53f8b222112f4c9cfd6f90c87e137e8f09 Author: Shixiong Zhu Date: Mon Nov 8 09:41:52 2021 -0800 Make hive connector support both hive 2 and 3 (#198) Since we share the jackson libraries in standalone jar, it's easy to make hive connector support both Hive 2 and 3. Here are the changes in the PR: - Make hive connector depend on the standalone jar that has the shaded jackson libraries. - Create a new hiveAssembly project to shade the hive connector. - Create a new hiveTest project and move existing code in hive-test/src/test to this project. - hiveAssembly and hiveTest will be built using hive 3. However, to make sure the assembly hive connector work with hive 2. We also create two new projects: hive2MR and hive2Tez. They will just depend on the jars created by `hiveAssembly` and `hiveTest` (We need the common test codes). After this change, hive connector tests will also help us test the production standalone jar we will release. commit 0f30f6f47761cb01043d84aad182f30cb59b5e31 Author: Shixiong Zhu Date: Fri Nov 5 15:25:28 2021 -0700 [DELTA-OSS-EXTERNAL] Pin pipenv version to fix the broken Python setup Our Python setup is broken with the following error: ``` Traceback (most recent call last): File "/usr/local/bin/pipenv", line 8, in sys.exit(cli()) File "/usr/local/lib/python3.8/dist-packages/pipenv/vendor/click/core.py", line 1137, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pipenv/vendor/click/core.py", line 1062, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.8/dist-packages/pipenv/vendor/click/core.py", line 1668, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.8/dist-packages/pipenv/vendor/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.8/dist-packages/pipenv/vendor/click/core.py", line 763, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pipenv/vendor/click/decorators.py", line 84, in new_func return ctx.invoke(f, obj, *args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pipenv/vendor/click/core.py", line 763, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pipenv/cli/command.py", line 440, in run do_run( File "/usr/local/lib/python3.8/dist-packages/pipenv/core.py", line 2456, in do_run ensure_project( File "/usr/local/lib/python3.8/dist-packages/pipenv/core.py", line 560, in ensure_project ensure_pipfile( File "/usr/local/lib/python3.8/dist-packages/pipenv/core.py", line 269, in ensure_pipfile project.create_pipfile(python=python) File "/usr/local/lib/python3.8/dist-packages/pipenv/project.py", line 710, in create_pipfile from .vendor.pip_shims.shims import InstallCommand ImportError: cannot import name 'InstallCommand' from 'pipenv.vendor.pip_shims.shims' (/usr/local/lib/python3.8/dist-packages/pipenv/vendor/pip_shims/shims.py) ``` By comparing the build logs, I found it's because pipenv's new release `2021.11.5.post0` broke it. Green build log ``` Collecting pipenv Downloading pipenv-2021.5.29-py2.py3-none-any.whl (3.9 MB) ``` Red build log ``` Collecting pipenv Downloading pipenv-2021.11.5.post0-py2.py3-none-any.whl (3.9 MB) ``` This PR pins pipenv to 2021.5.29 (The previous one we use) to fix the build. Closes delta-io/delta#821 Signed-off-by: Shixiong Zhu #30616 is resolved by zsxwing/zchw1xlj GitOrigin-RevId: ebb09c34eceaa077ee56fe3f133c4ea588eb894d commit 42b093fbe13dd85e87c3f6ad5cfa987e11232209 Author: Vítor Mussa Date: Fri Nov 5 11:38:53 2021 -0700 [DELTA-OSS-EXTERNAL] Natively handle errors when calling `shutil.rmtree` in Python examples In Python examples', remove the try-except block in favor of `shutil`'s native error handling, avoiding possible inconsistencies that could come from the high extent of the pure `except` statement. Closes delta-io/delta#819 Signed-off-by: Shixiong Zhu #30577 is resolved by zsxwing/xi0q3yy3 GitOrigin-RevId: 63eb4013fcbdf3e81b9bba9287cf203af4e36151 commit 117e7820467f2c5b801373974e10e3e6fb77eb5c Author: wwang-talend Date: Fri Nov 5 15:02:25 2021 -0700 Fix RowRecord to allow read partition values (#197) This PR fixes a lacking feature in the RowRecord implementation where partition (metadata) values were not included and able to be read, only data values were. Closes #197 Lead-authored-by: wwang-talend Co-authored-by: Scott Sandre Signed-off-by: Shixiong Zhu commit 5571d95966e3b03c4699dcfa1a5cd86bb16447dc Author: Shixiong Zhu Date: Thu Nov 4 12:01:38 2021 -0700 [SC-83988][DELTA] Upgrade Spark to 3.2.0 in OSS Delta This PR upgrades Spark to 3.2.0 for OSS Delta. The major change we need to do is implementing `withNewChildrenInternal`(TreeNode) or `withNewChildInternal` (LeafNode) for Delta's logical nodes. Closes delta-io/delta#618 delta-io/delta#765 GitOrigin-RevId: 6e05659bfffd563a19b701fcb2dc0b58c886d5f4 commit 5ad704713d3cc6b3fd016d727a2659a2d90c34c0 Author: Linhong Liu Date: Thu Nov 4 03:56:55 2021 -0700 Make DESC TABLE to show varchar instead of char GitOrigin-RevId: f00f995417eb8b3191b7518882a6a69c3bf71cef commit c0ad2ef94e8cb3042a8494d38a6b61fdf636ad30 Author: jackierwzhang Date: Wed Nov 3 16:23:51 2021 -0700 [SC-86568] Add id column mapping suites - Batch 2 This is the final batch. GitOrigin-RevId: c02b1699df6a0cbb4b22bda0395b4d242b8e96e1 commit d6b172a31f73d4c59a4123d65b3645bf0a6ea711 Author: Junyong Lee Date: Wed Nov 3 14:34:28 2021 -0700 [SC-85245][DELTA] Refactor logSegment in snapshot management Loading a log segment in snapshot management consists of two code paths, while most of logics share the common path. This PR refactors them to go a common path, with slightly different options for each. By doing so we expect any future improvement for this path would be easy to maintain and modify. GitOrigin-RevId: 5dc9fd0f6aa387f36dbaf147cdcf9ded6060cd9a commit 4aa1c260c80430b3a634bd5b57d1d36c05b81a66 Author: Rahul Mahadev Date: Thu Oct 28 18:14:33 2021 -0700 [SC-65265][DELTA]Change spark.databricks.delta.merge.repartitionBeforWrite.enabled to true in OSS Delta make repartitionBeforeWrite in OSS enabled by default GitOrigin-RevId: 1c6b7c48097c6070fcf184dfbb3c37d4cd6a9f37 commit 1b1dec27bd8a1c166fb8370cbdb01c9225e5d8e9 Author: Allison Portis Date: Thu Oct 28 16:22:41 2021 -0700 [SC-54273] Use LogStore.readAsIterator in Delta streaming source This pull request changes Delta streaming source to use `LogStore.readAsIterator` to read file actions when a delta JSON file is too large to load into memory entirely. `DeltaSourceSuite` has been extended to create `DeltaSourceLargeLogSuite` with `DeltaSQLConf.MAX_FILESIZE_IN_MEMORY.key = 0` so that all the tests are run using the altered code. GitOrigin-RevId: 465b0fc98b381d7c8c10f7cb63c4a4fd84e5ac06 commit 95e90763fd9f54df8880911b28b97b023a485d5f Author: Shixiong Zhu Date: Thu Oct 28 16:09:46 2021 -0700 [SC-86940][DELTA]Fix potential checkpoint corruption issue for GCS in OSS Delta This PR moves checkpoint write to a separate thread for GCS to workaround the potential checkpoint corruption issue when a thread is interrupted. GitOrigin-RevId: 418df2441923c85c6415f9baec157c923c4c3dca commit 315a4a0679d39fb90dfa3501d6c95990df838846 Author: KamCheung Ting Date: Thu Oct 28 13:44:51 2021 -0700 [SC-87685] [DELTA] Tiny code improvement in checkpoint code path. This PR contains some tiny code improvement on commit-checkpoint code path by simplifying some logic. GitOrigin-RevId: ecc8d9fc41587a16b7b65a66b5433f7e93abd1fb commit 030a0f21dcdfc6a4487f3a6ed75ba6e0f80e5598 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Nov 3 09:45:35 2021 -0700 [DSW] [34] Add Local, Azure, S3SingleDriver log stores (#193) * add Azure log store * add S3SingleDriver log store * add LocalLogStore commit b29ab7a3e8bdeece4c95a46f865ca5b7214eb78f Author: Scott Sandre Date: Tue Nov 2 16:09:24 2021 -0700 update build.sbt to use publishM2 instead of publishLocal commit 75cb17d47e21e4f991b1c41094b33a02c661368e Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Mon Nov 1 15:42:05 2021 -0700 Hadoop 3.2.1 compatibility on EMR commit d574d5b33ea1e29fb319b55e967034fdca9d66f4 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Nov 1 10:23:28 2021 -0700 [DSW] [33] Better documentation, misc fixes (#192) commit f909be90316af11406f5a8e78582d3c979287bad Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Oct 29 12:23:16 2021 -0700 [DSW] [32] fix test sources scalastyle (#191) commit 3ffb30d86c6acda9b59b9b029e6e945992ef4f42 Author: Meng Tong Date: Thu Oct 28 10:38:20 2021 -0700 [SC-86818] Add util methods for Column with default expression Add utility methods for column with default expression, some refactors. GitOrigin-RevId: 1807a064486fbca138b8ebb4b69d561908c6d2b3 commit 7f46e91cf0950e437ffbce93d8a5925ebd0a3991 Author: Adam Binford Date: Wed Oct 27 00:43:09 2021 -0700 [DELTA-OSS-EXTERNAL] Repartition when performing a vacuum parallel delete to avoid AQE coalescing Resolves https://github.com/delta-io/delta/issues/810 Added a repartition to the parallel delete step of vacuum to avoid being affected by AQE being enabled. The number of partitions is added as a new config and defaults to `spark.sql.shuffle.partitions`. Closes delta-io/delta#811 Signed-off-by: Shixiong Zhu #30133 is resolved by zsxwing/eltas8lb GitOrigin-RevId: 7674439070b739a78388e58950844677b24d8dc3 commit 398d71de6275e5a6fa04353d2b0ccee85778dfac Author: Liwen Sun Date: Tue Oct 26 17:49:38 2021 -0700 [SC-87005] Add foundInvalidCharsInColumnNames DeltaError Add foundInvalidCharsInColumnNames DeltaError and refactor some code. GitOrigin-RevId: e7556ec9fe6f03569cfdefdecfc7d7a754273bb6 commit 61cf007d083e2f88fd4236f73133052a21ac0337 Author: Liwen Sun Date: Tue Oct 26 15:28:41 2021 -0700 [SC-86316] Convert to Delta can ignore casting partition values In Convert To Delta, we throw an error when failing to cast a partition value. This PR add a flag so that we can silence the error and fill in nulls. new unit test GitOrigin-RevId: 67eacea8338d2cadbff52babaad715f3fa363a35 commit 323e909b054b946406f717882aae810cfe66d767 Author: Ryan Johnson Date: Tue Oct 26 13:04:50 2021 -0700 [SC-87654] Clean up several iffy unit tests This PR fixes unit tests that were violating the Delta specification, such as: - Creating Delta logs directly in the table's directory, rather than a `_delta_log` subdirectory Test change only, and the affected unit tests still pass. GitOrigin-RevId: 5cbf3ae6149046d5a2642b83bfa64aaee6ca2b03 commit 685820b66ec42de7ef8f8a61ef3fd0fcfb702a70 Author: Wenchen Fan Date: Tue Oct 26 21:27:33 2021 +0800 [SC-56391] Add char/varchar length check to Delta CONSTRAINT This PR adds the char/varchar type input string length check to Delta CONSTRAINT, so that it applies to all the write paths. This also introduces a reserved constraint name that end-users can't use: `__CHAR_VARCHAR_STRING_LENGTH_CHECK__` This PR removes the char type padding from Delta INSERT, to be consistent with other write paths. We can add the padding back when we have the infra to do it in the future. new tests Before this PR, when writing data to delta tables via UPDATE/MERGE, no char/varchar check is done. The written data may exceed the char/varchar length limitation, which violates the char/varchar semantic. This PR fixes this issue by adding the length check in Delta CONSTRAINTS so that it applies to all the write paths. This PR also removes the char padding logic from Delta INSERT, to be consistent with other write paths that only do length check. GitOrigin-RevId: 91e96ee9643bf1eba4d6ed159858d433e8357c95 commit 0b6e4392d06ccbe3b28f0b6adc4c7af1b970ee85 Author: jackierwzhang Date: Mon Oct 25 10:21:24 2021 -0700 [SC-86568] Minor change in SchemaUtils and refactor some tests Minor change in SchemaUtils and add essential test suites for column mapping modes. New + modified tests. GitOrigin-RevId: 7bb030802cc96ac186f9d8103ee343a84a1fc38a commit 4afd4156680ef37c9e8509cebab10ad43a0755f4 Author: Shixiong Zhu Date: Fri Oct 22 17:37:55 2021 -0700 [SC-86916]Support passing Hadoop FileSystem options via DataFrame options in Delta Similar to parquet, Delta now supports reading Hadoop file system configurations from DataFrameReader/Writer options when the table is read or written using `DataFrameReader.load(path) or DataFrameWriter.save(path)`. Example: ``` val myCredential = Map( "fs.azure.account.auth.type" -> "OAuth", "fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", "fs.azure.account.oauth2.client.id" -> "...", "fs.azure.account.oauth2.client.secret" -> "...", "fs.azure.account.oauth2.client.endpoint" -> "..." ) val df = spark.read.format("delta").options(myCredential).load() df.write.format("delta").options(myCredential).save() ``` Note: There is a slight difference between Delta and Parquet: Delta will only pick up options starting with `fs.` to create Hadoop Configuration object to access the storage, while Parquet will pick up all options except `path`. We avoid picking up other options because - We don't want to affect reading/writing of Delta transaction logs. - We don't want unrelated options to create a different DeltaLog entry in the cache for the same table. GitOrigin-RevId: 2a5a34370d28ea2afce5eeff79e8723329474c83 commit 1f70855af3bf602a0813d1a0a00dbe907b37c55e Author: Yijia Cui Date: Fri Oct 22 16:24:44 2021 -0700 [SC-86082]Refactor Delta Source Some Refactor of DeltaSource GitOrigin-RevId: 92f1451baf5eab8c97091b97c2ec884d3734c618 commit 03bbac52084b0eaf972d2867f3c83a94c4900f61 Author: jackierwzhang Date: Fri Oct 22 13:19:45 2021 -0700 [SC-87514] Use display name for partition error As title. GitOrigin-RevId: 5d5256d10305627eeb802c6e122fd55763fd32bc commit c71962abc3ac74e94a4c89e57f08497207794491 Author: jackierwzhang Date: Fri Oct 22 10:34:51 2021 -0700 [SC-86725] Some refactors in DeltaErrors Some refactors in DeltaErrors code. GitOrigin-RevId: fa5cc897e316dbb86c3701c2529c4aba521de0f4 commit a4236949fa8ae724900a2a9cd3809e4486925146 Author: Bogdan Raducanu Date: Fri Oct 22 14:35:53 2021 +0200 Minor refactor of AnalysisHelper Minor refactof of AnalysisHelper. GitOrigin-RevId: 6a3eb532e6871e60a4bfb6c52307166f668f72f3 commit 481021a588ab91c775c5912ed0629e95957ff1f0 Author: Scott Sandre Date: Thu Oct 21 15:10:31 2021 -0700 [SC-87924] Remove try and catches for deltaLog.checkpoint Removes 3 occurrences of `deltaLog.checkpoint` that was wrapped by a `try { ... } catch { case e: IllegalStateException } ` GitOrigin-RevId: 4821acd730f3ee7faeb54a7b02e16a84386aa497 commit 57f9f095aec62ee4b4a433b0b65823c5984c7b03 Author: Wenchen Fan Date: Thu Oct 21 21:00:15 2021 +0800 Further refactor of DeltaCatalog Small refactor of DeltaCatalog code. GitOrigin-RevId: 0c45940107a1b5024779032b2ce3acca678076ec commit 0f048ceaef965fb741b5dfb00a673b912baf4696 Author: Maciej Date: Wed Oct 20 14:38:30 2021 -0700 [DELTA-OSS-EXTERNAL] Fix DeltaTable._condition_to_jcolumn and DeltaTable._dict_to_jmap usage Correct `DeltaTable._condition_to_jcolumn` and `DeltaTable._dict_to_jmap` usage and convert both from `@classmethod` to `@staticmethod` Closes #801 Closes delta-io/delta#803 Signed-off-by: allisonport-db #29785 is resolved by allisonport-db/felza6a0 GitOrigin-RevId: 14904f67931796c8688331d746ad604d1f93257e commit fbcc3f35dac76c039fcc42b90955000d2a3589d8 Author: gurunath Date: Wed Oct 20 14:08:48 2021 -0700 [DELTA-OSS-EXTERNAL] Fix Workflow UI Bug Fixes Github Actions Workflow UI Bug mentioned here: https://github.com/delta-io/delta/pull/798#discussion_r730151656 Closes delta-io/delta#812 Signed-off-by: Shixiong Zhu #29887 is resolved by zsxwing/fmg4yagl GitOrigin-RevId: 220cef5eb73bc154263b8f40c2bd33b35591d51d commit 5d533b653cf0f54697167a5c667dcf5e983c8025 Author: jackierwzhang Date: Wed Oct 20 12:43:26 2021 -0700 [SC-86565] Enable name mode for Delta Column Mapping Enable name mode GitOrigin-RevId: 77c27722499bfef60762f9bdb92d9300f9c047e9 commit c1fce21fc2d972465db4b6ec41b2d8f956690d48 Author: jackierwzhang Date: Tue Oct 19 14:18:08 2021 -0700 [SC-86844] Refactor alterDeltaTableCommands Refactor schemaEqual method in alterDeltaTableCommands. GitOrigin-RevId: de64cae3c1a032c54bf2624e7aaf1897bdb12d09 commit 1732170e27bd077c54acbff0cac5317f5f6c9a24 Author: KamCheung Ting Date: Tue Oct 19 10:59:44 2021 -0700 [SC-87178] Refactor transaction commit timestamp Refactor transaction commit timestamp code path. GitOrigin-RevId: 6b51559a5292c5a42f2983ab7ac39e0a4a3e7faa commit b0cbaa8d36f404f795d97ef3dc9988f150009f00 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Oct 28 13:58:23 2021 -0700 [DSW] [31] Copyright (#190) commit 3d24e481269994cb5e83573a44c295181b0682a2 Author: Alexandre Lopes Date: Mon Oct 18 19:29:28 2021 +0000 [DELTA-OSS-EXTERNAL] Clarify delta-log entries JSON format in PROTOCOL.md #762 Clarify the format for delta-log entries on the documentation to explain that every action is written as a JSON document and separated by a new-line character. Signed-off-by: Alexandre Lopes Closes delta-io/delta#808 Signed-off-by: Scott Sandre Author: Alexandre Lopes #29736 is resolved by scottsand-db/sx6n38n2. GitOrigin-RevId: 940f02c6b73a5a8f4e59e7bc42385cb6f79c3357 commit c5c7f68ed452a6376d60633c68a257130cbe47b7 Author: Shixiong Zhu Date: Fri Oct 15 20:30:34 2021 +0000 Minor refactor of GenerateSymlinkManifest Minor refactor of GenerateSymlinkManifest code. Author: Shixiong Zhu Author: Shixiong Zhu GitOrigin-RevId: ddda3069ae73932493ffdb639aae884ba443f97f commit bb22eb47f611d39c1276de33d4128c88f5e9e72a Author: jackierwzhang Date: Fri Oct 15 18:47:48 2021 +0000 [SC-85872] Block manifest generation in column mapping modes Block manifest generation in column mapping modes. Author: jackierwzhang GitOrigin-RevId: 6c6ef7bd44abfda4f95a5e06c74f9a8a29787c9d commit 613fece295282fe33e7e817ee94b1ef266cfe679 Author: Terry Kim Date: Fri Oct 15 07:24:23 2021 -0700 Minor refactor of DeltaCatalog Authored-by: Terry Kim GitOrigin-RevId: fd4781cec99156c7e1db09fd9b635128a3a8a84d commit 050fc868c8bb95b96ff5b76ccb4df507372aa6ad Merge: 72c3c00ef 78a3c9eef Author: Scott Sandre Date: Thu Oct 28 12:06:39 2021 -0700 update with master commit 72c3c00effb18f283cc81997972de48388eb9941 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Oct 28 11:54:41 2021 -0700 [DSW] [30] scalastyle fixes (#189) commit ab2b4e23f6965cee6ffd9da2bf6b13aab4a18a22 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Oct 28 11:52:22 2021 -0700 Minor fix in Literal Expression (#188) commit 1693e795a529a9648ede3288234f9ecb1b143208 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Oct 28 11:51:16 2021 -0700 [DSW] [25] Add Java checkstyle tests (#179) commit a57ed24db49165ba54c0365a3c036772ff19e60f Author: Scott Sandre Date: Thu Oct 28 11:48:03 2021 -0700 merged with master commit 78a3c9eef5b308cb304b3f48d47f45be2a6151f1 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Oct 28 09:20:55 2021 -0700 [#137] Shade jackson and json4s jars (#187) Closes #137 commit 12a89a96657a2482b5b526979c675270b303f248 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Oct 25 16:40:52 2021 -0700 [DSW] [28] API Documentation (#184) commit 72814ba25c7800bda90655b6e0ab0ebf339c7e9e Merge: 8c5df277e 9c8f76eaf Author: Scott Sandre Date: Mon Oct 25 11:15:11 2021 -0700 merge with master commit 9c8f76eafefdb3ffbb2d2e5b7f08cfb0b71df5c9 Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Mon Oct 25 10:25:20 2021 -0700 Add more Expression functionality (#162) This PR extends the Expression functionality for the Delta standalone reader and writer. This includes: • Column and Literal support for partition data types • CastingComparator and Util support for valid comparison data types • PartitionRowRecord support for partition data types • In.eval() updated to follow Databrick's SQL semantics guide • ExpressionSuite testing • misc. changes & fixes commit 8c5df277e536092274f456eea71142cde36a90b8 Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Fri Oct 22 15:16:43 2021 -0700 structTypeToParquet converter (#171) This PR introduces a schema converter for our Java StructType to Parquet MessageType. The converter is implemented internally in Scala, and exposed through ParquetSchemaConverter.java in io.delta.standalone.util. Tests are copied over from Apache Spark. commit 75558ad1999b507e95640457c953e19770e8ac65 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Oct 22 11:57:22 2021 -0700 [DSW] [29] Operation Parameters fix; document Operation metrics (#185) commit 54a28a99f6cc7a69004c6b68322d9c319842eeeb Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Oct 22 11:50:15 2021 -0700 [DSW] [27] Misc fixes 5 - schema utils suite (#181) commit c569297fd5c4d12ab8b9e29a380c4535b9d7e8dd Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Oct 22 10:29:20 2021 -0700 [DSW] [26] Misc Fixes 4 (#180) commit 764f7cbd8962708d509e3956e6867f39345247f9 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Oct 20 14:12:10 2021 -0700 Add SchemaUtilsSuite; add two new public APIs to StructType (add); add test to OptTxnSuite (#178) commit 935f3036b4351cc64118896135beddfd8929897a Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Oct 20 09:59:25 2021 -0700 [DSW] [23] Add logging (#177) commit 56f40e38329d728794452a414676161d737baea6 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Oct 18 10:41:33 2021 -0700 [DSW] [22] Misc Fixes 2 (#176) commit 24e00ec38527cac973a286f1b470fb456ef5e17c Merge: 17c549805 ac9bc2fa9 Author: Scott Sandre Date: Fri Oct 15 11:28:54 2021 -0700 merge with master commit f5edc6f6b6d0b1c907e80813965e1afb53147c16 Author: Karen Feng Date: Thu Oct 14 17:54:13 2021 -0700 Update code for new missing column error introduced by Spark See https://issues.apache.org/jira/browse/SPARK-36943 GitOrigin-RevId: d5811840a29a66f75ccc3c9f2e1c4b59e9e80a1f commit 70a86a61b029ae5a32f7263826081417e7a2e38c Author: jackierwzhang Date: Thu Oct 14 20:39:26 2021 +0000 [SC-86598] Block CONVERT TO DELTA in column mapping modes Author: jackierwzhang GitOrigin-RevId: 138905e912d28e3523f948224fb99e7960058098 commit 1b75ac4007fe4f9ce85ed66a4c45d03fd7193b31 Author: Shixiong Zhu Date: Thu Oct 14 16:35:19 2021 +0000 [SC-87026][DELTA]Update status badges for OSS Delta Add the following status badges and remove the CircleCI one: - GitHub Actions - License - PyPI Author: Shixiong Zhu GitOrigin-RevId: bea2ce0a3405af2938e4761044f101bd45bf4c7e commit c586f9a7374923867c36f61df4ed133725c8df2c Author: Tom Lynch Date: Thu Oct 14 00:23:49 2021 +0000 [DELTA-OSS-EXTERNAL][BREAKING]Python: fix convertToDelta to return a delta.tables.DeltaTable Fix the return type of DeltaTable.convertToDelta in Python. GitOrigin-RevId: 4cd9bd00708a5cdd3c442bf9715ce9047d2b4893 commit 17c549805e7b600dafd8ff44acf3d86d7a9a45b3 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Oct 15 08:58:45 2021 -0700 [DSW] [21] Misc Fixes 1 (#172) * added comments for future work on operation json encoded values * added commitInfo engineInfo name & version string * remove a todo; add reader, writer version to public java Action interface * remove a todo; throw proper exception in Checkpoints.checkpoint * remove a todo; use schema.toJson instead of schemaStr in OptTxnSuite * remove old comments; minor style change; add extra oss compatibility comparison utils * add tests to DeltaLogSuite * tidy up conversionUtils * refactor action builders to their own suite * cleanup with conversion utils; add conversion utils test * update engineInfo format (slash instead of dash); rename BuildInfo to package object commit 08b17e0fae5222179a52ee277f93d8ebc0029621 Author: Shixiong Zhu Date: Wed Oct 13 21:22:13 2021 +0000 [SC-86513][DELTA]Unify Delta code to use the same API to create Hadoop configuration This PR adds a new method `DeltaLog.newDeltaHadoopConf` and refactors Delta code to use this method to create Hadoop configuration. This is the prerequisite for supporting passing Hadoop file system configurations using DataFrameReader/writer options. Here is the list of the major changes in the PR: - Add `DeltaLog.newDeltaHadoopConf`. - Add a scala style rule to OSS Delta to ban `sessionState.newHadoopConf`. - Add a new method that accepts Hadoop Configuration for each method in log store Here is what I did: - Add a new hadoop configuration parameter to all LogStore methods. In other words, we add new methods and remove the old methods. - Fix all compiler errors and pass all tests. - Add back the old methods to maintain the compatibility but none of Delta code will call these old method. Existing tests. This PR should not change any behaviors in the code. Author: Shixiong Zhu GitOrigin-RevId: a9f61e060fa2588621b26be72725da7b0fc4d92f commit 5d8e259598bbb074792a93c44cb80659bde3100f Author: gurunath Date: Wed Oct 13 16:59:24 2021 +0000 [DELTA-OSS-EXTERNAL] Convert build to GitHub Actions This PR solves https://github.com/delta-io/delta/issues/792 Closes delta-io/delta#798 Signed-off-by: Shixiong Zhu Author: gurunath Author: Shixiong Zhu #29538 is resolved by zsxwing/nbqcxxr0. GitOrigin-RevId: 1b35ce70185ddfcdfca9992d695ffbe1de397e40 commit 0df41fb220ab590d9bb3dfae451b799d649b8994 Author: Ryan Johnson Date: Wed Oct 13 00:10:55 2021 +0000 [DELTA] StateCache now works with DataFrame in addition to Dataset[A]. This PR updates the `StateCache` helper class so that callers can access its underlying RDD as a `Dataset[A]` or as a plain `DataFrame`. The latter potentially allows for lower runtime overhead by avoiding the need to parse rows to scala objects, in case the calling code does not require it. While we're at it, split the vague and generic `state` method of `Snapshot` into `stateDS` (a `Dataset[SingleAction]`) and `stateDF` (a `DataFrame`), and update some of Snapshot's internal use sites to use the latter since they were already treating the `state` as a `DataFrame`. For now, all other use sites stick with `stateDS` (= unchanged behavior). Existing unit tests exercise this code. Author: Ryan Johnson GitOrigin-RevId: e8f798fed53422e6fac4fd0bd2ef2efe1067eb6a commit f4351289065a6e7e688c57572715135fa266432c Author: JassAbidi Date: Fri Oct 8 22:33:05 2021 +0000 [DELTA-OSS-EXTERNAL] Intercept a more specific exception for block read+append against append test Closes delta-io/delta#373 Signed-off-by: Shixiong Zhu Author: JassAbidi GitOrigin-RevId: 4c02a56764254dd482ad2f9cd2a8881271f70ef8 commit fdb84e1bb7fbf124f2f7ac8ecbd4e76340236c44 Author: Eunjin Song Date: Fri Oct 8 21:19:28 2021 +0000 [DELTA-OSS-EXTERNAL] Improve resolving references in MergeIntoCommand This PR fixes a perf issue in MergeIntoCommand. In `MergeIntoCommand`.`writeAllChanges`, `resolveOnJoinedPlan` applies `tryResolveReferences` for each column. However, in `tryResolveReferences`, it calls `sparkSession.sessionState.analyzer.execute(newPlan)` for fake logical plan which is quite expensive. (ref: https://github.com/apache/spark/blob/38d39812c176e4b52a08397f7936f87ea32930e7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L247) ``` val newPlan = FakeLogicalPlan(Seq(expr), planContainingExpr.children) sparkSession.sessionState.analyzer.execute(newPlan) match { ``` If a table has many columns -few hundreds to thousands-, it will take some time on driver. I tested with 1000 columns and the following code took 22~25 seconds on both spark 2.4 & spark 3.1. ``` // call resolveOnJoinedPlan 4 times val processor = new JoinedRowProcessor( targetRowHasNoMatch = resolveOnJoinedPlan(Seq(col(SOURCE_ROW_PRESENT_COL).isNull.expr)).head, sourceRowHasNoMatch = resolveOnJoinedPlan(Seq(col(TARGET_ROW_PRESENT_COL).isNull.expr)).head, matchedConditions = matchedClauses.map(clauseCondition), matchedOutputs = matchedClauses.map(matchedClauseOutput), notMatchedConditions = notMatchedClauses.map(clauseCondition), notMatchedOutputs = notMatchedClauses.map(notMatchedClauseOutput), noopCopyOutput = resolveOnJoinedPlan(targetOutputCols :+ Literal.FalseLiteral :+ incrNoopCountExpr), deleteRowOutput = resolveOnJoinedPlan(targetOutputCols :+ Literal.TrueLiteral :+ Literal.TrueLiteral), joinedAttributes = joinedPlan.output, joinedRowEncoder = joinedRowEncoder, outputRowEncoder = outputRowEncoder) ``` With this fix, it took less than 1 second. Closes delta-io/delta#797 Signed-off-by: Shixiong Zhu GitOrigin-RevId: 0f547f2da3fc1feb9b5a2581db366e26fc802940 commit 83780aeeadd67893ad69ed6481f7c6bce5be563c Author: Meng Tong Date: Thu Oct 7 20:48:43 2021 +0000 [SC-86171] Delta check constraint expression should go through ReplaceExpressions rule Before the fix, the expression inside check constraint only goes through analyzer rules. However, some rules in optimizer are necessary to evaluate a `RuntimeReplaceable` function. When these functions appear in Delta check constraint, the insert queries will fail to compile. This PR fixes the issue by adding a specialized optimizer for Delta check constraint to run a specific rule to make sure such functions are correctly handled. Tested with all expressions that implement `RuntimeReplaceable` trait. Author: Meng Tong GitOrigin-RevId: 6ab0090e9e8b45a577868e9ada4fd9344c850ee9 commit d888ba78ca29e788373d244a8322b51039644aa1 Author: jackierwzhang Date: Thu Oct 7 01:40:37 2021 +0000 [SC-85973] Misc bug fixes under column mapping ID mode Introducing misc bug fixes under column mapping id mode. Author: jackierwzhang GitOrigin-RevId: 9217d9adea40f75cdfdb6290e31d2b7516a33b63 commit c4ca3a8a8851be1efd833bf8e2f139f84f0fcdd1 Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Wed Oct 13 22:38:00 2021 -0700 Add DeltaConfigs to support table properties stored in Metadata.configuration (#164) commit 0737c5eb6e5215007aa0b2abaa8a6a113cb0b9a9 Author: jackierwzhang Date: Wed Oct 6 20:48:30 2021 +0000 [SC-85863] Store physical partition column names - Store physical names for partition columns Author: jackierwzhang GitOrigin-RevId: a486b0cc38af157d0e3c979708570b7d0307ab86 commit ec89fbb5b61f8f35890d7aa0b04c914c7b7e4439 Author: Prakhar Jain Date: Wed Oct 6 15:48:12 2021 +0000 [SC-78511][DELTA] Fix spelling typo Typo Comment only change Author: Prakhar Jain GitOrigin-RevId: cb1f31a94fd73fd0a3e20e2b75ca8563fb1f6422 commit ba5ba93955b8a84240fecd38157e4d039d03a246 Author: Lars Kroll Date: Fri Oct 1 17:01:44 2021 +0000 [SC-83630] Merge suite helper method Existing tests New helper method for executing merges Author: Lars Kroll GitOrigin-RevId: 89713e5c14c2cc67b907b4927de425b8ec4dbff7 commit 6b0b16ff1e5e6824d33572b36f5be05d583a5fef Author: Liwen Sun Date: Thu Sep 30 21:33:13 2021 +0000 [SC-85374] Should remove Delta column mapping metadata in HadoopFsRelation Delta column mapping metadata should be dropped before the table schema is passed on to downstream, because such metadata is not needed in Spark SQL and also query plan transformation is not very robust in terms of propagating metadata correctly. Specifically, we fix the place when creating HadoopFsRelation and DeltaTableV2. New unit tests Author: Liwen Sun GitOrigin-RevId: 707681a4f7023aa72d82f180fbe112ac4ad20379 commit 282fc689467a31938114e4e1e0d452e0bd6b0aaf Author: Burak Yavuz Date: Wed Sep 29 20:57:46 2021 +0000 [SC-83667] Add column mapping to Delta PROTOCOL.md Introduces column mapping details to the Delta protocol. N/A Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: fa9fc0c3d87333d281fad05eb5f113e4949580d0 commit 43d14226cc802d721d1683495cdc8511acf460a1 Author: Shixiong Zhu Date: Wed Sep 29 19:30:19 2021 +0000 [DELTA-OSS-EXTERNAL] Set the correct Hadoop configuration for checkpoint write Checkpoint write doesn't pass the correct Hadoop configuration. It may fail when users specify their storage credentials using Spark session configs. This PR fixes it and adds a test for it. Closes delta-io/delta#784 Signed-off-by: Shixiong Zhu Author: Shixiong Zhu #28914 is resolved by zsxwing/66uz7rxl. GitOrigin-RevId: cc42b868f2a3ecffca88424fb7465792dc22c7e6 commit ac9bc2fa990adb44f897b7bd55212a1496b1c6af Author: Shixiong Zhu Date: Wed Oct 13 12:15:40 2021 -0700 Update the build status to use GitHub Actions and add license badge (#174) commit 4a28a1125c4a3cacfbc3a87ed15a60e8ade9d5d9 Author: gurunath Date: Wed Oct 13 23:21:00 2021 +0530 Moving the Build to Github Actions (#173) Fixes #81 commit d69c76c84bd9d2c7b7dd323f186abeeecc4a5600 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Oct 13 09:16:04 2021 -0700 [DSW] [18] OSS Compatibility Tests v3 (#167) commit d7f1e7823d98986d9fa17acf14e82d9e5ecfefdc Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Oct 11 10:15:26 2021 -0700 [#125] Fix a bug in CloseableParquetDataIterator where a single empty file would stop iteration (#170) commit a248425596e271cfc4b9a93447ff9350cb894a5f Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Oct 8 14:43:43 2021 -0700 [DSW] [20] Opt Txn Delta OSS tests v2 (#169) commit 8dc5830d4f606a25659871b768d89170bd1b2ac9 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Oct 8 14:43:22 2021 -0700 [DSW] [17] LogStore write tests (#163) commit 9692cabf1eb8615981555f8af805428077debd13 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Oct 8 12:20:15 2021 -0700 [DSW] [19] Delta OSS Opt Txn unit tests added to DSW (v1) (#168) * rename OptTxnSuite to OptTxnLegacySuite; add new OptTxnSuiteBase and OptTxnSuite; add OptTxn.metadata API; add no-arg Protocol constructor * add disjoint txn; disjoint delete/read tests * add disjoint add/read test * Update OptimisticTransactionSuite.scala commit 2228c8442e23261ad03954da1b4967755034a3c4 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Oct 8 12:18:54 2021 -0700 [DSW] [13] Production Log Store (#157) commit a9e56819b241fa629f01fc5ed39b402d9011ff2d Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Oct 7 17:05:04 2021 -0700 [DSW] [16] OSS Compatibility Tests v2 (#166) commit bca014e0386bad3c1115228d72521239eb70494a Author: Shixiong Zhu Date: Thu Oct 7 10:26:55 2021 -0700 Fix jackson incompatibility issue for Delta Hive 3 connector (#165) Found some jackson incompatibility issue when testing on an Amazon EMR 6.4.0. This PR fixes it by - Upgrade jackson-module-scala version - Upgrade json4s version - Shade jackson-module-scala to fix the incompatibility issue commit 7090c51f04ccb3c1bd342953f296173759003d3c Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Oct 5 10:25:28 2021 -0700 [DSW] [12] Production Checkpoint API (#154) commit 92b986e638dcafadb6c5cdff85802650dd8be9e9 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Oct 5 08:05:24 2021 -0700 [DSW] [15] DSW-OSS Compatibility Unit Tests framework prototype (#161) commit af9fd0e28a1500fd3c8047ca38855618d2523bfe Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Oct 4 12:19:51 2021 -0700 [DSW] [11] Production updateMetadata (#152) commit ffa0c90ffde4016a2c258e9496ec1d005f22c64f Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Sep 30 14:20:59 2021 -0700 [DSW] [14] OptimisticTransaction.markFilesAsRead returns a DeltaScan (#160) * first pass working; refactored DeltaScanImpl into Base and Filtered versions; fixed usages and tests * update scaladoc * added test suite; added equals method to some expressions; changed FilteredDeltaScanImpl constructor * introduce snapshot.scanScala interfaces; add scaladoc explaining difference between PartitionUtils.filterFileList and FilteredDeltaScanImpl * add test for expression equality * improve expression equality test case commit d37869c706f3a608bff25b37c2ce37e8189b85bf Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Sep 29 16:44:16 2021 -0700 [DSW] [10] Production conflict checker (#150) * Update OptimisticTransactionSuite.scala * update conflict checker detect concurrent appends; add test; add StringType to expressions * add more concurrent append and concurrent delete read exception * added concurrent protocol update conflict check; add another dataChange=false concurrent append test * add concurrent metadata update check with test * add concurrent delete delete conflict check with test * add concurrent txn conflict check with test commit 645aa8a156ffac4e86d678b6ad4f4d91310b4c31 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Sep 29 12:30:48 2021 -0700 Fix mima failure (#159) commit 2067d9e0a8860166515df0677de3f6f331d42b78 Author: Prakhar Jain Date: Wed Sep 29 02:14:18 2021 +0000 [SC-84123][DELTA] Refactor Delta Conflict detection code Author: Prakhar Jain GitOrigin-RevId: 1b7444fb800ea1ea7193a358f78cda4b1ce72922 commit 96ad33cb114225233405ea4db1ac5eec7ef4d8ea Author: Yijia Cui Date: Tue Sep 28 23:43:37 2021 +0000 [SC-76961][Delta] Minor refactor for schema change error in DeltaSource Author: Yijia Cui GitOrigin-RevId: 1cf5af9210f40199aa2c01a76aaeeb5de8cd44d6 commit 49f390347e9d2c392b0163a0e3c8c312d1b48d3b Author: Jose Torres Date: Tue Sep 28 23:30:55 2021 +0000 [SC-81537][DELTA] Minor refactor for the CDF conf Author: Jose Torres GitOrigin-RevId: 5f02309d27cf3fc783f5f63b8072d587c518f388 commit 78b69fec1a3d5f442fffb444f5723340fb408529 Author: Wenchen Fan Date: Tue Sep 28 12:09:59 2021 +0000 [SC-85209] Use TableCatalog.PROP_EXTERNAL in DeltaCatalog Author: Wenchen Fan GitOrigin-RevId: 4d8bb95496918a9b96cd2e0976af0ee3b96d4ffe commit 7a3f1e8ec626e80880d524c2b897a969c8b4d63a Author: Shixiong Zhu Date: Mon Sep 27 16:59:30 2021 +0000 [DELTA-OSS-EXTERNAL] Move GCS write to a separate thread to workaround a GCS corruption issue We found a potential GCS corruption due to an issue in GCS Hadoop connector: When writing a file on GCS, if the current thread is interrupted, GCS may close the partial write and upload the incomplete file. Here is a stack trace showing this behavior: ``` at java.io.PipedOutputStream.close(PipedOutputStream.java:175) at java.nio.channels.Channels$WritableByteChannelImpl.implCloseChannel(Channels.java:469) at java.nio.channels.spi.AbstractInterruptibleChannel$1.interrupt(AbstractInterruptibleChannel.java:165) - locked <0xc2f> (a java.lang.Object) at java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:173) at java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457) - locked <0xc3b> (a java.lang.Object) at com.google.cloud.hadoop.util.BaseAbstractGoogleAsyncWriteChannel.write(BaseAbstractGoogleAsyncWriteChannel.java:136) - locked <0xc3c> (a com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$2) at java.nio.channels.Channels.writeFullyImpl(Channels.java:78) at java.nio.channels.Channels.writeFully(Channels.java:101) at java.nio.channels.Channels.access$000(Channels.java:61) at java.nio.channels.Channels$1.write(Channels.java:174) - locked <0xc3d> (a java.nio.channels.Channels$1) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) - locked <0xc3e> (a java.io.BufferedOutputStream) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.write(GoogleHadoopOutputStream.java:118) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) at java.io.DataOutputStream.write(DataOutputStream.java:107) - locked <0xc3f> (a org.apache.hadoop.fs.FSDataOutputStream) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) ``` This PR moves GCS write in GCSLogStore to a separate new thread to workaround the issue. Closes delta-io/delta#782 Signed-off-by: Shixiong Zhu Author: Shixiong Zhu #28761 is resolved by zsxwing/mm7ity34. GitOrigin-RevId: 5bfa936db1983a151da7d6e4f239a200f634690a commit 52716a44ba236ab3973f9ebfc0fcb0532b08c3a2 Author: Fred Liu Date: Thu Sep 23 03:01:36 2021 +0000 [SC-84984] Add protocol and metadata fields to VersionChecksum Author: Fred Liu GitOrigin-RevId: c2c3d86b6360f5c2643fb688f133893486c2b31f commit 0c9d9f0979f14e249e853d2fc4386d8c96486ddd Author: Flavio Cruz Date: Thu Sep 23 01:30:18 2021 +0000 Update to Scala 2.12.14 for Spark 3.2 Author: Flavio Cruz GitOrigin-RevId: da96d97a7011bebeab848381f8181b1ea478aff7 commit 0f872432268c7526c007d3a3634cebd330546dec Author: Junyong Lee Date: Fri Sep 17 15:21:17 2021 +0000 [SC-85245] Minor style chhange in Snapshot Author: Junyong Lee GitOrigin-RevId: c32bb347a191751ef3ffeccd9b3c9b94c01a9a3c commit 00a71f225530223515e624df9b254e884a70386a Author: Yijia Cui Date: Fri Sep 17 06:16:53 2021 +0000 [SC-85058][Delta][WARMFIX]Add a flag to disable the constraint check for replaceWhere Add a flag to disable the constraint check for replaceWhere Add unit test. Author: Yijia Cui GitOrigin-RevId: 33e4b5ce32994b937f6ec1f8399a3a099e5c1d34 commit c44bf631d96b0819aa0f411a1ac834a6aad11083 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Sep 29 10:38:08 2021 -0700 [DSW] [9] Production commit API (#149) * add Expression::references() func; implement OptimisticTxn::markFilesAsRead(); isBlindAppend unit tests * Update OptimisticTransactionSuite.scala * whitespace javadoc changes commit 9d6ad509ed95bd8f07bae1428f66228c1ff205e1 Merge: 23894b04d 8e5c79c41 Author: Scott Sandre Date: Wed Sep 29 09:57:39 2021 -0700 update with master commit 23894b04dab196b273b502d6afa543068b8d000e Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Sep 29 09:12:25 2021 -0700 [DSW] [7] checkpoint prototype v2 (with tests; with metadata cleanup) (#148) commit 8e5c79c41b5fcb6f5cb56073c802ced9daedf145 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Sep 28 15:47:44 2021 -0700 Add Snapshot.scan API (#156) commit 91a2b1bb4c66093a9a7058a4a25e0b25741ca7b8 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Sep 27 10:08:04 2021 -0700 [DSW] [6] updateMetadata() prototype with tests (#147) * implement updateMetadata() function with tests * remove isCommitLockEnabled commit 544ff2f9707c0089f583bfa74a8fc238da3732af Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Wed Sep 22 14:04:17 2021 -0700 [DSW] [5] updated internal LogStore (#146) * refactor ReadOnlyLogStore to LogStore; add test suite commit 2f0a578e3cadd1d2a2187b99c21655dd6305d880 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Sep 21 13:46:03 2021 -0700 [DSW] [4] Improved commit API prototype with tests (#145) commit 48ee239b3ddfbf1034d672439cc20c87862382fb Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Tue Sep 21 11:44:55 2021 -0700 Add FieldMetadata to StructField (#130) Add column metadata to the schema via a new type FieldMetadata commit d2fcc2adf15120ea7477c2424478763c9f658465 Author: Shixiong Zhu Date: Tue Sep 21 09:52:54 2021 -0700 Update CONTRIBUTING.md commit 2c7ac31355e3f7dd1fb33e46409a14fd18bb4359 Author: Denny Lee Date: Tue Sep 21 09:52:32 2021 -0700 Update CONTRIBUTING.md (#153) Update connectors to match delta-core contributing.md commit 6e1b1ec4d368fe33ea3510951b589f30ec53ac3f Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Sep 21 08:58:49 2021 -0700 [DSW] [3] Conflict Checker Prototype (#144) commit b0f4f69f585cdacf4be4f907969afd146a1f77b4 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Sep 20 08:27:52 2021 -0700 [NEW] [2] Checkpoint Prototype (#142) * got basic checkpointing implementation and test working commit fb0a0ec6b36db1074bf66ec775922ab70a85a928 Author: Gerhard Brueckl Date: Mon Sep 20 15:34:34 2021 +0200 Support ADLS Gen1 and all simple data types (#107) * add PowerBI connector * add parameter to buffer the binary file before reading it to mitigate streaming errors * update global readme to include reference to the Power BI connector * fix issue with online refresh * add inline-documentation update readme.md add *.pbix to .gitignore * - added supprt for ADLS Gen1 and recursive load of folder content in general (option IterateFolderContent) - better support for data types - now all simple data types are supported and casted correctly * fix docs/sample for ADLS Gen1 * - add support from TimeZoneOffset - fix issue with special characters in column names * update README.md Co-authored-by: gbrueckl commit 8279d6a61bf45296efb92e9dd7244f48d89bc242 Author: Yann Byron Date: Mon Sep 20 09:31:22 2021 +0800 Support for Hive 3 (#36) (#151) commit db66ca4eb1da1f49b65376aad8614e5b0423d33b Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Sep 17 11:23:52 2021 -0700 fix circle CI errors for delta standalone writer feature branch (#143) commit 1de2365581966f9132aef44687f29ad7dcbd7b05 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Sep 17 10:10:11 2021 -0700 [NEW] [1] Commit Prototype & Skeleton (#141) * Rebased off of 0_API branch; started skeleton; WIP on scala impl of OptimisticTransaction * finish most of txn.prepareCommit * implement verifyNewMetadata; add comments for AnalysisException throughout code; add some Schema utils * added some comments for future tests * finished (mostly) commit and doCommit functions; added test comments as future work; TODO log store refactor * added HDFS Log Store write impl * add conversion utils from java actions to scala actions * starting to write a simple test; fixed java api Collection.unmodifiable.. null error (but not all); WIP fixing CommitInfo operationParams JSON escaped value bug * got a very basic test working * minor comment changes * clean up code after rebase with dsw_prototype_0_API * remove DeltaOperations scala class * removed unnecessary StandaloneHadoopConf vals * rename writerId to engineInfo; fix failing scalastyle * remove schemaStr from Metadata constructor and use schema.toJson instead * empty commit; PR isn't updating with previous commit commit 479a22bfcc4136a9c93a4f6ddd4192e7078605bd Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Sep 17 09:32:49 2021 -0700 [NEW] [0] API Prototype (#140) * Expression framework prototype (#37) * WIP; have a basic skeleton for expressions; have a column resolver; todo comparator * WIP; figuring out comparators * finished first pass at a basic skeleton, using just Ints and Booleans Literal types * add leaf expression; make expression and predicate both ABCs; add children() and bound() methods * add nullSafeBoundEval method to BinaryExpression * add verifyInputDataTypes function to Expression * big refactor; add DataType to Column constructor; no more need for 'bound' checks; use nullSafeEval; refactor Comparator usage; short circuit null checks in eval * rename createColumn to column * Update ExpressionSuite.scala * add IsNotNull predicate; test; null check to Column::eval * make Expression interface; add back Predicate interface with default dataType field; make more member fields final * add Not expression; add nullSafeEval to UnaryExpression; add test for Not expr * added interfaces * add RowRecordBuilder; remove duplicate ClosableIterator * add newline to LogStore.java * update interface for OptimisticTransaction with javadoc * update DeltaLog; remove RowRecordBuilder; remove RowRecord build interface * update Operation; add writerId to CommitInfo.java * minor comment change * update javadoc for CommitResult and OptTxn * fix typo * add 2 new public methods to OptTxn interface * add asParquet method to StructType * Update Operation.java * rename writerId to engineInfo * respond to PR comments; fix Operation enum str; remove StructType asParquet; fix LogStore version commit 17c00fb873dffc6cdaad7e2e84cd2b67d84ce95c Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Fri Sep 17 09:29:43 2021 -0700 Expression framework prototype (#37) (#138) * WIP; have a basic skeleton for expressions; have a column resolver; todo comparator * WIP; figuring out comparators * finished first pass at a basic skeleton, using just Ints and Booleans Literal types * add leaf expression; make expression and predicate both ABCs; add children() and bound() methods * add nullSafeBoundEval method to BinaryExpression * add verifyInputDataTypes function to Expression * big refactor; add DataType to Column constructor; no more need for 'bound' checks; use nullSafeEval; refactor Comparator usage; short circuit null checks in eval * rename createColumn to column * Update ExpressionSuite.scala * add IsNotNull predicate; test; null check to Column::eval * make Expression interface; add back Predicate interface with default dataType field; make more member fields final * add Not expression; add nullSafeEval to UnaryExpression; add test for Not expr commit f55281abecdf51e8945e10c8b60e04002ef1b41c Author: Florian Valeye Date: Tue Sep 14 04:31:28 2021 +0000 [DELTA-OSS-EXTERNAL] Add decimal data in the primitive type in the specification of the Delta Transaction Protocol The Decimal primitive type is missing in the Delta Transaction Protocol specification. Closes delta-io/delta#608 Signed-off-by: Liwen Sun Author: Florian Valeye #28269 is resolved by liwensun/jg2t3ugf. GitOrigin-RevId: 1f98052d67b421ec79f53a8c30e9483f4245b9e2 commit a37ebc59814ef3bd7606b0e5afde7b928ca8f51f Author: jackierwzhang Date: Fri Sep 10 17:49:30 2021 +0000 Add column mapping metadata uniqueness check and some other bug fixes in test suite... Add uniqueness check for column mapping metadata so we can fail early. Author: jackierwzhang GitOrigin-RevId: 19ef985d5ea2676f5f492894fcf1f378f784f272 commit fc952851f78dc4cadd87bd71ddd3e664960da9bc Author: Florian Valeye Date: Thu Sep 9 05:14:33 2021 +0000 [DELTA-OSS-EXTERNAL] Add Option to the deletionTimestamp from the Remove Action in the PROTOCOL The deletionTimestamp in the Remove action is an Option: https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/actions/actions.scala#L319 Closes delta-io/delta#775 Signed-off-by: Liwen Sun Author: Florian Valeye #28066 is resolved by liwensun/iowjzi2k. GitOrigin-RevId: 684126ff06bb3daea309e007085d3d13ed302bb6 commit 8410a20d4699af414ea12b983b2f763f849fbd55 Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Mon Sep 13 13:09:42 2021 -0700 Builder classes for action classes (#123) add builders for action classes & corresponding tests commit fb0452c2fb142310211c6d3604eefb767bb4a134 Author: Yijia Cui Date: Thu Sep 2 18:01:08 2021 +0000 Audit Delta Commands on Temp Views Audit Delta Commands on Temp Views to see if they can run on temp views or not. For update, delete, and merge, known bugs should be caught with new tests. Unit test only PR. Author: Yijia Cui GitOrigin-RevId: 07e6643a6a6bab8ddb07a7ab11303c98011ba7c0 commit 915fd701b42bdae99e1cddb4b75a0260c9edb85f Author: Burak Yavuz Date: Wed Sep 1 19:11:32 2021 +0000 Partition pruning support for column mapping This PR adds support for partition pruning for Delta tables that leverage column mapping. We use name mapping for partition pruning even if the table leverages `id` mapping when reading the files. Unit tests + will add more Author: Burak Yavuz GitOrigin-RevId: cd9fc02ea959437ed98cde86c6e0c245fb84abe6 commit 549d4c0669f6b4620cac9f5635c18c9ec194e8ad Author: ericfchang Date: Wed Sep 1 17:48:31 2021 +0000 Minor improvement to SchemaUtils. Unit tests. Author: ericfchang GitOrigin-RevId: 6d8c212dc24fc2cfb137cef762cd2450bd1dbc69 commit a4dd4e0f83d2bf2d8de3c7a295819fbb62ea454f Author: Yijia Cui Date: Wed Sep 1 07:21:31 2021 +0000 Support Array-of-Struct in MergeIntoCommand Schema Evolution Enhanced MergeIntoCommand Schema Evolution to support Array-of-Struct, by applying the existing transformation function for struct schema evolution into array elements. `UPDATE` and `INSERT` now can resolve struct fields by name, cast each struct element in the source column array to the struct type defined for the target column, filling additional / missing fields in the source/ target with nulls. Before this change, we just do a Spark cast to cast source array elements into target array struct type, which resolves struct fields by position. Unit tests Before this change, we just do a Spark cast to cast source array elements into target array struct type, which resolves struct fields by position. If struct fields are out-of-order, or contains additional columns or miss any columns, Spark can't cast and will raise an exception. Now, `ArrayType>` schema evolution is supported in Delta's Merge command. The resulting target array element type can now be evolved from the source array element type and target array element type. And elements in the source array and the target array will be casted into the resulting array element type. For `UPDATE`, the resulting array will completely replace the matched array in the target. Schema Evolution will support the following cases for array of structs: 1. Struct fields can now be resolved by name, and casted into the corresponding data type in the target.

**Example**:

Source `array>`, target `array>`. The source array will be evolved into `array>`. 2. If there are more fields in the target array, the structs in the source array will be casted to the target array's struct type and filled missing fields with null.

**Example:**

Source `array>`, target `array>`. The source array will be evolved into` array>` with c filled in with null. 3. If there are more fields in the source array, the structs in the target array will append additional fields to the end and fill them with null.

**Example:**

Source `array>`, target `array>`. The resulting target array type will be `array>`. The source array will be casted to `array>`, and the existing elements in the target array will be casted to array> and fill the additional fields with null. 4. If both source array and target array have additional fields that don't exist in the other array, the resulting target array type will retain its existing fields and append the non-overlapping fields from the source to the end, with those source-only fields filled in with null. The source array values will be casted to the resulting data type, with target-only fields filled in with null.

**Example:**

Source `array>`, target `array>`. The resulting target array type will be `array>`. The source array will be casted to `array>`, and the missing fields, which is o in this example, will be filled with null. The target array will cast the existing array elements to `array>` and fill the additional fields, which is z in this example, with null. Author: Yijia Cui GitOrigin-RevId: 5358cc412ee8e1ba034119cf436ce41eeda531da commit c4b596b8de3bcb191e947839474f9043eb94ddc4 Author: jackierwzhang Date: Wed Sep 1 07:16:11 2021 +0000 Column addition schema evolution in new Delta protocol Enabled column schema evolution in ID mode (although I feel like the Name mode will just work too). Unit tests Author: jackierwzhang GitOrigin-RevId: eeb06e929a26467276092a368426871229c47a30 commit 91caceba56c081c1c27d4ecd3d395644b364b4b1 Author: Prakhar Jain Date: Wed Sep 1 05:53:45 2021 +0000 Add tags field to CommitInfo action This PR adds "tags" field in CommitInfo action. Added UTs. Author: Prakhar Jain GitOrigin-RevId: 41c5e6d115a881fe1b0931259404ac7288549b46 commit d33eb94e96141d0288f623e166a025863108adb1 Author: Scott Sandre Date: Tue Aug 31 23:00:47 2021 +0000 Minor refactor to SetTransaction Author: Scott Sandre GitOrigin-RevId: 1e281d6753d762eccea54b080d864913d5d2b3c3 commit 6382a360bde6fa38d3d53a463e8c3620bda3a27b Author: Prakhar Jain Date: Tue Aug 31 14:33:24 2021 +0000 Make output of Describe Delta History command independent of CommitInfo schema Currently the output of "DescribeDeltaHistoryCommand" depends on CommitInfo's schema. This PR tries to make describe delta history independent of CommitInfo - so that making a change in CommitInfo's schema doesn't affect the output of DescribeDeltaHistoryCommand command. Added UT. Author: Prakhar Jain GitOrigin-RevId: 8a56e70189c57d4ccd3cb8377b9f7c145f3a963c commit 98a99f207dc4aba1322328b645a64c44b0a322d0 Author: Prakhar Jain Date: Mon Aug 30 23:40:37 2021 +0000 Fix a minor issue in OptimisticTransactionSuiteBase that doesn't use `_delta_log` for tests Added UTs Author: Prakhar Jain GitOrigin-RevId: fc5efffd1fe2ca62d41d37b15c0ca5fe23007cf5 commit 3874e0102a9e262986b69a9cdcdbe834282e048f Author: Alex Jing Date: Mon Aug 30 18:44:41 2021 +0000 Revert "Add tags field to CommitInfo action" Author: Alex Jing GitOrigin-RevId: 36682ff72c7d9c4b0235887ed32481200c90fc14 commit d90b4c56376a23816f63d003c1f77ff44ec00108 Author: Liwen Sun Date: Sat Aug 28 03:18:42 2021 +0000 Add physical name support in Delta name mapping Add physical name mapping new unit tests Author: Liwen Sun GitOrigin-RevId: d10fa7f1738164bca6cd5c58b3b185a92dd54763 commit bd02a2b1441ffa46ae2b442ff27366a32bbb8b94 Author: Pranav Anand Date: Fri Aug 27 22:32:12 2021 +0000 Minor refactor to SchemaUtils and Fix an issue writing to Delta where we were failing on mismatched nested schema metadata Author: Pranav Anand GitOrigin-RevId: c6e95b3175c3d4abb29ff12fcea6e51188c784fa commit caeb6a1b9e7afb810b52b0e307fea0ba413eec14 Author: Prakhar Jain Date: Fri Aug 27 21:35:29 2021 +0000 Add tags field to CommitInfo action Author: Prakhar Jain GitOrigin-RevId: 80914eaa2c6dbd5f25c834d181c3bf8dee6f8ba8 commit 50b8102cba4b603a5bc04c88173da94f3dfd9485 Author: sabir-akhadov Date: Fri Aug 27 12:57:09 2021 +0000 Fix AddFile insertion time to be in microseconds. Author: sabir-akhadov GitOrigin-RevId: 18c69626b67955e2724e96dac8386b44547e1366 commit ed04b88be072a897daab3dc4193062ee95a0a029 Author: Liwen Sun Date: Fri Aug 27 05:19:19 2021 +0000 Support Column Mapping in Delta Protocol Support Delta column mapping mode and bump delta protocol. new unit tests Author: Liwen Sun GitOrigin-RevId: b0ba53b0ff5b79b2f98605f0c06320ea12dd4a85 commit ef8d2676ffad4d07aad4f4fc91d65e288839e7e0 Author: jackierwzhang Date: Fri Aug 27 04:59:11 2021 +0000 Mini refactoring in ConvertToDeltaCommand Author: jackierwzhang GitOrigin-RevId: 5e3277225a3e006ca4032d17f2fae858c89cebc1 commit 4359484368b4a06c32b663826ac60bf12d9e8025 Author: Jarred Parrett Date: Thu Aug 26 17:40:44 2021 +0000 [DELTA-OSS-EXTERNAL] Fix an issue in Merge/Update/Delete when the table path contains special chars getTouchedFiles incorrectly calls toUri which will escape the table path when it contains special chars. This causes Merge/Update/Delete not able to find matches files in this case. This PR removes toUri to fix the issue and adds a test to confirm the fix. Fix #724 Closes delta-io/delta#741 Signed-off-by: Shixiong Zhu Author: Jarred Parrett GitOrigin-RevId: c8e6563fc48e82b86a00b58910eba8983fc5f933 commit cce4a35d0d9ceb1874c73043829576efcf8c2f09 Author: Burak Yavuz Date: Thu Aug 26 16:18:29 2021 +0000 Mini style change in DeltaLog Author: Burak Yavuz GitOrigin-RevId: 00a656853c1c5741633f86a6ce8c40b51f564645 commit ca798e65da206d829da0921cc39563570cf9824b Author: Prakhar Jain Date: Thu Aug 26 03:08:33 2021 +0000 Add serialization/deserialization tests for all Delta actions Author: Prakhar Jain GitOrigin-RevId: 808f4e4c67547866a43f7cdd5c913eac0c406de5 commit 9fc6c4fe0231528d5589e59d6dc4ad2413d8bd7f Author: Max Gekk Date: Wed Aug 25 13:09:49 2021 +0000 Reverts Minor Refactoring in DeltaLog Author: Max Gekk GitOrigin-RevId: 691dd3ad83e73aa01266514d0b1142b093db3df0 commit f55b0e5d94a76e4843309544dd567b78eac763bb Author: Burak Yavuz Date: Wed Aug 25 03:54:07 2021 +0000 Mini style change Author: Burak Yavuz GitOrigin-RevId: 5cb32f5b979fb735a9fd6fc1c0abb9de37583851 commit 4e1c53c6984ba7d56fd2a0d9fe27ac2573df27ea Author: Zach Schuermann Date: Tue Aug 24 20:11:44 2021 +0000 Fix AnalysisException on multiple insert clauses in merge With schema evolution on, the following merge query fails with: ``` Error in SQL statement: AnalysisException: There is a conflict from these SET columns: [`c`, `c`]. ``` For tables: ```scala spark.range(2).selectExpr("id a", "id b").write.format("delta").saveAsTable("t") spark.range(10).selectExpr("id a", "-1 c").write.format("delta").saveAsTable("s") ``` ```sql MERGE INTO t USING s ON t.a = s.a WHEN MATCHED AND s.a = 0 THEN UPDATE SET t.b = s1.c WHEN NOT MATCHED AND s.a = 7 THEN INSERT * WHEN NOT MATCHED AND s.a = 8 THEN INSERT * ``` This PR simply adds a `.distinct` call during `PreprocessTableMerge` when processing matched clauses; then when we calculate the columns that are present in the insert clauses but not present in the target/update, we aren't creating duplicates (which caused the above error). The above scenario is replicated in a test in `MergeIntoSuiteBase`. Author: Zach Schuermann GitOrigin-RevId: 4ca1ade59953221611a560c5d6b00bf047720303 commit 27039b02c0524fb1bec76342a44d53022669dd1c Author: Prakhar Jain Date: Tue Aug 24 00:30:33 2021 +0000 Minor Refactoring ConflictChecker and OptimisticTransaction Author: Prakhar Jain GitOrigin-RevId: cf922e476d6d2bc763f2e7add17de11534f95ee6 commit dfd4433c02bfcc02ac9c6e18ba6d78d870bc12a6 Author: jackierwzhang Date: Mon Aug 23 16:54:22 2021 +0000 Minor Refactoring in ConvertToDeltaCommand Author: jackierwzhang GitOrigin-RevId: ebbc2703467f10acf91d347eaf4de196319dd5d0 commit ed728326cb9cc298f7ad765ca61e70822ff3e12c Author: Lars Kroll Date: Fri Aug 20 11:57:41 2021 +0000 Minor Refactoring in operation metrics Author: Lars Kroll GitOrigin-RevId: a16207d06f1eb9de681616b8d473717466aff563 commit a6e64ca2aa3ce6a114cae8b0f9342a3d95559478 Author: Prakhar Jain Date: Wed Aug 18 20:58:56 2021 +0000 Fix incorrect json serialization of Delta RemoveFile Author: Prakhar Jain GitOrigin-RevId: 0396e54ede9ccad85e51899b519b00843207157a commit da6de9781484e0ce90e9d2f37fcfd1d85ee42dd3 Author: Yann Byron Date: Fri Aug 27 02:20:48 2021 +0800 remove spark dependencies for hive subproject (#116) * remove spark dependencies for hive-subproject (#115) * responded to PR comments Co-authored-by: Yann commit 43bb0093a26de0bfbea06ded4191cb5bb633914c Author: allisonport-db <89107911+allisonport-db@users.noreply.github.com> Date: Wed Aug 25 15:05:35 2021 -0700 Add toJson method to DataType.java (#119) This PR adds support for conversion to json strings in `DataTypeParser.scala` and exposure via a `String toJson()` method in `DataType.java.` Includes added tests in `DeltaDataReaderSuite.scala.` commit 4d405ff32fe7622add1c1f43abae850e065322ec Author: Burak Yavuz Date: Mon Aug 16 23:21:45 2021 +0000 [SC-83112] Minor refactor in DeltaAnalysis Author: Burak Yavuz GitOrigin-RevId: 6d973f32d0f22426ce03e72feae1c57f2066a3a2 commit 47bd54898178d80e0f479163598f0c63c0910007 Author: junlinzeng-db Date: Mon Aug 16 21:50:55 2021 +0000 [SC-82801] Add owner field for delta v2 table. Add owner field for delta v2 table if its dsv1 table has owner field specified. Author: junlinzeng-db GitOrigin-RevId: 4311e6f154eba3db28e01a6387838fcd10fe7dea commit 4d86594990a1bc152dd6dc4a22854d715a3f7206 Author: Shixiong Zhu Date: Mon Aug 16 19:38:11 2021 +0000 [DELTA-OSS-EXTERNAL] Fix the path doc for AddFile and RemoveFile Mention that the path is encoded. Closes delta-io/delta#567 Signed-off-by: Shixiong Zhu Author: Shixiong Zhu #26692 is resolved by zsxwing/n3f7bwsv. GitOrigin-RevId: 8eb5399c84bcdfeeac654976821b9e4ea8422206 commit f03a475b4ff246ab000406a447395ad3f56d76c3 Author: Liwen Sun Date: Mon Aug 16 16:31:26 2021 +0000 [SC-81796] Refactor ConvertToDeltaCommand Author: Liwen Sun Author: Burak Yavuz GitOrigin-RevId: 4a3fe689f031e611e9d5f6df71469be1cc800c6d commit ae99d7190ad0f74b5126b1a4be7eddae1c7d046f Author: Burak Yavuz Date: Fri Aug 13 05:20:52 2021 +0000 [SC-76695][Delta]Arbitrary replaceWhere Support Enhanced `replaceWhere` option to selectively overwrite only the data that matches arbitrary predicates over arbitrary columns by writing out the new data first and then deleting the old data matching the predicates. The data written out is guaranteed to meet the `replaceWhere` criteria by InvariantChecker. To delete matching data, refactored `DeleteCommand` to return the list of actions instead of just committing a transaction. Added a feature flag `REPLACEWHERE_DATACOLUMNS_ENABLED`, which is set to true by default, enabling the new feature. If the flag is disabled, then replaceWhere behavior will fall back to replace on partition columns only. Author: Burak Yavuz Author: Yijia Cui GitOrigin-RevId: 089f903529fc65596b1b525b8fed4db403de1292 commit 625de3b305f109441ad04b20dba91dd6c4e1d78e Author: Liwen Sun Date: Sat Aug 7 05:02:19 2021 +0000 [SC-80723] Refactor ConvertToDeltaCommand Minor refactor ConvertToDeltaCommand by moving logic of createAddFile into ConvertToDeltaCommand object. Author: Liwen Sun GitOrigin-RevId: 1bc9b19dbd06b2b4288d6725e51aa81bec50cfd4 commit 83a733101ecaa473d646d09e713579644600696f Author: Rahul Mahadev Date: Thu Aug 5 19:28:24 2021 +0000 [SC-82137][DELTA] Minor refactor in GeneratedColumns Minor refactor in GeneratedColumns Author: Rahul Mahadev GitOrigin-RevId: a74362dcb6b6868eaf85aa28cdfadfb7cda715b0 commit 152c83702e85d81359b41faefaec1c0462211943 Author: Prakhar Jain Date: Thu Aug 5 17:03:25 2021 +0000 [SC-82097][DELTA] Improve Delta ConflictChecker and OptimisticTransaction Improve Delta ConflictChecker with commit information and refactor OptimisticTransaction with CurrentTransactionInfo. Author: Prakhar Jain GitOrigin-RevId: f32ab3eff00aa1db2ec9b1d3a99b371596d3eb6b commit 27355b7dd9905573a43b35cebf780fa5c5d1784d Author: Yuyuan Tang Date: Thu Aug 5 01:57:27 2021 +0000 [SC-81968] Minor refactor in DescribeDeltaDetailsCommand Minor refactor in DescribeDeltaDetailsCommand. Author: Yuyuan Tang GitOrigin-RevId: c97bcbaa53797ff68d9e71e9b401fbcc8b1f138f commit a23e305c13156fef5a66a7554c8e7381734352de Author: YuXuan Tay Date: Wed Aug 4 17:32:12 2021 +0000 [DELTA-OSS-EXTERNAL] Relax importlib_metadata version required The current requirements for `importlib_metadata` is too strict (`importlib_metadata>=3.10.0`) and has led to it conflicting with versions required by other python libraries such as `apache-airflow` (`importlib_metadata==1.7.0`). Given that the only feature used from `importlib_metadata` is the `version()` function, and the behaviour of this function has not changed from version `1.0.0`, it is safe to relax the version required to improve compatibility with other libraries. Thank you. Changes - Relax `importlib_metadata` version required from `>=3.10.0` to `>=1.0.0` Closes delta-io/delta#709 Signed-off-by: Scott Sandre Author: YuXuan Tay #25854 is resolved by scottsand-db/b5cu40hu. GitOrigin-RevId: 20976a9ce0b7a321134cc53e5b271d4c98e2ed2c commit 1a4a0f15fbe906e807c5fd9f6067b1683116eedc Author: Wenchen Fan Date: Wed Aug 4 01:32:29 2021 +0000 Use UnresolvedTable in some Delta commands Refactor by using using UnresolvedTable in some Delta commands. Author: Wenchen Fan GitOrigin-RevId: 28e324fa1f8861ceed5402cb7674791e9e7d539c commit 904cb714f1cafd4fb43c5832d63e025c414e8983 Author: Meng Tong Date: Mon Aug 2 20:57:03 2021 +0000 [SC-81127] Minor refactor in DeltaLogging Minor refactor in DeltaLogging Author: Meng Tong GitOrigin-RevId: 929dd02a1f2fc89624f48fb15d2b31c0461410cf commit dfd7bcc0cda1409827a7f83519df1619272d7b2a Author: Scott Sandre Date: Fri Jul 30 20:00:59 2021 +0000 [DELTA-OSS-EXTERNAL] Pypi packaging automated test Create an automated test so that the pypi delta package is installed and used in python tests every commit. I changed the `version.sbt` version to be `version in ThisBuild := "1.1.0-SNAPSHOT-TEST"` (and did not commit those changes) That version (with the `-TEST` appended to it) was then used when executing `using_with_pip.py` ``` Processing ./dist/delta_spark-1.1.0_SNAPSHOT_TEST-py3-none-any.whl Collecting pyspark<3.2.0,>=3.1.0 Using cached pyspark-3.1.2-py2.py3-none-any.whl Requirement already satisfied: importlib-metadata>=3.10.0 in /usr/local/lib/python3.9/site-packages (from delta-spark==1.1.0-SNAPSHOT-TEST) (4.6.1) Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.9/site-packages (from importlib-metadata>=3.10.0->delta-spark==1.1.0-SNAPSHOT-TEST) (3.5.0) Requirement already satisfied: py4j==0.10.9 in /usr/local/lib/python3.9/site-packages (from pyspark<3.2.0,>=3.1.0->delta-spark==1.1.0-SNAPSHOT-TEST) (0.10.9) Installing collected packages: pyspark, delta-spark Successfully installed delta-spark-1.1.0-SNAPSHOT-TEST pyspark-3.1.2 Test command: ['python3', '/Users/scott.sandre/delta/examples/python/using_with_pip.py'] :: loading settings :: url = jar:file:/usr/local/lib/python3.9/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /Users/scott.sandre/.ivy2/cache The jars for the packages stored in: /Users/scott.sandre/.ivy2/jars io.delta#delta-core_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-0b707f56-f34e-4a63-9008-b9090b7f024c;1.0 confs: [default] found io.delta#delta-core_2.12;1.1.0-SNAPSHOT-TEST in local-m2-cache found org.antlr#antlr4;4.8 in local-m2-cache found org.antlr#antlr4-runtime;4.8 in local-m2-cache found org.antlr#antlr-runtime;3.5.2 in local-m2-cache found org.antlr#ST4;4.3 in local-m2-cache found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in local-m2-cache found org.glassfish#javax.json;1.0.4 in local-m2-cache found com.ibm.icu#icu4j;61.1 in local-m2-cache downloading file:/Users/scott.sandre/.m2/repository/io/delta/delta-core_2.12/1.1.0-SNAPSHOT-TEST/delta-core_2.12-1.1.0-SNAPSHOT-TEST.jar ... [SUCCESSFUL ] io.delta#delta-core_2.12;1.1.0-SNAPSHOT-TEST!delta-core_2.12.jar (4ms) :: resolution report :: resolve 611ms :: artifacts dl 10ms :: modules in use: com.ibm.icu#icu4j;61.1 from local-m2-cache in [default] io.delta#delta-core_2.12;1.1.0-SNAPSHOT-TEST from local-m2-cache in [default] org.abego.treelayout#org.abego.treelayout.core;1.0.3 from local-m2-cache in [default] org.antlr#ST4;4.3 from local-m2-cache in [default] org.antlr#antlr-runtime;3.5.2 from local-m2-cache in [default] org.antlr#antlr4;4.8 from local-m2-cache in [default] org.antlr#antlr4-runtime;4.8 from local-m2-cache in [default] org.glassfish#javax.json;1.0.4 from local-m2-cache in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 8 | 2 | 2 | 0 || 8 | 1 | --------------------------------------------------------------------- ``` I added ``` // scalastyle:off println println("pypi TEST") // scalastyle:on println ``` to the top (constructor) of `DeltaLog.scala` (and did not commit those changes). While running the `using_with_pip.py` tests, you can see that the `println` is executed: ``` pypi TEST +---+ | id| +---+ | 1| | 4| | 0| | 2| | 3| +---+ +---+ | id| +---+ | 1| | 4| | 0| | 2| | 3| +---+ Therefore, this specific package version was used. ``` Closes delta-io/delta#721 Signed-off-by: Scott Sandre Author: Scott Sandre GitOrigin-RevId: 5dc7c2ba80bbe87929151665520de3e1e017b1b1 commit 75283b4dc4e034ce81c01ffdb34dab0458c1b1aa Author: sabir-akhadov Date: Fri Jul 30 15:02:06 2021 +0000 [SC-80344] Minor refactor in TahoeFileIndex Minor refactor method in TahoeFileIndex N/A Author: sabir-akhadov GitOrigin-RevId: 621c4d930d60ac22e75ac5197c39d4802f93dc5a commit 37c3d5d8c44b87cd4a0c1c6d37eb90501fb792f6 Author: Wenchen Fan Date: Fri Jul 30 05:32:27 2021 +0000 Refactor testsuite UpdateSuiteBase Test only change Authored-by: Wenchen Fan Author: Wenchen Fan Author: Wenchen Fan GitOrigin-RevId: cba91a9000ebce99c195d5017339ca75760b1838 commit de5cbc1fc5f530a877ce92a298bbf4c72e0926c2 Author: Scott Sandre Date: Thu Jul 29 22:08:33 2021 +0000 [DELTA-OSS-EXTERNAL] Upgrade SBT to 1.5.5 This PR takes over the SBT upgrade work from #642 . Thanks Jacek Laskowski for the initial investigation and great work! All pass: - `build/sbt compile` - `build/sbt package` - `build/sbt test` - `build/sbt unidoc` Closes delta-io/delta#720 Merge AFTER https://github.com/databricks/universe/pull/111353 This was tested by commenting (on this PR) `Jenkins Trigger: Delta-OSS-Pr`. This then triggered the Delta-OSS-Pr job on a custom runbot CI shard, which had updated the `SBT_1_5_5_MIRROR_JAR_URL`. All tests passed. See the above universe PR for more details. Signed-off-by: Scott Sandre Author: Scott Sandre GitOrigin-RevId: 022a955f4f597888a5769df16d45bad4f5f0c6c0 commit 0bf07df709be2a93cb582eb0cf220f81e791fa9e Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Tue Aug 10 11:47:00 2021 -0700 add MiMa settings (#113) * add MiMa settings and test them * responded to PR comments commit eab445a15f478ed2454cf152edb795e9fece91f6 Author: sabir-akhadov Date: Wed Jul 28 11:49:01 2021 +0000 [SC-81903] Utility methods for action tags - Added utility methods for tags in actions Author: sabir-akhadov Author: Pranav Anand GitOrigin-RevId: 5dc4ef6acfd2288102808a476599dd4464ad4210 commit 338ccde72a46acf9546ae0e9baf64e5c63fee625 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Mon Aug 2 11:24:36 2021 -0700 [#101] add incremental changes api (#102) - resolves #101 - adds `DeltaLog::getChanges` API. Exposes an iterator of `VersionDelta`s, which is a java wrapper class containing the version and actions. - includes `DeltaLog.java` public interface - includes `DeltaLogImpl.scala` internal implementation - adds remaining `Actions` missing from Delta OSS (RemoveFile, AddCDCFIle, Protocol, SetTransaction) as both internal scala representations and public java interfaces - includes tests in `DeltaLogSuite.scala` - increases goldenTables project delta oss version to 0.8.0 (to get access to AddCDCFile) commit c19793f629a96ffa1648fb0a5c4a5031e53d82f1 Author: Yijia Cui Date: Tue Jul 27 09:17:54 2021 +0000 [SC-81245] Remove mergeSchemas from SchemaUtils Remove mergeSchemas from SchemaUtils No need to test. Author: Yijia Cui GitOrigin-RevId: 74766dc1701a46605b04e915d6a758e532863065 commit 3c0fb2efc4b547a397544f37f3dec78b3899f328 Author: JassAbidi Date: Fri Jul 23 15:59:31 2021 +0000 [DELTA-OSS-EXTERNAL] Set the right isolation level in the commit info This PR solve this #518. The isolation level is always set to null. this PR set the right isolation level for each transaction. Closes delta-io/delta#530 Signed-off-by: Shixiong Zhu Author: JassAbidi commit 896e39ad8bcc32618b564ec78cb27bdc829065c9 Author: Meng Tong Date: Thu Jul 22 16:28:25 2021 +0000 [SC-80763] Support Generated Columns in INSERT action in MERGE ## What changes were proposed in this pull request? This PR adds support for INSERT action in MERGE INTO command. Previously, when a generated column is not explicitly inserted, it will automatically be set to null, which may break the generated column check constraint when the referenced columns are not null. This PR generates correct value for generated columns in such cases based on the generation expression. This PR also simplifies how we generate all actions for INSERT in MERGE INTO in PreprocessTableMerge. Previously, we combine three sources: (1) explicit inserted actions; (2) implicit actions from original target table; (3) implicit actions from evolved columns. Now that we have the final schema already computed at this stage, we can simply do explicit and implicit actions based on the final schema. ## How was this patch tested? New unit tests. Author: Meng Tong #24624 is resolved by mengtong-db/merge-insert. GitOrigin-RevId: e536486662e706ed2aee12fbfa1f5b987412a742 commit c4d3a6470d9f3fe29772ad3c12891eba2bc81d6f Author: bart Samwel Date: Wed Jul 21 16:17:54 2021 +0000 [SC-80727] Test improvements for OptimisticTransactionSuite - Test only change Extended unit tests to cover all of these cases. Author: bart Samwel GitOrigin-RevId: 86d1ea838bb7722e9fa6250b5ed35e268564678f commit d05606cb97751dea2c26d48c32e6ae9b01e21037 Author: Burak Yavuz Date: Tue Jul 20 21:42:52 2021 +0000 [SC-74210] Don't drop NullType columns for SQL read path Don't drop nulltype columns for read path Author: Burak Yavuz GitOrigin-RevId: 812ccfa61494445bbb82338812bf22f5528833d1 commit b8a1d04041431fee1fd429ddacdcf85e887e9882 Author: sabir-akhadov Date: Mon Jul 19 15:55:17 2021 +0000 [SC-74196] Slight refactor of TahoeFileIndex Small refactor of TahoeFileIndex Author: sabir-akhadov GitOrigin-RevId: e8b2c0fe49c828bde9e24e33dac8ba9d907a452b commit c0184af52ef2cf9240382c28272aab1b31101c41 Author: Prakhar Jain Date: Fri Jul 16 20:26:27 2021 +0000 [SC-80329] Increase visibility of txn id in OptimisticTransaction - Make txn id protected Author: Prakhar Jain Author: Lars Kroll GitOrigin-RevId: cc1fd39ac753f9221d8b0e5af64c19884423da38 commit 553132d7f5475d1cb4066cecb7a8b00ca68c62a0 Author: Wenchen Fan Date: Fri Jul 16 01:14:37 2021 +0000 [AUTO][SPARK-36074][SQL] Change error message in AlterTableTests - Test only change Authored-by: Wenchen Fan Author: Wenchen Fan GitOrigin-RevId: be1db9f3c719c5f50c7df634fcd6ab045b792b78 commit fdbcf60215e78bec82f7c923a16b6ed5f1d03a7c Author: Yijia Cui Date: Thu Jul 15 17:50:49 2021 +0000 [SC-78889] Add SchemaUtils APIs - Add a few SchemaUtils APIs Author: Yijia Cui GitOrigin-RevId: 3db81520790efa26cbd3920d409e3cea63f8dde9 commit b48b8605c558d0ec70eaae4788381e8706fb8d24 Author: Prakhar Jain Date: Thu Jul 15 00:34:05 2021 +0000 [SC-80726][SC-80727][DELTA] Make logPrefix accessible - Increase visibility of logPrefix Author: Prakhar Jain GitOrigin-RevId: e649799040868fc2ad7ffd69c5c7cc6c4b7b78db commit 561b7198a88e7c7c1be7b7042cd41126299e117f Author: Jacek Laskowski Date: Tue Jul 13 22:36:39 2021 +0000 [DELTA-OSS-EXTERNAL] Move asyncUpdateTask to SnapshotManagement (where it is used) Closes delta-io/delta#713 Signed-off-by: Shixiong Zhu Author: Jacek Laskowski Author: Shixiong Zhu #24601 is resolved by zsxwing/st8nhfch. GitOrigin-RevId: 71a1c0bf64403fb9877253ed44fcabf4f5f2a590 commit 8d124cd6203be6bd5095fc0092e2e9bae66f1cc9 Author: Abhishek Somani Date: Tue Jul 13 21:42:28 2021 +0000 [SC-78248]: Make newDeltaPathTable protected Increase visibility of utility method Author: Abhishek Somani GitOrigin-RevId: 76df61b078ffc60acc3366b7e00d7fd42c4cc3c0 commit 327816cd95f048cdc9dbc60bdfefbdda38e0084b Author: Wenchen Fan Date: Tue Jul 13 02:58:16 2021 +0000 Change AlterTableTests error message Test only change Author: Wenchen Fan Author: Wenchen Fan GitOrigin-RevId: daafc2be94f0ef83f0fa94cf239d0a26651a0d92 commit f7ce276e779f6a14395be65e938870b3f5787637 Author: Rahul Mahadev Date: Mon Jul 12 22:54:48 2021 +0000 [SC-78904] Make Delta Vacuum handle duplicate paths ## What changes were proposed in this pull request? Make Vauum handle duplicate listing from logStores ## How was this patch tested? Added a unit test that uses a dummy logstore that lists duplicate URLs Author: Rahul Mahadev #24443 is resolved by rahulsmahadev/handleDuplicatesInVacuum. GitOrigin-RevId: de044ff3e130017f8b1d94ddcfdaf34f67f8ce92 commit 104e2a472b5a0a5c718c42ac14ac8b851a1a7fe8 Author: gurunath Date: Mon Jul 12 20:51:22 2021 +0000 [DELTA-OSS-EXTERNAL] Raise error when dataType was not provided throw Error explicitly, if dataType was not provided for DeltaColumnBuilder : issue Raised and Discussed here: https://github.com/delta-io/delta/issues/698 Closes delta-io/delta#714 Signed-off-by: Shixiong Zhu Author: gurunath Author: Shixiong Zhu #24669 is resolved by zsxwing/cuj6lcz0. GitOrigin-RevId: a9d92bc195693bb1ddca515a8069c38ce99d0497 commit 94ad72a038d8dfd99a53450b6903f5b71d7d3606 Author: Lars Kroll Date: Fri Jul 9 07:52:28 2021 +0000 [SC-79618] Augment Symlink Manifest Suite - Test only change Author: Lars Kroll GitOrigin-RevId: 0ad1cc1a70171daed062256cc00e5f2f3ac18672 commit 4b7b176aee8b173416c22d1d21d0786346337883 Author: Scott Sandre <59617782+scottsand-db@users.noreply.github.com> Date: Thu Jul 29 18:06:42 2021 -0700 [SC-80783] Change OSS Connectors repo to use Sonatype for the release (#105) Update build.sbt to use sonatype and not bintray, as bintray is sunset. Made changes similar to those made to delta-io/delta when the same thing was done (https://github.com/delta-io/delta/commit/926e30da6b6e86448cdb4739a082e0b566eec129) Tested by staging a sonatype release. Changed the `version.sbt` to be `0.2.1-SNAPSHOT-sonatype-test` and was able to see that new package version in the sonatype staging repository. The pgp keys were present, too. commit 1f876ca138e9a5378adb76b71f618be3036c900f Author: Gerhard Brueckl Date: Tue Jul 27 05:23:32 2021 +0200 Add inline-documentation to PQ function (#103) * add PowerBI connector * add parameter to buffer the binary file before reading it to mitigate streaming errors * update global readme to include reference to the Power BI connector * fix issue with online refresh * add inline-documentation update readme.md add *.pbix to .gitignore Co-authored-by: gbrueckl commit 137f7c378dac2f506fade74360b95255ea4159dc Author: Shixiong Zhu Date: Tue Jul 13 18:45:20 2021 -0700 Fix the sbt launch download url (#98) commit f71f646575d7b8d22b69bd9d05961a24a1c7eb85 Author: Gerhard Brueckl Date: Thu Jul 8 18:18:12 2021 +0200 Fix issue with data sources that do not support streaming of binary files (#94) * add PowerBI connector * add parameter to buffer the binary file before reading it to mitigate streaming errors * update global readme to include reference to the Power BI connector * fix issue with online refresh Co-authored-by: gbrueckl commit 86bbe9905ebf8c57f5bf5078a5d18ad2b5081abe Author: Shixiong Zhu Date: Thu Jul 8 00:16:57 2021 +0000 [DELTA-OSS-EXTERNAL] Fix the sbt launch download url The bintray url is not working now. Use the `repo.typesafe.com` link instead. Closes delta-io/delta#711 Signed-off-by: Shixiong Zhu Author: Shixiong Zhu #24449 is resolved by zsxwing/den3b8hd. GitOrigin-RevId: 1f8fdb3bba694ff53001d13ebca9f84dfae0748e commit 3fa6bcaa048e71862d2a7e3c151a55b08a5960d0 Author: Prakhar Jain Date: Wed Jul 7 20:43:55 2021 +0000 [SC-79815][DELTA] Refactor Conflict detection code flow This PR refactors the conflict detection code flow to a separate class so that: - Improve readability of the current code: The current code is has a single `checkForConflict` method which do all the required checks. Existing UTs GitOrigin-RevId: 54ad050e0967fa49a61f2677fe1510242d0916d5 commit bf0d95b5434d293425b47362f0d860029628f0c6 Author: Yuchen Huo Date: Sun Jul 4 02:57:47 2021 +0000 [SC-78855] Call default table path only once in delta table creation path A minor refactor to call defaultTablePath only once Author: Yuchen Huo GitOrigin-RevId: 1eafe477d0d2ad9d6980398f57dedea31c020d75 commit b29742d8088d1885210b54c5509ada4d2f60957f Author: Yijia Cui Date: Fri Jul 2 21:34:56 2021 +0000 [SC-78889][Delta] Move MergeSchema in SchemaUtils to a new file and Refactor DeltaMergeInto to Include Final Schema Move MergeSchema in SchemaUtils to a new file to report finalSchema in DeltaMergeInto. Refactor PreprocessTableMerge to report the fully analyzed DeltaMergeInto. Unit test. Author: Yijia Cui GitOrigin-RevId: 5fe7e0d2a2e899a384382d8caa7273be8408ee14 commit 7c6a0ceca84b5d04653dec7361aa82075542dfee Author: Prakhar Jain Date: Thu Jul 1 18:53:45 2021 +0000 [SC-80108][DELTA] Add new testsuite OptimisticTransactionSuite Add new testsuite OptimisticTransactionSuite UTs Author: Prakhar Jain Author: Prakhar Jain GitOrigin-RevId: 6aa1b08ea56220e76b75807c4577d21b4547762c commit f5da8237f0fd46823f08f30af0237f9137e4ce49 Author: Meng Tong Date: Thu Jul 1 18:22:26 2021 +0000 [SC-70216] Support generated columns in merge command ## What changes were proposed in this pull request? This PR adds support for generated command in MERGE ... UPDATE case. Previously, if the generated column is not explicitly updated, we will copy over the old values, which would potentially break the generated column check constraint and fail the query. With this PR, the values of generated columns will be computed correctly using the (potentially updated) referenced columns. This PR mostly reuses the utility functions for UPDATE to generate the correct update expressions, with some small changes needed to cover the schema evolution case. ## How was this patch tested? Added new unit test. Author: Meng Tong #23499 is resolved by mengtong-db/generated-column-merge. GitOrigin-RevId: 6245c07c323255eb4a0db88150e520ef24e02af8 commit 1cd3740ecefcc8115ff6673dbcb52fb5264a320a Author: Prakhar Jain Date: Mon Jun 28 21:26:30 2021 +0000 [SC-80108][DELTA] Minor code style change in OptimisticTransactionLegacySuite Minor code style change. Author: Prakhar Jain GitOrigin-RevId: 217f785ec1dbb111e6ff1aca88e680774ce68e90 commit 9f04cf77210e5d34ac073bc2bfb15fe9c72d0147 Author: Junyong Lee Date: Mon Jun 28 19:05:42 2021 +0000 [SC-74210] Drop NullType columns for SQL read path NullType column is not very useful as they do not contain any contents. Hence we used to drop this NullType column when we create a table from DataFrameReader, but we did not do the same thing on SQL read path. This PR unifies the behavior, which will drop NullType column always in any read/table/sql APIs. Unit tests testing different read APIs. Author: Junyong Lee GitOrigin-RevId: 9b55e8fb5e51ffbfb86832a811668e9b920c225e commit 83277eb30c0834bd837d9658864261fc31d366f6 Author: Jose Torres Date: Fri Jun 25 19:32:57 2021 +0000 [SC-75521][DELTA] Strip the full temp view plan for Delta DML commands Strip the full temp view plan for Delta DML commands. This allows us to reenable the test for merging into SQL temp views for MERGE - previously resolution would fail. new unit test Author: Jose Torres GitOrigin-RevId: b418f4bd194d6186390261cd8d32c4f2c9ed1048 commit ac5f9e1871a3fb5d889393dc6fadada009e83a2f Author: Linhong Liu Date: Fri Jun 25 08:20:59 2021 +0000 [SC-59185] Refactor DeltaAnalysis code and update comments Minor refactor of DeltaAnalysis code and update comments in DeltaInvariantCheckerExec existing UT Author: Linhong Liu GitOrigin-RevId: 4582218e20f7eae532d063cc8613e2d964ee35d9 commit c424efad8b03c2dce6d988a927677a0e9c314a11 Author: Zach Schuermann Date: Wed Jun 23 15:34:10 2021 +0000 [SC-78050][DELTA] Call `deltaLog.update()` in DataFrame read path ## What changes were proposed in this pull request? When a snapshot was created as an `InitialSnapshot` for a Delta table and is cached as such (for example, a race condition due to unmounting and mounting paths), then all following reads on that Delta table would return a “This path is not a Delta table” error. This PR adds an `update()` call to the Dataframe read path to prevent this from happening (and give the valid table). This is done by forcing the computation of `snapshot` when we create a `BaseRelation` for `DeltaTableV2`. In short, this will call `deltaLog.update()` so we ensure that the check whether or not the table exists is accurate. This costs an additional RPC but is deemed necessary for correctness. ## How was this patch tested? Added a unit test to simulate reading from a table with cached `InitialSnapshot` and a valid DeltaLog. Author: Zach Schuermann #23778 is resolved by schuermannator/sc-78050. GitOrigin-RevId: 8fd732bbf39788f92ea390f720aa9bb4246e8d12 commit 7e9c6e5e0ec1f472bf84d8066b47a71c1383bf46 Author: Zach Schuermann Date: Mon Jun 21 15:50:47 2021 +0000 [SC-74475][DELTA] Minor refactor of EvolvabilitySuiteBase Minor refactor of EvolvabilitySuiteBase Test-only PR. Author: Zach Schuermann GitOrigin-RevId: 73bc357d0634b1607ed77b3a4d709a39fe625b8b commit 65eff808b4fcf529393d0715a5ddd5dd94bc1f59 Author: Lars Kroll Date: Mon Jun 21 09:53:56 2021 +0000 [SC-78522] Remove redundant import Minor refactor Author: Lars Kroll GitOrigin-RevId: d9b49a4fa92dea967104a82fdbae69534c14436a commit 736b9f2456bccf9eea2c306900b1405a4f5b31aa Author: Prakhar Jain Date: Fri Jun 18 23:34:13 2021 +0000 [SC-77769][DELTA] Refactor conf code Minor refactor of Delta conf code Author: Prakhar Jain GitOrigin-RevId: 5ebc318ed5a9ad34529f0bf49ca7bf4b9399bccf commit 7bd8e22ff2514d4c32571c913970bfdc4b10063a Author: Prakhar Jain Date: Fri Jun 18 18:32:18 2021 +0000 [SC-79320][DELTA] Support getBinIndex func in FileSizeHistogram Add new function getBinIndex in FileSizeHistogram, which returns the index of the bin to which given fileSize belongs OR -1 if given fileSize doesn't belongs to any bin Existing UTs. Author: Prakhar Jain GitOrigin-RevId: 9a8bee48e60a4cf2b0e1207c9a7ddc3c31991c82 commit 62ad794694c0276c49d8cfc94ae3416b9bf10ab8 Author: Lars Kroll Date: Fri Jun 18 14:22:00 2021 +0000 [SC-77778] Minor refactor style Minor refactor style N/A Author: Lars Kroll GitOrigin-RevId: c9c06110075d32c749eb1afb24ea6f873bfece61 commit 181941f360ac12785f2e3012afe7fa4f79e97d83 Author: Tathagata Das Date: Fri Jun 18 01:30:54 2021 +0000 [SC-78753][Delta]Add more logging for measuring timing in conflict detection two improvements - every log line prints a unique identifier of the txn. this differentiates logs from concurrent txn to the same table in the same jvm (optimize does this all the time). the id is completely internal and used only for this log4j logging purpose. - addititonal timing metrics to show the breakdown of timing between different steps conflict detection. no unit tests Author: Tathagata Das GitOrigin-RevId: 3a0c424288660cbbcef5cb76cd66d75050b10828 commit 0b8e6cb6bd9577d26630f67ebc4e6b134abc2987 Author: Yuyuan Tang Date: Thu Jun 17 01:11:47 2021 +0000 [SC-76551] Refactor DeltaCatalog and CreateDeltaTableCommand Minor refactor Author: Yuyuan Tang GitOrigin-RevId: 1ae361b37d749cac3d06fe4cae18fc172fa464a7 commit ac57b7858d55eb21d5b259cb9768a85ca6211254 Author: Zach Schuermann Date: Wed Jun 16 17:04:23 2021 +0000 [SC-79127][DELTA] Refactor some test names, remove redundant comment Minor refactor of test names and comment Author: Zach Schuermann GitOrigin-RevId: 7d7c15e13c4c0b3f41fa3421c91e6a5a02812efa commit 31bf4bcff38a610d4eec3a5ddea627426322a9fc Author: Shixiong Zhu Date: Wed Jun 16 01:35:05 2021 +0000 [SC-78244]Delta should lock commits on Azure Set `spark.databricks.delta.commitLock.enabled` to `true` on Azure, as removing the lock will increase the chance to hit the concurrent error when overwriting the `_last_checkpoint` file concurrently. The new unit tests. -Regression: Azure users may hit concurrent error when overwriting the `_last_checkpoint` file concurrently. Author: Shixiong Zhu GitOrigin-RevId: df9d11f1982bb71563934d9d389e40a1e37b7add commit d1f8b83f8bf7da873670d6777f19e0af7c0091b0 Author: Guy Khazma Date: Fri Jun 11 23:38:51 2021 +0000 [DELTA-OSS-EXTERNAL] [Storage System] IBM Cloud Object Storage Support - cleanup This PR cleans up an unused variable in the IBMCOSLogStore. The variable (`writeSize`) was a leftover from an older version of the LogStore implementation. No logic changes are introduced. Closes delta-io/delta#692 Signed-off-by: Yijia Cui Author: Guy Khazma <33684427+guykhazma@users.noreply.github.com> #23256 is resolved by yijiacui-db/73jdbmso. GitOrigin-RevId: 0a1de46e6b55b7ebff76b187a4d24f5a826df385 commit 1470e33f3f728a1670a77da63f3fb78780c30873 Author: Yuhong Chen Date: Fri Jun 11 18:33:39 2021 +0000 [DELTA-OSS-EXTERNAL] [SC-77949] Make DeltaTable.forName support "delta.``" name ## What changes were proposed in this pull request? Make DeltaTable.forName support "delta.``" name. Before this change, DeltaTable.forName(s"delta.`$dir`") would result in an error. ## This PR introduces the following *user-facing* changes Before this change, DeltaTable.forName(s"delta.`$dir`") would result in an error. After this change, DeltaTable.forName(s"delta.`$dir`") would be allowed for Delta Table directories, but still blocked for empty (non Delta Table) directories. ## How was this patch tested? Unit tested that DeltaTable.forName(s"delta.`$dir`") on a Delta Table directory is allowed, and that DeltaTable.forName(s"delta.`$dir`") on an empty directory is still blocked. Author: Yuhong Chen Author: Yuhong Chen #22994 is resolved by FX196/dpgget9c. GitOrigin-RevId: 7f0bd84e5d1064bbc2282d5330f2e8b74a45959d commit 4bee7ae50c50299cdea54ed3afeb59effa0f0790 Author: yaohua Date: Thu Jun 10 19:49:27 2021 +0000 [ES-113602] Minor change in sbt install script Minor change in sbt install script. Author: yaohua GitOrigin-RevId: 26b0fe9df735739332a482ced59bdd90bd7534ec commit 4243bccbe397e0f47dc36b525f14983d57bbc848 Author: Li Zhang Date: Thu Jun 10 18:09:57 2021 +0000 Allow earliset Delta table time travel to smallest Delta file version - 1 When there’s a checkpoint at version 10 and a delta file at version 11, the earliest version returned should be version 10. We don’t handle that case correctly right now unit test AFFECTED VERSIONS: PROBLEM DESCRIPTION: Author: Li Zhang GitOrigin-RevId: 8e70e6b2aae76a5043653d3b6fdee45b824a9c27 commit 04d3c5ffbbfd4133ee019fc32a6e42afea6d2db3 Author: Yuhong Chen Date: Mon Jun 14 20:27:59 2021 -0700 Pass down hadoopConf to ParquetReader (#93) commit 26495e9d72bf6d129f4820aa74c959b2fc185fc5 Author: Gerhard Brueckl Date: Tue Jun 15 04:43:40 2021 +0200 add PowerBI connector (#87) Co-authored-by: gbrueckl commit 539463a99b0c9b84f03448efe6c8da3a5b4f4a28 Author: Alex Date: Mon Jun 14 22:31:30 2021 -0400 Adding `sql-delta-import` utility. (#80) * Adding sql-delta-import * cleanup. resolving compilation and scala fmt issues * changing copyright to Scribd * adding link to README.md * using scala 2_12 only for `sql-delta-import` * Create AUTHORS.md (#83) * Rename AUTHORS.md to AUTHORS * Addressing PR feedback. Adding Scribd to AUTHORS Changing attribution to The Delta Lake Project Authors * Update AUTHORS * Addressing PR feedback * fixing formatting * fixing dependency resolution failures * changing spark-sql dependency for cross scala versions * adding test for sql-delta-import * proper project name in CircleCI * Just trying to restart circle CI build.. * only publishing sql-delta-import for scala 2.12 * adding aliases to lower/upper bounds columns in bounds query to better support data stores that require it Co-authored-by: Alex Kushnir Co-authored-by: Denny Lee commit b76e2314583b0e2081a01163cea628031384b987 Author: Yijia Cui Date: Thu Jun 10 15:44:12 2021 +0000 [SC-78118][Delta] Improve Evolvability test by adding a new column in the existing action. Improve Evolvability test by adding a new column in the existing action. test only PR. Author: Yijia Cui GitOrigin-RevId: d0dc71791c2e56cb67386a5b0dc6e601aaa418c6 commit fd515cda15eb8e8d130f4c29a8188dc0e7d1672f Author: lizhangdatabricks <85116904+lizhangdatabricks@users.noreply.github.com> Date: Wed Jun 9 19:03:08 2021 +0000 Revert "[SC-78695] Fix bug to allow earliest checkpoint version to be smallest Delta file version - 1" Reverts databricks/runtime#23115 Author: lizhangdatabricks <85116904+lizhangdatabricks@users.noreply.github.com> GitOrigin-RevId: 9a525374b7ceeb723a92a51771802f94af9d32cf commit 3d9d14f3b46091563679550949ac84d1c1205ee8 Author: lizhangdatabricks <85116904+lizhangdatabricks@users.noreply.github.com> Date: Wed Jun 9 09:20:08 2021 -0700 [SC-78695] Fix bug to allow earliest checkpoint version to be smallest Delta file version - 1 (#23115) GitOrigin-RevId: 95b7d8f509cc0a8b11346966cf9f68e708320ae5 commit a7f0ac792c1912a70179ab5e692cc983c9b3c1c7 Author: Prakhar Jain Date: Tue Jun 8 15:31:22 2021 +0000 [SC-77108][DELTA] Refactoring the commits stats GitOrigin-RevId: df9c3b2ffcbe4dd07fecf929278f4469d42f1dc7 modified: core/src/main/scala/org/apache/spark/sql/delta/Snapshot.scala commit 4333a9e83aa6106e28c1db755cd60479c6071b9f Author: Yuhong Chen Date: Mon Jun 7 04:57:05 2021 +0000 [DELTA-OSS-EXTERNAL] [SC-77396] Block confusing SQL CREATE TABLE behavior Added logic and unit tests to block queries like ``CREATE TABLE delta.`/foo` USING delta LOCATION "/bar"`` where ambiguous paths are supplied. Users can allow such queries by setting the `DELTA_LEGACY_ALLOW_AMBIGUOUS_PATHS` flag to true, in which case the `USING delta LOCATION "/bar"` statement will be ignored. Closes delta-io/delta#688 Signed-off-by: FX196 **PROBLEM DESCRIPTION:** - Queries like ``CREATE TABLE delta.`/foo` USING delta LOCATION "/bar"`` will be blocked because there are two different paths in the query. - We still allow queries that use the same location such as ``CREATE TABLE delta.`/foo` USING delta LOCATION "/foo"``. - We also add a legacy flag `spark.databricks.delta.legacy.allowAmbiguousPathsInCreateTable` to allow such ambiguous queries since this is a behavior change. Author: FX196 Author: Yuhong Chen GitOrigin-RevId: 10513d45b134706ebb760ef719376cbbc3e9420b commit f5e1116efbf31b3fee35d7913f5643db12fdbd87 Author: Shuting Zhang Date: Fri Jun 4 19:26:51 2021 +0000 [DATA-15917]Log tahoeEvent in product logs record tahoeEvent in product logs api: recordProductEvent Author: Shuting Zhang GitOrigin-RevId: 4d7df82f27e9baae971b183f429bb8fcded740fc commit 732bcb8218b5ed0ba12c1a56567e0644e14d407a Author: Meng Tong Date: Fri May 28 16:37:41 2021 +0000 [SC-77308][Delta] Pass the Schema in the Delta Catalog to downstream Logic We use the schema in the catalog in `updateMetadata`. Once the schema is fixed, downstream logic can use that. Author: Meng Tong GitOrigin-RevId: cf3ae4feec1ecaa8a6e420432069aa90fe7a29d7 commit 2addad118a1755b24098b0e75dc2a562be1b2cc9 Author: Yuming Wang Date: Fri May 28 01:32:20 2021 +0000 [DELTA-OSS-EXTERNAL] Upgrade Antlr4 to 4.8 Upgrade Antlr4 to 4.8 to fix Antlr4 incompatible warning. Before: ``` yumwang@LM-SHC-16508156 spark-3.1.1-bin-hadoop2.7 % bin/spark-sql --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog ... spark-sql> create table test_delta using delta as select id from range(10); ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.8ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.8 21/05/21 21:14:53 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `default`.`test_delta` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Time taken: 9.841 seconds ``` After: ``` yumwang@LM-SHC-16508156 spark-3.1.1-bin-hadoop2.7 % bin/spark-sql --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog ... spark-sql> create table test_delta using delta as select id from range(10); 21/05/21 21:10:27 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `default`.`test_delta` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Time taken: 5.949 seconds ``` Closes delta-io/delta#676 Signed-off-by: Shixiong Zhu Author: Yuming Wang #22678 is resolved by zsxwing/0vx467nv. GitOrigin-RevId: 7627cfbe7ed84448f5b37a44cc538d328a472585 commit 59aa330c403f8a71b3eef0e90bd61cd54aab108c Author: Yijia Cui Date: Fri May 28 01:31:40 2021 +0000 [SC-78072][Delta] Fix partitionedBy in DeltaTableBuilder Python API Fix partitionedBy in DeltaTableBuilder Python API Added more usages in the unit tests. Author: Yijia Cui GitOrigin-RevId: 99e2e89242c8fa41c960df25dcc382faae439ab1 commit 50fcd9737979a9afbf4eacb9f666abf7be11b729 Author: Vijayan Prabhakaran Date: Thu May 27 21:39:31 2021 +0000 [SC-74857] Minor refactor of WriteIntoDelta Minor refactor of WriteIntoDelta. Author: Vijayan Prabhakaran GitOrigin-RevId: 5c2fe8b996085a1f447035e5aa79e865c76bf14a commit 8553b1c27dd7286ce626b10c66b02a006a7e5cac Author: Jose Torres Date: Thu May 27 21:06:33 2021 +0000 [SC-76446][DELTA] Set isBlindAppend correctly for DSv2 plans. ActiveOptimisticTransactionRule is not idempotent, but V2 write plans will be planned again based on the optimized version of the original plan through the V1 fallback paths. So we have to skip these rules for V2 write plans, or the file indices will be pre-pinned and thus not invoke the proper logic to signal their scans to the transaction. new unit test Author: Jose Torres GitOrigin-RevId: 51fc053d022ebcf6b45bf4441eed799d7d68d969 commit dae440de43ba741c39414e5d720cd55080fbd8f4 Author: Tathagata Das Date: Thu May 27 15:53:51 2021 +0000 [DELTA-OSS-EXTERNAL] Updated integration tests to test pypi package Made following updates to the integration test - Added test for pypi package, both main pypi and testpypi - Added support for providing staging maven repo for testing staged maven release artifacts Closes delta-io/delta#675 Signed-off-by: Tathagata Das Author: Tathagata Das Author: Tathagata Das #22601 is resolved by tdas/lpmdfft7. GitOrigin-RevId: 679b0b4df9e57a0a106c9258d40381f5b0c37f00 commit a9196c5f5df4256ecf291b4a3517a82432515160 Author: Meng Tong Date: Thu May 27 01:07:16 2021 +0000 [SC-78035][Delta] Minor Refactor of GeneratedColumn Minor refactor of GeneratedColumn. Author: Meng Tong GitOrigin-RevId: 60dbb5d9a0e15d9b74c901beb94d70a196cfef0b commit ba15bbd3e439de980670769fbb2db150d36f0eaa Author: yaohua Date: Wed May 26 20:31:01 2021 +0000 [SC-74862][Delta] Refactor the Vacuum Command Skip paths the file system fails to make them qualified. Author: yaohua GitOrigin-RevId: 1340e76c5d1c4d23e593a2694437b6123e4c2131 commit a2722f8b17369a47dd8d23696fc4958f022bb496 Author: Zach Schuermann Date: Wed May 26 19:13:21 2021 +0000 [SC-77958][Delta] Fix `userMetadata` option when creating/replacing Delta table This PR fixes a bug when adding `userMetadata` during table creation (specifically `saveAsTable` API). The following example would yield `null` for `userMetadata`: ``` spark.range(10).write.format("delta") .option("userMetadata", "someMeta").mode("overwrite").saveAsTable("user_meta") ``` This was due to `userMetadata` only being included with `DeltaOperations.Write`, not `CreateTable` or `ReplaceTable`. This PR adds support to pass `userMetadata` through from `saveAsTable()`/`createOrReplace()`, fixing the above behavior. Additional unit test explicitly using `saveAsTable` and `createOrReplace` APIs. Author: Zach Schuermann GitOrigin-RevId: b149bebb7fc27446ace70714452599aa51c54b8e commit efa944a0d5564933ff70492bdb94e44dece9f82f Author: Yijia Cui Date: Wed May 26 07:51:27 2021 +0000 [SC-75287][Delta] Add Evolvability Tests For Checkpoint Schema and Json Schema Add a new column in checkpoint schema to test actions that shouldn't fail. Now the checkpoint schema is with a new column "unknown" Write json log file with a new column to test actions that shouldn't fail. Now the json schema is `{"some_new_feature":{"a":1}}`. test-only PR. Author: Yijia Cui GitOrigin-RevId: 5ede8bc24bfc78dd80468bd3bf7bde0cd2cef057 commit ef76d4dc45748b2d3f7781555a740fe8954fde62 Author: Yijia Cui Date: Mon May 24 17:03:16 2021 +0000 [SC-69796][Delta] Fix JavaDeltaTableBuilderSuite and JavaDeltaTableSuite in OSS ## What changes were proposed in this pull request? Fix JavaDeltaTableBuilderSuite and JavaDeltaTableSuite in OSS ## How was this patch tested? Unit test only Author: Yijia Cui #22428 is resolved by yijiacui-db/SC-69796-java-suite. GitOrigin-RevId: 26053a0d4e517e7d3f472175b25910fa03b04948 commit e36dc6b9ca8ea8e893080dcea847978d5835125b Author: Tathagata Das Date: Mon May 24 17:00:16 2021 +0000 [SC-77952] Update Delta Lake OSS licenses from 2020 to 2021 ## What changes were proposed in this pull request? As the title says. ## How was this patch tested? Existing checks Author: Tathagata Das #22485 is resolved by tdas/SC-77952. GitOrigin-RevId: 85a17d069efe112989ac883c1395e2bbeebd9ae4 commit 9d352171433464adde7ac8784ea67a96b286f7aa Author: Rahul Mahadev Date: Thu May 20 23:38:10 2021 +0000 Minor refactor to catalog table code paths As the title says Author: Rahul Mahadev GitOrigin-RevId: a42e1ad347d3f0665070d6ddb43ef311e058b576 commit ebf76250e1e9909b0424f224145ac1d3a64c9801 Author: Tathagata Das Date: Thu May 20 18:50:07 2021 +0000 [DELTA-OSS-EXTERNAL] Update pypi name to delta-spark its shorter! Closes delta-io/delta#674 Signed-off-by: Tathagata Das Author: Tathagata Das #22309 is resolved by tdas/gvazkgm1. GitOrigin-RevId: 6cf1e197e0ac0ad7f6330af4733a40416f185a02 commit 9c316ad80847761f89069e117c0d69184138ac03 Author: Tathagata Das Date: Thu May 20 16:14:25 2021 +0000 [DELTA-OSS-EXTERNAL] Polish docs for 1.0 - show all classes in python docs - fix developer API tag in scala/java docs - add annotations to the new exceptions and logstores - refactored DeltaTableBuilder's options to hide them from docs Closes delta-io/delta#672 Signed-off-by: Tathagata Das Author: Tathagata Das #22282 is resolved by tdas/ok3cd8hv. GitOrigin-RevId: fba84e46059f667a9aabf6387efa65533c1e660e commit 441871c05d46ea940c44d2cf8fe9cead2fc93a60 Author: Shixiong Zhu Date: Thu May 20 01:23:18 2021 +0000 [DELTA-OSS-EXTERNAL] Add Writer Version 4 to PROTOCOL.md - Add Writer Version 4 requirements. - Add Generated Columns. - Make the `Writer Version Requirements` table use GitHub markdown format. - Update the table of contents. Closes delta-io/delta#671 Signed-off-by: Shixiong Zhu Author: Shixiong Zhu #22252 is resolved by zsxwing/bgtk1zni. GitOrigin-RevId: 6a3c66920e67829814088060522dd19eb4a0393b commit cb92e8f6e8ddeb90c7c3ba06261d2a5a2de1f5f9 Author: Meng Tong Date: Wed May 19 06:57:02 2021 +0000 [SC-77537][Delta] Add annotations and improve LogStore class docs ## What changes were proposed in this pull request? As titled. Also moved the classes to the correct package locations ## How was this patch tested? Doc/comment change. Existing tests are sufficient. Author: Meng Tong #22045 is resolved by mengtong-db/annotatiion. GitOrigin-RevId: 11e3d22206362a9de4d869812dd6b58074212cb8 commit 90606ffa468d9a55dcffccc3fae95b005dbd413c Author: Ranu Vikram Date: Tue May 18 23:22:47 2021 +0000 [DELTA-OSS-EXTERNAL] Adding support for GCS Adding support for Google Cloud Storage(GCS) as Delta Storage by introducing GcsLogStore. This PR addresses [issue #294]. File creation is an all-or-nothing approach to achieve atomicity and uses Gcs [preconditions]to avoid race conditions among multiple writers/drivers. This implementation relies on gcs-connector to provide necessary `FileSystem` implementations. This has been tested on a Google Dataproc cluster. #### GcsLogStore requirements 1. spark.delta.logStore.class=org.apache.spark.sql.delta.storage.GcsLogStore 2. Include gcs-connector in classpath. The Cloud Storage connector is automatically installed on Dataproc clusters. #### Usage ``` TABLE_LOCATION = 'gs://ranuvikram-test/test/delta-table' # Write data to table. data = spark.range(5, 10) data.write.format("delta").mode("append").save(TABLE_LOCATION) # Read data from table. df = spark.read.format("delta").load(TABLE_LOCATION) df.show() ``` : https://github.com/delta-io/delta/issues/294: https://cloud.google.com/storage/docs/generations-preconditions#_Preconditions Closes delta-io/delta#560 Signed-off-by: Tathagata Das Author: Tathagata Das Author: Ranu Vikram #22070 is resolved by tdas/o9ixtoaw. GitOrigin-RevId: 0a1ce1d4407637d7697b93a25d8fd6be3efe2f6d commit 9a7338caa2ba3a1b84b56c23e886f7764f619d64 Author: Yijia Cui Date: Tue May 18 23:11:38 2021 +0000 [SC-69796] [Delta] Add DeltaTableBuilder Python API To Support Generated Columns ## What changes were proposed in this pull request? Add DeltaTableBuilder Python API. ## How was this patch tested? Unit tests. ## This PR introduces the following *user-facing* changes AFFECTED VERSIONS: OSS Delta 1.0 and DBR 8.3. PROBLEM DESCRIPTION: OSS Delta users and DBR customers will now be able to create / replace DeltaTable using Python APIs. Lazy consensus on https://groups.google.com/a/databricks.com/g/eng-streamteam/c/THVJ4DvrQGM/m/Sg6WcpcAAgAJ Author: Yijia Cui #21551 is resolved by yijiacui-db/SC-69796-python. GitOrigin-RevId: 4a8d321194a2902883126b981c5066c176b148fb commit b61e138eb3dd91b2bf5ab07ed6b3e34879a55656 Author: Yijia Cui Date: Tue May 18 06:11:47 2021 +0000 [SC-77585] Fix Docs Generation and Add Instructions. ## What changes were proposed in this pull request? As title says. ## How was this patch tested? Tool fix. No need to test. Author: Yijia Cui #22131 is resolved by yijiacui-db/SC-77585. GitOrigin-RevId: 737e76d164058c77d4dfae68871146eb377a16ae commit f153958ef2cdae694700abf39b3c2bf9a56b516d Author: Prakhar Jain Date: Mon May 17 07:58:45 2021 +0000 [SC-76886][DELTA] Add usage logs for DeltaCommand.commitLarge code flow Currently there are usage logs for the commitLarge code flow - which is used by CONVERT commands. This PR add the same with a new tag "delta.commitLarge.stats". Added UTs. Author: Prakhar Jain GitOrigin-RevId: 6b4d2466aa41370cc95605edef282bd2759c9e8d commit bce30bdd8dd61147b99c24930633266f0af213df Author: Yijia Cui Date: Fri May 14 23:29:31 2021 +0000 [SC-77334][Delta] Remove Evolving Annotation From DeltaTable and DeltaMergeBuilder ## What changes were proposed in this pull request? Remove Evolving Annotation From DeltaTable and DeltaMergeBuilder ## How was this patch tested? No need to test - comment change only. Author: Yijia Cui #22038 is resolved by yijiacui-db/SC-77334. GitOrigin-RevId: 1d7559dad455ace3734caa32e04eabb72f52a608 commit 0717c059a962406ca0af4bdb787fe83c45275f15 Author: Meng Tong Date: Fri May 14 05:40:42 2021 +0000 Minor refactor to checkpoint logic Made the code cleaner Existing tests Author: Meng Tong GitOrigin-RevId: 15ee044259329088b32a8fe9662709b2af205588 commit 4b3a4ae9dfed164e5e6c38ac1870367e91649513 Author: Yijia Cui Date: Thu May 13 19:37:07 2021 +0000 [SC-69796][Delta] Add Create Delta Table APIs in Scala with support for Generated Columns. ## What changes were proposed in this pull request? Add Create Delta Table APIs as DeltaTableBuilder in scala with support for Generated Columns. See https://groups.google.com/a/databricks.com/g/spark-api/c/5vkssqZUmP0 for more details. ## How was this patch tested? unit test. ## This PR introduces the following *user-facing* changes AFFECTED VERSIONS: OSS Delta only. PROBLEM DESCRIPTION: OSS Delta users will now be able to create / replace DeltaTable with GeneratedColumn supported. Author: Yijia Cui #21310 is resolved by yijiacui-db/SC-69796. GitOrigin-RevId: f12cb173d9ebd20535801a316a665d62a7695ab4 commit 2ab890a9364af80218ea9115a68e86815102d33e Author: Meng Tong Date: Wed May 12 20:27:51 2021 +0000 [SC-76694][Delta] Add DelegatingLogStore ## What changes were proposed in this pull request? This PR adds `DelegatingLogStore` for OSS Delta with the capability of resolving the LogStore implementation based on the scheme of a path. The decision logic is as follows: 1. Check `spark.delta.logStore.class`. If it is set, use the value. If not, go to next step. 2. `DelegatingLogStore` will be used, which will 2.1 Check `spark.delta.logStore.scheme.impl`. If it is set, use the value. If not, go to next step. 2.2 Check if we have default implementation for `scheme`. If we do, use the corresponding default value. If not, go to next step. 2.3 Use `HDFSLogStore`. ## How was this patch tested? Added new unit test for the LogStore resolution logic. ## This PR introduces the following *user-facing* changes AFFECTED VERSIONS: OSS Delta only. PROBLEM DESCRIPTION: OSS Delta users will now be able to specify log store implementation for different schemes using `spark.delta.logStore.scheme.impl`. Author: Meng Tong #20836 is resolved by mengtong-db/logstore-public-api-oss-delegate. GitOrigin-RevId: 97bbd862405a2370a3d6b1fa46445297f60b172a commit 2b39f63c7cf78cdda639f254261416890fdf0304 Author: Jose Torres Date: Wed May 12 17:16:52 2021 +0000 Minor refactoring in the CreateDeltaTableCommand's updateCatalog As explained in the title Author: Jose Torres GitOrigin-RevId: f55a27328b6e0c97ac9ce57f126ebbae968bd06e commit a26c35de768c96e0d9631be602087ced7076a932 Author: Tathagata Das Date: Tue May 11 21:14:10 2021 +0000 [SC-77172] Enable unlimited clause tests for merge in OSS ## What changes were proposed in this pull request? Spark 3.1 supports unlimited merge clauses in SQL. So we enable the existing tests. ## How was this patch tested? newly enabled unit tests Author: Tathagata Das #21787 is resolved by tdas/SC-77172. GitOrigin-RevId: d5760ea79e7a8173953abe53e9705b1b5beebd3a commit cb46fd19d40e12241afd343f2998e1b8fad90486 Author: Vivek Bhaskar Date: Tue May 11 19:35:51 2021 +0000 [DELTA-OSS-EXTERNAL] Adding support for Oracle Cloud Infrastructure (OCI) Object Store as Delta Storage Adding support for Oracle Cloud Infrastructure (OCI) Object Store as Delta Storage by introducing OCILogStore. Regarding [Storage configuration](https://docs.delta.io/latest/delta-storage.html) page in Delta Documentation, I request following changes mentioned [here](https://docs.google.com/document/d/1DJvRAuUWUov5kepAQb176uSUsdlgU2MxaGBDBYiRGsg/edit?usp=sharing). Closes delta-io/delta#468 Co-authored-by: Vivek Bhaskar Signed-off-by: Tathagata Das Author: Tathagata Das Author: Vivek Bhaskar #21793 is resolved by tdas/5i600laq. GitOrigin-RevId: 9b0460f06e2cd9ab7fbf6bbce688d92ab10975f0 commit 34c53ac13f3d6218abf8831faec9182d3a7ff54b Author: Bruno Palos Date: Tue May 11 16:27:43 2021 +0000 [DELTA-OSS-EXTERNAL] Add null check to avoid NPE when Exception message is null while merging. Hi, When we interrupt a stream that updates a Delta table using _merge_ mode, the JVM runtime throws a plain `InterruptedException` with no description and the `AnalysisHelper` expects that all exceptions have a message and this causes a `NullPointerException` on such cases. This PR adds a null check to avoid that. Also added a unit test for this scenario. Thank you, Bruno Closes delta-io/delta#648 Signed-off-by: Shixiong Zhu Author: Bruno Palos Author: Shixiong Zhu #21879 is resolved by zsxwing/pouscxop. GitOrigin-RevId: 6be7561f24dd6b26a8f8732e82e25896234071a8 commit 0e16356ff46520404e2376d048f002ca74f6dc0c Author: mahmoud mahdi Date: Tue May 11 00:39:34 2021 +0000 [DELTA-OSS-EXTERNAL] Use isSameLogAs instead of comparing directly composite ids of two deltaLogs The main goal of this Pull Request is to use the ```isSameLogAs```function when possible. Closes delta-io/delta#650 Signed-off-by: Shixiong Zhu Author: mahmoud mahdi #21878 is resolved by zsxwing/95ddzfsx. GitOrigin-RevId: 19023f7b0291c368bbaf234c4abfaf228833521e commit 5b3055efc65cc97d6cd29faf062cbe325083f431 Author: mahmoud mahdi Date: Tue May 11 00:35:41 2021 +0000 [DELTA-OSS-EXTERNAL] Clean the DeltaLog class While digging into the DeltaLog code, I realized that there's some code that can be refactored, and that's the main goal of this Pull request. 1. I have removed some unused imports 2. removed an unused method 3. remove an unnecessary variable 4. convert some code to Single abstract methods Closes delta-io/delta#651 Signed-off-by: Shixiong Zhu Author: Shixiong Zhu Author: mahmoud mahdi #21874 is resolved by zsxwing/qum30qw7. GitOrigin-RevId: dcadfb75bbdcdc56ed681038892ac396e4ba2862 commit ed68d1f6599d85351b64ec5dff14449d1e154486 Author: Shixiong Zhu Date: Sat May 8 01:34:19 2021 +0000 [SC-59856]Remove the lock during a Delta commit Now we can ensure a checkpoint operation don't overwrite others, so we can remove the lock during a Delta commit to speed up concurrent commits. Author: Shixiong Zhu GitOrigin-RevId: 482edc3da0511c659baeae3fa3d44357b3277152 commit 0dca802e9148d0026d315dd09e6df8e7a9510927 Author: Rahul Mahadev Date: Fri May 7 18:59:31 2021 +0000 Speed up Delta Vacuum Suite Speed up the vacuum suite by lowering the parallelParitionDiscovery parallelism tested locally went from ~500s to ~300s Author: Rahul Mahadev GitOrigin-RevId: 0e70eee2f4beeba93218c00a4fc7a228e81f1da6 commit 286b9c1fd05cf998b86ede1a1022de944dbf82ba Merge: d2990624d 405a41183 Author: Denny Lee Date: Fri May 7 17:20:31 2021 -0700 Merge pull request #661 from rtyler/patch-1 Create CODE_OF_CONDUCT.md commit d2990624d34b6b86fa5cf230e00a89b095fde254 Author: Guy Khazma Date: Thu May 6 20:57:09 2021 +0000 [DELTA-OSS-EXTERNAL] [Storage System] Support for IBM Cloud Object Storage This PR adds support for IBM Cloud Object Storage (IBM COS) by creating `COSLogStore` which extends the `HadoopFileSystemLogStore` and relies on IBM COS ability to handle [atomic writes using Etags](https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-object-storage-upload#upload-conditional). The support for IBM COS relies on the following properties: 1. Write on COS is all-or-nothing, whether overwrite or not. 2. List-after-write is consistent. 3. Write is atomic when using the [Stocator - Storage Connector for Apache Spark](https://github.com/CODAIT/stocator) (v1.1.1+) and setting the configuration `fs.cos.atomic.write` to `true`. In addition I propose the following [documentation](https://docs.google.com/document/d/1ued0rajmIZPZXJZ65uvvUTb088rcsxLl3zct1Y5A4p8/edit) to be added to the [Storage Configuration](https://docs.delta.io/latest/delta-storage.html) page. Closes delta-io/delta#302 Signed-off-by: Tathagata Das Author: Guy Khazma <33684427+guykhazma@users.noreply.github.com> Author: Tathagata Das #21738 is resolved by tdas/7pvlaz2d. GitOrigin-RevId: 00d961bad7e2e15521ac51f06a4a101cd5bd925f commit 98c66a64d7ab60fc27ba790d7ef3df82bd76c3cb Author: Christopher Grant Date: Thu May 6 20:48:02 2021 +0000 [DELTA-OSS-EXTERNAL] Create PyPI release Related to delta-io/delta#282, contains code from delta-io/delta#353. This PR contains two changes: - Creates a PyPI release - A Python-only function, `delta.configure_spark_with_delta_pip()` that gets the currently installed version, to be used in an IDE when initializing a spark session, like: `spark_session = delta.configure_spark_with_delta_pip(spark_session_builder).getOrCreate()`. The idea here is to allow the package to be self-referential, avoiding version mismatch problems which can be tough to debug. A few items need to be addressed: - Get a username and password for PyPI. The username will be added in config.yml (), whereas the password will be stored in $PYPI_PASSWORD env variable in circleci. Other notes: - This change only targets IDE workflows. Command-line programs like `pyspark` and `spark-submit` will still need to specify the delta.io package on launch, see [quickstart docs](https://docs.delta.io/latest/quick-start.html#pyspark). Closes delta-io/delta#659 Co-authored-by: Nikolaos Tsipas Signed-off-by: Tathagata Das Author: Tathagata Das Author: Tathagata Das Author: Christopher Grant #21456 is resolved by tdas/htj80y82. GitOrigin-RevId: 138065a1d1659bd3bc5c303868c15d1b2747655b commit 1247307c9eb04f9b863a85e4a780a8ee13b81c7e Author: Shixiong Zhu Date: Thu May 6 18:40:18 2021 +0000 Minor refactoring to GeneratedColumns Author: Shixiong Zhu GitOrigin-RevId: b063a0d34b833df4f5fb999d0a49719890cd3b3d commit 789e5a40c8ba7d99a8d749e177bc75824e4442b8 Author: sabir-akhadov Date: Tue May 4 14:08:23 2021 +0000 Minor improvements New tags and copy method Author: sabir-akhadov GitOrigin-RevId: 2490fcef902cdedcb8ec59c4f1210c9fe39f3cdb commit 405a4118317892a43c1233cf06d4d8296f82a6a7 Merge: a2ed67859 8828aa537 Author: R. Tyler Croy Date: Thu May 6 11:41:16 2021 -0700 Merge pull request #1 from rtyler/patch-2 Patch 2 commit 8828aa537c65ccc406ce353c0375b4b3d6439d81 Author: R. Tyler Croy Date: Thu May 6 11:40:32 2021 -0700 Reference the code of conduct in the readme commit a2ed67859552b13b9b5ec56b8c801052b2d340e3 Author: R. Tyler Croy Date: Thu May 6 11:39:16 2021 -0700 Spell good commit e0f4cb535018ae6914735e7a3349653fcd6cd302 Author: Meng Tong Date: Mon May 3 14:24:16 2021 +0000 [SC-76491] Minor refactor Author: Meng Tong GitOrigin-RevId: 329ef04133858b76c512dbdd372dad2cee5fc6e5 commit 52f55ca6dbb3f56c70d2d0f85eeefb3d66aa9231 Author: Shixiong Zhu Date: Thu Apr 29 23:49:11 2021 +0000 [SC-71820] Support Generated Columns in update command Currently, when a source column of a generated column is updated in UPDATE command, we don't update the generated column accordingly. For example, given a table that has a generated column `g` defined as `c1 + 10`. For the following update command: ``` UPDATE target SET c1 = 100 ``` We will copy the old value of `g` . This is not correct, and it will fail because of the constraints we apply for generated columns. The correct approach should be updating `g` to `110` according to the new value of `c1`. This PR updates `PreprocessTableUpdate` to generate update expressions for generated columns automatically when the user doesn't provide update expressions. New unit tests Author: Shixiong Zhu GitOrigin-RevId: 5b6d3c5d37439d18b158269c2165e1b12c068209 commit b7cb203dd23a9a225fbcc414bc075916bd32a817 Author: Meng Tong Date: Thu Apr 29 20:29:55 2021 +0000 [SC-18397] LogStore public API This PR adds the new LogStore API designed to be a public API. The existing LogStore API is not changed. The new API is based on the existing API but we cleaned up a few places to make it simpler and easier to work with. As the new API and old API are not binary compatible, this PR also adds `LogStoreAdaptor` which adapts all implementations of new API to existing API. All implementations of existing LogStore API will continue to work as they are today without any changes. Author: Meng Tong GitOrigin-RevId: b389717c28fed9ea4ffbfdd999dc4265615270af commit 388448d4d70163d4636e372a41f31c5cdabc2ab0 Author: Yijia Cui Date: Thu Apr 29 06:28:58 2021 +0000 [SC-75743] Improve test Author: Yijia Cui GitOrigin-RevId: 459ce29075cddf7436ba4f67a783d2314a0445bd commit 57ef985f02f0071189d78d8b7fcae1252e527435 Author: Vijayan Prabhakaran Date: Tue Apr 27 20:18:09 2021 +0000 [SC-76061] Add better error message for concurrent write failures on empty path When two concurrent writers write to an empty path for the first time, the 2nd writer would fail because of ProtocolChangedException. In particular, the first writer would have updated the reader and writer version of the protocol and the second writer (even though could be updating the same reader-writer version) would consider this as a transaction conflict and throw a ProtocolChangedException error message. This change adds an additional information to the error message that this was due to concurrent transactions writing to the same empty path. Author: Vijayan Prabhakaran GitOrigin-RevId: 96e85c3363fd04181f7880e5907457b14b35b197 commit 6890c72e3eaa36cfab5fa0af50707c61bddcfd51 Author: Meng Tong Date: Mon Apr 26 16:31:32 2021 +0000 [ES-71465] Fix flaky test Author: Meng Tong GitOrigin-RevId: 991b1d5537abe52eca2739a86746e8b2e75a0a17 commit 96344cfe47d327e19eeee6d2bfd92d9e47fe477e Author: Shixiong Zhu Date: Sun Apr 25 07:29:25 2021 +0000 [ES-89890] Improve error message Author: Shixiong Zhu GitOrigin-RevId: 4cfe588476481e497364147c1f0429a5e5121728 commit 9532c28be7bc9885e71f055b0dd2398e25b9a13b Author: Wenchen Fan Date: Fri Apr 23 04:48:02 2021 +0000 [ES-80483] Minor refactor Author: Wenchen Fan GitOrigin-RevId: 759f76862fd4ff38c674dc56330cdbb9daad68bf commit 0efb2d336ea9ab99358e4e3cce07b8d1fff55030 Author: Meng Tong Date: Wed Apr 21 23:12:03 2021 +0000 [SC-76113] Refactor checkpoint() Author: Meng Tong GitOrigin-RevId: c18fd13bd6417c4c933f8da3304ec3b7cd3c7c5c commit df8a83c57a41ce5c387f18fb1274ea7d88ef9088 Author: Ali Afroozeh Date: Fri Apr 16 18:26:46 2021 +0000 [AUTO][SPARK-34989] Refactor tree transformation This PR is mainly a refactoring and passes existing tests. Author: Ali Afroozeh GitOrigin-RevId: b5d7f16d250349d1480d56ab46ed16cf91d896bd commit 8eff03939380df0c711f72a630cf2f97f6663aed Author: R. Tyler Croy Date: Mon May 3 16:59:16 2021 -0700 Create CODE_OF_CONDUCT.md Noticed while preparing some Delta community work around Data and AI Summit that this project doesn't actually have a defined code of conduct. Pilfering the one we're using for delta-rs with just the Enforcement/reporting section changed, commit 34002d3349dcb2cac38fcc44b89d584fffa63751 Author: Pranav Anand Date: Fri Apr 16 02:28:50 2021 +0000 [Delta OSS][SC-74312] enable update and delete on temp views - Enable temp views with Spark 3.1.1 for Delete and Update - Tests in Update and Delete Author: Pranav Anand GitOrigin-RevId: e3d34029be093ae960e7c3de0abca83bfedcc9e6 commit af8d3219e25f66f130461d645648a7cca0585794 Author: Yijia Cui Date: Thu Apr 15 06:47:16 2021 +0000 [SC-68794][Delta]Add test for aggregate expression for DML Add unit test for aggregate expression. Unit test. Author: Yijia Cui GitOrigin-RevId: 77636d544abfb53dda95b9dffcc3bb15474e38cc commit d0f0257d1c59f7acd9f785d03aed25ec2d1fda03 Author: Yijia Cui Date: Tue Apr 13 23:59:26 2021 +0000 [SC-42766][Delta]Make concurrent exceptions as public apis in python Make concurrent modification exceptions related apis public in Python. Unit tests. Author: Yijia Cui GitOrigin-RevId: 0a22e06162b1ee6747d0b55da1871fa1f142b56d commit 926e30da6b6e86448cdb4739a082e0b566eec129 Author: Tathagata Das Date: Tue Apr 13 20:38:32 2021 +0000 [SC-75485] Update Delta oss to publish directly to sonatype instead of bintray As the title says manual publish to sonatype staging. Author: Tathagata Das Author: Tathagata Das GitOrigin-RevId: e4d76cf07334e20dd0ef4238430690944df01189 commit b4897984f7cfc9b3b2e6215042aa42069e444d50 Author: Tathagata Das Date: Mon Apr 12 21:25:13 2021 +0000 [SC-74853] Convert Delta OSS into a multi module SBT project Making Delta a multi-module project will enable us to add other sub-modules. For example, we can then add a contribs sub-module that can have contributions from the community that needs to be very closely tied to the delta-core project (hence in this repo, and not delta/connectors) but does not have the same level of maturity as delta-core. Changes made in the Delta repeeo - Moved all files in root/src/ to root/core/src/ - update build.sbt to multiple modules - Removed dependency on spark-packages. existing tests Closes https://github.com/delta-io/delta/pull/644 Author: Tathagata Das GitOrigin-RevId: 68038d27302e82f6e680fe717633109757e48ba0 commit 7b708242353b2fe4a8498dbf06dd92ddf6538c5a Author: Jose Torres Date: Mon Apr 12 15:32:44 2021 +0000 [ES-76172][DELTA] Fix MERGE INTO evolution for partial updates of nested columns. Fix MERGE INTO evolution for partial updates of nested columns. We need to pass the flag to permit struct evolution, even though the UPDATE operation doesn't actually reference the new columns, because generateUpdateExpressions will implicitly generate them in order to produce one update action per target column. new unit test Instead of throwing an error this use case will work Author: Jose Torres GitOrigin-RevId: 5a4b68082bb329822d4361b7e4a764ef061cf878 commit 10df749012cf29d4881d2a4d54017e00fb752f07 Author: Shixiong Zhu Date: Fri Apr 9 23:12:20 2021 +0000 [SC-74439] Minor refactoring in GeneratedColumns Author: Shixiong Zhu GitOrigin-RevId: ef04d52ba1110134cb0eccf54d77342705c2c6f8 commit a4d3da4daed0e9630843157423877688a2cf36b0 Author: Jose Torres Date: Fri Apr 9 01:35:55 2021 +0000 [SC-75290][DELTA] Drop cdc field from the checkpoint file. Drop the cdc field from the checkpoint file. (Note that the actual CDC actions are already filtered out of the snapshot state in InMemoryLogReplay - right now this column is always null.) new unit test Author: Jose Torres GitOrigin-RevId: 8567d09a99b8fbba053930dff5695e0e67238961 commit b54095330e428307bfb4d21bad48f577bef48bd7 Author: Jose Torres Date: Thu Apr 8 18:26:05 2021 +0000 [SC-75186][DELTA] Use "change data feed" in the public-facing protocol incompatibility message. Use "change data feed" in the public-facing protocol incompatibility message. n/a Author: Jose Torres GitOrigin-RevId: 219e33773ce7f06b56d879aff521f544395f1ecb commit 942cdfc9c77558d8d2f554c30b8c4a3242dbe7f7 Author: Rahul Mahadev Date: Thu Apr 8 16:00:46 2021 +0000 [SC-75235][DELTA] Minor Refactoring in Vacuum. Author: Rahul Mahadev GitOrigin-RevId: f4210034290eb7c9ab6cca69529c555ef37d9819 commit 9e3240f40b17330f04e7a946521fb78b39139f15 Author: Yijia Cui Date: Wed Apr 7 21:14:19 2021 +0000 [SC-42766][Delta] Make concurrent modification exceptions related apis public in Scala Make concurrent modification exceptions related apis public in Scala. Unit tests. Author: Yijia Cui GitOrigin-RevId: a47a79db4a5d5bb799dc3510a41d7a0777eb54de commit 5c0a85e1d282a0c6ccc6cc7a85735c6588f9ce77 Author: Jose Torres Date: Wed Apr 7 20:06:10 2021 +0000 [SC-59632][DELTA] Add a test for MERGE INTO schema evolution for unlimited clauses Schema evolution didn't originally work with unlimited clauses. We need a test to prevent this from regressing in the future. n/a test only PR Author: Jose Torres GitOrigin-RevId: 6ba537a531fe591b8fbb8f2a1e03fc242c8f88ab commit 63aba97249076032e3018f9bde29dd8907b28408 Author: Jacek Laskowski Date: Wed Apr 7 19:21:21 2021 +0000 [DELTA-OSS-EXTERNAL] [MINOR] Remove PreprocessTableDelete.toCommand Signed-off-by: Jacek Laskowski Closes delta-io/delta#641 Signed-off-by: Yijia Cui Author: Jacek Laskowski GitOrigin-RevId: d03067231e3b2f73fc32d76f0feb19622d9968b4 commit 2c99ace19185c956abc40c7611fd73d4051c4d9f Author: Ali Afroozeh Date: Wed Apr 7 18:03:20 2021 +0000 Minor refactoring in deltaConstraints. Author: Ali Afroozeh GitOrigin-RevId: 2805d80e6cc1953c45117c1a111bf66de805006c commit 4f08f24037e5f7ad3ee1db301a36ca3c22bdd287 Author: Tathagata Das Date: Wed Apr 7 17:34:58 2021 +0000 [SC-70829][Delta] Prevent DeltaMergeInto plan generated by Scala API from being resolved twice If in any way, a DeltaMergeInto generated from Scala API having a `updateAll()` and schema evolution enable goes through the reference resolution twice, it can throw errors because - The first resolution will expand `star` in the plan to `x = source.x` assignments for all columns in source. This may include columns that are in the source but not yet in the target. - The second resolution, currently, can now throw an error because it does not know The obvious way to solve this by making the resolution idempotent - it will undergo reference resolution only if `plan.resolved` is false. However, it does not handle rare corner cases. The Scala API can generate DeltaMergeInto where all the expressions are already resolved. If we add the conditional check for plan.resolved, then in those cases with pre-resolved expressions, DeltaMergeInto may never go through the reference resolution phase and skip a lot of additional checks besides resolution. This can cause incorrect plans containing target column names that are wrong - since the target column names are stored as Seq[String] and not expressions, plans that containing all resolved expressions but incorrect column names will be considered as already resolved. To get around, this solution in this PR is to add a boolean field `targetColNameResolved` that explicitly represents whether the target column has gone through resolution or not. This `targetColNameResolved` is considered in `expression.resolved` and is set to false by default when generated by Scala API. This forces all plans to go through the resolution phase as the DeltaMergeInto.resolved will always be false even if all the expressions are resolved. In the resolution logic, the checks on the target columns will be done only when `targetColNameResolved` is false, and after the check, it will be set to true. This makes the checks robust on multiple passes - once the star has been expanded to columns and the boolean is set to true, future passes will skip checks. Note: The ideal solution here is to unify SQL and Scala code paths by Scala generated MergeIntoTable which in one shot gets converted into fully resolved DeltaMergeInto thus eliminating possibilities of another resolution attempt. This would be the ideal solution but we cannot do that now because the Assignment class in MergeIntoTable cannot differentially represent `Assignment()` and `Assignment()`. This is important because the Scala API can generate the latter (not the SQL API). This needs to be fixed in Apache Spark so will not be available until Spark 3.2. I have added this contextual information as inline docs for future development. Added a test with a function that failed without this change. Author: Tathagata Das Author: Tathagata Das GitOrigin-RevId: d72339b7b48671016b919f5e0f5bb268732fbc68 commit 96a7713c764417259cfa7c3c7bac50131e66c3b2 Author: Jacek Laskowski Date: Wed Apr 7 16:38:49 2021 +0000 [DELTA-OSS-EXTERNAL] [MINOR] Make end boundary of history optional (undefined) Since `None` is used for the end boundary of a delta table history it could also easily be "transferred" up the call chain and be the default input value. That's the purpose of the PR. Signed-off-by: Jacek Laskowski Closes delta-io/delta#633 Signed-off-by: Yijia Cui Author: Yijia Cui Author: Jacek Laskowski #20448 is resolved by yijiacui-db/1oyk8ltj. GitOrigin-RevId: dfda5f3ed910ba3b607111bed87009caf19ba8fe commit 9063677e88bf5aff0c46bcf59a4fc1f429f6138e Author: Meng Tong Date: Wed Apr 7 16:16:42 2021 +0000 [SC-69797][Delta] Remove unused fields. Author: Meng Tong GitOrigin-RevId: 13ccf8974ae9df9a142dd7a87d1ee3c902327b0c commit 83c824de2691a31f4bdf2e375f8597c76e09b593 Author: Yijia Cui Date: Wed Apr 7 07:43:42 2021 +0000 [SC-73942] Enable show create table test in OSS and Make it pass. SHOW CREATE TABLE isn't supported in spark 3.1. We should catch exception in unit test instead of expecting correct result. Unit test. Author: Yijia Cui GitOrigin-RevId: fc94772eb136d48f748fa37d5cca9879027e46cf commit c6d29e960754895314433daf88372c6399b1e5d1 Author: Wenchen Fan Date: Wed Apr 7 05:55:40 2021 +0000 Improve error handling in tests. Author: Wenchen Fan GitOrigin-RevId: f274112da57ebd3397d481b8525584542686b6d0 commit 8c92b736ce6bce4b69f1753a22b04e65cb37576c Author: Shixiong Zhu Date: Tue Apr 6 23:08:27 2021 +0000 [SC-74546] Fix the generation expression propagation issue for Generated Columns Currently we store the generation expressions in the column metadata of the table schema. However, when Spark reads the schema from a table, it will also propagate column metadata to downstream operations. For example, let's say table X is a table contains generated columns. The following command will create a new table whose column metadata contains the generation expression. This happens in all DBR versions when reading generated column tables. ``` CREATE TABLE Y AS SELECT * FROM X ``` This is not expected. This PR removes the generation expressions from the column metadata before giving the schema to Spark, so that the generation expressions won't be leaked to downstream operations. However, for old DBR versions, especially the EOS versions, the metadata propagation behavior still exists. But since old dBR versions that don't support generated columns have an old writer version (they don't support writer version 4), we can change the definition of generated columns to: A table has generated columns only if it's min writer version >= 4 and some of columns contain generation expressions in the metadata. With this definition, tables containing generation expressions but created by old DBR versions will be treated as normal tables, and none of generated column code path should be triggered when reading/writing such tables. New unit tests. Author: Shixiong Zhu GitOrigin-RevId: e3116c0d16c9f868aba03326efd82b72f7971b2c commit 59f79968f99ad207ef39807f1026aec221f32490 Author: Yijia Cui Date: Tue Apr 6 21:43:23 2021 +0000 [SC-66811][Delta] Test window functions in DML commands. Expect analysis exception for window functions in merge and update. Unit tests. Author: Yijia Cui GitOrigin-RevId: a407868f4d87e8d82119796aa3742edd5a8438ec commit 7d3a7f82f30709a07f7f545c21f3c8d9f7ebf3a3 Author: Jacek Laskowski Date: Tue Apr 6 15:48:11 2021 +0000 [DELTA-OSS-EXTERNAL] [MINOR] Use Literal constants That should lower memory requirements (as no extra objects are created) and improve readability Signed-off-by: Jacek Laskowski Closes delta-io/delta#638 Signed-off-by: Meng Tong Author: Jacek Laskowski #20440 is resolved by mengtong-db/gnggx483. GitOrigin-RevId: 1bc6619592a6bbb456112933f0181ed6e61078d2 commit c19b1599571e7c2009fed03acece09db8aa8a2f0 Author: Tathagata Das Date: Mon Apr 5 21:50:06 2021 +0000 [SC-74809][Delta] Minor Refactoring in DeltaOperations Author: Tathagata Das GitOrigin-RevId: d7f6bbc3168107eaba2942826bb945c9f6757737 commit 1b920604f7d38f56d72e541822dad4e826e73900 Author: Antonio Date: Mon Apr 5 18:03:38 2021 +0000 [DELTA-OSS-EXTERNAL] Typo in Python docs for DeltaMergeBuilder I know it's a tiny thing, but someone have got to fix it someday :) Closes delta-io/delta#596 Signed-off-by: Yijia Cui Author: Antonio #20377 is resolved by yijiacui-db/6zl3xm8p. GitOrigin-RevId: b3b80dc7215f0fc3fcc216cdb766fe66af816304 commit 3c0d614cca929d6ffe953c883127657303544e8d Author: Jacek Laskowski Date: Wed Mar 31 21:43:08 2021 +0000 [DELTA-OSS-EXTERNAL] Use deltaLog.tableExists for table availability Signed-off-by: Jacek Laskowski Closes delta-io/delta#637 Signed-off-by: Meng Tong Author: Jacek Laskowski #20259 is resolved by mengtong-db/n91aik90. GitOrigin-RevId: 1f5defe5a0da23234a35af929274a03e282c5c33 commit 262c6c651137278b09f017a2e9565c5733145ed1 Author: Rahul Mahadev Date: Wed Mar 31 14:16:14 2021 +0000 [SC-74667] Rename Change Data Capture to Change Data Feed Rename Change Data Capture to Change Data Feed Ran existing tests Author: Rahul Mahadev GitOrigin-RevId: e74e8f0d15e9ccdd733522c53a53fcc79eea90a0 commit 161ce093d0d5597034df4bd50e1274ed68379527 Author: Shixiong Zhu Date: Tue Mar 30 19:01:07 2021 +0000 [SC-63352] Refactor for generated columns Author: Shixiong Zhu GitOrigin-RevId: 4f7d0a28f25f3fcfe7097402f7398e9044139526 commit e7557202b30f5659eeaa4e2b45ba8092498b2aeb Author: Meng Tong Date: Sun Mar 28 00:52:21 2021 +0000 [SC-69797] Refactor around bucketing Author: Meng Tong Author: Meng Tong <77353730+mengtong-db@users.noreply.github.com> GitOrigin-RevId: 94ea828441cd90856841af4e8c76f75bd485f6b4 commit 22e473908b140e5c91ec9a51cbe729295228a0b0 Author: Pranav Anand Date: Fri Mar 26 20:11:51 2021 +0000 [SC-69842] Migrating Delta to use Spark 3.1.1 ## What changes were proposed in this pull request? - Migrate to use Spark 3.1.1 adding tests and refactors ## How was this patch tested? - Existing tests Author: Pranav Anand #19552 is resolved by pranavanand/pa-delta311migration. GitOrigin-RevId: fd3b86468f07752fbd3d56f653aec683af05d0b4 commit 73ade72c4c03fb76fff28ed8b6b60ebeee8389fd Author: Stefan Zeiger Date: Fri Mar 26 16:49:22 2021 +0000 Revert "[SC-59632][DELTA] Add a test for MERGE INTO schema evolution for unlimited clauses" Author: Stefan Zeiger GitOrigin-RevId: 5609916eafa039d5c76f5931ed38dc612cb5c231 commit 06e28f55cccd44012dd7acc3b5743bc9195bbd0e Author: Rahul Mahadev Date: Fri Mar 26 15:40:20 2021 +0000 [SC-74019] Refactor getStartingVersionFromTimestamp Author: Rahul Mahadev GitOrigin-RevId: e7e94e70ff2795a81ec73f12ba0a91ccf730056c commit 34566e02a4cef530df42dabc29fbb5fa945fbe1f Author: Lars Kroll Date: Fri Mar 26 12:36:07 2021 +0000 [SC-69552] Minor refactor. Author: Lars Kroll GitOrigin-RevId: 33a4fcdf50af72096e800bc5b2b4cc45476cb735 commit 64e8008e936c87e98a7966599f74c8b4982ec616 Author: brennerh1 Date: Fri Mar 26 02:43:42 2021 +0000 [DELTA-OSS-EXTERNAL] Add Delta Lake cheat sheet Add Delta Lake cheat sheet to `/examples/cheat_sheet/`. Closes delta-io/delta#628 Co-authored-by: Brenner Heintz Signed-off-by: Meng Tong Author: brennerh1 <65046554+brennerh1@users.noreply.github.com> GitOrigin-RevId: 1086666e5ccf56841f8c5e32d94af61ca14913ff commit 62d8544c2dbee0cefefd32834d98069a2adf8a7b Author: Jose Torres Date: Fri Mar 26 01:51:07 2021 +0000 [SC-59632][DELTA] Add a test for MERGE INTO schema evolution for unlimited clauses Schema evolution didn't originally work with unlimited clauses. We need a test to prevent this from regressing in the future. Right now there's just a Scala suite test, since the unlimited clauses test harness doesn't support evolution and the evolution test harness doesn't support unlimited clauses, so it's complicated to write a test that the ACL and CDC extensions of the SQL merge suite will correctly be able to skip. new unit test Author: Jose Torres GitOrigin-RevId: 4e492459966fafb23d1d5b3d4fea95656d02cf55 commit bc6992fb3995cc47e87ee3444dff405b5c7630a6 Author: Wenchen Fan Date: Wed Mar 24 15:48:42 2021 +0000 Refactor in DeltaAnalysis Authored-by: Wenchen Fan (cherry picked from commit be888b27edfbb0d7ebb2265de1bf74acb8d3d09a) Author: Wenchen Fan GitOrigin-RevId: ff2bea17c03ac092693c1db4610d800889b8ea49 commit 4e88d4f075b2e0359e54f6ebf8cbc592930f87ac Author: Howard Xiao Date: Tue Mar 23 22:15:22 2021 +0000 [SC-70743] Add ability to incrementally list from a particular file in DeltaFileOperations Added two methods, `recursiveListFrom` and `localListFrom`, into `DeltaFileOperations`. Currently, the only way to list files is to specify a directory path, then all files under that directory will be listed. This is wasteful if we only want _new_ files after a certain filename, instead of needing to re-list the entire directory and filter thereafter. These two methods allow you to specify a directory name and a path (in that directory) from which to list from. Then, taking advantage of `LogStore.listFrom`, only files with filenames after the specified path will be returned. - Added tests in `DeltaFileOperationsSuite`. Author: Howard Xiao GitOrigin-RevId: af5712e60224835cf00e247c6a9f740e04126968 commit 09e96febf35a33f0f9aee0acb9e71d464bb0f6ab Author: Gengliang Wang Date: Tue Mar 23 16:25:09 2021 +0000 Refactor test and error message Author: Gengliang Wang Author: Gengliang Wang Author: Wenchen Fan Author: Wenchen Fan Author: Kris Mok GitOrigin-RevId: 1d6c55bcd5232af4a0a486dfc3bb97afcc439f1f commit d77b0bfff01ec02816b29aa2b4a60ac2647d0323 Author: Gengliang Wang Date: Tue Mar 23 13:08:32 2021 +0000 [SC-72790] A new approach to extract all metadata predicates in Delta Converting all the predicates into CNF may result in a very long predicate and the codegen become unnecessarily large. We should follow the approach of https://github.com/apache/spark/pull/29101, which extracts all convertiable predicates gracefully. Author: Gengliang Wang GitOrigin-RevId: 9ecedbd83f85b8235262c3de0bd83b4540cbc560 commit 1912ac5b6d4f3742d7d0d21b52e03846dd5e8683 Author: Denny Lee Date: Tue Apr 6 09:20:29 2021 -0700 Rename AUTHORS.md to AUTHORS commit e4c85e64a7f9e70f7b70d14d63cf03c2a86152d9 Author: Rahul Mahadev Date: Mon Mar 22 17:27:21 2021 +0000 [SC-71855] Improve vacuum logging Improve vacuum logging Add unit test Author: Rahul Mahadev GitOrigin-RevId: 44ffdb72030de6ac6aadb9590238effe43bbaf4d commit 6382dea1be365aec2c52f8446363d3b2facb7420 Author: Shixiong Zhu Date: Fri Mar 19 17:56:50 2021 +0000 [SC-74152] Disallow column type change in Generated Columns Currently Delta allows some safe type change such as from SMALLINT to INT. However, this may break generated column contract. For example, let's say we have a column c1, and a generated column c2 defined as `CAST(hash(c1 + 32767s) AS SMALLINT)`. When c1's type is SMALLINT and we insert `32767s`, the expression will return 31349, but if c1's type is INT and we insert `32767`, the expression will return 9876. This means changing the column type may require to rewrite the existing data. But since it's too heavy, we can simply disallow it. New unit test Author: Shixiong Zhu GitOrigin-RevId: 93a0475fed83ec6a751e4b05902aec3fa71410a5 commit 7fa9cc0533744b9d325788a8ff550f93aab04029 Author: Meng Tong Date: Thu Mar 18 17:36:59 2021 +0000 [SC-65677] Avoid checkpointing the same version Before this PR, when we create a checkpoint, we use the snapshot established at the transaction start and create a checkpoint for that version. This might cause multiple transactions to checkpoint a same version and potentially lead to corrupted checkpoint status. In this PR, we change to checkpoint the version committed by this transaction to avoid such scenario. The downside of this approach is that we pay extra cost to get the new snapshot occasionally. We expect this is a rare case and we can tolerate the cost. Existing tests. Author: Meng Tong GitOrigin-RevId: 1858f58f69e0618924709631b67f81fe1d0b863d commit c48c42d7ff24339fa5de4b89f7b18147ce2b68cb Author: fvaleye Date: Thu Mar 18 00:00:20 2021 +0000 [DELTA-OSS-EXTERNAL] Add missing fields in the RemoveFile of the protocol specification Add missing fields in the RemoveFile of the protocol specification Closes delta-io/delta#613 Signed-off-by: Rahul Mahadev Author: fvaleye #19691 is resolved by rahulsmahadev/yhhem4p5. GitOrigin-RevId: bf3b646a83d830be1b04d83e2cdb566f744dfd39 commit 571b4abef33238f47e4dc5356fa1616809565b15 Author: liwensun Date: Tue Mar 16 20:44:48 2021 +0000 Refactor methods in SchemaUtils Author: liwensun GitOrigin-RevId: d48e075d5b6cdc55bf2b139016b3399ecfccf10c commit 3dee337a349e696d7afb23caa547b095076ad02b Author: yaohua Date: Tue Mar 16 19:45:27 2021 +0000 Improve comments Author: yaohua GitOrigin-RevId: bd103eeeb672424b8133aea732b9d66381f4d77f commit 88357fb2d4a1b125aa95090aa83e0cd14d9f6b0a Author: Tathagata Das Date: Tue Mar 16 13:58:50 2021 +0000 Improve evolvability test Author: Tathagata Das GitOrigin-RevId: 1cc8f84c0c5a3c04910feed934d3ad66b869ae77 commit a361a05059153c9cb5818ac00179994b0e66f386 Author: herman Date: Mon Mar 15 16:37:08 2021 +0000 [SC-68592] Refactor Delta invariant check Author: herman GitOrigin-RevId: b03bd9be547625516b0cef80522322af85a4d05a commit 7cb511599b62192a47a24fb179999896b34a9802 Author: Tathagata Das Date: Sat Mar 13 00:09:05 2021 +0000 [SC-72276] Add a config and test This PR adds a test so that we can detect https://github.com/delta-io/delta/issues/618 Author: Tathagata Das GitOrigin-RevId: 1af03b03f64c607c8f61b41eef678f3a72355ad7 commit 26e217e4c7538211985aa7c37fe49a763a84aae4 Author: Linhong Liu Date: Fri Mar 12 02:30:17 2021 +0000 Improve merge test Author: Linhong Liu (cherry picked from commit 3d79e78ee2fd05936ffb87b67ec1039e26257ba5) Signed-off-by: Ubuntu GitOrigin-RevId: 6524e412ea882e46e3d15db4a9ba22eee6eec125 commit b76223b375cbff54f3bc31b3c1086871041e8389 Author: Denny Lee Date: Mon Apr 5 09:10:45 2021 -0700 Create AUTHORS.md (#83) commit 331c2af50dc3a2746f72f58cb896a5459caebff1 Author: Rahul Mahadev Date: Thu Mar 11 19:30:50 2021 +0000 Fix CircleCI pyspark version to 3.0.1 ## What changes were proposed in this pull request? We did not update the CircleCI pyspark version to 3.0.1 when we changed it in the DockerFile. This PR fixes that ## How was this patch tested? This OSS PR - https://github.com/delta-io/delta/pull/617 build run is proof that this works Author: Rahul Mahadev #19475 is resolved by rahulsmahadev/fixCircleCI. GitOrigin-RevId: 596e2b3dc281acc7c1994b87a924cac2de16b685 commit 184317a6195b1f2367d599324146db4205d566a1 Author: Denny Lee Date: Thu Mar 11 17:57:41 2021 +0000 [DELTA-OSS-EXTERNAL] Update README.md Updating Slack channel invite Closes delta-io/delta#615 Signed-off-by: Rahul Mahadev Author: Denny Lee #19431 is resolved by rahulsmahadev/227y60x6. GitOrigin-RevId: d78b14c3ae488a08ecfa22cb359e34b6b9acda27 commit 4f02e9d5a917f7d450623d9ba2537ed93701df64 Author: Burak Yavuz Date: Thu Mar 11 01:12:23 2021 +0000 [SC-50579][DELTA] Refactored operation metrics for UpdateCommand Refactored operation metrics for UpdateCommand Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: 68e4c959dd63a8630f44a5da5d45c881e37d2e1d commit cda67f242df82594fddc75b953c36097d9d70c66 Author: Jose Torres Date: Wed Mar 10 22:53:22 2021 +0000 [SC-72103][DELTA] Add new metrics to CommitStats. Add new metrics to CommitStats. Currently return 0 as there is no CDC implementation. Author: Jose Torres GitOrigin-RevId: 9a35989028ade08a13b20cdecf0ec9d9217bfe1d commit 5d394c649c44b1b1e98b5594b349bf477b94b09f Author: Meng Tong Date: Wed Mar 10 05:43:54 2021 +0000 Add test for when directory is deleted before first DataFrame is defined Add a test to DeltaSuite As titled/ Author: Meng Tong GitOrigin-RevId: 91f206ed572b39a831e964aa9f3600a202580980 commit 3e0885618524d93f43c382763f4ec13c6a081893 Author: Yijia Cui Date: Tue Mar 9 21:54:19 2021 +0000 [SC-70419][Delta] Pass the correct table identifier for the Delta commands such as Vacuum/ History / Generate Pass the correct identifier for Vacuum/ History/ Generate Unit tests. Author: Yijia Cui GitOrigin-RevId: 87ef6249c34ef55e99f6a4a8e1612f88f23f43fa commit 09816195c44c59d6697ba348f6e0ce183bef3eed Author: yaohua Date: Tue Mar 9 17:57:02 2021 +0000 [SC-69133][Delta] Minor refactor of CreateDeltaTableCommand Minor refactor of CreateDeltaTableCommand Author: yaohua GitOrigin-RevId: ab0751dd9d8139950e342fc3dae7ddf551e0397b commit 864c20cd59d06038adea4a43ee93dabbd51d4c75 Author: Shixiong Zhu Date: Mon Mar 8 22:31:02 2021 +0000 [SC-70637]Remove unused AppendDeltaByName Remove unused `AppendDeltaByName` added by #19035. Found this when backporting the change. Jenkins Author: Shixiong Zhu GitOrigin-RevId: 67b020145852b5d2257f0c46969118b7e505b5a7 commit 4ab1c87e4635152777c12e2a1ecbaf36ae42bad3 Author: Shixiong Zhu Date: Mon Mar 8 09:06:26 2021 +0000 [SC-72013][WARMFIX]Support complex type extractors in Generated Columns This PR allows users to use complex type extractors such as `a.b` or `array` in generation expressions. Author: Shixiong Zhu GitOrigin-RevId: 7f667fa4f2a6c964b7977b3c435058f4b597bea8 commit 6b2e91506835d3022a58ea8d0adf9913d4e07a6b Author: Shixiong Zhu Date: Fri Mar 5 21:27:37 2021 +0000 [SC-70637] Minor refactor in DeltaAnalysis and add more tests for generated columns Author: Shixiong Zhu GitOrigin-RevId: b885fe525182215facd966ab16132b522f542e2b commit 448af5ca537f5463842af5d1a0d0adbcdf6081f1 Author: Jose Torres Date: Thu Mar 4 06:09:17 2021 +0000 [SC-71359][DELTA] Minor added new conf Added new conf called POTENTIALLY_UNMONOTONIZED_TIMESTAMPS Author: Jose Torres GitOrigin-RevId: 8b5f6d9353975c88f33b2688fb637cf355f599c5 commit 61a8a9d42f33b706e31e5aaf2672cd3cb36263fa Author: Linhong Liu Date: Thu Mar 4 05:17:41 2021 +0000 [SC-68972] Fix using temp view in Delta DML operations after SPARK-33142, the SQL temp view and permanent view have the same behavior. But this breaks some delta commands when a temp view is used. For example, after SPARK-33142, below queries will fail ```sql CREATE TEMP VIEW v AS SELECT * FROM tbl DELETE FROM v WHERE key = 1 ``` error: ``` AssertionError: assertion failed: No plan for DeleteFromTable ``` ```sql CREATE TEMP VIEW v AS SELECT * FROM tbl UPDATE v SET key = 1 ``` error: ``` UPDATE destination only supports Delta sources. ``` This PR will fix above usage. Newly added tests Author: Linhong Liu GitOrigin-RevId: 76ecfba9155dc3c4aa20504852bfff93d490b083 commit 88c602bcaf3de78359f6eeff7795bd2c0ba079ce Author: Shixiong Zhu Date: Thu Mar 4 03:52:18 2021 +0000 [SC-71676]Support backtick in replaceWhere When a user uses backtick in `replaceWhere` option, backtick will be treated as a part of the column name. Then we will fail the query because we cannot find such column in the table's partition columns. This PR fixes it. New unit tests. Author: Shixiong Zhu GitOrigin-RevId: 4059f137769f4051d433f7ae2e309ca8405531be commit a7380203767230badc5e78c23aece3efda457e4a Author: Rahul Mahadev Date: Wed Mar 3 20:20:32 2021 +0000 [SC-70742][DELTA] Refactored DeltaSource and split admission control into traits Refactored DeltaSource and split admission control into traits. Author: Rahul Mahadev GitOrigin-RevId: e559d04d38450d9fd8f4817c11a42d8697a76ae7 commit 04930a0e8cc3c135b9f27f7821b88f2c4de8ebfd Author: Shixiong Zhu Date: Wed Mar 3 15:48:23 2021 +0000 [SC-71667][TEST][WARMFIX]Fix table creation in tests of generated columns move createTable method in generated columns test suite to base class. Existing tests. -Other (please describe): Fix unit tests. Author: Shixiong Zhu GitOrigin-RevId: 7117af4d7ee56e7520773d578b5dbfa09701f907 commit 1569d9aa6a3c5a75e3d2abe9c17e0aa7f1acb15d Author: Burak Yavuz Date: Wed Mar 3 00:57:40 2021 +0000 Fix the version of Delta's pyspark builder to 3.0 It's installing 3.1 and breaking builds. Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: 5b8e03f3cce56c05e9e53d317408b20e731ac412 commit c76043800d4a3160419d4fa54d42329269412dc1 Author: Howard Xiao Date: Sat Feb 27 00:36:59 2021 +0000 [SC-70709] Minor refactoring of DeltaFileOperations Author: Howard Xiao GitOrigin-RevId: ef4e9970b0668b0a5812cd1d5c7a5a72f37c5527 commit e8b4bfef2611d56eb510ec9893e5832192dfe0f4 Author: Jose Torres Date: Fri Feb 26 22:12:30 2021 +0000 [SC-71212][WARMFIX][DELTA] Disable CDC writes Disable CDC writes Author: Jose Torres GitOrigin-RevId: ce204de6bfa1d612ac72192efab6af47b32eec9f commit a814a953298aeb7cb50375cc9a8c168facc1f3ef Author: Meng Tong Date: Fri Feb 26 21:22:52 2021 +0000 [SC-70705]Removes the support of partition transform expressions for Generated Columns This PR removes the support of partition transform expressions Modified existing tests. Author: Meng Tong GitOrigin-RevId: c431a6c29a5ab8c8211f36fcf8ebfa5b2255069d commit 4500752778e081d689a4eb31b12f89ffd88a2a68 Author: Yijia Cui Date: Wed Feb 24 04:57:21 2021 +0000 [SC-70376] [Delta] Minor refactor of InvariantViolationException Minor refactor of InvariantViolationException to extend RuntimeException Author: Yijia Cui GitOrigin-RevId: f1f6d6033b68484d2cae2db5ece531b4a9c6898b commit ee33a03e1f839db2195acca437e92ce4cf91d003 Author: Jose Torres Date: Tue Feb 23 20:07:44 2021 +0000 [SC-64953][DELTA] Fix RemoveFile deserialization default test. Renable Existing unit test com.databricks.sql.transaction.tahoe.ActionSerializerSuite.removefile which was broken by the upgrade. Author: Jose Torres GitOrigin-RevId: 70d1f997dcfb0881e2643b36910adc475bc707ce commit 85d4334f2f5d4d87d520214991f4dfd07af45483 Author: Shixiong Zhu Date: Tue Feb 23 11:25:40 2021 -0800 Revert "[SC-70376][Delta] Minor refactor of InvariantViolationException" This reverts commit 99ac06bc894ab0659bf0a9b47477b1e64dcfbb0b. GitOrigin-RevId: af4fe08326e742ce697079a1eea358db90c9d12c commit 2fbfe7ef206dad411ba65c202b2b74170a17343e Author: Yijia Cui Date: Tue Feb 23 19:18:42 2021 +0000 [SC-70376][Delta] Minor refactor of InvariantViolationException Minor refactor of InvariantViolationException to extend RuntimeException instead of IllegalArgumentException Author: Yijia Cui GitOrigin-RevId: 99ac06bc894ab0659bf0a9b47477b1e64dcfbb0b commit 8cb984e17eaa83e8bb35883f6a1a8d6e40dc010e Author: Rahul Mahadev Date: Fri Feb 19 18:11:11 2021 +0000 [SC-67650][DELTA] Refactor Delta Source - Refactor Delta Source to have DeltaSourceBase - modify IndexedFile to support other FileActions other than AddFile - Existing unit tests Author: Rahul Mahadev GitOrigin-RevId: b32fbc9a15d5a1d759a812cb37692eb2858b954e commit 5f69ca0d2d085b0639fb3e5e8d6ce9166dc558d5 Author: Shixiong Zhu Date: Fri Feb 19 15:52:08 2021 +0000 [SC-70791]Handle null values for constraint of generated column The constraint of generated column is using `EqualTo` which always returns `null` when any child is `null`. This means if the provided value and the calculated value are both `null`, it won't pass the constraint. This PR changes the expression to `EqualNullSafe` to fix the issue. It also addresses the outstanding comment in #18182 The new unit test Author: Shixiong Zhu GitOrigin-RevId: 3538b0a66251ca903f98e21ac35afc67f24a9e29 commit 45835907aedb2032b3e0cf18724456b345ba84e1 Author: Shixiong Zhu Date: Fri Feb 19 02:44:21 2021 +0000 [SC-70383]Fix an issue in `normalizeColumnNames` to handle dots in the column name We need to call `UnresolvedAttribute.quoted` before calling `col` to refer to the column correctly in `normalizeColumnNames`. Otherwise, the field name will be parsed to multiple name parts when it contains `dot`. New tests. Author: Shixiong Zhu GitOrigin-RevId: 0738f850b9984ee4c15d8b68596009f73794b173 commit 391a692eb55a32ada7a650acdd95b943572fd4e4 Author: Jose Torres Date: Fri Feb 19 02:04:32 2021 +0000 [SC-69565][DELTA] Fix invalid timestamp parsing in time travel. Add a flag (on by default) to make timestamp parsing "strict", where invalid timestamps will fail rather than being silently converted to unix time 0. This is most relevant for the startingTimestamp streaming option, because direct time travel doesn't assume it's provided with a literal, and any attempt to time travel to unix time 0 already would fail because no Delta tables are that old. new unit test Author: Jose Torres GitOrigin-RevId: ee4c04f883aadb86fbf4ec68cdad0f0b1d98f83d commit 1f7a911cbef12e1e964e2940e92f028b2b67029f Author: Meng Tong Date: Thu Feb 18 07:20:24 2021 +0000 [SC-70676] Deleting delta table directory before first DF is defined leads to empty result This PR fixes the following corner case: ``` spark.range(1,5).write.format("delta").save("/tmp/t1") // Delete directory /tmp/t1 in filesystem spark.read.format("delta").load("/tmp/t1") -> unexpected empty result. ``` In this case we will get an InitialSnapshot in DeltaLog.createRelation for a read query. Before this PR we proceed and produce empty result, which is not desired behavior, as an empty Delta table still need to have the directory and some metadata in it. With this PR we will now error out for this case. Author: Meng Tong GitOrigin-RevId: a296587aa1789b2c9d4be621ed1558213dd95e37 commit 30c8939415b8904f54cc4625e77cf147aed9bda1 Author: Shixiong Zhu Date: Wed Feb 17 21:06:32 2021 +0000 [SC-65351]Generated Columns for Delta This PR adds Generated Columns for Delta. Users can define an expression for a table column to generate values for this column automatically. Here are the major changes: - Support adding generated expressions into the column's metadata using the `delta.generatedExpression` key. - When writing to a Delta table, we will check the output. If the output is missing a generated column, we will add an expression to generate it. If a generated column exists in the output, in other words, we will add a constraint to ensure the given value doesn't violate the generation expression. Known limitations: - Update and Merge are not fully supported. Will be fixed in a followup PR New unit tests. Author: Shixiong Zhu GitOrigin-RevId: fad0bc6d71548c30b60997404def67674e4c1dbe commit a056163bff41fdf558c12a053556ff63cfc4bc31 Author: Rahul Mahadev Date: Wed Feb 17 21:05:28 2021 +0000 [SC-69855] Minor refactor in Delta executeGenerate Change delta table operations(generate) to go through sql code path instead of directly calling them. Author: Rahul Mahadev GitOrigin-RevId: 5c2c1fd848a71fee844e4e2ec0c31bd11b766c34 commit 1404c40c6961e04b28826555896c8f5727810ec3 Author: Yingyi Bu Date: Tue Feb 16 17:57:50 2021 +0000 [SC-70528] Reduce contentions on DatasetRefCache Use an atomic update operation to eliminate a synchronization. Existing tests. Verified that the contention on DatasetRefCache doesn't show up Author: Yingyi Bu GitOrigin-RevId: c08e3d7ded8af67f9ff1b886a260a863d4075849 commit 1e7d22a6012d1b37b7ee60e803dc6a45e909ac9a Author: herman Date: Tue Feb 16 11:08:43 2021 +0000 [SC-68684] Minor change that adds name to withNewExecutionId in TransactionalWrite Minor change that adds name to withNewExecutionId Added a new integration test in `DeltaObservableMetricsSuite`. Author: herman GitOrigin-RevId: 554640a7d1ea406a27f3565acaad60a67acf02e8 commit 9e3a44df8b313ae9dc981e462e58ad6d2c55604a Author: Yuchen Huo Date: Tue Feb 16 03:13:00 2021 +0000 [SC-67510][CPFS] Refactor Canonicalize path code for actions in Snapshot Refactor Canonicalize path to different method Author: Yuchen Huo GitOrigin-RevId: 4f46e40c3c51fd610446be347f86a2eaa0d9093b commit 469e6b6f88d8a67368632ad21f37e199921a2f99 Author: Sabir Akhadov Date: Thu Feb 11 20:38:43 2021 +0000 [SC-63766] Add insertionTime column to delta log We introduce insertion time tag to the delta log to indicate when the data in files has been originally inserted. This timestamp should be preserved during the lifetime of the data. After merging a group of files, the insertion time of the new file is set to the latest of the insertion times of the original files, unlike the modificationTime which depicts the time the file was written. new unit test Author: Sabir Akhadov GitOrigin-RevId: 7b2bd87b7a17e3cedc3c97c1724304331729f214 commit f6cfca9055bc429b095d0c361c8866911212f624 Author: Jose Torres Date: Wed Feb 10 05:54:48 2021 +0000 [SC-67663][DELTA] Add additional metrics writes that can capture changed rows. Add metrics (both Delta history and usage) for number and total size of Change files generated by DELETE, UPDATE, MERGE. new unit test Author: Jose Torres Author: Tathagata Das GitOrigin-RevId: 2d6341e09a2bec84ee62f1100361a434203a3b80 commit c01396fbc1d4917a41f59ab49637324adb32cb40 Author: Burak Yavuz Date: Tue Feb 9 23:15:47 2021 +0000 [SC-69350][DELTA] Do not store write options in the catalog for Delta Write options such as `replaceWhere` and `mergeSchema` can be stored in the transaction log, as well as the catalog when using the DataFrameWriter with `saveAsTable`. This has been a bug, as write options should not be stored in the transaction log. Nor do we need to store anything in the Catalog for Delta. This PR cleans up these properties from the catalog as well as the transaction log. However, there may be users who depend on this behavior. Therefore we do two things: 1. Introduce a legacy flag so that users can revert to the old behavior 2. We continue to store any Delta specific configurations, which are prefixed by `delta.` to the transaction log Unit test Author: Burak Yavuz GitOrigin-RevId: 3bf8db1b25f94096b13855a806a24122871a1585 commit 577f101d42422213aee4418f1b24450aee35b216 Author: Ali Afroozeh Date: Mon Feb 8 15:44:06 2021 +0000 [SC-68772] Refactor and change indentation Basic refactor and change indentation preferences. Author: Ali Afroozeh Author: Sander Goos GitOrigin-RevId: 46fec053a3ed05ce32056e28f179422966e23f0b commit 4277443703c5ab59a567c1e80189bbcdb7495817 Author: Alex Ott Date: Sat Feb 6 19:24:21 2021 +0000 [DELTA-OSS-EXTERNAL] Delta README updates - fix grammar - use 0.8.0 as the latest version in README Closes delta-io/delta#592 Signed-off-by: Shixiong Zhu Author: Alex Ott #18124 is resolved by zsxwing/7mjytwsk. GitOrigin-RevId: 60327bce4f2ab8504dc8448a48c81c971b2a056e commit c17690a921f3e056792efe065fdafac28dcbecaa Author: Yuchen Huo Date: Sat Jan 30 21:09:53 2021 +0000 [SC-67507] Add a utility method newDeltaPathTable to convert Identifier to DeltaTableV2 Author: Yuchen Huo GitOrigin-RevId: 17510bd0f09630a8490b7b048a4a436ab48f5c59 commit 508b0ef05a829206f7f4c06679bb55b2042dc0fa Author: Meng Tong Date: Thu Jan 28 22:09:44 2021 +0000 [SC-65157] Add time metrics for DELETE and TRUNCATE ## What changes were proposed in this pull request? This PR adds execution/scan/rewrite time metrics for DELETE and execution time metric for TRUNCATE. ## How was this patch tested? Modified existing tests to cover the new metrics. Author: Meng Tong #17684 is resolved by mengtong-db/add-metrics-update-truncate. GitOrigin-RevId: 6ee95a3a185be94fdc7bd3950bace745b27c6ad4 commit 2fa8f3a90280e95a3baee7fb0b713b805b7cdff9 Author: Joe Widen Date: Thu Jan 28 18:25:09 2021 +0000 [DELTA-OSS-EXTERNAL] Update README.md Update Slack Invitation Closes delta-io/delta#587 Signed-off-by: Shixiong Zhu Author: Joe Widen #17685 is resolved by zsxwing/vpnw7xou. GitOrigin-RevId: a0776c76002bc4cdeafccd1e588ad2555d6068ab commit 616e627ce142b49e10bff843aff25c47f85529df Author: Wenchen Fan Date: Thu Jan 28 11:08:08 2021 +0000 Update our forked CharVarcharUtils to follow SPARK-34192 Author: Wenchen Fan GitOrigin-RevId: b8537fb28647cbbc57e2e193c300e855caccd9da commit 4dd5cb20ef6a8b1527c4be55de387723f99618d0 Author: Meng Tong Date: Thu Jan 28 02:44:49 2021 +0000 [SC-68578] Record execution time before commit for MERGE ## What changes were proposed in this pull request? Before this PR the execution time of MERGE is recorded after commit so commit info will always have it as 0. This PR fixes the issue. ## How was this patch tested? Modified to existing test to check that execution time metric is the largest time metric. Author: Meng Tong #17554 is resolved by mengtong-db/fix-merge-time. GitOrigin-RevId: cfc630229ee41f65529a5ed5355e5b0cc9b0e889 commit be829c5d2737930c66a0ffd6dc3a8fdf25f14419 Author: Meng Tong Date: Wed Jan 27 15:50:29 2021 +0000 [SC-65157] Add scan/rewrite/execution time metrics for UDPATE ## What changes were proposed in this pull request? This PR adds scan/rewrite/execution time metrics for Update command. ## How was this patch tested? Added new test in DescribeDeltaHistorySuite to test that the new metrics exist. The exact values change from run to run and are thus not checked. Author: Meng Tong #17515 is resolved by mengtong-db/add-metrics. GitOrigin-RevId: dd2cb870c1f0a1a136a79d47a03dad19ef83e979 commit f630c6e17a012a845746625ff801a6856ee485d6 Author: Tom van Bussel Date: Tue Jan 26 14:41:30 2021 +0000 [SC-45173] Refactor DeltaInvariantCheckerExec Author: Tom van Bussel GitOrigin-RevId: 23ed42915eab652c4919fdd458130b1412a9a6f7 commit 6449181bcc651c92178b92d34fda3ce7f8c27553 Author: Jose Torres Date: Tue Jan 26 01:46:10 2021 +0000 [SC-65096][DELTA] Pass schema evolved output into the LogicalRelation of the target plan in MERGE INTO. ## What changes were proposed in this pull request? Pass schema evolved output into the LogicalRelation of the target plan in MERGE INTO. Right now we replace the attribute reference types only in a projection *above* the LogicalRelation, which isn't valid - it means the same exprId is used for both the pre and post-evolution datatypes, causing inconsistent codegen depending on which type happens to be picked up first. After this PR, the exprId should always have the post-evolution datatype. (Ideally, we wouldn't need to do this kind of surgery at all, but I don't see another way to resolve this without refactoring the entire MERGE implementation.) ## How was this patch tested? The primary test is LogicalPlanIntegrity.checkIfExprIdsAreGloballyUnique, which checks that all references to a particular exprId have the same type. This check is disabled in master pending resolution of SC-67287 but I verified locally that this change makes the check pass in MergeIntoSQLSuite. This also fixes the repartition(1) repro in SC-65096, along with the original issue we saw in CDC (but re-enabling the CDC schema evolution we should defer to a separate PR). Author: Jose Torres #17249 is resolved by jose-torres/fixbehavior. GitOrigin-RevId: 7bf4230699a8d50e6400f6a4fa44880d80401d32 commit 097f97a10b4eaa9d8eb63ffd593655e5f92df937 Author: Meng Tong Date: Thu Jan 21 17:17:02 2021 +0000 [SC-65144] Refactor DeltaSQLConf Author: Meng Tong GitOrigin-RevId: e1bbb1d0023b2fc0a90405a3970a388e4542d337 commit 45c0226e9ca00669388f554085cdf7d6d13be9fb Author: Meng Tong Date: Thu Jan 21 17:15:53 2021 +0000 [SC-66747] Fix LogStore test failures for Hadoop 3 ## What changes were proposed in this pull request? Different Hadoop versions produce slightly different error message text. This PR changes a few related tests to do regex match for error message comparison to cover both of them. ## How was this patch tested? The modified tests now pass on both Hadoop versions. Author: Meng Tong #17254 is resolved by mengtong-db/sc-67661-logstore-test. GitOrigin-RevId: d0a27bbe0eb0cbf67e4b1eda1ac99e3667976258 commit cf9d72e67ac7858397ddaa7bfa183b9b6899972a Author: Jose Torres Date: Thu Jan 21 15:51:13 2021 +0000 [SC-67123][DELTA] Update file name prefix for CDC Author: Jose Torres GitOrigin-RevId: 6203f0e817f3256dcbaf861c5198ad6a1e0f1bc2 commit 254677db756ea5bef709868849549822120fc00a Author: Meng Tong Date: Thu Jan 21 03:09:51 2021 +0000 [SC-65144] Refactor DeltaConfigs Author: Meng Tong GitOrigin-RevId: 55fec6397f41f2bbaccb4140147105f44f7903ee commit e936918e79b44e327a2030a18af5702eb12c9621 Author: Pranav Anand Date: Fri Feb 5 19:17:20 2021 +0000 Update Delta version snapshot - Update snapshot version Author: Pranav Anand GitOrigin-RevId: d1d5ff696d9c92c7f8546804bde3c2fd43f067d0 commit 1fb4dc7826ae8d3a31caf13f9d346d55716f7eb7 Author: Pranav Anand Date: Fri Feb 5 00:06:21 2021 +0000 Setting version to 0.8.1-SNAPSHOT commit 7b271cb067a555bca352ed00f5790b71c5482f40 Author: Pranav Anand Date: Fri Feb 5 00:05:16 2021 +0000 Setting version to 0.8.0 commit 5172443f2229d964fdbc3b3e83ccc449383198a6 Author: Shixiong Zhu Date: Wed Jan 20 01:08:02 2021 +0000 [SC-65398]Fix a bug that startingVersion/startingTimestamp doesn't work with rate limit correctly ## What changes were proposed in this pull request? When a query is using the rate limit options such as `maxFilesPerTrigger` with startingVersion/startingTimestamp, `isStartingVersion` may be set to `true` incorrectly. This PR fixes the bug and also adds a test for this. Fixes https://github.com/delta-io/delta/issues/568 ## How was this patch tested? The new added test. Author: Shixiong Zhu #17099 is resolved by zsxwing/SC-65398. GitOrigin-RevId: 962754812143197562c0f748b4d7b028e3982c3c commit c75a55f5fe52f6ada40e0f194071346d50aeb05a Author: Rahul Mahadev Date: Tue Jan 19 16:13:36 2021 +0000 [SC-64127][DELTA] Vacuum support for Delta CDC Add new type of hidden file Author: Rahul Mahadev GitOrigin-RevId: 2609974531f7bb99884877e5fa66378a039602bf commit 7899c47dd6594128d80db341bcb8d89ef62a9b78 Author: Wenchen Fan Date: Thu Jan 14 15:23:14 2021 +0000 [SPARK-34086] Change RaiseError to use RaiseErrorCopy expression Author: Wenchen Fan GitOrigin-RevId: 26d43465cd1f6f8499b3b8b9c5a5440dba0c3e6c commit 48b0fb94c28d7826cb485ea7c826378048764a98 Author: Pranav Anand Date: Wed Jan 13 23:59:02 2021 +0000 [SC-65387] Remove unused exception Author: Pranav Anand GitOrigin-RevId: cce84a349bc2c90c30d6fd49fb46b995a5460ede commit 59ef2d402d46a20de6bb56edfc1f99d9d76af6a2 Author: Rahul Mahadev Date: Wed Jan 13 18:30:16 2021 +0000 [SC-62562][DELTA] Refactor to add new options to DeltaTableV2 Author: Rahul Mahadev Author: Tathagata Das GitOrigin-RevId: cb87bbb1040a31f632fc19f795948e6c616ca94a commit 0661cc4d0c3f8d5c25ea97ef05fbb6e776caa91b Author: Wenchen Fan Date: Wed Jan 13 14:08:04 2021 +0000 [AUTO][SPARK-33875][SQL] Change check on ResolvedTable Author: Wenchen Fan GitOrigin-RevId: b3d0ba62e09d483f8eaf77c6d09fd05bd26df4f5 commit 3fe43a0bef7a734112fb72ed7c967a0b8fe582f1 Author: Shixiong Zhu Date: Mon Jan 11 21:13:55 2021 +0000 Revert "[SC-59856]Remove the lock during a Delta commit" We decided to revert this as it highly increased the chance to hit the concurrent checkpoint issue. Author: Shixiong Zhu GitOrigin-RevId: f6d492f7d3b869481d65e63fbab804bc5456dfce commit 5b3f6682b6c45c40d51c6a4968ff33dc0323b21c Author: Anton Okolnychyi Date: Mon Jan 11 05:22:54 2021 +0000 [SC-64604] Clean up DeltaAnalysis with AppendDelta and OverwriteDelta Author: Wenchen Fan GitOrigin-RevId: 212e75692943e70e6b2c72420d115f1edc31603b commit 667a999bf42e3ba8551bddb62d4ca6ab4adbcf1e Author: Yijia Cui Date: Thu Jan 7 05:57:11 2021 +0000 [SC-61645] isDeltaTable should check whether a path is a real delta table. Add check to ensure that a path with an empty _delta_log isn't a real delta table. Author: Yijia Cui GitOrigin-RevId: 05c77334e814e14c0fb1d304b47226328549e34d commit 9ee0e0f02941507be8826c85be27da579c4b46d9 Author: Xiao Li Date: Wed Jan 6 19:12:36 2021 +0000 [SC-65335] [SC-64953] Disable test and ActionSerializerSuite This PR is to disable the test that always fails N/A Author: Xiao Li GitOrigin-RevId: d9bdc6aebdcb916c33de1f05b9217d46170e5bea commit a170f364168c772774ab69b022b70ecd9aba459a Author: Shixiong Zhu Date: Wed Jan 6 01:09:52 2021 +0000 [SC-65167] Check RemoveFile when OptimisticTransaction.readWholeTable is called When a conflict commit contains RemoveFiles, we should check whether it removes files read by OptimisticTransaction. Currently, we only check whether RemoveFiles contain any of files in `readFiles`. However, `OptimisticTransaction.readWholeTable` doesn't update `readFiles`, so this is not enough when `OptimisticTransaction.readWholeTable` is called. This PR disallows any RemoveFiles when `OptimisticTransaction.readWholeTable` is called so that we are able to detect this case. Impacts: - Reading a Delta table and writing to the same table using DeltaSink Author: Shixiong Zhu GitOrigin-RevId: 6158df82baf5a8bfca288db343eeaa6fc7cb9165 commit 55a55e65b8a8a1f5810b998a9581333cf0312312 Author: Jose Torres Date: Wed Jan 6 00:49:31 2021 +0000 [SC-65156][DELTA] Improve docs Improve documentation for matchingFiles Author: Jose Torres GitOrigin-RevId: 00b8672ce3273b3521ea7d668d2a69837ec74cae commit cd3047bdd1fa9625e7810c1554c7a24f911bb009 Author: Shixiong Zhu Date: Tue Jan 5 23:30:37 2021 +0000 [SC-62936]Add ActiveOptimisticTransactionRule to ensure we use the same version when there is an active transaction This PR adds `ActiveOptimisticTransactionRule` to ensure we use the same version when there is an active transaction. Fixes https://github.com/delta-io/delta/issues/550 DeltaWithNewTransactionSuite Author: Shixiong Zhu GitOrigin-RevId: 5e720c3c63fde3991c1981186e5fea0754cfe5e0 commit 7465eff6551f7fd1cbc457e94686d412f0378389 Author: Tathagata Das Date: Tue Jan 5 23:20:22 2021 +0000 [SC-62571][Delta] Add import Author: Tathagata Das GitOrigin-RevId: e8d650ecef72eb4dc3dcffbd56324b8d76e82507 commit 3fbe77503b69bd214254a4f877da876d8eda07db Author: Tathagata Das Date: Tue Jan 5 20:40:53 2021 +0000 [SC-65136][Delta] Exposed scan / rewrite / total execution time of merge in operation metrics As the title says. Updated unit tests Author: Tathagata Das GitOrigin-RevId: f1e3c63e68a3127a1d109ba0747fa359e998b279 commit 50328ead9bf163fed4b844d83c608bb386609cbc Author: Jose Torres Date: Tue Jan 5 18:35:15 2021 +0000 [SC-62563][DELTA] Collect additional metrics in MergeInto Add functionality to collect additional metrics for Merge Author: Jose Torres GitOrigin-RevId: 85a3261d574ad269b3acf91a1ddf3605ae9217c2 commit 7a5443205d88b590f15eeccb69b6a55bf3a63715 Author: Rahul Mahadev Date: Tue Jan 5 16:26:03 2021 +0000 [SC-62562][DELTA] Refactor to add new options to DeltaTableV2 Add options to DeltaTableV2 Author: Rahul Mahadev GitOrigin-RevId: af7579c0be3237338381dff52d93d205a49aca2e commit 8cac68f56a93648892bb7ae92617ca18624c7b8c Author: Xiao Li Date: Sat Dec 26 19:21:40 2020 +0000 [SC-64323] Fix existing test by removing semicolon Author: Xiao Li GitOrigin-RevId: 56e20bbd2c07f624146979cb1bfe7620ad5a8ce8 commit b9cb93e5c53605c6d2067f122fd0a033f13b6d07 Author: Jose Torres Date: Tue Dec 22 21:11:23 2020 +0000 [SC-62565][DELTA] Add new FileAction and augment DelayedCommitProtocol Refactor to add a new FileAction Add helper methods to DelayedCommitProtocol Author: Jose Torres GitOrigin-RevId: 0c1ebce3b510733d066583df95ba8d8f1d4cfc56 commit 8939b8fee8acee8940dae416631b77e1b7e3c91f Author: Wenchen Fan Date: Tue Dec 22 09:25:39 2020 +0000 [SC-63547] Add new helper methods to CharVarcharUtils Author: Wenchen Fan GitOrigin-RevId: bc86288ba6115f9fedf777c7ba36982d0b8c0737 commit da2faf2f901f3b0c097932f54855bc447d37c51f Author: Alex liu Date: Fri Dec 18 22:59:45 2020 +0000 [SC-52455] Show delta table type for SHOW TBLPROPERTIES query In DataSourceV2, the DESCRIBE command no longer shows whether a table is a managed table or not. Add the table "Type" to the results returned by Delta's properties to show up as part of SHOW TBLPROPERTIES. Test: The new unit tests. Author: Alex liu Author: Alex liu GitOrigin-RevId: 2743ed62f4ba1a806d81860e35d31edba52d130e commit 1cffe48fb3a0b60b8a31b10715b57e601eef04b4 Author: Shixiong Zhu Date: Fri Dec 18 16:01:04 2020 +0000 [SC-63225]Block CREATE TABLE LIKE when the target table is using delta We forget to block CREATE TABLE LIKE when the target table is using delta. Currently, it will create a table in Hive metastore without creating any transaction logs. This PR blocks such commands. The updated tests. Author: Shixiong Zhu GitOrigin-RevId: bd4546723bb45855dd4e45b44a7c311133241094 commit 7cf96ec39f32f657318eeca0d3713cf038b9f7f8 Author: gatorsmile Date: Thu Dec 17 23:50:31 2020 -0800 Add import GitOrigin-RevId: fbeacadbff7f7092bf74176c1865b7dcebace5d7 commit 4a31a0ffdc1c845ce7383bd07ee34836c0cb9ff5 Author: Wenchen Fan Date: Fri Dec 18 04:41:01 2020 +0000 [SC-40165] Add new OSS DELETE test A new test for deleting from a temp view. Author: Wenchen Fan GitOrigin-RevId: afb3982585a01e7e66e13bb359f6a1d1366ad764 commit 6d16520e9f6932649432ffc6f47e89ec5cdc2c76 Author: Stephanie Bodoff Date: Thu Dec 17 17:48:48 2020 +0000 [DELTA-OSS-EXTERNAL] Remove duplicate parameter in docstring Closes delta-io/delta#570 Signed-off-by: Shixiong Zhu Author: Stephanie Bodoff GitOrigin-RevId: c43bac34546b9dd2c632ed09885d109e1a994524 commit ceda4f08957a4f54c273e0277bfc22180ecbdc56 Author: Jose Torres Date: Wed Dec 16 21:36:51 2020 +0000 [SC-62565][DELTA] Refactor writing to return Seq[FileAction] rather than Seq[AddFile]. Refactor writing to return Seq[FileAction] rather than Seq[AddFile]. existing tests - refactoring only Author: Jose Torres GitOrigin-RevId: a1a2e0588ecd0515523ea3f22ac5ba296b6ec3b0 commit a75b01bccd6ed13293514b3b01593f57da249dfe Author: Burak Yavuz Date: Wed Dec 16 18:50:28 2020 +0000 [SC-39976][DELTA] Enable reading Delta tables that contain NullType columns Currently there's no way to read Delta tables that have NullType defined in their schema due to restrictions in Apache Spark([Parquet check](https://github.com/apache/spark/blob/be09d37398f6b62c853e961df64b94b34fd3389d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L371), [DataSource check](https://github.com/apache/spark/blob/be09d37398f6b62c853e961df64b94b34fd3389d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L445)). Users [seem content](https://github.com/delta-io/delta/issues/450#issuecomment-654206338) with the idea of dropping null type columns during reads, therefore that's the temporary fix we are going for now. In the future, I don't think that the NullTypes necessarily need to be dropped. Spark's Parquet reader already handles the lack of existence of columns by entering nulls when necessary. We could keep the fields in the schema and just return the null values. Unit tests Author: Burak Yavuz GitOrigin-RevId: c4880d50e46a962d313aa8402f4c8e8ee4a8e013 commit de5e725f39cd250d309a495f490c84cf8b8c476c Author: Burak Yavuz Date: Wed Dec 16 17:13:29 2020 +0000 [SC-62672][DELTA] Fix support for updating comments for Delta tables We did not propagate the comment field to the description correctly in ALTER TABLE. This PR fixes that and a couple other things. Unit tests Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: 975e7c1d2aa89688531542ce8df731a55b5bb012 commit 8eccdd204c219e5ae08e2bd0262b1b31cdeade80 Author: Jose Torres Date: Tue Dec 15 17:47:45 2020 +0000 [SC-62560][DELTA] Add ChangeFile FileAction Adds a third type of FileAction for ChangeFile Author: Jose Torres GitOrigin-RevId: 7fedc15a3416957561230383caf22da0eec1cf60 commit 8d1d1ee77f0c52af10d516a21d2c7970dfe6e66c Author: Wenchen Fan Date: Sat Dec 12 04:50:32 2020 +0000 [SPARK-33480][SQL] Util methods for char/varchar type This PR adds util methods for char/varchar type which is kind of a variant of string type: Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan (cherry picked from commit 5cfbdddefe0753c3aff03f326b31c0ba8882b3a9) Signed-off-by: Wenchen Fan GitOrigin-RevId: ef167f622cf0ec21c22347b60cf7b485ae1aa746 commit 73bd173581949c631307e328a0257ad6452bda9c Author: yi.wu Date: Fri Dec 11 22:34:25 2020 +0800 Fix external property key in DeltaCatalog GitOrigin-RevId: 0664bbdd3f0a8ad0bf3b5ff6e22d7e9bc7bfb328 commit 20907e69e56e6073b1df86a10606d39467796135 Author: Linhong Liu Date: Tue Dec 8 15:25:33 2020 +0000 Fix open source delta compile fix open source delta compile Author: Linhong Liu GitOrigin-RevId: 206b9145b7b176cf6dc14fab5dc32809d05d2958 commit ebc2ecfbfb77155d29b8899dbbdd58672a159a14 Author: Gengliang Wang Date: Tue Dec 8 21:39:18 2020 +0800 Add conf to lock on Delta commits Adds a conf which dictates whether or not we should lock when writing a commit GitOrigin-RevId: 67face971d09627738946f8623493fd9de05c788 commit 7c36c5e27c143367802efe8eb4cc3b11ebf5fc36 Author: Gengliang Wang Date: Tue Dec 8 13:21:47 2020 +0000 Fix many compiling errors Minor change to Preprocess table classes Author: Gengliang Wang GitOrigin-RevId: 57e6c3ad24e38cd4fd0d966c8f9e9de9b0c5b97c commit e0933bc42d20ca4235e6bb5bea5ce175dcf67ee4 Author: Linhong Liu Date: Mon Dec 7 16:10:16 2020 +0800 Include SQLConf in UpdateExpressionSupport GitOrigin-RevId: 91ce2bda9886b53944b44346b42507d5b7339870 commit 94b716e3a428ebfb2f68c16358d2fddbc10d510a Author: Burak Yavuz Date: Thu Dec 3 21:03:13 2020 +0000 [ES-51029][DELTA][TEST] Fix flakiness of HiveDeltaDDLSuite Changes the table name used in HiveDeltaDDLSuite. There could be some race conditions where the HiveMetaStore entry for the table is not properly deleted in previous tests. This can cause the test to receive the pointer for a wrong table path and cause flakiness. This PR changes tests Author: Burak Yavuz GitOrigin-RevId: b609734b207427b3b9a7cfe2d4de2294153478c7 commit 84bc294401297416c0b83063134231d16aedbe3e Author: gatorsmile Date: Thu Dec 3 09:22:50 2020 -0800 Catch correct exception in DataFrameWriterSuite - Catch AnalysisException instead of NoSuchTableException GitOrigin-RevId: f45da4fc6cf6408d9562396d54f565c1c4b31521 commit 981ddbcc2927271bcd68be1b3eef59d25e4a6935 Author: Tathagata Das Date: Thu Dec 3 08:33:37 2020 +0000 [SC-60703][Delta] Fix execution of scans on different versions of same Delta table in the same query plan Fixed equality comparison for TahoeLogFileIndex such that two indices referring to different versions by time travel specs are never equal. unit test Author: Tathagata Das GitOrigin-RevId: f660e76e13318a5e22133f973186d89c726582a0 commit 412f51d8fc6c08f49b3962634c3e6a648b817e19 Author: Jose Torres Date: Thu Dec 3 05:09:54 2020 +0000 [SC-59546][DELTA] Fix merge schema evolution for isolated specific updates. There's no need to even begin doing the schema evolution work if there's no star operation, since non-star operations can't schema evolve. new unit tests Author: Jose Torres GitOrigin-RevId: b96f1adf91e2dbdcb0ec84ae0ebc3782228ace0c commit fd0f035c2591e069cd54770b7f8cd050d99f7613 Author: Jose Torres Date: Wed Dec 2 16:05:24 2020 +0000 [SC-57630][DELTA] Handle struct schema evolution for MERGE INTO. ## What changes were proposed in this pull request? Allow schema evolution to work for for nested columns, including proper by-name resolution. The finalSchema field in DeltaMergeInto is removed because it doesn't end up being necessary. We added it in the previous schema evolution PR to handle calculating the final schema in deltaMerge.scala, but that calculation is now deferred to PreprocessTableMerge where it can be added directly to the merge command without intermediate steps. ## How was this patch tested? new unit tests Author: Jose Torres #13248 is resolved by jose-torres/mergeupdate2. GitOrigin-RevId: b8fbeb5a38005df6282eb50bf7cec795b7bccbdd commit 9dc391818813b951692702b13d85f2dbff73b35b Author: Shixiong Zhu Date: Wed Dec 2 00:29:10 2020 +0000 [SC-58974]Update Spark to 3.0.1 in OSS Delta ## What changes were proposed in this pull request? Update OSS Delta to use Spark 3.0.1 and update test codes accordingly. ## How was this patch tested? Jenkins Author: Shixiong Zhu #14257 is resolved by zsxwing/update-oss-delta-spark. GitOrigin-RevId: 5433eee54d019434a8024433b9e6149a0a40e963 commit fa2df7668b99bc6bd306107ac976adee4d38b312 Author: Shixiong Zhu Date: Tue Dec 1 23:22:35 2020 +0000 [SC-59927]Disallow to checkpoint version -1 When a table doesn't exist, its version is -1. In this case, we should not allow to run `checkpoint` because it will run a -1 checkpoint file (`-0000000000000000001.checkpoint.parquet`) and make this table unreadable New unit tests. Author: Shixiong Zhu GitOrigin-RevId: d6fa48fddd8759692cb9e7a0b0a0690b0c6909dd commit 4fd7f370b53b99b5aa91b384d1dcf13a123d37ea Author: Shixiong Zhu Date: Wed Dec 9 16:56:44 2020 -0800 add javadoc commit bd37155802d92d8ebcc8a28129c2c115054321d3 Author: Shixiong Zhu Date: Wed Dec 9 16:26:20 2020 -0800 Setting version to 0.2.1-SNAPSHOT commit e7004adce030f1cb33add4474f7ab24c03742d5a Author: Shixiong Zhu Date: Wed Dec 9 16:23:05 2020 -0800 Setting version to 0.2.0 commit c7ba2598b38f489ad3b4e56a92715a85cb6960a1 Author: Shixiong Zhu Date: Wed Dec 9 16:20:09 2020 -0800 Fix NPE caused by asJava for Scala 2.11 (#68) commit 75882909c4440aad889bb38a316722c0fdf95185 Author: scottsand-db <59617782+scottsand-db@users.noreply.github.com> Date: Wed Dec 9 16:59:51 2020 -0500 Remove DeltaLogImpl.forTable func using File (#67) commit 157cbcd433e8f5e14de1f7863bcb40f0d961d746 Author: Shixiong Zhu Date: Wed Dec 9 08:27:56 2020 -0800 Audit API and doc for Delta Standalone Reader (#66) Review public APIs and make minor changes: - Remove unused or redundant APIs. - Add missing equals and hashCode methods. - Fix the null values for primitive types in RecordValue. - Minor doc changes. commit bbe1fe12d4d8b497304f0fab22f11dc8834909fd Author: scottsand-db <59617782+scottsand-db@users.noreply.github.com> Date: Tue Dec 8 20:54:16 2020 -0500 Update Spark and Delta version for the golden tables project (#65) commit e2216ac9bf3d614d7ed01002db5f89c1db8fcebb Author: scottsand-db <59617782+scottsand-db@users.noreply.github.com> Date: Tue Dec 8 13:32:17 2020 -0500 Add DeltaLog.getCommitInfoAt to query the metadata in a commit (#64) commit b8c8a5eb1b129341e5c7933692b6cf41c09ef8a1 Author: scottsand-db <59617782+scottsand-db@users.noreply.github.com> Date: Tue Dec 8 13:24:51 2020 -0500 Make Hive connector cache DeltaLog to speed up warm queries (#62) commit 39f3ac3c31ee9f5223ecdbe260f59e11505ccc22 Author: scottsand-db <59617782+scottsand-db@users.noreply.github.com> Date: Tue Dec 1 14:59:02 2020 -0500 Assertion error msg improvement (#61) commit cf2cb61fda7c4b2af82ada1e5f9819055241e0c0 Author: scottsand-db <59617782+scottsand-db@users.noreply.github.com> Date: Mon Nov 30 21:26:51 2020 -0500 Update README.md (#58) commit 7533738c17944d1cbff9913329ed3fe1f9997e13 Author: Tathagata Das Date: Wed Nov 25 18:20:17 2020 +0000 [SC-59423] streamlined writeFiles method for Merge Author: Tathagata Das GitOrigin-RevId: b2667032603d6994ca8bb84aa31d6839dd8c4faa commit b8f68cd3ed1d180a464d50cc6b59b9ccf35f8e6d Author: Pranav Anand Date: Wed Nov 25 08:18:43 2020 +0000 [SC-58738] Minor refactoring - removed todo Author: Pranav Anand GitOrigin-RevId: 247b5b10eefdfb5a92d2ca6de95afdcf722e8d6c commit b168037635027f73024de17274c9222c52e84445 Author: Rahul Mahadev Date: Wed Nov 25 04:14:32 2020 +0000 [SC-55218][DELTA]Minor refactoring of Auto Optimize conf Author: Rahul Mahadev GitOrigin-RevId: 04c50dba83734177a4ecac584d775a8c21e46f31 commit 20e51bdee58081fe3d66bdf60b13801bfcd6bd5c Author: Alex liu Date: Wed Nov 25 00:45:41 2020 +0000 [SC-45268] Save comment when CTAS a delta table with comment ## What changes were proposed in this pull request? When create a new delta table with a comment by CTAS, the metadata of the new table misses the comment. Replace delta table command updates the comment in a update metadata request, so the metadta is correct though the comment is not set in the initial metadata. To fix the issue, comment is set to metadata uring metadata creation. ## How was this patch tested? The new unit tests. Author: Alex liu Author: Alex liu Author: Burak Yavuz #14249 is resolved by alexoss68/SC-45268-master. GitOrigin-RevId: e6c56b08a7ec55be63e05f888b8d968f62cca6d4 commit 87e549262d5ad8f318e5e4d6fd1e84750b95dbde Author: mahmoud mahdi Date: Wed Nov 18 22:43:21 2020 +0000 [DELTA-OSS-EXTERNAL] Add implicits to provide delta(path) methods in Spark's read/write APIs for Scala The main goal of this Pull Request is to create a ```delta``` function which replaces the ```.format("delta")``` we use whenever we want to read/write from/in a delta table . I used the Scala "Pimp My Library" Pattern in order to extend the following APIs : - DataFrameReader - DataStreamReader - DataFrameWriter - DataStreamWriter I tested this feature by replacing, in the ```DeltaSourceSuite.scala```, all the ```.format("delta")``` by the function ```.delta(path)``` Closes delta-io/delta#369 Signed-off-by: Shixiong Zhu Author: Shixiong Zhu Author: mahmoud mahdi #14303 is resolved by zsxwing/p7uot0s0. GitOrigin-RevId: 9b85ca569779db3c500a8ca8e2a8a0ef9028f371 commit 82646384bbdfe399cca7b73474f5f8cb19cb8fd3 Author: Shixiong Zhu Date: Tue Nov 17 21:02:20 2020 +0000 [DELTA-OSS-EXTERNAL] Make JsonUtils.mapper lazy to show a better error Currently if failing to initialize `JsonUtils.mapper` (likely `jackson-module-scala` incompatibility issue), class `JsonUtils$` will not be loaded. Because this is a class loading error, the root cause of initialization failure will be shown only once by JVM, and the following calls that touch `JsonUtils` will fail without a cause. This makes users hard to find out the root cause in their logs. This PR makes JsonUtils.mapper lazy so that the initialization failure will be shown every time `JsonUtils` is touched. Closes delta-io/delta#552 Signed-off-by: Rahul Mahadev Author: Shixiong Zhu #14292 is resolved by rahulsmahadev/msqzx64j. GitOrigin-RevId: de71894fe398d0eeaa210ed8e837dbee4bf094ef commit 4080cc3e6c0626f4e21d65e2a77d6b63567b72b3 Author: Burak Yavuz Date: Mon Nov 16 19:22:11 2020 +0000 [SC-55207][DELTA] Minor refactoring in SnapshotManagement and StateCache Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: 5dca66d7af82ea792bbfa4bead21ce602fa796e1 commit 69ff08fed19005af87f5e7c3dc9c61b0df70c812 Author: Jose Torres Date: Mon Nov 16 14:58:38 2020 +0000 [SC-37912][DELTA] Fix struct resolution to be by name for UPDATE operations. ## What changes were proposed in this pull request? Fix struct resolution to be by name for UPDATE operations, including those in MERGE INTO. Legacy flag provided to retain an opt-in for the old behavior. ## How was this patch tested? new unit tests Author: Jose Torres #13938 is resolved by jose-torres/structbyname. GitOrigin-RevId: a59918225fddeec001cb401c38ca0b267ed79f37 commit b6ff6610430c760d24d17475d030b868d453142c Author: ekoifman Date: Fri Nov 13 01:09:47 2020 +0000 [DELTA-OSS-EXTERNAL] Add user friendly description to internal delta queries to make Spark UI easier to follow Delta operation generates multiple queries to manage the transaction log. For example ``` val data = spark.range(0, 15) data.write.format("delta").save("/tmp/delta-table") val df = spark.read.format("delta").load("/tmp/delta-table") import io.delta.tables._ import org.apache.spark.sql.functions._ val deltaTable = DeltaTable.forPath("/tmp/delta-table") deltaTable.delete(condition = expr("id == 0")) ``` executes several "housekeeping" queries. This enhancement makes it easier to figure out which is which. "Default Names" and "Enhanced Names" screen shots show the before and after UI. (Stack traces shown in the UI are not useful here) "Default Names" DefaultNames "Enhanced Names" Enhanced Names Closes https://github.com/delta-io/delta/issues/528 Closes delta-io/delta#525 Signed-off-by: Burak Yavuz Author: ekoifman Author: Burak Yavuz #14159 is resolved by brkyvz/c0ygkr5c. GitOrigin-RevId: 5d9d1d168ffa13bb1cdb3ca638c691a094005e77 commit 1e280f2340b040586247c084813ba36f580a1782 Author: HyukjinKwon Date: Wed Nov 4 03:06:12 2020 +0000 Minor Refactoring Logically checked and fixed. Author: HyukjinKwon GitOrigin-RevId: 7bfedaf8dac6084738bf323ea5778329f01aba34 commit 06e40925e9f414e263a66e4ee63a15ee5cc5e839 Author: Shixiong Zhu Date: Mon Nov 9 14:46:28 2020 -0800 Fix build.sbt for bintray release (#57) We plan to release the following two artifacts to maven central: ```xml io.delta delta-standalone_{scala_version} 0.2.0 io.delta delta-hive_{scala_version} 0.2.0 ``` So we need to rename the projects: - Rename `hive-delta` to `delta-hive`. - Rename `standalone` to `delta-standalone`. I also updated build.sbt to only release the above two projects. Manually verified `build/sbt release` work locally and uploaded the above two projects to bintray successfully. commit d44231f3b07fded106f61344135cded6bfa30f8f Author: scottsand-db <59617782+scottsand-db@users.noreply.github.com> Date: Fri Nov 6 14:17:37 2020 -0500 minor tweaks (#56) commit 015609d8db03c14a4e23ae5ebfd11b80324dda29 Author: scottsand-db <59617782+scottsand-db@users.noreply.github.com> Date: Thu Nov 5 15:01:45 2020 -0500 generated release jar & javaunidocs (#55) commit 3f518a2d56784de362151a115e920e2e4050b59d Author: Jose Torres Date: Sat Oct 31 08:42:56 2020 -0700 Add DeltaUnevaluable to be compatible with the coming Spark 3.1 GitOrigin-RevId: 2d8c03cf680251b49b08b90e813d7ac954962807 commit d352a0445ab6cb71c4be6a48fed0918a10e2249a Author: Burak Yavuz Date: Fri Oct 30 22:45:34 2020 +0000 Minor refactoring Author: Burak Yavuz GitOrigin-RevId: 3fb12ef9d2287338691d7be361946a36a4e478cd commit 3a726e62757748f20cd194e8113b5c972165495c Author: Burak Yavuz Date: Thu Oct 29 00:34:14 2020 +0000 [SC-54155][DELTA][WARMFIX] Fail a Delta table's stream if logs are skipped ## What changes were proposed in this pull request? Introduces a flag "failOnDataLoss" for the Delta streaming source. Our `getChanges` method to get the changes to a streaming source uses `listFrom` which returns the earliest available delta file after a given version. This version may be later than what we should be processing. In such cases, we will fail the stream unless the "failOnDataLoss" flag is set to false. Imagine the following case: 1. You start a stream 2. Process all your data 3. You stop your stream 4. The table has a delta.logRetentionDuration set to 1 day, writes happen to your table, and you Delta performs log cleanup 5. You restart your stream 2 days later. The delta files providing the changes that you need to process have already been deleted. We should fail in the case above. ## How was this patch tested? Unit tests Author: Burak Yavuz Author: Burak Yavuz #13644 is resolved by brkyvz/deltaFailOn. GitOrigin-RevId: 355d168763d1909abd3c4a0335dd716829142ed4 commit 0fec63980f4ccc16bd382687dfbd22fb028b94cd Author: Yuanjian Li Date: Fri Oct 23 06:09:38 2020 +0000 Pass Datasource V2 options to V1 Make time travel options such as `versionAsOf` work in `DataFrameReader.table` when we upgrade to Spark 3.1. Author: Yuanjian Li GitOrigin-RevId: 6bab02deb39ef45c062f971a8f648abd64de1115 commit cfe153c029a141b731b7ef28c9d2ac86187b61a5 Author: Burak Yavuz Date: Thu Oct 22 23:16:44 2020 +0000 Add DeltaConfig.defaultTablePropertyKey Author: Burak Yavuz GitOrigin-RevId: ef331a0f155a348bfcc5d761a5c95077f8222786 commit 7e5990e12940ec7363f1873e9b0c3dab97cdbc4f Author: Shixiong Zhu Date: Wed Oct 21 01:14:19 2020 +0000 [SC-44129]Add LogStore.readAsIterator This PR proposes a new method `LogStore.readAsIterator` to load the file on demand. `LogStore.readAsIterator` returns `ClosableIterator` which can be implemented by a LogStore implementation to load the content on demand. We also implement a new class `LineClosableIterator` which can turn a `Reader` to `ClosableIterator`. This class makes it easy to implement `readAsIterator` for a LogStore implementation. This PR also uses this API in `describe history` to solve its scalability issue. The new unit tests. Author: Shixiong Zhu GitOrigin-RevId: fb136c38c8070c4a822ac667063710b1bfc1f02c commit 54d33a8344a070d4432a4cdfbebd125b81b5bc41 Author: scottsand-db <59617782+scottsand-db@users.noreply.github.com> Date: Mon Nov 2 20:08:21 2020 -0500 updated build.sbt to build the hive assembly jar; updated README (#54) commit 75d41a953e2bfe2b2506048006a0baf8a2a03554 Author: Shixiong Zhu Date: Mon Nov 2 10:31:14 2020 -0800 Fix the broken Tez tests due to incompatible parquet version (#53) Update `build.sbt` to avoid adding `hive-exec` jar (contains an old parquet) to the classpath in Hive Tez tests so that we don't introduce an old parquet version in the classpath. commit 5965a72d4b5923359b74ef365f8784779398c34f Author: scottsand-db <59617782+scottsand-db@users.noreply.github.com> Date: Thu Oct 29 12:33:19 2020 -0400 Delta Standalone Reader (#51) In this PR, we add a new project `standalone` which implements a Delta reader without using Spark. The reader provides APIs to access Delta transaction logs programmingly and a simple interface to read the data files in the table as well. We also update the Hive connector project to use this new project and get rid of the embedded Spark from the Hive connector. commit 00b44ad34463e9141f666384b63b3972733d0580 Author: Rahul Mahadev Date: Thu Oct 15 19:28:25 2020 +0000 [SC-52256][DELTA] Minor refactoring - Get rid of dead code in DeltaOperations, DeltaDDLSuite - Add some special exception classes for Time Travel exceptions - No new tests Author: Rahul Mahadev GitOrigin-RevId: 5fe090310cd054f0a13a969dd248df78f908c319 commit 40182f3de69ca083afd48ef522897519d36e5fab Author: Adam Binford Date: Wed Oct 14 06:44:17 2020 +0000 [DELTA-OSS-EXTERNAL] Added config option to enable parallel deletes for vacuum command Resolves #395 https://github.com/delta-io/delta/pull/416 hasn't been updated in over four months, and it would be a verify useful feature for us to have, so I took my own stab at it. - A new config value is added `vacuum.parallelDelete.enabled` that defaults to false - I updated the default behavior to be coalesce to 1 instead of iterate on the driver so that you can see something being done by spark in the UI/console instead of it just sitting there. I'm not sure if there's a reason this would cause issues, so happy to revert this back if you think it should be. - If `vacuum.parallelDelete.enabled` is set to true, it maintains the existing partitions from the `diff` calculation. Because this is the result of a `join`, your partitions are then based off your `spark.sql.shuffle.partitions`. So your parallelism will be min(number of executors, shuffle partitions), and you can tweak your shuffle partitions if you want more/less parallelism I removed the delete static method because the number of parameters that had to be passed to it made it seem like too much. Happy to move that back if that's not preferred. Also happy to make any updates to the name or description of the new config. Closes delta-io/delta#522 Signed-off-by: Jose Torres Author: Jose Torres Author: Adam Binford #12941 is resolved by jose-torres/ee2ucyf3. GitOrigin-RevId: a73aa60a4820c4d6a37f0b21a0db31d72a09cfa5 commit be420465b4163959110959a6230e2fafd00aaf7c Author: Jose Torres Date: Tue Oct 13 17:51:20 2020 +0000 [SC-45767][DELTA][WARMFIX] Improve write-time resolution of Delta constraints. ## What changes were proposed in this pull request? When playing around with it I found two additional cases not accounted for: * Functions need to be resolved too. Fortunately much of the complexity of normal function resolution comes from analyzer-specific considerations and the existence of window + aggregate functions - we can just look up straight in the catalog since we don't have to worry about such things. * Implicit type coercion. ## How was this patch tested? new unit tests ## **IMPORTANT** Warmfix instructions What type of warmfix is this? Please select **exactly one choice**, or write description in Other. - [ ] Regression (e.g. fixing the behavior of a feature that regressed in the current release cycle) - [ ] ES ticket fix (e.g. Customer or internally requested update/fix) -Other (please describe): Fix to an unreleased feature required for protocol compatibility. No currently existing workloads will execute this new code. Author: Jose Torres #13058 is resolved by jose-torres/resolveconstraints. GitOrigin-RevId: 9ae99d108878c84159a7e48736a7acd84f614fda commit b18aebb2cfb3f0261ff930574adb4812727f71e8 Author: Burak Yavuz Date: Thu Oct 8 19:41:17 2020 +0000 [SC-51757][DELTA] Re-use Map encoder for GenerateManifest ## What changes were proposed in this pull request? There's something weird that happens with Scala reflection and the map type encoder in `GenerateSymlinkManifestImpl`. ScalaReflection can throw an exception that looks like: ```scala Caused by: scala.ScalaReflectionException: type T2 is not a class ``` I think it happens intermittently only for the `Map` type due to type erasure. Creating the encoder once, and re-using it would help us work around this rare problem. ## How was this patch tested? YOLO, going on a hunch here Author: Burak Yavuz #13088 is resolved by brkyvz/enc. GitOrigin-RevId: e9d1298a5cf39476fec53c4d1b904e179e1418cc commit 59d9555279d06d4f62832b2a4bc146f878af97b0 Author: Burak Yavuz Date: Wed Oct 7 01:33:22 2020 +0000 Add logging for add constraint ## What changes were proposed in this pull request? Improve ADD CONSTRAINT usability by logging saying that the entire dataset will be checked when adding a constraint. ## How was this patch tested? No tests needed Author: Burak Yavuz #13093 is resolved by brkyvz/usability. GitOrigin-RevId: abdbc1a2234fd19f0504b56e956bb561008c8888 commit 001d9b2efd4351f83e11457e0fed1f503a8f5c64 Author: Burak Yavuz Date: Tue Oct 6 16:49:10 2020 +0000 [SC-51428][DELTA] Model table protocol version as table properties for Delta ## What changes were proposed in this pull request? Model the protocol version of Delta as table properties so that it can be updated using the SQL API and SET TBLPROPERTIES and displayed as part of DESC EXTENDED. Requirements: - Protocol version should not be stored as part of table properties to avoid a split brain problem - Protocol version can be explicitly provided as part of table creation - Protocol version can be explicitly upgraded with ALTER TABLE SET TBLPROPERTIES - Protocol version should show up in DESC EXTENDED - Explicit setting of protocol version should trump the SQL configuration during table creation - Map default protocol version configuration to the table property default of "spark.databricks.delta.properties.defaults.min[Reader|Writer]Version" In addition, after these changes, if a user opts into a feature that needs a higher protocol version, the protocol version will be automatically upgraded. ## How was this patch tested? Unit tests Author: Burak Yavuz #13011 is resolved by brkyvz/ProtAsProps. GitOrigin-RevId: 134ee33fed357fe0dadfc7bb3e5dfdcc8c3ddaaa commit adab6191a6d5d9a56a555628601612b8c60651b0 Author: Jose Torres Date: Mon Oct 5 16:25:46 2020 +0000 [DELTA] Fix the spacing to align properly when multiple features require a protocol upgrade. ## What changes were proposed in this pull request? We're missing a space in list elements after the first so it looks like - Setting column level invariants -Setting CHECK constraints instead of - Setting column level invariants - Setting CHECK constraints ## How was this patch tested? code inspection Author: Jose Torres #12982 is resolved by jose-torres/fixspacing. GitOrigin-RevId: 19312c72187cd3ffdd56b95ff96ab37174722f34 commit dcf78ae37af6e77e3113cd7614a6f5170c6743f2 Author: Tathagata Das Date: Fri Oct 2 00:21:31 2020 +0000 [SC-49334][Delta] Add more partition and byte stats to Merge ## What changes were proposed in this pull request? Added the following metrics that were cheap to compute and collect. - Number of partitions touched at different stages (skipping, first scanning, writing) of the execution of merge - Byte information of files before and after skipping. These will help in analyzing the effectiveness of partition and file pruning. In addition, I have refactored the code to be a bit cleaner. Author: Tathagata Das #12887 is resolved by tdas/SC-49334. GitOrigin-RevId: bd0dc29932e1e239fe85f1a14c6702711fc93032 commit 90d6ca6599041aed533943ded1f7319560a521eb Author: Scott Sandre Date: Thu Oct 1 13:07:00 2020 +0000 DeltaErrors.scala shouldn't throw exception, but instead return them # What changes were proposed in this pull request? DeltaErrors helper methods should return new Throwable. Not actually throw them. Looked at all instances of `throw new` in DeltaErrors.scala. I checked all callers, and if the caller was not throwing that return type (i.e. caller was doing `DeltaErrors.___` instead of `throw DeltaErrors.___`) then I added the `throw` keyword. Author: Scott Sandre #12940 is resolved by scottsand-db/delta_errors_shouldnt_throw_exceptions. GitOrigin-RevId: abca9746169623534e8be40ac35581bee774843d commit 72eb889aab164eceeec6a8081f28d4e058755890 Author: Jose Torres Date: Wed Sep 30 20:39:39 2020 +0000 [SC-49814][DELTA] Support SET NOT NULL column changes. ## What changes were proposed in this pull request? We now allow SET NOT NULL, checking to make sure that all existing values of the column satisfy the condition. ## How was this patch tested? new unit tests Author: Jose Torres #12777 is resolved by jose-torres/addnn. GitOrigin-RevId: 2044cbe5d3aa3f00cf818088baccf49908032105 commit 9c3313295dfd7e32df6f95b5aef7ae855d1b84b8 Author: Burak Yavuz Date: Tue Sep 29 01:35:22 2020 +0000 [SC-50564][DELTA][HOTFIX] Do not re-analyze optimized plans while creating Delta tables ## What changes were proposed in this pull request? There are unfortunately some Catalyst rules in Spark that are not idempotent. These rules can cause data corruption issues if an optimized plan gets re-analyzed. Therefore, we should prevent Delta from having that problem. In DataSourceV2, the query plan for an AS SELECT statement (such as CREATE TABLE AS SELECT) gets optimized and turned into a SparkPlan. In this phase, for DataSource V1 writers, we pass in a DataFrame created from the optimized plan (and skip re-analysis) as fixed in [Apache Spark](https://github.com/apache/spark/commit/d378dc5f6db6fe37426728bea714f44b94a94861). When Delta was a V1 data source, we didn't have this problem, because we would intercept the analyzed plan, and plug it into a CreateDeltaTableCommand, where the query plan didn't get mucked around with anymore. If we try to go back to the LogicalPlan and to a DataFrame again, the optimized plan can get re-analyzed and cause issues. This PR is a follow up to that prevents the re-analysis of an optimized plan during table creation. I the code isn't super clean, but the changes are left to a minimum intentionally so that we can backport this fix. To the curious: One bug we observed was a manifestation of https://github.com/apache/spark/pull/29805. A literal in the query plan that was propagated during optimization caused the value of the column to completely replace the real value after the plan got re-analyzed and optimized. ## How was this patch tested? A new test suite that checks all DSV2 operations that Delta supports. ## **IMPORTANT** Warmfix instructions If this PR needs to be warmfixed (i.e. merged into the release branch after the code freeze), please follow steps below. What type of warmfix is this? Please select **exactly one choice**, or write description in Other. -Regression (e.g. fixing the behavior of a feature that regressed in the current release cycle) -ES ticket fix (e.g. Customer or internally requested update/fix) - [ ] Other (please describe): Make the following updates to this PR: :warning: *Only* do this if you are targeting the master branch and *additionally* want to have the MergePRSpark job put your commits into additional branches. If your PR already targets the branch you want to warmfix it to, just merge it like a normal PR and do not apply branch labels, milestone, or put "warmfix" in the title before triggering a merge. Track improvements to this confusion at [PLAT-6334](https://databricks.atlassian.net/browse/PLAT-6334). - [ ] Add `[WARMFIX]` in the title of this PR. - [ ] Label the PR using label(s) corrsponding to the WARMFIX branch(es). The label name should be in the format of `dbr-branch-a.b` (e.g. `dbr-branch-3.2`), which matches the release branch name for Runtime release `a.b`. - [ ] Ask your team lead or a staff engineer (or above) to sign off this warmfix and add the `warmfix-approved` label. - [ ] When merging the PR using the merge script, make sure to get this PR merged into the following branches: - The branch against which your PR is opened, and - Any extra release branch(es) corresponding to the `dbr-branch-a.b` label(s) applied to your PR. Author: Burak Yavuz #12836 is resolved by brkyvz/multi-plan. GitOrigin-RevId: b3f2793a2cc5ccee7acafb865d145a4b43d54687 commit 2e61b2d22dba81e0dd4eb5d77603e11df70fe26f Author: Alan Jin Date: Fri Sep 25 22:33:14 2020 +0000 [DELTA-OSS-EXTERNAL] Fix attribute resolution failure when insert overwrite some constants This PR will fix the resolution failure when insert overwrite some constants How to reproduce: ```sql -- create a partitioned delta table CREATE TABLE t1 (a int, b int, c int) USING delta PARTITIONED BY (b, c); ``` Below three queries all failed with attributes resolution phase. ```sql -- make sure there are two columns with same values: b=2, c=2 INSERT OVERWRITE TABLE t1 PARTITION (c=2) SELECT 3, 2; INSERT OVERWRITE TABLE t1 PARTITION (b=2, c=2) SELECT 3; INSERT OVERWRITE TABLE t1 PARTITION (b=2, c) SELECT 3, 2; ``` These three queries all throw exception > Resolved attribute(s) c#792 missing from a#793,b#794,c#795 in operator !OverwriteByExpression RelationV2[a#789, b#790, c#791] default.t1, (c#792 = cast(2 as int)), false. Attribute(s) with the same name appear in the operation: c. Please check if the right attribute(s) are used.;; !OverwriteByExpression RelationV2[a#789, b#790, c#791] default.t1, (c#792 = cast(2 as int)), false +- Project [2#787 AS a#793, 2#788 AS b#794, c#792 AS c#795] +- Project [2#787, 2#788, cast(2 as int) AS c#792] +- Project [2 AS 2#787, 2 AS 2#788] +- OneRowRelation Closes delta-io/delta#521 Signed-off-by: liwensun Author: Alan Jin Author: liwensun #12646 is resolved by liwensun/319x1oex. GitOrigin-RevId: 5a24aa8ba091586c5db90b8c9a4efee823cd41ba commit ca37b4a123f9efd8429d078d940dd4e0987a1e78 Author: Mike Dias Date: Fri Sep 25 21:59:52 2020 +0000 [DELTA-OSS-EXTERNAL] Parse the partition structure using the delta_log as the base path This allows converting a location that is under a path that looks like a partition value. For example, running the convert to delta command over a path like `s3://massive-events/year=2020/` would fail because the command will try to compare the partitions above the delta log base path. Signed-off-by: Mike Dias Closes delta-io/delta#510 Signed-off-by: liwensun Author: Mike Dias #12643 is resolved by liwensun/4w7hocg4. GitOrigin-RevId: 4dc2e55c97b47b3ce928a5a780b115324170e95d commit 140401dc3652b7731fd9ae54726d55f67e1dbb36 Author: contrun Date: Fri Sep 25 20:22:25 2020 +0000 [DELTA-OSS-EXTERNAL] fix a small typo Closes delta-io/delta#511 Signed-off-by: liwensun Author: contrun #12644 is resolved by liwensun/oyxa8ejl. GitOrigin-RevId: 97d874ddd933b3f08fdeecc4bd3d2757f76a409b commit 289c9689c8c5c8a20814192861ad463101ab6097 Author: Terry Kim Date: Fri Sep 25 20:21:33 2020 +0000 [DELTA-OSS-EXTERNAL] Fix AnalysisException message for configureSparkSessionWithExtensionAndCatalog This PR proposes to fix the `AnalysisException` message in `DeltaErrors.configureSparkSessionWithExtensionAndCatalog`. Before: ``` org.apache.spark.sql.AnalysisException: This Delta operation requires the SparkSession to be configured with the DeltaSparkSessionExtension and the DeltaCatalog. Please set the necessary configurations when creating the SparkSession as shown below. SparkSession.builder() .option("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .option("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog" ... .build() ``` After fix: ``` org.apache.spark.sql.AnalysisException: This Delta operation requires the SparkSession to be configured with the DeltaSparkSessionExtension and the DeltaCatalog. Please set the necessary configurations when creating the SparkSession as shown below. SparkSession.builder() .option("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .option("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") ... .build() ``` Closes delta-io/delta#512 Signed-off-by: liwensun Author: Terry Kim #12645 is resolved by liwensun/o69e912w. GitOrigin-RevId: e1924cd53d79296f889aa78f10ec9a1f7b809c3a commit df78549ab635b4e79f45eb6bdd786a814683f1d8 Author: Jose Torres Date: Wed Sep 23 03:43:41 2020 +0000 [SC-45767][DELTA] ADD/DROP CONSTRAINT APIs Add SQL APIs to add and drop CHECK constraints in a Delta table. new unit tests Author: Jose Torres GitOrigin-RevId: b2fb8d932524f42581ec088658b487d6fb4174ad commit 93549c6dd52cb8b04e31281276ddfc6add7ce893 Author: Jose Torres Date: Tue Sep 22 15:35:21 2020 +0000 [SC-49224][DELTA] Fix bug preventing INSERT INTO on nested struct literals ## What changes were proposed in this pull request? In DeltaAnalysis, we try to construct the appropriate casts for by-position resolution, but we cast nested struct fields to their own names on the source rather than the appropriate name in the target. This produces an infinite loop where DeltaAnalysis detects a cast is needed, inserts another project, but makes no progress. ## How was this patch tested? new unit test Author: Jose Torres #12597 is resolved by jose-torres/testbug. GitOrigin-RevId: b27586d381c01245a62e67c96aca8282d599ac5d commit 819bcd5635dd47775d1c6adb51b3f40674d0ea81 Author: Shixiong Zhu Date: Wed Sep 16 00:20:14 2020 +0000 [SC-44942]Add a Trigger.Once test that runs a second batch for Delta streaming source ## What changes were proposed in this pull request? Add a Trigger.Once test that runs a second batch for Delta streaming source. Author: Shixiong Zhu #12471 is resolved by zsxwing/SC-44942. GitOrigin-RevId: aca0d22f6e2ef2d96a3c9f3b63f4a5930af505d0 commit fd5ad8edde9ba802b3086fcb5d7528e27651da5c Author: Jose Torres Date: Tue Sep 15 17:54:55 2020 +0000 [SC-48358][DELTA] Add a "startingVersion" = "latest" option in Delta. ## What changes were proposed in this pull request? Add a "startingVersion" = "latest" option in Delta. We translate latest to the version 1 after the most recently committed version. ## How was this patch tested? new unit tests Author: Jose Torres #12474 is resolved by jose-torres/deltalatest. GitOrigin-RevId: 1bf158ac3f3376478f955fb0cec63079732bf218 commit 5c137b1f4aaef96ab1f02e20a591094416557be2 Author: Scott Sandre Date: Tue Sep 15 13:17:35 2020 +0000 [SC-48966] Add a flag to disable writing Delta checksum file - added new `DeltaSQLConf` value `DELTA_WRITE_CHECKSUM_ENABLED` - if that conf is set to false when `Checksum.scala::writeChecksumFile` is called, then we return right away GitOrigin-RevId: 5849a60ebf3a5ddbf2a145b9f070e86713be950e commit 5920a813f03b8d285e48e3b72ee17816af597909 Author: Burak Yavuz Date: Sat Sep 12 02:05:16 2020 +0000 [SC-48540][WARMFIX][DELTA] Introduce table properties writeStatsAsJson and writeStatsAsStruct for the checkpoint format We introduce the table properties `delta.checkpoint.writeStatsAsJson` and `delta.checkpoint.writeStatsAsStruct` to decide what to include as part of the checkpoint data. We used to consider the protocol version as a requirement for writing the new checkpoint columns. However, it'll be a better design to have these table properties instead and have the protocol upgrade a way to enforce the selection of these table properties. We also introduce a SQL conf for users to test whether they would want to opt-in to the new format or not, instead of having to make a table property change that can cause transaction conflicts. Author: Burak Yavuz GitOrigin-RevId: 0d7873424ce5aa7c351e3f9a2ccead47118cef37 commit eca36507f2e036daa69687c880ff4a58878d9807 Author: Burak Yavuz Date: Thu Sep 10 03:12:22 2020 +0000 [SC-48412][DELTA][WARMFIX] Do not lose stats in checkpoints when statsAsJson field is not written We were creating the stats_parsed column in V2 checkpoints by using the "stats" column in AddFile, however V2 checkpoints that do not write the "stats" column, when `writeStatsAsJson = false` no longer have this information. This PR fixes the checkpoint write code to leverage the "stats_parsed" field existing in previous checkpoint information when writing the new checkpoint. GitOrigin-RevId: 0aeccb076e899c6d8b8399dfdb2cb210ec657983 commit 9fc2acba2bce9869c98f2990dfe57a2667720d1c Author: Shixiong Zhu Date: Wed Sep 9 00:26:50 2020 +0000 [SC-47216][WARMFIX]Fix "startingTimestamp" behavior for Delta streaming source ## What changes were proposed in this pull request? Currently when we look up the version for `startingTimestamp`, we will search the latest commit that's **before or at** the timestamp. However, the behavior of `startingTimestamp` should be returning all changes happening **at or after** the timestamp. So we should search the earliest commit that's at or after the timestamp. Another change is we don't need to require `startingVersion/startingTimestamp` point to a recreatable version. If the json file exists, we should be able to read changes. ## How was this patch tested? Refactoring the existing tests for `startingVersion` and `startingTimestamp` to cover more edge cases. ## **IMPORTANT** Warmfix instructions - [X] Other (please describe): Fix an incorrect behavior of a new feature Author: Shixiong Zhu #12333 is resolved by zsxwing/SC-47216. GitOrigin-RevId: 7918cc10365c95d79f544beb3670f8d7121a79f3 commit 862e3c41bcae850dfbb99d3bf73346450809cc5a Author: Scott Sandre Date: Tue Sep 8 03:33:03 2020 +0000 [SC-46515] Add the RemoveFile path to DeltaSourceIgnore{Changes,Deletes}ErrorMessage Resolves: https://databricks.atlassian.net/browse/SC-46515 For the errors `DeltaSourceIgnoreChangesErrorMessage` and `DeltaSourceIgnoreDeleteErrorMessage` we add to the exception message: - the RemoveFile Action path that caused the exception - the commit version The 2 error messages now look like this: ``` s"Detected deleted data (for example, file $removedFile) from streaming source at " + s"version $version. This is currently not supported. If you'd like to ignore deletes, " + "set the option 'ignoreDeletes' to 'true'." ``` and ``` s"Detected a data update (for example, file $removedFile) in the source table at version " + s"$version. This is currently not supported. If you'd like to ignore updates, set the " + "option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, " + "please restart this query with a fresh checkpoint directory." ``` Inside of `DeltaSourceSuite` added 2 unit tests: - `SC-46515: deltaSourceIgnoreDeleteError contains removeFile, version`. This covers the scenario where there is an AddFile action & RemoveFile action & `ignoreChanges == false`. - `SC-46515: deltaSourceIgnoreChangesError contains removeFile, version`. This covers the scenario where there is NO AddFile action & is a RemoveFile action & `ignoreDeletes == false`. Author: Scott Sandre GitOrigin-RevId: de403937f68bc299b362ab8f9d2f5617c7be12c8 commit 0ca9e329918a4c65051ab9e1f9a01525e5de86e5 Author: HyukjinKwon Date: Sat Aug 8 08:51:57 2020 -0700 [SPARK-32319][PYSPARK] Disallow the use of unused imports Removing unused imports from the Python files to keep everything nice and tidy. Cleaning up of the imports that aren't used, and suppressing the imports that are used as references to other modules, preserving backward compatibility. Authored-by: HyukjinKwon commit 0259547afa9c97be254293d873ed04d1dc7c3f9d Author: Yuanjian Li Date: Fri Sep 4 12:01:03 2020 +0000 [SC-46562] Small fixes due to spark merge Small fix in DeltaCommand and Snapshot Author: Yuanjian Li commit 80e19f26002c49e5dea1bd37d1c98f1ce4261958 Author: Maryann Xue Date: Wed Sep 2 18:48:22 2020 +0000 [SC-46662][SQL] Small changes in MergeInto tests changes in MergeIntoSuiteBase Author: Maryann Xue GitOrigin-RevId: a1ec6d604ee9ad0d38ea07d1e7db8eaadee7ba7b commit 7c75c0b15037565dcf07e0344763f519c5616de0 Author: Scott Sandre Date: Wed Sep 2 14:13:10 2020 -0400 [DELTA-OSS-EXTERNAL] Remove passPartitionByAsOptions from DeltaDataSource (#12269) Resolves delta-io/delta#505 Signed-off-by: Scott Sandre Closes delta-io/delta#506 Signed-off-by: Scott Sandre GitOrigin-RevId: 62b4353ef9e8e4f2e52f5f73349bc801d3474693 commit 263c0d58093cc9e3707759a3579c4297ce334fac Author: Alan Jin Date: Wed Sep 2 15:52:03 2020 +0000 [DELTA-OSS-EXTERNAL][SC-42555] Allow multiple matches in Merge when matches are unconditionally deleted ```scala def multipleSourceRowMatchingTargetRowInMergeException(spark: SparkSession): Throwable = { new UnsupportedOperationException( s"""Cannot perform MERGE as multiple source rows matched and attempted to update the same |target row in the Delta table. By SQL semantics of merge, when multiple source rows match |on the same target row, the update operation is ambiguous as it is unclear which source |should be used to update the matching target row. |You can preprocess the source table to eliminate the possibility of multiple matches. |Please refer to |${generateDocsLink(spark.sparkContext.getConf, "/delta-update.html#upsert-into-a-table-using-merge")}""".stripMargin ) } ``` Checking multiple rows matching is to avoid updating the same target row in the Delta table. So for delete only clause (without update clause), we should not throw this exception. Closes delta-io/delta#434 Signed-off-by: Tathagata Das Author: Tathagata Das Author: Alan Jin #12212 is resolved by tdas/wrhspo2s. GitOrigin-RevId: 32047e408a7cf2734b23449e9016d405444fba9f commit ccda889a15167e3e51b078fabaec8c36b416e27a Author: Burak Yavuz Date: Wed Sep 2 04:39:55 2020 +0000 [SC-46931][DELTA] Clean up commitLarge ## What changes were proposed in this pull request? There are some unused variables in commitLarge. Clean them up. ## How was this patch tested? Minor refactor, no new tests needed. Author: Burak Yavuz #12254 is resolved by brkyvz/optClone. GitOrigin-RevId: e5100308abc6c3602a88aa00cded5acb8a3ed915 commit f7ae8f4af3881a0d46837bfdbbeac105219442f4 Author: Burak Yavuz Date: Tue Sep 1 23:29:42 2020 +0000 [SC-37256][DELTA] Remove flag to enable v2 checkpoint ## What changes were proposed in this pull request? Now that we have the protocol upgrade, we don't need this flag anymore. ## How was this patch tested? Existing tests Author: Burak Yavuz #12244 is resolved by brkyvz/enableV2. GitOrigin-RevId: 81022965a3870ae32f168f5c249db1d3c7c5da98 commit 9266b7b4884e7a2f0b357541b19928c89a8d0869 Author: Jose Torres Date: Tue Sep 1 23:10:29 2020 +0000 [SC-45766][DELTA] Various followups for CHECK constraint enforcement. ## What changes were proposed in this pull request? * Change to a protocol version check rather than a hard exception in requiredMinimumProtocol, so other metadata changes can be made to a table which has had a constraint added by future versions. (There's still no API to add constraints currently.) * Unfold InvariantViolationExceptions from their nesting inside SparkException trees for usability. * Comment the codegen column values extraction. ## How was this patch tested? Existing tests with slight tweaks Author: Jose Torres #12246 is resolved by jose-torres/constraintfollow. GitOrigin-RevId: 20b15c79b628b8d0a1a6ddeff87e93875c7da375 commit f4369fbdf50cc0e04c7f938baea3256807e9d5c6 Author: Jose Torres Date: Tue Sep 1 18:39:54 2020 +0000 [SC-45766][DELTA] Add CHECK constraint enforcement. ## What changes were proposed in this pull request? Add enforcement of CHECK constraints specified in table properties. These are merged with column-level invariants extracted from the table schema (including both NOT NULL and some single-column CHECK constraints, although those single-column constraints were never exposed in a public API), and then enforced through the existing invariant checker with some modifications. This doesn't yet include the API to create CHECK constraints. ## How was this patch tested? new unit tests Author: Jose Torres #12111 is resolved by jose-torres/invariantstorage. GitOrigin-RevId: f9cd927127d6432b4bdd8a67acea32ee00fa73ca commit 4498e201218377b517fb7b6ba90f1b473abcaf4d Author: Andrew Fogarty Date: Mon Aug 31 23:31:33 2020 +0000 [DELTA-OSS-EXTERNAL] Add "@since" to DeltaMergeMatchedActionBuilder.delete() This small PR adds a `@since 0.3.0` annotation to `DeltaMergeMatchedActionBuilder.delete()`, which seems to be the only public API function missing this. I also noticed that this function is missing a `@Evolving` tag. Is this intentional? CC: @rapoth @suhsteve @imback82 Closes delta-io/delta#493 Signed-off-by: Tathagata Das Author: Tathagata Das Author: Andrew Fogarty #12210 is resolved by tdas/8wg7e9nq. GitOrigin-RevId: e55fd064c923186d349c8c6078a3d5fc4ed375b4 commit d627769ce9822b8fad1f42bd0a00245a21892216 Author: Burak Yavuz Date: Mon Aug 31 20:13:30 2020 +0000 [SC-30351][DELTA] Remove checkpoint size config ## What changes were proposed in this pull request? Remove another unused configuration. ## How was this patch tested? Existing tests Author: Burak Yavuz #8658 is resolved by brkyvz/chkSize. GitOrigin-RevId: 68665da57b716ea7721cfde03898d0b3687237ea commit 3f1c603d3a9ab10d52109788d1c0c7af2789c3ba Author: Burak Yavuz Date: Sat Aug 29 00:20:33 2020 +0000 [SC-45757][DELTA] Upgrade the max Delta Protocol Version 3 and introduce upgrade API ## What changes were proposed in this pull request? Introduces the `DeltaTable.upgradeTableProtocol` method for upgrading a Delta table and also revs up the writer version to version 3. With version 3, writers are required to: - Write the new checkpoint format - Respect Check Constraints when writing to Delta tables SQL API will be added in a subsequent PR. ## How was this patch tested? Adds new tests for the checkpoint version and API tests Author: Burak Yavuz #12143 is resolved by brkyvz/protUpgrade. GitOrigin-RevId: 1cf8504306b2e26e4f1a7038f7de80aeabce932c commit d630877e8943b92bdc63543fe6f3c2dfa5b547d2 Author: Tathagata Das Date: Fri Aug 28 18:09:21 2020 +0000 [SC-44274][Delta] small changes in MergeInto tests Small changes in MergeInto tests Author: Tathagata Das Author: Zach Schuermann GitOrigin-RevId: 8e5a072c2ae93efc582ed0af061b813347b6ced2 commit af179901b0301c60467136af713f6246b2617e30 Author: Jose Torres Date: Fri Aug 28 16:55:16 2020 +0000 [SC-45763][DELTA] Add a protocol check for new table level CHECK constraints. ## What changes were proposed in this pull request? Just includes the check, reporting that the current version can't enforce constraints for forward compatibility. No actual implementation yet. ## How was this patch tested? new unit tests Author: Jose Torres #12115 is resolved by jose-torres/invariants2. GitOrigin-RevId: b0eb9dfb87694e2ab9ad329fbeba36210b246ebe commit 548fc74fa226379c3d34d4d7c2e652167656641a Author: Burak Yavuz Date: Wed Aug 26 23:52:43 2020 +0000 [SC-46268][DELTA] Change constructor of DeltaTable to leverage DeltaTableV2 ## What changes were proposed in this pull request? Leverage DeltaTableV2 as the constructor of DeltaTable, so that we can understand if the table was created through the `forPath` or `forName` code path. ## How was this patch tested? Existing unit tests Author: Burak Yavuz #12058 is resolved by brkyvz/cloneStats. GitOrigin-RevId: 9477ae98467ae115363c66a2ffcce886a0436fd6 commit b753ae3b5506bb551e6fa39cd06e6bc993e0478f Author: Pranav Anand Date: Wed Aug 26 19:38:26 2020 +0000 [SC-45770] Pass in operation metrics to commitLarge - Pass in operation metrics to `commitLarge` Author: Pranav Anand GitOrigin-RevId: 809dbc5d7a7ffc36a656c68756ef17b28863bb76 commit 7ffbaed75cb28569a58667e2f6f6d77f5605be90 Author: Jose Torres Date: Wed Aug 26 19:29:19 2020 +0000 [ES-31086][DELTA] Don't allow NOT NULL constraints inside arrays and maps. ## What changes were proposed in this pull request? Don't allow NOT NULL constraints inside arrays and maps, since they can't be enforced there. Includes a fallback flag to allow the constraint (but properly unset the nullability flag) since this has been incorrectly allowed for a while. ## How was this patch tested? new unit tests Author: Jose Torres #11448 is resolved by jose-torres/notnullarr. GitOrigin-RevId: 3e84b5b8d7d5cd5e1755fa441d58aaee28ab1677 commit f274039edf2ac3df88cad754e91849736f28c60b Author: Burak Yavuz Date: Fri Aug 21 21:14:40 2020 +0000 [SC-45188][DELTA] Update the new checkpoint format for Delta ## What changes were proposed in this pull request? Implement a new format for checkpoints in Delta that contain partition values as parsed columns as part of the checkpoint. This will help in avoiding potential casting bugs when performing partition filters. This new code path will be enabled by default for tables that are upgraded to writer protocol 3. The related code for upgrading the protocol will be introduced in a follow up PR. ## How was this patch tested? Unit tests Author: Burak Yavuz Author: Burak Yavuz #11864 is resolved by brkyvz/checkpointV2New. GitOrigin-RevId: ad74d3e3b15efc58ce432e56bd197db4411f6c7b commit a1e23b1e6dea33ba47fea2bd49c620b3aee506ab Author: Pranav Anand Date: Wed Aug 19 22:05:17 2020 +0000 [SC-42342] Fail starved writers on Delta tables - Most of the diff here is whitespace - New method `doCommitRetryIteratively` which calls `doCommit`, if `doCommit` fails, we separately call `checkAndRetry` and loop until the commit succeeds or the `DELTA_MAX_RETRY_COMMIT_ATTEMPTS` (a new conf) is exceeded - `doCommit` now does not catch `FileAlreadyExistsException` - `checkAndRetry` returns the next commit version to retry at instead of calling `doCommit` directly - Added new error to `DeltaErrors` - `maxCommitRetriesExceededException` - New test suite `TransactionRetrySuite` that creates a fake log store that just throws an error. It verifies that the `DELTA_MAX_RETRY_COMMIT_ATTEMPTS` works as intended. Author: Pranav Anand commit 63367af36877096f1efc6ae3702b212d8803db00 Author: Luca Menichetti Date: Thu Sep 3 19:27:55 2020 +0200 Corrected small typo in the Scala version example (#45) commit 3ed57d0668b09d48172b0eb7221f647811657972 Author: Wesley Hoffman Date: Wed Aug 19 01:15:39 2020 +0000 [DELTA-OSS-EXTERNAL] Add option to start stream from a specific version **Goal:** Add the ability to start structured streaming from a user provided version as outlined in #474 **Desc:** This change will introduce `startingVersion` and `startingTimestamp` to the Delta Stream Reader so that users can define a starting point of their stream source without processing the entire table. **What's the behavior if both startingVersion and startingTimestamp are set?** Both options cannot be provided together. An error will be thrown telling you to choose only one. **When a query restarts, how does the new options work?** If a query restarts and a checkpoint location is provided, the stream will start from the checkpoints offset as opposed to the provided option. **If the user uses an old Delta Lake version to read the checkpoint, what will happen?** If we use a newer version of Delta with this option, we will store an offset that looks like: Process up to this version and this index. If the stream fails and someone reverts to an older version of Delta, we would try to process ALL the data in the snapshot up to the end offset which we came up with. Credit to brkyvz for a large percentage of the work. Closes delta-io/delta#478 Signed-off-by: Burak Yavuz Author: Burak Yavuz Author: Wesley Hoffman GitOrigin-RevId: c654217a70b9f6bbee6e96a079223a693dd85e8a commit cb0384272dc1c3a42fb382c93cd300df5f474beb Author: Burak Yavuz Date: Mon Aug 17 23:50:09 2020 +0000 [DELTA][MINOR] Fix logging of starting checkpoint in SnapshotManagement ## What changes were proposed in this pull request? Fix logging for Delta.update `Some(...)` today which makes it less readable. Author: Burak Yavuz Author: Burak Yavuz #11904 is resolved by brkyvz/fixLogging. GitOrigin-RevId: df69dfc01d267796988b0f0c8abe8ace66eda425 commit b088dc0b1d0dc14f0cdf51cbc7806039b3767123 Author: Burak Yavuz Date: Mon Aug 17 21:01:54 2020 +0000 [SC-44268][DELTA] Add note about unpartitioned tables for Delta's checkpoint format ## What changes were proposed in this pull request? Clarify that the field `partitionValues_parsed` is required when the table is partitioned. If it is not partitioned, it may be empty. ## How was this patch tested? Doc change Author: Burak Yavuz #11899 is resolved by brkyvz/protocolDoc2. GitOrigin-RevId: 22a36fcf9bad33ab1691446b57ae5e0c61e31d62 commit 6500abbf9a2f52046cbd30daaa81ffdc00cbb26f Author: Burak Yavuz Date: Thu Aug 13 23:33:25 2020 +0000 [SC-44271][DELTA] Introduce default protocol version for Delta tables ## What changes were proposed in this pull request? In this PR, we introduce the concept of a default protocol version for Delta tables. As we build new features for Delta, we may have to make protocol breaking changes. Protocol upgrades could be a high-friction process as all readers and/or writers to a Delta table may have to be upgraded after a protocol change. With the new default configurations, users will be able to create tables that can be accessed the most widely. Only if they decide to use a new feature, only then will the table protocol require an upgrade. If the table is created with the respective table configuration, then we will choose the minimum required protocol version. During this process, I found some other irregularities, specifically around table creation. REPLACE wasn't really "creating" a new table as it should in the sense that: - Nullability information would be dropped (tests for this will be added separately) - default configurations for a table through SQL configs were not being picked up Therefore I introduced a new `updateMetadataForNewTable` method that will set the protocol for the new table as well. I've also moved protocol verification checks into `updateMetadata` for fail-fast behavior and `prepareCommit` as a last check. We also remove the "Upgrade your protocol" message as it can become annoying (it was initially meant to be). ## How was this patch tested? Tests in DeltaProtocolVersionSuite Author: Burak Yavuz #11723 is resolved by brkyvz/protocolConf. GitOrigin-RevId: df07ae9b8c183f4c9732632d8a7cfcf70522f038 commit 6dbd3a382d3d03f6f8d2a5caa089e2f0bb798158 Author: Burak Yavuz Date: Thu Aug 13 22:05:43 2020 +0000 [SC-44268][DELTA] Update Delta Protocol Doc with Checkpoint format information ## What changes were proposed in this pull request? In this PR we add sections to the Delta Protocol doc around the checkpoint format, essentially describing what format the checkpoint should be in, how the data should be distributed if this is a multi-part checkpoint, and what the schema of the checkpoint file should be. We also note the new columns that must exist in the new protocol, when we upgrade to writer version 3. I also noticed that the action reconciliation section was missing, therefore added that as well. ## How was this patch tested? N/A Author: Burak Yavuz #11673 is resolved by brkyvz/protDoc. GitOrigin-RevId: e6a4bb90875a1a09b24b123eb928993bee838556 commit efedee8e366d77ccc6feb4b5e1ff1ab47f5e2e2a Author: Tathagata Das Date: Wed Aug 12 23:24:08 2020 +0000 [SC-44184][Delta] Fixed incorrect results due to ambiguous attribute resolution in self-merge ## What changes were proposed in this pull request? Merge operation through Scala/Python APIs can cause data corruption when the source DataFrame was built using the same DeltaTable instance as that used for invoking merge. That is something like this. ``` val source = deltaTable.toDF.filter(...) deltaTable.as("t").merge(source.as("s"), "t.key = s.key") ``` This is because `deltaTable.toDF` always returns the same DF with resolved attribute refs, and any source DF generated from that will get the same attribute references. Therefore when the merge plan is generated by the `DeltaMergeBuilder`, both the children (`source` and `target`) have the same refs. All expressions in the merge plan may resolve successfully but still create an ambiguous resolved plan, something like this. ``` 'DeltaMergeInto ('t.key = 's.key), [Delete [, ]] :- SubqueryAlias t : +- Relation[key#257L,value#258L] parquet +- SubqueryAlias s +- Filter (key#257L = cast(4 as bigint)) +- Relation[key#257L,value#258L] parquet ``` This can lead to incorrect results because all the expression may bind to the incorrect columns when executing the merge in MergeIntoCommand. For example, expressions to copy unmodified target rows can copy instead pick up values from correspondingly-named columns in source row (i.e., pick up value of source's `key#257L` instead of target's `key#257L`). The solution chosen in this PR is to rewrite all the attribute references in the target plan duplicate attribute refs are detected between source and target. Here are the details of this when this rewrite occurs. - Rewrite target instead of source - Rewriting refs is a non-trivial challenge in the general plans as shown [here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1167). So we choose to rewrite the target plan which is guaranteed to be simple contains only a few known plans items. - Only when duplicate refs are detected - We do not want to always rewrite the references because there may be expressions provided by the user that are pre-resolved on the source and target (i.e., when using `target("colName")` format to generate expressions) and arbitrary reference rewrite will cause unnecessary resolution errors in unambiguous queries in existing workloads. When the duplicated refs in target plan are rewritten, if there are pre-resolved merge expressions (conditions and actions) where the same refs are used, then those are likely to be ambiguous. Hence we unresolve those refs and let the standard resolution logic check whether they are ambiguous or not. **This is the part that there is a slight chance of regression - some existing self-merge with pre-resolved expressions used to accidentally produce the correct results will now start failing. However, the chance of such regression is very low.** More specifically, here is the precise characterization of what queries are affected. - Non-self-merge queries - Not affected as this PR rewrites plans only when duplicate refs. - Self-merge queries - there are two cases: - Without pre-resolved exprs - There are unambiguous self-merge queries that would generate ambiguous plans therefore most likely produce incorrect results. With this PR, those unambiguous queries will always produce unambiguous plans and therefore correct results. With this PR, those unambiguous queries will always produce unambiguous plans and therefore correct results. No risk of failing those workloads, just making them correct. - With pre-resolved exprs - The small set of queries that produced ambiguous plans will now throw resolution error. A very small subset of these queries that passed earlier (but now fails) could have accidentally produced correct results. This the tiny risk. ## How was this patch tested? - Added more tests covering the presence of pre-resolved refs in all merge expressions - Added tests for self merge, without and with pre-resolved refs. Author: Tathagata Das #11610 is resolved by tdas/SC-44184. GitOrigin-RevId: 6809a073f3342b83d6591281f40565ad513f49f3 commit cdbff575dec5d7ceca14576d5fba8988abcdf196 Author: Burak Yavuz Date: Fri Aug 7 20:02:14 2020 +0000 [SC-43819][DELTA] Catch Glue permission errors in DeltaCatalog for path identifiers ## What changes were proposed in this pull request? Some users connect to Glue as their MetaStore (catalog). When Spark needs to resolve path based identifiers, it first needs to check with the catalog if the user provided table exists, and only then does it tries to resolve if the provided identifier is a path identifier. When accessing Glue though, Glue can easily throw a permission error. This shouldn't block users from accessing path based Delta tables. ## How was this patch tested? Unit tests in DeltaSuite and ConvertToDeltaSuite Author: Burak Yavuz Author: Burak Yavuz #11573 is resolved by brkyvz/glueErrors. GitOrigin-RevId: b3b49cd38406d1fe0a94864b430b71177a2949c7 commit c47cde080a41ce6c249e644aecc1d1b825fdf6c8 Author: Wesley Hoffman Date: Tue Aug 4 20:04:05 2020 +0000 [DELTA-OSS-EXTERNAL] Make DeltaTable Java Serializable Make `DeltaTable` Serializable so that it can be sent to executors without throwing `NotSerializableException`. However, methods of `DeltaTable` should not be allowed to run on the executors so should throw a clear exception explaining this. Closes delta-io/delta#485 delta-io/delta#486 Signed-off-by: Shixiong Zhu Author: Shixiong Zhu Author: Wesley Hoffman #11575 is resolved by zsxwing/jdmirx4o. GitOrigin-RevId: bb7dd868cb4782a7c09e46d8ad4a20405edbab3d commit 7e1588171325c62dbdbb8176889f56bce1a874b0 Author: Burak Yavuz Date: Thu Jul 30 19:07:01 2020 +0000 [SC-42763][DELTA][WARMFIX] commitLarge shouldn't commit the protocol ## What changes were proposed in this pull request? commitLarge shouldn't change the protocol, otherwise it causes transaction conflicts with other transactions. ## How was this patch tested? Unit test Author: Burak Yavuz Author: Burak Yavuz #11493 is resolved by brkyvz/cloneRocks. GitOrigin-RevId: f8b6fe904775be7041ab2324add0d3ec540d17f0 commit 13c9c6ee9ee6e6921d59e940243f5eabbee3841e Author: Zach Schuermann Date: Wed Jul 29 21:00:32 2020 +0000 [DELTA] Unlimited MATCHED/NOT MATCHED clauses in MERGE Currently, Delta’s MERGE only supports two MATCHED and one NOT MATCHED clauses. Since the Spark 3.0 SQL parser supports any number of MATCHED and NOT MATCHED clauses in MERGE, this functionality is extended to Delta to allow for any number of MATCHED and NOT MATCHED clauses in Delta’s Merge. The API largely remains the same, with the removal of the limitation on the number of MATCHED/NOT MATCHED clauses: ```sql MERGE INTO [db_name.]target_table [AS target_alias] USING [db_name.]source_table [] [AS source_alias] ON [ WHEN MATCHED [ AND ] THEN ] [ WHEN MATCHED [ AND ] THEN ] ... [ WHEN NOT MATCHED [ AND ] THEN ] [ WHEN NOT MATCHED [ AND ] THEN ] ... with: = DELETE | UPDATE SET * | UPDATE SET column1 = value1 [, column2 = value2 ...] = INSERT * | INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...]) = TIMESTAMP AS OF timestamp_expression | VERSION AS OF version ``` Old unit tests were modified to accommodate new behavior and new tests were added. Additionally, quicksilver benchmarks were used to ensure no performance regressions occurred. Benchmark comparison with master shows no significant regression in most non-trivial merge cases (cases that take more than 10 seconds) Author: Zach Schuermann GitOrigin-RevId: 63fc1628cd9b3009f0208d54661fc923960aa563 commit f75971a658c0a8c8e6fb887dcb3bcbe09b476866 Author: Shixiong Zhu Date: Wed Jul 29 14:47:47 2020 +0000 [SC-42546]Update ScalaObjectMapper's package name in imports ## What changes were proposed in this pull request? `com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper` is deprecated in [jackson-moudle-scala 2.10.0](https://github.com/FasterXML/jackson-module-scala/blob/jackson-module-scala-2.10.0/src/main/scala/com/fasterxml/jackson/module/scala/experimental/ScalaObjectMapper.scala#L9). All of codes have been moved to `com.fasterxml.jackson.module.scala.ScalaObjectMapper`. `com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper` is just an empty trait extending `com.fasterxml.jackson.module.scala.ScalaObjectMapper` now. This PR updates our codes to use the new package name `com.fasterxml.jackson.module.scala`, as `com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper` is removed in jackson-moudle-scala 2.11.0. ## How was this patch tested? Jenkins Author: Shixiong Zhu #11464 is resolved by zsxwing/ScalaObjectMapper. GitOrigin-RevId: 622c216b5b5bfe7d988208d359a890375b3e9940 commit d70cf7794c75d6d42e81f3486f3e4bba748a12e7 Author: Pranav Anand Date: Fri Jul 24 23:37:03 2020 +0000 [WARMFIX][SC-38808] Add more logging to checkAndRetry - `logInfo` and `recordDeltaOperation` added to `checkAndRetry` Author: Pranav Anand GitOrigin-RevId: 93573dd6d9aa8991cbbb0aab72d84ef7bb595492 commit 546849165dea091d59c8f202afa8c0d3ea6b014b Author: David Lewis Date: Thu Jul 23 18:16:40 2020 +0000 Minor changes Author: David Lewis commit 63d9d59dc3a2155fc95beb15c7bc4fcd8322c3a7 Author: Wenchen Fan Date: Thu Jul 23 06:06:10 2020 +0000 [SPARK-32030][SQL] Added tests in preparation for unlimited MATCHED and NOT MATCHED clauses in MERGE Added a few more tests to verify current merge clause limits, in preparatoin for unlimited merge clause support in Apache Spark. Author: Wenchen Fan GitOrigin-RevId: 9416708cf5e0a0e1a590e46499118b166083f3a0 commit bf9902ced9c906b682605e462d288e6b5ee83972 Author: Allen Reese Date: Sat Aug 1 15:25:05 2020 -0700 Fix a small typo in the README, and add a create table command example (#44) commit f587c1d117e9ef907a377604eb78b1e2c705ef6c Author: Alan Jin Date: Fri Jul 10 20:34:29 2020 +0000 [DELTA-OSS-EXTERNAL] testNullCaseMatchedOnly misses non-partition table case This is an obvious missing in unit test `testNullCaseMatchedOnly`: `isPartitioned` not set correct. Closes delta-io/delta#475 Signed-off-by: Rahul Mahadev Author: Alan Jin #11095 is resolved by rahulsmahadev/ip8ggg3x. GitOrigin-RevId: b847c41a43591fe30715723d91945b4f813a1c39 commit 7028414d180ec0947e92249b5ef2fb579fbd25ff Author: Shixiong Zhu Date: Wed Jul 8 18:48:00 2020 +0000 [SC-37944][WARMFIX]Use a temp dir for spark warehouse dir in Delta python tests This PR creates a temp dir for spark warehouse dir so that we don't leak files in the project directory when tests crash or get interrupted. - Manually ran `delta.tests.test_sql` locally and confirmed the spark warehouse dir changed to a temp dir. Author: Shixiong Zhu GitOrigin-RevId: e8f4d5c0814b9ef57270c326078e205bab43d6e2 commit f5cde0114174aaeac589faf80b7b8e0e248f7a03 Author: Shixiong Zhu Date: Mon Jul 6 23:13:55 2020 +0000 [SC-35302] Add a test to make sure Describe History can work in a complicated query Author: Shixiong Zhu GitOrigin-RevId: 3d1d45b6ab7f30dbd7b07841862fec07de319231 commit b81b40298f0de1eca288686294909ee812bba772 Author: Zach Schuermann Date: Wed Jul 1 18:26:50 2020 +0000 [DELTA] Support querying last commit version in SparkSession This introduces a new feature to query the version of the last commit written by the current SparkSession across **any** Delta table. The API is a new SQLConf field in the `SparkSession`’s `SessionState`: `spark.databricks.delta.lastCommitVersionInSession`, that the user can query from SQL or any supported language. The user can simply query a new SQLConf field to find an up-to-date (optional) String encoding of the version of the last commit written by that `SparkSession`. If no commits have been made by the spark session, querying the key will return `None`, and after commits, it will return `Some()`. How much do we care about atomicity for this operation? Currently, it's possible that the following occurs: - A pair of commits are being written concurrently (say `commit(0)` and `commit(1)`) - The ordering of the commit => update SQLConf could be: 1. `commit(0)` 2. `commit(1)` 3. `setSQLConf(0)` 4. `setSQLConf(1)` - A read between steps 3 and 4 would yield an incorrect result. In order to mitigate this behavior, locks would likely need to attach the update operation to the commit. We have moved the `setSQLConf` as close as possible to the code performing the deltaLog write, thereby closing the section where we would produce invalid reads to a negligible size. Unit tests Author: Zach Schuermann GitOrigin-RevId: f816852145480297ea7244368f5c4c02d3b15dd9 commit 13a13a4b867ceb485c97aaabe592a3ce0e43e267 Author: Burak Yavuz Date: Wed Jul 1 06:29:31 2020 +0000 [SC-40285][DELTA] Refactor commitLarge to be able to commit an arbitrary version ## What changes were proposed in this pull request? Refactors commitLarge to be able to commit the next version with arbitrary actions. `commitLarge` won't retry and will simply throw an exception if a commit exists. ## How was this patch tested? YOLO Author: Burak Yavuz #10887 is resolved by brkyvz/cloneReplace. GitOrigin-RevId: 4bd6bd2b95fe585a6e81d25505407dacd685fe3f commit 857c68a2badc3e2b8a718aed1a60cfe7e7cb8786 Author: liwensun Date: Tue Jun 30 06:02:36 2020 +0000 [SC-37899] Add a method to list the directory locally using LogStore without launching a spark job Author: liwensun GitOrigin-RevId: 0733f71b09091872d42b75fcab13b5655f1744ac commit 7ff30c5ac8703fd8c3dd505738a38f95620bced3 Author: Burak Yavuz Date: Mon Jun 29 23:41:05 2020 +0000 [SC-39734][DELTA] Fix working on a stale snapshot when accessing a Delta table by path When a Delta table is accessed by path, the latest available state of the table is being used during analysis, however we should be using the latest available state. Otherwise if some external writer changes the table, we will simply hit an error. Added a unit test Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: 8a2a8c0f53d43c8ea6240c61669217c07e23bec1 commit cdf828291019d0627cb146767228de0a812eba54 Author: Jose Torres Date: Mon Jun 29 21:17:36 2020 +0000 [SC-37296][DELTA] Clarify error message when doing DESCRIBE DETAIL on a view. ## What changes were proposed in this pull request? DESCRIBE DETAIL assumes it's been provided a table, and when given a view name it will simply report that there's no table by that name. This is a bit confusing, and it's easy enough for us to just check before throwing the error. ## How was this patch tested? new unit test Author: Jose Torres #10854 is resolved by jose-torres/fixerrmsg2. GitOrigin-RevId: 08d6872bf5322afaa77e5df924ebd115a8be17d0 commit 5f14b4fccd1b01714f8fe3044b9ad2918b0a3592 Author: kian-db Date: Thu Jun 25 18:35:56 2020 +0000 [SC-23547] Added Delta Option to use MetadataLog on CONVERT TO DELTA ## What changes were proposed in this pull request? The use of the parquet MetadataLog during a CONVERT TO DELTA is now a user flag as opposed to the current use of the metadatalog by default. ## How was this patch tested? A unit test was written to ensure all data is read (not just streamed data in the metadatalog) when this flag is set to false. Author: kian-db #10802 is resolved by kian-db/convert_option. GitOrigin-RevId: 4f12d0488adc123bae353b9fcf8e59d3f2fd1add commit f6c9e970fbebe50259c539c4d0a6458ec4c48d1c Author: kian-db Date: Wed Jun 24 19:54:12 2020 +0000 [SC-23547] ConvertToDelta: read parquet file index when available Convert to delta command uses Parquet file sink's MetadataLogFileIndex to read files instead of recursively listing files via hdfs. Current test coverage was sufficient: https://github.com/databricks/runtime/blob/master/sql/core/src/test/scala/com/databricks/sql/transaction/tahoe/ConvertToDeltaSuiteBase.scala#L234 Author: kian-db GitOrigin-RevId: a79f7b185e546c76118ee851de27ea2ad10d513a commit 011c122f00f8e8772de57e06b7b3e8137e1f3701 Author: mahmoud mahdi Date: Tue Jun 23 04:38:03 2020 +0000 [DELTA-OSS-EXTERNAL] Update version from 0.6.1 to 0.7.0 in README.md and version.sbt The main goal of this Pull Request is to update the version of delta lake to 0.7.0 in the README file and version.sbt Closes delta-io/delta#457 Co-authored-by: mahmoud MEHDI Signed-off-by: Tathagata Das Author: mahmoud mahdi #10767 is resolved by tdas/p04zztzp. GitOrigin-RevId: db895dcf4d04136c43c5259ca056944307ed27c5 commit 50f80f1a813309db5f2275748ab7756c65278d48 Author: Michael Armbrust Date: Fri Jun 19 23:59:58 2020 +0000 [DELTA-OSS-EXTERNAL] Make Delta a bit more inclusive Removes some archaic terminology from the codebase in the interest of making Delta more inclusive. There is more work to be done as in some cases no alternative APIs exist in Spark, but this is a small step forward. Closes delta-io/delta#458 Signed-off-by: Burak Yavuz Author: Michael Armbrust #10728 is resolved by brkyvz/wimk4v1l. GitOrigin-RevId: 489d95dee28ba32d2ee831b6292ccc52453ff672 commit 4f061eea9b8eab1759641c12ec156df23fc74452 Author: Burak Yavuz Date: Fri Jun 19 21:15:57 2020 +0000 [SC-37255][DELTA] Get rid of an extra list from the isValid call ## What changes were proposed in this pull request? Remove the `isValid` check in DeltaLog creation as it is unnecessary. We already call `update()` while running any Delta query or operation. The result of this call can be equivalent to the same thing that `isValid` accomplishes, which is to invalidate the cached snapshot in the case that the underlying table has been recreated. This change would save us a couple listFrom calls to the underlying storage system, and save users both money and time. We should be okay after removing this change, because: 1. update() is called during analysis when running queries 2. update() is called right before an OptimisticTransaction and we also have a check to ensure that our commit version has not gone backwards after a transaction 3. Streaming queries always call `listFrom` from the latest version, and don't need isValid in the first place ## How was this patch tested? New unit tests Author: Burak Yavuz Author: Burak Yavuz #10603 is resolved by brkyvz/isValid. GitOrigin-RevId: f5955e43fb2d2d6d53d9b213c18e4fc7b4f1f71a commit 184ae4f1c4d0ae8b2173432048880333adad128d Author: Jacek Laskowski Date: Fri Jun 19 02:19:16 2020 +0000 [DELTA-OSS-EXTERNAL] [MINOR] Fix typos Follow-up to #413 Closes delta-io/delta#455 Signed-off-by: Burak Yavuz Author: Jacek Laskowski #10720 is resolved by brkyvz/s4v1b75w. GitOrigin-RevId: 29491b7c64f0ee773bdee5c6721a25b0ff9df4b9 commit 04f16206b1ae0eb4973491c0e296711a3cff2305 Author: Ali Afroozeh Date: Thu Jun 18 10:05:56 2020 +0000 Added a method to run a block of code with an acive txn As the title says Author: Ali Afroozeh GitOrigin-RevId: 390f0ba30287117256d441b23ea2c0cf472c90fa commit 55d01c8bd60e3d4159ffcaca40f8681017a76f15 Author: Pranav Anand Date: Wed Jun 17 00:25:10 2020 +0000 [SC-37890] Generalize streamWrite method in DeltaCommand for shared use - Move `streamWrite` method to `DeltaCommand` so that it can be used elsewhere as a utility Author: Pranav Anand Author: Burak Yavuz #9806 is resolved by pranavanand/cloneDelta. GitOrigin-RevId: 68179f9f08b619ade564eb727cc0ee5ba1960a5e commit 87fecf37b68d44cf99a18cafc16a7092bb2a723a Author: Tathagata Das Date: Tue Jun 16 20:37:10 2020 +0000 [DELTA-OSS-EXTERNAL] Added SQL examples and integration tests for python/scala - Modified old tests to work with Spark 3.0 by adding the right spark conf - Added a new test to test SQL on metastore tables and paths - Updated integration testing script to run the examples in a fine-grained manner. Originally done by @rahulsmahadev in delta-io/delta#426 Closes delta-io/delta#451 Co-authored-by: Rahul Mahadev Signed-off-by: Tathagata Das Author: Tathagata Das #10659 is resolved by tdas/qfwc2lq2. GitOrigin-RevId: 4247f17fc78e741b01f5ee47cc8b96f776f4b250 commit a0db1e7b5ee519a8b8cefef87c9f6bac868851d5 Author: Tathagata Das Date: Tue Jun 16 18:36:55 2020 +0000 [SC-37900] [Delta] Updated Delta dependencies to Spark 3.0 final ## What changes were proposed in this pull request? - as the title says ## How was this patch tested? all tests Author: Tathagata Das #10658 is resolved by tdas/SC-37900. GitOrigin-RevId: 8a8994dfd7979185294697bfdd65642a7a230653 commit d4407d558b6c993dbe1839c2ba7241aabcd5526e Author: Tathagata Das Date: Tue Jun 16 15:30:30 2020 +0000 [SC-38364][Delta] Added support for trigger once in DeltaSource - Added support with AdmissionControl trait new unit test Author: Tathagata Das GitOrigin-RevId: 4d96f3bc4c90a7aedf5594527da88c23f3954483 commit 83fd1a44f4ea54501543da3cd5bca688ec7a6d27 Author: Shixiong Zhu Date: Mon Jun 15 23:58:41 2020 +0000 [SC-37893] Add a usage log for Delta commit list inconsistency Add a usage log when we detect an inconsistent list. We need this usage log to set up an alert job for this. Jenkins Author: Shixiong Zhu GitOrigin-RevId: 7b2a8ed5b56eacff1de820e83714a23d8f057000 commit 956ffc765daa7f1edaddd022932239ae3295937d Author: Tathagata Das Date: Thu Jun 11 20:53:01 2020 +0000 [DELTA-OSS-EXTERNAL] Fixed doc generation For java docs, we copy the jquery library file generated in the Scala API docs and inject it into the Java API docs and using it later to show the "evolving" badges. scala docs changed. However, the jquery library was changed (probably due to scala 2.11 to scala 2.12 dependency change) from `jquery.js` to `jquery.min.js` causing the script to fail. Closes delta-io/delta#449 Signed-off-by: Tathagata Das Author: Tathagata Das Author: Tathagata Das #10535 is resolved by tdas/cxe28xsf. GitOrigin-RevId: ac5d1d349fd02f6a7032cbe5d3ea8bbc05a441a7 commit b1271e0b77171d28449552df4ba2bd8d53c11758 Author: Jose Torres Date: Thu Jun 11 01:12:44 2020 +0000 [SC-36916][DELTA] Added more error messages Error messages used for Copy Into Author: Jose Torres GitOrigin-RevId: ee319f71e27785ec09346857147d77902cbee7a8 commit 501e6c7d9666e6995785fe2418f0134361b55f8a Author: Jose Torres Date: Thu Jun 11 00:42:52 2020 +0000 [SC-36442][DELTA] Add version to checksum failure message. ## What changes were proposed in this pull request? Add the version to the checksum failure message, so in the future we don't have to look at the logs to identify the right version to look at. ## How was this patch tested? n/a Author: Jose Torres #10518 is resolved by jose-torres/versionnum. GitOrigin-RevId: cbd7ac0be3b9b8dfbacf3237d14b22896271d790 commit c65c7f4a9d81fa383c36a6960b010e46cb791ff8 Author: Shixiong Zhu Date: Wed Jun 10 20:19:15 2020 +0000 [SC-37814][WARMFIX]Fix the location of the MergeIntoAccumulatorSuite.scala file Move the `MergeIntoAccumulatorSuite.scala` file to the correct location in OSS Delta. Jenkins GitOrigin-RevId: 2796dc66a55e20e95b43574ddc4e926a193d2b95 commit e618302abe5f3cc839c39e786af07d50f00d00ad Author: Tathagata Das Date: Wed Jun 10 19:04:38 2020 +0000 [SC-34725] [Delta] Added Python DDL tests ## What changes were proposed in this pull request? As the title says ## How was this patch tested? new tests Author: Tathagata Das #10352 is resolved by tdas/SC-34725. GitOrigin-RevId: d51a7cff7d720c281f71ddd8f48e67f76d28045a commit 6a0cd386a8d379a5f7310d419eb73ee2cfb98311 Author: Tathagata Das Date: Tue Jun 9 02:33:10 2020 +0000 [SC-37690] [Delta] Update OSS Spark to depend on Spark 3.0 RC3 ## What changes were proposed in this pull request? - Changed all references to the Spark 3.0 artifacts to new stage location which stages artifacts generated from https://github.com/apache/spark/commit/fa608b949b854d716904f4e43a4a10c71742b3c6 (the last commit before v3.0.0-rc3 tag) to ensure that we depend on Spark 3.0.1-SNAPSHOT - Changed pyspark binary to new staged locations. ## How was this patch tested? All existing tests Author: Tathagata Das #10451 is resolved by tdas/SC-37690. GitOrigin-RevId: 77ffd1511b83bcf785c91f67129e2932fc56d513 commit 4ed8aad19cc17cb38857b288b763175117fba995 Author: Rahul Mahadev Date: Mon Jun 8 20:44:17 2020 +0000 [SC-37289][DELTA] Add Feature flag for validation checks ## What changes were proposed in this pull request? Addded feature flags for all the new Exceptions added during state reconstruction validation and commit validation. The idea here is to have a flag that reverts the behavior if the flag is turned off so that we don't break existing pipelines. ## How was this patch tested? - modified tests to check if feature flag works - await test report Author: Rahul Mahadev #10369 is resolved by rahulsmahadev/featureFlagValidation. GitOrigin-RevId: e3190be494e45a36c1717d71352649ebdf88a6b5 commit e365c2d3db3bf5bd9c70888fd704f43c98393837 Author: Subhash Burramsetty Date: Sat Jun 6 04:15:32 2020 +0000 [DELTA-OSS-EXTERNAL] Update binaries version in Readme to latest (0.6.1) Closes delta-io/delta#441 Signed-off-by: Shixiong Zhu Author: Subhash Burramsetty #10414 is resolved by zsxwing/bfapbyp3. GitOrigin-RevId: f4dd3a92f86a09ad9e21efdc8b347df5df3fa6c6 commit b7eee5b2fbe6c668329ca55025e0e8edb4529104 Author: Zach Schuermann Date: Sat Jun 6 01:06:09 2020 +0000 [DELTA] Allow embedding user-defined metadata in commits to Delta table ## What changes were proposed in this pull request? Add new API to store user-defined metadata within `CommitInfo`. This is accomplished via either: 1. The `"userMetadata"` option during `DataFrame.write` and `DataFrame.writeStreaming` operations. 2. `spark.conf.set("spark.databricks.delta.commitInfo.userMetadata", "...")` for non-DataFrame writes. Fetch this data via `describe history`. ### Example ```scala df.write.format("delta") // or writeStream .option("userMetadata", "...") .save(savePath) ``` OR ```scala spark.conf.set("spark.databricks.delta.commitInfo.userMetadata", "...") spark.sql("INSERT INTO deltaTable ...") ``` **Note:** when both the config and option are set (to potentially different metadata), the option will take precedence over the config. The major change was including a new field in `CommitInfo`, `userMetadata`. Additionally, the `"userMetadata"` option is validated in `DeltaOptions` similar to `replaceWhere`. ## How was this patch tested? Unit tests writing metadata with each API, then retrieving history, checking `CommitInfo`'s new `userMetadata` field. At a high level: ```scala assert(deltaTable.history().head.userMetadata == Some("...")) ``` Author: Zach Schuermann #10188 is resolved by schuermannator/add-user-metadata. GitOrigin-RevId: 82b78bd4fbe5ee74724fbd8e8a871279b80fd27b commit 4cc2c2c4943f3ef4b1f6933ee9baba3aedb2a127 Author: Shixiong Zhu Date: Fri Jun 5 16:30:12 2020 +0000 [SC-37021]Move the accumulator test for Merge to a separate suite Move the accumulator test for Merge to a separate suite so that we don't run it multiple times. Author: Shixiong Zhu GitOrigin-RevId: 3dd9c97ba0afc0d6a2e19ad32fc4c519d0a5cab1 commit 26aecae63214a8e6fff547adb06399241e3135c2 Author: Burak Yavuz Date: Wed Jun 3 23:42:23 2020 +0000 [SC-37249][DELTA] Fix an error message for Delta Merge command Author: Burak Yavuz GitOrigin-RevId: 56f09d7dea0e385f455dcb98de87b1ac24ee910d commit 68f921b1e2accbfa824c3da00cef4447935d7322 Author: Tathagata Das Date: Wed Jun 3 21:04:24 2020 +0000 [SC-36744][Delta] Added tests for vacuum operation on name-based tables ## What changes were proposed in this pull request? As the title says. As part of improving the test structure, I have moved around a few tests and named them as "basic case" so that its obvious which tests verify the basic functionality with path and name-based tables. ## How was this patch tested? existing and new tests Author: Tathagata Das #10308 is resolved by tdas/SC-36744. GitOrigin-RevId: 2dc4db177eb9a071840607e4c01501281a99ec11 commit 35bb5a4f41dfb3bda842ed22ddf07e308f8026fa Author: Tathagata Das Date: Wed Jun 3 18:36:19 2020 +0000 [SC-36746][Delta] Added tests for generate operation on name-based tables ## What changes were proposed in this pull request? As the title says. As part of this, some unit tests were moved up and renamed as "basic case..." to make it obvious which tests ensure test matrix of SQL/Scala and name/path-based tables are correctly tested. ## How was this patch tested? new unit tests Author: Tathagata Das #10261 is resolved by tdas/SC-36746. GitOrigin-RevId: 4f9f77374322fd474d96f2f201af0e49e73adf37 commit 21671177d4612f4343b9cf451002c9e10993c00a Author: Burak Yavuz Date: Wed Jun 3 17:49:24 2020 +0000 [SC-34720][DELTA] Add check for INSERT INTO an empty directory ## What changes were proposed in this pull request? DML operations like Update/Delete/Merge require a table to exist. So does INSERT INTO, when you use table names. However, with INSERT INTO, we allow the case for an INSERT when a table directory does contain a `_delta_log` directory, even if it is empty. This adds a check so that we ensure that we are writing into a table that exists, i.e. has at least one commit and will unify the behavior of these DML operations. ## How was this patch tested? New unit tests Author: Burak Yavuz #9828 is resolved by brkyvz/stopDML. GitOrigin-RevId: 1416577c510d3fd22b92890299f5ea61785cad52 commit e91fc50f993961dcd5fc0cc47e6227a61c59707e Author: Tathagata Das Date: Wed Jun 3 16:37:49 2020 +0000 [SC-36743] [Delta] Added tests for history operation on name-based tables ## What changes were proposed in this pull request? As the title says. As part of improving the test structure, I have moved around a few tests and named them as "basic case" so that its obvious which tests verify the basic functionality with path and name-based tables. ## How was this patch tested? New tests Author: Tathagata Das #10202 is resolved by tdas/SC-36743. GitOrigin-RevId: c17e8430daf9439492486277dc8f761edb01d700 commit fafe393d85a6fb329c6e7abd4a910e12d1d93d21 Author: Burak Yavuz Date: Mon Jun 1 18:36:40 2020 +0000 [SC-37064][DELTA][TEST] Use DeltaLog.forTable consistently in tests when possible ## What changes were proposed in this pull request? Changes some very old tests to use DeltaLog.forTable instead of creating DeltaLog's at arbitrary directories. ## How was this patch tested? Simply changes some tests, so that they run under `_delta_log` directories. Author: Burak Yavuz #10291 is resolved by brkyvz/testRefactor. GitOrigin-RevId: c12fbe7a48a5b621b170404cf176d9fe92a7f534 commit c09cddc301d8695f2c18b523b6e9679ab01b883e Author: Burak Yavuz Date: Thu May 28 23:21:49 2020 +0000 [SC-36584][DELTA] Update the Delta table during analysis in DeltaTableV2 ## What changes were proposed in this pull request? We need to call update() when getting the Snapshot of the DeltaLog in DeltaTableV2. Otherwise we will use a stale snapshot when analyzing the query. We also unify the code path for createRelation, and leverage all the machinery we have in DeltaTableV2 to perform it. ## How was this patch tested? Unit test Author: Burak Yavuz Author: Burak Yavuz #10116 is resolved by brkyvz/databricksSeshCat. GitOrigin-RevId: b1a790904a5290495d5a331114615c85012f1dec commit 6e40bbe917435b7e4e13844fd6778ae285af187d Author: Tathagata Das Date: Tue May 26 23:17:03 2020 +0000 [SC-35384][Delta] Improve error messages in Scala API when Delta extension and catalog are not configured ## What changes were proposed in this pull request? Not configuring the SparkSession with the Delta extension and catalog can produce confusing error messages. In the Scala API (and therefore Python as well) we can intercept those messages and throw better error messages with instruction on configuring the session correctly. ## How was this patch tested? Added new unit tests. Author: Tathagata Das #10005 is resolved by tdas/SC-35384. GitOrigin-RevId: c16072444f913cc64c552146c6a8c17b876b72a2 commit a05c97e54a50118de3b2f8afdbed33887edb45c2 Author: Tathagata Das Date: Tue May 26 23:15:41 2020 +0000 [SC-36516] [Delta] Updated Delta to depend on Spark 3.0 RC2 ## What changes were proposed in this pull request? Change Delta OSS dependencies to the commit right before Spark 3.0 RC2 (in order to depend on -SNAPSHOT artifacts). Incidentally, that commit had the maven version as 3.0.1-SNAPSHOT (the result of the previous RC1), so the build, for now, depends on that version. ## How was this patch tested? existing unit tests Author: Tathagata Das #10106 is resolved by tdas/SC-36516. GitOrigin-RevId: 3ab55130baf6c56bf4b788c1456ec8e9d4fcac57 commit 4055b65d6693858c40328fa4f0e2a654a0e677dc Author: Rahul Mahadev Date: Tue May 26 21:29:04 2020 +0000 [SC-31603][DELTA] Added commit validation ## What changes were proposed in this pull request? Added a few checks during commit process - partition value should be same across AddFile and Metadata - Metadata cannot be null for the first commit - there cannot be more than one metadata actions for the first commit ## How was this patch tested? - Added tests in OptimisticTransactionSuite for the above checks Author: Rahul Mahadev #9742 is resolved by rahulsmahadev/commitValidation. GitOrigin-RevId: 2715eae4d4e8002f0475b6fb33071cfc25623de9 commit a01d1ae4c0ee4b6e29923d399c6bb7f30f8b8ed2 Author: Burak Yavuz Date: Fri May 22 18:04:03 2020 +0000 [SC-36522][DELTA] Minor changes for Spark 3.0.0 Author: Burak Yavuz GitOrigin-RevId: 24d80f294b2753e826a85e341f225786990555e2 commit bc9278b8f8c6b1ba9b3bbf1629778a616ad9aef4 Author: Burak Yavuz Date: Thu May 21 16:26:07 2020 +0000 [SC-32335][DELTA] Speed up Delta table resolution ## What changes were proposed in this pull request? When resolving Delta tables through the Delta Catalog, we were performing some redundant checks that add significant overhead. This PR removes these overheads when possible. ## How was this patch tested? Existing unit tests. In addition, benchmarks show that the time to perform analysis on TPC-DS queries have dropped drastically. Author: Burak Yavuz Author: Burak Yavuz #10082 is resolved by brkyvz/resolutionRegMaster. GitOrigin-RevId: 8bf43f10159c5af064cf8b497333ee935d17f908 commit a50321c82dc1c19fac6c0e05212038238f11d10e Author: Tathagata Das Date: Wed May 20 17:20:15 2020 +0000 [SC-35564][Delta] Added check for the schema changes before executing merge With the revert of lazy generation of LogicalPlan in DeltaTable in fcbc369d3ebde002106c72b271e6e0669750750c users are more likely to run into opaque errors caused by the schema changes between creation of DeltaTable object and the merge operation. Hence this PR adds a check that catch such changes between the analyzed plan schema and the current DeltaLog schema. Updated tests Author: Tathagata Das GitOrigin-RevId: e2e7f76a25a7f5ed03555141a697baaac9c5bd5f commit c98549672c860816616fbe9ef30773ca79bf6da1 Author: Jose Torres Date: Tue May 19 17:24:57 2020 +0000 [SC-29001][DELTA] Extend non-contiguous version error message Extend the error message for non-contiguous versions to describe what might be going wrong. Author: Jose Torres GitOrigin-RevId: c6142cf2e284198695394dbf6854a7600c6b3f46 commit d69aa09432d48401252546cf4a38c16aa7e3c3c8 Author: Jose Torres Date: Tue May 19 13:49:40 2020 +0000 [ES-24686][DELTA] Fix path conversion in CONVERT TO DELTA. ## What changes were proposed in this pull request? We currently use `table.location.getPath` to determine the path of a catalog table. But this is a URI path, which is relative to the host, so it won't work for things like s3 buckets where the host is not root. We have to instead do `new Path(table.location).toString`. ## How was this patch tested? new unit test Author: Jose Torres #9973 is resolved by jose-torres/fixthething. GitOrigin-RevId: eec1cb75204b7c147f5e5a7f653c9ca50715aab3 commit 764ab94514c2e0298705cc58c6892c20bb81872e Author: Alan Jin Date: Tue May 19 03:27:30 2020 +0000 [DELTA-OSS-EXTERNAL] Fix integer overflow of numUpdatedRows `numUpdatedRows` can overflow integer when update a large table. ``` java.lang.NumberFormatException: For input string: "4029988707" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:583) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.delta.DeltaOperations$Update.transformMetrics(DeltaOperations.scala:190) at org.apache.spark.sql.delta.files.SQLMetricsReporting$class.getMetricsForOperation(SQLMetricsReporting.scala:62) at org.apache.spark.sql.delta.OptimisticTransaction.getMetricsForOperation(OptimisticTransaction.scala:80) ``` Closes delta-io/delta#428 Signed-off-by: liwensun Author: Alan Jin #10064 is resolved by liwensun/4eq64o5z. GitOrigin-RevId: c4c77a7044d75896fc5ed8b766bd97ac87929d81 commit 8ddb49ab86c7417938fde0bfb74dedeb7e6ae59d Author: Jose Torres Date: Mon May 18 20:05:58 2020 +0000 [SC-35564][DELTA] Change back to non-lazy DF evaluation in scala API The lazy evaluation causes problems in the original repro for https://github.com/delta-io/delta/issues/419. change test to reflect desired behavior Author: Jose Torres GitOrigin-RevId: ccb52eda16dd2661b26dbe35a923798aec2fb7eb commit e5d2b645de302eae9ca000e3cc4a3bae674429a9 Author: QP Hou Date: Fri May 15 21:41:15 2020 +0000 [DELTA-OSS-EXTERNAL] add missing fields for metadata and txn action record Add missing fields for metadata and txn actions in delta spec. Closes delta-io/delta#422 Signed-off-by: liwensun Author: QP Hou #10000 is resolved by liwensun/pye8o0se. GitOrigin-RevId: f8634fa13e43d24bc908d17fd45e5eecb8326cb4 commit cb882ac5318fe1c9e063213b5679d103929d733c Author: Jose Torres Date: Fri May 15 14:38:24 2020 +0000 [SC-34321][DELTA] Add copy into exceptions Add copy into exceptions. Author: Jose Torres GitOrigin-RevId: c618ebbd6ff64a3b1d9583304977b4d8233ea7fe commit fbd82476f2be04ebd379f5b1db19fb25958711d3 Author: Youngbin Kim Date: Thu May 14 21:33:30 2020 +0000 [SC-31676] Minor refactoring Minor refactoring in DeltaAnalyis, DeltaAlterTableTests and DeltaDDLSuite Author: Youngbin Kim GitOrigin-RevId: df260b50820760246acfef6697c81cb82f19e460 commit 8da5df0421ab159d74ae5d65d5d1d9c03dc48b74 Author: Jose Torres Date: Thu May 14 14:42:18 2020 +0000 [SC-35564][DELTA] Resolve against the correct plan in merge. Make Delta merge resolution resolve against `fakeSourcePlan` and `fakeTargetPlan`. In bb73ec8f014d4933a7290291b29a64fc4e449004, we changed it to resolve against `source` and `target`, since there was no obvious reason that it wouldn't work. But we found a case where it doesn't work. new unit test Author: Jose Torres GitOrigin-RevId: d7b60267529bf0964ce1c69590e408960311e694 commit 055309ce2484ed7834eda782b5b073fa281d8353 Author: Rahul Mahadev Date: Wed May 13 21:27:59 2020 +0000 [SC-34322][DELTA] Analysis and Execution validation ## What changes were proposed in this pull request? Test Suite that checks the behavior of multiple queries in parallel. It stops one query via an optimizer rule and resumes it after another query changes the schema. ## How was this patch tested? This patch adds only the test suite Author: Rahul Mahadev #9475 is resolved by rahulsmahadev/metadataCheck. GitOrigin-RevId: ec59ff9b3feb67715c5923b25a65950cec5b7ded commit 61962bb486ea5760a40c2a02c5bfb02e38d7aecb Author: QP Hou Date: Wed May 13 21:14:35 2020 +0000 [DELTA-OSS-EXTERNAL] add primitive type long in protocol design doc Closes delta-io/delta#413 Signed-off-by: Shixiong Zhu Author: QP Hou #9940 is resolved by zsxwing/v3174kwo. GitOrigin-RevId: 60b77cb14d9b752887f57ae695a645ba9e8d7516 commit 3bf7ed307db8823842b46111dea6f3fa58442cd3 Author: Shixiong Zhu Date: Wed May 13 20:03:01 2020 +0000 [SC-35335]Don't call DeltaLog.isValid when the DeltaLog object is just created ## What changes were proposed in this pull request? When a DeltaLog object is not in the cache, we will create a new DeltaLog object. In this case, it should be valid and we don't need to call `isValid`. This will save one LIST request for a cold query. ## How was this patch tested? The updated tests show the saving of list requests. Author: Shixiong Zhu #9892 is resolved by zsxwing/SC-35335. GitOrigin-RevId: e1e21271783110ad4931b8a6743d3351100ab565 commit 1867b5c04a836b656acd53ff35cfba07ff716e7c Author: Shixiong Zhu Date: Wed May 13 18:39:29 2020 +0000 [SC-35329][FOLLOWUP]Add JavaDeltaSparkSessionExtensionSuite.java add `JavaDeltaSparkSessionExtensionSuite.java` back. Author: Shixiong Zhu GitOrigin-RevId: 401fa4772ffdb555f122341b1152cafc550b801c commit cb83e1e8cfbb81ad2f55a70bdbf4fabf6eaadff5 Author: Tathagata Das Date: Wed May 13 02:32:30 2020 +0000 [SC-35329] [Delta] Remove JavaDeltaSparkSessionExtensionSuite Remove JavaDeltaSparkSessionExtensionSuite Author: Tathagata Das GitOrigin-RevId: a8140d73295a1b17ab407cc677a33ec72d7850da commit 1e01dbbc82d2ea897aa204fcf4d39b4109bbbd7d Author: Rahul Mahadev Date: Tue May 12 07:35:13 2020 +0000 [SC-31599][DELTA] Add extra validation to State computation and fix failing tests ## What changes were proposed in this pull request? - Added extra validation to state reconstruction logic to not assume defaults for protocol and metadata - Added a check to ensure empty checkpoint files are not written ## How was this patch tested? - Around 101 tests were failing, fixed them by creating the DeltaLog in the right way Author: Rahul Mahadev #9402 is resolved by rahulsmahadev/stateReconstructionValidation. GitOrigin-RevId: 56492447d25b5ea343cf8a2d1873020ec8bdbe7d commit c728c6ec987a4fc14ae60c899211d83460df2374 Author: Tathagata Das Date: Tue May 12 00:11:27 2020 +0000 [SC-35330] [Delta] Upate Delta OSS dependencies to use Spark 3.0.0-SNAPSHOT ## What changes were proposed in this pull request? Depending on Spark 3.0 RC1 which actually depends on the maven version 3.0.0 can pollute the local ivy2 cache causing future errors when Spark 3.0 is actually released - SBT will not redownload the 3.0 artifacts and keep using the local cached RC1 artifacts (at least for a day) and fail to compile. Instead, the build now depends on the snapshot artifacts generated from the commit right before the v3.0.0-rc1 tag, so it practically depends on 3.0 RC1. ## How was this patch tested? Updated existing test infra to depend on new cached snapshot artifacts. Author: Tathagata Das #9888 is resolved by tdas/SC-35330. GitOrigin-RevId: ee84e62021e6523d4c03eeeefd5612826ff9e36e commit afd42c85f9a8f01865f712d815cbb8e6770bb52e Author: Rahul Mahadev Date: Mon May 11 23:30:32 2020 +0000 [SC-34647][DELTA] Fix DeltaHistoryManager getEarliestReproducable for multi part checkpoints ## What changes were proposed in this pull request? DeltaVersionManager `getEarliestReproducableCommit` was not handling the case with multi-part checkpoints accurately when the smallest delta version equal to the checkpoint version was not seen yet. removed an un-necessary check which prevented the `lastCompleteCheckpoint` from getting set. ## How was this patch tested? Added a unit test. Author: Rahul Mahadev #9777 is resolved by rahulsmahadev/timeTravelBugFix. GitOrigin-RevId: b0b6c133c0f131034776ca924a8fd3380bd575f9 commit ed9dfe8eed1a9fc81ef54ea71de4ec551e7c9e35 Author: Tathagata Das Date: Sat May 9 05:59:10 2020 +0000 [SC-34624] [Delta] Enable SQL and table support for Delete ## What changes were proposed in this pull request? As the title says. ## How was this patch tested? New unit tests enabled. Author: Tathagata Das #9769 is resolved by tdas/SC-34624. GitOrigin-RevId: 0f44d06b538685460f8e8ecdbc46dc86b560bee0 commit e3c385e037083a5268a5a9f6de1e09f4f2bb4502 Author: Wesley Hoffman Date: Sat May 9 02:14:02 2020 +0000 [DELTA-OSS-EXTERNAL] Fix commit duration statistic Closes delta-io/delta#411 Signed-off-by: Shixiong Zhu Author: Shixiong Zhu Author: Wesley Hoffman #9776 is resolved by zsxwing/p8cicfyb. GitOrigin-RevId: dac293258011fb6e88a7e107f1b5e6a2106aa5d4 commit 368a35fb43a4ee780f16b2b3fb10667f0d34aba9 Author: Burak Yavuz Date: Fri May 8 18:53:11 2020 +0000 [SC-29721][DELTA] Ensure schema changes on a Delta DataFrame throw good errors Delta has the ability to auto-update for queries. This auto-updating can sometimes be problematic, because a query can change the schema of the source table so drastically that it would be impossible to read the new data files with the schema fixed during analysis. This PR adds a nice error message for cases where this can happen. Unit tests in DeltaSuite Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: 04f2436188bd30b89c8f1cbb09fd483a9812feb9 commit 0b60d71dacc59f2b3f3cffa01725090698455070 Author: Burak Yavuz Date: Thu May 7 19:00:13 2020 +0000 [SC-34318][DELTA] Add test utility class in Python to configure Delta's extensions ## What changes were proposed in this pull request? Introduces a new utility class DeltaTestCase for Python that sets up a properly configured SparkSession that can be reused across python test suites. In addition, we add the API DeltaTable.forName to check that the suite is in fact being correctly configured. ## How was this patch tested? By the use of DeltaTable.forName to see that it works. Author: Burak Yavuz Author: Burak Yavuz #9801 is resolved by brkyvz/pythonDeltaDDL. GitOrigin-RevId: 378d155b368bf8db91d4f65b75e41eb92e95c868 commit 9a17a441f699bf8a8936828b27cfa9432ae80722 Author: Tathagata Das Date: Thu May 7 17:47:13 2020 +0000 [SC-34483][Delta] Enable SQL and table support for Update ## What changes were proposed in this pull request? As the title says. ## How was this patch tested? Newly enabled unit tests Author: Tathagata Das #9729 is resolved by tdas/SC-34483. GitOrigin-RevId: 73483e0a332bc98a8facba9eca759c3b22a30b68 commit 453b4255d8cb5cf6cd05c3a8dd37d6a0c80ed928 Author: Burak Yavuz Date: Wed May 6 01:24:15 2020 +0000 [SC-34645][DELTA] Test Delta DDL work with a Hive Catalog ## What changes were proposed in this pull request? Adds test utilities and additional traits for testing certain Delta DDL features by explicitly using a HiveClient and Hive catalog instead of Spark's inbuilt catalog to better replicate production usage. ## How was this patch tested? This PR adds tests Author: Burak Yavuz #9778 is resolved by brkyvz/hiveDDLTests. GitOrigin-RevId: 876031a32310bd5baba437b6afb391e94297d14b commit ee0719577fd38f61c5b9b1d55bcb577c43d4ee1b Author: Burak Yavuz Date: Tue May 5 01:34:49 2020 +0000 [SC-34299][DELTA] Support MetaStore tables with Convert to Delta ## What changes were proposed in this pull request? Now that we can support tables defined in the MetaStore with Delta, this PR adds support for CONVERT TO DELTA. ## How was this patch tested? Table identifier based tests Author: Burak Yavuz Author: Burak Yavuz #9700 is resolved by brkyvz/convertOSS. GitOrigin-RevId: 284b173164ad11e7a90841382c256b1568eba18b commit 06409f0d41121361f931a40fc789f971cb4fb994 Author: Burak Yavuz Date: Tue May 5 01:28:07 2020 +0000 [SC-34298][DELTA] Open source more DDL tests ## What changes were proposed in this pull request? Adds test suites such as DDLSuite, DDLUsingPathSuite and DeltaNotSupportedDDLSuite ## How was this patch tested? these are new test suites Author: Burak Yavuz Author: Burak Yavuz #9699 is resolved by brkyvz/moTests. GitOrigin-RevId: 087bd3d0948287138014d03d42be06f997feb298 commit ad6b4638e934f33985bf01340687d1b827adc9e4 Author: Tathagata Das Date: Tue May 5 00:47:48 2020 +0000 [SC-34475][Delta] Enable SQL and table support for Merge ## What changes were proposed in this pull request? As the title says. ## How was this patch tested? Newly enabled unit tests Author: Tathagata Das #9705 is resolved by tdas/SC-34323. GitOrigin-RevId: f2d0dc83b99b92992679ce0412f6453a85cf0976 commit 105fe9b91eff2eb41b56ba54bb6937e4d0cd6050 Author: Tathagata Das Date: Fri May 1 17:16:19 2020 +0000 [SC-29840][Delta] Added DeltaTable.forName() to Scala APIs ## What changes were proposed in this pull request? Since Delta on Spark 3.0 now supports table, we can add `DeltaTable.forName` in the Scala and Python APIs. ## How was this patch tested? New unit tests Author: Tathagata Das #9697 is resolved by tdas/SC-29840. GitOrigin-RevId: 251a96e9bd3be160d6797251b23a2009320e3374 commit a7b874c6b44cecd67147f2e2e03fa4ec16a2d1a0 Author: Burak Yavuz Date: Fri May 1 02:12:42 2020 +0000 [DELTA] Fix pyenv used in CircleCi ## What changes were proposed in this pull request? Fixes the CircleCI environment for Delta Author: Burak Yavuz #9701 is resolved by brkyvz/fixDeltaTest. GitOrigin-RevId: 1cb3574c4cc055d6cd6adcffb4a950aec201f389 commit 934131d105c72245be5a24d29c0fabe717985e9d Author: mahmoud mahdi Date: Fri May 1 00:43:49 2020 +0000 [DELTA-OSS-EXTERNAL] Refactor some tests in the delta code The main goal of this Pull Request is to make small changes in the tests by providing very minor changes which in most cases consists of : - removing unused imports - removing unused values - stop using a deprecated method and replace it by the adequate method - replacing var by val when possible - not using the new operator when instantiating a scala case class (like StructField or StructType) These changes were tested by running all the tests. Closes delta-io/delta#380 Co-authored-by: mahmoud MEHDI Signed-off-by: Shixiong Zhu Author: Shixiong Zhu Author: mahmoud mahdi #9675 is resolved by zsxwing/8i40er62. GitOrigin-RevId: a3e789fc65bc671dc13d9115980b525226d497d5 commit 9ee05c69d7ef16506df0c1528d65215df15d5ce6 Author: Rob Kelly Date: Fri May 1 00:35:01 2020 +0000 [DELTA-OSS-EXTERNAL] Update DeltaSourceSuite.scala fix typo in test description Closes delta-io/delta#404 Signed-off-by: Rahul Mahadev Author: Rob Kelly #9656 is resolved by rahulsmahadev/bncq1cbm. GitOrigin-RevId: 1bf7f3d4500e71a92aea82c10711ff93e1dcc887 commit 5cc383496b35905d3b7911a1f3418777156464c9 Author: Burak Yavuz Date: Thu Apr 30 22:30:40 2020 +0000 [SC-34233][DELTA] Open Source Delta DDL work ## What changes were proposed in this pull request? Now that Delta Lake is being built on Spark 3.0, we can support Delta table definitions in the Hive MetaStore. This is the first PR that adds support for DDL operations such as table creation and ALTER TABLE methods by leveraging the new TableCatalog API in Spark 3.0. For Delta table support, users need to set certain configurations as they create their SparkSessions, e.g. ```scala SparkSession.builder() // other configs .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "com.databricks.sql.transaction.tahoe.catalog.DeltaCatalog") .getOrCreate() ``` One the SparkSession is created as such, users will be able to register Delta tables in the HiveMetaStore _correctly_ with Spark. By _correctly_ we mean that Delta should create a transaction log at the table root directory, and the Hive MetaStore shouldn't contain any information other than the table format and location of the table. All table properties, schema and partitioning information will live in the transaction log to avoid a split brain situation. ## How was this patch tested? A plethora of test suites Author: Burak Yavuz #9631 is resolved by brkyvz/releaseDDL. GitOrigin-RevId: 06e0c832004311fdf8fcffcec1317fe490f7b92e commit 1f9e5c8faa9a946e0082c5ae02182ea301a9760d Author: Shixiong Zhu Date: Tue May 5 08:26:13 2020 -0700 Revert "Fix pipenv used in tests (#408)" This reverts commit 9b9d0003d9286d35d7817bec94ec8bedb480eb81. commit 9b9d0003d9286d35d7817bec94ec8bedb480eb81 Author: Burak Yavuz Date: Thu Apr 30 18:11:12 2020 -0700 Fix pipenv used in tests (#408) * try to fix pipenv * this fixed it locally * fix there too * what about this * try this * how about * argh * c'mon * fix indent: commit 4c2769da8ab4ddffa609dd96be5a35cde534029b Author: mahmoud mahdi Date: Thu Apr 30 17:51:52 2020 +0000 [DELTA-OSS-EXTERNAL] Use exists instead of combining map with .getOrElse(false) in the DeltaOptions The main goal of this Pull Request is to enhance the scala code of the Delta Options by using the ```exists``` method on Scala Options instead of combining ```.map``` with ```.getOrElse(false)```. Closes delta-io/delta#393 Co-authored-by: mahmoud MEHDI Signed-off-by: Shixiong Zhu Author: mahmoud mahdi #9674 is resolved by zsxwing/gey3aqon. GitOrigin-RevId: 435a3745d7aaa58260fabcfb1d4c0f8247398d3a commit 63a6e5c12e2797812fb219901e49b214d0d1a570 Author: Pranav Anand Date: Thu Apr 30 16:04:50 2020 +0000 [SC-34083] Test DataFrameWriter.load with no path Test that a nice error is thrown when a path is not specified on DataFrameWriter.load` - Added a test for no paths being specified Author: Pranav Anand GitOrigin-RevId: c6214b8b887a77e763cb0f3b2620138a63493469 commit ddd65906d5c2185d2b9bff5145688a445749f970 Author: Pranav Anand Date: Thu Apr 30 11:57:36 2020 -0400 [SC-26033] Remove dependence on CalendarInterval in Delta code CalendarInterval was removed in Spark 3.0. This PR removes the references to our internal copy of it. - Existing tests Closes #9550 from pranavanand/pa-revert-calendarinterval. Authored-by: Pranav Anand Signed-off-by: Tathagata Das GitOrigin-RevId: 6476ddf978227b113a04b7e6e9a4fa3d32769803 commit 7cad5ed360e2a801900cd5764c0feddf10e8d4d3 Author: Alex Ott Date: Thu Apr 30 13:55:03 2020 +0000 [DELTA-OSS-EXTERNAL] Update README.md to list latest released version 6.0 was released, but README wasn't updated to reflect this fact. Closes delta-io/delta#399 Signed-off-by: Shixiong Zhu Author: Alex Ott #9673 is resolved by zsxwing/y461ufh6. GitOrigin-RevId: ef84ad85929a01fdfbf9f3357cb9514c8cb7ee2a commit bc537271bd05ef3204ea6144d121c094238f73f0 Author: Rahul Mahadev Date: Thu Apr 30 06:41:59 2020 +0000 [SC-33980][WARMFIX][DELTA] Fix incorrect metrics in Delete/Update and add more tests ## What changes were proposed in this pull request? - Fix inccorect way of capturing number of copied rows - copied rows were incorrectly computed based on scanned files. Changing this to use the computed write stats in DeleteCommand. For UpdateCommand we now use the udf in the right place. - Fix incorrect way of capturing number of removed files ## How was this patch tested? - added more tests - changed existing tests Author: Rahul Mahadev #9576 is resolved by rahulsmahadev/rowLevelHistoryFix. GitOrigin-RevId: 83b991e4952a263549e1de2733003885c89737a9 commit 2e15921e49af146e2daba250ff75df3a0f73e7ed Author: Tathagata Das Date: Wed Apr 29 18:40:18 2020 +0000 [DELTA-OSS-EXTERNAL] Integrate scala examples with the integration tests Update `run-integration-tests.py` to run the scala tests - Update scala example build file to take the version from the env variable DELTA_VERSION - Update run-integration-tests.py to call all scala examples with the given version injected via env variable Closes delta-io/delta#396 Signed-off-by: Tathagata Das Author: Tathagata Das #9637 is resolved by tdas/jx52ioqe. GitOrigin-RevId: 20ea8fa957e4e4c8e603ef02d72db4832cdcb321 commit d12655b55c214c35284126eb96bbf4547125fd1a Author: Jose Torres Date: Tue Apr 28 20:20:31 2020 -0700 [SC-33974][DELTA] Remove messaging around Table ACL from error messages Removes dead code from the error messages. new unit test suite Closes #9545 from jose-torres/fixmergecondit2. Lead-authored-by: Jose Torres Co-authored-by: Jose Torres <30604243+jose-torres@users.noreply.github.com> Signed-off-by: Jose Torres GitOrigin-RevId: a571d1d880f035ed5ad85aa45287b2b4093ed2d9 commit f08d9eeccdc112f73ff68b0692137709c890bab2 Author: Burak Yavuz Date: Wed Apr 29 00:29:52 2020 +0000 [SC-31672][DELTA] Minor refactoring in Snapshot ## What changes were proposed in this pull request? Did a very minor change in Snapshot. ## How was this patch tested? None necessary Author: Burak Yavuz Author: Burak Yavuz #9436 is resolved by brkyvz/checkpointCache. GitOrigin-RevId: 1af63cee243c493cc4fa2a96a5d163aff95185c4 commit a4dc67188fd50db20d7e98c5361bb05370669f64 Author: Tathagata Das Date: Tue Apr 28 07:06:48 2020 +0000 [SC-33906] Upgrade Delta OSS to Spark 3.0 ## What changes were proposed in this pull request? The following changes were made - Updated build.sbt to depend on Spark 3.0.0-preview2 and only scala 2.12 - Updated API changes related to annotations and CalendarInterval. ## How was this patch tested? Existing unit tests + 5 unit tests had to be modified. Author: Tathagata Das #9499 is resolved by tdas/SC-33906. GitOrigin-RevId: 9224232acbd3794d9148cbc1cdd56cf3847bba7d commit 64c173c5d4d401892eefc2a2daa1d447c222ec29 Author: ericfchang Date: Fri Apr 24 20:10:30 2020 +0000 [SC-12873] Whitelist example code for flake8 Ignores some lines for the flake8 linter Author: ericfchang GitOrigin-RevId: cec0e91ddd0afa31b7280cc04d0ad753cbb39476 commit cdba0b9356903cdbce70df84ec65b1700ce8fc21 Author: Yin Huai Date: Thu Apr 23 16:03:08 2020 -0700 Fix delta/python/run-tests.py GitOrigin-RevId: c4baf5f8057838a9732709cea5cf1113ce2d6489 commit 8b89417adbd7752cf390917b541a338ca4a14294 Author: herman Date: Thu Apr 23 19:17:16 2020 +0000 [SPARK-31450][SQL] Make ExpressionEncoder usage thread-safe ExpressionEncoders are not thread-safe. We had various (nasty) bugs because of this. This PR makes the `ExpressionEncoder` usage of Delta safer and (more) reusable. The function objects are not thread safe, however they are documented as such and should be used in a more limited scope (making it easier to reason about thread safety). Existing tests. Author: herman GitOrigin-RevId: ac79e712f081fba8c828d4343d1458db425fe9ed commit 89319caf01fcf28862b1d37c71849908dba62c45 Author: Pranav Anand Date: Wed Apr 22 20:07:26 2020 +0000 [SC-32934] Remove quiet logging to help debug flakiness in test Removed the testQuietly label of a test so that we can see logs when this test fails again to help debugging Author: Pranav Anand GitOrigin-RevId: c69f82f987a8a0512dbe7c872e2498c16e88aaeb commit 1f3bbd4418710efedaffb4c0b779166ba2c69c85 Author: Burak Yavuz Date: Wed Apr 22 09:47:24 2020 -0700 [DELTA] Expose utility method Exposes the utility method canonicalizePath. Closes #9340 from brkyvz/expose. Authored-by: Burak Yavuz Signed-off-by: Burak Yavuz GitOrigin-RevId: ce7d46caaabb9d6dacccf77e860b22fb57b803d5 commit e0f954137e984143a61a34aca10e7e860657eeac Author: Tathagata Das Date: Wed Apr 22 16:59:13 2020 +0000 Setting version to 0.6.1-SNAPSHOT commit 87e46d33257595ec47b4e3ef419a951b117d9aba Author: Tathagata Das Date: Wed Apr 22 16:57:42 2020 +0000 Setting version to 0.6.0 commit ec721b87ffb64e7f40a654bc2808a7fb9297748d Author: KevinKarlBob Date: Wed Apr 22 01:19:05 2020 -0400 [DELTA-OSS-EXTERNAL] add supporting remove partitions if delta was partitioned It is PR for issue with removing partitions in delta table. It was discussed here https://delta-users.slack.com/archives/CJ70UCSHM/p1587048581235000 And i implemened proposed solution https://delta-users.slack.com/archives/CJ70UCSHM/p1587068793244400 Closes delta-io/delta#390 Co-authored-by: hleb.lizunkou Signed-off-by: Burak Yavuz GitOrigin-RevId: eee5573959bc9827f1d381133cde45685f8dbee4 commit bb73ec8f014d4933a7290291b29a64fc4e449004 Author: Jose Torres Date: Tue Apr 21 05:13:24 2020 +0000 [SC-33262][SS] Don't preserve the original dataframe in the Delta scala API. ## What changes were proposed in this pull request? In the scala API, when we create a DeltaTable object, we reuse the same dataframe forever. This isn't correct, and can lead to errors if the delta table's schema has changed between DeltaTable initialization and the time of executing an action such as merge. We should instead recreate it from the current state of the delta table when executing actions. ## How was this patch tested? new unit test Author: Jose Torres #9382 is resolved by jose-torres/fixmerge. GitOrigin-RevId: 70c28441197328ad89398c936d0efd95bfa87fd9 commit 40152ece9dff3139b01bfeaac59aa156274b473e Author: Shixiong Zhu Date: Fri Apr 17 14:25:10 2020 -0700 [SC-31106] Add DatasetRefCache to cache Dataset reference when the active session is not changed (master) This PR adds DatasetRefCache to cache Dataset reference when the active session is not changed to avoid the overhead of Dataset creation. DatasetRefCache will cache the Dataset reference and automatically create a new one when the active session changes. Jenkins Closes #9403 from zsxwing/SC-31106-master. Authored-by: Shixiong Zhu Signed-off-by: Shixiong Zhu GitOrigin-RevId: 5d73ae000f0f4740b232685134d85cf5d74e17a6 commit eab82b20fdf191a79c01cefbbf6748133068d260 Author: Burak Yavuz Date: Wed Apr 15 16:38:20 2020 +0000 [SC-31675][DELTA] Fix time travel with timestamp expressions The Analyzer resolves timestamp expressions, by using the session local time-zone during Analysis. Sometimes we were prematurely resolving TimeTravel nodes, which caused ```scala com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) ``` This PR adds a resolution check on timestamp expressions. The main bug existed for tables that were accessed through paths, e.g. ```delta.`/some/path` ```. Queries that accessed tables directly through names worked fine. Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: b311691fc6202f3fa1ed4a3e4b8e4197a28198f5 commit 32237829451cb5a7e015560a283bebb5f8b6193c Author: Tathagata Das Date: Wed Apr 15 09:12:10 2020 +0000 [SC-32755] Remove DeltaLogging from DeltaMergeBuilder ## What changes were proposed in this pull request? DeltaLogging is adding unnecessary methods in the API docs of DeltaMergeBuilder. Author: Tathagata Das #9345 is resolved by tdas/SC-32755. GitOrigin-RevId: 15314baff52247bb8e71f0ef600176e186428be2 commit 5dd28424008838e150e6137f6f833b06b3c531d6 Author: Shixiong Zhu Date: Wed Apr 15 00:20:03 2020 +0000 [SC-30821] Fix test flakiness After looking at failures, I found that the root cause is `enableExpiredLogCleanup` is not set to `false` in some tests and the automatic log cleanup gets triggered. This flag is not flipped because these tests don't have any `Metadata` in the commit logs. `spark.databricks.delta.properties.defaults.enableExpiredLogCleanup` is picked up only if there is a `Metadata` action committed. This PR adds `startTxnWithManualLogCleanup` and uses it to do the first commit to make sure we disable `enableExpiredLogCleanup` correctly. Author: Shixiong Zhu GitOrigin-RevId: d136dd842b29e165bfba8dfe079094d00e8cb5db commit 482c0f75a142c817111fc8b6f41b674707e83aaa Author: Burak Yavuz Date: Tue Apr 14 20:27:41 2020 +0000 [SC-31282][DELTA] Refactor the composition of Snapshot and Checksum Validation ## What changes were proposed in this pull request? This PR reorganizes some of the information passed into Snapshot and MetadataGetter to reduce the complexity of the implementations. Checksum validation is something that the Snapshot performs instead of the MetadataGetter now. We also get rid of the MetadataGetter interface, as it doesn't really add any value. ## How was this patch tested? Existing unit tests Author: Burak Yavuz Author: Burak Yavuz #9061 is resolved by brkyvz/refValidation. GitOrigin-RevId: 437d0ffbcd62c6253422a902d5a6c6e886fbf147 commit 38fa6dc8daaa03cbab86eb9df62ecc8a12626741 Author: Rahul Mahadev Date: Mon Apr 13 14:58:18 2020 -0700 [SC-31727][DELTA] Minor improvement to repartition tests Avoid flakiness by cleaning up scope of view in test Closes #8890 from rahulsmahadev/mergeLessFiles. Authored-by: Rahul Mahadev Signed-off-by: Tathagata Das GitOrigin-RevId: 740e7685d0570af007fae325e72f9cac0247740a commit 6804c8bf7f006bca193be132bbb25682391fd0e7 Author: Rahul Mahadev Date: Fri Apr 10 20:43:19 2020 +0000 [SC-30759][DELTA] Fix bug in operation metrics where redundant metrics don't get filtered out ## What changes were proposed in this pull request? Fixed a bug which caused the stats from `BasicFileStatsTracker` to not be filtered out ## How was this patch tested? Added a unit test which basically tests if the captured metrics are the ones defined in the schema. Author: Rahul Mahadev #8847 is resolved by rahulsmahadev/historyFixExtra. GitOrigin-RevId: 35719c253a035445dbbf2376ddf210e55e5f0adc commit 89bec0bbb8838f929b583d0f1aaca402c1d4e349 Author: Burak Yavuz Date: Fri Apr 10 19:15:28 2020 +0000 [SC-31514][DELTA] Refactor to simplify snapshot computation This PR removes the re-use of the previous Snapshot's state for the next Snapshot's computation. This logic adds some complexity around having to keep track of RDD lineage to avoid stack overflows, and requiring us to have 2 different Snapshot computation methods when we're trying to load a given version of the table versus just the latest version. Instead we will now perform file listing from the latest known checkpoint at all times and give the full set of files required to build a Snapshot. The only advantage of reusing the state from the previous snapshot was that it would avoid a cost of hitting the storage system for the existing data. We ran some benchmarks to show that this cost is minimal and should be dwarfed by the time it takes to actually perform ETL. - Write a very small commit to a large table (500,000 actions in checkpoint, 106 MB in size) + 8 delta files (100 actions per commit) (This should be a pathological case, because we need to hit the storage system every commit) Old code path: 1.7 seconds New code path: 8.1 seconds - How long it takes to run a rateStream with 10 rows per second to a Delta table (effects on latency) for 100 batches: Old code path: 5.3 minutes New code path: 4.6 minutes ^^ I honestly don't know how it got better. I would've expected it to get slower above. Could be variability in Cloud instance performance These numbers should be dwarfed by the time to actually write the full data, therefore it seems like a worthy compromise. Author: Burak Yavuz GitOrigin-RevId: afe702686df982766794f498969761c320c42e42 commit 8433a53066ea09131d120c1a5c0155d2d9c79c1f Author: Tathagata Das Date: Thu Apr 9 11:48:45 2020 -0700 [SC-31462] Added metadata checks and extra logging - Metadata checks for detecting table id changes, these checks will generate logs - Extra log4j logs to help debug metadata issues Closes #9221 from tdas/SC-31462. Authored-by: Tathagata Das Signed-off-by: Shixiong Zhu GitOrigin-RevId: c6e63057acb5c712ebefdb28d88df50bdbd875da commit 9477adbbd5c89c19d371de594bbd1610ce148c34 Author: Jose Torres Date: Thu Apr 9 06:33:19 2020 +0000 [SC-29000][DELTA] Implement MERGE INTO schema evolution Implement MERGE INTO schema evolution new unit tests Author: Jose Torres GitOrigin-RevId: 633dbcdd90143c8551982d97e48fffbca1dbb073 commit 847dddfb3352e32649ed9d2f42666cd792be8be4 Author: Burak Yavuz Date: Thu Apr 9 00:17:26 2020 +0000 [SC-27858][DELTA] Added a new type of error message Added a new type of message to throw on errors related to invalid options. Author: Burak Yavuz GitOrigin-RevId: d4133f79d0a52636d47665c49f6105ee8c00d8cd commit 6d9420539e9b00f8df8d5bc80ce6aa7a5a9f3b7d Author: Rahul Mahadev Date: Tue Apr 7 02:28:34 2020 +0000 [SC-30496][DELTA]Add option to repartition by partition column in merge to reduce the number of files ## What changes were proposed in this pull request? Added a DeltaSQLConf to allow repartition the merge output dataframe by partition columns to reduce the number of files. closes https://github.com/delta-io/delta/pull/367 ## How was this patch tested? Added unit tests in MergeIntoCommandSuiteBase Author: Rahul Mahadev #8890 is resolved by rahulsmahadev/mergeLessFiles. GitOrigin-RevId: 58ba990cb0bf05a93d78d18e0285b795e6014a24 commit b8f474d2f90498b996c0bd430c72526466174153 Author: Shixiong Zhu Date: Mon Apr 6 19:21:51 2020 +0000 [SC-30925]Add tests to ensure checkpoint uses `isPartialWriteVisible` to decide whether to use rename Add tests to ensure checkpoint uses `isPartialWriteVisible` to decide whether to use rename. The new unit test. Author: Shixiong Zhu GitOrigin-RevId: b72ebec9695aaabbc156ea27fe7d4077ce4d0fe1 commit 3f82a815d7e9b2db6a3633ef991deedcf3cb716c Author: Tathagata Das Date: Sat Apr 4 04:14:05 2020 +0000 [SC-31254][HOTFIX] Minor refactoring in OptimisticTransactionImpl Minor refactoring Author: Tathagata Das GitOrigin-RevId: 91f0ae9c7e82a153f812e0e9c94b0dbf7532e1df commit 1bd275a8d49fe2113c4845dcd9cee668f90797c0 Author: Burak Yavuz Date: Thu Apr 2 02:51:29 2020 +0000 [SC-31128][DELTA][WARMFIX] Do not return deltas older than checkpoint in getFilesForUpdate ## What changes were proposed in this pull request? Fixes a bug where delta files older than the latest checkpoint is returned as part of `getFilesForUpdate`. ## How was this patch tested? Regression test Author: Burak Yavuz #9060 is resolved by brkyvz/fixDeltas. GitOrigin-RevId: e4823726dc9c4b78c1672cbdea16b2b1988df017 commit 0cd59a858bfa6c9e3153a414a2ce137a7f0eba07 Author: gatorsmile Date: Wed Apr 1 11:13:45 2020 -0700 Minor import changes to GenerateSymlinkManifest.scala Minor refactoring in the imports for GenerateSymlinkManifest.scala GitOrigin-RevId: ef036ca9eae859797d828a604818b9ca24d966f2 commit 1984fa9a10413ec4b8ceeb302d4b287e3b856c9c Author: Jose Torres Date: Mon Mar 30 22:48:58 2020 +0000 [SC-30810][DELTA] Allow time travel to checkpoints with no previous deltas ## What changes were proposed in this pull request? Currently, a checkpoint must have delta files listed before it in the directory to be considered recoverable. This has a fencepost error: If there's only one recoverable checkpoint, and all deltas before it have been aged out, we'll report that the checkpoint isn't recoverable even if all deltas from that checkpoint onwards are available. This PR fixes the issue by advancing the check one iteration forwards. ## How was this patch tested? new unit test Author: Jose Torres #8868 is resolved by jose-torres/fixthebug2. GitOrigin-RevId: 044b81fd9973b1d3c90d87cfb7891b994c0655f7 commit b63bfa9d6d4063f6d192d280c6a731d664f59047 Author: Michael Armbrust Date: Thu Mar 26 23:57:42 2020 +0000 [DELTA-OSS-EXTERNAL] Update docs with LF processes This PR updates our docs to discuss the new process for accepting commits under the Linux foundation. Closes delta-io/delta#337 Co-authored-by: Tathagata Das Signed-off-by: Tathagata Das Author: Michael Armbrust #8924 is resolved by tdas/2zm6hc96. GitOrigin-RevId: a5f125b77c6d7d4cf790de4163e04e1bed93023c commit bb4aa67377e0369be93127a03a9a97abb173593f Author: Burak Yavuz Date: Sat Mar 21 19:18:05 2020 +0000 [SC-29465][DELTA] Fix Convert To Delta for tables in the HiveMetaStore Fixes the support for CONVERT TO DELTA on tables stored in the HiveMetaStore. This feature will be open sourced after Spark 3.0 is released. To ensure that the feature works, I had to refactor some of the test suites, so that the table based tests can run using Hive test utilities as well. Unit tests Author: Burak Yavuz Author: Burak Yavuz Author: Burak Yavuz Author: Burak Yavuz #8701 is resolved by brkyvz/hiveConvert. GitOrigin-RevId: fe3057f52c2c85053a69585e7fd7e26b2c240642 commit 4c2e08d26c26d641527e2cd1490c1d8b2687f2a6 Author: Rahul Mahadev Date: Fri Mar 20 21:38:55 2020 +0000 [SC-30486][Delta] Use File System for log reads instead of FileContext of Default HDFSLogStore ## What changes were proposed in this pull request? * Use FileSystem APIs for log reads instead of FileContext APIs. This was achieved by making the `HDFSLogStore` extend the `HadoopFileSystemLogStore` * Throw a better error message for writes closes https://github.com/delta-io/delta/pull/358 ## How was this patch tested? Added a couple of unit tests in the `HDFSLogStoreSuite` and a end to end test which reads a dataframe from a filesystem that doesnt implement AbstractFileSystem Author: Rahul Mahadev #8733 is resolved by rahulsmahadev/fileSystemForLogReads. GitOrigin-RevId: f5fdf83f63db1d81f0adc481c5e72ced520f1525 commit a43c8921dba15222d09188d7133416438028861c Author: Tathagata Das Date: Fri Mar 20 17:07:24 2020 +0000 [SC-30048][Delta]Changed copyright in license headers based on linux foundation guidelines ## What changes were proposed in this pull request? Based on Linux Foundation guidelines, we are changing the Copyright in the license headers from "Databricks, Inc" to "The Delta Lake Project Authors" ## How was this patch tested? Checked that no files have "Databricks, Inc" in them. Author: Tathagata Das #8792 is resolved by tdas/SC-30048. GitOrigin-RevId: dc1cfe30c1780a0c3202bcd114fcc2c747cb72c3 commit 5726a3b50f40416caca3cb8d2927e57bcbd22220 Author: Burak Yavuz Date: Thu Mar 19 00:02:28 2020 +0000 [SC-29737][DELTA] Fix tableId's for Delta tables for all tests ## What changes were proposed in this pull request? We fix the tableIds in all tests to break the assumption that all tableId's must be unique. Users may copy their tables to new locations resulting in duplicate tableIds. None of the logic within Delta code should assume that tables have unique ids. ## How was this patch tested? A test that required a new table Id in DeltaSourceSuite actually failed, so this change actually works. Author: Burak Yavuz Author: Burak Yavuz #8749 is resolved by brkyvz/tableId. GitOrigin-RevId: 0ef93c1cb18829db315dec7746a7ed6a352e4d3f commit 7f41022521596bca6c4bdd8bdc01ac49ee15d608 Author: Shixiong Zhu Date: Wed Apr 1 11:30:09 2020 -0700 Update README.md (#34) - Move where to download JRAs first, before the compilation instructions. - Make it clear that this connector does not provide the support for defining Hive Metastore tables in Apache Spark. - Mention that the connector doesn't support Spark execution engine in Hive. commit 0e8ebc4d2ab799e2526103d6a864fc89de010186 Author: Shixiong Zhu Date: Wed Apr 1 08:27:26 2020 -0700 update README commit 246765c015b8ff5452b0e985d2f36579c6b6607b Author: Shixiong Zhu Date: Wed Apr 1 08:25:12 2020 -0700 update README commit 58768fa2f3d25a3dc90beeb1565c17f8955e1827 Author: Shixiong Zhu Date: Wed Apr 1 08:08:46 2020 -0700 Changed copyright in license headers based on linux foundation guidelines commit 16d3b5158a3c3879264da86775b375a57ae909d2 Author: Shixiong Zhu Date: Wed Apr 1 07:29:29 2020 -0700 update README commit 9bf77cbffdf57505c69c80ddc467b86a1ab8e399 Author: Shixiong Zhu Date: Wed Apr 1 07:07:28 2020 -0700 set version to 0.1.0 commit fd9ef41e954d142e1728b23906d4843c8f50a94b Author: Shixiong Zhu Date: Wed Apr 1 06:47:54 2020 -0700 The query should fail when a path is deleted after a table is created commit 965f3fd93ad1e1426b4c8fc8d13348840331e8c5 Author: Shixiong Zhu Date: Wed Apr 1 04:24:36 2020 -0700 Save path in SerdeInfo's parameters (spark's data source table reads from it) commit 3645d9eaa101b521ae18a62727b3612993710806 Author: Shixiong Zhu Date: Wed Apr 1 06:48:44 2020 -0700 update FAQ commit cb2bbf8df3d55f62aab7c55195cd36de2487580c Author: Shixiong Zhu Date: Mon Mar 30 10:49:27 2020 -0700 Log Delta operation duration and the table stats (#32) Log Delta operation duration and the table stats so that we have a better understanding on how long it takes to load Delta metadata. commit aeff6c8df40beb02950df12055585b5dce77073f Author: Shixiong Zhu Date: Wed Mar 25 09:41:00 2020 -0700 Get table schema from table properties directly (#31) so that we don't need to store the property `delta.table.schema` in Hive Metastore. commit 1f8b4700b8a6591e65c876bf739947069badf016 Author: Shixiong Zhu Date: Tue Mar 24 22:10:12 2020 -0700 Fail incorrect format and throw an error rather then returning NULL values (#30) When a user doesn't set the input format to `io.delta.hive.HiveInputFormat`, we will return NULL values right now. This is pretty bad as the user may not notice it. It's better to throw an error when the input format is not set. commit a92d3722c6738e70789421917a5e18bd54ab71fe Author: Shixiong Zhu Date: Tue Mar 24 00:55:14 2020 -0700 Run unit tests for Tez (#29) - Move the unit tests for Hive MR to HiveMR project - Add a new project HiveTez to run unit tests in Tez mode. - Document that `hive.tez.input.format` should also be set to `io.delta.hive.HiveInputFormat`. - Fix the dependency issue for Tez - Remove dependencies used by Spark UI because Tez has its own UI and its dependencies conflict with Spark's. - Exclude `org.xerial.snappy` and don't shade it. Tez uses a different version and doesn't work with the Spark one. Closes #21 commit dc20a2739c19744e91d9047228af584c7ce73993 Author: ekoifman Date: Wed Mar 18 18:17:46 2020 +0000 [DELTA-OSS-EXTERNAL] Optimize merge command when there is no whenNotMatched clause #342 When Merge command doesn't have a whenNotMatched clause, the Full Outer Join in MergeIntoCommand.writeAllChanges can be changed to Right Outer Join. Since left/source side is usually small, this can enable Broadcast join - closes #342 Closes delta-io/delta#343 Author: Tathagata Das Author: ekoifman GitOrigin-RevId: 0bf21ea6eaa30df303a9b1ac926833c38f294e0e commit 71606f47d5bc8d06b8acefe1f3a649bc44d38c10 Author: Wesley Hoffman Date: Wed Mar 18 06:42:42 2020 +0000 [DELTA-OSS-EXTERNAL] Add python linting to circleci build Shamelessly stolen from https://github.com/apache/spark/blob/master/dev/lint-python Closes delta-io/delta#214 Author: Tathagata Das Author: Wesley Hoffman GitOrigin-RevId: 95f460ba0a79f2271983cf55fcbb2d07da13c779 commit d0f23c60b61b7e78b08e2d748bd3c535e73784a2 Author: mahmoud mahdi Date: Fri Mar 13 18:33:18 2020 +0000 [DELTA-OSS-EXTERNAL] Made some enhancements on the Scala Options' code The main goal of this Pull Request is to make some code enhancements on the different Options used in this project. Some pattern matching snippets were unnecessary since we could have treated options as collections. for example : - **foreach** can replace the following code: ``` option match { case None => {} case Some(x) => foo(x) } ``` - **forall** can replace the following code: ``` option match { case None => true case Some(x) => foo(x) } ``` - **exists** can replace the following code: ``` option match { case None => false case Some(x) => foo(x) } ``` I also removed some unnecessary pattern matching conditions which could have been anticipated using a ```getOrElse``` on the Scala Option. These modifications were tested by running the existant tests. Closes delta-io/delta#335 Author: mahmoud mahdi #8661 is resolved by tdas/rvc8m6aq. GitOrigin-RevId: 3cce52edf222df0a2f67813b8bd39b76f974c8fd commit 68580b66aa3f14df4a03c94bffcec3bad5066806 Author: mahmoud mahdi Date: Fri Mar 13 01:36:21 2020 +0000 [DELTA-OSS-EXTERNAL] Fix a typing mistake in the WriteIntoDelta's javadoc While digging into delta's code, I found a typing mistake in the javadoc of ```WriteIntoDelta``` which I fixed in this Pull Request. Closes delta-io/delta#334 Author: mahmoud mahdi #8656 is resolved by tdas/rkqow8pb. GitOrigin-RevId: 526bcf5c0b1238b7313161ad27cf41d4061d50c1 commit 79d75454b92547aacb7a5fa57ea16e70b94d735b Author: Burak Yavuz Date: Wed Mar 11 20:10:55 2020 +0000 [SC-30229][DELTA] Prevent partition columns from having invalid characters ## What changes were proposed in this pull request? This PR adds the same invalid character checks made on data columns to partition columns as well. Having different rules for different physical types of columns violates data independence. Since there is a possibility that this may break existing workloads, we add a flag to disable the check. ## How was this patch tested? Existing unit tests were retrofitted to perform this check Author: Burak Yavuz #8599 is resolved by brkyvz/partColCheck. GitOrigin-RevId: 4233f36e862056982c6245981567a7965033182c commit 6bdeb0c5c59ab4f03dafbd23b227a113ae588eb5 Author: Pranav Anand Date: Fri Mar 6 21:12:07 2020 +0000 [SC-29537] Ensure that docs links added correctly are tested ## What changes were proposed in this pull request? - The issue with the existing tests which caused the ticket to be filed was that though the docs link was added correctly (not hard coded, used the generateDocsLink method), it wasn't explicitly added to the tests therefore was not tested - This PR aims to make that less likely to happen i.e. if the link was added correctly, it will be tested ## How was this patch tested? - Changed existing test to check all links Author: Pranav Anand #8217 is resolved by pranavanand/pa-fix-docs-mergelink. GitOrigin-RevId: f0af7308ddd2214455f995fed0e505350b86a38a commit 85bdeddb68363ab0da28765541aa47b4fc348d7b Author: Shixiong Zhu Date: Mon Mar 9 10:59:38 2020 -0700 Cross build Scala 2.11 and 2.12 (#22) Add build support for Scala 2.11. Closes #18 commit b85fdd6fbb574c700429a5d57da2387efdc9967b Author: Shixiong Zhu Date: Mon Mar 9 00:18:56 2020 -0700 Optimize the connector memory usage (#24) - Disable Spark UI and remove `SparkListener`s to reduce the memory usage of Spark. They are unnecessary for the Spark cluster in connector. - Set the `DeltaLog` cache size to 1 if the user doesn't specify it to avoid caching lots of `DeltaLog` objects. commit 7927a80d8bcc9079683f554bbd9402666cd2a3ae Author: Shixiong Zhu Date: Mon Mar 9 00:18:16 2020 -0700 Use SchemaUtils.reportDifferences to report schema differences (#23) This is one example of the new error message: ``` MetaException(message:The Delta table schema is not the same as the Hive schema: Specified schema is missing field(s): a Specified schema has additional field(s): e Delta table schema: root -- a: integer (nullable = true) -- b: string (nullable = true) -- c: string (nullable = true) Hive schema: e: int c: string b: string Please update your Hive table's schema to match the Delta table schema.) ``` commit 9e2d147b9f114f6ef7d6d1b1e7c9a7616c40af9c Author: Burak Yavuz Date: Wed Mar 4 21:10:20 2020 +0000 [SC-29983][DELTA] Add a new error message when REPLACE table is missing schema This PR adds a new error message for REPLACE table. We will use this for Spark 3.0. Adding this here to reduce conflicts when we merge the Spark 3.0 branch. Author: Burak Yavuz GitOrigin-RevId: f1bdb229ea52c206563c60b449a54a83581a5809 commit 08bdd974e9e6947f9e17f6a9da228c3b25a996cf Author: maryannxue Date: Wed Mar 4 09:34:14 2020 -0800 [SC-24314][SQL] Use Spark's SubqueryExpression.hasSubquery to check whether there is a sub query Use Spark's SubqueryExpression.hasSubquery to check whether there is a sub query so that it's working even if Spark adds more types of sub queries in future. Authored-by: maryannxue Signed-off-by: gatorsmile GitOrigin-RevId: 46a16f2eaad9bc8b31ee7c0013b397783c8c6b85 commit 2fd15e480424f4bd3223847847501585df3f21c0 Author: Burak Yavuz Date: Wed Mar 4 02:27:21 2020 +0000 [SC-29969][DELTA] Support maxBytesPerTrigger for Admission Control in the DeltaSource Adds support for the option maxBytesPerTrigger to the DeltaSource. This will help the Delta Streaming source to process a soft-max of maxBytesPerTrigger each micro-batch. Unit tests Author: Burak Yavuz GitOrigin-RevId: 2e6b2f09a444dd03ec968096ae29a2e3c4bb2186 commit 0490c8b2e69f87562c8277cb18c7b038fa1d25e3 Author: Burak Yavuz Date: Mon Mar 2 07:20:59 2020 +0000 [SC-29635][DELTA] Rename `Delete` to `DeltaDelta` Author: Burak Yavuz GitOrigin-RevId: 5ebd0dd61de7fabb1064a104e587dd1add15464d commit cfcc5aca81cffd145386bde1a6d2134b474bef98 Author: IonutBoicuAms Date: Thu Feb 27 20:45:30 2020 +0000 [SC-24983]Use the public `Encoder` API rather than `ExpressionEncoder` if possible There are several places we can use the public `Encoder` class rather than the private concrete class `ExpressionEncoder`. This PR just fixes them. Author: IonutBoicuAms GitOrigin-RevId: c8f981ab8f1fbac8e5dc25a0bb62725841482ec7 commit 1498491e997ea0622f27eb133258ded5e37e4bde Author: mahmoud mahdi Date: Wed Feb 26 23:40:09 2020 +0000 [DELTA-OSS-EXTERNAL] Avoid using mutable values in the withRelativePartitionDir method The main goal of this simple Pull Request is to provide a simple enhancement of the existing code by avoiding to use a mutable dataframe variable in the ```withRelativePartitionDir``` method. Closes delta-io/delta#330 Author: mahmoud mahdi #8329 is resolved by mukulmurthy/df4pjj4s. GitOrigin-RevId: 5fa6aa0612c5818ccf7ef30320aaddd028e5c948 commit a1f7976b9c32849c605cfdce8d97254298555f33 Author: Shixiong Zhu Date: Wed Feb 26 21:26:51 2020 +0000 [SC-29560]Use an internal accumulator in MergeIntoCommand ## What changes were proposed in this pull request? Spark UI will track all normal accumulators along with Spark tasks to show them on Web UI. However, the accumulator used by `MergeIntoCommand` can store a very large value since it tracks all files that need to be rewritten. We should ask Spark UI to not remember it, otherwise, the UI data may consume lots of memory (It keeps 1000 stages by default. It means we will keep such accumulator until there are 1000 new stages completed). Hence, we use the prefix `internal.metrics.` to make this accumulator become an internal accumulator, so that it will not be tracked by Spark UI. Note: this doesn't fix the issue entirely. `org.apache.spark.status.LiveTask` is used to track all running or pending tasks. The internal accumulators are still stored in this class until the task completes (`LiveTask` will be converted to `TaskDataWrapper` and internal accumulators will be dropped in this step. See `org.apache.spark.status.LiveEntityHelpers#newAccumulatorInfos`). This is fine since the accumulator is needed when tasks are active. However, **if spark events get dropped**, we may leak task completion events and internal accumulators will stay with these `stale` tasks forever. ## How was this patch tested? The new test to verify the accumulator is not tracked by Spark UI. Author: Shixiong Zhu #8222 is resolved by zsxwing/SC-29560. GitOrigin-RevId: f9cb1244dabaa1c3f6a2d5128dcc13a86a9a3e6d commit 8be3787d387ae2ed7273591ba576c9701e2a2d10 Author: Ali Afroozeh Date: Wed Feb 26 03:02:40 2020 +0000 [SC-24988][Delta] Make Snapshot partition config an optional configuration ## What changes were proposed in this pull request? This PR makes the SNAPSHOT PARTITIONS configurations an optional configuration. This helps us understand when users override it. ## How was this patch tested? Unit tests [SC-24988]: https://databricks.atlassian.net/browse/SC-24988 Author: Ali Afroozeh #8208 is resolved by dbaliafroozeh/AutotuneNumPartitions. GitOrigin-RevId: 1cb47b4dbcf96a0b9ce900d2c18d9b2df5b61c7e commit 4b11cf11df12030fd7a9d9f5010e6eababd78e1b Author: herman Date: Tue Feb 25 23:12:17 2020 +0000 [SC-23870] Broadcast Hadoop Configuration to reduce the Spark task size Author: herman Author: Herman van Hovell GitOrigin-RevId: 512d1d6930f8a45d4249a6ffb5c4a93fe0a6d1ae commit 8b011227328eba60d2be83e8a016ca50b35d84f6 Author: Burak Yavuz Date: Mon Feb 24 20:22:38 2020 +0000 [SC-29540][DELTA] Change UpdateTable to DeltaUpdateTable Author: Burak Yavuz GitOrigin-RevId: 34d1a54965dc953b8b6518cb33501e574dcc0445 commit d039e00faf06f3c4bc4682b7ee9b87d015450ebe Author: Burak Yavuz Date: Fri Feb 21 20:12:56 2020 +0000 [SC-28999][SS] Minor improvements to Delta streaming source Existing tests Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: c8fd5fa6cd8625412a550a2d64a3a2e8878c22eb commit 5c84322f0221687a13bc97524b40a31f1178dbcd Author: Burak Yavuz Date: Wed Feb 19 07:07:54 2020 +0000 [SC-29289][DELTA] Refactor SnapshotManagement and Checksums ## What changes were proposed in this pull request? This PR refactors the newly introduced SnapshotManagement and MetadataGetter classes, as well as where checksums are verified. We were missing a case, where `getSnapshotAt` was not verifying checksums. The checksum verification has been moved into `MetadataGetter`, since this class is responsible for computing the statistics around a Delta table, and is the thin-waist where we can perform this check. We also unify some code in SnapshotManagement, and also introduce a method for computing the initial state of a Delta table when it is first loaded. This avoids a two step: 1. Create initial empty snapshot, because checkpoint doesn't exist 2. Now do a list and update the table initialization and consolidates them into one, reducing total file system operations ## How was this patch tested? Existing tests as this is a pure refactor Author: Burak Yavuz #8112 is resolved by brkyvz/checksumRefactor. GitOrigin-RevId: 5ad11307c39c5444b2feacb0ad74a9cf89711b1d commit 8ff9b1888c9683741cd8b47dfdeaf0e794f495c6 Author: Erik LaBianca Date: Tue Feb 18 09:05:40 2020 +0000 [DELTA-OSS-EXTERNAL] Avoid creating null output stream in S3SingleDriverLogStore Fixes #316 Closes delta-io/delta#317 Author: Erik LaBianca #7992 is resolved by zsxwing/8poe59z8. GitOrigin-RevId: 4e2306940262b3f942a8c325f494f22693e874b1 commit e48e43c2dce20fa41d14211336d1d8d789b0cebd Author: Rahul Govind Date: Fri Feb 14 10:46:08 2020 +0000 [SC-29179] Add Table ID in schema mismatch errors This commit adds the ID of the table being written to in schema mismatch error messages. Closes delta-io/delta#322 GitOrigin-RevId: 6523ca6eb81dd9ca12b9e97163fd520ff8dc839b commit b93686cf6f5db5361f3dafd3e26994ca9d2f7221 Author: gatorsmile Date: Thu Feb 13 19:28:38 2020 -0800 Improve some error messages GitOrigin-RevId: 8ac5d37ca3bfeb0509f5ca3c763eb4786f2f8df3 commit b2618a7a6049fa352b5741ff68e912da6e3bf838 Author: Ali Afroozeh Date: Wed Feb 12 21:24:52 2020 +0000 [SC-28990][DELTA] Add SnapshotManagement interface to DeltaLog ## What changes were proposed in this pull request? Moves the snapshot management piece in DeltaLog to a specific trait. DeltaLog code is a bit leaner now. ## How was this patch tested? This PR is a refactoring, passes existing tests. Author: Ali Afroozeh #7880 is resolved by dbaliafroozeh/IntroduceSnapshotEdge. GitOrigin-RevId: a5b108c7f648c18b306ed64c5f26806f5ae914e6 commit d27095e58bc07580501cbd3d4e10719a9292b566 Author: Pranav Anand Date: Wed Feb 12 16:49:10 2020 +0000 [SC-28439]Fix change column for array and map types ## What changes were proposed in this pull request? - WIP PR. Remaining changes are as follows: - In order to support adding comments on array/map columns and adding comments to primitive types on an array's element or a map's key/value, may need to use `transformColumn` or do something other than use `dropColumn` in its current form - Change `dropColumn` to support `ArrayType` and `MapType` ## How was this patch tested? - Tests added to `DeltaAlterTableTests` Author: Pranav Anand #7808 is resolved by pranavanand/pa-30-altertable-changecolumn. GitOrigin-RevId: 4c0b02dc28b6119a77463327ecc36df89c42393a commit 313a240c7f6eeaccbe41b641b095da4f72d087e0 Author: Rahul Mahadev Date: Wed Feb 12 02:48:21 2020 +0000 [DELTA] Enable operation metrics by default ## What changes were proposed in this pull request? Purpose of this PR is to enable Operation metrics by default and fix the tests that fail when they are enabled by default. ## How was this patch tested? Author: Rahul Mahadev #7841 is resolved by rahulsmahadev/historyEnableDefault. GitOrigin-RevId: ee67cd36f6de63b86d46e46ad4f657d8741a877d commit 6d860a65e8c1cddd7817766d6ac414c9ee2f4bd2 Author: Burak Yavuz Date: Mon Feb 10 16:34:56 2020 -0800 [SC-28961] Define how to compare two Delta tables Introduces a `compositeId`, which should be used in cases where we compare two Delta tables. Regression test Authored-by: Burak Yavuz Signed-off-by: Burak Yavuz GitOrigin-RevId: e62a1d1a2ccd92d4b67d7edad1fb80c4ba7a5497 commit 4b80a77880c04f1195201d4772f90d15f2916a1f Author: Ali Afroozeh Date: Mon Feb 10 16:55:36 2020 +0000 [SC-28594] Add MetadataGetter interface This PR introduces the `MetadataGetter` trait, which provides an interface for getting the metadata information of a Delta log. Also, the `StateMetadataGetter` class is added that implements the `MetadataGetter` interface and extracts the metadata information from the state of a Delta log (`Dataset[SingleAction]`). It's a refactoring, passes the existing tests. Author: Ali Afroozeh GitOrigin-RevId: dbe250c55567892f4a13d2ab911a5e53f43cad76 commit 240feb02ff0bdc390cf5db72f95c0e82f5a1f1db Author: Rahul Mahadev Date: Sun Feb 9 00:33:12 2020 +0000 [SC-28441][DELTA] Delta Operation metrics - convert ## What changes were proposed in this pull request? Added operation metrics for `CONVERT TO DELTA` -> Additionally added a missing field in `STREAMING UPDATE` ## How was this patch tested? added a test in `DescribeDeltaHistorySuite` Author: Rahul Mahadev #7767 is resolved by rahulsmahadev/historyConvert. GitOrigin-RevId: ad1354fef0dff1f4f8d734318de606cb3e0450c3 commit 79ff2e45810b2a79849894e51accc27ff6ba6dbe Author: Rahul Mahadev Date: Fri Feb 7 23:47:59 2020 +0000 [SC-28708][DELTA] Row level metrics for Delta Update ## What changes were proposed in this pull request? Adding row level metrics for UpdateTableCommand in Delta. this PR will add instrumentation to track the number of updated rows and the number of copied rows during a Delta Update. ## How was this patch tested? Added unit tests. Author: Rahul Mahadev #7821 is resolved by rahulsmahadev/rowLevelUD. GitOrigin-RevId: b1f99cd05a215675c19c9804bc9751ef0ba487e4 commit 9bdf37ae4bbb5108bce7b91289705117722f772b Author: Burak Yavuz Date: Fri Feb 7 05:21:36 2020 +0000 [SC-28833][DELTA] Add more types of errors for INSERT INTO Author: Burak Yavuz GitOrigin-RevId: 954c109b114cf59b76fa5fb6b6d9adcff5b95e39 commit ca10cf732ce201ceaab7f2e4bc15e9e378b7cee1 Author: Rahul Mahadev Date: Fri Feb 7 04:06:51 2020 +0000 [SC-28708][DELTA] row level metrics for delete ## What changes were proposed in this pull request? Added row level metrics for Delete ## How was this patch tested? added unit tests Author: Rahul Mahadev #7840 is resolved by rahulsmahadev/rowLevelDelete. GitOrigin-RevId: 7f5629ebb92534f55458e1449e17713c4244c526 commit ea094d838f3d61d9c264f173e3b9f416200eee93 Author: lswyyy <228930204@qq.com> Date: Thu Feb 6 14:29:10 2020 -0800 [DELTA-OSS-EXTERNAL] Generate does not update manifest if delete all data from unpartitioned tables fixes #275 Closes delta-io/delta#277 Closes #7931 from tdas/1u0hfort. Lead-authored-by: Tathagata Das Co-authored-by: Tathagata Das Co-authored-by: lswyyy <228930204@qq.com> Signed-off-by: Tathagata Das GitOrigin-RevId: fa35d0ea76e2973e84134e34e901be50d17e0f39 commit 7929edd2050f0f28f31614d4c599f8cb04f2576d Author: Shixiong Zhu Date: Tue Feb 25 22:26:18 2020 -0800 Column names should be case insensitive (#20) Hive's column names are case insensitive. We should ignore case when comparing the column names. This PR also improved the error message to make it easier to compare the schema differences. Closes #17 commit 21d2af1d8a33b9911cc3308ff5e4a1cc14517ac2 Author: herman Date: Thu Jan 30 22:06:23 2020 +0000 [SC-28546] Reenable test initial snapshot delta test ## What changes were proposed in this pull request? This PR re-enables one of the initial snapshot tests that was disabled during the last batch merge. It turns out that the payload of a `SparkListenerJobStart` event slightly changed, it now does not contain any stage info when the job is empty. ## How was this patch tested? It is a test. Author: herman #7860 is resolved by hvanhovell/SC-28546. GitOrigin-RevId: 2aa3174b1d1c13243c6133dde3f120827b9c7642 commit 472c9debe60417d3463cb8d327d4b0ed0e68b8d0 Author: Tom van Bussel Date: Thu Jan 30 13:32:32 2020 +0000 [SC-13365] Add getBasePath interface declare getBasePath in TahoeFileIndex Author: Tom van Bussel GitOrigin-RevId: 3c0e8be777db0ebe123cccc7baf8ca35e09a1481 commit 348ae42f89df08cbc588434c9165c02eed2b1eb6 Author: Rahul Mahadev Date: Wed Jan 29 18:51:49 2020 +0000 [SC-28443][DELTA] operation metrics - create table ## What changes were proposed in this pull request? added operation metrics for `CREATE TABLE`. Note: our SQLMetrics framework will allow us to siphon the metrics directly from the underlying `WRITE` ## How was this patch tested? Added unit test in `DescribeDeltaHistorySuite` Author: Rahul Mahadev #7771 is resolved by rahulsmahadev/historyCreateTable. GitOrigin-RevId: 723425ad1a03d674bfb93222e6a3309088c59cc2 commit 9382f40c034eda720926058322399e539a070d74 Author: Gengliang Wang Date: Wed Jan 29 00:19:44 2020 -0800 Disable test initial snapshot delta test GitOrigin-RevId: d6f628d292243095c2e6a668d1df6aecf20c7a28 commit ddfb085b42d24a5e65f0f9af6e6794684254b766 Author: Rahul Mahadev Date: Mon Jan 27 23:32:06 2020 +0000 [SC-28440][DELTA] Delta Operation Metrics - FSCK Command Added operation metrics for `FSCK` command Author: Rahul Mahadev GitOrigin-RevId: 17853fe70a302c6fd499fa44e6432b167fe38b16 commit 4e559168b526f76134907b821b589683e2a2af7f Author: Tathagata Das Date: Sat Jan 25 01:41:49 2020 +0000 [SC-28544] Rename Merge classes to DeltaMerge to avoid conflict with Apache Merge classes ## What changes were proposed in this pull request? Renamed all Merge* classes to DeltaMerge* classes. ## How was this patch tested? Existing tests Author: Tathagata Das #7763 is resolved by tdas/SC-28544. GitOrigin-RevId: 5f19f332601804befb0f1f83c45c34f2601eaee6 commit 0bf903ea19ae2a402ef358702fb17d1a56285fc8 Author: Pranav Anand Date: Fri Jan 24 01:07:11 2020 +0000 [SC-27974] Fix column addition into array and map types ## What changes were proposed in this pull request? - Change `findColumnPosition` for `MapType` to be able to tell if position needed is for a key or for a value and support `ArrayType` - Change `addColumn` to support `ArrayType` and `MapType` ## How was this patch tested? - Add tests for `MapType` and `ArrayType` Author: Pranav Anand #7638 is resolved by pranavanand/pa-30-altertable-addcolumn. GitOrigin-RevId: 755b00b3d788ea24f179cef0220ee7d8d4eb3e94 commit 9821e9af4b9e7d3bd1d90dd0bd8b65c697a86543 Author: Tathagata Das Date: Wed Jan 22 17:53:27 2020 +0000 [DELTA-OSS-EXTERNAL] Fixed two subtle bugs in merge resolution Here are the two bugs fixed. 1. Insert condition should be resolved only with source and not with source+target. This was because clause conditions were being resolved with the full `merge` plan (i.e., with both source and target output attribs) independent of the clause type. This PR fixes it by using the `planToResolveAction` to resolve the references of the condition, which is customized for each clause type. 2. Fix for bug #252 where incorrect columns were being printed as valid columns names. In the code, `plan.references` were being printed as valid column names. This is wrong because `references` includes invalid column names as well. This PR fixes it by using the output attributes of `plan.children` which are the only valid column names that can be referred to. Updated unit tests to verify the presence/absence of valid/invalid column names. Closes #252 Closes delta-io/delta#303 Author: Tathagata Das #7702 is resolved by tdas/u9qiqvwd. GitOrigin-RevId: 26a1458dc3deb05f2398b0c5daec3d4ef5a9a1a7 commit 6ff2e1a6a733a48476416b989c0c3fda81d570bc Author: Tathagata Das Date: Wed Jan 22 03:01:22 2020 +0000 [DELTA-OSS-EXTERNAL] Made the python api doc generation more robust - Made sphinx throw all warnings as errors. Sphinx tends to mark a build successful even if there are major issues (e.g., import not found) that cause the contents of built docs to be invalid. These issues shows up as warning, and converting them to errors allows them to be caught in the CI/CD loop early on. - Fixed indentation issues in docs. - Ignore pyspark not being present during python doc generation. Closes delta-io/delta#300 Author: Tathagata Das #7698 is resolved by tdas/nnjju8mt. GitOrigin-RevId: 86e3d5ac9a7f6e2698702138d3735f817177ecca commit 9dd54242f4a6c64385e3841b133e21f6cf2f129d Author: Rahul Mahadev Date: Tue Jan 21 22:14:37 2020 +0000 [SC-27760][DELTA] Add Descibe History metrics for update and delete ## What changes were proposed in this pull request? This PR adds Describe History metrics for Update and Delete ## How was this patch tested? added tests in `DescribeHistorySuite` Author: Rahul Mahadev #7494 is resolved by rahulsmahadev/historyStatsUpdateDelete. GitOrigin-RevId: 913dd271f0ecf6976582f74ac55c6981c54b2b22 commit 7c2c0dabff5fb48b2e08c2b1c9ef1ffbedf11c95 Author: Rahul Mahadev Date: Fri Jan 17 11:09:29 2020 -0800 [SC-27761][DELTA] Add Describe History metrics for OPTIMIZE command ## What changes were proposed in this pull request? This PR adds metrics for OPTIMIZE Command via Describe History. ## How was this patch tested? Added unit tests in `DescribeHistorySuiteEdge` Closes #7492 from rahulsmahadev/historyStatsOptimize. Authored-by: Rahul Mahadev Signed-off-by: Rahul Mahadev GitOrigin-RevId: 1becff31f20533c133fb82cc503bc58f8d6f501b commit 070c023fc200d3fff78720aa8ad98f6271fee750 Author: Ali Afroozeh Date: Thu Jan 16 12:11:03 2020 +0000 [SC-24866] Add SQLConf for Parallelize file index collection in PrepareDeltaScan This PR adds a SQLConf Author: Ali Afroozeh GitOrigin-RevId: 4ee5f1f5837e85d4344bbc96eadb87bec284591b commit e8797872d1b839d3966713ef5c14c7fde9246cbd Author: Xiao Li Date: Wed Jan 15 14:16:25 2020 -0800 Fix Delta on file table resolution Author: Xiao Li GitOrigin-RevId: 90ed86bc78b461b9e96fe01511f125a47dc7b3d5 commit adc03b860fc7ac11fe652e961ed5c570b606427e Author: Rahul Mahadev Date: Wed Jan 15 20:22:54 2020 +0000 [SC-27762][DELTA] Added Streaming Update to Describe History Stats ## What changes were proposed in this pull request? This PR adds metrics for Streaming Updates via Describe History. ## How was this patch tested? Added a test in `DescribeHistorySuite` Author: Rahul Mahadev #7493 is resolved by rahulsmahadev/historyStatsStreamingup. GitOrigin-RevId: d108c4b90e854188d920e3075f4d100db326e288 commit b46159e3a6f0ccbe9b3dc6488ab904815b9b283e Author: chet Date: Wed Jan 15 19:50:11 2020 +0000 [DELTA-OSS-EXTERNAL] Improvements - Scala code style What were proposed in this Pull Request? 1. Scala doesn't require `Return` keyword. 2. Unwanted parentheses has been removed. 3. Curly braces removed for variable, it's need for expression. 4. map - lamdba function doesn't need anonymous function. Example - `val replacedDb = database.map(quoteIdentifier(_))`. 5. `new` keyword is only required for class not for case class, it automatically executes `apply` method. 6. Collection type check, `!columns.isEmpty` has changed to `columns.nonEmpty` for better readability. 7. dataframe object declaration from `var` change to `val` for immutability. How this pull request is tested? * ran `sbt compile` to check compile time errors. PASSED * Checked circle-ci log, could not see any exception. Looking forward for your review comments. @brkyvz Closes delta-io/delta#286 Author: chet #7611 is resolved by brkyvz/flxve6u6. GitOrigin-RevId: 531ce1f4b69ab31cf91c7c95d10fef362fd7f349 commit 3ea54f2ec680faf26289e2988903d9e7f7bdc106 Author: Pranav Anand Date: Tue Jan 14 22:03:30 2020 +0000 [SC-27875] Make ConvertToDelta use Catalyst to see if TableIdentifier is a path or not ## What changes were proposed in this pull request? - New trait `ResolveTableIdentifier` which can be used by commands to determine if a `TableIdentifier` refers to a path based table or a table from a metastore - Make `ConvertToDelta` use `ResolveTableIdentifier` ## How was this patch tested? - This PR does not add functionality. Existing tests Author: Pranav Anand #7569 is resolved by pranavanand/pa-istableorpath. GitOrigin-RevId: 76e52d04fe3a140983872f909081e573a916b466 commit 119e67e6128b186bdc610bb81634ba4efa3481d0 Author: Shixiong Zhu Date: Fri Jan 17 10:36:14 2020 -0800 Set 'path' in the storage properties as Spark is using it for its data sources (#16) Spark reads the `path` option from the storage properties. We should set it when creating a table in Hive so that the table can be read by Spark as well. I also removed `delta.table.path` from the table properties since we can get it from the `location` property. commit f573ce7e272f2ca0f1085fa4e7a3f35f305a9d2c Author: Shixiong Zhu Date: Mon Jan 13 06:34:03 2020 +0000 [SC-26325] Refactor codes and add more tests for interval config value validation Author: Shixiong Zhu GitOrigin-RevId: 0b7170ac0bf9d89565d1efb82d51134738d64932 commit b0ac303d6b5640081b63e665603995c90fc43ddf Author: Rahul Mahadev Date: Fri Jan 10 03:01:54 2020 +0000 [SC-27763][DELTA] Added Describe History metrics for Truncate command Added Describe history metrics for Truncate Command Author: Rahul Mahadev GitOrigin-RevId: 3faea5cf832937a203fd995d8104ef8e415c7ce1 commit 2df6049090f197fad422adc29a73b787bd664d79 Author: Burak Yavuz Date: Fri Jan 10 01:57:50 2020 +0000 [SC-27921][DELTA] Refactor time travel codes We have special logic to handle partition filters and time travel specifications being part of a path to load a Delta table. This logic exists in createRelation of the V1 relation of Delta. This PR refactors some of this logic to prepare for Spark 3.0. Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: a299b927ded7a1fb904ca8af9ee4d825e79ffb65 commit 2a37476ccb2328e2a71d64b343460dd6218d69f7 Author: Andrew Fogarty Date: Fri Jan 10 00:33:40 2020 +0000 [DELTA-OSS-EXTERNAL] Clean-up VacuumCommand Remove `reservoirScheme` which is never used. Closes delta-io/delta#237 Author: Andrew Fogarty GitOrigin-RevId: 9115b0a6d20f5415367918b768c2ba26a2dae5bb commit 66f3bb527300592658692c8fef5b943cc7f47cec Author: Anurag870 Date: Fri Jan 10 00:18:41 2020 +0000 [DELTA-OSS-EXTERNAL] Minor changes for README.md Closes delta-io/delta#211 Author: Anurag870 #7544 is resolved by zsxwing/lfpuuecc. GitOrigin-RevId: 76f108cedc7ba486c97a016ed34954a673e3e15b commit e85b3ca1ce8fc470e486defe871dc4c2f609cfa7 Author: Pranav Anand Date: Thu Jan 9 21:02:11 2020 +0000 [SC-26881] ConvertToDelta using parquet.`/some/path` fails when external catalog fails Catch `AnalysisException` in `ConvertToDelta` when checking whether `databaseExists` to try to use path instead of throwing exception Author: Pranav Anand GitOrigin-RevId: b882d316f28f58cd9b41d95eb1f62a610b5d5f6a commit f0dc24df5178e773b604811066d98c0306782594 Author: Shixiong Zhu Date: Fri Jan 10 13:35:57 2020 -0800 Add Hive connector instructions commit 5afc01b1d58da4244190b595c77a1690c3b00c48 Author: herman Date: Wed Jan 8 20:19:33 2020 +0000 [SC-24982][DELTA] Do not run jobs on an empty snapshot When replaying a log for an uncheckpointed Delta table, we first compute an empty snapshot. While this is correct, it does nearly double the time spend on the executor for these tables. This PR changes this by using the InitialSnapshot class for the initial replay. This class has been modified to make sure it avoids expensive computation (constructing dataframes, caching & executing jobs). Added UT to `DeltaSuite`. Author: herman Author: Youngbin Kim GitOrigin-RevId: 5cb7fe0517f3a8f9c032657cc27851ea15da0c62 commit 6a22a9575d69265c5b25637bf657bf192e8b26ea Author: lys0716 Date: Tue Jan 7 21:34:13 2020 +0000 [DELTA-OSS-EXTERNAL] Allow file names to include = when convert to delta Allow file names to include = when convert to delta Signed-off-by: Yishuang Lu Closes delta-io/delta#264 Author: lys0716 #7514 is resolved by zsxwing/2s7xlu95. GitOrigin-RevId: cbbbc42443fa277738c98c794f6d4c41403a3a3e commit d9be0a45f5286c1cce0ead0381f2836c17e990c3 Author: Rahul Mahadev Date: Thu Jan 2 22:00:44 2020 +0000 [SC-27618][DELTA] Fix package names for DML Java Suite Fix package names for DML commands Java suite to have package `test.org.apache.spark.sql.delta` instead of `test.com.databricks.sql.transaction.tahoe` Ran test builders Closes https://github.com/delta-io/delta/pull/295 Closes https://github.com/delta-io/delta/issues/290 Author: Rahul Mahadev GitOrigin-RevId: 92f0ab08e238b182c76c24456fe276a79e4f82e1 commit 54643efc07dfaa9028d228dcad6502d59e4bdb3a Author: Rahul Mahadev Date: Fri Dec 20 18:14:14 2019 +0000 [SC-24567][DELTA] Add additional metrics to Describe Delta History - Added a way to capture SQLMetrics and inject it into CommitInfo - Currently supports Writes and Merge - Guarded by feature flag Preview of describe history on Merge +-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+-----------------+-------------+--------------------+ |version| timestamp|userId|userName|operation| operationParameters| job|notebook|clusterId|readVersion| isolationLevel|isBlindAppend| operationalMetrics| +-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+-----------------+-------------+--------------------+ | 1|2019-12-19 11:16:24| null| null| MERGE|[predicate -> (s....|null| null| null| 0|WriteSerializable| false|[numTargetRowsCop...| | 0|2019-12-19 11:16:17| null| null| WRITE|[mode -> ErrorIfE...|null| null| null| null|WriteSerializable| true|[numFiles -> 2, n...| +-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+-----------------+-------------+--------------------+ Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numFiles -> 5, numTargetFilesAdded -> 5, numTargetRowsInserted -> 50, numTargetRowsUpdated -> 50, numOutputRows -> 100, numOutputBytes -> 2838, numSourceRows -> 100, numTargetFilesRemoved -> 1) - Added tests in `DeltaDescribeHistorySuite` Author: Rahul Mahadev GitOrigin-RevId: 177c4ac5b70ea2a49cdd24086700d32def66ed91 commit 8f8df61b383e94a3d4a170c16c3d2384f98e5653 Author: Steve Suh Date: Wed Dec 18 23:34:05 2019 +0000 [DELTA-OSS-EXTERNAL] Fix typo in DeltaTable Generate method descriptions The Generate method descriptions list **symlink_manifest_format** as a valid mode, however the [Generate documentation](https://docs.delta.io/latest/delta-utility.html#generate) as well as the [DeltaGenerateCommand](https://github.com/delta-io/delta/blob/2a322facb5c57e7322302ab878eddb16ff21b5f1/src/main/scala/org/apache/spark/sql/delta/commands/DeltaGenerateCommand.scala#L66) use **symlink_format_manifest** Closes delta-io/delta#284 Author: Steve Suh #7378 is resolved by rahulsmahadev/s3qo81ns. GitOrigin-RevId: 83c7e4f6e39a35a2b6c05b6d848158b9c24a09e4 commit 44f255f417aa95031c8ceba8f72dd215dce297bb Author: hongdd Date: Wed Dec 18 21:34:19 2019 +0000 [DELTA-OSS-EXTERNAL] Update README.md version to 0.5.0 update 0.4.0 to latest version Closes delta-io/delta#279 Author: hongdd #7375 is resolved by mukulmurthy/z0yaa3go. GitOrigin-RevId: 50f69cb9b8cc3bd4119eaba93172921403a31dfc commit e63f98e48a974e30d5feac90eb534e569b269cc6 Author: Burak Yavuz Date: Wed Dec 18 02:56:51 2019 +0000 [SC-26887][DELTA] Fix the unapply DeltaFullTable DeltaFullTable currently misses filters performed on the relation. This PR fixes the unapply method Author: Burak Yavuz GitOrigin-RevId: 32781a1b2e0b49857a06b37e226b38c02714765b commit d38fdc0e944a8836a7a63ed1891a02473dacca0f Author: Jose Torres Date: Tue Dec 17 15:46:06 2019 -0800 [SC-24886][DELTA] Evaluate partition predicates on the right data type ## What changes were proposed in this pull request? Right now, we use the wrong data type for partition columns when evaluating partition predicates for the Delta file index - they are always strings. This almost always works because of implicit casting, but there are a few edge cases where implicit casts are not added and we must explicitly set the right type. SC-10573 was an earlier patch for one such case; we made it minimal to safely warmfix it, but kinda dropped the ball on doing the non-minimal fix. ## How was this patch tested? new unit test, along with the old one for the old patch Closes #6743 from jose-torres/fixparttype. Authored-by: Jose Torres Signed-off-by: Jose Torres GitOrigin-RevId: fc63605f97da3f937cf8f33cda7f7d504a024d36 commit 67e89b051354bc787dc3f1f5f583717abc0a9673 Author: Jakub Orłowski Date: Tue Dec 17 09:13:55 2019 +0000 [DELTA-OSS-EXTERNAL] Update Delta README Slack register link The old invite link is no longer active. The new one was posted by @dennyglee via Delta Lake Users and Developers mailing list, here: https://groups.google.com/forum/#!topic/delta-users/tv_umPl2avs Closes delta-io/delta#273 Author: Jakub Orłowski Author: Rahul Mahadev #7323 is resolved by rahulsmahadev/4u0z8neq. GitOrigin-RevId: 1f7938f08f44a6a6330abf70c96ee5de74a48a39 commit 0b4138539b7041ae19bd6a9606cfef054334a808 Author: Xiao Li Date: Mon Dec 16 17:52:20 2019 -0800 Fixup some import statements GitOrigin-RevId: c11bd517e4f23820886d972e21a6eb74d96a9287 commit e7db8325705d08074bf01bdfc1a6d2768f0c8929 Author: Burak Yavuz Date: Tue Dec 17 00:35:42 2019 +0000 [SC-26785][DELTA] Prepare for Alter Table support Add an error message that will be needed to support ALTER TABLE. Unit tests Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: d452857d712988d9b28ff16e1f8ff656ac91334c commit e6fbc27f81484565593d3458500dff018e91850f Author: Timothy Zhang Date: Sat Dec 14 20:08:48 2019 +0000 [DELTA-OSS-EXTERNAL] Updated example Streaming 1. Added some println as comments for different sections 2. Added variable checkpointPath to replace hardcode in option 3. Changed duration of stream3 from 10000 to 20000 so as to get obvious results for updated Delta Table 4. Added a delete statement of checkpoint directory in the end Closes delta-io/delta#270 Author: Timothy Zhang Author: Rahul Mahadev GitOrigin-RevId: 66b8da80dce9f5fac3922638ea064f4272842969 commit eb1389b976cb5224abe83e34d7978bea76edfe2a Author: Tathagata Das Date: Fri Dec 13 04:57:34 2019 +0000 [DELTA-OSS-EXTERNAL] Update latest Delta version in readme to 0.5.0 Closes delta-io/delta#272 Author: Tathagata Das #7299 is resolved by tdas/obke9grz. GitOrigin-RevId: d5ee228f5e52af134a925c76378028da572c20e3 commit b092e8db4bfe460be070bc4a04d4d83c721a6d5f Author: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com> Date: Fri Dec 13 03:08:30 2019 +0000 [DELTA-OSS-EXTERNAL] Updates Scala/Python examples with 0.5.0 features Updating examples to include examples 1. Streaming: Streaming append and concurrent repartition using `dataChange=false` 2. Utilities: generating manifest files. built jar and tested locally. Closes delta-io/delta#267 Author: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com> Author: Rahul Mahadev #7278 is resolved by rahulsmahadev/ujcblitf. GitOrigin-RevId: 14de114a98d8a4075e27525c0889e438fdcc90e5 commit 1d1ffb99209fc35f8d542196e0092bcd24151dad Author: Rahul Mahadev Date: Fri Dec 13 02:45:41 2019 +0000 [DELTA] Bumping version to match delta OSS version bumping version number to match with delta OSS release Tests will start passing after maven release Author: Rahul Mahadev GitOrigin-RevId: 5906a613a4ceb960dd8670f4299b6c7de4c30341 commit 868388a0155deb2db9e79cebeb412821cb216946 Author: Shixiong Zhu Date: Tue Dec 17 08:11:55 2019 -0800 Add Scala stylecheck (#15) This PR applies Scala stylecheck plugin to enforce the same `scalastyle-config.xml` file from the delta-core project. The major changes in this PR include: - Copy `scalastyle-config.xml` from the delta-core project. - Apply the stylecheck plugin to all projects. - Add the missing license header. - Fix the long lines. - Remove unused methods: `DeltaHelper.parsePathPartition` and `DeltaHepler. parsePartitionColumn`. - Update Delta version to 0.5.0. commit ea203636e075f9d7fbd23c805ba91869e52daa8e Author: Shixiong Zhu Date: Fri Dec 13 12:36:06 2019 -0800 Consistent schema in Hive and Delta metadata (#11) Right now we require the partition columns should be after the data columns. This PR adds a new DeltaInputSplit to remove the above limitation and also adds validation to ensure Hive's schema is always consistent with Delta's metadata regarding column types and order. commit 2168cebad3491fd2c6fa9006d568815d6ec03063 Author: Rahul Mahadev Date: Thu Dec 12 23:30:15 2019 +0000 Setting version to 0.5.1-SNAPSHOT commit 0ddd3322b03964aaccce4ded5eb5d67788331fed Author: Rahul Mahadev Date: Thu Dec 12 23:28:42 2019 +0000 Setting version to 0.5.0 commit 878ebda7d6e6a3752558bc75c5ebaf5232e66bb9 Author: Burak Yavuz Date: Thu Dec 12 19:49:57 2019 +0000 [SC-26788][DELTA] Fix issue with DateType partition columns in ConvertToDelta ## What changes were proposed in this pull request? DateType and TimestampType partitions were not working in ConvertToDelta, because we were not passing in the timezone information during casting. Before we would get a `None.get` error being thrown from the Cast, trying to see if the directory partition value when cast to the data type returned null. ## How was this patch tested? Regression test Author: Burak Yavuz Author: Burak Yavuz #7288 is resolved by brkyvz/convert. GitOrigin-RevId: 598bd4ee434672cf71e778208a13b24061a055ba commit 89449e820f3eda09a95bfe3054f826480a58f22b Author: Pranav Anand Date: Thu Dec 12 06:12:07 2019 +0000 [SC-26798] Check if documentation links in DeltaErrors point to valid URLs ## What changes were proposed in this pull request? - `DocsPath` now has a method `generateDocsLink` which will be used to create the docs link that is passed to the error message - New test suite to test the doc links in `DeltaErrors` ## How was this patch tested? - `DeltaErrorsSuite` was added - `DeltaErrorsSuiteBase` contains a list of all the error message in `DeltaErrors` that are applicable to Delta. The test accesses each link in each error message and checks to see if the HTTP response is valid for `docs.delta.io.{path}` Author: Pranav Anand #7264 is resolved by pranavanand/pa-docs-fixlinks. GitOrigin-RevId: 24b7353ba335dba4325afbab258f4cdb9f7618a9 commit 4933f0aac2e6c6d30428bb8ba0bc49caea07305e Author: Tathagata Das Date: Thu Dec 12 00:58:32 2019 +0000 [DELTA-OSS-EXTERNAL] Make the Delta API doc generation script Python 3 compatible - self explanatory - tested locally on python 3 conda environment Closes delta-io/delta#269 Author: Tathagata Das #7263 is resolved by tdas/g2qh6y4y. GitOrigin-RevId: 6091087098803eb14fd0ceeb50eb8a9e947348bc commit 16b3c1765818d4efc8daa429def14912e68fa24d Author: Burak Yavuz Date: Tue Dec 10 22:46:50 2019 +0000 [SC-26711][DELTA] Add exceptions for insert errors - Added a new type of exception, to be used later for DSv2 Author: Burak Yavuz GitOrigin-RevId: 87c2249b1447acee65d3904cf06cb8c89e78f996 commit 811f950e6d3270eb7d632f949f9fe83b1f885c16 Author: Pranav Anand Date: Tue Dec 10 02:46:19 2019 +0000 [SC-25510] Update merge error message to clarify the duplicate source key issue - No behavior change is a part of this PR - Change error message to be more verbose when merge fails - It seems that people think the previous error message was a bug in Delta rather than something that needs to be fixed on their end - the new error message should make this clear as well - Should consider adding a feature flag which, when enabled, gives table specific information about the merge like which source rows match target rows Existing tests Author: Pranav Anand GitOrigin-RevId: 44a77c5c5fb5463428855991deba3a7a33bce83f commit f32830022e6a664f99f25f3dad8584f5cd9952bf Author: Tathagata Das Date: Fri Dec 6 09:54:27 2019 +0000 [DELTA-OSS-EXTERNAL] Improved Delta concurrency with finer-grained conflict detection in OptTxnImpl This is a modified PR from the original PR https://github.com/delta-io/delta/pull/114 by `tomasbartalos` (kudos, it was a very good PR!). This PR tracks transaction changes at a finer granularity (no new columns required in RemoveFile action) thus allowing more concurrent operations to succeed. closes delta-io/delta#228 and delta-io/delta#72 This PR improves the conflict detection logic in OptTxn using the following strategy. - OptTxn tracks two additional things - All the partitions read by the query using the OptTxn - All the files read by the query - When committing a txn, it checks this txn's actions against the actions of concurrently committed txns using the following strategy: 1. If any of the concurrently added files are in the partitions read by this txn, then fail because this txn should have read them. -It’s okay for files to have been removed from the partitions read by this txn as long as this txn never read those files. This is checked by the next rule. 2. If any of the files read by this txn have already been removed by concurrent txns, then fail. 3. If any of the files removed by this txn have already been removed by concurrent txns, then fail. - In addition, I have made another change where setting `dataChange` to `false` in all the actions (enabled by #223) will ensure the txn will not conflict with any other concurrent txn based on predicates. Tests written by `tomasbartalos` in the original PR. Some tests were changed because some scenarios that were blocked in the original PR are now allowed, thanks to more granular and permissive conflict detection logic. Some test names tweaked to ensure clarity. GitOrigin-RevId: f02a8f48838f86d256a86cd40241cdbfa74addb4 Lead-authored-by: Tathagata Das Co-authored-by: Tomas Bartalos commit 48a74fa83858d164d1f789947981445a06cb2668 Author: Burak Yavuz Date: Thu Dec 5 21:37:13 2019 +0000 [SC-25475][DELTA] Add the `ReplaceTable` operation for Spark 3.0 Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: a477dde19a4dea5a724697acc0fa4f9201713cee commit 206c268f052324e8d9697180b93bd1f97049df4b Author: Shixiong Zhu Date: Thu Dec 5 21:44:04 2019 -0800 Convert AddFile to FileStatus directly to save FileSystem RPC calls (#10) This PR creates `FileStatus` directly from `AddFile` to save FileSystem RPC calls. As `AddFile` doesn't have the block locations, we lose the locality. But this is fine today since most of storage systems are on Cloud and the computation is running separately. I also fixes a bug that we return incorrect file paths to Hive when the partition values have some special values that are escaped in the file path. commit 2a322facb5c57e7322302ab878eddb16ff21b5f1 Author: Shixiong Zhu Date: Wed Dec 4 23:27:00 2019 +0000 [SC-26576]Minor refactor for DescribeDeltaDetailCommand ## What changes were proposed in this pull request? Rename files to make them consistent with other codes. ## How was this patch tested? Existing tests. Author: Shixiong Zhu #7197 is resolved by zsxwing/refactor-describe-detail. GitOrigin-RevId: fac330b592f61e43e1abfe7426050d91d85b29a5 commit 5b3e3ebd68be9be7825e6b2b7018a0731ebba7b0 Author: Rahul Mahadev Date: Wed Dec 4 04:42:00 2019 +0000 [SC-25748][DELTA] Scala/Python/SQL public APIs for Generate Delta Manifest ## What changes were proposed in this pull request? * Added Scala APIs and tests * Added Python APIs and tests * Added SQL APIs and tests in both Scala and Python closes delta-io/delta#262 ## How was this patch tested? Added tests in the 'DeltaGenerateManifest' suite Author: Rahul Mahadev #7053 is resolved by rahulsmahadev/generateManifestPub. GitOrigin-RevId: 8e31833ca0bc3b9a97128105861a84f63d086aba commit 1d698ba2419caadc5ea9b4eeab99140822b42022 Author: Rahul Mahadev Date: Wed Dec 4 00:59:06 2019 +0000 [SC-26331][DELTA] Exporting InvariantEnforcementSuite, SchemaUtilsSuite and CaseSensitivitySuite to OSS ## What changes were proposed in this pull request? This change proposes to export `InvariantEnforcementSuite`, `SchemaUtilsSuite` and `CaseSensitivitySuite` to OSS Delta. ## How was this patch tested? Ran `InvariantEnforcementSuite`,`SchemaUtilsSuite` and `CaseSensitivitySuite` locally Author: Rahul Mahadev #7137 is resolved by rahulsmahadev/invariantEnforcementOSS. GitOrigin-RevId: d723fbf25e4406cb35bb513a78e32485b5b99c7e commit 17ea0e7c5732b28060922d70ed80ea3f1666e7a7 Author: Xiao Li Date: Wed Nov 27 13:24:07 2019 -0800 Add isPositiveDayTimeInterval to unify all internal config checks GitOrigin-RevId: dad5a32665fbf848f9c5c1844dedf8f842fd78a4 commit 5a2198b4ca3873004fbc68965cfe7d1993c2bb4f Author: Shixiong Zhu Date: Thu Nov 28 09:20:17 2019 -0800 Improve error messages for unsupported features in Hive (#9) This PR makes some minor improvements in the error messages for unsupported features. It also adds the table property `spark.sql.sources.provider` so that a Delta table created by Hive can be read by Spark 3.0.0+ when they share the same metastore. commit e932a923e017af9f068731eda5cb934148682a33 Author: windpiger Date: Wed Nov 27 14:42:54 2019 +0800 minor modify README.md (#8) commit 4aeac03de0b613622404e8be4b7098693e69a9d1 Author: Shixiong Zhu Date: Tue Nov 26 22:40:00 2019 -0800 Rewrite Java files using Scala (#7) This PR rewrites Java files using Scala to make future development easier since Delta itself is written using Scala. `IndexPredicateAnalyzer` is not rewritten because it's a fork from Hive's `IndexPredicateAnalyzer` and it's better to not change it so that we can compare them when `IndexPredicateAnalyzer` is changed in Hive. This PR doesn't change any logic in codes. commit 526cc526f79dc22f6e76f7a61a2ce4561e65da7a Author: Pranav Anand Date: Fri Nov 22 00:49:18 2019 +0000 [SC-25990] Minor refactoring of the Convert To Delta internal APIs and tests **Refactor Changes** - `ConvertToDeltaSuite` now `ConvertToDeltaSuiteBase` and is now an abstract class - `ConvertToDeltaScalaSuite` became `ConvertToDeltaSuite` - Simplified test class definitions GitOrigin-RevId: a9bf5f307b4b80a4622f309379553551d0bef90a commit afbe153c4c7f33bc36ebd816093c0b2ca55882eb Author: windpiger Date: Tue Nov 26 06:37:59 2019 +0800 Add DeltaStorageHandler for Hive (#6) Implement HiveOnDelta with StorageHandler **DDL:** ``` create external table deltaTbl(a string, b int) stored by 'io.delta.hive.DeltaStorageHandler' location '/test/delta' ``` - must be external table - must not be a Hive partition table - if DeltaTable is a partitionTable, then the partition column should be after data column when creating Hive table - Hive's schema should be match with the under delta'schema ,including column number &column name - the delta.table.path should be existed **Read:** `set hive.input.format = io.delta.hive.HiveInputFormat` - support read a non-partition or a partition table - support push down filter with delta's partition column, currently support predicate (=,!=,>,>=,<,<=,in,like) - auto-detected delta's partition change **Unit Tests:** - Added(`build/sbt clean test`) - `build/sbt clean package` test ok in real Hive Cluster using delta-core-shaded-assembly-0.4.0.jar and hive-delta_2.12-0.4.0.jar commit 02802b4f7657cce16e92f079b2e8c6b1d46325e6 Author: Tathagata Das Date: Wed Nov 20 14:53:36 2019 -0800 Updated README.md once again commit 794fc8d8aa3f1ea9c98ff7c7894ed8780ee781a3 Author: Tathagata Das Date: Wed Nov 20 14:52:43 2019 -0800 Updated README.md with more detailed instructions commit 9495d5ef21c9da6e29a1632f98d5f2a174501e4e Author: Burak Yavuz Date: Mon Nov 18 22:30:22 2019 +0000 [SC-25431][DELTA][TEST] Refactor Delta Schema Enforcement tests in preparation for Spark 3.0 ## What changes were proposed in this pull request? Adds tests for schema enforcement, which we can run once we start implementing Delta as a V2 data source. ## How was this patch tested? Existing tests Author: Burak Yavuz Author: Burak Yavuz #6963 is resolved by brkyvz/schemaEnforcementTests. GitOrigin-RevId: 78fe293c662da4b7fdb1ecb16febc1f45408b912 commit 29c23d42434c3816ca9fdf7288b71658d2c5714e Author: Andrew Fogarty Date: Mon Nov 18 18:55:42 2019 +0000 [DELTA-OSS-EXTERNAL] OpType is case class This PR makes the fields of `OpType` (`typeName` and `description`) accessible so that custom logging implementations can log them. Closes delta-io/delta#253 Author: Andrew Fogarty GitOrigin-RevId: 0770e2a9ba127cc7fea0b341630051c2ef70a7fb commit 0cacfcc97d8f0345f04fb5094cc35ddaae19c3f8 Author: Rahul Mahadev Date: Fri Nov 15 01:34:13 2019 +0000 [SC-25746][DELTA] Changing the name of Feature Flag to align with naming convention ## What changes were proposed in this pull request? Changing the name of the feature flag key so that it aligns with the naming convention of other feature flags. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review https://spark.apache.org/contributing.html before opening a pull request. Author: Rahul Mahadev #7008 is resolved by rahulsmahadev/changeFlagName. GitOrigin-RevId: cc3448768418e24975f33989857293dac5fcb265 commit b18ffba990ad251dcb41b2f6a5684fb855b7c412 Author: Tathagata Das Date: Thu Nov 14 23:24:53 2019 +0000 [DELTA-OSS-EXTERNAL] Added the SymlinkTextInputFormat manifest generation for Presto/Athena support ## What changes were proposed in this pull request? This PR is the first in the sequence of PRs to add manifest file generation (SymlinkInputFormat) to OSS Delta for Presto/Athena read support (issue https://github.com/delta-io/delta/issues/76). Specifically, this PR adds the core functionality for manifest generation and rigorous tests to verify the contents of the manifest. Future PRs will add the public APIs for on-demand generation. - Added post-commit hooks to run tasks after a successful commit. - Added GenerateSymlinkManifest implementation of post-commit hook to generate the manifests. - Each manifest contains the name of data files to read for querying the whole table or partition - Non-partitioned table produces a single manifest file containing all the data files. - Partitioned table produces partitioned manifest files; same partition structured like the table, each partition directory containing one manifest file containing data files of that partition. This allows Presto/Athena partition-pruned queries to read only manifest files of the necessary partitions. - Each attempt to generate manifest will atomically (as much as possible) overwrite the manifest files in the directories (if they exist) and also delete manifest files of partitions that have been deleted from the table. Closes delta-io/delta#250 Co-authored-by: Tathagata Das Co-authored-by: Rahul Mahadev Author: Tathagata Das Author: Rahul Mahadev #6910 is resolved by tdas/SC-25511. GitOrigin-RevId: a3e04f2fcdafb6ac29c3adcfb791a3d0611583dc commit 48f5185808817a737f401f77c1bff0a298e5343d Author: Rahul Mahadev Date: Thu Nov 14 21:25:25 2019 +0000 [SC-25746][DELTA] Adding feature flag for Optimized Insert only merge Added a feature flag `DELTA_MERGE_INSERT_ONLY_ENABLED` which is enabled by default for Optimized insert only merge. Also did some refactor on tests in this PR Added a test in the `MergeIntoSuiteBase` to toggle the flag off and check if the behavior is like previous implementation. Author: Rahul Mahadev GitOrigin-RevId: fb8bd5d333ae6a8e250b50315143147ed74c72a8 commit 0b9966bc1bfc872a43f1788cc0c03266bb6a2fc2 Author: Rahul Mahadev Date: Wed Nov 13 06:20:35 2019 +0000 [SC-25227][DELTA] Added checks for metadata changing operations with dataChange ## What changes were proposed in this pull request? Added checks to catch operations that change metadata while having `dataChange` set to false. ## How was this patch tested? Added 'DeltaOptionSuite` Author: Rahul Mahadev #6846 is resolved by rahulsmahadev/dataChangeFlags. GitOrigin-RevId: 15442af0d5a5d640aa2ee19ca2c903fdec8aa5d0 commit 23c7613aa2e2af94967b7dce4b0423a857ab66f2 Author: Rahul Mahadev Date: Tue Nov 12 20:53:20 2019 +0000 [SC-25233][DELTA] Optimized Insert only merge for OSS Using the merge statement with purely a WHEN NOT MATCHED THEN INSERT * is becoming more and more common. This PR optimizes this use case by performing an anti-join on the source data to insert the data instead of performing a full-outer join. Added unit tests in the `MergeIntoSuiteBase` closes delta-io/delta#247 Author: Rahul Mahadev GitOrigin-RevId: d0ae1264f1189617c897fbb63eec15be6d5a00f9 commit 658b8493ad947cfe1b920d23c9fd0915170b7835 Author: Burak Yavuz Date: Mon Nov 11 21:14:44 2019 +0000 [SC-25430][DELTA][FIXIT] Avoid unnecessary filesystem checks in Delta Snapshot creation We do a lot of unnecessary file listing when creating/updating a Delta Snapshot. We have checks around: 1. Is there a _delta_log directory under each parent directory (recursive) 2. File listing to infer file format partitioning (for both JSON and parquet) 3. File existence checks for each file These all add latency to the state update of a Delta table, and especially (2) has a linear cost as the table grows larger, which is unacceptable. This PR introduces a DeltaLogFileIndex to use as part of a `LogicalRelation` and `HadoopFSRelation` to skip (1) and (3) and avoid (2), which are all done by DataSource. Existing tests + Improved numbers under DeltaFileOperationsSuite, and on average 1 second saved on the loading of `prod_ds.usage_logs`. Author: Burak Yavuz Author: Burak Yavuz GitOrigin-RevId: 2e25a5b4d4fd821af00dd28942544ba539751d2a commit 261e1ad8e674dfac556d500000f07f3ddb99d2b3 Author: Nicolas Paris Date: Thu Nov 7 21:18:48 2019 +0000 [DELTA-OSS-EXTERNAL] Merge insertAll and updateAll fail in case of dotted columns This resolves #233 and #208. This PR add a test to track the issue and also proposes a fix by quoting with backticks the columns coming from the target plan. Closes delta-io/delta#235 Author: Tathagata Das Author: Tathagata Das Author: Nicolas Paris GitOrigin-RevId: e893389e671cbee2cce21eeda33afd3b6fa93c84 commit 32af18c615afe0d4bcac27183790ca8cfed94d26 Author: Mukul Murthy <38224594+mukulmurthy@users.noreply.github.com> Date: Thu Nov 7 00:56:37 2019 +0000 [DELTA-OSS-EXTERNAL] Update Spark Summit tutorial description and add video link Now that the videos from Spark Summit have been published, we can link to the video of the tutorial. Also updated the tutorial's description. Closes delta-io/delta#234 Author: Tathagata Das Author: Mukul Murthy <38224594+mukulmurthy@users.noreply.github.com> #6887 is resolved by tdas/lvr4tjzx. GitOrigin-RevId: be593b21236f8b9fb0a81e81e33454b0df79135e commit d309dbf1feedab958f9eaa83afa758cea16743c1 Author: Burak Yavuz Date: Mon Nov 4 22:44:15 2019 +0000 [SC-25431][DELTA][TEST] Remove invalid test Remove an accidental test that tests table support. Until Spark 3.0 comes out, Delta cannot support Hive metastore tables. This test was invalid but passing because it tested a trivial behavior. Author: Burak Yavuz GitOrigin-RevId: e903bdbb143eb11cc940fafc8205ef29e86df174 commit a2f62149af3b8e9c4b1a6d3016c88bf5f5e4e59e Author: Shixiong Zhu Date: Tue Nov 12 17:16:48 2019 -0800 Add files for the open source project (#4) - Updated README - Added LICENSE, CONTRIBUTING, NOTICE commit 5d9a4c93617a9fa4170249a9dbd4c4dbae268ced Author: windpiger Date: Wed Nov 6 10:48:48 2019 -0800 hive connector initial commit commit 958251f9fa578c9bc0307fc1f2e1769f8a55cf13 Author: Tathagata Das Date: Tue Nov 5 19:35:40 2019 -0800 Added circleci commit 16b6d4b694e8cca269b48643a95b17309f899f2f Author: Tathagata Das Date: Tue Nov 5 19:10:21 2019 -0800 Added log4j.properties to make tests less verbose commit ce8ad3ef96032d7347caef104773ae405853b3f4 Author: Tathagata Das Date: Tue Nov 5 19:02:35 2019 -0800 Improved sbt commit 638ce0b8daa1ae58198f2921f482853ca08a7e65 Author: Tathagata Das Date: Tue Nov 5 18:06:40 2019 -0800 Added core files commit 6dde85d87581cac4344ad702055b872946fc56b1 Author: Tathagata Das Date: Tue Nov 5 18:05:18 2019 -0800 Initial commit commit 009c9492d0c729106b1bd7e918beec7c6fa87e23 Author: JassAbidi <35536039+JassAbidi@users.noreply.github.com> Date: Wed Oct 30 21:05:52 2019 +0000 [DELTA-OSS-EXTERNAL] [ISSUE#146] add support for setting dataChange = false This add suuport for setting dataChange parameter when writing to delta table. the dataChange parameter is implemented as a DataFrame writer option. add a test to illustrate the scene of rearranging the data without changing it. - Closes issue #146 Closes delta-io/delta#223 Author: Rahul Mahadev Author: JassAbidi <35536039+JassAbidi@users.noreply.github.com> #6806 is resolved by mukulmurthy/nv7wo55e. GitOrigin-RevId: c65006f57b10788efafccf8b0d6dce19eaf5475d commit 4d8aa5727f11501b3c15da9ecd650e82cb3aa48f Author: Mukul Murthy <38224594+mukulmurthy@users.noreply.github.com> Date: Tue Oct 29 01:11:43 2019 +0000 [DELTA-OSS-EXTERNAL] Remove unnecessary adding packagedArtifacts to publishM2 @tdas fixed packaging a month ago; so we shouldn't need to add spPublishLocal packagedArtifacts to publishM2 anymore. This also fixes a warning when running build/sbt. Closes delta-io/delta#227 Author: Mukul Murthy <38224594+mukulmurthy@users.noreply.github.com> #6792 is resolved by mukulmurthy/5kvx5nsi. GitOrigin-RevId: 3cfdef4298a245e916cb121009bff634902c0492 commit 3a95ae8b25753d16735165ee867ec50332883821 Author: Andreas Neumann Date: Thu Oct 24 19:53:33 2019 +0000 [DELTA-OSS-EXTERNAL] [ISSUE#221] Minor fixes for tutorial simple fixes for correctness. Closes delta-io/delta#222 Author: Andreas Neumann #6760 is resolved by mukulmurthy/c1081idg. GitOrigin-RevId: 7c31f1d1d2d0de3f944ba819fdf4b66c771e2ea8 commit 8c4fe42c58b948acb93c0d9418c3cf1df43da431 Author: Rahul Mahadev Date: Tue Oct 22 19:11:51 2019 -0700 [SC-24892] Add typesafe bintray repo for sbt-mima-plugin ## What changes were proposed in this pull request? Adding the `typesafe` bintray repo for `sbt-mima-plugin` since the old `sbt-mima-plugin` was removed. ## How was this patch tested? Run the usual tests Closes #6744 from rahulsmahadev/mimaFix. Lead-authored-by: Rahul Mahadev Co-authored-by: Shixiong Zhu Signed-off-by: Rahul Mahadev GitOrigin-RevId: 9cbdca25128b010f488b880a9cf7f39db89f3b2c commit 6c7142235fc88180d9471979445abb20ebc3e0d7 Author: Denny Lee Date: Wed Oct 16 07:44:01 2019 +0000 [DELTA-OSS-EXTERNAL] Update Tutorial to attach notebooks Closes delta-io/delta#216 Author: Denny Lee #6697 is resolved by tdas/949oueiv. GitOrigin-RevId: 5f0406e6e968d6a655aa11c696c3a820e55447f6 commit 3c9f685b8ead9e4ff9a3bccce75e77cf1b1b6927 Author: Denny Lee Date: Tue Oct 15 23:59:25 2019 +0200 [DELTA-OSS-EXTERNAL] Delta Lake Tutorial Instructions for SAIS EU 2019 Delta Lake Tutorial Instructions for SAIS EU 2019 cc tdas mukulmurthy Closes delta-io/delta#215 Closes #6694 from mukulmurthy/6ec796xz. Authored-by: Denny Lee Signed-off-by: Mukul Murthy GitOrigin-RevId: 89cfe0d049cf7a89e09c0425fc4d60f959367608 commit d0ad9a426bfba68b16b72e1087ef7e066ae1e081 Author: Pranav Anand Date: Mon Oct 14 23:22:14 2019 +0000 [SC-23897] Minor refactoring of DeltaTable - Pass options to findDeltaTableRoot to pass options to hadoop conf Author: Pranav Anand GitOrigin-RevId: d4ebeef9a073eb482b9f3d7066b7b536464848fe commit c9499263ca8da2f0caca7d495db69d5cf4d2a2f5 Author: Pranav Date: Tue Oct 8 20:32:23 2019 +0000 [DELTA-OSS-EXTERNAL] Added scala quick start example - Added a scala quick start example to simulate the quick start guide. - Created sbt project to easily run the quick start guide. Closes delta-io/delta#201 Author: Pranav Anand Author: Pranav GitOrigin-RevId: 9e24c9c006aaf2deca58f141be7ae47c68efbc95 commit e39b93fedc215d6e34b461823120cceba9327131 Author: Reynold Xin Date: Mon Oct 7 18:36:18 2019 +0000 [DELTA-OSS-EXTERNAL] Update README to point to PROTOCOL.md Closes delta-io/delta#202 Author: Reynold Xin #6631 is resolved by mukulmurthy/q5jhrbaa. GitOrigin-RevId: f397a04dc034d829eaf0d8430804687a056cf123 commit 9753fd1a904750fa7c16be1cae0b59b8af0f8473 Author: Matthew Powers Date: Mon Oct 7 18:34:45 2019 +0000 [DELTA-OSS-EXTERNAL] Update Maven setup instructions and add SBT setup command Closes delta-io/delta#203 Author: Matthew Powers #6632 is resolved by mukulmurthy/d546rq0o. GitOrigin-RevId: 6a54f1cf972d54ca1a99505bdb81b4fd4bb803a6 commit d20ea2aefe04507b085012d45f275c3670486d21 Author: Rahul Shivu Mahadev Date: Fri Oct 4 12:56:59 2019 -0700 [DELTA-OSS-EXTERNAL] Added examples directory for quick start example files * Added a python quick start example to simulate the quick start guide. * this can act as integration tests during release Closes delta-io/delta#197 Closes #6612 from rahulsmahadev/c2m6zbfl. Lead-authored-by: Rahul Mahadev Co-authored-by: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com> Signed-off-by: Rahul Mahadev GitOrigin-RevId: b564318a382d13a355dd5616aef123b1b32b18fe commit 1208e5aac78c376d775cb530a8e14c308a65cef4 Author: Matthew Powers Date: Fri Oct 4 17:47:33 2019 +0000 [DELTA-OSS-EXTERNAL] Add API doc links to the README Adds the Scala, Java, and Python API documentation links to the README so they're easily accessible. Closes delta-io/delta#199 Author: Matthew Powers #6619 is resolved by mukulmurthy/5ixjkmvw. GitOrigin-RevId: 1b348d939f2b5661e37e829ee3f71d9ed656c2bf commit def26ff9abbb55d80c5ce8966d704ef55a9508e0 Author: Matthew Powers Date: Fri Oct 4 00:40:21 2019 +0000 [DELTA-OSS-EXTERNAL] Remove unused imports Closes delta-io/delta#200 Author: Matthew Powers #6611 is resolved by zsxwing/wanqxejs. GitOrigin-RevId: cbc280d25397e3f684092e8481d448cd14078750 commit b624209cb802e600a9419b2a75e13c54c24eec7d Author: Pranav Anand Date: Wed Oct 2 21:02:33 2019 +0000 [SC-23293] Convert to Delta SQL support - Adds the ability to CONVERT TO DELTA to the SQL parser (including providing partition columns) - SQL test added to python - Changed existing scala OSS tests to also test SQL for certain tests Author: Pranav Anand GitOrigin-RevId: f80adf083ce8f3369be2c9a606768cf92538325e commit 8e7c30482cfda5dbd4c68c60b056e78edb3ebe0a Author: Mukul Murthy <38224594+mukulmurthy@users.noreply.github.com> Date: Wed Oct 2 09:53:52 2019 -0700 Update Delta README Slack register link GitOrigin-RevId: d99c1c6f829d68f32c8b1e679fd9c16b62210e69 commit 4654798e65f9dfa2c3c945f501617b25fbb45864 Author: Andreas Neumann Date: Tue Oct 1 21:02:52 2019 +0000 [SC-23606] Correctly handle arrays and maps in SchemaUtils.isReadCompatible() ## What changes were proposed in this pull request? DeltaSource validates that new files have a compatible schema using SchemaUtils.isReadCompatible(). This method did not handle array types and map types correctly: - For array types, it did not check nullability of the elements. - For map types, it required exact equality (not tolerating variation of case nullability) This change fixes isReadCompatible() to consistently recurse into subtypes and apply the same constraints everywhere. It also restructures the tests for this method for full coverage of all cases. ## How was this patch tested? - using unit tests Author: Andreas Neumann #6538 is resolved by anew/sc23606. GitOrigin-RevId: c227b97a6f5b3bde99dd77657e2a14a35b9c1a91 commit 5b166e5fc7a732b5788ce4bbe995276e1c9e0e48 Author: Shixiong Zhu Date: Tue Oct 1 06:01:30 2019 +0000 [SC-22738]Describe detail SQL support for OSS Delta This PR adds `describe detail` SQL support for OSS Delta. The user can use this command to show the metadata of a Delta table. `DescribeDeltaDetailsCommand.scala` is split to 3 files so that we can have different implementations of `getPathAndTableMetadata`. The codes for the edge `describe detail` should be the same. The OSS Delta needs to resolve `delta./blah/blah` because we don't have DatabricksSessionCatalog in OSS. Jenkins Author: Shixiong Zhu GitOrigin-RevId: 65707740e474a600ba31f818b0d12dc1967f8558 commit a8ee9b89ee78590478f50461ab1bcbca2a124eab Author: Ubuntu Date: Mon Sep 30 17:35:47 2019 +0000 Setting version to 0.4.1-SNAPSHOT commit a746c3af92ea56b8ceac3ed20bb7769c67c11b6a Author: Ubuntu Date: Mon Sep 30 17:34:25 2019 +0000 Setting version to 0.4.0 commit fbdf9ea630694f5fbf61e0ddcceb71a28a08f7cf Author: Tathagata Das Date: Mon Sep 30 15:52:21 2019 +0000 [DELTA-OSS-EXTERNAL] Allow Python vacuum() `vacuum(0)` was throwing the following error because the Java API has Double and Py4j was searching for integer. The fix is to convert the python parameter to float before calling java. ``` Traceback (most recent call last): File "", line 1, in File "/private/var/folders/0q/c_zjyddd4hn5j9jkv0jsjvl00000gp/T/spark-ff856d5c-6a62-45bd-b9a4-0a0fea0acd09/userFiles-de551310-4bed-4267-be1c-b3c36556d6ff/io.delta_delta-core_2.11-0.4.0.jar/delta/tables.py", line 210, in vacuum File "/usr/local/Cellar/spark/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/local/Cellar/spark/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/Cellar/spark/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o49.vacuum. Trace: py4j.Py4JException: Method vacuum([class java.lang.Integer]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` Closes delta-io/delta#198 Author: Tathagata Das #6557 is resolved by tdas/j0ag9ic6. GitOrigin-RevId: 7a171b8c110d40d23b18bcfffc2c5b8e59d6c0f4 commit d28c72d5e240d09fc046ef4c5f4bda547d2fd002 Author: Tathagata Das Date: Fri Sep 27 16:44:06 2019 -0700 Fixed packaging commit 0041eb3fc27008ff9f40e7023c3a2cdab513b138 Author: Tathagata Das Date: Fri Sep 27 16:41:14 2019 -0700 Revert "Setting version to 0.4.0" This reverts commit 982c445e0f937e3aa3354f8f8868b0b1282acf9a. commit 284c40fd0239b52c94c37d368a211da6dc33a999 Author: Tathagata Das Date: Fri Sep 27 16:41:07 2019 -0700 Revert "Setting version to 0.4.1-SNAPSHOT" This reverts commit 5e07029483c0cd8ad7d02065040a6f1d358e3849. commit 5e07029483c0cd8ad7d02065040a6f1d358e3849 Author: Ubuntu Date: Thu Sep 26 21:37:06 2019 +0000 Setting version to 0.4.1-SNAPSHOT commit 982c445e0f937e3aa3354f8f8868b0b1282acf9a Author: Ubuntu Date: Thu Sep 26 21:36:44 2019 +0000 Setting version to 0.4.0 commit 9802594eea729ea62d64deff4c7103ab1e1dfddd Author: Shixiong Zhu Date: Thu Sep 26 13:22:42 2019 -0700 [DELTA-OSS-EXTERNAL] Clean up temp files in DeltaSparkSessionExtensionSuite Clean up temp files because the directory created by `Files.createTempDirectory` will not be deleted automatically. Closes delta-io/delta#196 Closes #6541 from zsxwing/gw01nzhw. Authored-by: Shixiong Zhu Signed-off-by: Shixiong Zhu GitOrigin-RevId: 6fe8c52088fa7a20482e545bb83b9ad4b27e7045 commit db81af1f897abcc2e34312b0b9e217e618cc04f7 Author: Shixiong Zhu Date: Thu Sep 26 07:03:13 2019 +0000 [DELTA-OSS-EXTERNAL] Add examples and tests to DeltaSparkSessionExtension ## What changes were proposed in this pull request? Update DeltaSparkSessionExtension's doc to add examples. Also add tests to make sure the examples work. ## How was this patch tested? The new unit tests. Author: Shixiong Zhu #6533 is resolved by zsxwing/delta-sql-doc. GitOrigin-RevId: 0d342157d533b1d944e8eb8d60f3810a71ecbfba commit a5b8fe966398c6351512daa99f654b9e22aef667 Author: Rahul Mahadev Date: Thu Sep 26 03:11:31 2019 +0000 [SC-23612] Adding python api and tests for isDeltaTable ## What changes were proposed in this pull request? Added python APIs for isDeltaTable ## How was this patch tested? ran `test_deltatable` Author: Rahul Mahadev #6527 is resolved by rahulsmahadev/isDeltaTable. GitOrigin-RevId: 1a8a21490ad31ed6a8177907cf55821aa99b223a commit 28a2de664c47fbdc06deb3fcf9a0b41248900d0c Author: Rahul Mahadev Date: Wed Sep 25 21:19:32 2019 +0000 [SC-23573] Changed package name for DeltaConvert ## What changes were proposed in this pull request? Package name was incorrect, changing it to hide Scala/Java doc generation ## How was this patch tested? Ran `ConvertToDeltaSuite` Author: Rahul Mahadev #6521 is resolved by rahulsmahadev/hideClass. GitOrigin-RevId: a5e1763e75848cd3ccf4a544f86f7ebb6130085e commit d00953aa7f19997724a9a0b0d68dad054f17f8dd Author: Andreas Neumann Date: Wed Sep 25 20:27:40 2019 +0000 [SC-22791] Issue a specific message about how the schemas mismatch ## What changes were proposed in this pull request? If a table is created over an existing location that already has a transaction log, then that log determines the schema of the table. If the new creation DDL specifies a different schema, then table creation fails. In the case where the schema only differs in the metadata of a field, we currently print a message that does not show the metadata. It is highly confusing to the user because the error states that the schemas are different, and then prints two seemingly identical schemas. This fixes that error message by printing a detailed statement of how the two schemas differ: - Introduces a new method SchemaUtils.detectDifference that returns the first difference found - Uses that method instead of equality to test for schema compatibility - Adds various tests for this methos Also fixes an issue with the existing method isReadCompatible, which previously ignored the nullability of structs in array types. Added a test case to cover that scenario. For now I added println() statements to the tests so that one can see what the error messages look like in the various cases. I plan to remove these print statements in a follow-up commit once the code is reviewed. ## How was this patch tested? - unit tests - manual tests to see the error messages Please review https://spark.apache.org/contributing.html before opening a pull request. Author: Andreas Neumann #6470 is resolved by anew/sc22791. GitOrigin-RevId: eba98ffe394a1c0af728d4cd924523a411b11f3e commit adc020599760382a97b53366ebf28bbf6047e1a7 Author: Wesley Hoffman Date: Wed Sep 25 20:22:13 2019 +0000 [DELTA-OSS-EXTERNAL] Add public API for identifying delta tables API to see if a Delta table exists at a path or cataloged table. Closes delta-io/delta#149 Author: Mukul Murthy Author: Wesley Hoffman #6519 is resolved by mukulmurthy/aab1sdti. GitOrigin-RevId: 91069bebfc50ab2f19a17bf1e8e3eae26ef123c8 commit 42bf98d2b5b948975f7bea85cfb27b5471ce12a3 Author: Shixiong Zhu Date: Wed Sep 25 09:56:00 2019 -0700 [SC-23557] Improve error message for OSS Delta VACUUM and DESC HISTORY commands ## What changes were proposed in this pull request? Improve error message for OSS Delta VACUUM and DESC HISTORY commands and add unit tests. ## How was this patch tested? The new tests. Closes #6506 from zsxwing/fix-error. Authored-by: Shixiong Zhu Signed-off-by: Shixiong Zhu GitOrigin-RevId: 82cb3cc52abc9b1452628b40a0213d6469a91997 commit acc1a5dbee3e3ef945763621917919d0401c8252 Author: Tathagata Das Date: Wed Sep 25 01:50:11 2019 +0000 [DELTA-OSS-EXTERNAL] Improved Python docs - Added examples for each method - Added param and return docs for more complicated methods - Added version information for all classes and methods Closes delta-io/delta#190 Author: Tathagata Das #6513 is resolved by tdas/3llcz8yx. GitOrigin-RevId: 957096fbe467322d164506e8c0d15f31f8b83065 commit 167065ef8aaa06cae5321a3084c484c0678a5003 Author: Pranav Anand Date: Wed Sep 25 00:52:00 2019 +0000 [SC-23565] Publish should use spPublishLocal ## What changes were proposed in this pull request? Change build.sbt to also add artifacts from spPublishLocal when calling publish ## How was this patch tested? Manually published and downloaded release to verify it works Author: Pranav Anand #6512 is resolved by pranavanand/pa-sppublish-build-sbt. GitOrigin-RevId: 7f5ac68463f24cd9a7c115f823bc19ee55a2f23d commit 7bf2e8f5c9f3cf13db73e3c9a339710355437717 Author: Rahul Mahadev Date: Tue Sep 24 23:15:33 2019 +0000 [SC-23497][Delta] Added new API for convert to delta and Scala/Python tests ## What changes were proposed in this pull request? * Added new API to convert to delta to allow paritionSchema to be a string * Moved idempotence tests from Edge to Base ## How was this patch tested? Added Scala and Python tests to verify and validate the added API Author: Rahul Mahadev #6493 is resolved by rahulsmahadev/cvtDltSim. GitOrigin-RevId: 35b71e996418e6cf54fb87ec8783d9de678c42fb commit 8d1abe6af1f8df2e8804b2765492c328d54317fe Author: Rahul Mahadev Date: Tue Sep 24 00:11:55 2019 +0000 [DELTA-OSS-EXTERNAL] Added python doc generation using Sphinx * Added Python docs generation using Sphinx * modified `generate_api_docs` to generate Python docs Closes delta-io/delta#185 Author: Rahul Mahadev Author: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com> #6492 is resolved by rahulsmahadev/6ma482ij. GitOrigin-RevId: 6af6528b7cc891451756257132c446a8a2c5239e commit 888f78d850e84e3a4198889a18698acca82434ec Author: Tathagata Das Date: Mon Sep 23 22:48:51 2019 +0000 [SC-21520][Delta] Added Python APIs for merge, update, delete ## What changes were proposed in this pull request? Added update, delete, merge to the Python DeltaTable class. ## How was this patch tested? New unit tests Closes delta-io/delta#182 Author: Tathagata Das #6448 is resolved by tdas/SC-21520. GitOrigin-RevId: cf26b4c87333b6979fb3742ce901b5e3ed963e62 commit f3a1d21c7246477299036d7badd1dbfdc3063e65 Author: Wenchen Fan Date: Mon Sep 23 00:10:08 2019 -0700 Update `Delete` comment GitOrigin-RevId: f16d2d5da481cf068b7970a331964c4951034183 commit 43ec7481e9f935dd70d494332e59ed588c3ed7c0 Author: Pranav Anand Date: Fri Sep 20 23:27:13 2019 +0000 [SC-22415] Modify the catalog metadata to reflect that source of truth is delta log Author: Pranav Anand GitOrigin-RevId: 906d7b00ad5ceea8ea882c0614e9b2082c9971ed commit fce8eff2956c16f570502dc5104eb5de94cb0fb9 Author: Shixiong Zhu Date: Fri Sep 20 22:56:52 2019 +0000 [DELTA-OSS-EXTERNAL] Enable doclint on Circle CI Looks like `doclint` is disabled on Circle CI. See https://github.com/Debian/openjdk-8/blob/master/debian/patches/disable-doclint-by-default.patch This PR enables it manually to test our javadoc. In addition, it also includes the private package `io.delta.sql.parser` to fix javadoc build. Closes delta-io/delta#184 Author: Shixiong Zhu #6471 is resolved by zsxwing/2t677slp. GitOrigin-RevId: 3c5399392d6c28406fd8f4df74904464c90c7893 commit d95777bd0445a78b04e7acfd54b7addd9e56e905 Author: Shixiong Zhu Date: Fri Sep 20 21:59:27 2019 +0000 [DELTA-OSS-EXTERNAL] Describe history SQL support Add a `describe history` SQL command to query a table's history. Here are some examples of this command. ``` DESCRIBE HISTORY '/foo/bar' DESCRIBE HISTORY delta.`/foo/bar` limit 3 ``` Resolves https://github.com/delta-io/delta/issues/168 Closes delta-io/delta#181 Author: Shixiong Zhu #6459 is resolved by zsxwing/k7rx6y8s. GitOrigin-RevId: 7f23263a9f6e35de08356d0a0add957d5c65a776 commit 1af993df2ed9e6604a1f4a4f0d116476fad1844c Author: Rahul Mahadev Date: Fri Sep 20 21:58:22 2019 +0000 [DELTA-OSS-EXTERNAL] Changed tests to use the spark-submit in package mode instead of jars - changed tests to use the spark-submit in package mode instead of jars. - quality of life change to remove INFO messages from log Closes delta-io/delta#183 Author: Rahul Mahadev Author: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com> GitOrigin-RevId: 0d83933021a0709400b99f9d99a5e9f813404bbf commit b1524d7ba2da52410ba875d4d1a72b2543467f00 Author: Rahul Mahadev Date: Thu Sep 19 14:14:23 2019 -0700 [DELTA] Reenabling Python tests for OSS/PR builds ## What changes were proposed in this pull request? Attempting to fix the issue where sbt was failing to be fetched in the dockerized PR builder, will now use the Google mirror instead of the usual sbt source. ## How was this patch tested? ran locally and on jenkins Closes #6454 from rahulsmahadev/ossBuilderFix. Authored-by: Rahul Mahadev Signed-off-by: Rahul Mahadev GitOrigin-RevId: 7168d01d0765e12b82c1065db65da3253dd673a8 commit 91349ad310985334f34d0718da2769526a5b0f89 Author: Rahul Mahadev Date: Wed Sep 18 19:10:01 2019 -0700 [DELTA] Disable Delta Python tests until Docker issue is fixed Disable Python tests for OSS PR builder as they were failing due to a SBT issue. Closes #6451 from rahulsmahadev/releaseTestint. Authored-by: Rahul Mahadev Signed-off-by: Rahul Mahadev GitOrigin-RevId: f638c6250f7688de9cf82ad15a6e5d1d8742b13e commit 99ad0d130f2833f2c8042a5ec90615c4e148009c Author: Rahul Mahadev Date: Wed Sep 18 12:59:33 2019 -0700 [SC-20936] Python API for Delta Utility Commands(Vacuum, History, ConvertToDelta) * Added Python APIs for Vacuum, ConvertToDelta and Describe History * Added unit tests for the above * Added fix to not swallow Python tests Exceptions closes delta.io/delta #173 * Unit tests( `test_deltatable.py`) Closes #6423 from rahulsmahadev/pyUtilCommands. Lead-authored-by: Rahul Mahadev Co-authored-by: Tathagata Das Co-authored-by: Zhitong Yan Signed-off-by: Rahul Mahadev GitOrigin-RevId: 6576d4457f19583cc62a88350a9d83c990703109 commit b9cf201fee07d5748e0a8b53883c6a5902cf836f Author: Shixiong Zhu Date: Wed Sep 18 10:24:13 2019 -0700 [SC-22736] Vacuum SQL support for OSS Delta This PR adds SQL support for Delta. The user can set `spark.sql.extensions` to `io.delta.sql.DeltaSparkSessionExtension` to enable Delta's SQL commands (`vacuum` command is added in this PR as well to show how this works). `io.delta.sql.DeltaSparkSessionExtension` is the only public API added by this PR. This is a Spark extension to make Spark SQL understand Delta's SQL commands. This will will inject a new parser defined by `DeltaSqlBase.g4`, and it will parse SQL text before Spark SQL. If it finds any Delta's commands, it will forward the calls to the corresponding Delta commands. Otherwise, we just delegate the calls to Spark SQL. In addition, this PR adds `vacuum` SQL support. There are two ways to use this SQL command: ``` vacuum '/foo/bar'; vacuum delta.`/foo/bar`; ``` This is currently not support tables because we cannot add a Delta table to Hive right now. Lastly, there is a known issue: `spark.sql.extensions` is not picked up in PySpark. See https://issues.apache.org/jira/browse/SPARK-25003. The user needs to run the following lines in PySpark to enable it. ``` >>> sc._jvm.io.delta.sql.DeltaSparkSessionExtension().apply(spark._jsparkSession.extensions()) >>> spark = SparkSession(sc, spark._jsparkSession.cloneSession()) ``` Closes #6333 from zsxwing/delta-sql. Authored-by: Shixiong Zhu Signed-off-by: Shixiong Zhu GitOrigin-RevId: 5cac0f4cd41fb5d00a4e71a4e2a82828e3000439 commit 10eb6776f4b7dba0152d8e419265e640cb0f0248 Author: Rahul Mahadev Date: Tue Sep 17 09:21:21 2019 +0000 [SC-19633][DELTA] Convert to delta Scala APIs for DBR and OSS Added Scala APIs for ConvertToDelta. New tests in ConvertToDeltaSuite. Closes #5975 from rahulsmahadev/convert_to_delta. Authored-by: Rahul Mahadev Signed-off-by: Rahul Mahadev GitOrigin-RevId: 07f3aa8d01649cd445765088dd93ba2b98f85b41 commit 7e76205a93c28637072e17c1365af5a5009f999e Author: rahulsmahadev Date: Fri Sep 13 11:01:29 2019 -0700 [SC-19220] Python API and tests for DeltaTable PR to make Python API tests to work on the DBR and OSS PR builders. * Added Python APIs for DeltaTable * Tests for DeltaTable Python APIs * Dockerized the tests for OSS - Checks for Jenkins/Circle CI and runs on docker else it will fall back to run without it. Set DOCKER_MODE_DELTA env variable to run on docker locally. A lot of the work has been originally done by ZhitongDB so massive shout out to him. Closes delta-io/delta#169 Closes #6240 from rahulsmahadev/pyInfra. Lead-authored-by: Rahul Mahadev Co-authored-by: Tathagata Das Co-authored-by: Zhitong Yan Signed-off-by: Shixiong Zhu GitOrigin-RevId: 7a7619129d343a1e6fef05b2bc1712ec1721263a commit 43f40d7c6fef3b75b8e97026eb22589b127972ea Author: Michael Armbrust Date: Fri Sep 13 10:41:30 2019 -0700 [DELTA-OSS-EXTERNAL] Specification for the Delta Transaction Protocol This PR adds a specification for the Delta Transaction Protocol. The goal of this specification is to allow other implementors to build integrations that read and write from a Delta table. This is an early draft that should contain all the information needed to correctly read from a delta table. There are TODOs later in the document, that will be fleshed out in a future PR, on the additional requirements for modifying a table. Closes delta-io/delta#153 Closes #6388 from zsxwing/filw1ck1. Authored-by: Michael Armbrust Signed-off-by: Shixiong Zhu GitOrigin-RevId: c2969b0f07854a665ac2f3aa5a8aa5d0b624eafd commit 587289b6099a6c2b411db51df4629f13f6e8010e Author: Xiao Li Date: Thu Sep 12 22:20:15 2019 -0700 Merge remote-tracking branch 'databricks/pr/6287' into HEAD GitOrigin-RevId: e92d8d3f4566beae2b748bd510fd4e5955060368 commit 4d6d63fbd6fac19637a4676baa60a0b6d17c1423 Author: Mukul Murthy Date: Thu Sep 12 19:53:32 2019 +0000 [DELTA-OSS-EXTERNAL] Fix flaky test in DeltaRetentionSuite `DeltaRetentionSuite.log files being already deleted shouldn't fail log deletion job` is flaky because it fails when started in the last ~200 seconds of a UTC day. The test logic is as follows: * Make 25 commits 10 seconds apart, starting from the current time. * Delete commits 5-15. * Advance the clock LOG_RETENTION (30) days + 1 day and run a cleanup. * Version 20 should be the latest checkpoint, and logs 1-19 should have been deleted. When the test starts in the last 190 seconds of the day, version 19 (or older) falls on the other side of a date boundary, and so does not get cleaned up. This change fixes this test by advancing the clock an extra day so that no matter when in the day the test starts, all log files are expired and the unnecessary ones can get cleaned up. Fixes https://github.com/delta-io/delta/issues/163. Closes delta-io/delta#164 Author: Mukul Murthy <38224594+mukulmurthy@users.noreply.github.com> GitOrigin-RevId: 8a4ed7d4076edcc77370c958bdb042b798472ce9 commit 5026b73dc7073f3de0215252f9a549671db915d6 Author: Pranav Anand Date: Fri Sep 6 20:12:10 2019 +0000 [SC-20615] Refactor tests to make it easier to extend them Split the utility methods into a separate trait so that they can be reused in other tests. The existing tests should still work, no behavior should have been changed Author: Pranav Anand GitOrigin-RevId: e430beaf69ebfd870fd955a5b8bc1c4cec4c6bb9 commit e025ac1de374ce652849a8498fc36476844c96e2 Author: Pranav Anand Date: Fri Sep 6 10:58:14 2019 -0700 [SC-20730] Log invalid DeltaOptions - Adds a verification method to `DeltaOptions` which checks if given options are valid or not in which case it usage logs them - Adds tests which check different ways users may be able to pass in incorrect DeltaOptions and asserts whether they are logged correctly in `DeltaLogSuiteEdge` - Test case has both a unit test as well as an "end to end" test where writing to and from delta are tested Closes #5949 from pranavanand123/options-checker-delta. Authored-by: Pranav Anand Signed-off-by: Burak Yavuz GitOrigin-RevId: 37d6c5f5453eed9fab2e1647940a8df91db4fc75 commit 5ba1a0dcc2aab37cb27a16d5e3231e2dfa0fa692 Author: Burak Yavuz Date: Fri Sep 6 16:40:17 2019 +0000 [SC-20947] Always make the output attributes nullable when writing to Delta Even though we change the schema as nullable when writing to Delta, the output attributes may remain not-null. In such cases, when writing to Parquet, we still want to keep the attributes as nullable, to avoid potential corruption with Parquet. This unfortunately bloats certain parquet file sizes, but the changes in the tests suggest that we weren't doing the right thing in the first place. Regression test Author: Burak Yavuz GitOrigin-RevId: ee2941458fd1ba1ba62e911e8b1987f219dce10e commit 103f5cb8937182c69b09b855587579dd91067c9b Author: Pranav Anand Date: Thu Sep 5 22:51:41 2019 +0000 [SC-18153] Minor refactoring in the exceptions Changed exception to take table names and paths. Author: Pranav Anand GitOrigin-RevId: 6aa9d068c332c0cb782f67bac0ef8d9125fb20e7 commit 55ca0054a7ba01a05acda5d8edc961f7a2ce9cdc Author: Yishuang Lu Date: Wed Sep 4 18:12:49 2019 +0000 [DELTA-OSS-EXTERNAL] Remove all unused imports in the code Remove all unused imports in the code Signed-off-by: Yishuang Lu Closes delta-io/delta#135 Author: lys0716 Author: Shixiong Zhu GitOrigin-RevId: 5732e6337d9a66dad1f95cf60dd4bbab34924325 commit 20840668e18a89f85568d1d8e025f0de5d440394 Author: dmatrix Date: Tue Sep 3 23:34:23 2019 +0000 [DELTA-OSS-EXTERNAL] found minor typos and usage; fixed it Minor edits in usage in the README.md. Closes delta-io/delta#57 Author: dmatrix #6265 is resolved by zsxwing/sk5qnz6j. GitOrigin-RevId: 40f1c948fbb0c493f4f24c5117e5105e40d50942 commit 06e33df20a724ed5f4e76e6746b5c4bccf674b1d Author: Shixiong Zhu Date: Tue Sep 3 10:12:29 2019 -0700 [SC-21941][WARMFIX] Log cleanup should delete the checksum file for version 0 Right now as we are listing from the delta file name from version 0, it will not return the checkpoint file for version 0. This is usually fine since we don't checkpoint version 0. However, technically, we can create a checkpoint for version 0, so it's better to also handle it by using the checkpoint file name to list. Authored-by: Shixiong Zhu Signed-off-by: Shixiong Zhu GitOrigin-RevId: 335904ef1738aebd94d0e30684956c675576f266 commit 64905d8ccd2013a1b77118b1f9004aa1fe352372 Author: Wesley Hoffman Date: Thu Aug 29 21:02:57 2019 +0000 [DELTA-OSS-EXTERNAL] Document features not supported in OSS Delta fixes #77 Closes delta-io/delta#129 Author: Wesley Hoffman GitOrigin-RevId: 4e7513968ec154ac629a61c23e996b03e83d63de commit 68aec53866b4159acf3318149f59a6c884ce0b08 Author: Rahul Mahadev Date: Mon Aug 26 16:52:54 2019 -0700 [DELTA-OSS-EXTERNAL] Add MIMA - Scala Binary Incompatability Check to sbt This PR adds MIMA to the build process of Delta Lake. During build process we fetch the latest release of Delta Lake and check if the new change would break any binary compatibility with the previous versions. Note: `sbt test` would trigger MIMA check however they are not triggered with `sbt testOnly` Closes delta-io/delta#137 Closes #6171 from rahulsmahadev/720b5kn1. Authored-by: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com> Signed-off-by: Rahul Mahadev GitOrigin-RevId: e5bbd0dc17fad44331176d0d17a427dfb2dcc830 commit 778cc6923bb3985122062f7ce4af4cd58a15f849 Author: Jungtaek Lim Date: Mon Aug 26 19:50:22 2019 +0000 [DELTA-OSS-EXTERNAL] Try to delete leaked CRC file in HDFSLogStore due to HADOOP-16255 Due to [HADOOP-16255](https://issues.apache.org/jira/browse/HADOOP-16255), `fc.rename` doesn't correctly rename CRC file of source file if filesystem is descendant of `ChecksumFs` (specifically `LocalFs`), which makes HDFSLogStore leak CRC files of temp files. This patch will try to delete CRC file of source file when renaming, but just do as a "best-effort" since it's OK to leak some CRC file instead of let write fail. Also added verification logic to check any leaked CRC files. Closes delta-io/delta#139 Author: Jungtaek Lim #6165 is resolved by zsxwing/l406oky1. GitOrigin-RevId: 2abafc268b3f0407115378d218c5b7a11118e200 commit ee1770714072874f295d6ce3205849b702c5eda2 Author: Jose Torres Date: Fri Aug 23 20:09:55 2019 +0000 [SC-19523][DELTA] Move DeltaSource offset forward even if there are no AddFiles Currently, only commits with data move the DeltaSource offset forward. So if there are many no-data commits over a long period of time, the retention period will eventually hit the old commit that the DeltaSource is at, causing it to fail. new unit test Author: Jose Torres GitOrigin-RevId: 2c69af0f16a186b52742e4bd4e21ef4ee1597230 commit ad72d485c6c481c870ed80e0b52e922223721af6 Author: Yishuang Lu Date: Wed Aug 21 13:36:52 2019 -0700 [DELTA-OSS-EXTERNAL] Fix typos in the code Fix typos in the code Signed-off-by: Yishuang Lu Closes delta-io/delta#132 Closes #6105 from mukulmurthy/czjxk36z. Lead-authored-by: Yishuang Lu Co-authored-by: Mukul Murthy <38224594+mukulmurthy@users.noreply.github.com> Co-authored-by: lys0716 Signed-off-by: Mukul Murthy GitOrigin-RevId: a3f46c85c09f7ca33482beb45020720e8dde1e4f commit 5adb8d1f43b2c921a589e0edacaac52e8af44724 Author: Rahul Mahadev Date: Tue Aug 20 11:46:47 2019 -0700 [SC-21403][DELTA] Describe History Scala API test cleanup Refactored the DescribeDeltaHistorySuite to remove an unnecessary DeltaLog creation. Closes #6081 from rahulsmahadev/describe_history_ss. Authored-by: Rahul Mahadev Signed-off-by: Rahul Mahadev GitOrigin-RevId: aa3126ab362df2a5e46ae14c2ff530e93ecaee1b commit d3b5e42f955d45e71ad2f20253800e37eadee7b6 Author: Shixiong Zhu Date: Tue Aug 20 18:30:03 2019 +0000 [DELTA-OSS-EXTERNAL] Hide package private types and methods in javadoc Add `-P:genjavadoc:strictVisibility=true` to scalac options in order to hide package private types and methods in javadoc. Manually ran `build/sbt clean unidoc` and verified the generated javadoc doesn't show package private methods of `DeltaTable`. Closes delta-io/delta#130 Author: Shixiong Zhu #6079 is resolved by zsxwing/nar2ifi2. GitOrigin-RevId: 085060f712eaee15379e8366358913cac64d441d commit 94572bfe1387766d8b9d2d9d5f613d4d888ab7a6 Author: Tathagata Das Date: Tue Aug 20 10:45:13 2019 +0000 [SC-20935] Add DeltaLogging to DeltaMergeBuilder Add DeltaLogging to DeltaMergeBuilder to allow tracking metrics about merges. Author: Tathagata Das GitOrigin-RevId: 4298da297e73ec916ef5d87ac9b96f6f5a03f518 commit 07f1e6f6be377a10e4257c6d9e44c8c0bc557d54 Author: Terry Kim Date: Tue Aug 13 15:58:37 2019 +0000 [DELTA-OSS-EXTERNAL] Fix build warnings and remove unnecessary imports `build/sbt compile` shows: ``` [info] Compiling 72 Scala sources to /tmp/delta/target/scala-2.12/classes... [warn] /tmp/delta/src/main/scala/org/apache/spark/sql/delta/PreprocessTableMerge.scala:21: imported `DeltaErrors' is permanently hidden by definition of object DeltaErrors in package delta [warn] import org.apache.spark.sql.delta.{DeltaErrors, DeltaFullTable} [warn] ^ [warn] /tmp/delta/src/main/scala/org/apache/spark/sql/delta/PreprocessTableMerge.scala:21: imported `DeltaFullTable' is permanently hidden by definition of object DeltaFullTable in package delta [warn] import org.apache.spark.sql.delta.{DeltaErrors, DeltaFullTable} [warn] ^ [warn] /tmp/delta/src/main/scala/org/apache/spark/sql/delta/PreprocessTableUpdate.scala:19: imported `DeltaErrors' is permanently hidden by definition of object DeltaErrors in package delta [warn] import org.apache.spark.sql.delta.{DeltaErrors, DeltaFullTable} [warn] ^ [warn] /tmp/delta/src/main/scala/org/apache/spark/sql/delta/PreprocessTableUpdate.scala:19: imported `DeltaFullTable' is permanently hidden by definition of object DeltaFullTable in package delta [warn] import org.apache.spark.sql.delta.{DeltaErrors, DeltaFullTable} [warn] ^ [warn] /tmp/delta/src/main/scala/org/apache/spark/sql/delta/UpdateExpressionsSupport.scala:19: imported `DeltaErrors' is permanently hidden by definition of object DeltaErrors in package delta [warn] import org.apache.spark.sql.delta.DeltaErrors [warn] ^ [warn] there were two deprecation warnings; re-run with -deprecation for details [warn] 6 warnings found ``` Remove unnecessary imports in some other files as well. Closes delta-io/delta#120 Author: Mukul Murthy Author: Terry Kim GitOrigin-RevId: c9a40e1063abf96aada8a6a6b19ac04425b799bd commit 6cb6406483d4fbd06ca0f791b1da8d748466ffe3 Author: Jose Torres Date: Tue Aug 13 14:58:19 2019 +0000 [SC-20682][DELTA] Save partition schema in PreparedDeltaFileIndex PreparedDeltaFileIndex right now recomputes the partition schema from snapshot every time. There's no need for this, and it ends up meaning time travel creates a whole new snapshot for each partition (since the time travel snapshot isn't the most recent one in the Delta log). Added a file operations test. The number of listed paths in the provided test case goes down from 309 to 9. Author: Jose Torres GitOrigin-RevId: e78974ac4122f72833c874f156c975a5bd022156 commit 1b8b376f43b2d56b54cf9e78694c749754e2ff9f Author: Mukul Murthy Date: Tue Aug 13 09:34:31 2019 -0700 [DELTA-REFACTOR] Cleanup and merge a couple import statements Authored-by: Mukul Murthy Signed-off-by: Mukul Murthy GitOrigin-RevId: 8d34431be055121705c8462a65969e0af67fcd35 commit 36bc2f111f3a761534d6beda8067e08f6cc84c21 Author: Burak Yavuz Date: Mon Aug 12 23:25:30 2019 +0000 [DELTA-OSS-EXTERNAL] Reduce file listing parallelism for tests This should make Vacuum tests a lot faster. It's running 10,000 individual Spark jobs right now due to the very high setting of file parallelism setting. cc @rahulsmahadev Closes delta-io/delta#113 Author: Burak Yavuz GitOrigin-RevId: 94dbb425368952bba633991b2c1a8045b8b53a4f commit 94c407c28a9b4f0bf8aeb6fe7180c23379afcc96 Author: Shixiong Zhu Date: Mon Aug 12 14:45:01 2019 -0700 [DELTA-OSS-EXTERNAL] Disable the automatic async log cleanup in DeltaRetentionSuite to make tests stable Sometimes tests in DeltaRetentionSuite fail because of the automatic async log cleanup. This PR just disables to make tests stable. Closes delta-io/delta#125 Authored-by: Shixiong Zhu Signed-off-by: Mukul Murthy GitOrigin-RevId: 6531f320ba97cc019b4c3a7b74b493a5f165ea8d commit ccda9e4a5acfae4e71a3e7ab43f45575a5900ea1 Author: Yucai Yu Date: Mon Aug 12 14:39:17 2019 -0700 [DELTA-OSS-EXTERNAL] Fix comments in DeltaLog.createRelation The return type of `createRelation` should be `BaseRelation` instead of `DataFrame`. Closes delta-io/delta#117 Authored-by: Yucai Yu Signed-off-by: Mukul Murthy GitOrigin-RevId: a3678d988cf3799c9999d67e88e0534be43b4fca commit f5ed7dd5c534819d3bdfdbe7c1ccac5b685ce33b Author: lys0716 Date: Mon Aug 12 17:37:32 2019 +0000 [DELTA-OSS-EXTERNAL] Update binaries version in README.md from 0.2.0 to 0.3.0 Update binaries version in README.md from 0.2.0 to 0.3.0 Signed-off-by: Yishuang Lu Closes delta-io/delta#119 Author: lys0716 GitOrigin-RevId: 5e5e541b89d0b46ac84b5b0ba3a28069cb2cb6cd commit d9749f7c26fa63c5f05fdb2ed4c7116bdd56a51d Author: liwensun Date: Fri Aug 9 11:17:50 2019 -0700 [SC-20980][DELTA] Bump oss build version to 0.3.1-SNAPSHOT ## What changes were proposed in this pull request? a.t.t Closes #5991 from liwensun/sc20980-oss-version. Authored-by: liwensun Signed-off-by: Shixiong Zhu GitOrigin-RevId: b4c5d1fdb50b6c49b5b07a54edbccfd210c58e97 commit b18b1775b9be6dcf855265268e3c29fdc6c9ed35 Author: Rahul Mahadev Date: Thu Aug 8 15:42:27 2019 -0700 [SC-20260][DELTA] Evolvability test for describe history command. ## What changes were proposed in this pull request? Added evolvability test for Describe History Scala API. Generated the resource files for OSS using `build/sbt "test:runMain org.apache.spark.sql.delta.EvolvabilitySuite src/test/resources/delta/delta-0.2.0`. on delta lake 0.2.0 release. ## How was this patch tested? Closes #5929 from rahulsmahadev/evolvability_suite. Authored-by: Rahul Mahadev Signed-off-by: Rahul Mahadev GitOrigin-RevId: adca1e06f3d59d99ed3841f2b7967fc4031ebc2b commit 2da5bcfc8449473c0f1d0e3eae4b9da3dc03a554 Author: Shixiong Zhu Date: Mon Aug 5 19:40:04 2019 +0000 [SC-20741]Remove DeltaTable.apply and add DeltaTableTestUtils to open it in tests only ## What changes were proposed in this pull request? This PR removes the public method `DeltaTable.apply` to avoid exposing an internal API. I made the constructor of `DeltaTable` package private and add `DeltaTableTestUtils` in tests to open it up in tests only. ## How was this patch tested? Jenkins Author: Shixiong Zhu #5951 is resolved by zsxwing/SC-20741. GitOrigin-RevId: 679c9d245dacf5849c6fa2ed055e4b2bd2db0c70 commit 75439ff80157b7abbc50eb3ed5963f68e0503296 Author: Rahul Mahadev Date: Fri Aug 2 11:25:04 2019 -0700 [SC-20324] Refactored DeltaTable Refactored DeltaTable to take SparkSession and DeltaLog as parameters instead of DataFrame. Closes #5934 from rahulsmahadev/deltatable_refacotr. Authored-by: Rahul Mahadev Signed-off-by: Rahul Mahadev GitOrigin-RevId: 663ab2070c58e900ca3dd73452696281cbff1fc7 commit 3100b6d87247946ec3d65686bee78c43f50c551b Author: Tathagata Das Date: Thu Aug 1 16:41:11 2019 -0700 Setting version to 0.3.1-SNAPSHOT commit 21ba848dad637ed6e7e64f785397112b46770c15 Author: Tathagata Das Date: Thu Aug 1 16:39:41 2019 -0700 Setting version to 0.3.0 commit e75c8d14270a5f12fcf44cde05eb151575496d15 Author: Rahul Mahadev Date: Thu Aug 1 20:06:21 2019 +0000 Refactoring io.delta package to io.delta.tables Staging this PR for now. Changing the namespace `io.delta` to `io.delta.tables` No new tests added, re ran old tests. Author: Rahul Mahadev GitOrigin-RevId: ab6a5968a29f698426bea27db0f33f431da637ef commit df0393e66a614d993b1dbf4ebd02f5aeb69e6a12 Author: Shixiong Zhu Date: Wed Jul 31 23:14:46 2019 +0000 [SC-20413]Snapshot staleness should use the last update timestamp ## What changes were proposed in this pull request? Right now we use the timestamp of latest commit to check staleness. This is not great since if a table is old and doesn't have recent commits, we will always run `update` to check, which is wasting a list request. We can remember the timestamp of the last update, and use it to check staleness. Then we can save a list request when a table is not stale. ## How was this patch tested? Jenkins Author: Shixiong Zhu #5904 is resolved by zsxwing/update-staleness. GitOrigin-RevId: 8f6939d7d84997ff0ec9490b3ec41db0519d450e commit 890158a020494948275c63e6966159246dbe25a3 Author: Shixiong Zhu Date: Wed Jul 31 20:04:30 2019 +0000 [SC-20377]Delta streaming source should check the latest protocol of a table ## What changes were proposed in this pull request? - Delta streaming source should verify if it's allowed to read a table when loading a json file. - Add tests to make sure protocolRead/Write is called with the right Protocol instance. Author: Shixiong Zhu #5881 is resolved by zsxwing/protocol-fix. GitOrigin-RevId: fd64c4b528a1d482f2f3edede24cb93c3b1a54a0 commit 388f1f4aebb24d45df09995f3787f63b14175950 Author: Rahul Mahadev Date: Fri Jul 26 23:00:50 2019 +0000 Fixed comments in vacuum Minor changes Author: Rahul Mahadev GitOrigin-RevId: 3f40622903e5ee301af3be06c0cc91f4722a52f8 commit 46b995b56a8b15724980eade822df7a91c1bb224 Author: Tathagata Das Date: Fri Jul 26 22:58:05 2019 +0000 [DELTA-OSS-EXTERNAL] Refactored and improved API docs for DeltaTable operations - Moved the update() methods to DeltaTable class to fix java doc issues - Converted builder classes from case class to simple class because case classes have unnecessary public methods (e.g. productArity, etc.) that show up in the API docs. - Added details docs for update and merge Closes delta-io/delta#101 Author: Tathagata Das #5897 is resolved by tdas/dunpvr01. GitOrigin-RevId: cc7b9fe82c3182859ffa0ff0e7bc6942ffb02be7 commit e624d92d7cfa7669ef97e423a9c451de88f4e5ca Author: Rahul Mahadev Date: Fri Jul 26 13:31:29 2019 -0700 [SC-19634] Add Describe History Scala APIs to OSS Delta Lake ## What changes were proposed in this pull request? Adding DescribeDeltaHistory history Scala API DescribeDeltaHistory on a DeltaTable would return a DataFrame with the commit info in reverse chronological order. The limit optional parameter specifies the last limit operations to fetch the History on. Sample usage : deltaTable = new DeltaTable(spark.table(tableName)) deltaTable.history(limit = 10) deltaTable.history() ## How was this patch tested? DeltaDescribeHistorySuite Closes #5852 from rahulsmahadev/describe-history. Authored-by: Rahul Mahadev Signed-off-by: Rahul Mahadev GitOrigin-RevId: 071cb9cdcbd0e56e35eb9cbb727ee3c80fb57904 commit 96ec65407ba178a37ad7b389954faae2b0dd1d10 Author: Zhitong Yan Date: Thu Jul 25 23:52:35 2019 +0000 [DELTA-OSS] Enable JUnit tests in Delta Lake ## What changes were proposed in this pull request? After this PR, Delta Lake will have the ability to run JUnit tests in SBT. ## How was this patch tested? by `JavaDeltaTableSuite.java` Closes delta-io/delta#97 Author: Zhitong Yan #5654 is resolved by ZhitongDB/enable-java-test-in-oss. GitOrigin-RevId: 89f93948a8deda9c260df3e5265d440b70d45931 commit 43c30f6631b4922234b81357df0b518649e6dd30 Author: Shixiong Zhu Date: Thu Jul 25 22:41:53 2019 +0000 [SC-20256]Save one Snapshot when creating DeltaLog ## What changes were proposed in this pull request? Right now creating DeltaLog needs to create 2 Snapshots when the last checkpoint version is not the same as the latest version. We load the parquet checkpoint to create a `Snapshot`, then call `update` to pick up latest json files after the checkpoint. We can load the checkpoint and json files together to save one `Snapshot`. ## How was this patch tested? Jenkins Author: Shixiong Zhu #5865 is resolved by zsxwing/load. GitOrigin-RevId: 6ab2ce407f613ecf2cbcd5b6136443176470442d commit d6dd8bca125cebc2bda9f0d51c1321a652b8ab01 Author: Arul Ajmani Date: Thu Jul 25 20:11:41 2019 +0000 [DELTA-OSS-EXTERNAL] Update analysisException to take all parameters AnalysisException has AnalysisException takes line, startPosition, and cause as constructor parameters in addition to message and plan -- this change updates analysisException to accept those as well. Closes delta-io/delta#102 Author: Arul Ajmani #5885 is resolved by mukulmurthy/zw6c4nhx. GitOrigin-RevId: bb66861c1400f5e06afbb59877233190688e10e8 commit 73f4a64284f7e971be4a5238aa4b3f310360df85 Author: Rahul Mahadev Date: Thu Jul 25 03:39:31 2019 +0000 [SC-19632][DELTA] - VACUUM Scala API ## What changes were proposed in this pull request? Users can Vacuum a given DeltaTable with a given retention period. Vacuum would Recursively delete files and directories in the table that are not needed by the table for maintaining older versions up to the given retention threshold. Note: Vacuum would disable the ability to time travel beyond the retention period. Sample usage : deltaTable = new DeltaTable(spark.table(tableName)) deltaTable.vacuum(retentionHours = 13) deltaTable.vacuum() ## How was this patch tested? - DeltaVacuumSuite to test the functionality of Vacuum. Closes delta-io/delta#95 ORIGINAL_AUTHOR=rahulsmahadev rahul.mahadev@databricks.com Author: Rahul Mahadev #5741 is resolved by rahulsmahadev/vacuum. GitOrigin-RevId: 1fe58f3f8877fd82a3770e7d12e282ab0241b89a commit 39070b448e8d9d1c2c0b60a71c6d36b34bceb045 Author: Shixiong Zhu Date: Thu Jul 25 12:21:43 2019 -0700 Fix GitOrigin-RevId GitOrigin-RevId: ec7fecac856b4c2033f288cae787f0bb7f6b3149 commit 526982101e2aeade3e9e98968312403d42bbdc12 Author: Tathagata Das Date: Fri Jul 19 06:46:46 2019 +0000 Merge Scala API for DeltaTable This PR is for #42, after the change, delta table has the ability to merge some source table/query with optional condition and update/insert rules. This PR has been tested by MergeIntoScalaSuite.scala Closes delta-io/delta#96 Author: Zhitong Yan Author: Tathagata Das GitOrigin-RevId: 7d5b61738546c63460d717db24720843718fd3d5 commit 824989c61458856e771770322e95e055e6ff09f9 Author: Tathagata Das Date: Fri Jul 19 01:03:49 2019 +0000 [SC-19637] [Delta] Add nested data support to Update Scala/Java APIs in OSS ## What changes were proposed in this pull request? Supporting nested data in Update required explicitly resolving dotted column names in the analysis phase. This was done by DBR's Analyzer which explicitly handled `UpdateTable` logical plan. But in OSS, Apache Spark's Analyzer does not do this. Hence we need to explicitly resolve nested columns in OSS and throw errors if they dont resolve. Here are the changes - Moved all the resolution code from DBR's analyzer to UpdateTable so that it can be invoked directly. - Updated DeltaTableOperations.executeUpdate to explicitly resolve expressions to extract the nested name parts. - To resolve references of expressions from outside the Analyzer (without writing a rule or invoking ResolveReferences rule directly), I stuck each expression in a fake LogicalPlan and invoked the Analyzer to resolve it. This keeps the dependency on internal APIs at a minimum. ## How was this patch tested? Moved nested data tests from UpdateScalaSuite to UpdateSuiteBase to ensure they run in OSS. Author: Tathagata Das #5794 is resolved by tdas/SC-19637. GitOrigin-RevId: 2bc7adf42a1862200af32bb3de7818690983528a commit f6e28a34e247c3cdff5a9eb6a137eed25732eb19 Author: Tathagata Das Date: Tue Jul 16 22:30:20 2019 +0000 [DELTA-OSS-EXTERNAL] [DOCS] Added scripts to generate api docs and API stability annotations - Used Spark annotations to annotate public APIs regarding their stability - Copied Spark scripts to patch the generated docs files to make the annotations visible - Additional `:: Evolving ::` is necessary for these scripts to dynamically add badges for the annotation. Without this snippet + patching, generated Java docs do not show annotations. - The `generate_api_docs()` script does the following. 1. Generates Scala and Java docs with `sbt unidoc` 1. Patches the docs' html, js, and css to dynamically inject badges 1. Copies the docs to a standard location `docs/_site/api/` which will be useful for publishing the API docs in the Delta docs. - Moved public methods in DeltaTableOperations into DeltaTable because Java docs generated by unidoc incorrectly handles inherited methods in some cases. In this case, it was showing `delete` as static methods. **Java docs** ![image](https://user-images.githubusercontent.com/663212/61239363-8fb86180-a6f3-11e9-866d-0a2852f6be74.png) **Scala docs** ![image](https://user-images.githubusercontent.com/663212/61254509-893ce080-a719-11e9-98ff-c8239cc2ca74.png) Closes delta-io/delta#93 Author: Tathagata Das Author: Tathagata Das #5779 is resolved by tdas/myt2d8a3. GitOrigin-RevId: a433a0ffdab0eeb3471f4a4c8be66316bd5d828e commit 5bf990d1a4ce4d429212118fa2d6008d982820bf Author: ZhitongDB Date: Mon Jul 15 18:32:51 2019 +0000 [SC-19217] Add Update Scala API to DBR + OSS ## What changes were proposed in this pull request? In this PR, added the update operations for DeltaTable. After the change, DeltaTable can perform a update operation based on the condition specified. ## How was this patch tested? by `UpdateScalaSuite` Closes delta-io/delta#86 Author: Zhitong Yan Author: Tathagata Das #5458 is resolved by ZhitongDB/update-scala-api. GitOrigin-RevId: 54517b5b64a2a6d14185d8109cd21d71a095f230 commit 3f8e7541e894dcb06b8a3b315ea2617c9e977925 Author: Shixiong Zhu Date: Thu Jul 11 23:27:36 2019 +0000 [SC-19619]Remove unused codes and change the default Scala version to 2.12 - Remove unused codes - Change the default Scala version to 2.12 Author: Shixiong Zhu GitOrigin-RevId: 8a31c6df41417c6fff84d8e4026d2e821532b9c2 commit f5b84b6ddaa1bbfd6f51bf1e176da6d44dcd1905 Author: Tathagata Das Date: Thu Jul 11 01:38:48 2019 +0000 [DELTA-OSS-EXTERNAL] Added ability to generate scala and java docs Added sbt-unidoc to generate Scala and Java docs. To generate both docs, just run `build/sbt unidoc` - Any classes not directly in `io.delta` will be ignored from the docs. - Unidoc will be generated when testing so that we can verify that docs are never broken. The overhead of generating the docs is just a few seconds, so does not add much to test times. Closes delta-io/delta#88 Author: Tathagata Das #5735 is resolved by tdas/5jt6u51s. GitOrigin-RevId: ebf7aff73e50cb762be74fd78d9adc5825ab14cf commit 635520855284a887747c6c9668454d25676fa97d Author: Jose Torres Date: Wed Jul 3 11:12:36 2019 -0700 [SC-15200][DELTA] Make SaveAsTable create the table with an empty partitioned dataframe. ## What changes were proposed in this pull request? There's a special case to skip over making a commit if no data files were written. This bug indicates we don't want that special case. ## How was this patch tested? new unit test Closes #5634 from jose-torres/fixempty. Authored-by: Jose Torres Signed-off-by: Jose Torres GitOrigin-RevId: 4e0cef36df7138e7aa032edc50ecd3b443a08519 commit f22a7e2eed5249ad8841bdb6bc0087aa40399e95 Author: liwensun Date: Wed Jul 3 05:27:34 2019 +0000 [SC-19357][DELTA][WARMFIX]Fix performance regression on small tables PR #5500 is a bug fix regarding DeltaLog cache polluting spark sessions, but it introduced a regression on small table optimization for Delta. This PR fixes this regression by caching the collected rows instead of having to compute them repeatedly. existing tests Author: liwensun GitOrigin-RevId: da298b2ee21c1727998b34cf97a7d005384c47d4 commit 729ccb552184acda4095c17e0ebcaebc0bb1d83b Author: Rahul Mahadev Date: Tue Jul 2 15:33:36 2019 -0700 Minor change Lead-authored-by: Rahul Mahadev Co-authored-by: Burak Yavuz Signed-off-by: Rahul Mahadev GitOrigin-RevId: d0b4eb75fa10727293dfca93d37bb0c41465206a commit 5483d5475d62d6b44e6c33a5afa10be211505284 Author: Zhitong Yan Date: Tue Jul 2 11:29:42 2019 -0700 [SC-18879] Add Delete Scala API to DBR + OSS In this PR, added the delete operations for DeltaTable. After the change, DeltaTable can perform a delete operation based on the condition specified. By `DeleteScalaSuite` Closes delta-io/delta#75 Closes #5391 from ZhitongDB/delete-scala-api. Lead-authored-by: Zhitong Yan Co-authored-by: Tathagata Das Co-authored-by: ZhitongDB <50844714+ZhitongDB@users.noreply.github.com> Signed-off-by: Tathagata Das GitOrigin-RevId: f692e0020a8a6d279a25d4a5106fb19ee105ec49 commit 07f53f4296b1154adda0b1022fa81bf828db9aa0 Author: Zhitong Yan Date: Sun Jun 30 17:21:55 2019 -0700 [DELTA-OSS-EXTERNAL] Fix DeltaTable forPath to not accept non-delta table paths In this PR, I added exception handling for `forPath` method in `DeltaTable.scala`. After the change, Delta Table will make sure create instance just for "Delta Source", other sources will throw an exception. This PR is tested by `DeltaTableSuite.scala` Closes delta-io/delta#79 Closes #5599 from tdas/qy70ntky. Lead-authored-by: Zhitong Yan Co-authored-by: Tathagata Das Co-authored-by: ZhitongDB <50844714+ZhitongDB@users.noreply.github.com> Signed-off-by: Tathagata Das GitOrigin-RevId: eb3a014714aacc1b97d165089e610186b1bb9f83 commit ab260c9a7ee912357a5516447001442863d4c4e2 Author: Shixiong Zhu Date: Fri Jun 28 03:38:41 2019 +0000 [DELTA-OSS-EXTERNAL] Add a system property for the DeltaLog cache size Add "delta.log.cacheSize" system property for the DeltaLog cache size, and set it to 10 in tests to reduce the memory footprint. Closes delta-io/delta#80 GitOrigin-RevId: cf9e0560df0cdc08ff7c1978963ff9584278d050 Author: Shixiong Zhu #5595 is resolved by tdas/fa38kpiz. GitOrigin-RevId: 2c19cd19e491501fd42efff200e9d2b5ddcb9c22 commit adaee91212233ae5391b9bf4e069e680d1fae4da Author: Zhitong Yan Date: Thu Jun 27 20:15:57 2019 +0000 [DELTA-OSS] Fix filterFiles() in OptimisticTransaction ## What changes were proposed in this pull request? In this PR, changed the implementation of `filterFiles` method in `OptimisticTransaction.scala`. Original one may return empty matched file list based on some prediction, if there is some partition specified in delta table. ## How was this patch tested? By `OptimisticTransactionSuite.scala` zhitong.yan@databricks.com Author: Zhitong Yan #5534 is resolved by ZhitongDB/filter-files. GitOrigin-RevId: cac5cb2d198413f10305666eb1a6b66c490adb7e commit ffa2575f2596c0fec83b81076ec99aeff6f62671 Author: Shixiong Zhu Date: Wed Jun 26 23:17:43 2019 +0000 [SC-19257] Don't use DataSourceOptions in Delta ## What changes were proposed in this pull request? Apache Spark has removed DataSourceOptions in master and it will not be available in Spark 3.0. See https://github.com/apache/spark/commit/2a80a4cd39c7bcee44b6f6432769ca9fdba137e4 We should avoid using it in Delta. ## How was this patch tested? Jenkins Author: Shixiong Zhu #5586 is resolved by zsxwing/SC-19257. GitOrigin-RevId: c4fe4d18d5029fffb2ad725ef8b8e9cd725b8bda commit 911267a0c6e06a81ea22b1f9d160eed9a02a4a74 Author: Burak Yavuz Date: Tue Jun 25 23:55:29 2019 -0700 [SC-19019] Remove another unused configuration ## What changes were proposed in this pull request? Found another unused configuration... ## How was this patch tested? YOLO Closes #5564 from brkyvz/turnOnMQO2. Authored-by: Burak Yavuz Signed-off-by: Burak Yavuz GitOrigin-RevId: d189c033f4d52d9f8269fb0a47b672b380737fcb commit 5569557a8cce61f1bdaba84b76c91b835159ee53 Author: Jose Torres Date: Wed Jun 26 05:38:45 2019 +0000 [SC-18753] Minor code cleanup GitOrigin-RevId: 4e0463d89d3af39b150b76f02f6d83e68bd1dc5b commit a1e4ff82eb045cc7a4ffe4d9e88a378a19c769e9 Author: liwensun Date: Tue Jun 25 17:52:07 2019 -0700 [SC-14260][DELTA] Stop caching DFs in DeltaLog to prevent spark session pollution ## What changes were proposed in this pull request? TL;DR: This PR changes the Dataframe fields `state` and `withStats` from ` val` to `def`. This is to prevent these DFs, when cached as part of a `DeltaLog` instance, from polluting the active spark session. When these cached DFs are executed from a different session, the original session in which these DFs were created will become the active session (because Spark just sets the current DF's session as active session). This pollutes session-specific configs and libraries. By making these DFs a method call instead of a cacheable field, they will be created using the current active session every time they are called. Hopefully this change won't degrade the performance too much because these DFs can still be evaluated from underlying RDD cache instead of being computed from scratch. ## How was this patch tested? A regression test to make sure DeltaLog operations no long changes active sessions unexpectedly. Authored-by: liwensun Signed-off-by: liwensun GitOrigin-RevId: 406ae31b224f276921c9952b044866d6ee48406c commit 02fa7d94edf6181b7bfcb698896a252d866be405 Author: Shixiong Zhu Date: Tue Jun 25 22:01:07 2019 +0000 [SC-19030] Add a new API to invalid DeltaLog for a path Add a new API to invalid DeltaLog for a path. Jenkins Author: Shixiong Zhu GitOrigin-RevId: 9f34f1ab9fb81d9aa78bcc110f87b3c54a2099be commit e7efebc1cdc1e5e056bd1c6989b16f2660af7147 Author: Maxim Gekk Date: Fri Jun 21 10:36:17 2019 +0000 [SC-18743] Minor code cleanup GitOrigin-RevId: 969d5fd29b01e916e2367ee0a59f48ec9122c1dc commit bbc7c981deaaf3a3171279177ef82384ea172886 Author: Liwen Sun <36902243+liwensunusers.noreply.github.com> Date: Thu Jun 20 13:44:21 2019 -0700 [DELTA-OSS-EXTERNAL] Update readme for 0.2.0 - update the latest version to 0.2.0 - point storage and concurrency control to docs. Closes delta-io/delta#74 Closes #5496 from liwensun/092oqefd. Authored-by: Liwen Sun <36902243+liwensun@users.noreply.github.com> Signed-off-by: liwensun GitOrigin-RevId: a1ed89c626d374cd0c353d054754f0372969a302 commit 6b81231cecbedded552e5ab542fbcd358f8caf46 Author: Zhitong Yan Date: Thu Jun 20 20:41:24 2019 +0000 Export delta table to oss ## What changes were proposed in this pull request? added support for export DeltaTable.scala to OSS. ## How was this patch tested? By DeltaTableSuite.scala zhitong.yan@databricks.com Author: Zhitong Yan #5482 is resolved by ZhitongDB/export-delta-table-to-OSS. GitOrigin-RevId: bbad3c44d0e36707c660d29ee4d9e413e68eb988 commit f88fc36669831d29c6e860042b301d2782810e69 Author: liwensun Date: Tue Jun 18 11:34:37 2019 -0700 Setting version to 0.2.1-SNAPSHOT commit ae3daa85be7cfb574a83f8d73eb10920243e4014 Author: liwensun Date: Tue Jun 18 11:33:18 2019 -0700 Setting version to 0.2.0 commit d85727026506649e3f2716d46f72d1eeb2089acb Author: liwensun Date: Mon Jun 17 14:03:43 2019 -0700 [DELTA-OSS] Small edits on readme ## What changes were proposed in this pull request? small edits on readme ## How was this patch tested? existing tests. Closes #5455 from liwensun/readme-edits-0.2.0. Authored-by: liwensun Signed-off-by: liwensun GitOrigin-RevId: 4fccf5635133264662b632919e6f87cdbae7c54f commit 973e34a15a59d693a146a916343cb60fccf46507 Author: Liwen Sun <36902243+liwensunusers.noreply.github.com> Date: Mon Jun 17 11:53:15 2019 -0700 [DELTA-OSS-EXTERNAL] Update Readme for changes in 0.2.0 a.t.t. Closes delta-io/delta#67 Closes #5448 from liwensun/k2rge5ic. Lead-authored-by: liwensun Co-authored-by: Liwen Sun <36902243+liwensun@users.noreply.github.com> Signed-off-by: liwensun GitOrigin-RevId: 44b45f5ae869be13ba1fb2ddef67aecbb4adcc50 commit bf6efb6bc1f228f893333b0b84f79fc17ab496b2 Author: Shixiong Zhu Date: Fri Jun 14 21:12:56 2019 +0000 [SC-18796]Delta checkpoint should not fail because of FileAlreadyExistsException ## What changes were proposed in this pull request? When a stage gets retried, a zombie task may still run and write the checkpoint file. This will cause the new tasks in the retried stage fail because of FileAlreadyExistsException on **S3**. Since the zombie task actually writes the same checkpoint, we can just make the new task successful if the checkpoint file exists. I also added a TaskFailureListener to delete temp files when a task fails. ## How was this patch tested? Jenkins Author: Shixiong Zhu #5401 is resolved by zsxwing/SC-18796. GitOrigin-RevId: a01ed042ab1f59c32b3e2620b2d56600420724a5 commit b6aa8c199fcbe7ef68c23513e72aa148de8c1d0e Author: Liwen Sun <36902243+liwensun@users.noreply.github.com> Date: Fri Jun 14 07:13:45 2019 +0000 [DELTA-OSS-EXTERNAL] Change the LogStore conf key name Right now the LogStore conf key is "delta.logStore.class", but this is not getting picked up because Spark requires spark conf to start with "spark.". So we change this to "spark.delta.logStore.class". Closes delta-io/delta#66 Author: Liwen Sun <36902243+liwensun@users.noreply.github.com> #5440 is resolved by liwensun/x310tcc3. GitOrigin-RevId: 3554916081e8be86bd8e25630f285f9e3809831f commit c3805a8fed12544c8450928ce839c84b0091306a Author: liwensun Date: Fri Jun 14 03:02:53 2019 +0000 [SC-17326][DELTA-OSS] Allow concurrent appends In Delta OSS: - Add `checkRetry` - Allow concurrent appends New unit tests Author: liwensun Author: Liwen Sun GitOrigin-RevId: b4cd65f15402263621021667d7259295ff951f9a commit b1bdec75b9e4df249c2c6df9ab02c675d92b2795 Author: Rahul Mahadev Date: Thu Jun 13 15:27:38 2019 -0700 [SC-18596][DELTA] ZORDERing on a partition column should throw a better error message ## What changes were proposed in this pull request? Changed the error message when Z-Ordering is done on a partitioned column. ## How was this patch tested? Unit test available with this commit. Closes #5421 from rahulsmahadev/SC-18596. Lead-authored-by: Rahul Mahadev Co-authored-by: rahulsmahadev <51690557+rahulsmahadev@users.noreply.github.com> Signed-off-by: Rahul Mahadev GitOrigin-RevId: d0f9b6913c1338396b666b19aa0cec65b22a0e8d commit b3b3ccf65eff6ccac6f5837192b46d4e31103d3e Author: Naoki Takezoe Date: Thu Jun 13 06:20:22 2019 +0000 [DELTA-OSS-EXTERNAL] Minor code enhancements Includes following fixes: - Use StandardCharsets.UTF_8 instead of “utf-8” - Remove unnecessary return statement - Fix typo in error message - Fix warnings Closes delta-io/delta#60 Author: Naoki Takezoe #5426 is resolved by zsxwing/rlcw8og1. GitOrigin-RevId: 20e1f065e38c1e73e621221a738e593ab87bfe01 commit 2d77fca9164dc01fcdd0b7704625de683f62a2a1 Author: feiwang Date: Wed Jun 12 14:13:26 2019 -0700 [DELTA-OSS-EXTERNAL] Fix code style for the naming of variables. The naming of `checkpointMetaData` and `checkpointMetadata` is confusing. Closes delta-io/delta#64 Closes #5422 from zsxwing/2s78063f. Authored-by: feiwang Signed-off-by: Shixiong Zhu GitOrigin-RevId: 3d23e386387ab56391242630c477d9110c32521f commit 352eaec8b203c4b1186cb8b887fadd1efd806f57 Author: liwensun Date: Mon Jun 10 16:57:55 2019 -0700 [SC-18394][DELTA] Remove HDFSLogStore and other LogStore renames - Rename `FileSystemLogStore` to `HadoopFileSystemLogStore` - Rename `S3LogStore` to `S3SingleDriverLogStore` - Add a new `LocalLogStore` and make it the default LogStore implementation for DBR, while `HDFSLogStore` will be the default for OSS. Existing tests. Author: liwensun GitOrigin-RevId: 0ff96c6015a50b30855251b408fc331c15c05cd3 commit e380e5a6149447fe2cddcc4f03d9e2e851fbad41 Author: Burak Yavuz Date: Mon Jun 10 12:36:02 2019 -0700 [SC-18778] Clean-up some unused configurations ## What changes were proposed in this pull request? This PR cleans up some unused configurations. ## How was this patch tested? Existing tests Closes #5399 from brkyvz/obscureConfs. Lead-authored-by: Burak Yavuz Co-authored-by: Burak Yavuz Signed-off-by: Burak Yavuz GitOrigin-RevId: 63071a2c77a23e789f4944121a91b8f5c7d33d50 commit 7af3bf57477ffdfe70214f379a5f204915d21f74 Author: Burak Yavuz Date: Fri Jun 7 11:05:45 2019 -0700 [SC-18732] Remove unused configs from DeltaSQLConf Clean up some unused configurations from DeltaSQLConf. Build Closes #5387 from brkyvz/aoRemove. Authored-by: Burak Yavuz Signed-off-by: Burak Yavuz GitOrigin-RevId: 428bef6a9485b1547aeba1eaace61cf204bf6369 commit f1c39c362f8bc0032cf9934dc045917ed5874c82 Author: Tathagata Das Date: Thu Jun 6 22:01:19 2019 +0000 [SC-18233][Delta] Fix logStore class config to not have "databricks" The log store class config is "spark.databricks.tahoe.logStore.class" which was not fixed for Delta Lake. So this PR changes the config to "delta.logStore.class". Existing unit tests and new unit tests. Author: Tathagata Das GitOrigin-RevId: bd2903849c01eb614848142635de93176636e8f5 commit db90371105daae2bdc6dc50dcabfa0fa4ed4c69d Author: liwensun Date: Tue Jun 4 13:55:39 2019 -0700 [SC-18267][DELTA] Cleanup import in LogStore suite as the title existing tests. Closes #5352 from liwensun/sc18267-mixin. Authored-by: liwensun Signed-off-by: liwensun GitOrigin-RevId: f7a608b61073c15f2275fd7e8ca1a32032d1f6bc commit 27d86b08384c0eafb49099314669b2561b4cee3c Author: Shixiong Zhu Date: Fri May 31 20:49:13 2019 +0000 [DELTA-OSS]Display test duration for Delta OSS tests ## What changes were proposed in this pull request? The whole build right now takes about 20 minutes. It's better to display the test duration so that we can find out which tests take too long. ## How was this patch tested? Jenkins Author: Shixiong Zhu #5326 is resolved by zsxwing/show-test-duration. GitOrigin-RevId: 42c65974a708cd3d3e221e2d698c78e4d11f06b0 commit c8169bd106e1a509d4138c5d1655457aa11179c5 Author: liwensun Date: Wed May 29 23:36:12 2019 +0000 [SC-18034][DELTA-OSS] S3 Support Add a S3 LogStore implementation. new unit tests Author: liwensun GitOrigin-RevId: 5071e09398fd7237d4ec3de4d3ed80103ec0371f commit ae4aa3c8450f548ddecb36f54f084a16b59c727e Author: liwensun Date: Wed May 29 00:19:55 2019 +0000 [SC-18133][DELTA] Add a `isPartialWriteVisible` interface in LogStore The writing of checkpoint files doesn't go through log store, so we add an interface `isPartialWriteVisible` to `LogStore` to let the out-of-band writers know whether to use rename or not. This is a temporary solution - ultimately it would be good to encapsulate this information within log store. Add a simple end to end test for different log store implementations. Author: liwensun GitOrigin-RevId: 5bf04c818d702ea24d28bed981f838d164a8d373 commit 43a7a7a1304a00cefdfc7dfcf1c98077ee2e54ef Author: Cheng Lian Date: Tue May 28 22:35:49 2019 +0000 [SPARK-24601][SPARK-27051][CHERRY-PICK] Update Jackson to 2.9.8 Existing tests. Author: Cheng Lian GitOrigin-RevId: 4b9428c7939dabd12f3380ef70f090eebe32e3e2 commit 95e4d0d8c9be5e7494d16b52dc3664e00a5de93b Author: Kaushal Prajapati Date: Tue May 28 19:35:27 2019 +0000 [DELTA-OSS-EXTERNAL] Delta conf fix Closes delta-io/delta#26 Author: Kaushal Prajapati #5288 is resolved by mukulmurthy/cktxuu1w. GitOrigin-RevId: 79c7e9018710fcd63a8a00dbe8d3c7272306cdf4 commit 6c34e7c49422a9ee21adf9f6b3811c3b9f8bf34b Author: Naoki Takezoe Date: Tue May 28 19:23:49 2019 +0000 [DELTA-OSS-EXTERNAL] Use Set.empty instead of Set() in DelayedCommitProtocol Closes delta-io/delta#56 Author: Naoki Takezoe #5289 is resolved by mukulmurthy/bx50c6g3. GitOrigin-RevId: 7f7068bbb417f2f11706fedc5f5d983c24bfc9f9 commit fcbe2e7fe002a5285a2a68fbf06390567c033dbc Author: Shixiong Zhu Date: Tue May 28 07:13:56 2019 +0000 [SC-17993]Fix an issue when re-adding the same file If deleting a file and re-adding it back, a `Snapshot` may contain both `AddFile` and `RemoveFile` for this path. If this `Snapshot` is used to build a new `Snapshot`, as `input_file_name` for AddFile and `RemoveFile` is `null` and the order of them is not stable, InMemoryLogReplay may just delete this new added file if its `RemoveFile` is after `AddFile`. This PR updates InMemoryLogReplay to remove the tombstone when a file is added. This will ensure `InMemoryLogReplay` always output only one `FileAction` (either `AddFile` or `RemoveFile`) for each path. This PR also removes unused `stateSize` and `hadoopConf` from `InMemoryLogReplay`. The new unit test. Author: Shixiong Zhu GitOrigin-RevId: 7c3e96c2d344d7c08383019ac2bbfda22f0f6119 commit d29a0e1a797be2b6911fc26cea92649b4ed11424 Author: Liang-Chi Hsieh Date: Thu May 23 22:43:42 2019 +0000 [DELTA-OSS-EXTERNAL] Fix wrong data when recording event in protocolWrite In `protocolWrite`, it records event with wrong data `minReaderVersion`, currently. This patch goes to fix that. This also fixes few other typos and style. Closes delta-io/delta#45 Author: Liang-Chi Hsieh #5262 is resolved by mukulmurthy/j98bt8j6. GitOrigin-RevId: 1d996b3aaed93a5657f811c90f032312e14411f0 commit e20843875da4ac35483c8261e88f8819158621b6 Author: Naoki Takezoe Date: Thu May 23 22:23:36 2019 +0000 [DELTA-OSS-EXTERNAL] Fix typo in an argument name of OptimisticTransactionImpl Closes delta-io/delta#55 Author: Naoki Takezoe #5261 is resolved by mukulmurthy/60bvtd9h. GitOrigin-RevId: 47b33a8841d10022985be3462866c04090cd8525 commit 35d0e0039b03351d00a3b672b96eef80c17105bc Author: runzhliu Date: Thu May 23 15:09:25 2019 -0700 [DELTA-OSS-EXTERNAL] remove useless imports Obviously, there is currently no check for code styles including useless import, now I remove all of the useless imports. Closes delta-io/delta#12 Closes #5260 from mukulmurthy/6020wsjz. Lead-authored-by: runzhliu Co-authored-by: Mukul Murthy Co-authored-by: Mukul Murthy <38224594+mukulmurthy@users.noreply.github.com> Signed-off-by: Mukul Murthy GitOrigin-RevId: af35df3c7e881336641b182743c92a2aedfb1d6e commit eed0dcc3695befbee060017140435e2e79f59190 Author: Shixiong Zhu Date: Wed May 22 20:19:53 2019 +0000 [SC-17875]Parsing interval Delta config should be case-insensitive ## What changes were proposed in this pull request? Some Delta configs accept an interval string. However, they only accept lower case. This is inconvenient. This PR forks `CalendarInterval.fromCaseInsensitiveString` from https://github.com/apache/spark/pull/24619 and uses it to parse the interval string to support upper case. It also improves the error message when the input string is an invalid interval (Right now it just throws NPE). ## How was this patch tested? New unit test. Author: Shixiong Zhu #5213 is resolved by zsxwing/SC-17875. GitOrigin-RevId: d4d9cc0c2684096812be5697d1c093f4b000ce51 commit a98550011591d78d55bd6e003ce455b9a9f7a232 Author: liwensun Date: Wed May 22 00:14:47 2019 +0000 [SC-18029][DELTA] Azure support This PR adds Azure support for Delta LogStore: - Create a class `FileSystemLogStore` with a default implementation for any underlying storage that implements hadoop FileSystem APIs (e.g., Azure and S3) - Create a `AzureLogStore` that extends `FileSystemLogStore` and relies on Azure's atomic rename. We will make a S3 implementation that extends `FileSystemLogStore` in a separate PR. Checkpoints should also use rename for AzureLogStore. We will do that in a follow up PR. New unit tests on basic operations. Author: liwensun GitOrigin-RevId: a800eacf77ea7199b259d232e2040a9ee93ffd71 commit c28f7ce4c308ca8f0f9ad89e336da41e8d770f64 Author: Joe Ellis Date: Tue May 21 02:58:11 2019 -0700 [DELTA-OSS-EXTERNAL] Fix spelling Closes delta-io/delta#19 Closes #5235 from tdas/yaviiaq9. Authored-by: Joe Ellis Signed-off-by: Tathagata Das GitOrigin-RevId: 040df282385745dcbdaca50df6edac4c1b4fa4f7 commit 81601a1b025c4cc925d8e3b043fd8a046d59d6fc Author: Naoki Takezoe Date: Tue May 21 00:47:50 2019 -0700 [DELTA-OSS-EXTERNAL] Fix Scaladoc of DeltaLogging.scala DeltaLogging's Scaladoc says that underneath it uses `com.databricks.spark.util.UsageLogging`, but I guess it's `com.databricks.spark.util.DatabricksLogging`. Closes delta-io/delta#25 Closes #5230 from tdas/hdnle5x7. Authored-by: Naoki Takezoe Signed-off-by: Tathagata Das GitOrigin-RevId: 6cf2878e70baf7a4544d028dc7ee823854120499 commit e3b0b8ef4a1854414598e7f8a3e95c2af2fd98ce Author: merrily01 Date: Tue May 21 00:39:38 2019 -0700 [DELTA-OSS-EXTERNAL] Fix scalastyle errors of DeltaLogging.scala ## What changes were proposed in this pull request? Fix scalastyle errors of DeltaLogging.scala. Change "// scalastyle:off on" to "// scalastyle:on println" to ensure the validity of grammar checks. Closes delta-io/delta#36 Closes #5225 from tdas/znzmjb67. Authored-by: merrily01 Signed-off-by: Tathagata Das GitOrigin-RevId: 7b0826f7171a7c40704957116d456d6a42b67b23 commit 015597667d42184cb0b3efe6668b38aeb815f03c Author: liwensun Date: Sun May 19 19:59:45 2019 +0000 [ALL TESTS][SC-17838][DELTA] Migrate EvolvabilitySuite ## What changes were proposed in this pull request? Migrate EvolvabilitySuite ## How was this patch tested? New tests for OSS. Author: liwensun #5167 is resolved by liwensun/sc17838-evolvability-suite. GitOrigin-RevId: c2db307ef36041f53aba681559c21974f98c844e commit 22fe83946aae84b45916012cc7203720b5f6f131 Author: liwensun Date: Sat May 18 21:27:21 2019 +0000 [SC-17690][DELTA] Migrate DeltaSuite to OSS Port DeltaSuite to OSS Moving around existing tests Author: liwensun GitOrigin-RevId: 72cb23c9e308e4cd79c1b5f7620cf5812afc85df commit 97f132083b1442715fd33867619653dcf3a4abbe Author: Shixiong Zhu Date: Fri May 17 23:02:12 2019 +0000 [DELTA-OSS]Remove PGP plugin for OSS Delta ## What changes were proposed in this pull request? We enabled automatic content signing to simply the release process. The PGP plugin is not needed any more, so just remove it. I also removed `bintrayReleaseOnPublish in ThisBuild := false`. Turning this off will make the release invisible until someone goes to Bintray and click a button. This was added basically for release testing and it's not needed now. ## How was this patch tested? Manually pushed a test release to Bintray and confirmed that Bintray signed files for us. Author: Shixiong Zhu #5203 is resolved by zsxwing/remove-pgp-plugin. GitOrigin-RevId: a2428c2bd35bf119c0310625734c539384b4d0ba commit d1c55a74d0ddfbce02f783bd08e34a9acbde00bb Author: liwensun Date: Wed May 15 21:36:01 2019 +0000 [SC-17737][DELTA] DeltaRetentionSuite migration and cleanup ## What changes were proposed in this pull request? Migrate DeltaRetentionSuite, also some cleanup ## How was this patch tested? Existing tests. New tests for OSS. Author: liwensun #5165 is resolved by liwensun/sc17737-delta-retention-suite. GitOrigin-RevId: 45ed25679a9151e422e0a1b99b11db18a12be300 commit 21af841058f3aedbec93877f8385548c6cdf5289 Author: Mukul Murthy Date: Wed May 15 12:44:30 2019 -0700 [SC-17535] DeltaSourceOffset should not write null json field ## What changes were proposed in this pull request? DeltaSourceOffset should not write null JSON field. An earlier change updated our serialization to always include null values meant we were writing out a `"json": null` in the offset. While this is not wrong by itself, https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeqLog.scala expects that the serialized field is either not present or nonnull. We fix this by making the json in DeltaSourceOffset a def instead of a val so it doesn't get serialized as null. ## How was this patch tested? New unit test Closes #5084 from mukulmurthy/bug. Authored-by: Mukul Murthy Signed-off-by: Mukul Murthy GitOrigin-RevId: b9455563eaf6df087e5b165be7d1f5d673a131e6 commit 60dc435ead5e4d462f7b3ae72ca700bafcc88d6d Author: Burak Yavuz Date: Tue May 14 16:12:20 2019 -0700 [SC-17862][DELTA] Fix time travel when the partitioning of a table changes ## What changes were proposed in this pull request? When the partitioning of a table changes with `mode("overwrite").option("overwriteSchema", true)`, we use the latest partitionSchema in the file index. This PR fixes that bug. ## How was this patch tested? Unit test Closes #5184 from brkyvz/ttPartition. Authored-by: Burak Yavuz Signed-off-by: Burak Yavuz GitOrigin-RevId: 4d10e57dd61b4fbb1a62d03be302fc404390d385 commit 1977fd2d4b35a9c5aab6c51c38db1e5e3830bb01 Author: liwensun Date: Mon May 13 07:17:53 2019 +0000 [SC-17689][DELTA] DeltaLogSuite migration and cleanup This PR cleans up and migrates DeltaLogSuite. Existing tests. N/A Author: liwensun GitOrigin-RevId: 9cda2cd5e50f98cb3547a3e93dc777c6f790fbd2 commit bcf4d92bb1668f42dff85da2ab1456bd38f45b13 Author: liwensun Date: Sun May 12 07:15:32 2019 +0000 [SC-17804][DELTA] Re-enable OSS tests Existing tests Author: liwensun GitOrigin-RevId: 38aaec6880cbe00cbd0438268968f02031bafb1d commit 6febef767b960d400f7200bf64f37c1abac4101e Author: Mukul Murthy Date: Thu May 9 21:30:16 2019 -0700 [SC-17683][DELTA] Fix configs Lead-authored-by: Mukul Murthy Co-authored-by: Adrian Ionescu Signed-off-by: Mukul Murthy GitOrigin-RevId: 06f2a3a1dddfcc345b5e591368e7346573698c6e commit 0e9ff7030a259b33ff02959e7434fdda3c459258 Author: Tathagata Das Date: Fri May 10 00:15:16 2019 +0000 [SC-17706] Remove config Author: Tathagata Das GitOrigin-RevId: 1ee42d89d826bec32cf4f12040201c11ebb73fca commit b36b5ed88265c695f03ef183c673fb7f04c0c84d Author: Tathagata Das Date: Tue May 7 12:57:12 2019 -0700 [SC-17155][DELTA] Add a CommitInfo flag to identify when a transaction is a blind append or not - Added a flag to identify in future whether a transaction is a blind append or not Updated tests Authored-by: Tathagata Das Signed-off-by: Tathagata Das GitOrigin-RevId: ca272b767ee0f1b630f9d459a7bc2f2beb870014 commit 00221f6bcc3fbcc4929d93c910f78d634d12a551 Author: Burak Yavuz Date: Fri May 3 17:00:30 2019 -0700 [SC-17606] Turn off metadata queries for Delta The rule is not ready for production yet. Closes #5106 from brkyvz/statsOFf. Lead-authored-by: Burak Yavuz Co-authored-by: Burak Yavuz Signed-off-by: Burak Yavuz GitOrigin-RevId: 7813d59418f4795464d3a2fde335e215a57f7ffa commit f28c295903b099455bc84c51ba88c30aa12bd75f Author: liwensun Date: Thu May 2 23:52:31 2019 +0000 [SC-17605][DELTA] Catch all NonFatal exception when reading a checksum file Tested by existing tests Author: liwensun commit 513e6c03bb89ef36162b70d34c7117e7a728072a Author: Burak Yavuz Date: Wed May 1 13:50:07 2019 -0700 [SC-16426][DELTA] Reduce listing load when rewriting Delta checkpoints Existing tests Closes #4730 from brkyvz/smarterRecreation. Authored-by: Burak Yavuz Signed-off-by: Burak Yavuz commit f72bb4147c3555b9a0f571b35ac4d9a41590f90f Author: liwensun Date: Thu May 2 22:40:12 2019 +0000 [SC-16833][Delta] Add startVersion to CommitStats ## What changes were proposed in this pull request? right now we don't record the version read by the txn in CommitStats. `readVersion` sounds like it but it's not. so we add a new field `startVersion` and add docs also change variable names to make this less confusing. ## How was this patch tested? modify existing test to include the new field ## **IMPORTANT** Warmfix instructions N/A Author: liwensun #4912 is resolved by liwensun/sc16833-startVersion. GitOrigin-RevId: ab874d31875af6c31c79857c46041b753f7858c1 commit a9e752a5cc6f20f7933450ee068272e5e334f221 Author: Tathagata Das Date: Thu May 2 14:54:24 2019 -0700 Updated README regarding storage system support ## What changes were proposed in this pull request? Minor change Closes #5092 from tdas/delta-readme. Authored-by: Tathagata Das Signed-off-by: Tathagata Das GitOrigin-RevId: ad07af6c785d96a0a3fbf9e4af4a734ce4590050 commit f08ee6aa6f700dc653d253e2f14498367fecbeb0 Author: Burak Yavuz Date: Thu May 2 14:32:46 2019 -0700 [SC-17180] Ensure that partition values are being json serialized even if they are null ## What changes were proposed in this pull request? We need to ensure that partition values are always json serialized in the `AddFile` actions even if they are null. Potential Jackson upgrades may cause them to be dropped, and we would like to avoid that to maintain forwards compatibility of Delta Lake. ## How was this patch tested? Unit test Closes #5098 from brkyvz/addFileNullParts. Authored-by: Burak Yavuz Signed-off-by: Burak Yavuz GitOrigin-RevId: 012a7c0889a38d708b2c6fc09437cec6794edf68 commit 62708c4931e101d8942f80e399e0c7f19feb9e2f Author: Ivan Sadikov Date: Thu May 2 15:29:16 2019 +0000 [SC-17420] Tolerate IOExceptions when reading Delta LAST_CHECKPOINT file ## What changes were proposed in this pull request? This PR updates `loadMetadataFromFile` method to retry when IOException was thrown while reading last checkpoint file from log store. Author: Ivan Sadikov #5053 is resolved by sadikovi/SC-17420-delta-last-checkpoint-file. GitOrigin-RevId: 956b9c4c917b5b00305c16f6a395afad2d9ba15d commit c1367a76542bf0a9e71ad776db5f4434c7b73aee Author: David Lewis Date: Thu May 2 13:49:21 2019 +0000 [SC-17165][SC-17138] Improvements to ZOrder Optimize Minor changes New unit tests Author: David Lewis GitOrigin-RevId: 966a1486e71ff641aec378164015d11de9f3130b commit cac0235c74e3b17047bb372fdf9f572342e147df Author: Burak Yavuz Date: Wed May 1 17:07:35 2019 -0700 [SC-17512][DELTA][EDGE] Improvements to file size on optimize Minor improvements to optimize Unit tests Lead-authored-by: Burak Yavuz Co-authored-by: Burak Yavuz Signed-off-by: Burak Yavuz GitOrigin-RevId: 9125d63787b1ec53ad9eb20b582cbf8402173612 commit 92e58a6fd08f93bd3b684d5df10c2fd0a240addc Author: Shixiong Zhu Date: Wed May 1 22:16:07 2019 +0000 [SC-17484] Apply scala style check for compile task ## What changes were proposed in this pull request? The initial scala style check commit only checks the test codes. This PR fixes it by adding a task to the compile stage. Author: Shixiong Zhu #5077 is resolved by zsxwing/compile-check. GitOrigin-RevId: 613910a7afc6e23a489bf618f305f022b54b11ac commit 9905bae5a05e4ccda97e34819e166cd07aa78b67 Author: Burak Yavuz Date: Wed May 1 13:50:07 2019 -0700 [SC-16426][DELTA][EDGE] Reduce listing load when rewriting Delta checkpoints ## What changes were proposed in this pull request? ## How was this patch tested? Existing tests Closes #4730 from brkyvz/smarterRecreation. Authored-by: Burak Yavuz Signed-off-by: Burak Yavuz GitOrigin-RevId: 5854568bfac3606ba5c1acaeadc48d44385735e8 commit 34883b349c151f5267945cb71c69b9bc5c3f6bde Author: Tathagata Das Date: Wed May 1 09:26:14 2019 +0000 Revert "[SC-16831] Always include null fields when serializing Delta metadata" This reverts commit for "Always include null fields when serializing Delta metadata" Author: Tathagata Das #5085 is resolved by tdas/revert-json-null. GitOrigin-RevId: d4dcd21012632bea996a6209379c01bb4dbd3180 commit da3734e62b6698045a4e2b07286aac7f24cee44b Author: Shixiong Zhu Date: Mon Apr 29 16:09:25 2019 -0700 [SC-17484] Port Apache Spark Scala style check to Delta ## What changes were proposed in this pull request? Port Apache Spark Scala style check to Delta so that we are using the same Scala style check rules. Closes #5059 from zsxwing/scalastyle. Authored-by: Shixiong Zhu Signed-off-by: Shixiong Zhu GitOrigin-RevId: ce2f69a1431446bb4e657c6bd2156ff7e41774fd commit ee88c737770182918796c06041272ab44277bdee Author: Mukul Murthy Date: Mon Apr 29 22:38:32 2019 +0000 [SC-16831] Always include null fields when serializing Delta metadata ## What changes were proposed in this pull request? Always include null fields when serializing Delta metadata. This is to ensure compatibility across Jackson Databind versions, some of which remove null entries that may then result in errors when trying to read partition columns with null values. Author: Mukul Murthy #4910 is resolved by mukulmurthy/16831-null. GitOrigin-RevId: 85873c5f379c5a483ceacdf5158e9d031d3bf251 commit 282a50d875d22b6df8b3b60b5cbf8e8874ed57a0 Author: Wesley Hsiao Date: Mon Apr 29 17:48:14 2019 +0000 [SC-17067] Add missing copyright headers to databricks scala files Author: Wesley Hsiao GitOrigin-RevId: 8496201f1619bf5495e31648cfebfb2f7a256b7e commit 9a0c1b1d6dd97265c85de6feb7c4c44f2ea6ceb2 Author: Mukul Murthy Date: Mon Apr 29 17:18:06 2019 +0000 [DELTA-OSS] Update comment in LogStore to match README ## What changes were proposed in this pull request? Update comment in LogStore to match README. Scaladoc for LogStore.scala is currently missing one condition required by the storage system, according to the README: https://github.com/delta-io/delta/blob/master/README.md Author: Mukul Murthy #5048 is resolved by mukulmurthy/comment. GitOrigin-RevId: 5b4542731da15d79922a6ab165cd57d622920a3f commit 8d9db348f6b920133ad221ddc490f42ac41252e4 Author: Wesley Hsiao Date: Mon Apr 29 07:12:52 2019 +0000 [SC-17067] Update other scala files with new copyright header Author: Wesley Hsiao GitOrigin-RevId: 806112d4fe5c221c283c4cc8ef031b5bd29b8ab8 commit 31ff76515e4d610a03c324831f959d2d6cdf975e Author: Yuming Wang Date: Fri Apr 26 13:45:49 2019 -0700 [EXTERNAL] Fix broken NOTICE link [EXTERNAL] Fix broken NOTICE link It throws `No such file or directory` when running test in IDEA: ![image](https://user-images.githubusercontent.com/5399861/56707622-340db600-674c-11e9-9e60-ca706ab8d3a9.png) The reason is that `NOTICE` is changed to `NOTICE.txt`. This PR update the link to `NOTICE.txt`. **After this PR**: ![image](https://user-images.githubusercontent.com/5399861/56707853-2c9adc80-674d-11e9-9876-dfc26c56787e.png) Closes delta-io/delta#8 Author: Yuming Wang GitOrigin-RevId: af1ec7a0737eba33b3fba44a2be7e6b96c13fc17 commit faaa0bca969baff09b7c19e135b1a51fd0812ade Author: Shixiong Zhu Date: Tue Apr 23 22:40:10 2019 -0700 Setting version to 0.1.1-SNAPSHOT commit 81a384231ecc11a518e03ebef853256f62416818 Author: Shixiong Zhu Date: Tue Apr 23 22:38:07 2019 -0700 Setting version to 0.1.0 commit 39243a3e822dc4df74e5af30063b341628cddd25 Author: Shixiong Zhu Date: Tue Apr 23 22:21:11 2019 -0700 Prepare for 0.1.0 release commit 08c6e75b84b1dc18d663c6cc97118fd34298f3c8 Author: Tathagata Das Date: Tue Apr 23 19:00:52 2019 -0700 [DELTA-OSS] Add Link to quickstart in README ## What changes were proposed in this pull request? Added link ## How was this patch tested? No test! YOLO! Closes #5007 from tdas/link-to-quickstart. Authored-by: Tathagata Das Signed-off-by: Tathagata Das GitOrigin-RevId: 3cce3373de1667464a8d3a5aa4687cfb1bdbc7ca commit 8bf5f3730cce84fa830afb3d7ac512c6daeb7623 Author: Tathagata Das Date: Tue Apr 23 18:39:14 2019 -0700 [DELTA-OSS] Fix comments ## What changes were proposed in this pull request? Fix a few comments. ## How was this patch tested? None Closes #5006 from tdas/fix-comment. Authored-by: Tathagata Das Signed-off-by: Tathagata Das GitOrigin-RevId: 045242d51e86dc16abbeb91c0e3683a016eb5447 commit 7bf79ba1573e7afb3159f3dd2f236a6231666772 Author: Stephanie Bodoff Date: Tue Apr 23 18:04:07 2019 -0700 Fix typos, correct product names. ## What changes were proposed in this pull request? Fix typos and correct product names in the README. ## How was this patch tested? Not needed. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Authored-by: Stephanie Bodoff Signed-off-by: Tathagata Das GitOrigin-RevId: d46cc11e15190d2179445c24599a626551214c3b commit 8862c4a95d686f0827e01fbbedb01dad9b7ca8db Author: Tathagata Das Date: Tue Apr 23 15:06:18 2019 -0700 [DELTA-OSS] Fix license ## What changes were proposed in this pull request? Fixed license management in delta files. ## How was this patch tested? manually running copybara and checking whether the license is correct. Closes #5003 from tdas/fix-license. Authored-by: Tathagata Das Signed-off-by: Tathagata Das GitOrigin-RevId: b0d705d936dce3f4acdfbd75fc68470b4dff1bd5 commit 0ac0a43ce47bef56ba8d7a924c89115edd45b5bc Author: Stephanie Bodoff Date: Tue Apr 23 14:14:25 2019 -0700 Remove dup and fix GH. ## What changes were proposed in this pull request? Fix dup phrase and GitHub spelling in Delta Lake contributing doc. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Authored-by: Stephanie Bodoff Signed-off-by: Tathagata Das GitOrigin-RevId: 734434805f3ef2180b4b1bf8ed259042b76311b7 commit 16e087979ebd41ff3e893c8735a0e64e913de78b Author: Shixiong Zhu Date: Tue Apr 23 10:32:56 2019 -0700 [SC-17193] Fix apache spark build and add DeltaTimeTravelSuite back ## What changes were proposed in this pull request? Fix apache spark build and add DeltaTimeTravelSuite back along with other minor cleanup. ## How was this patch tested? Jenkins Closes #4981 from zsxwing/SC-17193. Authored-by: Shixiong Zhu Signed-off-by: Shixiong Zhu GitOrigin-RevId: 96cf836312c839039914bc2835c8f4fbb1360763 commit d6e5f67a60b5c83a436bd92482c10a329ca1e0e7 Author: Wesley Hsiao Date: Mon Apr 22 22:41:37 2019 -0700 [SC-17067] Update scala files with new copyright header ## What changes were proposed in this pull request? jira: https://databricks.atlassian.net/browse/SC-17067 This pr is the batch copy right header update for scala files. Make sure scala styles pass the lint check and do not change any of the file permissions. ## How was this patch tested? Command to update files: `python dev/update_databricks_copyright_header.py` Please review http://spark.apache.org/contributing.html before opening a pull request. Authored-by: Wesley Hsiao Signed-off-by: Wesley Hsiao GitOrigin-RevId: 9b8148676c22aa3df29e9f579069ea2fb6cee1cd commit 565de18c11a3e536c65538c97f13f4722bc8f98e Author: Wesley Hsiao Date: Mon Apr 22 22:33:12 2019 -0700 [SC-17067] Other files' copyright header being updated ## What changes were proposed in this pull request? Updating other files copyright headers. - R - Sh - Proto - Xml - css - js - sky - sh - g4 ## How was this patch tested? Command to update files: python dev/update_databricks_copyright_header.py Please review http://spark.apache.org/contributing.html before opening a pull request. Authored-by: Wesley Hsiao Signed-off-by: Wesley Hsiao GitOrigin-RevId: b98ae40a6f9423787a3ea213cbdd895344e2002f commit cd508b443dc655617a6841b32bc9f1905a721fe5 Author: Michael Armbrust Date: Mon Apr 22 15:42:57 2019 -0700 Update README.md GitOrigin-RevId: cde50a92f7370d429583860788526335ba4a6605 commit de0a87c0e59bb89709b9f42d36a56be6af201023 Author: Jose Torres Date: Mon Apr 22 15:07:17 2019 -0700 [SC-17187][DELTA-OSS] Port streaming source tests. ## What changes were proposed in this pull request? Port streaming source tests. Closes #4913 from jose-torres/oss4. Lead-authored-by: Jose Torres Co-authored-by: Ahir Reddy Signed-off-by: Jose Torres GitOrigin-RevId: ca7c785107aeafe2f0924749982bf229d6eaac1f commit bd25e20051aff506ef5f409374c2c274fe2c5a2a Author: Shixiong Zhu Date: Mon Apr 22 14:37:03 2019 -0700 [DELTA-OSS] Update Delta OSS links ## What changes were proposed in this pull request? Update links to use the Delta OSS repo. Closes #4977 from zsxwing/update-links. Authored-by: Shixiong Zhu Signed-off-by: Michael Armbrust GitOrigin-RevId: 82976341f2f720e08dd6ed36d0509b7d59c2e4de commit 3c31161c85b2b06016c431b504f661e1a88d09ec Author: Burak Yavuz Date: Mon Apr 22 14:31:37 2019 -0700 [DELTA-REFACTOR] Add CircleCi badge for master, and configure SBT to pass tests ## What changes were proposed in this pull request? Title explains it all ## How was this patch tested? https://circleci.com/gh/delta-io/delta/6 Closes #4979 from brkyvz/fixSbtChanges. Authored-by: Burak Yavuz Signed-off-by: Tathagata Das GitOrigin-RevId: 5a9be10747151e0ee44c1b02ed42cebb6770d8d9 commit ec28775897e736b151fdcabff55c75752678287c Author: Shixiong Zhu Date: Mon Apr 22 14:31:18 2019 -0700 Update links commit 2db4d54892c04ff68a6c6eca0e74ec6015520d05 Author: Tathagata Das Date: Mon Apr 22 14:12:50 2019 -0700 [SC-17341] Updated README ## What changes were proposed in this pull request? Added intro Fixed license ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Authored-by: Tathagata Das Signed-off-by: Tathagata Das GitOrigin-RevId: d76f6d04ca9f06fdfbea1dab34d44baacbf714c4 commit 954eb260806229d3843cd4cce516858df249ccb1 Author: Burak Yavuz Date: Mon Apr 22 14:14:45 2019 -0700 Reduce SBT memory to ~2gb (#1) * Reduce to ~2g * Update build.sbt * Update build.sbt * Update sbt-launch-lib.bash commit f2d5d7a5c108473d7573b9e8c963af5467325eaa Author: Mukul Murthy <38224594+mukulmurthy@users.noreply.github.com> Date: Mon Apr 22 11:49:23 2019 -0700 Fix delta docs versions GitOrigin-RevId: 6a748039c56bb56e951cdb6cfff9e2b9dd9ee6e3 commit 116e79454690a283e077b2cfc883075d48e9cb8c Author: Mukul Murthy <38224594+mukulmurthy@users.noreply.github.com> Date: Mon Apr 22 11:33:02 2019 -0700 Update Delta quickstart docs GitOrigin-RevId: 123d4b89bb6a88d0c2d224d79ddc572aa1420512 commit 554be6210d20ce02dcf7b8d37db3ebe800ff1677 Author: Shixiong Zhu Date: Mon Apr 22 10:45:45 2019 -0700 [DELTA-OSS] Remove "partitionByHack" and fix filter push down ## What changes were proposed in this pull request? - Upgrade to Spark 2.4.2-rc1. - Always enable `spark.sql.legacy.sources.write.passPartitionByAsOptions` in `DeltaDataSource` constructor and remove `partitionByHack`. - Fix filter push down. ## How was this patch tested? New unit tests. Closes #4955 from zsxwing/spark2.4.2-rc1. Lead-authored-by: Shixiong Zhu Co-authored-by: Shixiong Zhu Signed-off-by: Michael Armbrust GitOrigin-RevId: ac17c3e7cf5e19cebbdbbbea42025fb8bf877f84 commit 2259f6f8870d9ecdd6585141b2d0f1dd76601e82 Author: Burak Yavuz Date: Mon Apr 22 10:41:19 2019 -0700 [SC-17152] Open source some Time Travel tests ## What changes were proposed in this pull request? This PR open sources some time travel tests. ## How was this patch tested? Copybara'd to open source repo and ran tests Closes #4909 from brkyvz/ttTests. Lead-authored-by: Burak Yavuz Co-authored-by: liwensun Signed-off-by: Tathagata Das GitOrigin-RevId: 853989251f00d43ac71932fc64e25904ff51ba6c commit 62c205a7b1200f55f831e98b654f4710cee2f291 Author: Shixiong Zhu Date: Mon Apr 22 10:39:56 2019 -0700 [DELTA-OSS] Update build scripts and README ## What changes were proposed in this pull request? - Update build scripts to match the new created bintray repo. - Remove unused SBT plugins. - Fix minor issues in README. Closes #4971 from zsxwing/delta-oss-cleanup. Authored-by: Shixiong Zhu Signed-off-by: Michael Armbrust GitOrigin-RevId: e079220522f18c9bb57a0849656b3b504cc02ef6 commit e26435bcd787b232c1cf73eb118202971f1e18f1 Author: Tathagata Das Date: Sun Apr 21 20:21:06 2019 -0700 [SC-17340][SC-17271][SC-17340][DELTA-OSS] Fix license headers ## What changes were proposed in this pull request? - Added script to automatically update any existing license header from to new header - For a few files, manually updated their DB and Apache license and excluded them the script - Added new license headers to other non scala files - Replaced all instances of "Databricks Delta" to "Delta Lake" ## How was this patch tested? Existing DBR and OSS delta tests Closes #4970 from tdas/SC-17340. Authored-by: Tathagata Das Signed-off-by: Tathagata Das GitOrigin-RevId: 0d7eb26b553957af3e732e68bc1f4c269d6823bf commit 1e9475d9f1121f373d9cf355ee825cd60f363967 Author: Shixiong Zhu Date: Fri Apr 19 15:49:08 2019 -0700 [DELTA-OSS] Add Google Mirror of Maven Central and use it first ## What changes were proposed in this pull request? Google Mirror of Maven Central is more stable than Maven Central. Place it first to avoid hitting flaky Maven Central error. Closes #4966 from zsxwing/google-maven. Authored-by: Shixiong Zhu Signed-off-by: Shixiong Zhu GitOrigin-RevId: 14deb32c3c6ca2fda6cf8a92b62fe41dead8ad65 commit 51f681ad148341ae4cfcf61d23e6ec41dcb9476c Author: Tathagata Das Date: Fri Apr 19 14:30:53 2019 -0700 [DELTA-OSS] Update Notice and Contributing ## What changes were proposed in this pull request? Self-explanatory ## How was this patch tested? No need Closes #4964 from tdas/delta-lake-notic. Authored-by: Tathagata Das Signed-off-by: Tathagata Das GitOrigin-RevId: c6f90a323f20c65e4e8bca033a760b7823fd08c8 commit 28c219dc7ee7f3ea57ba08fa279e2b2f02a18f54 Author: Tathagata Das Date: Fri Apr 19 11:28:37 2019 -0700 [SC-17149][DELTA] Implemented default OSS LogStore that is correct for HDFS ## What changes were proposed in this pull request? - Implemented HDFSLogStoreImpl for OSS - Uses FileContext for all ops and FileContext.rename to ensure atomic writes (atomic for both overwrite = true and false) - Added a lot of scala docs. ## How was this patch tested? Fixed unit test to work with both HDFSLogStoreImpl and HDFSLogStore. Test the OSS part by locally running `delta/build/copybara.py --sbt-args "testOnly *LogStore*"` Authored-by: Tathagata Das Signed-off-by: Tathagata Das GitOrigin-RevId: 2377995a20aa2013a09fa14a400dd2e9b9a0292d commit 66a04995a6f05f408d1db88a1b7fa1a43fec3dd0 Author: Ahir Reddy Date: Fri Apr 19 10:10:05 2019 -0700 Coursier Dependency Resolution in OSS Delta Switch the OSS Delta build to Coursier to speed up dependency resolution. Closes #4952 from ahirreddy/oss-test-pr-trigger. Authored-by: Ahir Reddy Signed-off-by: Shixiong Zhu GitOrigin-RevId: e8959d233ef49ac7f506592b6391f09b991ae2eb commit e66dc4c6fe765d7c0481956044853bd7e0ec47e9 Author: Sean Owen Date: Thu Apr 18 16:32:06 2019 -0700 [DELTA-REFACTOR] Add NOTICE; add Spark LICENSE to LICENSE ## What changes were proposed in this pull request? - Add a NOTICE file contents from Spark per ALv2 requirement - Symlink NOTICE so it gets built into artifact - Add Spark LICENSE to LICENSE as possible overkill but complete ## How was this patch tested? N/A Closes #4951 from srowen/LicenseNotice. Authored-by: Sean Owen Signed-off-by: Tathagata Das GitOrigin-RevId: 02c33f94b9b26e9dd42cc912205fc3bc45d98a1b commit 7179624caf7d388957a1e724071997607a23e2cb Author: David Lewis Date: Thu Apr 18 15:51:37 2019 -0700 [SC-13048][SC-15987] ZORDER BY scalability: add a threshold for splitting rewrite of large partition into multiple jobs Authored-by: David Lewis Signed-off-by: Mukul Murthy GitOrigin-RevId: 8a39e20766696126aa7031ca2eb24d6b480914bb commit 2fcd4a1adc94a9c907f390c74ebd4664328fcc73 Author: Shixiong Zhu Date: Thu Apr 18 15:30:11 2019 -0700 [DELTA-OSS] Add spark-packages repo ## What changes were proposed in this pull request? Add `spark-packages` repo to fix the build. Authored-by: Shixiong Zhu Signed-off-by: Tathagata Das GitOrigin-RevId: db28ffc21c26b7893a1b2a7bd2f323e6652253c7 commit a2cb6970e31e8123184b9d66d6c4f3134a2c5a60 Author: Shixiong Zhu Date: Thu Apr 18 11:50:52 2019 -0700 [DELTA-OSS] Add release build and license ## What changes were proposed in this pull request? This PR updates SBT scripts to support release and also adds the missing license. ## How was this patch tested? Run `build/sbt "+ publish-local"` and verify the generated pom.xml Closes #4926 from zsxwing/delta-publish. Authored-by: Shixiong Zhu Signed-off-by: Shixiong Zhu GitOrigin-RevId: bf67f6d7be62d00abdd3684f099ece0e8fd4ed92 commit 268221ea6d40dd7b203e6e7f1e655f7e282f892e Author: Ahir Reddy Date: Wed Apr 17 16:07:30 2019 -0700 [PLAT-5541] Per Commit Copybara Export Exposes support for per-commit Copybara export. By default local workflows use the Squash flow (because it's faster). When exporting to the OSS repo, we will use the iterative mode to preserve commits, messages, and authorship. Closes #4930 from ahirreddy/no-squash-copybara. Authored-by: Ahir Reddy Signed-off-by: Tathagata Das GitOrigin-RevId: 891060da88d6c4604718134521b12dd20c37264b commit 4896d97f926130ad8eba35887912de2e2a5a2b85 Author: Jose Torres Date: Tue Apr 16 15:27:00 2019 -0700 [SC-17088][DELTA] Run DeltaSinkSuite in OSS. ## What changes were proposed in this pull request? Run DeltaSinkSuite in OSS. ## How was this patch tested? Copybaraed along with Liwen's upcoming transaction update() fix. (I won't merge this until that PR is in and I re-run with just this PR and master.) Closes #4898 from jose-torres/oss3. Authored-by: Jose Torres Signed-off-by: Jose Torres GitOrigin-RevId: 39b5a582284ac279c4bc92ff952be46d9c18ee07 commit 2519d68ad3bb00fc7870455f6a8d110b3b085994 Author: Tathagata Das Date: Tue Apr 16 14:19:42 2019 -0700 [SC-17091][DELTA] First draft of Delta OSS README.md ## What changes were proposed in this pull request? self-explanatory ## How was this patch tested? no tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Authored-by: Tathagata Das Signed-off-by: Tathagata Das GitOrigin-RevId: 37497a1cb88d3cc4cb4fe680bd67dbba5bffd2d1 commit d715c774dacda230291705333b3b6d863d374e28 Author: liwensun Date: Mon Apr 15 19:55:56 2019 -0700 [DELTA-REFACTOR] Add back deltaLog.update() during commit ## What changes were proposed in this pull request? We accidentally removed the `deltaLog.update()` in `commit()` during OSS refactoring. This PR adds it back. ## How was this patch tested? Enable the test that used to fail before this change. Authored-by: liwensun Signed-off-by: liwensun GitOrigin-RevId: 4410be739252989ac536aa57bedcb29229ec831f commit d5a03e6d1b475885283f075186054b57d471e159 Author: Burak Yavuz Date: Mon Apr 15 17:02:51 2019 -0700 [SC-17092][DELTA-REFACTOR] Update build.sbt and add circleci config file ## What changes were proposed in this pull request? Adds CircleCI support to our repo. ## How was this patch tested? Built in our prototype branch Closes #4896 from brkyvz/addCircleCIConfigs. Lead-authored-by: Burak Yavuz Co-authored-by: Burak Yavuz Signed-off-by: Burak Yavuz GitOrigin-RevId: 5226a4d3d623d4d54fcc4fad26d210a54cb05d16 commit 5fd67738b1121bd0c28f590a2778afb3a20b341c Author: Jose Torres Date: Mon Apr 15 16:35:08 2019 -0700 [SC-17137][DELTA] Put log4j.properties for OSS distribution ## What changes were proposed in this pull request? Copied log4j.properties from Spark SQL tests. (Is this ok or do we need to somehow write our own?) ## How was this patch tested? manually copybaraed and ran DeltaSuiteOSS Closes #4889 from jose-torres/oss. Authored-by: Jose Torres Signed-off-by: Jose Torres GitOrigin-RevId: 2235c6f9d99347a8df4271568f25663e2076c894 commit c49beebcf34a721326310a4d2849b1d189974948 Author: Ahir Reddy Date: Mon Apr 15 15:00:54 2019 -0700 [PLAT-5541] Separate Test Module for Delta Folder Setup a test module for the `delta/` folder so that changes don't trigger every runtime test. Closes #4888 from ahirreddy/delta-test-module. Authored-by: Ahir Reddy Signed-off-by: Ahir Reddy GitOrigin-RevId: eeb209a31b497c4585617e2dcaa9d15f230203be commit 06c5a49cf486065b465c0269d5ef348cb1a0a24b Author: Shixiong Zhu Date: Mon Apr 15 13:14:25 2019 -0700 [DELTA-REFACTOR] Copy Spark's git settings to Delta OSS ## What changes were proposed in this pull request? Copy Spark's git settings to Delta OSS to ignore files properly. ## How was this patch tested? Jenkins Closes #4887 from zsxwing/improve-git. Authored-by: Shixiong Zhu Signed-off-by: Shixiong Zhu GitOrigin-RevId: d12eb112ada4b50b7e0c6f10e13d7c9783391d11 commit 14cb4e0267cc188e0fdd47e5b4f0235baf87874e Author: Michael Armbrust Date: Fri Apr 12 08:03:23 2019 -0700 Project import generated by Copybara. GitOrigin-RevId: 62fbd592bead1a3592c258a3191e3e603b026377 --- .../skipping/MultiDimClusteringSuite.scala | 42 +++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/spark/src/test/scala/org/apache/spark/sql/delta/skipping/MultiDimClusteringSuite.scala b/spark/src/test/scala/org/apache/spark/sql/delta/skipping/MultiDimClusteringSuite.scala index 28baf076870..7c7c038bd48 100644 --- a/spark/src/test/scala/org/apache/spark/sql/delta/skipping/MultiDimClusteringSuite.scala +++ b/spark/src/test/scala/org/apache/spark/sql/delta/skipping/MultiDimClusteringSuite.scala @@ -148,6 +148,48 @@ class MultiDimClusteringSuite extends QueryTest } } + test("ensure records are sorted according to Z-order values") { + withSQLConf( + MDC_SORT_WITHIN_PARTITIONS.key -> "true", + MDC_ADD_NOISE.key -> "false") { + val data = Seq( + // "c1" -> "c2", // (rangeId_c1, rangeId_c2) -> ZOrder (decimal Z-Order) + "a" -> 20, "a" -> 20, // (0, 0) -> 0x01 (0) + "b" -> 20, // (1, 1) -> 0x03 (3) + "c" -> 30, // (2, 2) -> 0x0C (12) + "d" -> 70, // (3, 3) -> 0x0F (15) + "e" -> 90, "e" -> 90, "e" -> 90, // (4, 4) -> 0x30 (48) + "f" -> 200, // (5, 5) -> 0x33 (51) + "g" -> 10, // (6, 0) -> 0x28 (40) + "h" -> 20) // (7, 0) -> 0x2B (43) + + // Randomize the data. Use seed for deterministic input. + val inputDf = new Random(seed = 101).shuffle(data) + .toDF("c1", "c2") + + // Cluster the data, range partition into one partition, and sort. + val outputDf = MultiDimClustering.cluster( + inputDf, + approxNumPartitions = 2, + colNames = Seq("c1", "c2"), + curve = "zorder") + + // Check that dataframe is sorted. + checkAnswer( + outputDf, + Seq( + "a" -> 20, "a" -> 20, + "b" -> 20, + "c" -> 30, + "d" -> 70, + "g" -> 10, + "h" -> 20, + "e" -> 90, "e" -> 90, "e" -> 90, + "f" -> 200 + ).toDF("c1", "c2").collect()) + } + } + test("noise is helpful in skew handling") { Seq("zorder", "hilbert").foreach { curve => Seq("true", "false").foreach { addNoise =>