[Spark] Handle type widening in Delta streaming source #4042
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Streaming reads from a Delta source currently don't allow any form of type change to be propagated: the stream fails with a non-retryable error. Type widening introduced the ability to change the type of an existing column or field in a Delta table.
This change allows widening type changes to be propagated during streaming reads from a Delta source. The same mechanism as non-additive schema changes from column mapping (Drop/Rename) is applied.
To allow widening type changes to propagate, the user must:
.option("schemaTrackingLocation", ..)
Note that the current check for column mapping had a loophole: as long as a schema tracking location was provided, it only gated column drop/rename but allowed any type changes to get through. This is fixed here by properly checking and rejecting non-widening type changes.
How was this patch tested?
DeltaSourceMetadataEvolutionSupportSuite
covering the logic to detect and gate non-additive schema changes.TypeWideningStreamingSourceSuite
to cover stream reads from a delta source when a column is widened.This PR introduces the following user-facing changes
When reading using a streaming query from a Delta table that had a column type widened:
Before this change:
The stream fails with a non-retryable error:
Note: users could get around this by providing a schema tracking location and applying a column drop or rename to their source. This allowed arbitrary type changes to go through unchecked.
After this change:
Users must provide a schema tracking location via
.option("schemaTrackingLocation")
, otherwise the stream fails with:When a schema tracking location is provided, non-widening type changes are now properly rejected and fail with error
[DELTA_SCHEMA_CHANGED_WITH_VERSION]
When reading the batch that contains the type change, the stream first fail and records the tracked schema:
On retry, the stream fails and the user is prompted with a call to action:
Users can set one of the proposed config to resume processing.