-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52462] [SQL] Enforce type coercion before children output deduplication in Union #51172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e3f534a
to
318e232
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
@mihailoale-db can you say more about how your example query gets a different type coercion result with different rule order? Let's describe "not consistent" clearly here. |
@cloud-fan Some third party data sources may add custom analyzer rules that will change the rule order here. Delta Lake is an example. Let me mention that in the description. Thanks! |
318e232
to
d0c0056
Compare
@@ -15,6 +15,24 @@ CreateViewCommand `t2`, VALUES (1.0, 1), (2.0, 4) tbl(c1, c2), false, true, Loca | |||
+- LocalRelation [c1#x, c2#x] | |||
|
|||
|
|||
-- !query | |||
CREATE TABLE parquetTable (col1 INT, col2 INT, col3 INT, col4 INT) USING parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the changes, I don't think which built-in file format matters here, why do we test with 3 formats? shall we just test one?
What changes were proposed in this pull request?
Right now, query the following query produces plans that are not consistent over different underlying table providers.
Mentioned can happen when introducing some third party data sources that may add custom analyzer rules that will change the rule order here. Delta Lake is an example.
See these examples:
For this one, we trigger
WidenSetOperationTypes
beforeResolveReferences
(deduplication ofUnion
children outputs) and thus plan looks like:In this case, we
ResolveReferences
(deduplication ofUnion
children outputs) beforeWidenSetOperationTypes
and thus plan looks like:In this issue I propose that we align those two by enforcing type coercion to happen before deduplication.
Why are the changes needed?
To make
UNION
with different underlying table providers producing consistent plans.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added tests + existing ones.
Was this patch authored or co-authored using generative AI tooling?
No.