[SPARK-52462] [SQL] Enforce type coercion before children output deduplication in Union #51172

mihailoale-db · 2025-06-12T19:36:12Z

What changes were proposed in this pull request?

Right now, query the following query produces plans that are not consistent over different underlying table providers.

Mentioned can happen when introducing some third party data sources that may add custom analyzer rules that will change the rule order here. Delta Lake is an example.

See these examples:

Delta table:

CREATE TABLE deltaTable (col1 INT, col2 INT, col3 INT, col4 INT) USING delta;
SELECT col1, col2, col3, NULLIF('','') AS col4
FROM deltaTable
UNION ALL
SELECT col2, col2, null AS col3, col4
FROM deltaTable;

For this one, we trigger WidenSetOperationTypes before ResolveReferences (deduplication of Union children outputs) and thus plan looks like:

Union false, false
:- Project [col1#418, col2#419, col3#420, cast(col4#412 as bigint) AS col4#426L]
:  +- Project [col1#418, col2#419, col3#420, nullif(, ) AS col4#412]
:     +- SubqueryAlias spark_catalog.default.deltaTable
:        +- Relation spark_catalog.default.deltatable[col1#418,col2#419,col3#420,col4#421] parquet
+- Project [col2#423, col2#423 AS col2#431, col3#427, col4#428L]
   +- Project [col2#423, col2#423, cast(col3#413 as int) AS col3#427, cast(col4#425 as bigint) AS col4#428L]
      +- Project [col2#423, col2#423, null AS col3#413, col4#425]
         +- SubqueryAlias spark_catalog.default.deltaTable
            +- Relation spark_catalog.default.deltatable[col1#422,col2#423,col3#424,col4#425] parquet

Non-delta table:

CREATE TABLE parquetTable (col1 INT, col2 INT, col3 INT, col4 INT) USING parquet;
SELECT col1, col2, col3, NULLIF('','') AS col4
FROM parquetTable
UNION ALL
SELECT col2, col2, null AS col3, col4
FROM parquetTable;

In this case, we ResolveReferences (deduplication of Union children outputs) before WidenSetOperationTypes and thus plan looks like:

Union false, false
:- Project [col1#2, col2#3, col3#4, cast(col4#0 as bigint) AS col4#11L]
:  +- Project [col1#2, col2#3, col3#4, nullif(, ) AS col4#0]
:     +- SubqueryAlias spark_catalog.default.parquettable
:        +- Relation spark_catalog.default.parquettable[col1#2,col2#3,col3#4,col4#5] parquet
+- Project [col2#7, col2#10, cast(col3#1 as int) AS col3#12, cast(col4#9 as bigint) AS col4#13L]
   +- Project [col2#7, col2#7 AS col2#10, col3#1, col4#9]
      +- Project [col2#7, col2#7, null AS col3#1, col4#9]
         +- SubqueryAlias spark_catalog.default.parquettable
            +- Relation spark_catalog.default.parquettable[col1#6,col2#7,col3#8,col4#9] parquet

In this issue I propose that we align those two by enforcing type coercion to happen before deduplication.

Why are the changes needed?

To make UNION with different underlying table providers producing consistent plans.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests + existing ones.

Was this patch authored or co-authored using generative AI tooling?

No.

mihailotim-db

LGTM

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

cloud-fan · 2025-06-13T23:07:12Z

@mihailoale-db can you say more about how your example query gets a different type coercion result with different rule order? Let's describe "not consistent" clearly here.

mihailoale-db · 2025-06-16T08:55:22Z

@cloud-fan Some third party data sources may add custom analyzer rules that will change the rule order here. Delta Lake is an example. Let me mention that in the description. Thanks!

cloud-fan · 2025-06-19T01:01:13Z

sql/core/src/test/resources/sql-tests/analyzer-results/union.sql.out

@@ -15,6 +15,24 @@ CreateViewCommand `t2`, VALUES (1.0, 1), (2.0, 4) tbl(c1, c2), false, true, Loca
      +- LocalRelation [c1#x, c2#x]


+-- !query
+CREATE TABLE parquetTable (col1 INT, col2 INT, col3 INT, col4 INT) USING parquet


Looking at the changes, I don't think which built-in file format matters here, why do we test with 3 formats? shall we just test one?

github-actions bot added the SQL label Jun 12, 2025

mihailoale-db force-pushed the uniondeduplicationfix branch 2 times, most recently from e3f534a to 318e232 Compare June 13, 2025 08:32

mihailotim-db approved these changes Jun 13, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala Show resolved Hide resolved

vladimirg-db reviewed Jun 13, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

initial commit

d0c0056

mihailoale-db force-pushed the uniondeduplicationfix branch from 318e232 to d0c0056 Compare June 16, 2025 09:01

cloud-fan reviewed Jun 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52462] [SQL] Enforce type coercion before children output deduplication in Union #51172

[SPARK-52462] [SQL] Enforce type coercion before children output deduplication in Union #51172

mihailoale-db commented Jun 12, 2025 •

edited

Loading

Uh oh!

mihailotim-db left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Jun 13, 2025 •

edited

Loading

Uh oh!

mihailoale-db commented Jun 16, 2025

Uh oh!

cloud-fan Jun 19, 2025

Uh oh!

Uh oh!

[SPARK-52462] [SQL] Enforce type coercion before children output deduplication in Union #51172

Are you sure you want to change the base?

[SPARK-52462] [SQL] Enforce type coercion before children output deduplication in Union #51172

Conversation

mihailoale-db commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

mihailotim-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihailoale-db commented Jun 16, 2025

Uh oh!

cloud-fan Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mihailoale-db commented Jun 12, 2025 •

edited

Loading

cloud-fan commented Jun 13, 2025 •

edited

Loading