Skip to content

[SPARK-52462] [SQL] Enforce type coercion before children output deduplication in Union #51172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mihailoale-db
Copy link
Contributor

@mihailoale-db mihailoale-db commented Jun 12, 2025

What changes were proposed in this pull request?

Right now, query the following query produces plans that are not consistent over different underlying table providers.

Mentioned can happen when introducing some third party data sources that may add custom analyzer rules that will change the rule order here. Delta Lake is an example.

See these examples:

  • Delta table:
CREATE TABLE deltaTable (col1 INT, col2 INT, col3 INT, col4 INT) USING delta;
SELECT col1, col2, col3, NULLIF('','') AS col4
FROM deltaTable
UNION ALL
SELECT col2, col2, null AS col3, col4
FROM deltaTable;

For this one, we trigger WidenSetOperationTypes before ResolveReferences (deduplication of Union children outputs) and thus plan looks like:

Union false, false
:- Project [col1#418, col2#419, col3#420, cast(col4#412 as bigint) AS col4#426L]
:  +- Project [col1#418, col2#419, col3#420, nullif(, ) AS col4#412]
:     +- SubqueryAlias spark_catalog.default.deltaTable
:        +- Relation spark_catalog.default.deltatable[col1#418,col2#419,col3#420,col4#421] parquet
+- Project [col2#423, col2#423 AS col2#431, col3#427, col4#428L]
   +- Project [col2#423, col2#423, cast(col3#413 as int) AS col3#427, cast(col4#425 as bigint) AS col4#428L]
      +- Project [col2#423, col2#423, null AS col3#413, col4#425]
         +- SubqueryAlias spark_catalog.default.deltaTable
            +- Relation spark_catalog.default.deltatable[col1#422,col2#423,col3#424,col4#425] parquet
  • Non-delta table:
CREATE TABLE parquetTable (col1 INT, col2 INT, col3 INT, col4 INT) USING parquet;
SELECT col1, col2, col3, NULLIF('','') AS col4
FROM parquetTable
UNION ALL
SELECT col2, col2, null AS col3, col4
FROM parquetTable;

In this case, we ResolveReferences (deduplication of Union children outputs) before WidenSetOperationTypes and thus plan looks like:

Union false, false
:- Project [col1#2, col2#3, col3#4, cast(col4#0 as bigint) AS col4#11L]
:  +- Project [col1#2, col2#3, col3#4, nullif(, ) AS col4#0]
:     +- SubqueryAlias spark_catalog.default.parquettable
:        +- Relation spark_catalog.default.parquettable[col1#2,col2#3,col3#4,col4#5] parquet
+- Project [col2#7, col2#10, cast(col3#1 as int) AS col3#12, cast(col4#9 as bigint) AS col4#13L]
   +- Project [col2#7, col2#7 AS col2#10, col3#1, col4#9]
      +- Project [col2#7, col2#7, null AS col3#1, col4#9]
         +- SubqueryAlias spark_catalog.default.parquettable
            +- Relation spark_catalog.default.parquettable[col1#6,col2#7,col3#8,col4#9] parquet

In this issue I propose that we align those two by enforcing type coercion to happen before deduplication.

Why are the changes needed?

To make UNION with different underlying table providers producing consistent plans.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests + existing ones.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 12, 2025
@mihailoale-db mihailoale-db force-pushed the uniondeduplicationfix branch 2 times, most recently from e3f534a to 318e232 Compare June 13, 2025 08:32
Copy link
Contributor

@mihailotim-db mihailotim-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 13, 2025

@mihailoale-db can you say more about how your example query gets a different type coercion result with different rule order? Let's describe "not consistent" clearly here.

@mihailoale-db
Copy link
Contributor Author

@cloud-fan Some third party data sources may add custom analyzer rules that will change the rule order here. Delta Lake is an example. Let me mention that in the description. Thanks!

@mihailoale-db mihailoale-db force-pushed the uniondeduplicationfix branch from 318e232 to d0c0056 Compare June 16, 2025 09:01
@@ -15,6 +15,24 @@ CreateViewCommand `t2`, VALUES (1.0, 1), (2.0, 4) tbl(c1, c2), false, true, Loca
+- LocalRelation [c1#x, c2#x]


-- !query
CREATE TABLE parquetTable (col1 INT, col2 INT, col3 INT, col4 INT) USING parquet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the changes, I don't think which built-in file format matters here, why do we test with 3 formats? shall we just test one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants