feat(sql): fuse `distinct` with other select nodes when possible #9923

jcrist · 2024-08-26T15:36:45Z

This generates more concise SQL when chaining relational operations with .distinct().

Fixes #9905.

ibis/backends/sql/rewrites.py

jcrist · 2024-08-27T17:03:28Z

Hmmm, this has turned up some SQL analysis bugs (err, I think they're bugs) in spark & risingwave. In both cases the presence of a column alias inside a SELECT DISTINCT is causing the original column name to not be in scope in the ORDER BY part of the query (it is still in scope in other parts like WHERE).

--- Order by the alias works
SELECT DISTINCT x AS z FROM test ORDER BY z
--- Order by the original name fails, saying no column named `z` found
SELECT DISTINCT x AS z FROM test ORDER BY x
--- Order by the original name works if you remove the DISTINCT
SELECT x AS z FROM test ORDER BY x

Is this a bug in our SQL generated that most backends just happen to support, or is this a bug in pyspark/risingwave?
Is using the alias instead of the original name everywhere a valid fix?

jcrist · 2024-08-28T01:27:14Z

Ok, I have a fix for this (I think), and now know more dumb things about SQL dialects than I did before.

jcrist

Ok, provided tests pass (I didn't check the cloud backends myself), I think this is ready for review. Due to very dumb backend-specific reasons this was more work than initially thought. I've marked a few code spots below for review.

jcrist · 2024-08-29T16:53:30Z

ibis/backends/sql/rewrites.py

@@ -351,6 +408,46 @@ def wrap(node, _, **kwargs):
 # supplemental rewrites selectively used on a per-backend basis


+@replace(Select)
+def split_select_distinct_with_order_by(_):


There are dragons here.

Missed opportunity to call this supplant_merged_with_ordered_distinct 😂

jcrist · 2024-08-29T16:53:42Z

ibis/backends/sql/rewrites.py

@@ -244,6 +251,48 @@ def merge_select_select(_, **kwargs):
    if _.parent.find_below(blocking, filter=ops.Value):
        return _

+    if _.parent.distinct:


There are also dragons here.

ibis/backends/tests/test_generic.py

ibis/backends/tests/sql/test_select_sql.py

cpcloud · 2024-09-03T12:27:11Z

I'll run the clouds.

cpcloud · 2024-09-03T12:38:13Z

Snowflake is passing, but BigQuery is raising this exception:

E               google.api_core.exceptions.BadRequest: 400 ORDER BY clause expression references t0.id which is not visible after SELECT DISTINCT at [11:7]; reason: invalidQuery, location: query, message: ORDER BY clause expression references t0.id which is not visible after SELECT DISTINCT at [11:7]
E
E               Location: US
E               Job ID: d7ab4e0d-b30f-4afc-8ed7-286b71a895a5

@jcrist I'm guessing BigQuery needs the new rewrite rule applied?

jcrist · 2024-09-03T13:18:03Z

Huh, thought I'd run the cloud tests in this PR already, will fix up.

jcrist · 2024-09-03T14:01:49Z

Cloud tests are passing, should be good-to-go.

cpcloud

LGTM, thanks for chugging through the muck!

cpcloud · 2024-09-03T17:22:12Z

ibis/backends/sql/rewrites.py

+        # - Fusing in the presence of expensive calls in the select would lead to potential
+        #   performance pitfalls
+        if _.distinct and not all(
+            isinstance(v, ops.Field) for v in _.selections.values()


Does this cover the alias-in-outer-project case?

SELECT a, b AS c FROM ( SELECT DISTINCT a, b FROM t )

would become

SELECT DISTINCT a, b AS c FROM t

If not, fine to either handle later or perhaps never if it doesn't come up.

This branch is for both being distinct, so:

SELECT DISTINCT a, b as c FROM (SELECT DISTINCT a, b FROM t)

In this case, yes, the aliases are properly handled.

The branch on line 270 handles the SELECT ... FROM (SELECT DISTINCT ...) case. Right now it only works with SELECT * cases, but we might be able to make it work with outer queries that rename columns but otherwise select all of them. Right now I don't think that's worth it.

ibis/backends/sql/rewrites.py

cpcloud · 2024-09-03T17:28:59Z

ibis/backends/sql/rewrites.py

@@ -351,6 +408,46 @@ def wrap(node, _, **kwargs):
 # supplemental rewrites selectively used on a per-backend basis


+@replace(Select)
+def split_select_distinct_with_order_by(_):


Missed opportunity to call this supplant_merged_with_ordered_distinct 😂

ibis/backends/tests/test_generic.py

jcrist commented Aug 26, 2024

View reviewed changes

ibis/backends/sql/rewrites.py Outdated Show resolved Hide resolved

ibis/backends/sql/rewrites.py Show resolved Hide resolved

cpcloud reviewed Aug 26, 2024

View reviewed changes

ibis/backends/sql/rewrites.py Outdated Show resolved Hide resolved

cpcloud reviewed Aug 26, 2024

View reviewed changes

ibis/backends/sql/rewrites.py Outdated Show resolved Hide resolved

cpcloud reviewed Aug 26, 2024

View reviewed changes

ibis/backends/sql/rewrites.py Outdated Show resolved Hide resolved

cpcloud reviewed Aug 26, 2024

View reviewed changes

ibis/backends/sql/rewrites.py Outdated Show resolved Hide resolved

jcrist force-pushed the fuse-select-distinct branch from 94b2e95 to 913808e Compare August 27, 2024 14:45

jcrist force-pushed the fuse-select-distinct branch from 913808e to d8ea7c4 Compare August 29, 2024 16:47

jcrist added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Aug 29, 2024

jcrist commented Aug 29, 2024

View reviewed changes

ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Aug 29, 2024

jcrist force-pushed the fuse-select-distinct branch from d8ea7c4 to 1cc8e72 Compare August 29, 2024 17:00

jcrist added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Aug 29, 2024

ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Aug 29, 2024

jcrist force-pushed the fuse-select-distinct branch from 1cc8e72 to 622e62b Compare August 29, 2024 17:38

jcrist requested a review from cpcloud August 29, 2024 18:01

cpcloud reviewed Sep 3, 2024

View reviewed changes

ibis/backends/tests/sql/test_select_sql.py Show resolved Hide resolved

cpcloud added this to the 9.4 milestone Sep 3, 2024

cpcloud added feature Features or general enhancements sql Backends that generate SQL labels Sep 3, 2024

feat(sql): fuse distinct with other select nodes when possible

f71339d

jcrist force-pushed the fuse-select-distinct branch from 622e62b to f71339d Compare September 3, 2024 13:24

jcrist added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Sep 3, 2024

ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Sep 3, 2024

cpcloud approved these changes Sep 3, 2024

View reviewed changes

jcrist merged commit c31412b into ibis-project:main Sep 3, 2024
87 checks passed

jcrist deleted the fuse-select-distinct branch September 3, 2024 18:04

jcrist mentioned this pull request Sep 3, 2024

refactor(polars): use Select op within polars backend #10005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sql): fuse `distinct` with other select nodes when possible #9923

feat(sql): fuse `distinct` with other select nodes when possible #9923

jcrist commented Aug 26, 2024

jcrist commented Aug 27, 2024

jcrist commented Aug 28, 2024

jcrist left a comment

jcrist Aug 29, 2024

cpcloud Sep 3, 2024

jcrist Aug 29, 2024

cpcloud commented Sep 3, 2024

cpcloud commented Sep 3, 2024

jcrist commented Sep 3, 2024

jcrist commented Sep 3, 2024

cpcloud left a comment

cpcloud Sep 3, 2024

jcrist Sep 3, 2024

cpcloud Sep 3, 2024

feat(sql): fuse distinct with other select nodes when possible #9923

feat(sql): fuse distinct with other select nodes when possible #9923

Conversation

jcrist commented Aug 26, 2024

jcrist commented Aug 27, 2024

jcrist commented Aug 28, 2024

jcrist left a comment

Choose a reason for hiding this comment

jcrist Aug 29, 2024

Choose a reason for hiding this comment

cpcloud Sep 3, 2024

Choose a reason for hiding this comment

jcrist Aug 29, 2024

Choose a reason for hiding this comment

cpcloud commented Sep 3, 2024

cpcloud commented Sep 3, 2024

jcrist commented Sep 3, 2024

jcrist commented Sep 3, 2024

cpcloud left a comment

Choose a reason for hiding this comment

cpcloud Sep 3, 2024

Choose a reason for hiding this comment

jcrist Sep 3, 2024

Choose a reason for hiding this comment

cpcloud Sep 3, 2024

Choose a reason for hiding this comment

feat(sql): fuse `distinct` with other select nodes when possible #9923

feat(sql): fuse `distinct` with other select nodes when possible #9923