Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sql): fuse distinct with other select nodes when possible #9923

Merged
merged 1 commit into from
Sep 3, 2024

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Aug 26, 2024

This generates more concise SQL when chaining relational operations with .distinct().

Fixes #9905.

ibis/backends/sql/rewrites.py Outdated Show resolved Hide resolved
ibis/backends/sql/rewrites.py Show resolved Hide resolved
@jcrist jcrist force-pushed the fuse-select-distinct branch from 94b2e95 to 913808e Compare August 27, 2024 14:45
@jcrist
Copy link
Member Author

jcrist commented Aug 27, 2024

Hmmm, this has turned up some SQL analysis bugs (err, I think they're bugs) in spark & risingwave. In both cases the presence of a column alias inside a SELECT DISTINCT is causing the original column name to not be in scope in the ORDER BY part of the query (it is still in scope in other parts like WHERE).

--- Order by the alias works
SELECT DISTINCT x AS z FROM test ORDER BY z
--- Order by the original name fails, saying no column named `z` found
SELECT DISTINCT x AS z FROM test ORDER BY x
--- Order by the original name works if you remove the DISTINCT
SELECT x AS z FROM test ORDER BY x
  • Is this a bug in our SQL generated that most backends just happen to support, or is this a bug in pyspark/risingwave?
  • Is using the alias instead of the original name everywhere a valid fix?

@jcrist
Copy link
Member Author

jcrist commented Aug 28, 2024

Ok, I have a fix for this (I think), and now know more dumb things about SQL dialects than I did before.

@jcrist jcrist force-pushed the fuse-select-distinct branch from 913808e to d8ea7c4 Compare August 29, 2024 16:47
@jcrist jcrist added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Aug 29, 2024
Copy link
Member Author

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, provided tests pass (I didn't check the cloud backends myself), I think this is ready for review. Due to very dumb backend-specific reasons this was more work than initially thought. I've marked a few code spots below for review.

@@ -351,6 +408,46 @@ def wrap(node, _, **kwargs):
# supplemental rewrites selectively used on a per-backend basis


@replace(Select)
def split_select_distinct_with_order_by(_):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are dragons here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed opportunity to call this supplant_merged_with_ordered_distinct 😂

@@ -244,6 +251,48 @@ def merge_select_select(_, **kwargs):
if _.parent.find_below(blocking, filter=ops.Value):
return _

if _.parent.distinct:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also dragons here.

ibis/backends/tests/test_generic.py Show resolved Hide resolved
ibis/backends/tests/test_generic.py Show resolved Hide resolved
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Aug 29, 2024
@jcrist jcrist force-pushed the fuse-select-distinct branch from d8ea7c4 to 1cc8e72 Compare August 29, 2024 17:00
@jcrist jcrist added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Aug 29, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Aug 29, 2024
@jcrist jcrist force-pushed the fuse-select-distinct branch from 1cc8e72 to 622e62b Compare August 29, 2024 17:38
@jcrist jcrist requested a review from cpcloud August 29, 2024 18:01
@cpcloud cpcloud added this to the 9.4 milestone Sep 3, 2024
@cpcloud cpcloud added feature Features or general enhancements sql Backends that generate SQL labels Sep 3, 2024
@cpcloud
Copy link
Member

cpcloud commented Sep 3, 2024

I'll run the clouds.

@cpcloud
Copy link
Member

cpcloud commented Sep 3, 2024

Snowflake is passing, but BigQuery is raising this exception:

E               google.api_core.exceptions.BadRequest: 400 ORDER BY clause expression references t0.id which is not visible after SELECT DISTINCT at [11:7]; reason: invalidQuery, location: query, message: ORDER BY clause expression references t0.id which is not visible after SELECT DISTINCT at [11:7]
E
E               Location: US
E               Job ID: d7ab4e0d-b30f-4afc-8ed7-286b71a895a5

@jcrist I'm guessing BigQuery needs the new rewrite rule applied?

@jcrist
Copy link
Member Author

jcrist commented Sep 3, 2024

Huh, thought I'd run the cloud tests in this PR already, will fix up.

@jcrist jcrist force-pushed the fuse-select-distinct branch from 622e62b to f71339d Compare September 3, 2024 13:24
@jcrist jcrist added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Sep 3, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Sep 3, 2024
@jcrist
Copy link
Member Author

jcrist commented Sep 3, 2024

Cloud tests are passing, should be good-to-go.

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for chugging through the muck!

# - Fusing in the presence of expensive calls in the select would lead to potential
# performance pitfalls
if _.distinct and not all(
isinstance(v, ops.Field) for v in _.selections.values()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cover the alias-in-outer-project case?

SELECT a, b AS c
FROM (
  SELECT DISTINCT
    a, b
  FROM t
)

would become

SELECT DISTINCT
  a, b AS c
FROM t

If not, fine to either handle later or perhaps never if it doesn't come up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is for both being distinct, so:

SELECT DISTINCT a, b as c
FROM (SELECT DISTINCT a, b FROM t)

In this case, yes, the aliases are properly handled.

The branch on line 270 handles the SELECT ... FROM (SELECT DISTINCT ...) case. Right now it only works with SELECT * cases, but we might be able to make it work with outer queries that rename columns but otherwise select all of them. Right now I don't think that's worth it.

ibis/backends/sql/rewrites.py Show resolved Hide resolved
@@ -351,6 +408,46 @@ def wrap(node, _, **kwargs):
# supplemental rewrites selectively used on a per-backend basis


@replace(Select)
def split_select_distinct_with_order_by(_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed opportunity to call this supplant_merged_with_ordered_distinct 😂

ibis/backends/tests/test_generic.py Show resolved Hide resolved
@jcrist jcrist merged commit c31412b into ibis-project:main Sep 3, 2024
87 checks passed
@jcrist jcrist deleted the fuse-select-distinct branch September 3, 2024 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements sql Backends that generate SQL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: simplify generated select distinct expressions
2 participants