-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(sql): fuse distinct
with other select nodes when possible
#9923
Conversation
94b2e95
to
913808e
Compare
Hmmm, this has turned up some SQL analysis bugs (err, I think they're bugs) in spark & risingwave. In both cases the presence of a column alias inside a --- Order by the alias works
SELECT DISTINCT x AS z FROM test ORDER BY z
--- Order by the original name fails, saying no column named `z` found
SELECT DISTINCT x AS z FROM test ORDER BY x
--- Order by the original name works if you remove the DISTINCT
SELECT x AS z FROM test ORDER BY x
|
Ok, I have a fix for this (I think), and now know more dumb things about SQL dialects than I did before. |
913808e
to
d8ea7c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, provided tests pass (I didn't check the cloud backends myself), I think this is ready for review. Due to very dumb backend-specific reasons this was more work than initially thought. I've marked a few code spots below for review.
@@ -351,6 +408,46 @@ def wrap(node, _, **kwargs): | |||
# supplemental rewrites selectively used on a per-backend basis | |||
|
|||
|
|||
@replace(Select) | |||
def split_select_distinct_with_order_by(_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are dragons here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed opportunity to call this supplant_merged_with_ordered_distinct
😂
@@ -244,6 +251,48 @@ def merge_select_select(_, **kwargs): | |||
if _.parent.find_below(blocking, filter=ops.Value): | |||
return _ | |||
|
|||
if _.parent.distinct: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also dragons here.
d8ea7c4
to
1cc8e72
Compare
1cc8e72
to
622e62b
Compare
I'll run the clouds. |
Snowflake is passing, but BigQuery is raising this exception:
@jcrist I'm guessing BigQuery needs the new rewrite rule applied? |
Huh, thought I'd run the cloud tests in this PR already, will fix up. |
622e62b
to
f71339d
Compare
Cloud tests are passing, should be good-to-go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for chugging through the muck!
# - Fusing in the presence of expensive calls in the select would lead to potential | ||
# performance pitfalls | ||
if _.distinct and not all( | ||
isinstance(v, ops.Field) for v in _.selections.values() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this cover the alias-in-outer-project case?
SELECT a, b AS c
FROM (
SELECT DISTINCT
a, b
FROM t
)
would become
SELECT DISTINCT
a, b AS c
FROM t
If not, fine to either handle later or perhaps never if it doesn't come up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch is for both being distinct, so:
SELECT DISTINCT a, b as c
FROM (SELECT DISTINCT a, b FROM t)
In this case, yes, the aliases are properly handled.
The branch on line 270 handles the SELECT ... FROM (SELECT DISTINCT ...)
case. Right now it only works with SELECT *
cases, but we might be able to make it work with outer queries that rename columns but otherwise select all of them. Right now I don't think that's worth it.
@@ -351,6 +408,46 @@ def wrap(node, _, **kwargs): | |||
# supplemental rewrites selectively used on a per-backend basis | |||
|
|||
|
|||
@replace(Select) | |||
def split_select_distinct_with_order_by(_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed opportunity to call this supplant_merged_with_ordered_distinct
😂
This generates more concise SQL when chaining relational operations with
.distinct()
.Fixes #9905.