-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api): add distinct
option to collect
#10121
Conversation
df.id[(df.bigint_col == 10) & pd_cond].sort_values(ascending=False).astype(str) | ||
) | ||
assert result == expected | ||
def gen_test_collect_marks(distinct, filtered, ordered, include_null): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option was to add a bunch of strict=False
checks everywhere, making the test less strict. In the long run I might want to add a new pytest shorthand for handling parametrizing a test with a cross-product of parameters with markers for certain parameter combos, but for now breaking the mark generation into a utility function didn't seem too bad.
# TODO: Flink supposedly supports `ARRAY_AGG(DISTINCT ...)`, but it | ||
# doesn't work with filtering (either `include_null=False` or | ||
# additional filtering). Their `array_distinct` function does maintain | ||
# ordering though, so we can use it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flink does some really weird stuff here. One parameter combo somehow results in filtering out all but two null values for unknown reasons.
005f1f9
to
b54dd44
Compare
Cloud tests are passing fine, should be ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing blocking, thanks for adding this!
if include_null: | ||
raise com.UnsupportedOperationError( | ||
"`include_null=True` is not supported by the snowflake backend" | ||
) | ||
if where is not None and distinct: | ||
raise com.UnsupportedOperationError( | ||
"Combining `distinct=True` and `where` is not supported by snowflake" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Combining `distinct=True` and `where` is not supported by snowflake" | |
"Combining `distinct=True` and `where` is not supported by Snowflake" |
Only because it's a proper noun and not a generic one :)
Alternatively, we can leave this for a follow up and audit all the backends for proper noun capitalization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there's more of these in the codebase, I'll leave this as a follow up.
if where is not None: | ||
if include_null: | ||
raise com.UnsupportedOperationError( | ||
"Combining `include_null=True` and `where` is not supported by bigquery" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Combining `include_null=True` and `where` is not supported by bigquery" | |
"Combining `include_null=True` and `where` is not supported by BigQuery" |
) | ||
if distinct: | ||
raise com.UnsupportedOperationError( | ||
"Combining `distinct=True` and `where` is not supported by bigquery" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Combining `distinct=True` and `where` is not supported by bigquery" | |
"Combining `distinct=True` and `where` is not supported by BigQuery" |
def visit_ArrayCollect(self, op, *, arg, where, order_by, include_null, distinct): | ||
if distinct: | ||
raise com.UnsupportedOperationError( | ||
"`collect` with `distinct=True` is not supported" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"`collect` with `distinct=True` is not supported" | |
"`collect` with `distinct=True` is not supported by DataFusion" |
This adds a new
distinct
option tocollect
(defaulting toFalse
). When True, only distinct elements are collected.Depending on the backend calling
x.collect(distinct=True)
may be more efficient thanx.collect().unique()
, and has the added benefit of not disrupting array ordering for backends whereunique
may not respect ordering.