Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): add distinct option to collect #10121

Merged
merged 1 commit into from
Sep 16, 2024

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Sep 13, 2024

This adds a new distinct option to collect (defaulting to False). When True, only distinct elements are collected.

Depending on the backend calling x.collect(distinct=True) may be more efficient than x.collect().unique(), and has the added benefit of not disrupting array ordering for backends where unique may not respect ordering.

@jcrist jcrist added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Sep 13, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Sep 13, 2024
df.id[(df.bigint_col == 10) & pd_cond].sort_values(ascending=False).astype(str)
)
assert result == expected
def gen_test_collect_marks(distinct, filtered, ordered, include_null):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option was to add a bunch of strict=False checks everywhere, making the test less strict. In the long run I might want to add a new pytest shorthand for handling parametrizing a test with a cross-product of parameters with markers for certain parameter combos, but for now breaking the mark generation into a utility function didn't seem too bad.

# TODO: Flink supposedly supports `ARRAY_AGG(DISTINCT ...)`, but it
# doesn't work with filtering (either `include_null=False` or
# additional filtering). Their `array_distinct` function does maintain
# ordering though, so we can use it here.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flink does some really weird stuff here. One parameter combo somehow results in filtering out all but two null values for unknown reasons.

ibis/expr/operations/reductions.py Show resolved Hide resolved
@jcrist
Copy link
Member Author

jcrist commented Sep 13, 2024

Cloud tests are passing fine, should be ready for review.

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking, thanks for adding this!

ibis/expr/tests/test_reductions.py Show resolved Hide resolved
if include_null:
raise com.UnsupportedOperationError(
"`include_null=True` is not supported by the snowflake backend"
)
if where is not None and distinct:
raise com.UnsupportedOperationError(
"Combining `distinct=True` and `where` is not supported by snowflake"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Combining `distinct=True` and `where` is not supported by snowflake"
"Combining `distinct=True` and `where` is not supported by Snowflake"

Only because it's a proper noun and not a generic one :)

Alternatively, we can leave this for a follow up and audit all the backends for proper noun capitalization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there's more of these in the codebase, I'll leave this as a follow up.

if where is not None:
if include_null:
raise com.UnsupportedOperationError(
"Combining `include_null=True` and `where` is not supported by bigquery"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Combining `include_null=True` and `where` is not supported by bigquery"
"Combining `include_null=True` and `where` is not supported by BigQuery"

)
if distinct:
raise com.UnsupportedOperationError(
"Combining `distinct=True` and `where` is not supported by bigquery"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Combining `distinct=True` and `where` is not supported by bigquery"
"Combining `distinct=True` and `where` is not supported by BigQuery"

def visit_ArrayCollect(self, op, *, arg, where, order_by, include_null, distinct):
if distinct:
raise com.UnsupportedOperationError(
"`collect` with `distinct=True` is not supported"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"`collect` with `distinct=True` is not supported"
"`collect` with `distinct=True` is not supported by DataFusion"

ibis/expr/operations/reductions.py Show resolved Hide resolved
@cpcloud cpcloud added this to the 10.0 milestone Sep 14, 2024
@cpcloud cpcloud added feature Features or general enhancements ux User experience related issues labels Sep 14, 2024
@jcrist jcrist merged commit 13cf036 into ibis-project:main Sep 16, 2024
78 checks passed
@jcrist jcrist deleted the collect-distinct branch September 16, 2024 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements ux User experience related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants