Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/aggregates as windows #871

Merged
merged 4 commits into from
Sep 18, 2024

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Sep 12, 2024

Which issue does this PR close?

Closes #865
Closes #688

Rationale for this change

This function allows the user to turn any aggregate function into a window function. Previously using aggregates as windows was only expressed through functions.window() which took the aggregate's name. This only supports built in window functions or functions that have been registered with the session context. While this worked, it's a bit of an anti-pattern for people who don't need to register functions into the session context.

What changes are included in this PR?

Adds .over() to an expression which indicates it is to be used as a window function.

Are there any user-facing changes?

During testing I noticed that the default window frames for f.window() are set to unbounded preceeding to unbounded following. With .over() it uses the default frames, so if ordering is set it goes from unbounded preceeding to current row. This does match user expectation.

Example

Old version:

f.window(
    "first_value",
    [column("a")],
    order_by=[f.order_by(column("b"))],
    partition_by=[column("c")],

New version:

f.first_value(column("a")).over(
    partition_by=[column("c")],
    order_by=[column("b")]
)

@timsaucer timsaucer force-pushed the feature/aggregates_as_windows branch from bdbf77f to 64d3415 Compare September 17, 2024 21:09
@timsaucer timsaucer marked this pull request as ready for review September 17, 2024 21:09
Copy link
Contributor

@emgeee emgeee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me

@Michael-J-Ward
Copy link
Contributor

During testing I noticed that the default window frames for f.window() are set to unbounded preceeding to unbounded following. With .over() it uses the default frames, so if ordering is set it goes from unbounded preceeding to current row. This does match user expectation.

So does this PR also close #688?

@timsaucer
Copy link
Contributor Author

It does! Updated the description

@Michael-J-Ward
Copy link
Contributor

I had just verified with a test - if this test is not duplicative then maybe we should add it.

def test_window_frame_defaults_match_postgres(ctx):
    # ref: https://github.com/apache/datafusion-python/issues/688
    
    # create a RecordBatch and a new DataFrame from it
    batch = pa.RecordBatch.from_arrays(
        [pa.array([1.0, 10.0, 20.0])],
        names=["a"],
    )

    df = ctx.create_dataframe([[batch]])

    window_frame = WindowFrame("rows", None, None)

    col_a = column("a")

    # Using `f.window` with or without an unbounded window_frame produces the same results
    no_frame = f.window("avg", [col_a]).alias('no_frame')
    with_frame = f.window("avg", [col_a], window_frame=window_frame).alias('with_frame')
    df_1 = df.select(col_a, no_frame, with_frame)

    expected = """DataFrame()
+------+--------------------+--------------------+
| a    | no_frame           | with_frame         |
+------+--------------------+--------------------+
| 1.0  | 10.333333333333334 | 10.333333333333334 |
| 10.0 | 10.333333333333334 | 10.333333333333334 |
| 20.0 | 10.333333333333334 | 10.333333333333334 |
+------+--------------------+--------------------+
""".strip()

    assert str(df_1) == expected

    # When `order_by` is set, the default window should be `unbounded preceding` to `current row`
    no_order = f.avg(col_a).over(Window()).alias('over_no_order')
    with_order = f.avg(col_a).over(Window(order_by=[col_a])).alias('over_with_order')
    df_2 = df.select(col_a, no_order, with_order)

    expected = """DataFrame()
+------+--------------------+--------------------+
| a    | over_no_order      | over_with_order    |
+------+--------------------+--------------------+
| 1.0  | 10.333333333333334 | 1.0                |
| 10.0 | 10.333333333333334 | 5.5                |
| 20.0 | 10.333333333333334 | 10.333333333333334 |
+------+--------------------+--------------------+
""".strip()
    assert str(df_2) == expected

Copy link
Contributor

@Michael-J-Ward Michael-J-Ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion on an error report, but otherwise LGTM.

src/expr.rs Outdated
),
_ => Err(
DataFusionError::ExecutionError(datafusion::error::DataFusionError::Plan(
"Using `over` requires an aggregate function.".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Using `over` requires an aggregate function.".to_string(),
format!("Using {} with `over` is not allowed. Must use an aggregate or window function.", self.expr.variant_name())

@Michael-J-Ward Michael-J-Ward merged commit a00cfbf into apache:main Sep 18, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: ergonomic way to use aggregates as window functions Default window frames to not match PostgreSQL
3 participants