Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Specify the semantics of empty Series aggregations #19739

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions docs/source/user-guide/expressions/aggregation.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,3 +134,42 @@ This means that if you were to use a `lambda` or a custom Python function to app
Polars will try to parallelize the computation of the aggregating functions over the groups, so it is recommended that you avoid using `lambda`s and custom Python functions as much as possible.
Instead, try to stay within the realm of the Polars expression API.
This is not always possible, though, so if you want to learn more about using `lambda`s you can go [the user guide section on using user-defined functions](user-defined-python-functions.md).

## Behavior with empty `Series`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferred:

Suggested change
## Behavior with empty `Series`
## Aggregations on empty series

But this might do as well:

Suggested change
## Behavior with empty `Series`
## Behavior with empty series


Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A sentence per line is good because it makes diffs cleaner and makes it easier to review the docs.)

“Consequently, .group_by().agg() on columns with null values might result in different results than would be given by an SQL ”

The table shows results for aggregations computed on empty series.
What do empty series have to do with series that contain null values?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You used the word “semantics” 4 times in the first 3 sentences and that's quite a heavy word for a user-friendly user guide.
Here's a possible rewrite in simpler English:

When computing aggregations on empty series, Polars tries to follow set theory and Python's behaviour.
This differs from SQL for some operations: for example, pl.Series([], pl.Int32).sum() is equal to 0 in Polars but it is NULL in SQL.
Below we provide an overview of all aggregations and the return value when performed on an empty series.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't agree with my other subjective criticism on this paragraph, at least a couple of adjustments need to be made to fix typos and for consistency with the remainder of the docs:
(Again, one sentence / line would make it easier to review my suggested changes.)

Suggested change
Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series.
Polars tries to follow aggregation semantics that match closely [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and Python semantics. This means that we might differ from SQL for operations on empty series. For example, `pl.Series([], dtype=pl.Int32).sum()` is equal to `0` in Polars, but it would be a missing value or `NULL` in SQL. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than those that would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series.

Or, “but it should be None if we followed SQL (semantics)”.


| Aggregation | Empty Series return value |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the nulls in this table are actually None, aren't they?
And true should be True and false should be False.

|-------------------|---------------------------|
| `min` | `null` |
| `max` | `null` |
| `nan_min` | `null` |
| `nan_max` | `null` |
| `arg_max` | `null` |
| `arg_min` | `null` |
| `sum` | `0` |
| `product` | `1` |
| `mean` | `null` |
| `median` | `null` |
| `std` | `null` |
| `var` | `null` |
| `n_unique` | `0` |
| `approx_n_unique` | `0` |
| `null_count` | `0` |
| `has_nulls` | `false` |
| `first` | `null` |
| `last` | `null` |
| `quantile` | `null` |
| `get` | n/a |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find the method get on a series:

>>> import polars as pl
>>> pl.Series([]).get()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Series' object has no attribute 'get'. Did you mean: 'ge'?

| `count` | `0` |
| `len` | `0` |
| `implode` | `[ ]` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `implode` | `[ ]` |
| `implode` | `[]` |

| `bitwise_and` | `null` |
| `bitwise_or` | `null` |
| `bitwise_xor` | `null` |
| `all` | `True` |
| `any` | `False` |
| `entropy` | `-0.0` |
| `kurtosis` | `null` |
| `lower_bound` | type dependent value |
| `upper_bound` | type dependent value |
Loading