-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Specify the semantics of empty Series aggregations #19739
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -134,3 +134,42 @@ This means that if you were to use a `lambda` or a custom Python function to app | |||||
Polars will try to parallelize the computation of the aggregating functions over the groups, so it is recommended that you avoid using `lambda`s and custom Python functions as much as possible. | ||||||
Instead, try to stay within the realm of the Polars expression API. | ||||||
This is not always possible, though, so if you want to learn more about using `lambda`s you can go [the user guide section on using user-defined functions](user-defined-python-functions.md). | ||||||
|
||||||
## Behavior with empty `Series` | ||||||
|
||||||
Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (A sentence per line is good because it makes diffs cleaner and makes it easier to review the docs.)
The table shows results for aggregations computed on empty series. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You used the word “semantics” 4 times in the first 3 sentences and that's quite a heavy word for a user-friendly user guide. When computing aggregations on empty series, Polars tries to follow set theory and Python's behaviour. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you don't agree with my other subjective criticism on this paragraph, at least a couple of adjustments need to be made to fix typos and for consistency with the remainder of the docs:
Suggested change
Or, “but it should be |
||||||
|
||||||
| Aggregation | Empty Series return value | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of the |
||||||
|-------------------|---------------------------| | ||||||
| `min` | `null` | | ||||||
| `max` | `null` | | ||||||
| `nan_min` | `null` | | ||||||
| `nan_max` | `null` | | ||||||
| `arg_max` | `null` | | ||||||
| `arg_min` | `null` | | ||||||
| `sum` | `0` | | ||||||
| `product` | `1` | | ||||||
| `mean` | `null` | | ||||||
| `median` | `null` | | ||||||
| `std` | `null` | | ||||||
| `var` | `null` | | ||||||
| `n_unique` | `0` | | ||||||
| `approx_n_unique` | `0` | | ||||||
| `null_count` | `0` | | ||||||
| `has_nulls` | `false` | | ||||||
| `first` | `null` | | ||||||
| `last` | `null` | | ||||||
| `quantile` | `null` | | ||||||
| `get` | n/a | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't find the method >>> import polars as pl
>>> pl.Series([]).get()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Series' object has no attribute 'get'. Did you mean: 'ge'? |
||||||
| `count` | `0` | | ||||||
| `len` | `0` | | ||||||
| `implode` | `[ ]` | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| `bitwise_and` | `null` | | ||||||
| `bitwise_or` | `null` | | ||||||
| `bitwise_xor` | `null` | | ||||||
| `all` | `True` | | ||||||
| `any` | `False` | | ||||||
| `entropy` | `-0.0` | | ||||||
| `kurtosis` | `null` | | ||||||
| `lower_bound` | type dependent value | | ||||||
| `upper_bound` | type dependent value | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferred:
But this might do as well: