Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Improve filter documentation #17755

Merged
merged 1 commit into from
Jul 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 67 additions & 13 deletions py-polars/polars/dataframe/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4381,28 +4381,39 @@ def filter(
Each constraint will behave the same as `pl.col(name).eq(value)`, and
will be implicitly joined with the other filter conditions using `&`.

Notes
-----
If you are transitioning from pandas and performing filter operations based on
the comparison of two or more columns, please note that in Polars,
any comparison involving null values will always result in null.
As a result, these rows will be filtered out.
Ensure to handle null values appropriately to avoid unintended filtering
Copy link
Member

@ritchie46 ritchie46 Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should mention ne_missing and eq_missing, and I'd rather see an example using those methods as those will include missing values in the comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap you are right, will do and will update the example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

(See examples below).


Examples
--------
>>> df = pl.DataFrame(
... {
... "foo": [1, 2, 3],
... "bar": [6, 7, 8],
... "ham": ["a", "b", "c"],
... "foo": [1, 2, 3, None, 4, None, 0],
... "bar": [6, 7, 8, None, None, 9, 0],
... "ham": ["a", "b", "c", None, "d", "e", "f"],
... }
... )

Filter on one condition:

>>> df.filter(pl.col("foo") > 1)
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
└─────┴─────┴─────┘
shape: (3, 3)
┌─────┬──────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪══════╪═════╡
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
│ 4 ┆ null ┆ d │
└─────┴──────┴─────┘

Filter on multiple conditions, combined with and/or operators:

Expand Down Expand Up @@ -4433,13 +4444,14 @@ def filter(
... pl.col("foo") <= 2,
... ~pl.col("ham").is_in(["b", "c"]),
... )
shape: (1, 3)
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ a │
│ 0 ┆ 0 ┆ f │
└─────┴─────┴─────┘

Provide multiple filters using `**kwargs` syntax:
Expand All @@ -4453,6 +4465,48 @@ def filter(
╞═════╪═════╪═════╡
│ 2 ┆ 7 ┆ b │
└─────┴─────┴─────┘

Filter by comparing two columns against each other

>>> df.filter(pl.col("foo") == pl.col("bar"))
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 0 ┆ 0 ┆ f │
└─────┴─────┴─────┘

>>> df.filter(pl.col("foo") != pl.col("bar"))
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ a │
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
└─────┴─────┴─────┘

Notice how the row with `None` values is filtered out. In order to keep the
same behavior as pandas, use:

>>> df.filter(pl.col("foo").ne_missing(pl.col("bar")))
shape: (5, 3)
┌──────┬──────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞══════╪══════╪═════╡
│ 1 ┆ 6 ┆ a │
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
│ 4 ┆ null ┆ d │
│ null ┆ 9 ┆ e │
└──────┴──────┴─────┘

"""
return self.lazy().filter(*predicates, **constraints).collect(_eager=True)

Expand Down
75 changes: 63 additions & 12 deletions py-polars/polars/lazyframe/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2983,28 +2983,38 @@ def filter(
Each constraint will behave the same as `pl.col(name).eq(value)`, and
will be implicitly joined with the other filter conditions using `&`.

Notes
-----
If you are transitioning from pandas and performing filter operations based on
the comparison of two or more columns, please note that in Polars,
any comparison involving null values will always result in null.
As a result, these rows will be filtered out.
Ensure to handle null values appropriately to avoid unintended filtering
(See examples below).

Examples
--------
>>> lf = pl.LazyFrame(
... {
... "foo": [1, 2, 3],
... "bar": [6, 7, 8],
... "ham": ["a", "b", "c"],
... "foo": [1, 2, 3, None, 4, None, 0],
... "bar": [6, 7, 8, None, None, 9, 0],
... "ham": ["a", "b", "c", None, "d", "e", "f"],
... }
... )

Filter on one condition:

>>> lf.filter(pl.col("foo") > 1).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
└─────┴─────┴─────┘
shape: (3, 3)
┌─────┬──────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪══════╪═════╡
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
│ 4 ┆ null ┆ d │
└─────┴──────┴─────┘

Filter on multiple conditions:

Expand Down Expand Up @@ -3057,6 +3067,47 @@ def filter(
│ 1 ┆ 6 ┆ a │
│ 3 ┆ 8 ┆ c │
└─────┴─────┴─────┘

Filter by comparing two columns against each other

>>> lf.filter(pl.col("foo") == pl.col("bar")).collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 0 ┆ 0 ┆ f │
└─────┴─────┴─────┘

>>> lf.filter(pl.col("foo") != pl.col("bar")).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ a │
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
└─────┴─────┴─────┘

Notice how the row with `None` values is filtered out.
In order to keep the same behavior as pandas, use:

>>> lf.filter(pl.col("foo").ne_missing(pl.col("bar"))).collect()
shape: (5, 3)
┌──────┬──────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞══════╪══════╪═════╡
│ 1 ┆ 6 ┆ a │
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
│ 4 ┆ null ┆ d │
│ null ┆ 9 ┆ e │
└──────┴──────┴─────┘
"""
all_predicates: list[pl.Expr] = []
boolean_masks = []
Expand Down