Handling of null values in the polars.Expr.rank #19415

hanepudding · 2024-10-24T03:19:26Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

May be kinda related to #18243 in some way.
Updated to 1.11 released today (2024-10-24), not solved.

example = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "birthdate": [
            dt.date(1997, 1, 10),
            dt.date(1985, 2, 15),
            dt.date(1983, 3, 22),
            dt.date(1981, 4, 30),
        ],
        "weight": [57.9, np.nan, 53.6, 83.1],
        "height": [1.56, 1.77, 1.65, 1.75],
    }
)

print(example.with_columns(
    ranking=pl.col("weight").rank(method="min", descending=True),
))

Actual behavior:

shape: (4, 5)
┌────────────────┬────────────┬────────┬────────┬─────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ ranking │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---     │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ u32     │
╞════════════════╪════════════╪════════╪════════╪═════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 3       │
│ Ben Brown      ┆ 1985-02-15 ┆ NaN    ┆ 1.77   ┆ 1       │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 4       │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 2       │
└────────────────┴────────────┴────────┴────────┴─────────┘

Log output

No response

Issue description

When encountering null values (np.nan, None, etc.), the rank method treats them as the biggest possible value, rather than ignoring them.

This behavior may be problematic in some cases. Say, I am doing quant research and ranking my factors. An NaN means model has no opinion, therefore should not give an idea about trade or not to trade. If rank method determines that NaN is the biggest one, it may mislead to a Long trade.

Suggested Improvement: An arugument like "ignore_na=True/False" may help.

Expected behavior

Ignore the NaN and rank base on valid values:

shape: (4, 5)
┌────────────────┬────────────┬────────┬────────┬─────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ ranking │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---     │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ u32     │
╞════════════════╪════════════╪════════╪════════╪═════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 2       │
│ Ben Brown      ┆ 1985-02-15 ┆ NaN    ┆ 1.77   ┆ NaN     │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 3       │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 1       │
└────────────────┴────────────┴────────┴────────┴─────────┘

Installed versions

--------Version info---------
Polars:              1.10.0  [Reproducible in 1.11.0 too]
Index type:          UInt32
Platform:            Windows-11-10.0.22631-SP0
Python:              3.12.5 (tags/v3.12.5:ff3bc82, Aug  6 2024, 20:45:27) [MSC v.1940 64 bit (AMD64)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               5.4.1
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.6.1
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                2.0.1
openpyxl             3.1.5
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           2.0.35
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-10-24T07:46:28Z

https://docs.pola.rs/user-guide/expressions/missing-data/#notanumber-or-nan-values

NaN is not considered to be missing data in Polars

Actual null values are ignored:

df.with_columns(
    ranking=pl.col.weight.fill_nan(None).rank(method="min", descending=True)
)

# shape: (4, 5)
# ┌────────────────┬────────────┬────────┬────────┬─────────┐
# │ name           ┆ birthdate  ┆ weight ┆ height ┆ ranking │
# │ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---     │
# │ str            ┆ date       ┆ f64    ┆ f64    ┆ u32     │
# ╞════════════════╪════════════╪════════╪════════╪═════════╡
# │ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 2       │
# │ Ben Brown      ┆ 1985-02-15 ┆ NaN    ┆ 1.77   ┆ null    │
# │ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 3       │
# │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 1       │
# └────────────────┴────────────┴────────┴────────┴─────────┘

orlp · 2024-10-24T08:34:35Z

This works as intended.

hanepudding added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Oct 24, 2024

orlp closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of null values in the polars.Expr.rank #19415

Handling of null values in the polars.Expr.rank #19415

hanepudding commented Oct 24, 2024 •

edited

Loading

cmdlineluser commented Oct 24, 2024 •

edited

Loading

orlp commented Oct 24, 2024

Handling of null values in the polars.Expr.rank #19415

Handling of null values in the polars.Expr.rank #19415

Comments

hanepudding commented Oct 24, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented Oct 24, 2024 • edited Loading

orlp commented Oct 24, 2024

hanepudding commented Oct 24, 2024 •

edited

Loading

cmdlineluser commented Oct 24, 2024 •

edited

Loading