Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of null values in the polars.Expr.rank #19415

Closed
2 tasks done
hanepudding opened this issue Oct 24, 2024 · 2 comments
Closed
2 tasks done

Handling of null values in the polars.Expr.rank #19415

hanepudding opened this issue Oct 24, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@hanepudding
Copy link

hanepudding commented Oct 24, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

May be kinda related to #18243 in some way.
Updated to 1.11 released today (2024-10-24), not solved.

example = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "birthdate": [
            dt.date(1997, 1, 10),
            dt.date(1985, 2, 15),
            dt.date(1983, 3, 22),
            dt.date(1981, 4, 30),
        ],
        "weight": [57.9, np.nan, 53.6, 83.1],
        "height": [1.56, 1.77, 1.65, 1.75],
    }
)

print(example.with_columns(
    ranking=pl.col("weight").rank(method="min", descending=True),
))

Actual behavior:

shape: (4, 5)
┌────────────────┬────────────┬────────┬────────┬─────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ ranking │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---     │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ u32     │
╞════════════════╪════════════╪════════╪════════╪═════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 3       │
│ Ben Brown      ┆ 1985-02-15 ┆ NaN    ┆ 1.77   ┆ 1       │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 4       │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 2       │
└────────────────┴────────────┴────────┴────────┴─────────┘

Log output

No response

Issue description

When encountering null values (np.nan, None, etc.), the rank method treats them as the biggest possible value, rather than ignoring them.

This behavior may be problematic in some cases. Say, I am doing quant research and ranking my factors. An NaN means model has no opinion, therefore should not give an idea about trade or not to trade. If rank method determines that NaN is the biggest one, it may mislead to a Long trade.

Suggested Improvement: An arugument like "ignore_na=True/False" may help.

Expected behavior

Ignore the NaN and rank base on valid values:

shape: (4, 5)
┌────────────────┬────────────┬────────┬────────┬─────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ ranking │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---     │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ u32     │
╞════════════════╪════════════╪════════╪════════╪═════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 2       │
│ Ben Brown      ┆ 1985-02-15 ┆ NaN    ┆ 1.77   ┆ NaN     │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 3       │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 1       │
└────────────────┴────────────┴────────┴────────┴─────────┘

Installed versions

--------Version info---------
Polars:              1.10.0  [Reproducible in 1.11.0 too]
Index type:          UInt32
Platform:            Windows-11-10.0.22631-SP0
Python:              3.12.5 (tags/v3.12.5:ff3bc82, Aug  6 2024, 20:45:27) [MSC v.1940 64 bit (AMD64)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               5.4.1
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.6.1
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                2.0.1
openpyxl             3.1.5
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           2.0.35
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@hanepudding hanepudding added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Oct 24, 2024
@cmdlineluser
Copy link
Contributor

cmdlineluser commented Oct 24, 2024

https://docs.pola.rs/user-guide/expressions/missing-data/#notanumber-or-nan-values

NaN is not considered to be missing data in Polars

Actual null values are ignored:

df.with_columns(
    ranking=pl.col.weight.fill_nan(None).rank(method="min", descending=True)
)

# shape: (4, 5)
# ┌────────────────┬────────────┬────────┬────────┬─────────┐
# │ name           ┆ birthdate  ┆ weight ┆ height ┆ ranking │
# │ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---     │
# │ str            ┆ date       ┆ f64    ┆ f64    ┆ u32     │
# ╞════════════════╪════════════╪════════╪════════╪═════════╡
# │ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 2       │
# │ Ben Brown      ┆ 1985-02-15 ┆ NaN    ┆ 1.77   ┆ null    │
# │ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 3       │
# │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 1       │
# └────────────────┴────────────┴────────┴────────┴─────────┘

@orlp
Copy link
Collaborator

orlp commented Oct 24, 2024

This works as intended.

@orlp orlp closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants