Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to test for list equality #10138

Closed
m-legrand opened this issue Jul 28, 2023 · 4 comments · Fixed by #10857
Closed

Add a way to test for list equality #10138

m-legrand opened this issue Jul 28, 2023 · 4 comments · Fixed by #10857
Labels
enhancement New feature or an improvement of an existing feature

Comments

@m-legrand
Copy link

m-legrand commented Jul 28, 2023

Problem description

I'm currently working with a dataset that contains list columns, and I was surprised not to find an easy way to test for list equality:

>>> import polars as pl
>>> data = pl.DataFrame({"x": [[1], [1, 2], [2, 3]]}, schema={"x": pl.List(int)})
... data
shape: (3, 1)
┌───────────┐
│ x         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1]       │
│ [1, 2]    │
│ [2, 3]    │
└───────────┘
>>> data.filter(pl.col("x") == pl.lit([1]))
ArrowErrorException: NotYetImplemented("Casting from Int64 to 
    LargeList(Field { name: \"item\", data_type: Int64, is_nullable: true, metadata: {} }) 
    not supported")

Maybe I missed the right incantation from the Expr.list namespace?
In the meantime I went with the following utility function:

def filter_list_equal(df: pl.DataFrame, colname: str, values: list) -> pl.DataFrame:
    col = pl.col(colname)
    lf = df.lazy()
    lf = lf.filter(col.list.lengths() == pl.lit(len(values)))
    for i, v in enumerate(values):
        lf = lf.filter(col.list[i] == pl.lit(v))
    return lf.collect()
@m-legrand m-legrand added the enhancement New feature or an improvement of an existing feature label Jul 28, 2023
@ion-elgreco
Copy link
Contributor

ion-elgreco commented Jul 28, 2023

You need to wrap [1] in a list otherwise it's interpreted as an int #7879. Also you need to add it before you filter on it. Not sure why it's not working within the filter. I think I saw an issue about this before.

data.with_columns(pl.lit([[1]]).alias('y')).filter(pl.col('x') == pl.col('y'))

@ritchie46
Copy link
Member

Something seems to go wrong when we inline the predicate. Will take a look later

@m-legrand
Copy link
Author

m-legrand commented Jul 29, 2023

Having to assign a new column (and delete it afterwards) also makes for a more cumbersome user experience.
Not even mentioning having to come up with column names I'm sure my input dataframe doesn't already have!

@cmdlineluser
Copy link
Contributor

This was asked again today on stackoverflow: https://stackoverflow.com/questions/77002768/how-to-filter-a-polars-dataframe-with-list-type-columns

The current workaround for adding a new column seems to be (casting to numerics if necessary e.g. str -> cat) and .hash()

df.filter(pl.col("x").hash() != pl.lit([[1]]).hash())

# shape: (2, 1)
# ┌───────────┐
# │ x         │
# │ ---       │
# │ list[i64] │
# ╞═══════════╡
# │ [1, 2]    │
# │ [2, 3]    │
# └───────────┘

Was also asked a couple of weeks ago: https://stackoverflow.com/questions/76875762/filter-on-listint64-dtype-in-polars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants