Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter a list in a column #6596

Closed
1 of 2 tasks
EmmanuelLM opened this issue Jan 31, 2023 · 8 comments
Closed
1 of 2 tasks

Filter a list in a column #6596

EmmanuelLM opened this issue Jan 31, 2023 · 8 comments

Comments

@EmmanuelLM
Copy link

EmmanuelLM commented Jan 31, 2023

Research

  • I have searched the above polars tags on Stack Overflow for similar questions.

  • I have asked my usage related question on Stack Overflow.

Link to question on Stack Overflow

No response

Question about Polars

Hi, I have a Dataframe with a column that looks like that:
Statut │
│ --- │
│ list[str] │
╞═════════════════════════════════════╡
│ ["Absent excusé", "Vu"] │
│ ["Absent excusé", "Vu"] │
│ ["Absent excusé", "Absent excusé... │
│ ["Vu", "Absent excusé", "Absent ... │
│ ... │
│ ["Vu", "Absent non excusé"] │
│ ["Absent excusé", "Vu"] │
│ ["Vu", "Vu"] │
│ ["Absent excusé", "Vu"]

polar_statut_vu.select(pl.col("Statut"))

What I would like to do is exclude rows where the list exactly matches ["Vu","Vu"] but I cannot figure out howto do that...
I have tried: polar_statut_vu.filter((pl.col("Statut") != ["Vu", "Vu"])) but this throws up an error...

Any help would be appreciated, I am sure it is a simple solution :)

@mcrumiller
Copy link
Contributor

I can't seem to get this to work: @ritchie46, how does one create a list literal?

pl.lit([1, 1], dtype=pl.List(pl.Int64))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\430015439\AppData\Local\Programs\Python\Python39\lib\site-packages\polars\internals\lazy_functions.py", line 1155, in lit
    return pli.wrap_expr(pylit(value, allow_object)).cast(dtype)
ValueError: could not convert value '[1, 1]' as a Literal

We can do a literal pl.Series, so the closest I can get is this, but the filter still fails:

import polars as pl

df = pl.DataFrame({'a': [[1, 1], [2, 2]]})
s = pl.lit(
    pl.Series([[1, 1]], dtype=pl.List(pl.Int64))
)
df.filter(pl.col('a') == s)                
shape: (2, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 1]    │
│ [2, 2]    │
└───────────┘

...but the filter obviously does not work. I tried is_in and that panics with an unimplemented Exception:

df.filter(pl.col('a').is_in(s))
thread '<unnamed>' panicked at 'this operation is not implemented/valid for this dtype: List(Int64)', D:\a\polars\polars\polars\polars-core\src\series\series_trait.rs:633:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\430015439\AppData\Local\Programs\Python\Python39\lib\site-packages\polars\internals\dataframe\frame.py", line 2702, in filter
    self.lazy()
  File "C:\Users\430015439\AppData\Local\Programs\Python\Python39\lib\site-packages\polars\internals\lazyframe\frame.py", line 1143, in collect
    return pli.wrap_df(ldf.collect())
pyo3_runtime.PanicException: this operation is not implemented/valid for this dtype: List(Int64)

@stinodego
Copy link
Member

stinodego commented Jan 31, 2023

The following works, but it's not pretty:

col = pl.col('Statut')
mask = (col.arr.lengths() == 2) & (col.arr.get(0) == pl.lit('Vu')) & (col.arr.get(1) == pl.lit('Vu'))
df.filter(~mask)

We should definitely improve the usability here.

@mcrumiller
Copy link
Contributor

@stinodego that's a great solution, but I would say that's a workaround to the core issue which is that one cannot (AFAIK) supply a list as a literal.

I imagine this is something that has come up before but my searching reveals nothing, so maybe this is indeed the first time. I don't know how polars does it, but in python lists are Hashable but I doubt polars would ever do a python hashing of every list element, so a list literal would probably have a lot of associated checks that would make it a pain to implement, although not prohibitively so.

@stinodego
Copy link
Member

I'd say there are actually two issues here:

  • Cannot create a literal of nested types (neither list nor struct work at the moment)
  • Cannot compare equality of nested types. Although maybe we can use series_equal to achieve this?

@EmmanuelLM
Copy link
Author

For posterity, I have found a way to make this filtering:
polar_statut_vu = polar_statut_vu.filter((pl.col("Statut").arr.get(0) !="Vu") & (pl.col("Statut").arr.get(1) != "Vu"))

@EmmanuelLM
Copy link
Author

EmmanuelLM commented Feb 1, 2023

Actually, I spoke too fast (found a bug?) as the multiple expression filtering does not interpret & as one would expect.... i.e.

& function should be:
0 0 -> 0
0 1 -> 0
1 0 -> 0
1 1 -> 1

what the expression above does in the & is:
0 0 -> 0
0 1 -> 1
1 0 -> 1
1 1 -> 1

(have tried adding ( ) and it doesn't work, as soon as one expression matches, the row is filtered out)

@EmmanuelLM
Copy link
Author

looks like #6184 #6311

@stinodego
Copy link
Member

I think the original question was answered. Please open a feature request if there is specific functionality you are still missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants