-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][C++] Issue with Filtering Using pc.equal and pc.count Expression in PyArrow #44961
Comments
Thanks for raising the issue @rustyconover . Could you help me get a minimal reproducer of the issue? |
Counting values in a column is not actually a "scalar" expression (at least in the way we use that term, i.e. as an element-wise function that can be calculated for each element in the array independently. Counting is a kind of reduction, and that requires to consider the full array to get the result) |
As requested here is a simple reproducer: import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.dataset as ds
# Example data
data = [["peanuts", "shellfish", "gluten"], ["dust", "pollen"], ["cats", "dogs", "feathers"], []]
# Create a PyArrow array
allergies_array = pa.array(data, type=pa.list_(pa.string()))
# Create a PyArrow table
table = pa.Table.from_arrays([allergies_array], names=["allergies"])
# Create a PyArrow dataset (in-memory)
dataset = ds.dataset(table)
# Filter for rows where the "allergies" column is empty
filter = pc.equal(pc.count(pc.field("allergies"), mode="all"), 0)
reader = dataset.scanner(filter=filter).to_table()
# This should be one row.
print(reader) What the filter expression does is trying to retry rows where the allergies column has zero entries. |
To my previous comment, I think you might misinterpret what If you call the function on the actual data (so it executes eagerly and not through an expression);
It counts that the full array has 4 elements (so kind of the length of the array, except that by default it only counts the non-null values) |
Is there a scalar length function for arrays like there is for strings?
Rusty
…On Fri, Dec 13, 2024 at 01:21 Joris Van den Bossche < ***@***.***> wrote:
To my previous comment, I think you might misinterpret what count is
exactly doing, as it is not a scalar function, i.e. it is not counting the
values per element in the list array (so you can't use it to filter empty
*rows*).
If you call the function on the actual data (so it executes eagerly and
not through an expression);
In [28]: data = [["peanuts", "shellfish", "gluten"], ["dust", "pollen"], ["cats", "dogs", "feathers"], []]
...:
...: # Create a PyArrow array
...: allergies_array = pa.array(data, type=pa.list_(pa.string()))
In [29]: pc.count(allergies_array, mode="all")
Out[29]: <pyarrow.Int64Scalar: 4>
It counts that the full array has 4 elements (so kind of the length of the
array, except that by default it only counts the non-null values)
—
Reply to this email directly, view it on GitHub
<#44961 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFSWJJINR7HSHVS2WAIYGD2FKKIRAVCNFSM6AAAAABTF5VRXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBQHA2DQOBSGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Maybe this?
(but it just checks the length, regardless of values being null or not) |
@jorisvandenbossche thank you, I guess I was looking for something with he semantic of SQL count() which does ignore nulls. |
Describe the bug, including details regarding any error messages, version, and platform.
When using the
pa.dataset.Expression
with the following code:This expression correctly evaluates as
True
when there are no elements in theallergies
list column.However, when attempting to use this expression to filter a dataset, the following error occurs:
Could you clarify why this filtering expression cannot be used? It appears to be a scalar expression, so I'm unsure why it results in this error. Any guidance would be greatly appreciated!
Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: