Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++] Issue with Filtering Using pc.equal and pc.count Expression in PyArrow #44961

Closed
rustyconover opened this issue Dec 7, 2024 · 7 comments

Comments

@rustyconover
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

When using the pa.dataset.Expression with the following code:

>>> pc.equal(pc.count(pc.field("allergies"), mode="all"), 0)
<pyarrow.compute.Expression (count(allergies, {mode=ALL}) == 0)>

This expression correctly evaluates as True when there are no elements in the allergies list column.

However, when attempting to use this expression to filter a dataset, the following error occurs:

pyarrow.lib.ArrowInvalid: ExecuteScalarExpression cannot Execute non-scalar expression (count(allergies, {mode=ALL}) == 0)

Could you clarify why this filtering expression cannot be used? It appears to be a scalar expression, so I'm unsure why it results in this error. Any guidance would be greatly appreciated!

Component(s)

C++, Python

@raulcd raulcd changed the title Issue with Filtering Using pc.equal and pc.count Expression in PyArrow [Python][C++] Issue with Filtering Using pc.equal and pc.count Expression in PyArrow Dec 9, 2024
@raulcd
Copy link
Member

raulcd commented Dec 9, 2024

Thanks for raising the issue @rustyconover . Could you help me get a minimal reproducer of the issue?
Create a simple ds column and execute the filter that raises the error?
I am trying to reproduce but I am struggling to get a minimal reproducer.
Thanks!

@jorisvandenbossche
Copy link
Member

Could you clarify why this filtering expression cannot be used? It appears to be a scalar expression,

Counting values in a column is not actually a "scalar" expression (at least in the way we use that term, i.e. as an element-wise function that can be calculated for each element in the array independently. Counting is a kind of reduction, and that requires to consider the full array to get the result)

@rustyconover
Copy link
Author

As requested here is a simple reproducer:

import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.dataset as ds

# Example data
data = [["peanuts", "shellfish", "gluten"], ["dust", "pollen"], ["cats", "dogs", "feathers"], []]

# Create a PyArrow array
allergies_array = pa.array(data, type=pa.list_(pa.string()))

# Create a PyArrow table
table = pa.Table.from_arrays([allergies_array], names=["allergies"])

# Create a PyArrow dataset (in-memory)
dataset = ds.dataset(table)

# Filter for rows where the "allergies" column is empty
filter = pc.equal(pc.count(pc.field("allergies"), mode="all"), 0)

reader = dataset.scanner(filter=filter).to_table()

# This should be one row.
print(reader)

What the filter expression does is trying to retry rows where the allergies column has zero entries.

@jorisvandenbossche
Copy link
Member

To my previous comment, I think you might misinterpret what count is exactly doing, as it is not a scalar function, i.e. it is not counting the values per element in the list array (so you can't use it to filter empty rows).

If you call the function on the actual data (so it executes eagerly and not through an expression);

In [28]: data = [["peanuts", "shellfish", "gluten"], ["dust", "pollen"], ["cats", "dogs", "feathers"], []]
    ...: 
    ...: # Create a PyArrow array
    ...: allergies_array = pa.array(data, type=pa.list_(pa.string()))

In [29]: pc.count(allergies_array, mode="all")
Out[29]: <pyarrow.Int64Scalar: 4>

It counts that the full array has 4 elements (so kind of the length of the array, except that by default it only counts the non-null values)

@rustyconover
Copy link
Author

rustyconover commented Dec 13, 2024 via email

@jorisvandenbossche
Copy link
Member

Maybe this?

In [21]: pc.list_value_length(allergies_array)
Out[21]: 
<pyarrow.lib.Int32Array object at 0x7fe753b6a920>
[
  3,
  2,
  3,
  0
]

(but it just checks the length, regardless of values being null or not)

@rustyconover
Copy link
Author

@jorisvandenbossche thank you, I guess I was looking for something with he semantic of SQL count() which does ignore nulls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants