Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] pyarrow table group_by/aggregate results in multiple rows with the same group_by key #42231

Closed
FreekPaans opened this issue Jun 20, 2024 · 3 comments

Comments

@FreekPaans
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Originally posted here

I'm doing a simple group_by/aggregate on multiple keys, out of which one has null-values. This sometimes results in multiple result rows having the same values for the group_by keys, which I don't expect. Tested on pyarrow-16.1.0

Repro case:

import pyarrow as pa
def try_repro(size):
    repro = pa.table({"a": [0] * size,
                      "g": [None]*size},
                     schema=pa.schema([pa.field("a", "uint8"),
                                       pa.field("g", "date32")]))\
              .group_by(["a", "g"]).aggregate([([], "count_all")])

    if len(repro) != 1:
        print(f"{size} => {len(repro)}")
    return repro

for i in range(1,50):
    r = try_repro(i)

print()
print(r)

Output without AVX2 (expected):

$ ARROW_USER_SIMD_LEVEL=AVX python repro.py

pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0]]
g: [[null]]
count_all: [[49]]

Output with AVX2 (not expected):

$ ARROW_USER_SIMD_LEVEL=AVX2 python repro.py
33 => 2
...
40 => 2
41 => 3
...
48 => 3
49 => 4

pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0,0,0,0]]
g: [[null,null,null,null]]
count_all: [[32,8,8,1]]

Some observations:

  • Grouping on only g doesn't have the problem
  • Swapping the order a and g in the group_by also removes the issue.
  • Looks like this starts happening as soon as the size of the tables hits 33, and then we get an extra group for every 8 rows we add (so at 33, 41, 49)
  • Having g be an int does not exhibit the problem, a float does.
  • Non-null values don't have the issue
  • Macbook Pro M2 is also fine

Component(s)

Python

@amoeba amoeba changed the title pyarrow table group_by/aggregate results in multiple rows with the same group_by key [C++][Python] pyarrow table group_by/aggregate results in multiple rows with the same group_by key Jun 20, 2024
@amoeba
Copy link
Member

amoeba commented Jun 20, 2024

Thanks for filing this over here @FreekPaans. I can reproduce this on the latest wheel from PyPi (16.1.0) on my AVX2-capable Linux machine. However, on a source build from today, I can't reproduce it:

❯ git show --oneline --no-patch HEAD
a01fe038d (HEAD -> main, apache/main, apache/HEAD) GH-42130: [GLib] Fix building gir files with MSVC (#42131)
(venv)
~/src/apache/arrow on main •
❯ arrow ^C
(venv)
~/src/apache/arrow on main •
❯ python repro.py
17.0.0.dev370+ga01fe038d

pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0]]
g: [[null]]
count_all: [[49]]
(venv)

I'll go see if I can track down where this was fixed.

@amoeba
Copy link
Member

amoeba commented Jun 20, 2024

Using bisect, I was able to track down the fix to 5232137 which will be included in the 17.x release. I'm going to close this issue but feel free to re-open if needed. Thanks for the report.

@amoeba amoeba closed this as completed Jun 20, 2024
@FreekPaans
Copy link
Author

Great to hear it'll be fixed in the next release, thanks for looking into it!

For future and cross reference, it was fixed in this PR: #40997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants