You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm doing a simple group_by/aggregate on multiple keys, out of which one has null-values. This sometimes results in multiple result rows having the same values for the group_by keys, which I don't expect. Tested on pyarrow-16.1.0
Repro case:
import pyarrow as pa
def try_repro(size):
repro = pa.table({"a": [0] * size,
"g": [None]*size},
schema=pa.schema([pa.field("a", "uint8"),
pa.field("g", "date32")]))\
.group_by(["a", "g"]).aggregate([([], "count_all")])
if len(repro) != 1:
print(f"{size} => {len(repro)}")
return repro
for i in range(1,50):
r = try_repro(i)
print()
print(r)
amoeba
changed the title
pyarrow table group_by/aggregate results in multiple rows with the same group_by key
[C++][Python] pyarrow table group_by/aggregate results in multiple rows with the same group_by key
Jun 20, 2024
Thanks for filing this over here @FreekPaans. I can reproduce this on the latest wheel from PyPi (16.1.0) on my AVX2-capable Linux machine. However, on a source build from today, I can't reproduce it:
❯ git show --oneline --no-patch HEAD
a01fe038d (HEAD -> main, apache/main, apache/HEAD) GH-42130: [GLib] Fix building gir files with MSVC (#42131)
(venv)
~/src/apache/arrow on main •
❯ arrow ^C
(venv)
~/src/apache/arrow on main •
❯ python repro.py
17.0.0.dev370+ga01fe038d
pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0]]
g: [[null]]
count_all: [[49]]
(venv)
I'll go see if I can track down where this was fixed.
Using bisect, I was able to track down the fix to 5232137 which will be included in the 17.x release. I'm going to close this issue but feel free to re-open if needed. Thanks for the report.
Describe the bug, including details regarding any error messages, version, and platform.
Originally posted here
I'm doing a simple group_by/aggregate on multiple keys, out of which one has null-values. This sometimes results in multiple result rows having the same values for the group_by keys, which I don't expect. Tested on pyarrow-16.1.0
Repro case:
Output without AVX2 (expected):
Output with AVX2 (not expected):
Some observations:
g
doesn't have the problema
andg
in the group_by also removes the issue.g
be anint
does not exhibit the problem, afloat
does.Component(s)
Python
The text was updated successfully, but these errors were encountered: