-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][C++?] Dataset "right anti" join gives incorrect result with large_string columns (11.0.0) #35354
Comments
I can confirm this with a unit test. I suspect we are overflowing something but I don't see any asan or ubsan violations so it'll take a bit more digging:
|
This is also an issue with other join types, including
The end of the array should be |
This should be fixed, I ran both python and C++ case for I think it's most likely fixed by #38147, as the related issues #38074 and #37729 have very similar symptom. @AlenkaF I saw you verified #37729 a couple of days ago. Would you please help to double confirm this as well? Thanks. |
I get an empty table as expected: >>> import pyarrow as pa
>>> pa.__version__
'15.0.0.dev285+g32f13e893.d20231220'
>>> N = 1030
>>> strings = pa.array((str(i) for i in range(N)), type=pa.large_string())
>>> table = pa.table([strings], names=["a"])
>>> dataset = ds.dataset(table)
>>> result = dataset.join(dataset, keys=["a"], join_type="right anti")
>>> print(result.to_table())
pyarrow.Table
a: large_string
----
a: [] Closing this issue, thanks for the ping! |
Describe the bug, including details regarding any error messages, version, and platform.
In this example, I'm doing an anti-join between a table and itself - which should result in an empty table. But once the table exceeds 1024 elements, the resulting table is non-empty:
Result:
The problem is also observed for
large_binary
, but not other types that I've tried such asstring
. Interestingly it also doesn't seem to be a problem forleft anti
.Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: