Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++?] Dataset "right anti" join gives incorrect result with large_string columns (11.0.0) #35354

Closed
mattaubury opened this issue Apr 27, 2023 · 4 comments

Comments

@mattaubury
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

In this example, I'm doing an anti-join between a table and itself - which should result in an empty table. But once the table exceeds 1024 elements, the resulting table is non-empty:

import pyarrow as pa
import pyarrow.dataset as ds

N = 1030

strings = pa.array((str(i) for i in range(N)), type=pa.large_string())
table = pa.table([strings], names=["a"])
dataset = ds.dataset(table)

result = dataset.join(dataset, keys=["a"], join_type="right anti")

print(result.to_table())

Result:

pyarrow.Table
a: large_string
----
a: [["1024","1025","1026","1027","1028","1029"]]

The problem is also observed for large_binary, but not other types that I've tried such as string. Interestingly it also doesn't seem to be a problem for left anti.

Component(s)

C++, Python

@westonpace
Copy link
Member

I can confirm this with a unit test. I suspect we are overflowing something but I don't see any asan or ubsan violations so it'll take a bit more digging:

TEST(HashJoin, LargeString) {
  LargeStringBuilder builder;
  ASSERT_OK(builder.Reserve(1030));
  for (int i = 0; i < 1030; i++) {
    ASSERT_OK(builder.Append(std::to_string(i)));
  }
  ASSERT_OK_AND_ASSIGN(auto arr, builder.Finish());
  ExecBatch batch({arr}, 1030);
  std::vector<ExecBatch> batches = {batch};

  Declaration left{"exec_batch_source", ExecBatchSourceNodeOptions(
                                            schema({field("x", large_utf8())}), batches)};
  Declaration right{
      "exec_batch_source",
      ExecBatchSourceNodeOptions(schema({field("x", large_utf8())}), batches)};

  HashJoinNodeOptions join_opts(JoinType::RIGHT_ANTI, {"x"}, {"x"});
  Declaration join{"hashjoin", {left, right}, join_opts};

  ASSERT_OK_AND_ASSIGN(std::shared_ptr<Table> result, DeclarationToTable(join));
  std::cout << result->ToString() << std::endl;
}

@mattaubury
Copy link
Author

This is also an issue with other join types, including left semi, right semi, and inner. For example, in the Python code above, if the join_type is set to "inner" we get a truncated output:

pyarrow.Table
a: large_string
----
a: [["0","1","2","3","4",...,"1019","1020","1021","1022","1023"]]

The end of the array should be "1029".

@zanmato1984
Copy link
Contributor

This should be fixed, I ran both python and C++ case for right semi and inner in latest dev branch, the results are correct.

I think it's most likely fixed by #38147, as the related issues #38074 and #37729 have very similar symptom.

@AlenkaF I saw you verified #37729 a couple of days ago. Would you please help to double confirm this as well? Thanks.

@AlenkaF
Copy link
Member

AlenkaF commented Dec 21, 2023

I get an empty table as expected:

>>> import pyarrow as pa
>>> pa.__version__
'15.0.0.dev285+g32f13e893.d20231220'

>>> N = 1030

>>> strings = pa.array((str(i) for i in range(N)), type=pa.large_string())
>>> table = pa.table([strings], names=["a"])
>>> dataset = ds.dataset(table)

>>> result = dataset.join(dataset, keys=["a"], join_type="right anti")

>>> print(result.to_table())
pyarrow.Table
a: large_string
----
a: []

Closing this issue, thanks for the ping!

@AlenkaF AlenkaF closed this as completed Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants