-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37655: [C++] Allow joins of large tables in Acero #37709
Conversation
[PR 35087](apache#35087) introduced an explicit fail in large joins with Acero when key data is larger than 4GB (solving the problem reported by [issue 34474](apache#34474)). However, I think (but I'm not sure) that this quick fix is too restrictive because the total size condition is applied to the total size of tables to be joined, rather than to the size of keys. As a consequence, Acero fails when trying to merge large tables, even when the size of key data is well below 4 GB. This PR modifies the source code so that the logical test only verifies whether the total size of _key variable_ is below 4 GB.
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
|
@oliviermeslin Thanks for this. This probably can't be tested efficiently in the test suite unfortunately, due to the test size required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be some lint failures: https://github.com/apache/arrow/actions/runs/6470238300/job/17569780906?pr=37709#step:5:864
Remove trailing whitespace
@raulcd : I think I solved the problem (I forgot to remove a trailing whitespace). Can you please re-launch the checks? |
@oliviermeslin I agree it's too big to test in any kind of reasonable unit test. have you run any tests manually with payloads larger than 4GB and confirmed the results are correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't verify that joins with 4GiB or more of payload data will work correctly, I never tested for that. But I agree that the issue I fixed was specific to key and so, assuming we have manually done some testing of large payload data then it seems fine to loosen this restriction.
I've pushed a minor lint change, I do think that this should fix it. GitHub UI did not allowed me to propose the change to remove a trailing whitespace, that's why I just pushed myself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only failing test now appears to be unrelated (s3). Conditional approval provided someone has at least done some basic manual testing to verify that this does actually work.
Thanks @westonpace ! I already prepared some tests (one using artificial data, the other one using real data). Unfortunately, I could not install |
Hey @oliviermeslin, I ran your example from #37655 w/o this patch and got the expected error after the script printed "Doing the join with 9 variables":
I then built libarrow and the R package off your patch and actually got a segfault on deref of arrow/cpp/src/arrow/acero/swiss_join.cc Line 604 in c5bce96
Edit: Looks like it might be an off-by-one error from this output:
Full output below: Output with segfault
|
@amoeba : I could finally install arrow with my patch (on Linux/Ubuntu). I re-ran my my patch with using the
Line 605 of swiss_join.cc is part of the following loop, where the content of source rows is iteratively added to target rows:
As I am completely new to C++, I'm not sure what this bug means. My intuition is that we're trying to add too much to the |
@@ -473,7 +474,7 @@ Status RowArrayMerge::PrepareForMerge(RowArray* target, | |||
(*first_target_row_id)[sources.size()] = num_rows; | |||
} | |||
|
|||
if (num_bytes > std::numeric_limits<uint32_t>::max()) { | |||
if (is_key_data && num_bytes > std::numeric_limits<uint32_t>::max()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When letting non-key data that is greater than std::numeric_limits<uint32_t>::max()
bypass this check, the num_bytes
will underflow to a much smaller value in the static_cast<uint32_t>
in #486. Then the target
won't allocate enough space, resulting in segfault when copying data to target. The original check is necessary and there is unfortunately nothing to loosen.
Hi @oliviermeslin @amoeba , I was trying to help on the issue you met. After looking a bit, I'd say it's unfortunately an inherent problem of the fix. In other words, the original issue just transfers to a later place and explodes in a more implicit fashion (the segfault). Please see my comment on the changed code about what is actually happening. Thanks. cc @westonpace |
PR #43389 fixed the issue in a more thorough way so I'm closing this one. Thanks. |
PR 35087 introduced an explicit fail in large joins with Acero when key data is larger than 4GB (solving the problem reported by issue 34474). However, I think (but I'm not sure) that this quick fix is too restrictive because the total size condition is applied to the total size of tables to be joined, rather than to the size of keys. As a consequence, Acero fails when trying to merge large tables, even when the size of key data is well below 4 GB.
This PR modifies the source code so that the logical test only verifies whether the total size of key variable is below 4 GB.
Rationale for this change
In the current situation, joins with arrow fail when the tables to be joined are jointly larger than 4GB, even if the key data is smaller than 4GB. This is due to the fact that the test on the size of key data is erroneously applied to the size the tables to be joined.
What changes are included in this PR?
I modify slightly the C++ code so that the test on the size of key data does not apply any more to the size the tables to be joined.
Are these changes tested?
Not so far.
Are there any user-facing changes?
No.