GH-37655: [C++] Allow joins of large tables in Acero #37709

oliviermeslin · 2023-09-13T20:18:18Z

PR 35087 introduced an explicit fail in large joins with Acero when key data is larger than 4GB (solving the problem reported by issue 34474). However, I think (but I'm not sure) that this quick fix is too restrictive because the total size condition is applied to the total size of tables to be joined, rather than to the size of keys. As a consequence, Acero fails when trying to merge large tables, even when the size of key data is well below 4 GB.

This PR modifies the source code so that the logical test only verifies whether the total size of key variable is below 4 GB.

Rationale for this change

In the current situation, joins with arrow fail when the tables to be joined are jointly larger than 4GB, even if the key data is smaller than 4GB. This is due to the fact that the test on the size of key data is erroneously applied to the size the tables to be joined.

What changes are included in this PR?

I modify slightly the C++ code so that the test on the size of key data does not apply any more to the size the tables to be joined.

Are these changes tested?

Not so far.

Are there any user-facing changes?

No.

Closes: [C++] Acero cannot join large tables because of a misspecified test #37655

[PR 35087](apache#35087) introduced an explicit fail in large joins with Acero when key data is larger than 4GB (solving the problem reported by [issue 34474](apache#34474)). However, I think (but I'm not sure) that this quick fix is too restrictive because the total size condition is applied to the total size of tables to be joined, rather than to the size of keys. As a consequence, Acero fails when trying to merge large tables, even when the size of key data is well below 4 GB. This PR modifies the source code so that the logical test only verifies whether the total size of _key variable_ is below 4 GB.

github-actions · 2023-09-13T20:18:43Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2023-09-19T11:24:19Z

⚠️ GitHub issue #37655 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/acero/swiss_join.cc

cpp/src/arrow/acero/swiss_join_internal.h

pitrou · 2023-10-10T15:22:25Z

@oliviermeslin Thanks for this. This probably can't be tested efficiently in the test suite unfortunately, due to the test size required.

raulcd

There seems to be some lint failures: https://github.com/apache/arrow/actions/runs/6470238300/job/17569780906?pr=37709#step:5:864

cpp/src/arrow/acero/swiss_join.cc

Remove trailing whitespace

oliviermeslin · 2023-10-12T10:08:58Z

@raulcd : I think I solved the problem (I forgot to remove a trailing whitespace). Can you please re-launch the checks?

westonpace · 2023-10-13T14:22:38Z

@oliviermeslin I agree it's too big to test in any kind of reasonable unit test. have you run any tests manually with payloads larger than 4GB and confirmed the results are correct?

westonpace

I can't verify that joins with 4GiB or more of payload data will work correctly, I never tested for that. But I agree that the issue I fixed was specific to key and so, assuming we have manually done some testing of large payload data then it seems fine to loosen this restriction.

raulcd · 2023-10-13T16:32:40Z

@raulcd : I think I solved the problem (I forgot to remove a trailing whitespace). Can you please re-launch the checks?

I've pushed a minor lint change, I do think that this should fix it. GitHub UI did not allowed me to propose the change to remove a trailing whitespace, that's why I just pushed myself.

westonpace

The only failing test now appears to be unrelated (s3). Conditional approval provided someone has at least done some basic manual testing to verify that this does actually work.

oliviermeslin · 2023-10-15T18:52:14Z

Thanks @westonpace ! I already prepared some tests (one using artificial data, the other one using real data). Unfortunately, I could not install arrow as an R package from my own fork, because installing a development version of arrow seems to be quite complicated (see this issue). I'll try my best to find a solution and I'll let you know the result.

amoeba · 2023-10-16T01:03:18Z

Hey @oliviermeslin, I ran your example from #37655 w/o this patch and got the expected error after the script printed "Doing the join with 9 variables":

! Invalid: There are more than 2^32 bytes of key data. Acero cannot process a join of this magnitude

I then built libarrow and the R package off your patch and actually got a segfault on deref of target_row_ptr:

arrow/cpp/src/arrow/acero/swiss_join.cc

Line 604 in c5bce96

*target_row_ptr++ = *source_row_ptr++;

Edit: Looks like it might be an off-by-one error from this output:

Error in tryCatchOne(expr, names, parentenv, handlers[[1L]]) :
R_Reprotect: only 41 protected items, can't reprotect index 42

Full output below:

Output with segfault

❯ Rscript acero-join-test.R
Welcome to R :)
This session's PID is 2890
Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

    timestamp


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

[1] "Doing the join with 2 variables"
[1] "id"        "variable1"
[1] "Doing the join with 3 variables"
[1] "id"        "variable1" "variable2"
[1] "Doing the join with 4 variables"
[1] "id"        "variable1" "variable2" "variable3"
[1] "Doing the join with 5 variables"
[1] "id"        "variable1" "variable2" "variable3" "variable4"
[1] "Doing the join with 6 variables"
[1] "id"        "variable1" "variable2" "variable3" "variable4" "variable5"
[1] "Doing the join with 7 variables"
[1] "id"        "variable1" "variable2" "variable3" "variable4" "variable5"
[7] "variable6"
[1] "Doing the join with 8 variables"
[1] "id"        "variable1" "variable2" "variable3" "variable4" "variable5"
[7] "variable6" "variable7"
[1] "Doing the join with 9 variables"
[1] "id"        "variable1" "variable2" "variable3" "variable4" "variable5"
[7] "variable6" "variable7" "variable8"

 *** caught segfault ***
address 0x7ed0457b20, cause 'invalid permissions'

 *** caught segfault ***
address 0x7f6676b420, cause 'invalid permissions'

 *** caught segfault ***
address 0x7ed8992de0, cause 'invalid permissions'

 *** caught segfault ***
address 0x7ee948e360, cause 'invalid permissions'

Traceback:
 1: Table__from_ExecPlanReader(self)
 2: x$read_table()
 3: as_arrow_table.RecordBatchReader(reader)

Traceback:
 1: Table__from_ExecPlanReader(self)
 2: x$read_table()
 3: as_arrow_table.RecordBatchReader(reader)
 4: as_arrow_table(reader)
 5: as_arrow_table.arrow_dplyr_query(x)
 6: as_arrow_table(x)

Traceback:

Traceback:
 1: Table__from_ExecPlanReader(self) 4: as_arrow_table(reader)
 2: x$read_table()
 3: as_arrow_table.RecordBatchReader(reader) 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
 5:  1:
as_arrow_table.arrow_dplyr_query(x)
 6: as_arrow_table(x) 4:
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: as_arrow_table(reader)
 5: as_arrow_table.arrow_dplyr_query(x)Table__from_ExecPlanReader(self)
 2: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {    augment_io_error_msg(e, call, schema = schema())})
11: compute.arrow_dplyr_query(left_join(data, data(all_of(vars_temp),     all_of(vars_temp)), by = c(id = "id")))
12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id")))
13: FUN(X[[i]], ...)
14: lapply(1:nb_var, function(n) {    print(paste0("Doing the join with ", n + 1, " variables"))    vars_temp <- c("id", vars[1:n])    print(vars_temp)    data_out <- compute(left_join(data, select(data, all_of(vars_temp)),         by = c(id = "id")))    return("Success!")})
An irrecoverable exception occurred. R is aborting now ...

10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {    augment_io_error_msg(e, call, schema = schema())(x$read_table())()
 6: as_arrow_table(x)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: })
tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)

 3: as_arrow_table.RecordBatchReader(reader)
 4: as_arrow_table(reader)10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {
 5: as_arrow_table.arrow_dplyr_query(x)
 6:     augment_io_error_msg(e, call, schema = schema())})
11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)),     by = c(id = "id")))
12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id")))
13: FUN(X[[i]], ...)
14: lapply(1:nb_var, function(n) {    print(paste0("Doing the join with ", n + 1, " variables"))    vars_temp <- c("id", vars[1:n])    print(vars_temp)    data_out <- compute(left_join(data, select(data, all_of(vars_temp)),         by = c(id = "id")))    return("Success!")})
An irrecoverable exception occurred. R is aborting now ...
as_arrow_table(x)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {    augment_io_error_msg(e, call, schema = schema())})
11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)),     by = c(id = "id")))
12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id")))
13: FUN(X[[i]], ...)
14: lapply(1:nb_var, function(n) {    print(paste0("Doing the join with ", n + 1, "Doing thes"))    vars_temp <- c("id", vars[1:n])    print(vars_temp)    data_out <- compute(left_join(data, select(data, all_of(vars_temp)),         by = c(id = "id")))    return("Success!")})
An irrecoverable exception occurred. R is aborting now ...
11: compute.arrow_dplyr_query(left_join(data, vars_temp(), by = c(id = "id")))
12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id")))
13: FUN(X[[i]], ...)
14: lapply(1:nb_var, function(n) {    print(paste0("Doing the join with ", n + 1, " variables"))    vars_temp <- c("id", vars[1:n])    print(vars_temp)    data_out <- compute(left_join(data, select(data, all_of(vars_temp)),         by = c(id = "id")))    return("Success!")})
An irrecoverable exception occurred. R is aborting now ...

 *** caught segfault ***
address 0x7f6ecc2520, cause 'invalid permissions'

Traceback:
 1: Table__from_ExecPlanReader(self)
 2: x$read_table()
 3: as_arrow_table.RecordBatchReader(reader)
 4: as_arrow_table(reader)
 5: as_arrow_table.arrow_dplyr_query(x)
 6: as_arrow_table(x)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {    augment_io_error_msg(e, call, schema = schema())})
11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)),     by = c(id = "id")))
12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id")))
13: FUN(X[[i]], ...)
14: lapply(1:nb_var, function(n) {    print(paste0("Doing the join with ", n + 1, " variables"))    vars_temp <- c("id", vars[1:n])    print(vars_temp)    data_out <- compute(left_join(data, select(data, all_of(vars_temp)),         by = c(id = "id")))    return("Success!")})
An irrecoverable exception occurred. R is aborting now ...

 *** caught segfault ***
address 0x7f5e1daf00, cause 'invalid permissions'

Traceback:
 1: Table__from_ExecPlanReader(self)
 2: x$read_table()
 3: as_arrow_table.RecordBatchReader(reader)
 4: as_arrow_table(reader)
 5: as_arrow_table.arrow_dplyr_query(x)
 6: as_arrow_table(x)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {    augment_io_error_msg(e, call, schema = schema())})
11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)),     by = c(id = "id")))
12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id")))
13: FUN(X[[i]], ...)
14: lapply(1:nb_var, function(n) {    print(paste0("Doing the join with ", n + 1, " variables"))    vars_temp <- c("id", vars[1:n])    print(vars_temp)    data_out <- compute(left_join(data, select(data, all_of(vars_temp)),         by = c(id = "id")))    return("Success!")})
An irrecoverable exception occurred. R is aborting now ...

 *** caught segfault ***
address 0x7ee0f0bac0, cause 'invalid permissions'

Traceback:
 1: Table__from_ExecPlanReader(self)
 2: x$read_table()
 3: as_arrow_table.RecordBatchReader(reader)
 4: as_arrow_table(reader)
 5: as_arrow_table.arrow_dplyr_query(x)
 6: as_arrow_table(x)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {    augment_io_error_msg(e, call, schema = schema())})
11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)),     by = c(id = "id")))
12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id")))
13: FUN(X[[i]], ...)
14: lapply(1:nb_var, function(n) {    print(paste0("Doing the join with ", n + 1, " variables"))    vars_temp <- c("id", vars[1:n])    print(vars_temp)    data_out <- compute(left_join(data, select(data, all_of(vars_temp)),         by = c(id = "id")))    return("Success!")})
An irrecoverable exception occurred. R is aborting now ...

 *** caught segfault ***
address 0x7ecb394000, cause 'invalid permissions'

Traceback:
 1: Table__from_ExecPlanReader(self)
 2: x$read_table()
 3: as_arrow_table.RecordBatchReader(reader)
 4: as_arrow_table(reader)
 5: as_arrow_table.arrow_dplyr_query(x)
 6: as_arrow_table(x)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {    augment_io_error_msg(e, call, schema = schema())})
11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)),     by = c(id = "id")))
12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id")))
13: FUN(X[[i]], ...)
14: lapply(1:nb_var, function(n) {    print(paste0("Doing the join with ", n + 1, " variables"))    vars_temp <- c("id", vars[1:n])    print(vars_temp)    data_out <- compute(left_join(data, select(data, all_of(vars_temp)),         by = c(id = "id")))    return("Success!")})
An irrecoverable exception occurred. R is aborting now ...
fish: Job 1, 'Rscript acero-join-test.R' terminated by signal SIGSEGV (Address boundary error)

oliviermeslin · 2023-10-16T15:42:08Z

@amoeba : I could finally install arrow with my patch (on Linux/Ubuntu). I re-ran my my patch with using the gdb debugger and got the same error as you. The debugger output is the following:

Thread 125 "R" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffcdd7fa640 (LWP 20471)]
arrow::acero::RowArrayMerge::CopyVaryingLength (target=<optimized out>, source=..., first_target_row_id=<optimized out>, first_target_row_offset=<optimized out>, source_rows_permutation=0x7ffcc8052020) at /home/onyxia/work/arrow/cpp/src/arrow/acero/swiss_join.cc:605
605             *target_row_ptr++ = *source_row_ptr++;

Line 605 of swiss_join.cc is part of the following loop, where the content of source rows is iteratively added to target rows:

for (uint32_t word = 0; word < length / sizeof(uint64_t); ++word) {
  *target_row_ptr++ = *source_row_ptr++;
}

As I am completely new to C++, I'm not sure what this bug means. My intuition is that we're trying to add too much to the target_row_ptr pointer given its type. Would you or @westonpace have any clue on what is going wrong?

zanmato1984 · 2024-07-19T03:51:30Z

cpp/src/arrow/acero/swiss_join.cc

@@ -473,7 +474,7 @@ Status RowArrayMerge::PrepareForMerge(RowArray* target,
    (*first_target_row_id)[sources.size()] = num_rows;
  }

-  if (num_bytes > std::numeric_limits<uint32_t>::max()) {
+  if (is_key_data && num_bytes > std::numeric_limits<uint32_t>::max()) {


When letting non-key data that is greater than std::numeric_limits<uint32_t>::max() bypass this check, the num_bytes will underflow to a much smaller value in the static_cast<uint32_t> in #486. Then the target won't allocate enough space, resulting in segfault when copying data to target. The original check is necessary and there is unfortunately nothing to loosen.

zanmato1984 · 2024-07-19T03:58:16Z

Hi @oliviermeslin @amoeba , I was trying to help on the issue you met. After looking a bit, I'd say it's unfortunately an inherent problem of the fix. In other words, the original issue just transfers to a later place and explodes in a more implicit fashion (the segfault).

Please see my comment on the changed code about what is actually happening. Thanks.

cc @westonpace

zanmato1984 · 2024-08-19T10:58:13Z

PR #43389 fixed the issue in a more thorough way so I'm closing this one. Thanks.

oliviermeslin requested a review from westonpace as a code owner September 13, 2023 20:18

github-actions bot added Component: C++ awaiting review Awaiting review labels Sep 13, 2023

oliviermeslin changed the title ~~Solve issue #37655~~ [C++]: Solve issue #37655 to allow Acero to merge large tables Sep 13, 2023

oliviermeslin mentioned this pull request Sep 13, 2023

[C++] Acero cannot join large tables because of a misspecified test #37655

Closed

oliviermeslin changed the title ~~[C++]: Solve issue #37655 to allow Acero to merge large tables~~ GH-37655: [C++] Allow joins of large tables in Acero Sep 19, 2023

Add the new argument to swiss_join_internal.h

80cd210

oliviermeslin commented Oct 10, 2023

View reviewed changes

cpp/src/arrow/acero/swiss_join.cc Outdated Show resolved Hide resolved

A code line should not be more than 90 characters long

608778a

pitrou reviewed Oct 10, 2023

View reviewed changes

cpp/src/arrow/acero/swiss_join_internal.h Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 10, 2023

raulcd reviewed Oct 11, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Oct 11, 2023

oliviermeslin commented Oct 12, 2023

View reviewed changes

cpp/src/arrow/acero/swiss_join.cc Outdated Show resolved Hide resolved

Update cpp/src/arrow/acero/swiss_join.cc

a851b06

Remove trailing whitespace

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 12, 2023

oliviermeslin requested review from pitrou and raulcd October 12, 2023 20:20

oliviermeslin added 2 commits October 12, 2023 22:26

Change the order of the arguments and the name of the new argument

9715485

Adapt swiss_join.cc to the new Status

0ebf92e

westonpace reviewed Oct 13, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 13, 2023

Minor lint fix

04eb7cb

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 13, 2023

westonpace approved these changes Oct 14, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 14, 2023

Merge branch 'main' into issue-37655

4dd6d67

zanmato1984 requested changes Jul 19, 2024

View reviewed changes

zanmato1984 closed this Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-37655: [C++] Allow joins of large tables in Acero #37709

GH-37655: [C++] Allow joins of large tables in Acero #37709

oliviermeslin commented Sep 13, 2023 •

edited

Loading

github-actions bot commented Sep 13, 2023

github-actions bot commented Sep 19, 2023

pitrou commented Oct 10, 2023

raulcd left a comment

oliviermeslin commented Oct 12, 2023

westonpace commented Oct 13, 2023

westonpace left a comment

raulcd commented Oct 13, 2023

westonpace left a comment

oliviermeslin commented Oct 15, 2023

amoeba commented Oct 16, 2023 •

edited

Loading

oliviermeslin commented Oct 16, 2023 •

edited

Loading

zanmato1984 Jul 19, 2024 •

edited

Loading

zanmato1984 commented Jul 19, 2024 •

edited

Loading

zanmato1984 commented Aug 19, 2024

GH-37655: [C++] Allow joins of large tables in Acero #37709

GH-37655: [C++] Allow joins of large tables in Acero #37709

Conversation

oliviermeslin commented Sep 13, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Sep 13, 2023

github-actions bot commented Sep 19, 2023

pitrou commented Oct 10, 2023

raulcd left a comment

Choose a reason for hiding this comment

oliviermeslin commented Oct 12, 2023

westonpace commented Oct 13, 2023

westonpace left a comment

Choose a reason for hiding this comment

raulcd commented Oct 13, 2023

westonpace left a comment

Choose a reason for hiding this comment

oliviermeslin commented Oct 15, 2023

amoeba commented Oct 16, 2023 • edited Loading

oliviermeslin commented Oct 16, 2023 • edited Loading

zanmato1984 Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

zanmato1984 commented Jul 19, 2024 • edited Loading

zanmato1984 commented Aug 19, 2024

oliviermeslin commented Sep 13, 2023 •

edited

Loading

amoeba commented Oct 16, 2023 •

edited

Loading

oliviermeslin commented Oct 16, 2023 •

edited

Loading

zanmato1984 Jul 19, 2024 •

edited

Loading

zanmato1984 commented Jul 19, 2024 •

edited

Loading