-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43040: [C++] Reduce the recursion of many-join test #43042
GH-43040: [C++] Reduce the recursion of many-join test #43042
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
@github-actions crossbow submit -g cpp |
Revision: 7251176 Submitted crossbow builds: ursacomputing/crossbow @ actions-5714627e25 |
@github-actions crossbow submit -g cpp |
Revision: 3e6acd8 Submitted crossbow builds: ursacomputing/crossbow @ actions-d143fbd7c0 |
Ran with join recursion = 16. |
Ran with join recursion = 72. |
@github-actions crossbow submit -g cpp |
Revision: 6ccaa6c Submitted crossbow builds: ursacomputing/crossbow @ actions-368846a25e |
|
Ran with join recursion = 16 again. |
Hi @pitrou @felipecrv , would you help to take a look? This will fix two long failing jobs. |
@@ -3220,7 +3220,7 @@ TEST(HashJoin, ManyJoins) { | |||
// stack), which is essentially the recursive usage of the temp vector stack. | |||
|
|||
// A fair number of joins to guarantee temp vector stack overflow before GH-41335. | |||
const int num_joins = 64; | |||
const int num_joins = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure this conservative value serves the same protection purpose, I've verified in my local that, by reverting commit 6c386da, the test failed (with "temp stack overflow") with 16 joins (actually the minimal number for joins to fail is 14).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you condition the reduction on the specific platforms that can't handle num_joins=64? To ensure possible bugs on a high number of joins are caught in regression tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a nice idea. It's just that the condition could be very tricky to identify. So far I've experienced the following combinations on number of joins being 64:
- Ubuntu w/ or w/o ASAN (the CI jobs), all good.
- MacOS w/ ASAN, stack overflow; MacOS w/o ASAN, good.
- Alpine and Emscripten w/o ASAN (the CI jobs), segfault or memory out-of-bound (presumably to be caused by stack overflow as well).
And I don't find macros to differentiate Linux distributions such as Alpine and Ubuntu. To enable at least one build to run 64-join, it seems the only safe condition is to enable 64 joins on Linux w/ ASAN - but that's just because we have only sanitizer build on Ubuntu.
Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can stick with 16 if it's enough to reproduce the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can stick with 16 if it's enough to reproduce the issue.
What do you mean by "enough to repro the issue"? Reducing to 16 is making the issue "go away".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "issue" here means this:
by reverting commit 6c386da, the test failed (with "temp stack overflow") with 16 joins
In other words, 16 joins serves the purpose that this test is originally designed to cover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh. Now I see it.
I will let @pitrou approve and merge this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this @zanmato1984
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 2a8fa3e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…43042) ### Rationale for this change The current recursion 64 in many-join test is too aggressive so stack (the C program stack) overflow may happen on alpine or emscripten causing issues like apache#43040 . ### What changes are included in this PR? Reduce the recursion to 16, which is strong enough for the purpose of apache#41335 which introduced this test. ### Are these changes tested? Change is test. ### Are there any user-facing changes? None. * GitHub Issue: apache#43040 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
The current recursion 64 in many-join test is too aggressive so stack (the C program stack) overflow may happen on alpine or emscripten causing issues like #43040 .
What changes are included in this PR?
Reduce the recursion to 16, which is strong enough for the purpose of #41335 which introduced this test.
Are these changes tested?
Change is test.
Are there any user-facing changes?
None.