Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43046: [C++] Fix avx2 gather rows more than 2^31 issue in CompareColumnsToRows #43065

Merged
merged 3 commits into from
Jul 10, 2024

Conversation

zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Jun 26, 2024

Rationale for this change

See #43046.

What changes are included in this PR?

Use unsigned offset safe gather introduced in #42188 which is to fix similar issues.

Are these changes tested?

Yes.

Are there any user-facing changes?

None.


// Compare columns to rows of row ids larger than 2^31 within a row table.
// Certain AVX2 instructions may behave unexpectedly causing troubles like GH-43046.
TEST(KeyCompare, CompareColumnsToRowsMany) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case takes about 3min to run - it has to construct a row table with more than 2^31 rows which is necessary to reproduce the issue. Is it proper to add it here? Or somewhere else that is more suitable for long-running tests? @pitrou please help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or somewhere else that is more suitable for long-running tests?

For example some compiler flags that will be only turned on for time consuming tests? Like ARROW_BUILD_DETAILED_BENCHMARKS for benchmarks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have such a flag, no. Hopefully we can shrink the test execution time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how much RAM does this require?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how much RAM does this require?

It takes 24+GB virtual memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have such a flag, no. Hopefully we can shrink the test execution time?

I can try to reduce the execution time by using more memory, not sure how good it can do though. And we don't want to introduce more sophisticated speedup method like parallel, do we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24+GB memory is unpractical for most CI configurations (which are typically based on small VMs), and also many local dev environments. So, I think we'll unfortunately have to disable this test, perhaps using the LARGE_MEMORY_TEST annotation (you can grep through the sources to look for it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's unfortunate but seems to be the only option. Shall I now stop my (hopeless) attempts to speedup the test then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If speeding up the test requires consuming even more memory, then it doesn't sound like a good idea indeed.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 26, 2024
@zanmato1984
Copy link
Contributor Author

Some CI failure seems related. Looking. And turn to draft before figure that out.

@zanmato1984 zanmato1984 marked this pull request as draft June 27, 2024 03:12
@zanmato1984
Copy link
Contributor Author

Some CI failure seems related. Looking. And turn to draft before figure that out.

No big deal, changed one more line by mistake:
c1eb593 (#43065)

Fixed. Re-ready for review.

@zanmato1984 zanmato1984 marked this pull request as ready for review June 27, 2024 06:01
@zanmato1984
Copy link
Contributor Author

zanmato1984 commented Jun 27, 2024

My investigation on how much RAM this test uses disclosed something new. The test was a fake pass. There are offset overflows (of uint32_t) happening - I'll open up an issue for it later. But since the number of rows and the size of the row_table_right is multiple of columns_left, the overflowed offsets slide to the beginning of the row table, making the comparison actually happens between the first num_rows_batch rows in row table and the batch. Simply reducing the number of rows of the batch by 1 will make the offset overflow bad enough to fail the test:

-  constexpr int64_t num_rows_batch = std::numeric_limits<uint16_t>::max() + 1ll;
+  constexpr int64_t num_rows_batch = std::numeric_limits<uint16_t>::max();

The further impact to this PR is that, it is technically impossible to construct a valid test case to reproduce the issue that this PR is aiming to solve. This issue requires the following conditions to reproduce:

  1. A fixed length column exists;
  2. A var length column exists;
  3. The number of rows is greater than 2^31.

No matter how small the fixed length column and var length column's content are, there is an 32b "offset" (of the var length column) within the row. And more than 2^31 rows effectively take 2^31 * 4B > 4GB data, then the offset within the row must overflow. Or in other words, one could not simply construct a row table of more than 2^31 rows WITHOUT overflowing the row offset.

We can say that the issue that this PR is for won't actually happen, because something even worse (the overflow) must have happened way earlier. But I still want to have this fix in (I'll have to remove the test though), because the fix is simple, safe, without overhead, and can prevent bad things from happening in the future (maybe when someday we improve the row table to allow valid bigger-than-2^31 rows?).

What's your opinion? @pitrou Thanks.

@pitrou
Copy link
Member

pitrou commented Jun 27, 2024

Two things:

  1. Yes, I agree we should fix the bug even though we can't write a unit test to avoid regressions
  2. Are you saying the current fix is not enough, or am I misunderstanding you?

@zanmato1984
Copy link
Contributor Author

  1. Are you saying the current fix is not enough, or am I misunderstanding you?

No, the current fix is perfectly enough to solely solve the "gather data from offsets for rows[>=2^31]" issue as long as it actually happens.

What I was talking about is an "overflow" issue will always happen way before the gather issue could actually happen. And once we fixed the overflow issue (I'm going to introduce explicit error), the gather issue won't happen too because the run will be interrupted by my presumable explicit error. But, quoting myself:

and can prevent bad things from happening in the future (maybe when someday we improve the row table to allow valid bigger-than-2^31 rows?).

@zanmato1984
Copy link
Contributor Author

zanmato1984 commented Jun 28, 2024

The test has been removed from this PR. But it will come back, with minor modification and slightly different coverage intention, when this TODO is being mitigated:
https://github.com/apache/arrow/pull/43065/files#diff-26ce8f4fb451b09705a8cadde87fbcc588d0cfe05a91b5677fbda95af215218cR328

@@ -325,6 +325,7 @@ Status RowTableImpl::AppendSelectionFrom(const RowTableImpl& from,
// Varying-length rows
auto from_offsets = reinterpret_cast<const uint32_t*>(from.offsets_->data());
auto to_offsets = reinterpret_cast<uint32_t*>(offsets_->mutable_data());
// TODO: The following two variables are possibly overflowing.
Copy link
Contributor Author

@zanmato1984 zanmato1984 Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the "overflow" mentioned in my other comments happens. I'll file another issue/PR for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you file the other issue? If so, can you update the TODO here to the proper GH issue number? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I was planning to file it after and refer to this TODO in it.

@zanmato1984
Copy link
Contributor Author

@pitrou Shall we move on with this? Thanks.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one change to make

@@ -325,6 +325,7 @@ Status RowTableImpl::AppendSelectionFrom(const RowTableImpl& from,
// Varying-length rows
auto from_offsets = reinterpret_cast<const uint32_t*>(from.offsets_->data());
auto to_offsets = reinterpret_cast<uint32_t*>(offsets_->mutable_data());
// TODO: The following two variables are possibly overflowing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you file the other issue? If so, can you update the TODO here to the proper GH issue number? Thanks!

@pitrou
Copy link
Member

pitrou commented Jul 9, 2024

@zanmato1984 Sorry for the delay, I was away sick

@pitrou
Copy link
Member

pitrou commented Jul 9, 2024

@zanmato1984 Also, can you rebase?

@zanmato1984
Copy link
Contributor Author

@zanmato1984 Sorry for the delay, I was away sick

You have my best wishes for you to take care and get better my friend.

@zanmato1984
Copy link
Contributor Author

@zanmato1984 Also, can you rebase?

Rebase done.

Also, if you think it's better to file the other issue ahead of this PR and have the TODO pointing to it, I'll do that. Thank you!

@pitrou
Copy link
Member

pitrou commented Jul 9, 2024

Also, if you think it's better to file the other issue ahead of this PR and have the TODO pointing to it, I'll do that. Thank you!

Yes, that would be nice. Thanks!

@zanmato1984
Copy link
Contributor Author

Also, if you think it's better to file the other issue ahead of this PR and have the TODO pointing to it, I'll do that. Thank you!

Yes, that would be nice. Thanks!

#43202 filed and assigned to the TODO in the PR.

@pitrou pitrou merged commit 47b2fbd into apache:main Jul 10, 2024
39 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jul 10, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Jul 10, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 47b2fbd.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants