GH-41813: [C++] Fix avx2 gather offset larger than 2GB in `CompareColumnsToRows` #42188

zanmato1984 · 2024-06-17T17:44:31Z

Rationale for this change

AVX2 intrinsics _mm256_i32gather_epi32/_mm256_i32gather_epi64 are used in CompareColumnsToRows API, and treat the vindex as signed integer. In our row table implementation, we use uint32_t to represent the offset within the row table. When a offset is larger than (0x80000000, or 2GB), the aforementioned intrinsics will treat it as negative offset and gather the data from undesired address. More details please see #41813 (comment).

Considering there is no unsigned-32bit-offset or 64bit-offset counterparts of those intrinsics in AVX2, this issue can be simply mitigated by translating the base address and the offset:

new_base = base + 0x80000000;
new_offset = offset - 0x80000000;

What changes are included in this PR?

Fix and UT that reproduces the issue.

Are these changes tested?

UT included.

Are there any user-facing changes?

None.

GitHub Issue: [R] Segfault when collecting parquet dataset query results #41813

github-actions · 2024-06-17T17:44:58Z

⚠️ GitHub issue #41813 has been automatically assigned in GitHub to PR creator.

zanmato1984 · 2024-06-17T17:47:46Z

cc @pitrou @amoeba @mrd0ll4r

cpp/src/arrow/compute/row/compare_test.cc

amoeba · 2024-06-17T23:36:40Z

Hi @zanmato1984, thanks for your work on this. I'm hoping others can review the implementation but I did just check that the new test passes (it does) and also fixes the original issue (it does). 👍

zanmato1984 · 2024-06-18T00:35:19Z

Hi @zanmato1984, thanks for your work on this. I'm hoping others can review the implementation but I did just check that the new test passes (it does) and also fixes the original issue (it does). 👍

Thank you @amoeba for verifying, and the help on reproducing the issue!

mrd0ll4r · 2024-06-19T18:59:34Z

I can't give much feedback or test this out, unfortunately. But I'm very thankful for you all looking into this!

FreekPaans · 2024-06-20T19:21:32Z

Ran into this issue when I was debugging my own issue where running a group_by/aggregate on a table with null columns was failing to group some keys, i.e. some group key value tuples were duplicated in the result.

Why I mention it here:

My investigation also pointed to a problem with CompareColumnsToRows, but I don't fully understand the code.
Disabling AVX2 (i.e. `ARROW_USER_SIMD_LEVEL=AVX) solved the issue

However, the table I use is only 3.8MB - so perhaps there is some other bug around the AVX2 related code here as well, unrelated to the size, but related to nulls.

~~Can unfortunately not share the table but if I figure a repro case I will drop it here.~~

Repro case:

import pyarrow as pa
def try_repro(size):
    repro = pa.table({"a": [0] * size,
                      "g": [None]*size},
                     schema=pa.schema([pa.field("a", "uint8"),
                                       pa.field("g", "date32")]))\
              .group_by(["a", "g"]).aggregate([([], "count_all")])

    if len(repro) != 1:
        print(f"{size} => {len(repro)}")
    return repro

for i in range(1,50):
    r = try_repro(i)

print()
print(r)

Output without AVX2 (expected):

$ ARROW_USER_SIMD_LEVEL=AVX python repro.py

pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0]]
g: [[null]]
count_all: [[49]]

Output with AVX2 (not expected):

ARROW_USER_SIMD_LEVEL=AVX2 python repro.py
33 => 2
...
40 => 2
41 => 3
...
48 => 3
49 => 4

pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0,0,0,0]]
g: [[null,null,null,null]]
count_all: [[32,8,8,1]]

Some observations:

Grouping on only g doesn't have the problem
Swapping the order a and g in the group_by also removes the issue.
Looks like this starts happening as soon as the size of the tables hits 33, and then we get an extra group for every 8 rows we add (so at 33, 41, 49)
Having g be an int does not exhibit the problem, a float does.
Non-null values don't have the issue
Macbook Pro M2 is also fine

Let me know if you think I should open a new ticket for this.

amoeba · 2024-06-20T21:13:24Z

Hi @FreekPaans, can you please open a new issue for that? I think the issue will be fixed in the upcoming 17.x PyArrow release but it'd be good to make sure.

FreekPaans · 2024-06-20T21:26:47Z

@amoeba sure thing #42231

Any issue/PR you can point me to for where it's fixed?

amoeba · 2024-06-20T21:33:16Z

Thanks. I'll test and follow up over on #42231.

zanmato1984 · 2024-06-21T06:56:32Z

@pitrou @felipecrv @ZhangHuiGui @mapleFU Would you please help to take a look? Thanks.

zanmato1984 · 2024-06-24T12:48:58Z

cpp/src/arrow/compute/row/compare_internal_avx2.cc

@@ -236,6 +236,8 @@ uint32_t KeyCompare::CompareBinaryColumnToRowHelper_avx2(
        irow_right =
            _mm256_loadu_si256(reinterpret_cast<const __m256i*>(left_to_right_map) + i);
      }
+      // TODO: Need to test if this gather is OK when irow_right is larger than


I'll test in the future.

When you say "in the future", is it in this PR or another one?

Oh sorry, I meant in another PR.

pitrou · 2024-06-24T13:11:00Z

cpp/src/arrow/compute/row/compare_internal_avx2.cc

@@ -236,6 +236,8 @@ uint32_t KeyCompare::CompareBinaryColumnToRowHelper_avx2(
        irow_right =
            _mm256_loadu_si256(reinterpret_cast<const __m256i*>(left_to_right_map) + i);
      }
+      // TODO: Need to test if this gather is OK when irow_right is larger than


When you say "in the future", is it in this PR or another one?

pitrou · 2024-06-24T13:11:56Z

cpp/src/arrow/compute/row/compare_internal_avx2.cc

+/// `0x80000000` lower. This way, the offset is always in range of [-2G, 2G) and those
+/// intrinsics are safe.
+
+constexpr auto two_gb = 0x80000000ull;


Can we make sure we use an explicit width type here? I'm not even sure what it is expected to be for correctness of the code using this constant (uint32_t or uint64_t?)

Both uint32_t and uint64_t are OK. It only has to be unsigned and wide enough for 0x80000000. I'm declaring it uint64_t (the ull suffix) just to make all the arithmetics to be promoted to 64b to not worry about the potential underflow. The two subsequent usages are:

Being added to pointer base after divided by a specific sizeof(). The division is unsigned so the addition is addressing the base "forward", as expected.

Being loaded to a signed __m256i register via an implicit static cast (after divided by scale).

I'll update to make it, and the usages, more more type and width explicit.

pitrou · 2024-06-24T13:13:58Z

cpp/src/arrow/compute/row/compare_internal_avx2.cc

+}
+
+template <int scale>
+inline __m256i UnsignedOffsetSafeGather64(arrow::util::int64_for_gather_t const* base,


What is the use of int64_for_gather_t exactly?

It's just copied from the existing usage:
https://github.com/apache/arrow/pull/42188/files#diff-4dcfb2dd9add2d770dcbd924498120ab30e38186dd4bdc60708485c31074ce67L368

pitrou · 2024-06-24T13:13:59Z

cpp/src/arrow/compute/row/compare_internal_avx2.cc

+
+constexpr auto two_gb = 0x80000000ull;
+
+template <int scale>


Two things:

if we're using unsigned arithmetic below, the scale type should probably be unsigned for readability and sanity?

naming convention: can we make this kScale?

~~The type of the third formal parameter of _mm256_set1_epi32/64 is int so I'm just using int too.~~ Yeah, that's probably good.

Yeah, will do.

pitrou · 2024-06-24T13:14:50Z

cpp/src/arrow/compute/row/compare_internal_avx2.cc

@@ -251,6 +253,35 @@ uint32_t KeyCompare::CompareBinaryColumnToRowHelper_avx2(
  }
 }

+namespace {
+
+/// Intrinsics `_mm256_i32gather_epi32/64` treat the `vindex` as signed integer, and we


Can you use regular comments (//)? This isn't a docstring so shouldn't use the docstring-specific prefix (///)

Yeah, will do.

pitrou · 2024-06-24T13:16:40Z

cpp/src/arrow/compute/row/compare_test.cc

+  // number of rows.
+  constexpr int64_t num_rows = std::numeric_limits<uint16_t>::max() + 1;
+  const std::vector<std::shared_ptr<DataType>> fixed_length_types{uint64(), uint32()};
+  // The var length column should be a little smaller than 2GB to WAR the capacity


Sorry, I meant "workaround". Will update.

pitrou · 2024-06-24T13:18:31Z

cpp/src/arrow/compute/row/compare_test.cc

@@ -164,5 +166,128 @@ TEST(KeyCompare, CompareColumnsToRowsTempStackUsage) {
  }
 }

+// Compare columns to rows at offsets over 2GB within a row table.
+// Certain AVX2 instructions may behave unexpectedly causing troubles like GH-41813.
+TEST(KeyCompare, CompareColumnsToRowsLarge) {


What is the runtime of this test? Perhaps we need to disable it on Valgrind builds.

What do you mean by "runtime"? I can't think of a reason why Valgrind would complain (at least ASAN didn't).

Sorry, I meant "run time" or execution time :-)

Ah, got it! It takes about 20s with ASAN enabled. Perhaps it will be fine with Valgrind too?

I should take a quick look.

Ok, it takes 70s locally under Valgrind. That's a bit high for a single test, I would rather disable it under Valgrind.

OK. Updated to disable the test under Valgrind. Thanks for helping running in your local!

pitrou · 2024-06-24T13:19:13Z

cpp/src/arrow/compute/row/compare_test.cc

+  ASSERT_OK(row_encoder.EncodeSelected(&row_table, static_cast<uint32_t>(num_rows),
+                                       row_ids_right.data()));
+
+  ASSERT_TRUE(row_table.offsets());


I'm not sure what's that supposed to check (offsets being "true"?). Do we want to make the test a bit more self-documenting, or perhaps add a comment?

This is asserting the address of row_table.offsets() is not null, like if (some_pointer). Perhaps I can refine it to ASSERT_NE(row_table.offsets(), NULLPTR).

And the point of this check is to make sure the row_table constructed has an internal offset buffer, i.e., it contains var length columns.

Yes, the ASSERT_NE suggestion would make this more easily understandable, thanks!

ZhangHuiGui · 2024-06-25T01:44:46Z

cpp/src/arrow/compute/row/compare_internal_avx2.cc

+      base + kTwoGB / sizeof(arrow::util::int64_for_gather_t);
+  __m128i normalized_offset =
+      _mm_sub_epi32(offset, _mm_set1_epi32(static_cast<int>(kTwoGB / kScale)));
+  return _mm256_i32gather_epi64(normalized_base, normalized_offset,


I have a question about instructions.

Why is the vindex parameter type of _mm256_i32gather_epi32 is _m256i and the vindex type of _mm256_i32gather_epi64 is _m128i?

This may not be related to PR, I just want to understand it🫡

Both intrinsics gather "several" integers based on a base address and "several" 32b offsets (vindex), and stores the results into a 256b register. The difference is: _mm256_i32gather_epi32 gathers 8 32b-integers (8 * 32 = 256) at a time so 8 32b indices are used, hence the 256b vindex. Whereas _mm256_i32gather_epi64 gathers 4 64b-integers at a time so 4 32b indices are used, hence the 128b vindex.

I see. Thanks!

zanmato1984 · 2024-06-25T06:43:56Z

I've committed two changes containing code restructures (moving, renaming, etc.) and a minor fix to the test, to make the test logic more clear and readable. Hope it doesn't trouble your review @pitrou . Thanks.

pitrou · 2024-06-25T14:26:29Z

Thanks a lot @zanmato1984 !

conbench-apache-arrow · 2024-06-25T21:38:33Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit e635cc2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

…ColumnsToRows` (#43065) ### Rationale for this change See #43046. ### What changes are included in this PR? Use unsigned offset safe gather introduced in #42188 which is to fix similar issues. ### Are these changes tested? Yes. ### Are there any user-facing changes? None. * GitHub Issue: #43046 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

github-actions bot added Component: C++ awaiting review Awaiting review labels Jun 17, 2024

zanmato1984 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Jun 17, 2024

zanmato1984 added 11 commits June 18, 2024 01:58

Repro WIP

8528e00

Repro done

c01227c

UT

42781f4

Finish UT

135f295

Change fix length to uint32

2b43288

Fix

771ad49

Done

d998c45

Revert

dcb1306

Add comment

91b72fe

Format

474cd56

Fix lint

53f6d73

zanmato1984 force-pushed the fix-41813 branch from c96540d to 53f6d73 Compare June 17, 2024 17:58

Add todo

4266961

zanmato1984 commented Jun 17, 2024

View reviewed changes

cpp/src/arrow/compute/row/compare_test.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 17, 2024

Fix warning on windows

7d56722

FreekPaans mentioned this pull request Jun 20, 2024

[C++][Python] pyarrow table group_by/aggregate results in multiple rows with the same group_by key #42231

Closed

zanmato1984 added 4 commits June 24, 2024 20:28

Fix scale for avx2 safe gather

0452684

No more todo - its a mis-use

0fc496d

Format

0ea8674

Add a new todo

f9c88bb

zanmato1984 commented Jun 24, 2024

View reviewed changes

zanmato1984 added 3 commits June 24, 2024 21:02

Support more scales for safe gather

e8a407c

Comment

a72dc25

Fix

6b085c7

pitrou reviewed Jun 24, 2024

View reviewed changes

zanmato1984 added 4 commits June 24, 2024 23:57

Address comments

7ed9f0f

Disable test for valgrind

9862d03

Pure readability refine

3dcdd9f

Fix a case that should have used selection but did not

956869f

ZhangHuiGui reviewed Jun 25, 2024

View reviewed changes

pitrou approved these changes Jun 25, 2024

View reviewed changes

pitrou merged commit e635cc2 into apache:main Jun 25, 2024
35 of 39 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Jun 25, 2024

pitrou mentioned this pull request Jun 25, 2024

[R] Segfault when collecting parquet dataset query results #41813

Closed

github-actions bot added the awaiting committer review Awaiting committer review label Jun 25, 2024

zanmato1984 mentioned this pull request Jun 26, 2024

GH-43046: [C++] Fix avx2 gather rows more than 2^31 issue in CompareColumnsToRows #43065

Merged

zanmato1984 mentioned this pull request Jul 31, 2024

[C++][Compute] Consider widening the row offset of the row table to 64-bit #43495

Closed

GH-41813: [C++] Fix avx2 gather offset larger than 2GB in CompareColumnsToRows #42188

GH-41813: [C++] Fix avx2 gather offset larger than 2GB in CompareColumnsToRows #42188

Conversation

zanmato1984 commented Jun 17, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jun 17, 2024

zanmato1984 commented Jun 17, 2024

amoeba commented Jun 17, 2024

zanmato1984 commented Jun 18, 2024

mrd0ll4r commented Jun 19, 2024

FreekPaans commented Jun 20, 2024 • edited Loading

amoeba commented Jun 20, 2024

FreekPaans commented Jun 20, 2024

amoeba commented Jun 20, 2024

zanmato1984 commented Jun 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zanmato1984 Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZhangHuiGui Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

zanmato1984 commented Jun 25, 2024 • edited Loading

pitrou commented Jun 25, 2024

conbench-apache-arrow bot commented Jun 25, 2024

GH-41813: [C++] Fix avx2 gather offset larger than 2GB in `CompareColumnsToRows` #42188

GH-41813: [C++] Fix avx2 gather offset larger than 2GB in `CompareColumnsToRows` #42188

zanmato1984 commented Jun 17, 2024 •

edited by github-actions bot

Loading

FreekPaans commented Jun 20, 2024 •

edited

Loading

zanmato1984 Jun 24, 2024 •

edited

Loading

ZhangHuiGui Jun 25, 2024 •

edited

Loading

zanmato1984 commented Jun 25, 2024 •

edited

Loading