GH-44052: [C++][Compute] Reduce the complexity of row segmenter #44053

zanmato1984 · 2024-09-11T02:27:03Z

Rationale for this change

As described in #44052, currently AnyKeysSegmenter::GetNextSegment has O(n*m) complexity, where n is the number of rows in a batch, and m is the number of segments in this batch (a "segment" is the group of contiguous rows who have the same segment key). This is because in each invocation of the method, it computes all the group ids of the remaining rows in this batch, where it's only interested in the first group, making the rest of the computation a waste.

In this PR I introduced a new API GetSegments (and subsequently deprecated the old GetNextSegment) to compute the group ids only once and iterate all the segments outside to avoid the duplicated computation. This reduces the complexity from O(n*m) to O(n).

What changes are included in this PR?

Because grouper.h is a public header, so I assume RowSegmenter::GetNextSegment is a public API and only deprecate it instead of removing it.
Implement new API RowSegmenter::GetSegments and update the call-sites.
Some code reorg of the segmenter code (mostly moving to inside a class).
A new benchmark for the segmented aggregation. (The benchmark result is listed in the comments below, which shows up to 50x speedup, nearly O(n*m) to O(n) complexity reduction.)

Are these changes tested?

Legacy tests are sufficient.

Are there any user-facing changes?

Yes.

This PR includes breaking changes to public APIs.

The API RowSegmenter::GetNextSegment is deprecated due to its inefficiency and replaced with a more efficient one RowSegmenter::GetSegments.

GitHub Issue: [C++][Compute] Row segmenter inefficiency #44052

zanmato1984 · 2024-09-11T02:31:38Z

The benchmark number, ran on my Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz, shows a maximum 50x speedup, nearly O(n*m) to O(n) complexity reduction.

Before:

----------------------------------------------------------------------------------------------------------------------
Benchmark                                                            Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:0        24934 ns        24906 ns        27871 bytes_per_second=9.80236Gi/s items_per_second=1.31565G/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:0        24729 ns        24687 ns        27358 bytes_per_second=9.88957Gi/s items_per_second=1.32736G/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:0       24630 ns        24616 ns        28100 bytes_per_second=9.91805Gi/s items_per_second=1.33118G/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:0       24633 ns        24620 ns        28112 bytes_per_second=9.91645Gi/s items_per_second=1.33096G/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:0      24679 ns        24656 ns        28651 bytes_per_second=9.90192Gi/s items_per_second=1.32901G/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:1       199781 ns       199339 ns         3606 bytes_per_second=2.4495Gi/s items_per_second=164.383M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:1       222783 ns       220778 ns         2905 bytes_per_second=2.21164Gi/s items_per_second=148.421M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:1      254637 ns       254545 ns         2703 bytes_per_second=1.91825Gi/s items_per_second=128.732M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:1      452659 ns       452356 ns         1550 bytes_per_second=1.07942Gi/s items_per_second=72.4385M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:1    1241185 ns      1240526 ns          572 bytes_per_second=403.055Mi/s items_per_second=26.4146M/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:2       351964 ns       351202 ns         1983 bytes_per_second=1.39031Gi/s items_per_second=93.3024M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:2      1076041 ns      1010477 ns          727 bytes_per_second=494.816Mi/s items_per_second=32.4282M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:2     4587149 ns      4584191 ns          152 bytes_per_second=109.071Mi/s items_per_second=7.14804M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:2    25132123 ns     25118536 ns           28 bytes_per_second=19.9056Mi/s items_per_second=1.30453M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:2  102196767 ns    102086286 ns            7 bytes_per_second=4.89782Mi/s items_per_second=320.983k/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:3       424279 ns       423836 ns         1619 bytes_per_second=1.15205Gi/s items_per_second=77.3129M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:3      1180743 ns      1157215 ns          610 bytes_per_second=432.072Mi/s items_per_second=28.3163M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:3     5738171 ns      5729356 ns          118 bytes_per_second=87.2698Mi/s items_per_second=5.71932M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:3    30695547 ns     30656783 ns           23 bytes_per_second=16.3096Mi/s items_per_second=1.06887M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:3  125980210 ns    125906333 ns            6 bytes_per_second=3.97121Mi/s items_per_second=260.257k/s

After:

----------------------------------------------------------------------------------------------------------------------
Benchmark                                                            Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:0        24717 ns        24709 ns        28098 bytes_per_second=9.8808Gi/s items_per_second=1.32618G/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:0        24283 ns        24275 ns        28709 bytes_per_second=10.0574Gi/s items_per_second=1.34989G/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:0       24426 ns        24413 ns        28328 bytes_per_second=10.0006Gi/s items_per_second=1.34226G/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:0       24331 ns        24320 ns        28073 bytes_per_second=10.0385Gi/s items_per_second=1.34735G/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:0      26190 ns        25537 ns        28471 bytes_per_second=9.56034Gi/s items_per_second=1.28317G/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:1       197105 ns       196813 ns         3587 bytes_per_second=2.48093Gi/s items_per_second=166.493M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:1       207379 ns       207241 ns         3384 bytes_per_second=2.3561Gi/s items_per_second=158.115M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:1      254707 ns       254565 ns         2777 bytes_per_second=1.9181Gi/s items_per_second=128.721M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:1      439539 ns       439415 ns         1589 bytes_per_second=1.11121Gi/s items_per_second=74.5719M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:1    1225127 ns      1224517 ns          576 bytes_per_second=408.324Mi/s items_per_second=26.7599M/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:2       355268 ns       355082 ns         1985 bytes_per_second=1.37512Gi/s items_per_second=92.283M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:2       410702 ns       410486 ns         1703 bytes_per_second=1.18952Gi/s items_per_second=79.8273M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:2      667262 ns       667087 ns         1035 bytes_per_second=749.527Mi/s items_per_second=49.121M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:2     1087405 ns      1086757 ns          651 bytes_per_second=460.084Mi/s items_per_second=30.1521M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:2    2133928 ns      2132308 ns          331 bytes_per_second=234.488Mi/s items_per_second=15.3674M/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:3       437444 ns       437048 ns         1592 bytes_per_second=1.11722Gi/s items_per_second=74.9757M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:3       505824 ns       505355 ns         1385 bytes_per_second=989.403Mi/s items_per_second=64.8415M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:3      756347 ns       755999 ns          906 bytes_per_second=661.377Mi/s items_per_second=43.344M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:3     1294590 ns      1293676 ns          546 bytes_per_second=386.496Mi/s items_per_second=25.3294M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:3    2552917 ns      2551420 ns          274 bytes_per_second=195.969Mi/s items_per_second=12.843M/s

zanmato1984 · 2024-09-11T02:35:07Z

Hi friends, @pitrou @felipecrv @mapleFU , would you like to take a look? Thanks.

zanmato1984 · 2024-09-11T02:36:44Z

cpp/src/arrow/acero/aggregate_internal.h

  ARROW_ASSIGN_OR_RAISE(auto segment_exec_batch, batch.SelectValues(ids));
  ExecSpan segment_batch(segment_exec_batch);

-  while (true) {


This is the only call-site in non-testing codes.

zanmato1984 · 2024-09-11T02:38:24Z

cpp/src/arrow/acero/hash_aggregate_test.cc

  }
-  // Assert next is the last (empty) segment.


The new API generates fewer segments than before - there is no meaningless (length == 0) tailing segment.

zanmato1984 · 2024-09-11T02:39:06Z

cpp/src/arrow/acero/hash_aggregate_test.cc

@@ -629,61 +622,47 @@ TEST(RowSegmenter, Basics) {
  auto batch2 = ExecBatchFromJSON(types2, "[[1, 1], [1, 2], [2, 2]]");
  auto batch1 = ExecBatchFromJSON(types1, "[[1], [1], [2]]");
  ExecBatch batch0({}, 3);
-  {


The new API doesn't accept offset argument so no need for such testing.

zanmato1984 · 2024-09-11T02:39:51Z

cpp/src/arrow/acero/hash_aggregate_test.cc

    ExecSpan span0(batch0);
-    TestSegments(segmenter, span0, {{0, 3, true, true}, {3, 0, true, true}});


Here and below are just because the new API doesn't generate the zero-length ending segment.

zanmato1984 · 2024-09-11T02:40:22Z

cpp/src/arrow/compute/row/grouper.cc

@@ -54,13 +55,8 @@ using group_id_t = std::remove_const<decltype(kNoGroupId)>::type;
 using GroupIdType = CTypeTraits<group_id_t>::ArrowType;
 auto g_group_id_type = std::make_shared<GroupIdType>();

-inline const uint8_t* GetValuesAsBytes(const ArraySpan& data, int64_t offset = 0) {


Moved w/o touching the code inside.

zanmato1984 · 2024-09-11T02:40:46Z

cpp/src/arrow/compute/row/grouper.cc

@@ -102,21 +109,6 @@ Segment MakeSegment(int64_t batch_length, int64_t offset, int64_t length, bool e
  return Segment{offset, length, offset + length >= batch_length, extends};
 }

-// Used by SimpleKeySegmenter::GetNextSegment to find the match-length of a value within a
-// fixed-width buffer
-int64_t GetMatchLength(const uint8_t* match_bytes, int64_t match_width,


Moved w/o touching the code inside.

zanmato1984 · 2024-09-11T02:41:00Z

cpp/src/arrow/compute/row/grouper.cc

@@ -147,21 +152,15 @@ struct SimpleKeySegmenter : public BaseRowSegmenter {
        save_key_data_(static_cast<size_t>(key_type_.type->byte_width())),
        extend_was_called_(false) {}

-  Status CheckType(const DataType& type) {


Moved w/o touching the code inside.

zanmato1984 · 2024-09-11T02:42:01Z

cpp/src/arrow/compute/row/grouper.h

@@ -97,7 +97,12 @@ class ARROW_EXPORT RowSegmenter {
  virtual Status Reset() = 0;

  /// \brief Get the next segment for the given batch starting from the given offset
+  /// DEPRECATED: Due to its inefficiency, use GetSegments instead.
+  ARROW_DEPRECATED("Deprecated in 18.0.0. Use GetSegments instead.")


Deprecate instead of removing it because I assume this is a public API (due to grouper.h is a public header).

zanmato1984 · 2024-09-11T02:42:14Z

cpp/src/arrow/compute/row/grouper.cc

-  // Runs the grouper on a single row.  This is used to determine the group id of the
-  // first row of a new segment to see if it extends the previous segment.
-  template <typename Batch>
-  Result<group_id_t> MapGroupIdAt(const Batch& batch, int64_t offset) {


Moved w/o touching the code inside.

pitrou · 2024-09-11T08:36:48Z

cpp/src/arrow/acero/aggregate_benchmark.cc

+//
+
+template <typename... Args>
+static void BenchmarkRowSegmenter(benchmark::State& state, Args&&...) {


Can we turn this into a more realistic GroupBy benchmark? For example:

M=0 to 1 non-segmented keys

N=1 to 2 segmented keys

K=key cardinality (2,8,64)

Sure. Will do, and the name should be named with GroupBy for sure.

pitrou · 2024-09-11T08:38:38Z

cpp/src/arrow/acero/aggregate_benchmark.cc

+  ArrayVector segments(num_segments);
+  for (int i = 0; i < num_segments; ++i) {
+    ASSERT_OK_AND_ASSIGN(
+        segments[i],
+        Constant(std::make_shared<Int64Scalar>(i))->Generate(num_rows / num_segments));
+  }
+  // Concat all segments to form the segment key.
+  ASSERT_OK_AND_ASSIGN(auto segment_key, Concatenate(segments));
+  // num_segment_keys copies of the segment key.
+  ArrayVector segment_keys(num_segment_keys, segment_key);


Less trivially, you can, for each segmented key, 1) generate a random Int64 array of the right cardinality (using the min and max values) 2) sort it to make it segmented. Then the multi-column segments will not trivially map to the individual key segments.

Then the multi-column segments will not trivially map to the individual key segments.

Sorry, I don't follow this part :(

Could you elaborate a bit? Thank you.

I mean that you have the exact same segment lengths in each segment key column. By using random generation the patterns would be slightly different.

Ah, you mean introducing some randomness into the individual segment lengths. That sounds good given that this particular benchmark needs to be more realistic.

Will do. Thank you for elaborating.

pitrou · 2024-09-11T08:39:06Z

cpp/src/arrow/acero/aggregate_benchmark.cc

+std::vector<std::vector<int64_t>> row_segmenter_args = {
+    {32 * 1024}, benchmark::CreateRange(1, 256, 4), benchmark::CreateDenseRange(0, 3, 1)};
+
+BENCHMARK(BenchmarkRowSegmenter)


This is a GroupBy benchmark, so should probably have "GroupBy" in its name like other benchmarks?

pitrou · 2024-09-11T08:40:23Z

cpp/src/arrow/acero/aggregate_benchmark.cc

+
+  BenchmarkGroupBy(state, {{"count", ""}}, {arg}, /*keys=*/{}, segment_keys);
+
+  state.SetItemsProcessed(num_rows * state.iterations());


Why don't we call SetItemsProcessed in BenchmarkGroupBy instead? It should probably be relevant for the other GroupBy benchmarks.

I guess they should.

Moved into BenchmarkGroupBy. Other sharing benchmarks will have items_per_second metrics now.

pitrou · 2024-09-11T08:40:39Z

cpp/src/arrow/acero/aggregate_benchmark.cc

+using arrow::Concatenate;
+using arrow::ConstantArrayGenerator;


These look unnecessary?

cpp/src/arrow/compute/row/grouper.cc

pitrou · 2024-09-11T08:51:33Z

cpp/src/arrow/compute/row/grouper.cc

  }

+  ARROW_DEPRECATED("Deprecated in 18.0.0. Use GetSegments instead.")


It's already in the declaration, is it useful to duplicate the deprecation here?

I guess not so useful. Removed.

pitrou · 2024-09-11T08:57:25Z

I'm curious, is the SimpleKeySegmenter useful for performance?

zanmato1984 · 2024-09-11T16:56:28Z

I'm curious, is the SimpleKeySegmenter useful for performance?

If you mean if the SimpleKeySegmenter has better performance than using AnyKeysSegmenter for "simple key", then yes. The former only has to perform a linear search on the input data, as opposed to the later who uses a heavy hash table which introduces all sorts of overhead (memory alloc/insertion/lookup). In short, I think SimpleKeySegmenter is a worthwhile specialization for performance.

I hacked the code and got the benchmark number.

SimpleKeySegmenter (copied from my previous comment):

BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:1       197105 ns       196813 ns         3587 bytes_per_second=2.48093Gi/s items_per_second=166.493M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:1       207379 ns       207241 ns         3384 bytes_per_second=2.3561Gi/s items_per_second=158.115M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:1      254707 ns       254565 ns         2777 bytes_per_second=1.9181Gi/s items_per_second=128.721M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:1      439539 ns       439415 ns         1589 bytes_per_second=1.11121Gi/s items_per_second=74.5719M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:1    1225127 ns      1224517 ns          576 bytes_per_second=408.324Mi/s items_per_second=26.7599M/s

AnyKeysSegmenter (by hacking the code to use it for even "simple key"):

BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:1       271931 ns       271563 ns         2571 bytes_per_second=1.79804Gi/s items_per_second=120.664M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:1       335488 ns       330342 ns         2196 bytes_per_second=1.47811Gi/s items_per_second=99.194M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:1      510618 ns       508622 ns         1387 bytes_per_second=983.048Mi/s items_per_second=64.425M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:1      847002 ns       846245 ns          827 bytes_per_second=590.845Mi/s items_per_second=38.7216M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:1    1745628 ns      1744707 ns          409 bytes_per_second=286.581Mi/s items_per_second=18.7814M/s

pitrou · 2024-09-11T17:00:17Z

I see, thanks. Are segment keys used often? If so, it might worth improving SimpleKeySegmenter and also making it more generic (there's no reason why it should be limited to no-nulls fixed-width types, for example).

zanmato1984 · 2024-09-11T17:13:02Z

I see, thanks. Are segment keys used often? If so, it might worth improving SimpleKeySegmenter and also making it more generic (there's no reason why it should be limited to no-nulls fixed-width types, for example).

Not in traditional RDBMS-es, but could be essential in time-series scenarios. (I believe I saw the original author mentioned that somewhere, but I can't find it at the moment.)

I can help with enhancing the simple key case anyway.

pitrou · 2024-09-11T17:35:31Z

That said, it's quite low priority.

Co-authored-by: Antoine Pitrou <[email protected]>

pitrou · 2024-09-17T13:11:57Z

cpp/src/arrow/acero/aggregate_benchmark.cc

+  auto segment_key = rng.Int64(num_rows, /*min=*/0, /*max=*/num_segments - 1);
+  int64_t* values = segment_key->data()->GetMutableValues<int64_t>(1);
+  std::sort(values, values + num_rows);
+  // num_segment_keys copies of the segment key.
+  ArrayVector segment_keys(num_segment_keys, segment_key);


This is not much better than before, is it? I would expect something like:

Suggested change

auto segment_key = rng.Int64(num_rows, /*min=*/0, /*max=*/num_segments - 1);

int64_t* values = segment_key->data()->GetMutableValues<int64_t>(1);

std::sort(values, values + num_rows);

// num_segment_keys copies of the segment key.

ArrayVector segment_keys(num_segment_keys, segment_key);

ArrayVector segment_keys(num_segment_keys);

for (auto& segment_key : segment_keys) {

segment_key = rng.Int64(num_rows, /*min=*/0, /*max=*/num_segments - 1);

int64_t* values = segment_key->data()->GetMutableValues<int64_t>(1);

std::sort(values, values + num_rows);

}

Ah, I see your point. It is only that I want to make the number of segments to be exactly as specified. Combining independently-random keys, even with the same distribution, will make the number of segments (potentially, much) bigger.

But that's also much more realistic, isn't it?

That said, as you prefer.

But that's also much more realistic, isn't it?

I think having exact number of segments would help on diagnostics of performance issues than introducing yet another level of randomness, so I'd rather keep it as is :)

pitrou · 2024-09-17T13:12:59Z

cpp/src/arrow/acero/aggregate_benchmark.cc

+}
+BENCHMARK(CountGroupByIntsSegmentedByInts)
+    ->ArgNames({"Keys", "SegmentKeys", "Segments"})
+    ->ArgsProduct({{1, 2}, {0, 1, 2}, benchmark::CreateRange(1, 256, 8)});


I've tried to use 1+ segment keys and 0 non-segment keys to get an idea of the purely-segmented performance, but I get:

Invalid: The provided function (hash_count) is a hash aggregate function. Since there are no keys to group by, a scalar aggregate function was expected (normally these do not start with hash_)

It's a bit of a bummer, isn't it?

As my other comment explained, an aggregation w/o any group by keys (aka. non-segment keys here) is a scalar aggregation and requires "count" rather than "hash_count".

This is a "group by aggregation" benchmark (the counter-part of the previous "CountScalar" one) so there has to be at least one group by key.

pitrou · 2024-09-17T13:16:15Z

cpp/src/arrow/acero/aggregate_benchmark.cc

+}
+
+template <typename... Args>
+static void CountScalarSegmentedByInts(benchmark::State& state, Args&&...) {


So "CountScalar" is actually the case where Keys=0 but there are segment keys?
This is quite misleading.

Please let me explain a bit.

Though both are named "aggregation", depending on whether there are "group by" keys or not (take SQL select count(*) from t group by c and select count(*) from t for instance), most compute engines have two variants of them. Acero calls them "scalar aggregation" and "group by aggregation", and they work with aggregation functions "count/sum/..." and "hash_count/hash_sum/..." respectively. This is understandable because w/o a group by key, the aggregation just needs to hold one "scalar" value (e.g., current count/sum/...) during the whole computation, whereas a group by key immediately requires some structures like a hash table.

And segment keys, on the other hand, working orthogonally with group by keys, apply to both scalar and group by aggregations.

Back to your question, yes, "CountScalar" implies exactly that this is a "scalar aggregation" (i.e., w/o any group by keys) on a "count" function. "SegmentedByInts" implies that there are potential segment keys.

Hope this can clear things a bit.

But select count(*) from t is not what this is doing. It's still doing a "group by", it's just not using a hash table for it.

You are right, what this benchmark does is not simply a select count(*) from t. This is merely to explain the difference between a scalar agg and a group by agg. But that doesn't mean this benchmark is a group by - at least not in Acero.

Though I'm not entirely sure how the segmented non-group-by agg map to a SQL basis, in Acero, specifying segment keys doesn't actually require group-by's existence. One just specifies segment keys and group by keys (independently) within

arrow/cpp/src/arrow/acero/options.h

Lines 332 to 348 in 9576a41

class ARROW_ACERO_EXPORT AggregateNodeOptions : public ExecNodeOptions {

public:

/// \brief create an instance from values

explicit AggregateNodeOptions(std::vector<Aggregate> aggregates,

std::vector<FieldRef> keys = {},

std::vector<FieldRef> segment_keys = {})

: aggregates(std::move(aggregates)),

keys(std::move(keys)),

segment_keys(std::move(segment_keys)) {}

// aggregations which will be applied to the targeted fields

std::vector<Aggregate> aggregates;

// keys by which aggregations will be grouped (optional)

std::vector<FieldRef> keys;

// keys by which aggregations will be segmented (optional)

std::vector<FieldRef> segment_keys;

};

and call an "aggregate" node. The plan parsing will generate scalar agg or group by agg depending on if there are group by keys and assign segment keys to whichever of them.

So I would let the naming reflect the underlying implementation (scalar agg vs. group by agg).

(I understand the argument could be: a segment key still implies group by semantic. I agree. But that's a more end-to-end perspective and doesn't reflect how we handle the segment key).

pitrou · 2024-09-17T13:17:08Z

cpp/src/arrow/acero/aggregate_benchmark.cc

+  int64_t num_keys = state.range(0);
+  ArrayVector keys(num_keys);
+  for (auto& key : keys) {
+    key = rng.Int64(num_rows, /*min=*/0, /*max=*/64);


It's a bit weird that this doesn't use num_segments. For comparison purposes, I would expect a similar cardinality in segmented and non-segmented keys.

Generally the cardinalities of a segment key and a group by key are relatively independent, so I chose not to use the one of segment key for group by as well. If we want to make the group by cardinality variable, we can use another independent parameter.

But my concerns is that the performance of segment keys and group by keys are also independent, that is why this benchmark uses a fixed group by key cardinality. What do you think?

The point here is to compare performance of segmented vs. non-segmented keys, right?

In these two particular benchmarks, they are to show how the number of segment keys and the number of segments perform with a predefined group by setup. In other words, the group by portion of the benchmark is merely to make the benchmark more realistic. They are not designed for the comparison you suggested.

IIUC, an apple to apple comparison though might be: a segmented scalar agg (N segment keys and 0 group by keys) VS. a non-segmented group by agg (0 segment keys and N group by keys), with the same key distribution. In this comparison, we can answer how a group by can be done more efficiently by taking advantage of the segmented nature of the key(s) (i.e., using segmented agg). If you are suggesting such a comparison, yes that makes sense. I think I can add a new benchmark for it.

What do you think? Thanks.

Ok, thanks for the explanation. A new benchmark can be added in a later PR if we want, this one is fine now that I understand the motivation.

pitrou

+1, thanks for this

conbench-apache-arrow · 2024-09-18T15:04:22Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 3d6d581.

There was 1 benchmark result indicating a performance regression:

Commit Run on amd64-m5-4xlarge-linux at 2024-09-18 09:22:53Z
- partitioned-dataset-filter (R) with dataset=dataset-taxi-parquet, language=R, query=vignette

The full Conbench report has more details. It also includes information about 255 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 added 11 commits September 10, 2024 19:43

WIP

412a0d7

Add benchmark

c83c1ed

Add an aseert

05b0bf0

Update benchmark

05c14df

Cleanup the code

65eee57

Some comments to the benchmark

4f66bfb

Refine

f2f26d1

Fix CI errors

2044125

Fix CI errors

9cb7b91

Refine

9c7dcdb

Update benchmark

9281e6a

zanmato1984 requested a review from westonpace as a code owner September 11, 2024 02:27

github-actions bot added Component: C++ awaiting review Awaiting review labels Sep 11, 2024

zanmato1984 commented Sep 11, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 11, 2024

pitrou reviewed Sep 11, 2024

View reviewed changes

cpp/src/arrow/compute/row/grouper.cc Outdated Show resolved Hide resolved

pitrou reviewed Sep 11, 2024

View reviewed changes

cpp/src/arrow/compute/row/grouper.cc Outdated Show resolved Hide resolved

pitrou reviewed Sep 11, 2024

View reviewed changes

zanmato1984 and others added 5 commits September 12, 2024 14:29

Update cpp/src/arrow/compute/row/grouper.cc

2233274

Co-authored-by: Antoine Pitrou <[email protected]>

Update cpp/src/arrow/compute/row/grouper.cc

ffc2e85

Co-authored-by: Antoine Pitrou <[email protected]>

Address review comments

374ab6a

More realistic benchmark

c4659e2

Fix format

2c16f32

pitrou reviewed Sep 17, 2024

View reviewed changes

Small refine

1428999

pitrou approved these changes Sep 18, 2024

View reviewed changes

pitrou merged commit 3d6d581 into apache:main Sep 18, 2024
41 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Sep 18, 2024

pitrou mentioned this pull request Sep 18, 2024

[C++][Compute] Row segmenter inefficiency #44052

Closed

github-actions bot added the awaiting committer review Awaiting committer review label Sep 18, 2024

		ExecSpan span0(batch0);
		TestSegments(segmenter, span0, {{0, 3, true, true}, {3, 0, true, true}});


		BenchmarkGroupBy(state, {{"count", ""}}, {arg}, /keys=/{}, segment_keys);

		state.SetItemsProcessed(num_rows * state.iterations());

		using arrow::Concatenate;
		using arrow::ConstantArrayGenerator;

		}

		ARROW_DEPRECATED("Deprecated in 18.0.0. Use GetSegments instead.")

	class ARROW_ACERO_EXPORT AggregateNodeOptions : public ExecNodeOptions {
	public:
	/// \brief create an instance from values
	explicit AggregateNodeOptions(std::vector<Aggregate> aggregates,
	std::vector<FieldRef> keys = {},
	std::vector<FieldRef> segment_keys = {})
	: aggregates(std::move(aggregates)),
	keys(std::move(keys)),
	segment_keys(std::move(segment_keys)) {}

	// aggregations which will be applied to the targeted fields
	std::vector<Aggregate> aggregates;
	// keys by which aggregations will be grouped (optional)
	std::vector<FieldRef> keys;
	// keys by which aggregations will be segmented (optional)
	std::vector<FieldRef> segment_keys;
	};

GH-44052: [C++][Compute] Reduce the complexity of row segmenter #44053

GH-44052: [C++][Compute] Reduce the complexity of row segmenter #44053

Conversation

zanmato1984 commented Sep 11, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

zanmato1984 commented Sep 11, 2024 • edited Loading

zanmato1984 commented Sep 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Sep 11, 2024

zanmato1984 commented Sep 11, 2024

pitrou commented Sep 11, 2024

zanmato1984 commented Sep 11, 2024

pitrou commented Sep 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zanmato1984 Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Sep 18, 2024

zanmato1984 commented Sep 11, 2024 •

edited

Loading

zanmato1984 commented Sep 11, 2024 •

edited

Loading

zanmato1984 Sep 17, 2024 •

edited

Loading