-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
### Rationale for this change As described in #44052, currently `AnyKeysSegmenter::GetNextSegment` has `O(n*m)` complexity, where `n` is the number of rows in a batch, and `m` is the number of segments in this batch (a "segment" is the group of contiguous rows who have the same segment key). This is because in each invocation of the method, it computes all the group ids of the remaining rows in this batch, where it's only interested in the first group, making the rest of the computation a waste. In this PR I introduced a new API `GetSegments` (and subsequently deprecated the old `GetNextSegment`) to compute the group ids only once and iterate all the segments outside to avoid the duplicated computation. This reduces the complexity from `O(n*m)` to `O(n)`. ### What changes are included in this PR? 1. Because `grouper.h` is a [public header](https://github.com/apache/arrow/blob/8556001e6a8b4c7f35d4e18c28704d7811005904/cpp/src/arrow/compute/api.h#L47), so I assume `RowSegmenter::GetNextSegment` is a public API and only deprecate it instead of removing it. 2. Implement new API `RowSegmenter::GetSegments` and update the call-sites. 3. Some code reorg of the segmenter code (mostly moving to inside a class). 4. A new benchmark for the segmented aggregation. (The benchmark result is listed in the comments below, which shows up to `50x` speedup, nearly `O(n*m)` to `O(n)` complexity reduction.) ### Are these changes tested? Legacy tests are sufficient. ### Are there any user-facing changes? Yes. **This PR includes breaking changes to public APIs.** The API `RowSegmenter::GetNextSegment` is deprecated due to its inefficiency and replaced with a more efficient one `RowSegmenter::GetSegments`. * GitHub Issue: #44052 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
- Loading branch information
1 parent
9576a41
commit 3d6d581
Showing
5 changed files
with
332 additions
and
130 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.