[C++] Improve performance of sequential access of ChunkResolver #44641

anjakefala · 2024-11-05T01:28:42Z

Describe the enhancement requested

There has been recent work to move the ChunkResolver to public API.

ChunkResolver uses O(log(num_chunks)) binary search to identify chunks, which is optimised for random access. For sequential row-by-row access, using ChunkResolver would be inefficient.

Sometimes a user needs to be able to do row-major processing of the data. To that note, the proposal is to add these helper methods to the ChunKResolver API for more efficient sequential access traversal.

These helper methods were written by @felipecrv:

  /// \pre loc.chunk_index >= 0
  /// \pre loc.index_in_chunk is assumed valid if chunk_index is not the last one
  inline bool Valid(ChunkLocation loc) const {
    const int64_t last_chunk_index = static_cast<int64_t>(offsets_.size()) - 1;
    return loc.chunk_index + 1 < last_chunk_index ||
           (loc.chunk_index + 1 == last_chunk_index &&
            loc.index_in_chunk < offsets_[last_chunk_index]);
  }

  /// \pre Valid(loc)
  inline ChunkLocation Next(ChunkLocation loc) const {
    const int64_t next_index_in_chunk = loc.index_in_chunk + 1;
    return (next_index_in_chunk < offsets_[loc.chunk_index + 1])
               ? ChunkLocation{loc.chunk_index, next_index_in_chunk}
               : ChunkLocation{loc.chunk_index + 1, 0};
  }

with the resulting loops:

ChunkResolver resolver(batches);
for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
  // re-use loc for all the typed columns since they are split on the same offsets
}

Component(s)

C++

The text was updated successfully, but these errors were encountered:

felipecrv · 2024-11-05T14:16:21Z

+1 I think these would be nice additions to the API.

zeroshade · 2024-11-07T08:48:54Z

I think this is a great idea, though we might need to either rename ChunkResolver or otherwise just ensure that the public docs make all the limitations clear.

That's really my only concern, we want to avoid this becoming the "standard way to iterate a table" and instead ensure that users can easily know when they should or shouldn't use the ChunkResolver

anjakefala · 2024-11-07T23:54:19Z

instead ensure that users can easily know when they should or shouldn't use the ChunkResolver

This makes sense! So the steps for this PR are to add tests and appropriate documentation, on top of the changes.

anjakefala added the Type: enhancement label Nov 5, 2024

github-actions bot added the Component: C++ label Nov 5, 2024

anjakefala changed the title ~~[C++] Improve sequential access use of ChunkResolver~~ [C++] Improve performance of sequential access of ChunkResolver Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Improve performance of sequential access of ChunkResolver #44641

[C++] Improve performance of sequential access of ChunkResolver #44641

anjakefala commented Nov 5, 2024

felipecrv commented Nov 5, 2024

zeroshade commented Nov 7, 2024

anjakefala commented Nov 7, 2024

[C++] Improve performance of sequential access of ChunkResolver #44641

[C++] Improve performance of sequential access of ChunkResolver #44641

Comments

anjakefala commented Nov 5, 2024

Describe the enhancement requested

Component(s)

felipecrv commented Nov 5, 2024

zeroshade commented Nov 7, 2024

anjakefala commented Nov 7, 2024