Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Improve performance of sequential access of ChunkResolver #44641

Open
anjakefala opened this issue Nov 5, 2024 · 3 comments
Open

[C++] Improve performance of sequential access of ChunkResolver #44641

anjakefala opened this issue Nov 5, 2024 · 3 comments

Comments

@anjakefala
Copy link
Collaborator

Describe the enhancement requested

There has been recent work to move the ChunkResolver to public API.

ChunkResolver uses O(log(num_chunks)) binary search to identify chunks, which is optimised for random access. For sequential row-by-row access, using ChunkResolver would be inefficient.

Sometimes a user needs to be able to do row-major processing of the data. To that note, the proposal is to add these helper methods to the ChunKResolver API for more efficient sequential access traversal.

These helper methods were written by @felipecrv:

  /// \pre loc.chunk_index >= 0
  /// \pre loc.index_in_chunk is assumed valid if chunk_index is not the last one
  inline bool Valid(ChunkLocation loc) const {
    const int64_t last_chunk_index = static_cast<int64_t>(offsets_.size()) - 1;
    return loc.chunk_index + 1 < last_chunk_index ||
           (loc.chunk_index + 1 == last_chunk_index &&
            loc.index_in_chunk < offsets_[last_chunk_index]);
  }

  /// \pre Valid(loc)
  inline ChunkLocation Next(ChunkLocation loc) const {
    const int64_t next_index_in_chunk = loc.index_in_chunk + 1;
    return (next_index_in_chunk < offsets_[loc.chunk_index + 1])
               ? ChunkLocation{loc.chunk_index, next_index_in_chunk}
               : ChunkLocation{loc.chunk_index + 1, 0};
  }

with the resulting loops:

ChunkResolver resolver(batches);
for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
  // re-use loc for all the typed columns since they are split on the same offsets
}

Component(s)

C++

@anjakefala anjakefala changed the title [C++] Improve sequential access use of ChunkResolver [C++] Improve performance of sequential access of ChunkResolver Nov 5, 2024
@felipecrv
Copy link
Contributor

+1 I think these would be nice additions to the API.

@zeroshade
Copy link
Member

I think this is a great idea, though we might need to either rename ChunkResolver or otherwise just ensure that the public docs make all the limitations clear.

That's really my only concern, we want to avoid this becoming the "standard way to iterate a table" and instead ensure that users can easily know when they should or shouldn't use the ChunkResolver

@anjakefala
Copy link
Collaborator Author

instead ensure that users can easily know when they should or shouldn't use the ChunkResolver

This makes sense! So the steps for this PR are to add tests and appropriate documentation, on top of the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants