Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43408: [C++] IO: Make Advance to virtual and naive implement it #43409

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Jul 24, 2024

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Copy link

⚠️ GitHub issue #43408 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Jul 24, 2024

@felipecrv @pitrou

This is just a naive impl, I'll test and refactor after interface is stable. Currently I think we may change

Status Advance(int64_t nbytes);

To

Result<int64_t> Advance(int64_t nbytes);

Comment on lines +453 to +455
// TODO(mwish): Considering using raw_->Advance if available,
// currently we don't have a way to know if the underlying stream supports fast
// skipping. So we just read and discard the data.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some code like:

  /// \brief Return true if InputStream is capable of zero copy Buffer reads
  ///
  /// Zero copy reads imply the use of Buffer-returning Read() overloads.
  virtual bool supports_zero_copy() const;

Would help, a supports_fast_advance like https://github.com/ClickHouse/ClickHouse/blob/1b2fd51e090214deb340a76833bab7b4985eecfc/src/Disks/IO/ReadBufferFromRemoteFSGather.h#L19 might work

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 25, 2024
// TODO(mwish): Considering using raw_->Advance if available,
// currently we don't have a way to know if the underlying stream supports fast
// skipping. So we just read and discard the data.
auto result = Read(remain_skip_bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply call Advance? The default implementation calls Read anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply call Advance? The default implementation calls Read anyway.

If underlying don't have supports_fast_advance, it would call "read" without buffering, and might be a low-efficient direct read and less efficent than "do large buffer and read"

return Status::OK();
}

if (nbytes < bytes_buffered_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (nbytes < bytes_buffered_) {
if (nbytes <= bytes_buffered_) {

@pitrou
Copy link
Member

pitrou commented Aug 5, 2024

@felipecrv @pitrou

This is just a naive impl, I'll test and refactor after interface is stable. Currently I think we may change

Status Advance(int64_t nbytes);

To

Result<int64_t> Advance(int64_t nbytes);

Yes, that sounds reasonable.

@mapleFU
Copy link
Member Author

mapleFU commented Aug 7, 2024

Yes, that sounds reasonable.

I would change this, thanks!

@pitrou
Copy link
Member

pitrou commented Aug 8, 2024

By the way, why do you need this?

@mapleFU
Copy link
Member Author

mapleFU commented Aug 8, 2024

Emmm I'm using stream to read a local file, when I trying to Advance I found severo ms is taken, and when I dive in I found Read is called

I can avoid calling read by re-construct a stream, however, maybe enhance advance is more intuitive

@pitrou
Copy link
Member

pitrou commented Aug 8, 2024

Hmm, I see we're using Advance in Parquet to skip data pages according to statistics, so perhaps this would be useful after all.

EncodedStatistics data_page_statistics;
if (ShouldSkipPage(&data_page_statistics)) {
PARQUET_THROW_NOT_OK(stream_->Advance(compressed_len));
continue;
}

@mapleFU
Copy link
Member Author

mapleFU commented Aug 9, 2024

Sigh, I don't like SkipPage api here, I found it might help decoding in arrow-ipc format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants