Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assume Pages Delimit Records When Offset Index Loaded (#4921) #4943

Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Oct 16, 2023

Which issue does this PR close?

Closes #4921

Rationale for this change

The parquet specification, especially before version 2, did not make it especially clear that records should not be split across pages, and prior to #4327 the Rust writer would do this in certain situations.

As a result #4376 attempted to not assume page boundaries delimit records, and instead only relied on row group boundaries delimiting records.

Unfortunately this logic is incomplete:

  • When reading or skipping the last record of a page, this record count will not be "counted" until the next read, this can then cause the reader to skip too far if the next selection is at the end of the next page
  • When using the offset index to prune IO the reader tries to read pages beyond those necessary for the RowSelection, if a RowSelector ends on a page boundary, leading to Invalid offset errors

What changes are included in this PR?

This PR removes the workarounds, and simply assumes pages delimit records.

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 16, 2023
@alamb alamb changed the title Assume Records Not Split Across Pages (#4921) Assume Parquet Records Not Split Across Pages (#4921) Oct 16, 2023
@alamb
Copy link
Contributor

alamb commented Oct 16, 2023

This PR removes the workarounds, and simply assumes pages delimit records.

What will happen if the reader gets a file that does split a record across a page boundary?

@tustvold
Copy link
Contributor Author

What will happen if the reader gets a file that does split a record across a page boundary?

Most likely it will error

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove the "workaround" for reading data that spans pages, doesn't that mean the rust parquet reader will not be able to read some files that were written by the writer?

Clearly, there is no test coverage for such page spanning records (as no tests fail when you remove the logic) but I still worry this will be a regression for existing users.

Perhaps we could at least check for rows that span page boundaries and throw a helpful error message saying it isn't supported / referring to this PR / ticket?

@@ -1677,4 +1682,85 @@ mod tests {
assert!(sbbf.check(&"Hello"));
assert!(!sbbf.check(&"Hello_Not_Exists"));
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this test fails as follows without the code changes in this PR

called `Result::unwrap()` on an `Err` value: ArrowError("Parquet argument error: Parquet error: StructArrayReader out of sync in read_records, expected 1 read, got 0")
thread 'arrow::async_reader::tests::test_nested_skip' panicked at parquet/src/arrow/async_reader/mod.rs:1760:29:
called `Result::unwrap()` on an `Err` value: ArrowError("Parquet argument error: Parquet error: StructArrayReader out of sync in read_records, expected 1 read, got 0")
stack backtrace:
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: core::result::unwrap_failed
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1652:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1077:23
   4: parquet::arrow::async_reader::tests::test_nested_skip::{{closure}}
             at ./src/arrow/async_reader/mod.rs:1760:26
   5: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/future/future.rs:125:9
   6: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/future/future.rs:125:9
   7: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:665:57
   8: tokio::runtime::coop::with_budget
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:107:5
   9: tokio::runtime::coop::budget
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:73:5
  10: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:665:25
  11: tokio::runtime::scheduler::current_thread::Context::enter
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:410:19
  12: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:664:36
  13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:743:68
  14: tokio::runtime::context::scoped::Scoped<T>::set
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/scoped.rs:40:9
  15: tokio::runtime::context::set_scheduler::{{closure}}
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context.rs:176:26
  16: std::thread::local::LocalKey<T>::try_with
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/thread/local.rs:270:16
  17: std::thread::local::LocalKey<T>::with
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/thread/local.rs:246:9
  18: tokio::runtime::context::set_scheduler
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context.rs:176:9
  19: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:743:27
  20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:652:19
  21: tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}}
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:175:28
  22: tokio::runtime::context::runtime::enter_runtime
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/runtime.rs:65:16
  23: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:167:9
  24: tokio::runtime::runtime::Runtime::block_on
             at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/runtime.rs:347:47
  25: parquet::arrow::async_reader::tests::test_nested_skip
             at ./src/arrow/async_reader/mod.rs:1744:9
  26: parquet::arrow::async_reader::tests::test_nested_skip::{{closure}}
             at ./src/arrow/async_reader/mod.rs:1687:33
  27: core::ops::function::FnOnce::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/ops/function.rs:250:5
  28: core::ops::function::FnOnce::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


@tustvold tustvold force-pushed the assume-records-not-split-across-pages branch from ff217be to 684f749 Compare October 16, 2023 22:15
@tustvold tustvold changed the title Assume Parquet Records Not Split Across Pages (#4921) Assume Parquet Records Not Split Across Pages When Offset Index Loaded (#4921) Oct 16, 2023
@tustvold tustvold changed the title Assume Parquet Records Not Split Across Pages When Offset Index Loaded (#4921) Assume Pages Delimit Records When Offset Index Loaded (#4921) Oct 16, 2023
@tustvold
Copy link
Contributor Author

I reworked this to only assume pages delimit records if the offset index is loaded, this will ensure this change is restricted to the use-case where the behaviour was previously incorrect / buggy and avoid regressing the case of no offset index - which will continue to work as before. This will also serve as an opt-out mechanism should the new behaviour be problematic for any reason.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold

@tustvold tustvold merged commit d4d11fe into apache:master Oct 17, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading page index breaks skipping of pages with nested types
2 participants