-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assume Pages Delimit Records When Offset Index Loaded (#4921) #4943
Assume Pages Delimit Records When Offset Index Loaded (#4921) #4943
Conversation
What will happen if the reader gets a file that does split a record across a page boundary? |
Most likely it will error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove the "workaround" for reading data that spans pages, doesn't that mean the rust parquet reader will not be able to read some files that were written by the writer?
Clearly, there is no test coverage for such page spanning records (as no tests fail when you remove the logic) but I still worry this will be a regression for existing users.
Perhaps we could at least check for rows that span page boundaries and throw a helpful error message saying it isn't supported / referring to this PR / ticket?
@@ -1677,4 +1682,85 @@ mod tests { | |||
assert!(sbbf.check(&"Hello")); | |||
assert!(!sbbf.check(&"Hello_Not_Exists")); | |||
} | |||
|
|||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified this test fails as follows without the code changes in this PR
called `Result::unwrap()` on an `Err` value: ArrowError("Parquet argument error: Parquet error: StructArrayReader out of sync in read_records, expected 1 read, got 0")
thread 'arrow::async_reader::tests::test_nested_skip' panicked at parquet/src/arrow/async_reader/mod.rs:1760:29:
called `Result::unwrap()` on an `Err` value: ArrowError("Parquet argument error: Parquet error: StructArrayReader out of sync in read_records, expected 1 read, got 0")
stack backtrace:
0: rust_begin_unwind
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
1: core::panicking::panic_fmt
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
2: core::result::unwrap_failed
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1652:5
3: core::result::Result<T,E>::unwrap
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1077:23
4: parquet::arrow::async_reader::tests::test_nested_skip::{{closure}}
at ./src/arrow/async_reader/mod.rs:1760:26
5: <core::pin::Pin<P> as core::future::future::Future>::poll
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/future/future.rs:125:9
6: <core::pin::Pin<P> as core::future::future::Future>::poll
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/future/future.rs:125:9
7: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:665:57
8: tokio::runtime::coop::with_budget
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:107:5
9: tokio::runtime::coop::budget
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:73:5
10: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:665:25
11: tokio::runtime::scheduler::current_thread::Context::enter
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:410:19
12: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:664:36
13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:743:68
14: tokio::runtime::context::scoped::Scoped<T>::set
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/scoped.rs:40:9
15: tokio::runtime::context::set_scheduler::{{closure}}
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context.rs:176:26
16: std::thread::local::LocalKey<T>::try_with
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/thread/local.rs:270:16
17: std::thread::local::LocalKey<T>::with
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/thread/local.rs:246:9
18: tokio::runtime::context::set_scheduler
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context.rs:176:9
19: tokio::runtime::scheduler::current_thread::CoreGuard::enter
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:743:27
20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:652:19
21: tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}}
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:175:28
22: tokio::runtime::context::runtime::enter_runtime
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/runtime.rs:65:16
23: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/current_thread/mod.rs:167:9
24: tokio::runtime::runtime::Runtime::block_on
at /Users/alamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/runtime.rs:347:47
25: parquet::arrow::async_reader::tests::test_nested_skip
at ./src/arrow/async_reader/mod.rs:1744:9
26: parquet::arrow::async_reader::tests::test_nested_skip::{{closure}}
at ./src/arrow/async_reader/mod.rs:1687:33
27: core::ops::function::FnOnce::call_once
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/ops/function.rs:250:5
28: core::ops::function::FnOnce::call_once
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
ff217be
to
684f749
Compare
I reworked this to only assume pages delimit records if the offset index is loaded, this will ensure this change is restricted to the use-case where the behaviour was previously incorrect / buggy and avoid regressing the case of no offset index - which will continue to work as before. This will also serve as an opt-out mechanism should the new behaviour be problematic for any reason. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold
Which issue does this PR close?
Closes #4921
Rationale for this change
The parquet specification, especially before version 2, did not make it especially clear that records should not be split across pages, and prior to #4327 the Rust writer would do this in certain situations.
As a result #4376 attempted to not assume page boundaries delimit records, and instead only relied on row group boundaries delimiting records.
Unfortunately this logic is incomplete:
Invalid offset
errorsWhat changes are included in this PR?
This PR removes the workarounds, and simply assumes pages delimit records.
Are there any user-facing changes?