-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data page statistics when all rows are null in a data page #11295
Fix data page statistics when all rows are null in a data page #11295
Conversation
…ull. Fixes most of the failing tests for iterators not handling this situation correctly.
@@ -600,6 +601,31 @@ make_data_page_stats_iterator!( | |||
Index::DOUBLE, | |||
f64 | |||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just consolidating these together.
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @efredine -- this looks (really) nice.
Also thank you @Rachelint for the review
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Outdated
Show resolved
Hide resolved
|
||
// There is one data page with 4 nulls | ||
// The statistics should be present but null | ||
Test { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that this test covered the code changes by running the test without the code changes and it failed as expected.
thread 'parquet::arrow_statistics::test_data_page_stats_with_all_null_page' panicked at datafusion/core/tests/parquet/arrow_statistics.rs:276:13:
assertion `left == right` failed: col: Mismatch with expected data page minimums
left: PrimitiveArray<UInt64>
[
]
right: PrimitiveArray<UInt64>
[
null,
]
stack backtrace:
0: rust_begin_unwind
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
1: core::panicking::panic_fmt
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
2: core::panicking::assert_failed_inner
3: core::panicking::assert_failed
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:364:5
4: parquet_exec::parquet::arrow_statistics::Test::run_checks
at ./tests/parquet/arrow_statistics.rs:276:13
5: parquet_exec::parquet::arrow_statistics::Test::run
at ./tests/parquet/arrow_statistics.rs:229:9
6: parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page::{{closure}}
at ./tests/parquet/arrow_statistics.rs:567:9
7: <core::pin::Pin<P> as core::future::future::Future>::poll
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/future/future.rs:123:9
8: <core::pin::Pin<P> as core::future::future::Future>::poll
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/future/future.rs:123:9
9: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:659:57
10: tokio::runtime::coop::with_budget
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:107:5
11: tokio::runtime::coop::budget
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:73:5
12: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:659:25
13: tokio::runtime::scheduler::current_thread::Context::enter
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:404:19
14: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:658:36
15: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:737:68
16: tokio::runtime::context::scoped::Scoped<T>::set
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/scoped.rs:40:9
17: tokio::runtime::context::set_scheduler::{{closure}}
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context.rs:180:26
18: std::thread::local::LocalKey<T>::try_with
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:286:12
19: std::thread::local::LocalKey<T>::with
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:262:9
20: tokio::runtime::context::set_scheduler
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context.rs:180:9
21: tokio::runtime::scheduler::current_thread::CoreGuard::enter
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:737:27
22: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:646:19
23: tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}}
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:175:28
24: tokio::runtime::context::runtime::enter_runtime
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/runtime.rs:65:16
25: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:167:9
26: tokio::runtime::runtime::Runtime::block_on
at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/runtime.rs:347:47
27: parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page
at ./tests/parquet/arrow_statistics.rs:517:5
28: parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page::{{closure}}
at ./tests/parquet/arrow_statistics.rs:516:51
29: core::ops::function::FnOnce::call_once
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
30: core::ops::function::FnOnce::call_once
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
FixedSizeBinaryArray::new(*size, vec![].into(), None) | ||
}) | ||
)) | ||
let mut builder = FixedSizeBinaryBuilder::new(*size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a nice change to me
Let's merge this one in so we can proceed with getting #11319 ready |
THanks again! |
…e#11295) * Adds tests for data page statistics when all values on the page are null. Fixes most of the failing tests for iterators not handling this situation correctly. * Fix handling of data page statistics for FixedBinaryArray using a builder. * Fix data page all nulls stats test for Dictionary DataType. * Fixes handling of None statistics for Decimal128 and Decimal256. * Consolidate make_data_page_stats_iterator uses. * Fix linting error. * Remove unnecessary collect. --------- Co-authored-by: Eric Fredine <[email protected]>
Which issue does this PR close?
Closes #11280.
Rationale for this change
When all rows for a data page are null the min and max statistics should be present but null. Some of the data page statistics iterators were incorrectly omitting statistics rather than setting them to null. This results in an array whose length is different from the number of data pages.
What changes are included in this PR?
Adds test for data page statistics for all data types when all rows in a data page are null. Fixes data page statistics iterators that fail these tests.
Are these changes tested?
Yes.
Are there any user-facing changes?
No.