-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parquet bloom filter length #4885
Conversation
Signed-off-by: Letian Jiang <[email protected]>
351e1a9
to
1f04ca9
Compare
impl Default for Statistics { | ||
fn default() -> Self { | ||
Statistics{ | ||
max: Some(Vec::new()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that this is technically a breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is replaced by #[derive(Default)]
at line 641
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I didn't misunderstand it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that will default the fields to None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. It is a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor nits, thank you for this
parquet/src/file/writer.rs
Outdated
column_chunk | ||
.meta_data | ||
.as_mut() | ||
.expect("can't have bloom filter without column metadata") | ||
.bloom_filter_offset = Some(start_offset as i64); | ||
column_chunk | ||
.meta_data | ||
.as_mut() | ||
.expect("can't have bloom filter without column metadata") | ||
.bloom_filter_length = | ||
Some((end_offset - start_offset) as i32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column_chunk | |
.meta_data | |
.as_mut() | |
.expect("can't have bloom filter without column metadata") | |
.bloom_filter_offset = Some(start_offset as i64); | |
column_chunk | |
.meta_data | |
.as_mut() | |
.expect("can't have bloom filter without column metadata") | |
.bloom_filter_length = | |
Some((end_offset - start_offset) as i32) | |
let meta = column_chunk.meta_data.as_mut().expect("can't have bloom filter without column metadata"); | |
meta.bloom_filter_offset = Some(start_offset as i64); | |
meta.bloom_filter_length = Some((end_offset - start_offset) as i32); |
Or something to that effect
let length: usize = header.num_bytes.try_into().map_err(|_| { | ||
ParquetError::General("Bloom filter length is invalid".to_string()) | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We appear to have lost this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Signed-off-by: Letian Jiang <[email protected]>
@tustvold Hi, I have made some updates. Could you please take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, will merge once CI is green
Since 2.10 is not released, can bloom filter length be used now? 🤔 |
I hadn't realised it wasn't released, but it's a fairly innocuous change so I'm inclined to think this is fine |
Nice! This is helpful but I think currently parquet 2.10 is not released, so user might not able to write this... |
🤔 I think this doesn't harm as long as old readers can read new files (w. bloom filter length) correctly. |
Thrift supports this form of forwards compatibility natively |
Which issue does this PR close?
Closes #4398.
Rationale for this change
What changes are included in this PR?
If
bloom_filter_length
is available inColumnMetaData
, use a single IO to fetch both header and bitset.Are there any user-facing changes?