-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-31992: [C++][Parquet] Handling the special case when DataPageV2 values buffer is empty #45252
base: main
Are you sure you want to change the base?
Conversation
|
|
||
ASSERT_EQ(100, this->metadata_num_values()); | ||
this->ReadColumn(Compression::SNAPPY); | ||
ASSERT_EQ(0, this->values_read_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've verified that the unittest test would test this case
page_buffer->data() + levels_byte_len, | ||
uncompressed_len - levels_byte_len, | ||
decompression_buffer_->mutable_data() + levels_byte_len)); | ||
// GH-31992: DataPageV2 may store only levels and no values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fix this in the Snappy decompressor instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, which one do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends whether an empty buffer is a normal compression result or a shortcut taken by parquet-java. Let me see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it's really a bug in parquet-java, because a 0-size buffer compressed to a 1-size buffer using Snappy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I handle the 0,0
case in snappy 🤔?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I handle the
0,0
case in snappy 🤔?
No, sorry. We should work around it in Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I mean parquet-java is able to produce 0-sized compressed data as the example. How can I reproduce your case where 0 input is compressed to 1 byte?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not my case, it's just what Snappy produces when you ask it to compress 0 byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just what Snappy produces when you ask it to compress 0 byte.
IMO that's:
- "Compressed" Data is 0 byte
- Actually, the levels holds k bytes ?
I don't know how parquet-java works, I tried parquet rust and it failed to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, a 0-byte compressed data is invalid. It's probably a special case in the Parquet Java implementation.
@pitrou so what should I need to do to merge this? Should I check #45252 (comment) in compression to check that "Compress" doesn't compress the zero-sized page? Or I should change other code? |
@mapleFU Can we open a bug for Parquet-java and reference it here? |
Created: apache/parquet-java#3122 |
Rationale for this change
In DataPageV2, the levels and data will not be compressed together. So, we might get the "empty" data page buffer.
When meeting this, Snappy C++ will failed to decompress the
(input_len == 0, output_len == 0)
data.What changes are included in this PR?
Handling the case in
column_reader.cc
Are these changes tested?
Are there any user-facing changes?
Minor fix