Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-31992: [C++][Parquet] Handling the special case when DataPageV2 values buffer is empty #45252

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Jan 14, 2025

Rationale for this change

In DataPageV2, the levels and data will not be compressed together. So, we might get the "empty" data page buffer.

When meeting this, Snappy C++ will failed to decompress the (input_len == 0, output_len == 0) data.

What changes are included in this PR?

Handling the case in column_reader.cc

Are these changes tested?

  • Will add

Are there any user-facing changes?

Minor fix

@mapleFU mapleFU requested a review from wgtmac as a code owner January 14, 2025 08:41
Copy link

⚠️ GitHub issue #31992 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 14, 2025

ASSERT_EQ(100, this->metadata_num_values());
this->ReadColumn(Compression::SNAPPY);
ASSERT_EQ(0, this->values_read_);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've verified that the unittest test would test this case

@mapleFU
Copy link
Member Author

mapleFU commented Jan 14, 2025

@pitrou @wgtmac

page_buffer->data() + levels_byte_len,
uncompressed_len - levels_byte_len,
decompression_buffer_->mutable_data() + levels_byte_len));
// GH-31992: DataPageV2 may store only levels and no values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fix this in the Snappy decompressor instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, which one do you prefer?

Copy link
Member

@pitrou pitrou Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends whether an empty buffer is a normal compression result or a shortcut taken by parquet-java. Let me see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's really a bug in parquet-java, because a 0-size buffer compressed to a 1-size buffer using Snappy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I handle the 0,0 case in snappy 🤔?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I handle the 0,0 case in snappy 🤔?

No, sorry. We should work around it in Parquet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mean parquet-java is able to produce 0-sized compressed data as the example. How can I reproduce your case where 0 input is compressed to 1 byte?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not my case, it's just what Snappy produces when you ask it to compress 0 byte.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just what Snappy produces when you ask it to compress 0 byte.

IMO that's:

  1. "Compressed" Data is 0 byte
  2. Actually, the levels holds k bytes ?

I don't know how parquet-java works, I tried parquet rust and it failed to read

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, a 0-byte compressed data is invalid. It's probably a special case in the Parquet Java implementation.

cpp/src/parquet/column_writer_test.cc Outdated Show resolved Hide resolved
@mapleFU
Copy link
Member Author

mapleFU commented Jan 15, 2025

@pitrou so what should I need to do to merge this? Should I check #45252 (comment) in compression to check that "Compress" doesn't compress the zero-sized page? Or I should change other code?

@pitrou
Copy link
Member

pitrou commented Jan 15, 2025

@mapleFU Can we open a bug for Parquet-java and reference it here?

@mapleFU
Copy link
Member Author

mapleFU commented Jan 15, 2025

Created: apache/parquet-java#3122

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants