Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45283: [C++][Parquet] Omit level histogram when max level is 0 #45285

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Jan 16, 2025

Rationale for this change

The level histogram of size statistics can be omitted if its max level is 0. We haven't implemented this yet and enforces histogram size to be equal to max_level + 1. However, when reading a Parquet file with omitted level histogram, exception will be thrown.

What changes are included in this PR?

Omit level histogram when max level is 0.

Are these changes tested?

Yes, a test case has been added to reflect the change.

Are there any user-facing changes?

No.

Copy link

⚠️ GitHub issue #45283 has been automatically assigned in GitHub to PR creator.

@wgtmac
Copy link
Member Author

wgtmac commented Jan 16, 2025

cc @etseidl

@wgtmac wgtmac requested a review from pitrou January 16, 2025 16:07
@@ -1609,6 +1609,9 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<

auto add_levels = [](std::vector<int64_t>& level_histogram,
::arrow::util::span<const int16_t> levels, int16_t max_level) {
if (max_level == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: should this be a constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this is already clear.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 16, 2025
Comment on lines 67 to 79
auto validate_histogram = [](const std::vector<int64_t>& histogram, int16_t max_level,
const std::string& name) {
if (max_level == 0 && histogram.empty()) {
return;
}
if (histogram.size() != static_cast<size_t>(max_level + 1)) {
throw ParquetException(name + " level histogram size mismatch");
}
};
validate_histogram(repetition_level_histogram, descr->max_repetition_level(),
"Repetition");
validate_histogram(definition_level_histogram, descr->max_definition_level(),
"Definition");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to add a threshold argument to validate_histogram, the reason being that the definition level histogram can be omitted for max def equal 0 or 1. arrow-rs and cudf both use the same cutoff of 0 on write (not sure what parquet-java does, but you wrote that 😄), but down the road someone might stick to the spec and this test will still fail.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof...it's still too early for my brain to work github >.<

Anyway, this looks good to me on the write side, but the read side needs to account for the asymmetry in the spec. Thanks!

@@ -376,4 +376,22 @@ TEST_F(SizeStatisticsRoundTripTest, LargePage) {
/*byte_array_bytes=*/{90000}}));
}

TEST_F(SizeStatisticsRoundTripTest, MaxLevelZero) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a file to parquet-testing that has max_def==1 and omits the def histogram?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think if any open-sourced Parquet writer is able to create this yet...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could coerce one into doing so. But with the change allowing empty histograms, it's probably not worth creating one now.

@raulcd
Copy link
Member

raulcd commented Jan 16, 2025

but the read side needs to account for the asymmetry in the spec.

I am just following the PR, what do you mean by this? I am not sure I understand what case you're describing.

@etseidl
Copy link
Contributor

etseidl commented Jan 16, 2025

but the read side needs to account for the asymmetry in the spec.

I am just following the PR, what do you mean by this? I am not sure I understand what case you're describing.

@raulcd The parquet spec says the repetition level histogram can be omitted if max_rep_level==0, but for the definition levels it can be omitted for max_def_level == 0 or 1. https://github.com/apache/parquet-format/blob/a498aa9a377edcdbc5da802cf9f1763a2e409411/src/main/thrift/parquet.thrift#L231-L237

So the case I'm describing is a file with a column with max_def_level==1, where the repetition histogram is populated but the definition histogram is not.

@pitrou
Copy link
Member

pitrou commented Jan 16, 2025

The spec doesn't say anything about when it's allowed to be omitted.

It says in which situations it can be omitted without loss of information.

So we should assume that it can always be omitted, regardless of the max level values.

::arrow::TableFromJSON(schema, {R"([["foo"],["bar"]])"}),
/*max_row_group_length=*/2,
/*page_size=*/1024);
ReadSizeStatistics();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it also read the full file? We should ensure we don't crash elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadSizeStatistics currently only reads the metadata where it crashes from the issue. Let me add a new function to verify scanning data.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this patch omit the levels when writing and loose the validation on reading? I forget a bit can we always omit the level whatever the max_def_level is...

auto validate_histogram = [](const std::vector<int64_t>& histogram, int16_t max_level,
const std::string& name) {
if (histogram.empty()) {
// A levels histogram is always allowed to be missing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also allow for missing unencoded_byte_array_data_bytes below? Maybe a different PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just thought about this and will loose it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants