-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45283: [C++][Parquet] Omit level histogram when max level is 0 #45285
base: main
Are you sure you want to change the base?
Conversation
|
cc @etseidl |
@@ -1609,6 +1609,9 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter< | |||
|
|||
auto add_levels = [](std::vector<int64_t>& level_histogram, | |||
::arrow::util::span<const int16_t> levels, int16_t max_level) { | |||
if (max_level == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: should this be a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this is already clear.
cpp/src/parquet/size_statistics.cc
Outdated
auto validate_histogram = [](const std::vector<int64_t>& histogram, int16_t max_level, | ||
const std::string& name) { | ||
if (max_level == 0 && histogram.empty()) { | ||
return; | ||
} | ||
if (histogram.size() != static_cast<size_t>(max_level + 1)) { | ||
throw ParquetException(name + " level histogram size mismatch"); | ||
} | ||
}; | ||
validate_histogram(repetition_level_histogram, descr->max_repetition_level(), | ||
"Repetition"); | ||
validate_histogram(definition_level_histogram, descr->max_definition_level(), | ||
"Definition"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to add a threshold argument to validate_histogram
, the reason being that the definition level histogram can be omitted for max def equal 0 or 1. arrow-rs and cudf both use the same cutoff of 0
on write (not sure what parquet-java does, but you wrote that 😄), but down the road someone might stick to the spec and this test will still fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof...it's still too early for my brain to work github >.<
Anyway, this looks good to me on the write side, but the read side needs to account for the asymmetry in the spec. Thanks!
@@ -376,4 +376,22 @@ TEST_F(SizeStatisticsRoundTripTest, LargePage) { | |||
/*byte_array_bytes=*/{90000}})); | |||
} | |||
|
|||
TEST_F(SizeStatisticsRoundTripTest, MaxLevelZero) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a file to parquet-testing that has max_def==1 and omits the def histogram?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think if any open-sourced Parquet writer is able to create this yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could coerce one into doing so. But with the change allowing empty histograms, it's probably not worth creating one now.
I am just following the PR, what do you mean by this? I am not sure I understand what case you're describing. |
@raulcd The parquet spec says the repetition level histogram can be omitted if max_rep_level==0, but for the definition levels it can be omitted for max_def_level == 0 or 1. https://github.com/apache/parquet-format/blob/a498aa9a377edcdbc5da802cf9f1763a2e409411/src/main/thrift/parquet.thrift#L231-L237 So the case I'm describing is a file with a column with max_def_level==1, where the repetition histogram is populated but the definition histogram is not. |
The spec doesn't say anything about when it's allowed to be omitted. It says in which situations it can be omitted without loss of information. So we should assume that it can always be omitted, regardless of the max level values. |
::arrow::TableFromJSON(schema, {R"([["foo"],["bar"]])"}), | ||
/*max_row_group_length=*/2, | ||
/*page_size=*/1024); | ||
ReadSizeStatistics(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it also read the full file? We should ensure we don't crash elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadSizeStatistics
currently only reads the metadata where it crashes from the issue. Let me add a new function to verify scanning data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this patch omit the levels when writing and loose the validation on reading? I forget a bit can we always omit the level whatever the max_def_level is...
auto validate_histogram = [](const std::vector<int64_t>& histogram, int16_t max_level, | ||
const std::string& name) { | ||
if (histogram.empty()) { | ||
// A levels histogram is always allowed to be missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also allow for missing unencoded_byte_array_data_bytes
below? Maybe a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just thought about this and will loose it as well.
Rationale for this change
The level histogram of size statistics can be omitted if its max level is 0. We haven't implemented this yet and enforces histogram size to be equal to
max_level + 1
. However, when reading a Parquet file with omitted level histogram, exception will be thrown.What changes are included in this PR?
Omit level histogram when max level is 0.
Are these changes tested?
Yes, a test case has been added to reflect the change.
Are there any user-facing changes?
No.