-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write null counts in parquet files when they are present #6257
base: main
Are you sure you want to change the base?
Conversation
6070619
to
2d70413
Compare
2d70413
to
e00e160
Compare
parquet/tests/arrow_writer_layout.rs
Outdated
@@ -189,7 +189,7 @@ fn test_primitive() { | |||
pages: (0..8) | |||
.map(|_| Page { | |||
rows: 250, | |||
page_header_size: 36, | |||
page_header_size: 38, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The page headers (and files) are larger now because Some(0)
takes more space than None
parquet/src/file/statistics.rs
Outdated
|
||
#[test] | ||
fn test_count_encoding() { | ||
statistics_count_test(None, None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails like this without the changes in this PR:
assertion `left == right` failed
left: Boolean({min: Some(true), max: Some(false), distinct_count: None, null_count: Some(0), ...
right: Boolean({min: Some(true), max: Some(false), distinct_count: None, null_count: None, ...
@@ -1195,6 +1197,23 @@ impl<'a> StatisticsConverter<'a> { | |||
self.arrow_field | |||
} | |||
|
|||
/// Set the statistics converter to treat missing null counts as missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default reading null counts will work with files written with older versions of parquet-rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For breaking / major release, is it acceptable to include an upgrade instruction of "add this config to maintain to old behavior"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely
It is important to note that the default behavior in this PR is the old behavior (in other words there should be changes needed in downstream consumers of this code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default in this PR is missing_null_counts_as_zero = true
, which maintains the old behavior, right?
If "add this config to maintain old behavior" is acceptable for a breaking release, then I would expect the default to be the new behavior.
IOW, I'd expect what you said on the parquet mailing list
Applications that use parquet-rs to read parquet_files and interpret the
null_count will need to be changed after the upgrade to explicitly continue
the old behavior of "treat no null_count as 0" which is also documented
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default in this PR is missing_null_counts_as_zero = true, which maintains the old behavior, right?
Yes
then I would expect the default to be the new behavior.
My thinking was
Since there are two different apis:
Statistics::null_count
would now return Option<..>
so users of the library will ned to update their code anyways and thus can choose at that time which behavior they wanted
StatisticsConverter
's API didn't change and thus it keeps the previous behavior. This is what I would persoanlly want for a system -- no change for reading parquet files that were written with previous versions of the rust writer.
// Number of nulls recorded, when it is not available, we just mark it as 0. | ||
// TODO this should be `None` if there is no information about NULLS. | ||
// see https://github.com/apache/arrow-rs/pull/6216/files | ||
let null_count = stats.null_count.unwrap_or(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the removal of this unwrap_or
is what changes the semantics while reading
parquet/src/file/statistics.rs
Outdated
let null_count = stats | ||
.null_count_opt() | ||
.map(|value| value as i64) | ||
.filter(|&x| x > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of this filter is what fixes the statistics while writing
Maybe it was intended to be x >= 0
originally 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@etseidl I wonder if you have any thoughts on this code / the writing of null counts |
Sorry, I was away when this discussion started (interesting things always happen when I'm on vacation 😉). I think this PR is heading the right direction. Writing |
Unless someone else has time to review this PR and thinks it should go into 53.0.0 (#6016) my personal preference would be to merge this PR early in the next major release cycle (e.g. 54.0.0) so it gets maximum bake / testing time before release |
I've looked this PR over several times and haven't found any issues with it. The only thing giving me pause is this apache/parquet-format#449 (comment) It seems java and cpp ignore the definition levels and write Some(0) regardless, so I'm fine with merging this and worrying about micro optimizations down the road (in step with the rest of the parquet community). |
Thank you @etseidl -- I think
I am not going to merge this PR until after we have released 53.0.0 |
I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another look and found no nits. I'd say ship it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @alamb
I go back and forth on this PR -- it isn't technically an API change in the sense of a breaking API change, but also it changes the content of written parquet files. Arguable the content better matches the parquet spec, I am just really worried about unintended consequences of doing this Maybe I am overly concerned. |
Then how about splitting this up? First change the read behavior so |
That is a (very) good idea -- I will plan to do that when I can find some time |
@alamb Let me know if you'd like some help with this. I have spare cycles right now. |
that would be super helpful @etseidl -- I do not have many spare cycles now (as you have probably guessed). All your help is most appreciated |
To clear up my own thinking on this, I made a table of what happens with a round trip of the statistics.
Currently we write So in theory we can write the null counts properly now, but should wait until 54.0.0 to make the last change on the read side. Does this sound right to you @alamb? |
I think it also would break reading of OLD files (aka files written with the older software / older versions of arrow-rs would now be interpreted as "unknown null count" rather than "0 null count") which I worry about a lot.
Yes, I suppose writing the null counts properly seems like a good idea. I am (likely overly) paranoid about the read side changes |
Yes, that's the "change read" column from my table, I suppose. So we agree it's the read side where the danger lies.
As I am often told, I am not paranoid enough, so maybe we balance out 😆 |
Update here is that @etseidl has broken out the non backwards compatible changes in two PRs: |
Gentle bump...do we want to merge this now? |
I am still quite worried about the subtle semantic implications of this change -- see for example the discussion on This is the kind of change that could easily lead to lots of debugging / head scratching I think, and even worse would be hard to catch with tests as it would only affect files written in the past So I am torn, to be honest |
Not my call, but I would just throw out there that with this change, there is still a way to get the old behavior if desired. But without this change, there is no way for a user to figure out if |
FWIW the current behaviour is a bug IMO and so my vote would be to proceed with this change. We've postponed it to a breaking release, and are calling it out as a major change in the changelog, so I think we've done all that we can reasonably be expected to. |
/// To preserve the prior behavior and read null counts properly from older files | ||
/// you should default to zero: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should make it clearer that this behaviour is actually incorrect, it will claim a null count of 0, when it actually isn't known
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last nit if we're doing this
/// Note this API returns Some(0) even if the null count was not present | ||
/// in the statistics. | ||
/// See <https://github.com/apache/arrow-rs/pull/6216/files> | ||
/// Note: Versions of this library prior to `53.0.0` returned 0 if the null count was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Note: Versions of this library prior to `53.0.0` returned 0 if the null count was | |
/// Note: Versions of this library prior to `54.0.0` returned 0 if the null count was |
Co-authored-by: Ed Seidl <[email protected]>
After more thought, I would like to hold off merging this into arrow 54.0.0 because:
Maybe it is just that I am burned out on the fallout from the DataFusion 43 release (where there were a bunch of regressions and other challenges) but I don't want to introduce some other potentially subtle issue for a while longer (I agree this may just me being weak) |
Fair enough...I'll check back in February or March 😉 |
Which issue does this PR close?
Closes #6256
Note this PR contains some of the code that was originally in @Michael-J-Ward 's PR #6256
Rationale for this change
See #6256.
Current behavior:
None
to thrift when the null count is zeroNone
) asSome(0)
(aka that it is known there are no nulls)This is inconsistent with the parquet spec as well as what parquet-java and parquet-cpp do
What changes are included in this PR?
i64
None
.Are there any user-facing changes?
Yes
Changes
Some(..)
to thriftNone
(aka that it is unknown if there are nulls) if there are no null counts in the thriftNone
if the null count / distinct count is too large to fit ini64
StatisticsConverter
code to read statistics consistently with older versions of parquet-rs (treat missing null counts as known zero) and added a flag to alter the behaviorThis change means the generated parquet files are slightly larger (as now they encode
Some(0)
for null counts) but the behavior is more correct and consistent.