-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify num-nulls handling in Statistics and ColumnIndex #449
Changes from 6 commits
dbfe60f
0ebe97f
d3f210f
2e0df3a
fe815ea
1508aa2
0bf50c5
4dc4502
aee6e1e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -257,7 +257,14 @@ struct Statistics { | |
*/ | ||
1: optional binary max; | ||
2: optional binary min; | ||
/** count of null value in the column */ | ||
/** | ||
* Count of null values in the column. | ||
* | ||
* Writers SHOULD always write this field even if it is zero (i.e. no null value) | ||
* or the column is not nullable. | ||
* Readers MUST distinguish between null_count not being present and null_count == 0. | ||
* If null_count is not present, readers MUST NOT assume null_count == 0. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with this statement (even though there is an issue with older parquet files written with parquet-rs) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What did parquet-rs do? Not write the null count in case it was null? If so, then readers will just not optimize out null handling in this case, which seems fine. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @JFinis apache/arrow-rs#6257 would support write 0 here |
||
*/ | ||
3: optional i64 null_count; | ||
/** count of distinct values occurring */ | ||
4: optional i64 distinct_count; | ||
|
@@ -1084,7 +1091,16 @@ struct ColumnIndex { | |
*/ | ||
4: required BoundaryOrder boundary_order | ||
|
||
/** A list containing the number of null values for each page **/ | ||
/** | ||
* A list containing the number of null values for each page | ||
* | ||
* Writers SHOULD always write this field even if no null value | ||
* or the column is not nullable. | ||
mapleFU marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Readers MUST distinguish between null_counts not being present | ||
* and null_count is 0. | ||
mapleFU marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* If null_counts is not present, readers MUST NOT assume all | ||
mapleFU marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* null_count is 0. | ||
mapleFU marked this conversation as resolved.
Show resolved
Hide resolved
|
||
*/ | ||
5: optional list<i64> null_counts | ||
|
||
/** | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should writers write this field if the column isn't nullable? That feels strictly redundant and therefore a waste of memory. If a reader assumes that their could be nulls even though the type is required in the schema, then this reader has a bug. There wouldn't even be a way to write nulls into a required column, as they cannot be represented in this case.
(Also, if you disregard this comment, at least change the wording from
not nullable
tonot required
, as that's the correct term w.rt. to the schema (there is nonullable
field in the schema, just a repetition type).)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's not costless in this section, however most impl uses this. It's not a standard but it's widely used.
You're right, I'll rechecking this