Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify num-nulls handling in Statistics and ColumnIndex #449

Merged
merged 9 commits into from
Aug 23, 2024
20 changes: 18 additions & 2 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,14 @@ struct Statistics {
*/
1: optional binary max;
2: optional binary min;
/** count of null value in the column */
/**
* Count of null values in the column.
*
* Writers SHOULD always write this field even if it is zero (i.e. no null value)
* or the column is not nullable.
Comment on lines +263 to +264
Copy link
Contributor

@JFinis JFinis Aug 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should writers write this field if the column isn't nullable? That feels strictly redundant and therefore a waste of memory. If a reader assumes that their could be nulls even though the type is required in the schema, then this reader has a bug. There wouldn't even be a way to write nulls into a required column, as they cannot be represented in this case.

(Also, if you disregard this comment, at least change the wording from not nullable to not required, as that's the correct term w.rt. to the schema (there is no nullable field in the schema, just a repetition type).)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should writers write this field if the column isn't nullable?

I agree it's not costless in this section, however most impl uses this. It's not a standard but it's widely used.

(Also, if you disregard this comment, at least change the wording from not nullable to not required, as that's the correct term w.rt. to the schema (there is no nullable field in the schema, just a repetition type).)

You're right, I'll rechecking this

* Readers MUST distinguish between null_count not being present and null_count == 0.
* If null_count is not present, readers MUST NOT assume null_count == 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this statement (even though there is an issue with older parquet files written with parquet-rs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did parquet-rs do? Not write the null count in case it was null? If so, then readers will just not optimize out null handling in this case, which seems fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JFinis apache/arrow-rs#6257 would support write 0 here
Before this nonthing is writen.

*/
3: optional i64 null_count;
/** count of distinct values occurring */
4: optional i64 distinct_count;
Expand Down Expand Up @@ -1084,7 +1091,16 @@ struct ColumnIndex {
*/
4: required BoundaryOrder boundary_order

/** A list containing the number of null values for each page **/
/**
* A list containing the number of null values for each page
*
* Writers SHOULD always write this field even if no null values
* are present or the column is not nullable.
* Readers MUST distinguish between null_counts not being present
* and null_count being 0.
Comment on lines +1099 to +1100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null_count => null_counts

* If null_counts are not present, readers MUST NOT assume all
* null counts are 0.
*/
5: optional list<i64> null_counts

/**
Expand Down