Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify num-nulls handling in Statistics and ColumnIndex #449

Merged
merged 9 commits into from
Aug 23, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 16, 2024

Rationale for this change

Clarify num-nulls handling. I've mentioned this in maillist: https://lists.apache.org/thread/oqd9lt7p0jtf217hg0667xhk0zvwbgvt

What changes are included in this PR?

Suggest writer to have num_nulls/null_count

Do these changes have PoC implementations?

no

@mapleFU mapleFU marked this pull request as ready for review August 16, 2024 17:16
/**
* count of null value in the column
*
* Writers should write this field even if it is zero or in non-null columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Writers should write this field even if it is zero or in non-null columns.
* Writers SHOULD always write this field even if it is zero (a.k.a. no null value)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also add the expectation to the reader implementations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also add the expectation to the reader implementations?

I'd like to wait other replying the context in ML firstly

@mapleFU
Copy link
Member Author

mapleFU commented Aug 21, 2024

@pitrou @gszadovszky @emkornfield @alamb @etseidl @shangxinli Would you mind take a look?

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mapleFU, I think this is an important clarification. Will you also add this language to ColumnIndex?

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
@mapleFU
Copy link
Member Author

mapleFU commented Aug 21, 2024

Will you also add this language to ColumnIndex?

Sure, wait 1h

@etseidl
Copy link
Contributor

etseidl commented Aug 21, 2024

Will you also add this language to ColumnIndex?

Sure, wait 1h

I'll be asleep 🤣

@mapleFU
Copy link
Member Author

mapleFU commented Aug 21, 2024

Aha, I may need sometime to check the parquet-java, parquet-rs ColumnIndex impl, so maybe would take some time, it would be finished before you wake up tomorrow. Nice dream!

@gszadovszky
Copy link
Contributor

@mapleFU, it looks good to me and is a proper change for the actual case.

However, the statement If null_count is not present, readers SHOULD NOT assume null_count == 0. should be a general one for each optional values in this file. We've had a couple of issues not checking whether a field is actually set and using its default value.
Maybe SHOULD would be better being a MUST.

@mapleFU mapleFU changed the title Clarify num-nulls handling Clarify num-nulls handling in Statistics and ColumnIndex Aug 21, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just two wording suggestions

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mapleFU -- this makes sense to me

* Writers SHOULD always write this field even if it is zero (i.e. no null value)
* or the column is not nullable.
* Readers MUST distinguish between null_count not being present and null_count == 0.
* If null_count is not present, readers MUST NOT assume null_count == 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this statement (even though there is an issue with older parquet files written with parquet-rs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did parquet-rs do? Not write the null count in case it was null? If so, then readers will just not optimize out null handling in this case, which seems fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JFinis apache/arrow-rs#6257 would support write 0 here
Before this nonthing is writen.

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nit 😄. Thanks again!

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
@mapleFU
Copy link
Member Author

mapleFU commented Aug 23, 2024

@gszadovszky Would you mind take a look?

Besides, I didn't get any response in the maillist. What should I do before checking in this pr?

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapleFU, I think this PR does not change the way an implementation should handle these values. It only articulates the correct handling.
Also, the PR has been approved by a quite broad audience already. You may safely push it.

@mapleFU mapleFU merged commit db68787 into apache:master Aug 23, 2024
3 checks passed
@mapleFU
Copy link
Member Author

mapleFU commented Aug 23, 2024

Merge now, thanks all!

Comment on lines +263 to +264
* Writers SHOULD always write this field even if it is zero (i.e. no null value)
* or the column is not nullable.
Copy link
Contributor

@JFinis JFinis Aug 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should writers write this field if the column isn't nullable? That feels strictly redundant and therefore a waste of memory. If a reader assumes that their could be nulls even though the type is required in the schema, then this reader has a bug. There wouldn't even be a way to write nulls into a required column, as they cannot be represented in this case.

(Also, if you disregard this comment, at least change the wording from not nullable to not required, as that's the correct term w.rt. to the schema (there is no nullable field in the schema, just a repetition type).)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should writers write this field if the column isn't nullable?

I agree it's not costless in this section, however most impl uses this. It's not a standard but it's widely used.

(Also, if you disregard this comment, at least change the wording from not nullable to not required, as that's the correct term w.rt. to the schema (there is no nullable field in the schema, just a repetition type).)

You're right, I'll rechecking this

Comment on lines +1099 to +1100
* Readers MUST distinguish between null_counts not being present
* and null_count being 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null_count => null_counts

@mapleFU
Copy link
Member Author

mapleFU commented Aug 25, 2024

@JFinis I totally agree the non-nullable ( including def-level in parquet) type can implicit deducing that not-null, however I found most implemention consumes that and produce that, so I sound it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants