Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2435: Clarify behavior of DELTA_BINARY_PACKED encoding #231

Merged
merged 4 commits into from
Feb 28, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Feb 20, 2024

Provide some guidance around the issue of how many bits may be used when encoding DELTA_BINARY_PACKED data.

Jira

  • My PR addresses the following Parquet Jira issues and references them in the PR title.

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

Explain when it is best to use DELTA_BINARY_PACKED encoding, and
address the issue of using more bits in the encoding than are used in
the underlying type being encoded.
@etseidl etseidl force-pushed the clarify_delta_encodings branch from 161444d to 3d1b1f7 Compare February 20, 2024 07:10
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encodings.md Outdated
@@ -179,6 +179,12 @@ This encoding is adapted from the Binary packing described in
["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf)
by D. Lemire and L. Boytsov.

Delta encoding is best when used on sorted data, or data with runs of repeated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to move it to the Characteristics section below?

IMHO, we should not provide detail guidance like this because different users may have various opinions on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this document should remain descriptive.

If we want to provide guidance and suggestions as to algorithm choice, we should create a separate section for that (in this document, or another one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure about including this section, but did so as a result of the discussion of apache/arrow#37940. Rather than guidance, how about a sentence added to Characteristics to the effect that encoding random integer data will result in no space savings over PLAIN encoding and will incur increased metadata overhead. Just the facts, no opinion 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thing could be said about most encodings and compression algorithms, so is that actually useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I'll just remove this.

Encodings.md Outdated
@@ -247,6 +253,15 @@ and handled as wrapping around in 2's complement notation so that the original
values are correctly restituted. This may require explicit care in some programming
languages (for example by doing all arithmetic in the unsigned domain).

One strategy that might be employed to avoid the above mentioned overflow is to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One strategy that might be employed to avoid the above mentioned overflow is to
One strategy that might be employed to reproduce the above mentioned overflow is to

Encodings.md Outdated
@@ -247,6 +253,15 @@ and handled as wrapping around in 2's complement notation so that the original
values are correctly restituted. This may require explicit care in some programming
languages (for example by doing all arithmetic in the unsigned domain).

One strategy that might be employed to avoid the above mentioned overflow is to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One strategy that might be employed to avoid the above mentioned overflow is to
One strategy that might be employed to reproduce the above mentioned overflow is to

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM

Encodings.md Outdated
@@ -179,6 +179,12 @@ This encoding is adapted from the Binary packing described in
["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf)
by D. Lemire and L. Boytsov.

Delta encoding is best when used on sorted data, or data with runs of repeated
values. It can also be useful when the range of values is small, such as would
be the case with INT_8 data. It should *not* be used when the range of the data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change should not to should better not? Since though not recommended, user can also do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree this wording is a little too strong. I'll change it if this section survives 😄

Encodings.md Outdated
values. It can also be useful when the range of values is small, such as would
be the case with INT_8 data. It should *not* be used when the range of the data
would necessitate the use of large bitwidths, as could be the case with random
INT32 values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious that int32 and int8 spelling is different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took INT_8 from the logical type name, and INT32 from the physical type name. That distinction could be made clearer.

Encodings.md Outdated
@@ -247,6 +253,15 @@ and handled as wrapping around in 2's complement notation so that the original
values are correctly restituted. This may require explicit care in some programming
languages (for example by doing all arithmetic in the unsigned domain).

One strategy that might be employed to avoid the above mentioned overflow is to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather wordy addition. I think the spec should remain concise. If we want to elaborate on this, we should move the discussion of signedness and bit width to a dedicated "Pitfalls" subsection, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do that 😅. This can be reduced to a single sentence if we decide to mandate the use of no more bits than the physical type.

Encodings.md Outdated
while encoding INT32 data one might choose to perform arithmetic operations using
64-bit integers. This can lead to situtations where the number of bits used to encode
the resulting deltas is greater than the number of bits used to represent the input
values. While this behavior is allowed, data produced in this manner may not be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this behavior is (or should be) allowed. The spec should IMHO prescribe that INT32 is encoded at most using 32-bit deltas, and INT64 using 64-bit deltas. Emitting deltas larger than the physical bitwidth should be considered a bug in the encoder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually maybe some legacy encoder generate these kind of data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but I think we should still consider it a bug :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually agree with @pitrou, but after feedback on the mailing list (and watching other similar proposals), I thought the squishy middle ground of "writers should not do this, but readers should accept it" would be more palatable. I'm totally fine with simply adding a sentence to the end of the preceding paragraph. That adds the clarity I want as the developer of an implementation (and is much less wordy 😉).

@etseidl
Copy link
Contributor Author

etseidl commented Feb 26, 2024

Thank you for the comments @wgtmac @pitrou @mapleFU. I've added the prohibition language. If there is consensus on forbidding the use of extra bits, then I can remove the long paragraph.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cc @tustvold

@tustvold
Copy link
Contributor

tustvold commented Feb 27, 2024

I am likely missing some context here, but I would agree with @pitrou that an encoder producing data with more bits than the physical type is a bug in the encoder, and not to mention sub-optimal

@pitrou
Copy link
Member

pitrou commented Feb 27, 2024

The latest proposed changes look fine to me. I'll let others chime in before potentially merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants