-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2435: Clarify behavior of DELTA_BINARY_PACKED encoding #231
Conversation
Explain when it is best to use DELTA_BINARY_PACKED encoding, and address the issue of using more bits in the encoding than are used in the underlying type being encoded.
161444d
to
3d1b1f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encodings.md
Outdated
@@ -179,6 +179,12 @@ This encoding is adapted from the Binary packing described in | |||
["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) | |||
by D. Lemire and L. Boytsov. | |||
|
|||
Delta encoding is best when used on sorted data, or data with runs of repeated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to move it to the Characteristics
section below?
IMHO, we should not provide detail guidance like this because different users may have various opinions on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this document should remain descriptive.
If we want to provide guidance and suggestions as to algorithm choice, we should create a separate section for that (in this document, or another one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure about including this section, but did so as a result of the discussion of apache/arrow#37940. Rather than guidance, how about a sentence added to Characteristics
to the effect that encoding random integer data will result in no space savings over PLAIN
encoding and will incur increased metadata overhead. Just the facts, no opinion 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same thing could be said about most encodings and compression algorithms, so is that actually useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I'll just remove this.
Encodings.md
Outdated
@@ -247,6 +253,15 @@ and handled as wrapping around in 2's complement notation so that the original | |||
values are correctly restituted. This may require explicit care in some programming | |||
languages (for example by doing all arithmetic in the unsigned domain). | |||
|
|||
One strategy that might be employed to avoid the above mentioned overflow is to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One strategy that might be employed to avoid the above mentioned overflow is to | |
One strategy that might be employed to reproduce the above mentioned overflow is to |
Encodings.md
Outdated
@@ -247,6 +253,15 @@ and handled as wrapping around in 2's complement notation so that the original | |||
values are correctly restituted. This may require explicit care in some programming | |||
languages (for example by doing all arithmetic in the unsigned domain). | |||
|
|||
One strategy that might be employed to avoid the above mentioned overflow is to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One strategy that might be employed to avoid the above mentioned overflow is to | |
One strategy that might be employed to reproduce the above mentioned overflow is to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM
Encodings.md
Outdated
@@ -179,6 +179,12 @@ This encoding is adapted from the Binary packing described in | |||
["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) | |||
by D. Lemire and L. Boytsov. | |||
|
|||
Delta encoding is best when used on sorted data, or data with runs of repeated | |||
values. It can also be useful when the range of values is small, such as would | |||
be the case with INT_8 data. It should *not* be used when the range of the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change should not to should better not? Since though not recommended, user can also do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree this wording is a little too strong. I'll change it if this section survives 😄
Encodings.md
Outdated
values. It can also be useful when the range of values is small, such as would | ||
be the case with INT_8 data. It should *not* be used when the range of the data | ||
would necessitate the use of large bitwidths, as could be the case with random | ||
INT32 values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious that int32 and int8 spelling is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took INT_8
from the logical type name, and INT32
from the physical type name. That distinction could be made clearer.
Encodings.md
Outdated
@@ -247,6 +253,15 @@ and handled as wrapping around in 2's complement notation so that the original | |||
values are correctly restituted. This may require explicit care in some programming | |||
languages (for example by doing all arithmetic in the unsigned domain). | |||
|
|||
One strategy that might be employed to avoid the above mentioned overflow is to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rather wordy addition. I think the spec should remain concise. If we want to elaborate on this, we should move the discussion of signedness and bit width to a dedicated "Pitfalls" subsection, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do that 😅. This can be reduced to a single sentence if we decide to mandate the use of no more bits than the physical type.
Encodings.md
Outdated
while encoding INT32 data one might choose to perform arithmetic operations using | ||
64-bit integers. This can lead to situtations where the number of bits used to encode | ||
the resulting deltas is greater than the number of bits used to represent the input | ||
values. While this behavior is allowed, data produced in this manner may not be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this behavior is (or should be) allowed. The spec should IMHO prescribe that INT32 is encoded at most using 32-bit deltas, and INT64 using 64-bit deltas. Emitting deltas larger than the physical bitwidth should be considered a bug in the encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually maybe some legacy encoder generate these kind of data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, but I think we should still consider it a bug :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually agree with @pitrou, but after feedback on the mailing list (and watching other similar proposals), I thought the squishy middle ground of "writers should not do this, but readers should accept it" would be more palatable. I'm totally fine with simply adding a sentence to the end of the preceding paragraph. That adds the clarity I want as the developer of an implementation (and is much less wordy 😉).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cc @tustvold
I am likely missing some context here, but I would agree with @pitrou that an encoder producing data with more bits than the physical type is a bug in the encoder, and not to mention sub-optimal |
The latest proposed changes look fine to me. I'll let others chime in before potentially merging. |
Provide some guidance around the issue of how many bits may be used when encoding
DELTA_BINARY_PACKED
data.Jira
Commits
Documentation