PARQUET-2435: Clarify behavior of DELTA_BINARY_PACKED encoding #231

etseidl · 2024-02-20T02:31:58Z

Provide some guidance around the issue of how many bits may be used when encoding DELTA_BINARY_PACKED data.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

Explain when it is best to use DELTA_BINARY_PACKED encoding, and address the issue of using more bits in the encoding than are used in the underlying type being encoded.

wgtmac

cc @pitrou @mapleFU

wgtmac · 2024-02-26T08:35:19Z

Encodings.md

@@ -179,6 +179,12 @@ This encoding is adapted from the Binary packing described in
 ["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf)
 by D. Lemire and L. Boytsov.

+Delta encoding is best when used on sorted data, or data with runs of repeated


Is it better to move it to the Characteristics section below?

IMHO, we should not provide detail guidance like this because different users may have various opinions on this.

I agree that this document should remain descriptive.

If we want to provide guidance and suggestions as to algorithm choice, we should create a separate section for that (in this document, or another one).

I wasn't sure about including this section, but did so as a result of the discussion of apache/arrow#37940. Rather than guidance, how about a sentence added to Characteristics to the effect that encoding random integer data will result in no space savings over PLAIN encoding and will incur increased metadata overhead. Just the facts, no opinion 😉

The same thing could be said about most encodings and compression algorithms, so is that actually useful?

Fair point. I'll just remove this.

wgtmac · 2024-02-26T08:36:50Z

Encodings.md

@@ -247,6 +253,15 @@ and handled as wrapping around in 2's complement notation so that the original
 values are correctly restituted. This may require explicit care in some programming
 languages (for example by doing all arithmetic in the unsigned domain).

+One strategy that might be employed to avoid the above mentioned overflow is to


Suggested change

One strategy that might be employed to avoid the above mentioned overflow is to

One strategy that might be employed to reproduce the above mentioned overflow is to

wgtmac · 2024-02-26T08:36:54Z

Encodings.md

@@ -247,6 +253,15 @@ and handled as wrapping around in 2's complement notation so that the original
 values are correctly restituted. This may require explicit care in some programming
 languages (for example by doing all arithmetic in the unsigned domain).

+One strategy that might be employed to avoid the above mentioned overflow is to


Suggested change

One strategy that might be employed to avoid the above mentioned overflow is to

One strategy that might be employed to reproduce the above mentioned overflow is to

mapleFU

General LGTM

mapleFU · 2024-02-26T10:29:21Z

Encodings.md

@@ -179,6 +179,12 @@ This encoding is adapted from the Binary packing described in
 ["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf)
 by D. Lemire and L. Boytsov.

+Delta encoding is best when used on sorted data, or data with runs of repeated
+values. It can also be useful when the range of values is small, such as would
+be the case with INT_8 data. It should *not* be used when the range of the data


change should not to should better not? Since though not recommended, user can also do that.

Yes, I agree this wording is a little too strong. I'll change it if this section survives 😄

mapleFU · 2024-02-26T10:30:03Z

Encodings.md

+values. It can also be useful when the range of values is small, such as would
+be the case with INT_8 data. It should *not* be used when the range of the data
+would necessitate the use of large bitwidths, as could be the case with random
+INT32 values.


Just curious that int32 and int8 spelling is different.

I took INT_8 from the logical type name, and INT32 from the physical type name. That distinction could be made clearer.

pitrou · 2024-02-26T11:01:58Z

Encodings.md

@@ -247,6 +253,15 @@ and handled as wrapping around in 2's complement notation so that the original
 values are correctly restituted. This may require explicit care in some programming
 languages (for example by doing all arithmetic in the unsigned domain).

+One strategy that might be employed to avoid the above mentioned overflow is to


This is a rather wordy addition. I think the spec should remain concise. If we want to elaborate on this, we should move the discussion of signedness and bit width to a dedicated "Pitfalls" subsection, IMHO.

Sorry, I do that 😅. This can be reduced to a single sentence if we decide to mandate the use of no more bits than the physical type.

pitrou · 2024-02-26T11:04:37Z

Encodings.md

+while encoding INT32 data one might choose to perform arithmetic operations using
+64-bit integers. This can lead to situtations where the number of bits used to encode
+the resulting deltas is greater than the number of bits used to represent the input
+values. While this behavior is allowed, data produced in this manner may not be


I don't think that this behavior is (or should be) allowed. The spec should IMHO prescribe that INT32 is encoded at most using 32-bit deltas, and INT64 using 64-bit deltas. Emitting deltas larger than the physical bitwidth should be considered a bug in the encoder.

Actually maybe some legacy encoder generate these kind of data?

Perhaps, but I think we should still consider it a bug :-)

I actually agree with @pitrou, but after feedback on the mailing list (and watching other similar proposals), I thought the squishy middle ground of "writers should not do this, but readers should accept it" would be more palatable. I'm totally fine with simply adding a sentence to the end of the preceding paragraph. That adds the clarity I want as the developer of an implementation (and is much less wordy 😉).

etseidl · 2024-02-26T18:30:08Z

Thank you for the comments @wgtmac @pitrou @mapleFU. I've added the prohibition language. If there is consensus on forbidding the use of extra bits, then I can remove the long paragraph.

mapleFU

Also cc @tustvold

tustvold · 2024-02-27T02:52:37Z

I am likely missing some context here, but I would agree with @pitrou that an encoder producing data with more bits than the physical type is a bug in the encoder, and not to mention sub-optimal

pitrou · 2024-02-27T13:44:34Z

The latest proposed changes look fine to me. I'll let others chime in before potentially merging.

PARQUET-2435: Clarify behavior of DELTA_BINARY_PACKED encoding

3d1b1f7

Explain when it is best to use DELTA_BINARY_PACKED encoding, and address the issue of using more bits in the encoding than are used in the underlying type being encoded.

etseidl force-pushed the clarify_delta_encodings branch from 161444d to 3d1b1f7 Compare February 20, 2024 07:10

wgtmac reviewed Feb 26, 2024

View reviewed changes

mapleFU reviewed Feb 26, 2024

View reviewed changes

pitrou reviewed Feb 26, 2024

View reviewed changes

etseidl added 2 commits February 26, 2024 10:07

remove suggested use cases

5c2932f

forbid using too many bits

9b9a1ca

mapleFU approved these changes Feb 27, 2024

View reviewed changes

remove the wordy explanation

ee04946

pitrou approved these changes Feb 27, 2024

View reviewed changes

pitrou merged commit f65d4e1 into apache:master Feb 28, 2024
3 checks passed

etseidl mentioned this pull request Jun 20, 2024

[Parquet] DELTA_BINARY_PACKED constraint on num_bits is too restrict? apache/arrow#20374

Closed

asfimport mentioned this pull request Jun 23, 2024

Clarify behavior of DELTA_BINARY_PACKED encoders/decoders #426

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2435: Clarify behavior of DELTA_BINARY_PACKED encoding #231

PARQUET-2435: Clarify behavior of DELTA_BINARY_PACKED encoding #231

etseidl commented Feb 20, 2024 •

edited

Loading

wgtmac left a comment

wgtmac Feb 26, 2024

pitrou Feb 26, 2024

etseidl Feb 26, 2024

pitrou Feb 26, 2024

etseidl Feb 26, 2024

wgtmac Feb 26, 2024

wgtmac Feb 26, 2024

mapleFU left a comment

mapleFU Feb 26, 2024

etseidl Feb 26, 2024

mapleFU Feb 26, 2024

etseidl Feb 26, 2024

pitrou Feb 26, 2024

etseidl Feb 26, 2024

pitrou Feb 26, 2024

mapleFU Feb 26, 2024

pitrou Feb 26, 2024

etseidl Feb 26, 2024

etseidl commented Feb 26, 2024

mapleFU left a comment

tustvold commented Feb 27, 2024 •

edited

Loading

pitrou commented Feb 27, 2024

	One strategy that might be employed to avoid the above mentioned overflow is to
	One strategy that might be employed to reproduce the above mentioned overflow is to

PARQUET-2435: Clarify behavior of DELTA_BINARY_PACKED encoding #231

PARQUET-2435: Clarify behavior of DELTA_BINARY_PACKED encoding #231

Conversation

etseidl commented Feb 20, 2024 • edited Loading

Jira

Commits

Documentation

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Feb 26, 2024

mapleFU left a comment

Choose a reason for hiding this comment

tustvold commented Feb 27, 2024 • edited Loading

pitrou commented Feb 27, 2024

etseidl commented Feb 20, 2024 •

edited

Loading

tustvold commented Feb 27, 2024 •

edited

Loading