-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2485: Be more consistent with BYTE_ARRAY types #251
Changes from 1 commit
969d02c
b6ccbd5
d0ee828
55a7d5c
6df2b55
488ce4d
6113446
d0e8e00
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -23,7 +23,8 @@ Parquet Logical Type Definitions | |||||||||||||||
Logical types are used to extend the types that parquet can be used to store, | ||||||||||||||||
by specifying how the primitive types should be interpreted. This keeps the set | ||||||||||||||||
of primitive types to a minimum and reuses parquet's efficient encodings. For | ||||||||||||||||
example, strings are stored as byte arrays (binary) with a UTF8 annotation. | ||||||||||||||||
example, strings are stored with the primitive type BYTE_ARRAY with a UTF8 | ||||||||||||||||
annotation. | ||||||||||||||||
|
||||||||||||||||
This file contains the specification for all logical types. | ||||||||||||||||
|
||||||||||||||||
|
@@ -59,7 +60,7 @@ Compatibility considerations are mentioned for each annotation in the correspond | |||||||||||||||
|
||||||||||||||||
### STRING | ||||||||||||||||
|
||||||||||||||||
`STRING` may only be used to annotate the binary primitive type and indicates | ||||||||||||||||
`STRING` may only be used to annotate the BYTE_ARRAY primitive type and indicates | ||||||||||||||||
that the byte array should be interpreted as a UTF-8 encoded character string. | ||||||||||||||||
|
||||||||||||||||
The sort order used for `STRING` strings is unsigned byte-wise comparison. | ||||||||||||||||
|
@@ -70,7 +71,7 @@ The sort order used for `STRING` strings is unsigned byte-wise comparison. | |||||||||||||||
|
||||||||||||||||
### ENUM | ||||||||||||||||
|
||||||||||||||||
`ENUM` annotates the binary primitive type and indicates that the value | ||||||||||||||||
`ENUM` annotates the BYTE_ARRAY primitive type and indicates that the value | ||||||||||||||||
was converted from an enumerated type in another data model (e.g. Thrift, Avro, Protobuf). | ||||||||||||||||
Applications using a data model lacking a native enum type should interpret `ENUM` | ||||||||||||||||
annotated field as a UTF-8 encoded string. | ||||||||||||||||
|
@@ -79,9 +80,9 @@ The sort order used for `ENUM` values is unsigned byte-wise comparison. | |||||||||||||||
|
||||||||||||||||
### UUID | ||||||||||||||||
|
||||||||||||||||
`UUID` annotates a 16-byte fixed-length binary. The value is encoded using | ||||||||||||||||
big-endian, so that `00112233-4455-6677-8899-aabbccddeeff` is encoded as the | ||||||||||||||||
bytes `00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff` | ||||||||||||||||
`UUID` annotates a 16-byte FIXED_LEN_BYTE_ARRAY primitive type. The value is | ||||||||||||||||
encoded using big-endian, so that `00112233-4455-6677-8899-aabbccddeeff` is encoded | ||||||||||||||||
as the bytes `00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff` | ||||||||||||||||
(This example is from [wikipedia's UUID page][wiki-uuid]). | ||||||||||||||||
|
||||||||||||||||
The sort order used for `UUID` values is unsigned byte-wise comparison. | ||||||||||||||||
|
@@ -211,8 +212,8 @@ unsigned integers with 8, 16, 32, or 64 bit width. | |||||||||||||||
`DECIMAL` annotation represents arbitrary-precision signed decimal numbers of | ||||||||||||||||
the form `unscaledValue * 10^(-scale)`. | ||||||||||||||||
|
||||||||||||||||
The primitive type stores an unscaled integer value. For byte arrays, binary | ||||||||||||||||
and fixed, the unscaled number must be encoded as two's complement using | ||||||||||||||||
The primitive type stores an unscaled integer value. For byte arrays, variable | ||||||||||||||||
and fixed-length, the unscaled number must be encoded as two's complement using | ||||||||||||||||
etseidl marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||
big-endian byte order (the most significant byte is the zeroth element). The | ||||||||||||||||
scale stores the number of digits of that value that are to the right of the | ||||||||||||||||
decimal point, and the precision stores the maximum number of digits supported | ||||||||||||||||
|
@@ -228,7 +229,7 @@ integer. A precision too large for the underlying type (see below) is an error. | |||||||||||||||
warning | ||||||||||||||||
* `fixed_len_byte_array`: precision is limited by the array size. Length `n` | ||||||||||||||||
can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits | ||||||||||||||||
* `binary`: `precision` is not limited, but is required. The minimum number of | ||||||||||||||||
* `byte_array`: `precision` is not limited, but is required. The minimum number of | ||||||||||||||||
Comment on lines
230
to
+232
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps we can change this to name the types using the same names as elswhere:
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the formatting here is odd...all of the type names are lower case. Also, the use of back ticks for type names is inconsistent. Perhaps (as you suggest) for this PR we can solely worry about binary->byte_array, and then do a second pass to fix capitalization and quoting. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at this more, it seems I'm introducing more inconsistency with the quoting. I'll try to clean that up. |
||||||||||||||||
bytes to store the unscaled value should be used. | ||||||||||||||||
|
||||||||||||||||
The sort order used for `DECIMAL` values is signed comparison of the represented | ||||||||||||||||
|
@@ -251,7 +252,7 @@ The `FLOAT16` annotation represents half-precision floating-point numbers in the | |||||||||||||||
|
||||||||||||||||
Used in contexts where precision is traded off for smaller footprint and potentially better performance. | ||||||||||||||||
|
||||||||||||||||
The primitive type is a 2-byte fixed length binary. | ||||||||||||||||
The primitive type is a 2-byte FIXED_LEN_BYTE_ARRAY. | ||||||||||||||||
|
||||||||||||||||
The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`. | ||||||||||||||||
|
||||||||||||||||
|
@@ -544,8 +545,8 @@ Embedded types do not have type-specific orderings. | |||||||||||||||
|
||||||||||||||||
### JSON | ||||||||||||||||
|
||||||||||||||||
`JSON` is used for an embedded JSON document. It must annotate a `binary` | ||||||||||||||||
primitive type. The `binary` data is interpreted as a UTF-8 encoded character | ||||||||||||||||
`JSON` is used for an embedded JSON document. It must annotate a BYTE_ARRAY | ||||||||||||||||
primitive type. The byte array data is interpreted as a UTF-8 encoded character | ||||||||||||||||
etseidl marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||
string of valid JSON as defined by the [JSON specification][json-spec] | ||||||||||||||||
|
||||||||||||||||
[json-spec]: http://json.org/ | ||||||||||||||||
|
@@ -554,8 +555,8 @@ The sort order used for `JSON` is unsigned byte-wise comparison. | |||||||||||||||
|
||||||||||||||||
### BSON | ||||||||||||||||
|
||||||||||||||||
`BSON` is used for an embedded BSON document. It must annotate a `binary` | ||||||||||||||||
primitive type. The `binary` data is interpreted as an encoded BSON document as | ||||||||||||||||
`BSON` is used for an embedded BSON document. It must annotate a BYTE_ARRAY | ||||||||||||||||
primitive type. The byte array data is interpreted as an encoded BSON document as | ||||||||||||||||
etseidl marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||
defined by the [BSON specification][bson-spec]. | ||||||||||||||||
|
||||||||||||||||
[bson-spec]: http://bsonspec.org/spec.html | ||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the spec is unclear about whether or not
STRING
andENUM
can annotateFIXED_LENGTH_BYTE_ARRAY
. Literally it is reasonable to annotateFIXED_LENGTH_BYTE_ARRAY
, right? I'm not sure if there is any use case in the wild.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. For
DECIMAL
the spec calls out "byte arrays, binary and fixed" as valid physical types. I'd take the lack of mention of fixed-length here to indicated that only theBYTE_ARRAY
physical type is allowed. Do any current implementations allow fixed-length strings?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing wording is pretty clear of we take "binary primitive type" to mean BYTE_ARRAY and thus in my opinion this is not a change to the spec.
Perhaps we could send a note to [email protected] just highlighting this clarification in case someone wants to chime in and say they read it to mean FIXED_LENGTH_BYTE_ARRAY was also supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sense I'm getting from the M/L is that people are open to adding it, but haven't seen any evidence that any one actually supports adding
STRING
toFLBA
. IMO we should ship this as is, and start a new thread to gauge support for actually modifying the spec to allow for fixed length strings. Given the variable width of UTF-8 characters, I'd think padding would have to be added to account for up to 4 bytes per character.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this assesment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we'd better keep it as is for now. Let me merge this. We can always change the content when there is a consensus.