Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2485: Be more consistent with BYTE_ARRAY types #251

Merged
merged 8 commits into from
Jun 19, 2024
47 changes: 24 additions & 23 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ Parquet Logical Type Definitions
Logical types are used to extend the types that parquet can be used to store,
by specifying how the primitive types should be interpreted. This keeps the set
of primitive types to a minimum and reuses parquet's efficient encodings. For
example, strings are stored as byte arrays (binary) with a UTF8 annotation.
example, strings are stored with the primitive type BYTE_ARRAY with a STRING
annotation.

This file contains the specification for all logical types.

Expand Down Expand Up @@ -59,7 +60,7 @@ Compatibility considerations are mentioned for each annotation in the correspond

### STRING

`STRING` may only be used to annotate the binary primitive type and indicates
`STRING` may only be used to annotate the BYTE_ARRAY primitive type and indicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the spec is unclear about whether or not STRING and ENUM can annotate FIXED_LENGTH_BYTE_ARRAY. Literally it is reasonable to annotate FIXED_LENGTH_BYTE_ARRAY, right? I'm not sure if there is any use case in the wild.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. For DECIMAL the spec calls out "byte arrays, binary and fixed" as valid physical types. I'd take the lack of mention of fixed-length here to indicated that only the BYTE_ARRAY physical type is allowed. Do any current implementations allow fixed-length strings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing wording is pretty clear of we take "binary primitive type" to mean BYTE_ARRAY and thus in my opinion this is not a change to the spec.

Perhaps we could send a note to [email protected] just highlighting this clarification in case someone wants to chime in and say they read it to mean FIXED_LENGTH_BYTE_ARRAY was also supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sense I'm getting from the M/L is that people are open to adding it, but haven't seen any evidence that any one actually supports adding STRING to FLBA. IMO we should ship this as is, and start a new thread to gauge support for actually modifying the spec to allow for fixed length strings. Given the variable width of UTF-8 characters, I'd think padding would have to be added to account for up to 4 bytes per character.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this assesment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we'd better keep it as is for now. Let me merge this. We can always change the content when there is a consensus.

that the byte array should be interpreted as a UTF-8 encoded character string.

The sort order used for `STRING` strings is unsigned byte-wise comparison.
Expand All @@ -70,7 +71,7 @@ The sort order used for `STRING` strings is unsigned byte-wise comparison.

### ENUM

`ENUM` annotates the binary primitive type and indicates that the value
`ENUM` annotates the BYTE_ARRAY primitive type and indicates that the value
was converted from an enumerated type in another data model (e.g. Thrift, Avro, Protobuf).
Applications using a data model lacking a native enum type should interpret `ENUM`
annotated field as a UTF-8 encoded string.
Expand All @@ -79,9 +80,9 @@ The sort order used for `ENUM` values is unsigned byte-wise comparison.

### UUID

`UUID` annotates a 16-byte fixed-length binary. The value is encoded using
big-endian, so that `00112233-4455-6677-8899-aabbccddeeff` is encoded as the
bytes `00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff`
`UUID` annotates a 16-byte FIXED_LEN_BYTE_ARRAY primitive type. The value is
encoded using big-endian, so that `00112233-4455-6677-8899-aabbccddeeff` is encoded
as the bytes `00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff`
(This example is from [wikipedia's UUID page][wiki-uuid]).

The sort order used for `UUID` values is unsigned byte-wise comparison.
Expand Down Expand Up @@ -211,8 +212,8 @@ unsigned integers with 8, 16, 32, or 64 bit width.
`DECIMAL` annotation represents arbitrary-precision signed decimal numbers of
the form `unscaledValue * 10^(-scale)`.

The primitive type stores an unscaled integer value. For byte arrays, binary
and fixed, the unscaled number must be encoded as two's complement using
The primitive type stores an unscaled integer value. For byte arrays, variable
and fixed-length, the unscaled number must be encoded as two's complement using
etseidl marked this conversation as resolved.
Show resolved Hide resolved
big-endian byte order (the most significant byte is the zeroth element). The
scale stores the number of digits of that value that are to the right of the
decimal point, and the precision stores the maximum number of digits supported
Expand All @@ -228,7 +229,7 @@ integer. A precision too large for the underlying type (see below) is an error.
warning
* `fixed_len_byte_array`: precision is limited by the array size. Length `n`
can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits
* `binary`: `precision` is not limited, but is required. The minimum number of
* `byte_array`: `precision` is not limited, but is required. The minimum number of
Comment on lines 230 to +232
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can change this to name the types using the same names as elswhere:

Suggested change
* `fixed_len_byte_array`: precision is limited by the array size. Length `n`
can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits
* `binary`: `precision` is not limited, but is required. The minimum number of
* `byte_array`: `precision` is not limited, but is required. The minimum number of
* `FIXED_LEN_BYTE_ARRAY `: precision is limited by the array size. Length `n`
can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits
* `BYTE_ARRAY `: `precision` is not limited, but is required. The minimum number of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the formatting here is odd...all of the type names are lower case. Also, the use of back ticks for type names is inconsistent. Perhaps (as you suggest) for this PR we can solely worry about binary->byte_array, and then do a second pass to fix capitalization and quoting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this more, it seems I'm introducing more inconsistency with the quoting. I'll try to clean that up.

bytes to store the unscaled value should be used.

The sort order used for `DECIMAL` values is signed comparison of the represented
Expand All @@ -251,7 +252,7 @@ The `FLOAT16` annotation represents half-precision floating-point numbers in the

Used in contexts where precision is traded off for smaller footprint and potentially better performance.

The primitive type is a 2-byte fixed length binary.
The primitive type is a 2-byte FIXED_LEN_BYTE_ARRAY.

The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.

Expand Down Expand Up @@ -544,8 +545,8 @@ Embedded types do not have type-specific orderings.

### JSON

`JSON` is used for an embedded JSON document. It must annotate a `binary`
primitive type. The `binary` data is interpreted as a UTF-8 encoded character
`JSON` is used for an embedded JSON document. It must annotate a BYTE_ARRAY
primitive type. The byte array data is interpreted as a UTF-8 encoded character
etseidl marked this conversation as resolved.
Show resolved Hide resolved
string of valid JSON as defined by the [JSON specification][json-spec]

[json-spec]: http://json.org/
Expand All @@ -554,8 +555,8 @@ The sort order used for `JSON` is unsigned byte-wise comparison.

### BSON

`BSON` is used for an embedded BSON document. It must annotate a `binary`
primitive type. The `binary` data is interpreted as an encoded BSON document as
`BSON` is used for an embedded BSON document. It must annotate a BYTE_ARRAY
primitive type. The byte array data is interpreted as an encoded BSON document as
etseidl marked this conversation as resolved.
Show resolved Hide resolved
defined by the [BSON specification][bson-spec].

[bson-spec]: http://bsonspec.org/spec.html
Expand Down Expand Up @@ -604,14 +605,14 @@ The following examples demonstrate two of the possible lists of string values.
// List<String> (list non-null, elements nullable)
required group my_list (LIST) {
repeated group list {
optional binary element (UTF8);
optional binary element (STRING);
}
}

// List<String> (list nullable, elements non-null)
optional group my_list (LIST) {
repeated group list {
required binary element (UTF8);
required binary element (STRING);
}
}
```
Expand Down Expand Up @@ -642,7 +643,7 @@ even though the repeated group is named `element`.
```
optional group my_list (LIST) {
repeated group element {
required binary str (UTF8);
required binary str (STRING);
};
}
```
Expand Down Expand Up @@ -672,22 +673,22 @@ optional group my_list (LIST) {
// List<Tuple<String, Integer>> (nullable list, non-null elements)
optional group my_list (LIST) {
repeated group element {
required binary str (UTF8);
required binary str (STRING);
required int32 num;
};
}

// List<OneTuple<String>> (nullable list, non-null elements)
optional group my_list (LIST) {
repeated group array {
required binary str (UTF8);
required binary str (STRING);
};
}

// List<OneTuple<String>> (nullable list, non-null elements)
optional group my_list (LIST) {
repeated group my_list_tuple {
required binary str (UTF8);
required binary str (STRING);
};
}
```
Expand Down Expand Up @@ -723,7 +724,7 @@ nullable integers:
// Map<String, Integer>
required group my_map (MAP) {
repeated group key_value {
required binary key (UTF8);
required binary key (STRING);
optional int32 value;
}
}
Expand Down Expand Up @@ -752,15 +753,15 @@ Examples that can be interpreted using these rules:
// Map<String, Integer> (nullable map, non-null values)
optional group my_map (MAP) {
repeated group map {
required binary str (UTF8);
required binary str (STRING);
required int32 num;
}
}

// Map<String, Integer> (nullable map, nullable values)
optional group my_map (MAP_KEY_VALUE) {
repeated group map {
required binary key (UTF8);
required binary key (STRING);
optional int32 value;
}
}
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,8 @@ readers and writers for the format. The types are:
Logical types are used to extend the types that parquet can be used to store,
by specifying how the primitive types should be interpreted. This keeps the set
of primitive types to a minimum and reuses parquet's efficient encodings. For
example, strings are stored as byte arrays (binary) with a UTF8 annotation.
These annotations define how to further decode and interpret the data.
example, strings are stored with the primitive type BYTE_ARRAY with a STRING
annotation. These annotations define how to further decode and interpret the data.
Annotations are stored as `LogicalType` fields in the file metadata and are
documented in [LogicalTypes.md][logical-types].

Expand Down
20 changes: 10 additions & 10 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -60,14 +60,14 @@ enum ConvertedType {
* values */
LIST = 3;

/** an enum is converted into a binary field */
/** an enum is converted into a BYTE_ARRAY field */
ENUM = 4;

/**
* A decimal value.
*
* This may be used to annotate binary or fixed primitive types. The
* underlying byte array stores the unscaled value encoded as two's
* This may be used to annotate BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY primitive
* types. The underlying byte array stores the unscaled value encoded as two's
* complement using big-endian byte order (the most significant byte is the
* zeroth element). The value of the decimal is the value * 10^{-scale}.
*
Expand Down Expand Up @@ -151,14 +151,14 @@ enum ConvertedType {
/**
* An embedded JSON document
*
* A JSON document embedded within a single UTF8 column.
* A JSON document embedded within a single BYTE_ARRAY(STRING) column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep it as is? This is the deprecated ConvertedType section where UTF8 is used for string type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point. I'll revert.

*/
JSON = 19;

/**
* An embedded BSON document
*
* A BSON document embedded within a single BINARY column.
* A BSON document embedded within a single BYTE_ARRAY column.
*/
BSON = 20;

Expand Down Expand Up @@ -282,11 +282,11 @@ struct Statistics {
}

/** Empty structs to use as logical type annotations */
struct StringType {} // allowed for BINARY, must be encoded with UTF-8
struct StringType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8
struct UUIDType {} // allowed for FIXED[16], must encoded raw UUID bytes
struct MapType {} // see LogicalTypes.md
struct ListType {} // see LogicalTypes.md
struct EnumType {} // allowed for BINARY, must be encoded with UTF-8
struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8
struct DateType {} // allowed for INT32
struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes

Expand All @@ -308,7 +308,7 @@ struct NullType {} // allowed for any physical type, only null values stored
* To maintain forward-compatibility in v1, implementations using this logical
* type must also set scale and precision on the annotated SchemaElement.
*
* Allowed for physical types: INT32, INT64, FIXED, and BINARY
* Allowed for physical types: INT32, INT64, FIXED_LEN_BYTE_ARRAY, and BYTE_ARRAY.
*/
struct DecimalType {
1: required i32 scale
Expand Down Expand Up @@ -360,15 +360,15 @@ struct IntType {
/**
* Embedded JSON logical type annotation
*
* Allowed for physical types: BINARY
* Allowed for physical types: BYTE_ARRAY
*/
struct JsonType {
}

/**
* Embedded BSON logical type annotation
*
* Allowed for physical types: BINARY
* Allowed for physical types: BYTE_ARRAY
*/
struct BsonType {
}
Expand Down