Skip to content

Commit

Permalink
Update from latest review comments.
Browse files Browse the repository at this point in the history
  • Loading branch information
rdblue committed Nov 26, 2024
1 parent ce706e0 commit 5cdd682
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 29 deletions.
12 changes: 8 additions & 4 deletions VariantEncoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ Another motivation for the representation is that (aside from metadata) each nes
For example, in a Variant containing an Array of Variant values, the representation of an inner Variant value, when paired with the metadata of the full variant, is itself a valid Variant.

This document describes the Variant Binary Encoding scheme.
Variant fields can also be _shredded_.
Shredding refers to extracting some elements of the variant into separate columns for more efficient extraction/filter pushdown.
The [Variant Shredding specification](VariantShredding.md) describes the details of shredding Variant values as typed Parquet columns.

## Variant in Parquet
Expand All @@ -47,9 +49,8 @@ A Variant value in Parquet is represented by a group with 2 fields, named `value

* The Variant group must be annotated with the `VARIANT` logical type.
* Both fields `value` and `metadata` must be of type `binary` (called `BYTE_ARRAY` in the Parquet thrift definition).
* The `metadata` field is required and must be a valid Variant metadata, as defined below.
* The `value` field is required for unshredded Variant values.
* The `value` field is optional when parts of the Variant value are shredded according to the [Variant Shredding specification](VariantShredding.md).
* The `metadata` field is `required` and must be a valid Variant metadata, as defined below.
* The `value` field must be annotated as `required` for unshredded Variant values, or `optional` if parts of the value are [shredded](VariantShredding.md) as typed Parquet columns.
* When present, the `value` field must be a valid Variant value, as defined below.

This is the expected unshredded representation in Parquet:
Expand Down Expand Up @@ -473,7 +474,7 @@ To maximize compatibility with readers that can process JSON but not Variant, th
|---------------|-----------|----------------------------------------------------------|--------------------------------------|
| Null type | null | `null` | `null` |
| Boolean | boolean | `true` or `false` | `true` |
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 |
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, `34.00` |
| Float | number | Fraction must be present | `14.20` |
| Double | number | Fraction must be present | `1.0` |
| Date | string | ISO-8601 formatted date | `"2017-11-16"` |
Expand All @@ -484,3 +485,6 @@ To maximize compatibility with readers that can process JSON but not Variant, th
| Array | array | | `[34, "abc", "2017-11-16]` |
| Object | object | | `{"id": 34, "data": "abc"}` |

Notes:

* For timestamp and timestampntz, values must use microsecond precision and trailing 0s are required
52 changes: 27 additions & 25 deletions VariantShredding.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,25 +84,25 @@ For example, if a Variant is required (like `measurement` above) and both `value

Shredded values must use the following Parquet types:

| Variant Type | Equivalent Parquet Type |
|-----------------------------|------------------------------|
| boolean | BOOLEAN |
| int8 | INT(8, signed=true) |
| int16 | INT(16, signed=true) |
| int32 | INT32 / INT(32, signed=true) |
| int64 | INT64 / INT(64, signed=true) |
| float | FLOAT |
| double | DOUBLE |
| decimal4 | DECIMAL(precision, scale) |
| decimal8 | DECIMAL(precision, scale) |
| decimal16 | DECIMAL(precision, scale) |
| date | DATE |
| timestamp | TIMESTAMP(true, MICROS) |
| timestamp without time zone | TIMESTAMP(false, MICROS) |
| binary | BINARY |
| string | STRING |
| array | LIST; see Arrays below |
| object | GROUP; see Objects below |
| Variant Type | Equivalent Parquet Type |
|-----------------------------|-----------------------------------|
| boolean | BOOLEAN |
| int8 | INT(8, signed=true) |
| int16 | INT(16, signed=true) |
| int32 | INT32 / INT(32, signed=true) |
| int64 | INT64 / INT(64, signed=true) |
| float | FLOAT |
| double | DOUBLE |
| decimal4 | INT32 / DECIMAL(precision, scale) |
| decimal8 | INT64 / DECIMAL(precision, scale) |
| decimal16 | DECIMAL(precision, scale) |
| date | DATE |
| timestamp | TIMESTAMP(true, MICROS) |
| timestamp without time zone | TIMESTAMP(false, MICROS) |
| binary | BINARY |
| string | STRING |
| array | LIST; see Arrays below |
| object | GROUP; see Objects below |

#### Primitive Types

Expand All @@ -112,12 +112,13 @@ Unless the value is shredded as an object (see [Objects](#objects)), `typed_valu

#### Arrays

Arrays can be shredded using a 3-level Parquet list for `typed_value`.
Arrays can be shredded by using a 3-level Parquet list for `typed_value`.

If the value is not an array, `typed_value` must be null.
If the value is an array, `value` must be null.

The list `element` must be a required group that contains `value` and `typed_value` fields.
The list `element` must be a required group.
The `element` group can contain `value` and `typed_value` fields.
The element's `value` field stores the element as Variant-encoded `binary` when the `typed_value` is not present or cannot represent it.
The `typed_value` field may be omitted when not shredding elements as a specific type.
When `typed_value` is omitted, `value` must be `required`.
Expand Down Expand Up @@ -183,12 +184,12 @@ optional group event (VARIANT) {
}
```

The group for each named field must be required.
The group for each named field must use repetition level `required`.

A field's `value` and `typed_value` are set to null (missing) to indicate that the field does not exist in the variant.
To encode a field that is present with a null value, the `value` must contain a Variant null: basic type 0 (primitive) and physical type 0 (null).

The series of objects below would be stored as:
The table below shows how the series of objects in the first column would be stored:

| Event object | `value` | `typed_value` | `typed_value.event_type.value` | `typed_value.event_type.typed_value` | `typed_value.event_ts.value` | `typed_value.event_ts.typed_value` | Notes |
|------------------------------------------------------------------------------------|-----------------------------------|---------------|--------------------------------|--------------------------------------|------------------------------|------------------------------------|--------------------------------------------------|
Expand Down Expand Up @@ -334,7 +335,8 @@ def primitive_to_variant(typed_value: Any): Variant:

Shredding is an optional feature of Variant, and readers must continue to be able to read a group containing only `value` and `metadata` fields.

Engines without shredding support are not expected to be able to read Parquet files that use shredding.
Different files may contain conflicting schemas.
Engines that do not write shredded values must be able to read shredded values according to this spec or must fail.

Different files may contain conflicting shredding schemas.
That is, files may contain different `typed_value` columns for the same Variant with incompatible types.
It may not be possible to infer or specify a single shredded schema that would allow all Parquet files for a table to be read without reconstructing the value as a Variant.

0 comments on commit 5cdd682

Please sign in to comment.