Skip to content

Commit

Permalink
.
Browse files Browse the repository at this point in the history
  • Loading branch information
alkis committed Aug 7, 2024
1 parent 68a78fb commit 852f912
Show file tree
Hide file tree
Showing 2 changed files with 58 additions and 52 deletions.
58 changes: 54 additions & 4 deletions ExtensionExamples.md → Extensions.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,35 @@
# Parquet extension examples
# Parquet Extensions

The extension mechanism of the `binary` Thrift field-id `32767` has some desirable properties:

* Existing readers will ignore these extensions without any modifications
* Existing readers will ignore the extension bytes with little processing overhead
* The content of the extension is freeform and can be encoded in any format. This format is not restricted to Thrift.
* Extensions can be appended to existing Thrift serialized structs [without requiring Thrift libraries](#appending-extensions-to-thrift) for manipulation (or changes to the thrift IDL).

Because only one field-id is reserved the extension bytes themselves require disambiguation; otherwise readers will not be able to decode extensions safely. This is left to implementers which MUST put enough unique state in their extension bytes for disambiguation. This can be relatively easily achieved by adding a [UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) at the start or end of the extension bytes. The extension does not specify a disambiguation mechanism to allow more flexibility to implementers.

Putting everything together in an example, if we would extend `FileMetaData` it would look like this on the wire.

N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field)
4 bytes | 08 FF FF 01 (long form header for 32767: binary)
1-5 bytes | ULEB128(M) encoded size of the extension
M bytes | extension bytes
1 byte | \0 (thrift stop field)

The choice to reserve only one field-id has an additional (and frankly unintended) property. It creates scarcity in the extension space and disincentivizes vendors from keeping their extensions private. As a vendor having an extension means one cannot use it in tandem with other extensions from other vendors even if such extensions are publicly known. The easiest path of interoperability and ability to further experiment is to push an extension through standardization and continue experimenting with other ideas internally on top of the (now) standardized version.

#### Path to standardization

So far the above specification shows how different vendors can add extensions without stepping on each other's toes. As long as extensions are private this works out ok.

Unavoidably (and desirably) some extensions will make it into the official specification. Depending on the nature of the extension, migration can take different paths. While it is out of the scope of this document to design all such migrations, we illustrate some of these paths in the [examples](#examples).

## Examples

To illustrate the applicability of the extension mechanism we provide examples of fictional extensions to Parquet and how migration can play out if/when the community decides to adopt them in the official specification.

## Footer
### Footer

A variant of `FileMetaData` encoded in Flatbuffers is introduced. This variant is more performant and can scale to very wide tables, something that current Thrift `FileMetaData` struggles with.

Expand Down Expand Up @@ -57,7 +84,7 @@ In this example, we see several design decisions for the extension at play:
* The crc32 of the flatbuffer representation enhances Parquet to have crc32 for metadata as well which is arguably more important than crc32 for data.
* The new encoding itself, which MUST contain some way to be extended in the future (much like Thrift does with this specification).

## Encoding
### Encoding

The community experiments with a new encoding extension. At the same time they want to keep the newly encoded Parquet files open for everyone to read. So they add a new encoding via an extension to the ColumnMetaData struct. The extension stores offsets in the Parquet file where the new and duplicate encoded data for this column lives. The new writer carefully places all the new encodings at the start of the row group and all the old encodings at the end of the row group. This layout minimizes disruption for readers unaware of the new encodings.

Expand All @@ -74,4 +101,27 @@ In its private form Parquet files look like so:

The custom reader is compiled with thrift IDL with a binary for field with id 32767\. This is done to become extension aware and inspect the extension bytes looking for the UUID disambiguator. If that’s found it decodes the offsets from the rest of the bytes and reads the region of the file containing the new encoding.

If/when the encoding is ratified, it is added to the official specification as an additional type in `Encodings` at which point the extension is no longer necessary, nor the duplicated data in the row group.
If/when the encoding is ratified, it is added to the official specification as an additional type in `Encodings` at which point the extension is no longer necessary, nor the duplicated data in the row group.

## Appending extensions to thrift

```c++
void AppendUleb(uint32_t x, std::string* s) {
while (true) {
uint8_t c = x & 0x7F;
if (c < 0x80) return s->push_back(c);
s->push_back(c + 0x80);
x >>= 7;
}
}

std::string AppendExtension(std::string thrift, std::string ext) {
thrift.pop_back(); // remove the stop field
thrift += "\x08"; // binary
AppendUleb(32767, &thrift); // field-id
AppendUleb(ext.size(), &thrift); // field isze
thrift += ext;
thrift += "\x00"; // add the stop field
return thrift;
}
```
52 changes: 4 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -290,54 +290,10 @@ There are many places in the format for compatible extensions:
- Encodings: Encodings are specified by enum and more can be added in the future.
- Page types: Additional page types can be added and safely skipped.

### Thrift extensions
Parquet Thrift IDL reserves field-id `32767` of every Thrift struct for extensions. The (Thrift) type of this field is always `binary`. These choices provide some desirable properties:

* Existing readers will ignore these extensions without any modifications
* Existing readers will ignore the extension bytes with little processing overhead
* The content of the extension is freeform and can be encoded in any format. This format is not restricted to Thrift.
* Extensions can be appended to existing Thrift serialized structs [without requiring Thrift libraries](#appending-extensions-to-thrift) for manipulation (or changes to the thrift IDL).

Because only one field-id is reserved the extension bytes themselves require disambiguation; otherwise readers will not be able to decode extensions safely. This is left to implementers which MUST put enough unique state in their extension bytes for disambiguation. This can be relatively easily achieved by adding a [UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) at the start or end of the extension bytes. The extension does not specify a disambiguation mechanism to allow more flexibility to implementers.

Putting everything together in an example, if we would extend `FileMetaData` it would look like this on the wire.

N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field)
4 bytes | 08 FF FF 01 (long form header for 32767: binary)
1-5 bytes | ULEB128(M) encoded size of the extension
M bytes | extension bytes
1 byte | \0 (thrift stop field)

The choice to reserve only one field-id has an additional (and frankly unintended) property. It creates scarcity in the extension space and disincentivizes vendors from keeping their extensions private. As a vendor having an extension means one cannot use it in tandem with other extensions from other vendors even if such extensions are publicly known. The easiest path of interoperability and ability to further experiment is to push an extension through standardization and continue experimenting with other ideas internally on top of the (now) standardized version.

#### Appending extensions to thrift

```c++
void AppendUleb(uint32_t x, std::string* s) {
while (true) {
uint8_t c = x & 0x7F;
if (c < 0x80) return s->push_back(c);
s->push_back(c + 0x80);
x >>= 7;
}
}

std::string AppendExtension(std::string thrift, std::string ext) {
thrift.pop_back(); // remove the stop field
thrift += "\x08"; // binary
AppendUleb(32767, &thrift); // field-id
AppendUleb(ext.size(), &thrift); // field isze
thrift += ext;
thrift += "\x00"; // add the stop field
return thrift;
}
```
#### Path to standardization
So far the above specification shows how different vendors can add extensions without stepping on each other's toes. As long as extensions are private this works out ok.
Unavoidably (and desirably) some extensions will make it into the official specification. Depending on the nature of the extension, migration can take different paths. While it is out of the scope of this document to design all such migrations, we illustrate some of these paths in the [examples](ExtensionExamples.md).
### Thrift [extensions](Extensions.md)

Parquet Thrift IDL reserves field-id `32767` of every Thrift struct for extensions.
The (Thrift) type of this field is always `binary`.

## Testing

Expand Down

0 comments on commit 852f912

Please sign in to comment.