Skip to content

Commit

Permalink
[T3] Add extension points to all thrift messages
Browse files Browse the repository at this point in the history
  • Loading branch information
alkis committed Jun 5, 2024
1 parent 079a2df commit 5ef488c
Showing 1 changed file with 69 additions and 0 deletions.
69 changes: 69 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,75 @@ There are many places in the format for compatible extensions:
- Encodings: Encodings are specified by enum and more can be added in the future.
- Page types: Additional page types can be added and safely skipped.

### Thrift extensions
Thrift is used for metadata. The Thrift spec mandates that unknown fields are
skipped. To facilitate extensions Parquet reserves field-id `32767` of *every*
struct as an ignorable extension point. More specifically Parquet guarantees
that field-id `32767` will *never* be seen in the official Thrift IDL. The type
of this field is always `binary` for maximum extensibility and fast skipping by
thrift parsers.

Such extensions can easily be appended to an existing Thrift serialized message
without any special APIs. Sample `C++` implementation is provided:

```c++
std::string AppendExtension(std::string thrift, std::string ext) {
auto append_uleb = [](uint32_t x, std::string* out) {
while (true) {
int c = x & 0x7F;
if ((x >>= 7) == 0) {
out->push_back(c);
return;
} else {
out->push_back(c | 0x80);
}
}
};
thrift.pop_back(); // remove the trailing 0
thrift += "\x08\xFF\xFF\x01"; // long form field header for 32767: binary
append_uleb(ext.size(), &thrift);
thrift += ext;
thrift += "\x00"; // add the trailing 0 back
return thrift;
}
```
Additionally the binary extension MUST have a specific form in order to be
unambiguously identifiable by parsers that know of it and, as a corollary,
impossible to be accidentally generated by user data.
<optional id>
N bytes: the extension data
<optional id>
4 bytes: little endian crc32 of the previous N bytes
4 bytes: N in little endian
4 bytes: little endian crc32 of N
3 bytes: 3 byte magic extension (after this we have the Thrift stop-field)
`optional id` is left to the implementors of the extension. It is highly
recommended to add an id before and/or after the extension data to make
it easier to interop with other organizations before acceptance of the
extenstion to the official specification.
The choice of the 3 byte magic is so that the magic plus the `\x00` for the
stop-field will form a new 4 byte magic which can replace `PAR1` or `PARE` in
the future when all engines adopt a new format.
Each organization/engine can reserve a magic extension here to avoid clashes.
To add your extension, file a JIRA and send a PR.
The current list of extensions are
| Magic | Organization |
|-------|--------------|
| `PAR` | Reserved for the future when an extension replaces `PAR1` |
| `PER` | Reserved for the future when an extension replaces `PARE` |
| `ASF` | Apache |
| `CDH` | Cloudera |
| `CRM` | Salesforce |
| `DBR` | Databricks |
| `EXP` | Apache/Experimental |
## Contributing
Comment on the issue and/or contact [the parquet-dev mailing list](http://mail-archives.apache.org/mod_mbox/parquet-dev/) with your questions and ideas.
Changes to this core format definition are proposed and discussed in depth on the mailing list. You may also be interested in contributing to the Parquet-MR subproject, which contains all the Java-side implementation and APIs. See the "How To Contribute" section of the [Parquet-MR project](https://github.com/apache/parquet-mr#how-to-contribute)
Expand Down

0 comments on commit 5ef488c

Please sign in to comment.