From 8a9e2f3fdb6e90871d1a05e2a64dec1ee9e84efd Mon Sep 17 00:00:00 2001 From: emkornfield Date: Sat, 25 May 2024 10:33:18 -0700 Subject: [PATCH] DRAFT: Alternative V3 metadata proposal. Salient points: 1. Introduce a new encoding that allows random access for byte arrays 2. Use page info as a structure for storing lists. 3. Storage pages out of line of thrift. --- README.md | 59 +++++++++++++++ src/main/thrift/parquet.thrift | 134 ++++++++++++++++++++++++++++++--- 2 files changed, 183 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 42578c7be..5ebafc8c7 100644 --- a/README.md +++ b/README.md @@ -118,6 +118,65 @@ chunks they are interested in. The columns chunks should then be read sequentia ![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif) + ### PAR3 File Footers + + PAR3 file footer footer format designed to better support wider-schemas and more control + over the various footer size vs compute trade-offs. Its format is as follows: + - Data pages containing serialized Thrift metadata objects that were modeled as lists + in PAR1.These are stored contiguously with offsets stored in the FileMetadata. See + parquet.thrift for more details on each. + - Serialized Thrift FileMetadata Structure + - (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata. + - 4-byte length in bytes (little endian) of the serialized FileMetadata structure. + - 4-byte length in bytes (little endian) of all preceding elements in the footer. + - 1 byte flag field to indicate features that require special parsing of the footer. + Readers MUST raise an error if there is an unrecognized flag. Current flags: + + * 0x01 - Footer encryption enabled (when set the encryption information is written before + FileMeta structure as in the PAR1 footer). + * 0x02 - CRC32 of FileMetadata Footer. + + - 4-byte magic number "PAR3" + + When parsing the footer implementations SHOULD read at least the last 10 bytes of the footer. Then + read in the entirety of the footer based on the length of all preceding elements. This prevents further + I/O cost for accessing metadata stored in the data pages. PAR3 footers can fully replace PAR1 footers. + If a file is written with only PAR3 footer, implementation MUT write PAR3 as the first four bytes in + they file. PAR3 footers can also be written in a backwards compatible way after PAR1 Metadata + (see next section for details). + + #### Dual Mode PAR1 and PAR3 footers + + There is a desire to gradually rollout PAR3 footers to allow newer readers to take advantage of them, while + older readers can still properly parse the file. This section outlines a strategy to do this. + + As backgroud, Thrift structs are always serialized with a 0 trailing byte do delimit there ending. + Therefore for PAR1 written before PAR3 was introduced are always expect the files to have the following + trailing 9 bytes [0x00, x, x, x, x, P, A, R, 1] (where x can be any value). We also expect all compliant + Thrift parsers to only parse the first available FileMetadata message and stop consuming the stream once read. + Today, we don't believe that any Parquet readers validate that the entire "length in bytes of file metadata" + is consumed. Therefore, to allow both footers to exist simultaneously in the file the following algorithm is used: + + 1. Serialize and write the original (PAR1) FileMetadata thrift structure + 2. Transform the original FileMetadata structure to conform to PAR3 + * Move data elements if necessary + * Generate data pages for elements stored in metadata pages + * Clear the lists that were transferred to metadata pages + 3. Write out metadata pages + 4. Serialize and write the updated Thrift FileMetadata structure. + 5. Write out remainder of PAR3 header (last bytes written are "PAR3"). + 6. Write out the total size in bytes of both the serialized (PAR1) data structure plus the + size of the PAR3 footer as the final 4-byte byte length. + 7. Write PAR1 + + When these steps are followed readers wishing to use PAR3 footers SHOULD read the last 12 bytes of the file + and look for "PAR3" written out in step five at the beginning of the 12 bytes. As noted above, there should be + no ambiguity with files generated by Parquet reference implementations, as without PAR3 we expected [x, x, x, 0x00] + for PAR1 files. Any ambiguity can be completely eliminated if the CRC32 is written in PAR3 mode and verified by + readers. + + When embedded into a PAR1 file no modification to the magic number at the beginning of the file is mandated. + ## Metadata There are three types of metadata: file metadata, column (chunk) metadata and page header metadata. All thrift structures are serialized using the TCompactProtocol. diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index c928ad66b..3df09bbf3 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -537,6 +537,39 @@ enum Encoding { Support for INT32, INT64 and FIXED_LEN_BYTE_ARRAY added in 2.11. */ BYTE_STREAM_SPLIT = 9; + + /** Encoding for variable length binary data that allows random access of values. + * + * This encoding designed for random access of BYTE_ARRAY values. It is mostly useful in cases + * for non-nullable BYTE_ARRAY columns where determining the exact offset of the value does not require + * parsing definition levels. + * + * The layout consists of the following elements elements: + * 1. byte_arrays - Byte Array values layed out contiguously. The BYTE_ARRAYs are immediately contiguous the cumulative + * offsets. + * 2. offsets: A contiguous set of signed N-byte little-endian unsigned integers + * representing the end byte offset (exclusive) of a BYTE_ARRAY value from + * the the beginning of the page. For simplicity of implementation the 0 index is + * always as zero. + * 3. The last byte indicates the number of bytes used for offsets (valid values are 1,2,3 and 4). + * Implementations SHOULD try to use the smallest byte value that meets the length requirements. + * + * Note the order of lengths is reversed from DELTA_BINARY_PACKED to allow for byte array values to + * potentially allow for incremental compression in the case of Data Page V2 or other future data pages + * where values are compressed separately from nesting information. + * + * The beginning offset of the offsets can be determined using the final offset element. + * + * An individual byte array element can be found at an index using the following pseudo-code + * (real implementations SHOULD do bounds checking): + * + * return byte_arrays[offsets[index] : offsets[index+1]] + * + * + * Example encoding of "f", "oo", "bar1" (square brackets delimit the components listed): + * [foobar1][0,1,3,7][1] + */ + RANDOM_ACCESS_BYTE_ARRAY = 10; } /** @@ -779,8 +812,12 @@ struct ColumnMetaData { * whether we can decode those pages. **/ 2: required list encodings - /** Path in schema **/ - 3: required list path_in_schema + /** Path in schema + * Example of deprecated a field for PAR3 + * PAR1 Footer: Required + * PAR3 Footer: Deprecated (don't populate) + */ + 3: optional list path_in_schema /** Compression codec **/ 4: required CompressionCodec codec @@ -792,7 +829,11 @@ struct ColumnMetaData { 6: required i64 total_uncompressed_size /** total byte size of all compressed, and potentially encrypted, pages - * in this column chunk (including the headers) **/ + * in this column chunk (including the headers) + * + * Fetching the range of min(dictionary_page_offset, data_page_offset) + total_compressed_size + * should fetch all data in the the given column chunk + */ 7: required i64 total_compressed_size /** Optional key/value metadata **/ @@ -812,7 +853,7 @@ struct ColumnMetaData { /** Set of all encodings used for pages in this column chunk. * This information can be used to determine if all data pages are - * dictionary encoded for example **/ + * dictionary encoded for example **/ 13: optional list encoding_stats; /** Byte offset from beginning of file to Bloom filter data. **/ @@ -881,15 +922,21 @@ struct ColumnChunk { /** Crypto metadata of encrypted columns **/ 8: optional ColumnCryptoMetaData crypto_metadata - /** Encrypted column metadata for this chunk **/ + /** Encrypted column metadata for this chunk + * + * PAR3: Not set see column_metadata_page on FileMetadata struct + **/ 9: optional binary encrypted_column_metadata } struct RowGroup { /** Metadata for each column chunk in this row group. * This list must have the same order as the SchemaElement list in FileMetaData. + * + * PAR1: Required + * PAR3: Not populated. Use columns_page on FileMetadata. **/ - 1: required list columns + 1: optional list columns /** Total byte size of all the uncompressed column data in this row group **/ 2: required i64 total_byte_size @@ -1115,6 +1162,33 @@ union EncryptionAlgorithm { 2: AesGcmCtrV1 AES_GCM_CTR_V1 } +/** + * Description of location of a metadata page. + * + * A metadata page is a data page used to store metadata about + * the data stored in the file. This is a key feature of PAR3 + * footers which allow for deferred decoding of metadata. + * + * For common use cases the current recommendation is to use a + * an encoding that supported random access (e.g. PLAIN for fixed types + * and RANDOM_ACCESS_BYTE_ARRAY for variable sized types). implementations + * SHOULD consider allowing configurability per page to allow for end-users + * to optimize size vs compute trade-offs that make sense for their use-case. + */ +struct MetadataPageLocation { + // Offset from the beginning of the PAR3 footer to the header + // of the data page. + 1: optional i32 footer_offset + + // The length of the serialized page (header + data) in bytes. This + // is redundant with information in the header but allow + // for more robust checks before doing any Thrift parsing. + 2: optional i32 full_page_size + + // Optional compression applied to the page. + 3: optional CompressionCodec compression +} + /** * Description for file metadata */ @@ -1127,16 +1201,52 @@ struct FileMetaData { * are flattened to a list by doing a depth-first traversal. * The column metadata contains the path in the schema for that column which can be * used to map columns to nodes in the schema. - * The first element is the root **/ - 2: required list schema; + * The first element is the root + * + * PAR1: Required + * PAR3: Use schema_metadata_page + * + * TODO: This might be too much (i.e. leave as a list for PAR3), but potentially useful for + * wide Schemas if a "schema index" is every added. + **/ + 2: optional list schema; + + /** Required BYTE_ARRAY data where each element is REQUIRED. + * + * Each element is a serialized SchemaElement. The order and content should + * have a one to one correspondence with schema. + * + * If encryption is applied to the footer each element is encrypted individually. + */ + 10: optional MetadataPageLocation schema_page; /** Number of rows in this file **/ 3: required i64 num_rows - /** Row groups in this file **/ + /** Row groups in this file + * + * TODO: Decide if this should be moved to a metadata page. + **/ 4: required list row_groups - /** Optional key/value metadata **/ + /** Required BYTE_ARRAY data where each element is REQUIRED. + * + * Each element is a serialized ColumnChunk. The number of + * elements is M * N, where M is the number row groups in the file + * and N is the number of columns storing data. An columns metadata + * object is stored at `m*N + column index` where m is the row-group + * index. + * + * If encryption applies to the footer each element in page is encrypted + * individually. + * + * PAR1: Don't include + * PAR3: Required **/ + 11: optional MetadataPageLocation columns_page + + /** Optional key/value metadata + * TODO: Consider if this should be moved to use a data page as well + **/ 5: optional list key_value_metadata /** String for application that wrote this file. This should be in the format @@ -1160,6 +1270,10 @@ struct FileMetaData { * * The obsolete min and max fields in the Statistics object are always sorted * by signed comparison regardless of column_orders. + * + * TODO: consider moving to a data page. While fast to decode, this potentially + * compresses/encodes extremely well since it is only a single value at the + * moment. */ 7: optional list column_orders;