Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Parquet 3 metadata with decoupled column metadata #242

Closed
wants to merge 1 commit into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented May 16, 2024

Parquet 3 metadata proposal

This is a very rough attempt at solving the problem of FileMetadata footprint and decoding cost, especially for Parquet files with many columns (think tens of thousands columns).

Context

This is in the context of the broader "Parquet v3" discussion on the mailing-list. A number of possible far-reaching changes are being collected in a document.

It is highly recommended that you read at least that document before commenting on this PR.

Specifically, some users would like to use Parquet files for data with tens of thousands of columns, and potentially hundreds or thousands of row groups. Reading the file-level metadata for such a file is prohibitively expensive given the current file structure where all column-level metadata is eagerly decoded as part of file-level metadata.

Contents

It includes a bunch of changes:

  1. a new "Parquet 3" file structure with backwards compatibility with legacy readers
  2. new Thrift structures allowing for decoupled decoding of file-level metadata and column metadata: file metadata is now O(n_columns + n_row_groups) instead of O(n_columns * n_row_groups)
  3. removal of outdated, redundant or undesirable fields from the new structures

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@pitrou pitrou force-pushed the v3-metadata branch 3 times, most recently from f6ad6e1 to 4adccbc Compare May 16, 2024 11:24
README.md Show resolved Hide resolved
src/main/thrift/parquet.thrift Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@tustvold
Copy link
Contributor

tustvold commented May 16, 2024

FWIW I'd be very interested to see how far we can push the current data structures with approaches like apache/arrow-rs#5775, before reaching for format changes.

I'd also observe that the column statistics can already be stored separately from FileMetadata, and if you do so you're really only left with a couple of integers... The schema strikes me as a bigger potential bottleneck, but also one that I can't help feeling is unavoidable...

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great initiative, @pitrou!

README.md Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
@@ -885,6 +971,44 @@ struct ColumnChunk {
9: optional binary encrypted_column_metadata
}

struct ColumnChunkV3 {
/** File where column data is stored. **/
1: optional string file_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this concept of having the metadata in a separate file? I did not see it working anywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do support it in PyArrow (see tests here and here), and I think Dask makes use of it.

@jorisvandenbossche @mrocklin Am I right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the summary metadata file, IIUC. cc @rdblue for more context.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring this question from Dask's perspective to @fjetter and @phofl

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(thanks for pinging us)

Copy link
Member

@rok rok Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArrow provides write_metadata and parquet_dataset for such usecases. Original PR goes more in depth.
cc @jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the benefits of using this _metadata index file should translate to cloud object stores too by reducing the number of objects/files to be read.

not really

  1. forces a full read of all generated files in job commit, which even if done in parallel is really slow. If it were to be done, it'd be better off done on demand in the first query. (note, faster reads would improve this)
  2. it doesn't work with the cloud committer design, which #1361 formalises without doing some bridging classes.

the reason for (2) is that the hadoop cloud-native committer design kept clear of making any changes to the superclass of ParquetOutputCommitter as it is a critical piece of code in so many existing workflows, and really hard to understand. Not just a co-recursive algorithm, but two intermingled algorithms, one of which lacks the correctness guarantees (failures during task commit can be recovered from).

with a move to table based formats rather than directory trees, that whole commit process becomes much easier as well as supporting atomic job commits on a table (including deletes!). And as you note, these formats can include schema info too.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, AFAIK the _metadata summary file was a practice originally used in Spark (and supported by Parquet-mr), and inspired on that for example also taken over in Dask. We then implemented support for this in Arrow C++ / PyArrow mostly based on the dask usage (as a downstream user of pyarrow). In the meantime though, Spark disabled writing those files by default a long time ago (https://issues.apache.org/jira/browse/SPARK-15719), and also dask stopped doing this 2 years ago (dask/dask#8901),

Another Parquet dev mailing list thread with some discussion about this: https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamreeve I see so the parquet file is one with all the metadata and all the data is in files pointed to by this singleton.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the benefits of using this _metadata index file should translate to cloud object stores too by reducing the number of objects/files to be read.

not really

Sorry I couldn't really follow this argument, that sounds like a Hadoop specific problem. To me a cloud object store means something like S3, and for our use case we're mostly concerned with reducing the number of objects that need to be read to satisfy read queries that filter data and don't need to read all files in a dataset, as we have many concurrent jobs running and adding load on storage.

Another Parquet dev mailing list thread with some discussion about this: https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7

Much of the discussion there seems to be related to issues users can run into if doing things like overwriting Parquet files or having heterogeneous schemas, which this feature was not designed for. But it sounds like others have also found this feature useful. I think this quote from Patrick Woody matches our experience: "outstandingly useful when you have well laid out data with a sort-order"

@adamreeve I see so the parquet file is one with all the metadata and all the data is in files pointed to by this singleton.

Yes, exactly.

The _metadata file format could have been designed so that the file_path field wasn't needed in the column chunk metadata. But it's there now and provides value to users while adding minimal overhead to those not using it (missing fields require zero space in serialized messages if I've understood the Thrift Compact Protocol correctly).

README.md Outdated Show resolved Hide resolved
@pitrou
Copy link
Member Author

pitrou commented May 16, 2024

FWIW I'd be very interested to see how far we can push the current data structures with approaches like apache/arrow-rs#5775, before reaching for format changes.

At first sight this would be a Rust-specific optimization. Also, while such improvements are good in themselves, they don't address the fundamental issue that file metadata size is currently O(n_row_groups * n_columns).

I'd also observe that the column statistics can already be stored separately from FileMetadata, and if you do so you're really only left with a couple of integers...

The main change in this PR is that a RowGroupV3 structure is O(1), instead of O(n_columns) for a RowGroup. The rest are assorted improvements.

@tustvold
Copy link
Contributor

tustvold commented May 16, 2024

they don't address the fundamental issue that file metadata size is currently O(n_row_groups * n_columns).

Is it not still - https://github.com/apache/parquet-format/pull/242/files#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR1337

Edit: Oh I see you lose the per row-groupness. Although there is nothing to prevent having one big row group...

@pitrou
Copy link
Member Author

pitrou commented May 16, 2024

Hmm. If SchemaElementV3 is an issue, we might further decouple things I suppose. Though I'm not sure how one would look up columns by names without decoding all the schema elements.

Writing one big row group is of course possible, but it probably comes with its own problems (such as RAM consumption in the writer?).

@tustvold
Copy link
Contributor

At first sight this would be a Rust-specific optimization

The same optimisation could be done in C++, borrows are just pointers with compiler enforced lifetimes, but I accept it might be harder to achieve something similar in managed languages like Java without at least some cost.

@pitrou
Copy link
Member Author

pitrou commented May 16, 2024

The same optimisation could be done in C++, borrows are just pointers with compiler enforced lifetimes

This assumes the Thrift C++ APIs allow this.

@pitrou pitrou force-pushed the v3-metadata branch 2 times, most recently from 9d4b0bd to e6a9088 Compare May 16, 2024 15:01
3: optional list<SortingColumn> sorting_columns

/** REMOVED from v1: file_offset.
* Use the OffsetIndex for each column instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is useful when we want to estimate the whole read range but do not want to read offset index.

Copy link
Member Author

@pitrou pitrou May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my question would be: how do you estimate the read range if you only have the file offset, but not the on-disk length? Using total_compressed_size perhaps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_compressed_size should give the range I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the current spec does not prohibit placing column chunks in random order. So we must use file_offset together with total_compressed_size to determine the read range. This is a trick used to place small column chunks together which may be fetched in a single I/O.

@pitrou pitrou force-pushed the v3-metadata branch 2 times, most recently from 9c71ce0 to 3d651b8 Compare May 16, 2024 15:26
@pitrou
Copy link
Member Author

pitrou commented May 16, 2024

I added a "PAR3 without legacy metadata" variation for the distant future.

will notice the "PAR3" magic number just before the File Metadata and will
instead read and decode the File Metadata v3.

### Parquet 3 without legacy metadata
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac @gszadovszky Let me know if it is a good idea to even mention this :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents I think at some point we want to avoid the duplication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. There will be a time when people will agree to remove the old metadata. I added the "PAR1" magic number specifically for this :)

@kiszk
Copy link
Member

kiszk commented May 16, 2024

This is not directly related to a new structure. However, it would be a good opportunity to explicitly declare the endianness of data and meta-data.

1: required i32 version

/** Parquet schema for this file **/
2: required list<SchemaElementV3> schema;
Copy link
Contributor

@emkornfield emkornfield May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lists cannot really be decoded in a random access manner. I suggest in V3 we should consider moving any list elements to a Page that has type byte_array where each element is a serialized struct (thrift or something else if we choose to move away from it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For heavily nested nested lists we might want to separate type specific fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not here. I think I need to do a more formal auditing to see what makes sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip lists do improve list random IO perf, or some variant where as the element list is built up somehow an offset to a later list element is added. But I don't know of any easy way to do that here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concrete suggestion is to introduce a page encoding that supports random access: #250 which IIUC is similar to the approach described here for Columns but allows for the solution to be more easily generalized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you build random access at the encoding level, then how about the compression step?

<File-level Column N Metadata v3>

File Metadata v3
4-byte length in bytes of File Metadata v3 (little endian)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we might want a slightly extended. In particular, I think it is possible we want a digest (e.g. sha256) of the v3 header. This can serve two purposes:

  1. Be able to truly disambiguate the unlikely case that par3 ends here (e.g. unencoded data page where the last value in "PAR3".
  2. Additional ability to ensure contents are reliable.

Second I think we might want to record an offset to the "full footer" containing additional index information, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastly, we might want to consider if compression is should be a setting, if we move enough stuff out of thrift this probably isn't a concern anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. How many users have asked to sha256-protect the header, or how likely is it to get a corrupt header in the real world? We shouldn't start making gratuitous additions that are not backed by concrete use cases.

  2. I don't know what you mean with "full footer", are you talking about the FileMetadata struct (which is still here, just below)? Something else?

  3. As for a false positive "PAR3", there is indeed a small risk, though the additional "PAR3" at the beginning of the file should help disambiguate.

  4. What does compression have to do here? I'm not following you.

Copy link
Contributor

@emkornfield emkornfield May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. A very small fraction likely, a lighter-weight digest is also fine, we have digests in other parts of the spec, and I think the main reasion for likely not having it on the footer was to avoid compatibility issues.
  2. FileMetadata + Serialized metadata like indeces/bloom filters and anything we move to the data page. after all the column chunks is what I mean by "Full Footer"
  3. It isn't clear to me that everyone will check the header. This adds an additional IO for not too much benefit unless, the entire file is being retrieved from disk.
  4. Compressing the thrift serialized data to minimize size if consumers want the ultimate smallest file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can add a CRC32 here if other people want it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield IIUC the digest is not to protect for corruption but to make sure we do not mistakenly read a V3 footer in a file without one if we happen to see "PAR3" bytes before the V1 footer, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can serve both purposes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posit the latter is more important. Reading a valid encoded PAR1 file as PAR3 by accident is unlikely. But in zettabytes of data stored in parquet globally it will happen. When it happens someone is going to come here with a pitchfork.

Whereever we store a new metadata footer, its content hash needs to be stored somewhere for verification. A crc32 of the footer should be good enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whereever we store a new metadata footer, its content hash needs to be stored somewhere for verification. A crc32 of the footer should be good enough.

This is exactly what I'm suggesting, I think it solves both use-cases (sha1 is probably overkill). Today there is no digest on the footer as far as I'm know.

2: required i64 num_rows

/** If set, specifies a sort ordering of the rows in this row group. */
3: optional list<SortingColumn> sorting_columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be per file. I guess the downside is expense concatenating files with different sort orders.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea. Not knowing the precise use cases for this, I would err on the side of safety and not change this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to make the same comment. I would imagine the point of sorting_columns is to tip off a query engine as to which columns can be used for filtering (and perhaps also columns used as a compount key?). I can't see how it would make sense for this to vary per row group. Since we're blowing things up, I'd vote to move this to FileMetaData.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders)

Yes, but it can be worked around by simply dropping sort order metadata if row groups have inconsistent values. This is different to codec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fair, but why lose the information per row-group it seems some engines could make sure of this, although it is probably superceded by page indexes?

*/

/** Total byte size of all the uncompressed column data in this row group **/
1: required i64 total_byte_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this field is particularly useful, as it doesn't account for encodings. We might want a different field name with a clearer meaning.

Copy link
Member Author

@pitrou pitrou May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field was introduced in 2013, I'm not sure anyone still has a memory of the reasons. The author is named Nong Li, without a corresponding GitHub account. @nongli Is it you?
Also ping @julienledem, who was there too.

Copy link
Member Author

@pitrou pitrou May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be used by parquet-mr's ParquetInputFormat.getSplits, which hasn't changed much since 2014. It doesn't seemed used elsewhere in parquet-mr, and isn't used by Parquet C++ at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it might be used for a proxy as to actual data size, but it isn't a good one maybe we can keep it and always add another better field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with removing this, FTR, I would just like to make sure this isn't actually useful for some unexpected use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was added before that. this commit is just referencing when we created the separate parquet-format repo. It might well be part of the original metadata. The redfile.thrift name refers to the old RedElm name before we rename it Parquet. It is the size of the decompressed (after decompression, before decoding) page. Agreed it's not super useful.

6: optional string created_by

/** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/
7: required list<i64> file_column_metadata_offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be at the beginning of the column chunk? Or is it part of the footer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the column information in the footer is probably still useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "in the footer", you mean after all the data?
In the README above, it is laid out thusly, so I suppose it is in the footer :-)

[...]
<File-level Column 1 Metadata v3>
...
<File-level Column N Metadata v3>

File Metadata v3
[...]

Though, of course, this is purely advisory and writers are free to place them elsewhere.


/** NEW from v1: Optional key/value metadata for this column at the file level
**/
3: optional list<KeyValue> key_value_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to model this as a two data pages (key and value) or follow the suggestion above for simply using a page to individually store the encodings.

Copy link
Member Author

@pitrou pitrou May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is file-level metadata. Do we have reasons to believe there will be enough key-values to warrant the complexity of using dedicated data pages for this?

The more we deviate from Parquet 1 file organization, the more work it will create for implementors and the more potential for incompatibilites and bugs.

We should perhaps ask on the ML for opinions...

Edit: started a discussion on https://lists.apache.org/thread/9z4o0zbz34lt5jtllzfrr4gmxczddqyb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is mostly due to the fact that I think we want to follow a policy of "lazy" decoding as much as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that "lazy as much as possible" is that desirable. "Lazy where necessary" sounds more reasonable to me. Hence my question about typical metadata sizes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size. Switching it to separate "page" or something may unblock other potentials like putting extra user-defined secondary index. For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the key_value_metadata if we want to add custom index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size.

Ok, but is it a practical concern? In Parquet C++ we have:

constexpr int32_t kDefaultThriftStringSizeLimit = 100 * 1000 * 1000;
// Structs in the thrift definition are relatively large (at least 300 bytes).
// This limits total memory to the same order of magnitude as
// kDefaultStringSizeLimit.
constexpr int32_t kDefaultThriftContainerSizeLimit = 1000 * 1000;

For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the key_value_metadata if we want to add custom index.

Well, you could also have a special-named column with 1 defined BYTE_ARRAY value for the piece of metadata you care about (or you could also model it more finely using Parquet types).

3: required i64 num_rows

/** Row groups in this file **/
4: required list<RowGroupV3> row_groups

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per discussion above, we would really like to move away from using a list for the row_groups so that individual row_groups can be read in a random access way. That is, we don't have to read data about all the row_groups if we just want a single row group from the parquet file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that RowGroupV3 is heavily reduced compared to RowGroup. That said, I understand the concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query?

I think yes? An engine might dispatch individual row-groups to different workers if there is external metadata to support knowing the number of row-group in a file before hand. Again given reduced size of row-groups this might be too much of a micro-optimization.

@corwinjoy
Copy link

@pitrou In conjunction with this change, if we want improved random access for row groups and columns I think this would also be a good time to upgrade the OffsetIndex / ColumnIndex in two key ways:

  1. Have OffsetIndex be stored in a random access way rather than using a list so that an individual page chunk can be loaded without needing to read the entire OffsetIndex array.
  2. Have OffsetIndex explicitly include the dictionary page in addition to any data pages so that column data can be directly loaded from the OffsetIndex without needing to get all offsets from the metadata.

I think this would make the ColumnIndex a lot more powerful as it could then be used for projection pushdown in a much faster way without the large overhead it has now.

@emkornfield
Copy link
Contributor

emkornfield commented May 16, 2024

@pitrou In conjunction with this change, if we want improved random access for row groups and columns I think this would also be a good time to upgrade the OffsetIndex / ColumnIndex in two key ways:

  1. Have OffsetIndex be stored in a random access way rather than using a list so that an individual page chunk can be loaded without needing to read the entire OffsetIndex array.
  2. Have OffsetIndex explicitly include the dictionary page in addition to any data pages so that column data can be directly loaded from the OffsetIndex without needing to get all offsets from the metadata.

I think this would make the ColumnIndex a lot more powerful as it could then be used for projection pushdown in a much faster way without the large overhead it has now.

@corwinjoy IMO, I think these are reasonable suggestions, but I think they can be handled as a follow-up once we align on design principles here. In general for dictionaries (and other "auxiliary") metadata we should maybe consider this more holistically, on how pages can be linked effectively.

@@ -467,6 +467,35 @@ struct SchemaElement {
10: optional LogicalType logicalType
}

struct SchemaElementV3 {
Copy link
Contributor

@emkornfield emkornfield May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Antoine, on the question of keeping implementation simpler, would it pay to not revise this and just reuse the existing one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 these changes are not necessary, we should mark the relevant bits in SchemaElement deprecated instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, of course, but deprecated fields will probably never be removed from the format (except perhaps in 20 years).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are deprecated fields a problem?

Assuming a writer writes them, the writer wastes cpu and bytes.
A reader can chooce to parse them or ignore them (by removing/commenting them out) from the .thrift file. The latter means the deprecated fields will be ignored by the thrift parser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mandating that readers remove fields from the official Parquet Thrift file sounds like a complicated and error-prone arrangement.

Copy link
Contributor

@alkis alkis May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no mandate. The official parquet thrift will comment them out.

  1. Writers compiled with old version of the official thrift file may write the fields.
  2. Writers compiled with new version of the official thrift file won't write the fields.
  3. Readers compiled with old version of the official thrift file may read the fields.
  4. Readers compiled with new version of the official thrift file will ignore the fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that, for any given implementation, we'll have either 1+3 (compiled with old version: backwards compatibility but no perf improvement when reading), or 2+4 (compiled with new version: better performance on read, but backwards compatibility is lost).

This doesn't satisfy the requirements we're trying to satisfy here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. The problem is old readers and new writers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with no cleanup of deprecated fields. Marking them semantically as logically required for PAR1 footers seems reasonable to me. I hope at some point standalone PAR3 would become the default and they would not need to be written if they present meaningful overhead once that happens.

from a wire compatibility perspective required->optional should be forward/backward compatible.

/** Nested fields. */
5: optional i32 num_children;

/** CHANGED from v1: from i32 to i64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you encountered a use-case for this? 2 Billion field ID seems quite high?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. I don't think this change is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It seemed a bit arbitrary to limit the width of this, but I can undo the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is backwards compatible and in the long run, it probably isn't a big deal either way. I as just curious if there as a reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concrete reason actually :-)

@tustvold
Copy link
Contributor

tustvold commented May 16, 2024

Perhaps we could articulate the concrete use-cases we want to support with this? I understand that there is a desire to support extremely wide schemas of say 10,000 columns, but the precise nature of these columns eludes me?

The reason I ask this is if we stick with a standard page size of 1MB, then a 10,000 wide table with even distribution across the columns is unlikely to ever need multiple row groups - it will be 10GB just with just a single row group. This seems at odds with the stated motivation of this PR to avoid scaling per row group, which makes me think I am missing something.

Perhaps the use-case involves much smaller column chunks than normal, which would imply small pages, which might require changes beyond metadata if we want to support effectively? But at the same time I struggle to see why you would want to do this?

As an aside I did some toy benchmarking of parquet-rs, and confirmed that using thrift is perfectly fine, and can perform on par with flatbuffers - apache/arrow-rs#5770 (comment). It's a toy benchmark and should therefore be taken with a big grain of salt, but it at least would suggest 1us per column chunk is feasible


File Metadata v3
4-byte length in bytes of File Metadata v3 (little endian)
4-byte magic number "PAR3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more thought is potentially having a 16-bit or 32-bit flag map, initially set to zero if we do want to allow for future iterations on metadata intepretation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data point: we added this to the Arrow IPC format and nobody to my knowledge is making use of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be superior to ParquetVersion of parquet-cpp in terms of representing a collection of enabled features.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what would be a concrete use case for those feature flags?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test if the reader is compatible and time to upgrade?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readers can already scan the metadata and see if they recognize all encodings, etc.
Also, what would be the point of erroring out if a feature is optional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of current discussion on evolution, some of these might come fruitition some of them won't.

As a concrete example if we have a bitmap, we not longer need different magic footer/header values for compression.

Other potential items that we might not want to close off in :

  1. Move to flatbuffers in the future.
  2. Allow for discontinous column chunks (i.e. require offset index to read data)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can perhaps add 4 reserved bytes just before the magic number. But I would recommend against giving them any specific meaning until necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for the concrete use case I meant to say "encryption" which I think we could use immediately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that encryption would be an immediate use case (as you did in #250, btw).

@emkornfield
Copy link
Contributor

Perhaps we could articulate the concrete use-cases we want to support with this? I understand that there is a desire to support extremely wide schemas of say 10,000 columns, but the precise nature of these columns eludes me?

At least in datasets I've seen there are a small number of rows at least filtering (i.e. more columns then rows).

@julienledem
Copy link
Member

julienledem commented May 17, 2024

Thank you Antoine
On the mailing list Micah is collecting feedback in a document.
https://lists.apache.org/thread/61z98xgq2f76jxfjgn5xfq1jhxwm3jwf

Would you mind putting your feedback there?
We should collect the goals before jumping to solutions.
It is a bit difficult to discuss goals directly in the thrift metadata.

4-byte magic number "PAR1"
4-byte magic number "PAR3"

<Column 1 Chunk 1 + Column Metadata>
Copy link

@mhaseeb123 mhaseeb123 May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't <Column 1 Chunk 1 + Column 1 Chunk 1 Metadata> and so on be better here according to the V3 metadata format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've literally copied this from the original snippet above. This proposal doesn't change anything in the layout here, so it would be confusing if I expressed it differently, IMHO.

Comment on lines +162 to +165
This file structure is backwards-compatible. Parquet 1 readers will read and
decode the legacy File Metadata in the file footer, while Parquet 3 readers
will notice the "PAR3" magic number just before the File Metadata and will
instead read and decode the File Metadata v3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Efficient I/O can be challenging with this proposal. A reader needs to read the last 8 bytes of the file, then read 8 bytes before the legacy footer, figure out a v3 footer exists and then read that.

It would be better if the v3 metadata are at the end of the file, right before the 4-byte len + PAR1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest a layout that preserves compatibility with PAR1 readers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few options:

  1. Use FileMetadata.version to introduce a new version of the metadata. Starting from the minimal change that can be done in place (DRAFT: Incremental improvements to parquet metadata #248) we can bump the version and remove columns from RowGroup and decouple the column metadata completely.
  2. Add a binary field to FileMetadata named v3_metadata with tag number 10003. This field will encode the flatbuffer/thrift representation of the new footer. This field is going to be encoded last by thrift. Readers can manually look at the tail of the file and if they find this field, they can ignore the rest of the footer and parse these bytes only, ignoring the old style footer alltogether.

Copy link
Member Author

@pitrou pitrou May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is twofold:

  1. achieve better metadata parsing performance for PAR3 readers
  2. keep compatibility with PAR1 readers

This is why this proposal creates a separate array of structure types: so that PAR3 readers don't have to eagerly decode those pesky columns, while letting PAR1 readers correctly access column information.

I don't think any of your two proposals is able of achieving those two goals simultaneously, are they?
(admittedly, I'm not sure I understand proposal number 2, though it seems to require hand-coded Thrift parsing which doesn't sound like a tremendous idea)

Copy link
Contributor

@emkornfield emkornfield May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for 2, thrift could avoid parsing it assuming that we still follow the pattern of nested footer.

e.g. <field_marker 10003 and byte size><[serialized v3 metadatadata] + <v3 trailing bits (length, digest, feature bitmask)>"PAR3">0x0000<footer size>PAR1 as long as the byte size in the thrift header accounts for everything through PAR3 (as @alkis mentions below) it should work.

So the encoding/serialization would be manual but on decoding old readers should automatically drop the unknown field (it is possible some thrift implementations retain unknown fields, I know proto does) (i.e. the field ID 10003 should never actually be modeled in the schema).

"note 0x0000" is the stop field for structs if I am reading the thrift spec correctly

So the trade-offs of doing this approach are:

  1. A bit of extra data to be copied for readers accessing the original version.
  2. A guaranteed lower bound on amount of IO operations for V3 since it is incorporated into v2
  3. Potentially more memory utilization if accessing the original version if unknown fields are maintained by thrift implementation.

Effectively for doing the operation currently as proposed in V3 the trade-offs are reverse. I

Copy link
Contributor

@tustvold tustvold May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With fetch latencies of that order, does the decode latency of the existing thrift payload even matter? I would be interested to see any empirical data that would suggest the current metadata structures are an appreciable bottleneck for use-cases involving a catalog. I created a mailing list thread related to this here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the concrete numbers. This gives us more precise insight for us to work on (though not all Parquet files are read from S3; local reads and writes are a very common use case as well). However, I'm a bit surprised by your numbers because typical guidance for S3 involves larger reads (i.e. multi-megabyte).

Copy link
Contributor

@alkis alkis May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On S3 reads:

Guidance from S3 involves larger reads because of cost: reads from S3 typically cost an API call and bandwidth/transfer volume is free (same zone). 8MB reads will cost half the price of 4MB reads. Each connection can go up to ~100MB/sec so just transfering 4MB of data is at least 25ms. If one is reading large files, doing 100s of 8MB reads in parallel will saturate any network card - which is typically what engines do.

I posit that the vast majority of parquet encoded data is in some cloud somewhere (like >90% of all data). Hence working well with object stores (high latency, immutable files) is a requirement for any change. This is also lesson 4 from An Empirical Evaluation of Columnar Storage Formats.

With fetch latencies of that order, does the decode latency of the existing thrift payload even matter?

Yes it does. With the numbers above in mind:

  • cold reads: baseline 110ms (as above). Optimized metadata are down to 2mb and parsing in 5ms translates to 20ms + 30ms + 5ms = 55ms --> 2x speedup
  • warm reads: footer bytes are cached on disk/s3express/something-with-low-latency. It takes 5ms fetch + 40ms parse. Optimized, it takes 2ms to fetch + 5ms to parse = 7ms --> 6x speedup

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posit that many PiB of parquet data lives in HDFS which also has read latencies, but not as bad as S3.
S3 standard has pretty bad latency for small byte reads plus you get billed. best to grab large amounts speculatively

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late to this thread and haven't fully read it. I just want to give my +1 that the current proposal would require too many I/O requests and thus make this basically a deal breaker for high latency storage like S3. We would not use this due to this.

Any change that increases the number of data-dependent requests necessary to decode it basically a deal breaker for us and I guess is so for a lot of other data lake companies.


File Metadata v3
4-byte length in bytes of File Metadata v3 (little endian)
4-byte magic number "PAR3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do encrypted footers work in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully, the same way they work for PAR1, but I agree this will need to be spelled out more explicitly :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, might need some analysis. If we continue using "PARE", can the readers know if the file is v1 or v3?
Maybe we can use something like "PRE3" instead (PaRquet Encrypted, or "PQE3" - ParQuet Encrypted, or "PEF3" - parquet encrypted footer)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. I thought the "PARE" footer was in addition to the "PAR1" footer, but apparently it replaces it. Darn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using PAR3 + bit in a bitmap to represent encryption be a reasonable approach here to reduce the proliferation of magic footer values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, as long as we can read the bitmap before decrypting the footer

@@ -885,6 +971,44 @@ struct ColumnChunk {
9: optional binary encrypted_column_metadata
}

struct ColumnChunkV3 {
/** File where column data is stored. **/
1: optional string file_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on removing this if we are doing breaking changes.

3: required i64 num_rows

/** Row groups in this file **/
4: required list<RowGroupV3> row_groups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query?

/** Nested fields. */
5: optional i32 num_children;

/** CHANGED from v1: from i32 to i64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. I don't think this change is needed.

@@ -467,6 +467,35 @@ struct SchemaElement {
10: optional LogicalType logicalType
}

struct SchemaElementV3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 these changes are not necessary, we should mark the relevant bits in SchemaElement deprecated instead.

@@ -835,6 +864,65 @@ struct ColumnMetaData {
16: optional SizeStatistics size_statistics;
}

struct ColumnChunkMetaDataV3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to SchemaElement, we can reuse the existing struct and deprecate the useless fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also don't we need a index into the list of SchemaElement to reference it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm misunderstanding something, there should be one FileColumnMetadataV3 element per leaf SchemaElement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an index referencing a SchemaElement means that:

  1. a writer can skip encoding columns that do not have values in a rowgroup range
  2. a writer can encode/write columns in different order than metadata

(1) is important when schemata are very wide but data is sparse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, ignoring the fact that Parquet is currently not a sparse format, your proposal implies that readers have to do a O(n) search to find a given column?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readers have to do a O(n) search to find a given column?

Why would they need an O(n) search? The index indexes an SchemaElement[] in java or std::vector<SchemaElement> in C++ which is O(1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's step back because I think I am not following you.

Here is the file organization as proposed here:

  1. FileMetaDataV3 points to one FileColumnMetadataV3 per column
  2. FileColumnMetadataV3 points to one ColumnChunkV3 per row group

I'm not sure what you're proposing exactly when you mean "can skip encoding columns that do not have values"? This is currently not possible using Parquet, as at least the corresponding definition levels would be expected (even if they are all 0).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, ignoring the fact that Parquet is currently not a sparse format, your proposal implies that readers have to do a O(n) search to find a given column?

IIUC, Finding a column via schema elements today is also O(N) assuming no nesting. I think the difference is today the first thing implementations do create an efficient dictionary structure to amortize lookup of further columns.

I think if we want fast lookups without building any additional dictionaries in memory we should be considering a new stored index structure (or reconsider how we organize schema elements instead of a straight BFS).

/** REMOVED from v1: index_page_offset (unused in practice?) */

/** Compression codec **/
1: required CompressionCodec codec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could go in the SchemaElement since it should not in principle vary between row groups.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, right now it's possible to make it vary. I don't know if any writers make use of this possibility, but I'm not sure there's any harm in keeping it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforcing same codec to all row groups will prohibit fast merging row groups of different parquet files without rewriting chunk data. So I vote for keeping it as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest culprit in parsing metadata is Statistics because every one of the values is a binary variable length thing. We could improve Statistics trivially and in place if we add a few fixed size fields:

struct Statistics {
   /**
    * DEPRECATED
    */
   1: optional binary max;
   2: optional binary min;
   /** count of null value in the column */
   3: optional i64 null_count;
   /** count of distinct values occurring */
   4: optional i64 distinct_count;
   /**
    * Only one pair of min/max will be populated. For fixed sized types, one of the minN/maxN variants will be used. Otherwise min_value/max_value is used.
    */
   5: optional binary max_value;
   6: optional binary min_value;

   7: optional byte max1;
   8: optional byte min1;
   9: optional i32 max4;
   10: optional i32 min4;
   11: optional i64 max8;
   12: optional i64 min8;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since that structure is deprecated, do we really want to "improve" it, or should we focus our efforts on non-deprecated structures such as ColumnIndex?

More generally, this should probably be discussed in a separate ML thread, for better separation of concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware the whole structure was deprecated. I thought only max amd min fields are deprecated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a separate topic. I still want to say that Apache ORC has defined separate statistics for different types: https://github.com/apache/orc-format/blob/6e30e63f5071b616ec5cedaac7b2e0a98ae377c5/src/main/proto/orc/proto/orc_proto.proto#L87-L99. But the dirty thing is that we might still need some sort of encoding for data types which do not have direct representation from protobuf or thrift, e.g. decimal type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an encoding shouldn't be a problem as long as the encoding is fast enough to decode (e.g. PLAIN), should it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree separate topic. I think you still need variable length types that rely on byte_array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. Defining specific statistics for different data types adds complexity as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a union work here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware the whole structure was deprecated. I thought only max amd min fields are deprecated.

Ahah, sorry. I think you're right actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought only max amd min fields are deprecated.

Sigh, a unrelated issue is that currently min max might still be written even if min_value and max_value is being provided.

Copy link
Contributor

@alkis alkis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied to comments. I also put a strawman in the comments for minimal changes we can do to improve FileMetadata decoding speed: #248.

Most of the bottleneck in decoding FileMetadata are variable length fields: list<> and binary. With those removed/elided we can start reducing the pain today and get to a much better state when we have a solid design for larger change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware the whole structure was deprecated. I thought only max amd min fields are deprecated.

@@ -835,6 +864,65 @@ struct ColumnMetaData {
16: optional SizeStatistics size_statistics;
}

struct ColumnChunkMetaDataV3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an index referencing a SchemaElement means that:

  1. a writer can skip encoding columns that do not have values in a rowgroup range
  2. a writer can encode/write columns in different order than metadata

(1) is important when schemata are very wide but data is sparse.

<File-level Column N Metadata v3>

File Metadata v3
4-byte length in bytes of File Metadata v3 (little endian)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield IIUC the digest is not to protect for corruption but to make sure we do not mistakenly read a V3 footer in a file without one if we happen to see "PAR3" bytes before the V1 footer, correct?

Comment on lines +162 to +165
This file structure is backwards-compatible. Parquet 1 readers will read and
decode the legacy File Metadata in the file footer, while Parquet 3 readers
will notice the "PAR3" magic number just before the File Metadata and will
instead read and decode the File Metadata v3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few options:

  1. Use FileMetadata.version to introduce a new version of the metadata. Starting from the minimal change that can be done in place (DRAFT: Incremental improvements to parquet metadata #248) we can bump the version and remove columns from RowGroup and decouple the column metadata completely.
  2. Add a binary field to FileMetadata named v3_metadata with tag number 10003. This field will encode the flatbuffer/thrift representation of the new footer. This field is going to be encoded last by thrift. Readers can manually look at the tail of the file and if they find this field, they can ignore the rest of the footer and parse these bytes only, ignoring the old style footer alltogether.

Comment on lines +985 to +987
/** NEW from v1: Byte length in file_path of ColumnChunkMetaDataV3, optionally encrypted
**/
3: required i32 metadata_file_length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :)

The naming is a little awkward...I read these as "metadata-file offset"/"metadata-file length". Perhaps instead just "metadata_offset" and "metadata_length"?

Aside: just curious, but in the current format how do you make use of the file_offset? Is there a way to deduce the length of the metadata? Or do you have to use a file based reader and seek to the offset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do we still want to put a copy ColumnMetaData at the end of column chunk and another copy here at 4: optional binary encoded_metadata? I know it is good to keep backward compatibility. Does any implementation actually read it from the end of column chunk?

/** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/
7: required list<i64> file_column_metadata_offset;
/** NEW from v1: byte length of FileColumnMetadataV3, for each column **/
8: required list<i32> file_column_metadata_length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too late to add something like below for all offset/length pair?

struct Reference {
  1: i64 offset
  2: i32 length
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're looking to speed up thrift deserialization, I'd bet two int lists are going to be faster to parse than a list of structs. If the metadata objects are contiguous, maybe instead extend the offsets list by one and use deltas to calculate the lengths.

Copy link
Member Author

@pitrou pitrou May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going with the same intuition as @etseidl 's, though I don't have any precise insight into typical Thrift deserializer performance characteristics.

Mandating contiguous column metadata objects and using N+1 offsets is an intriguing idea. It could perhaps allow preloading many column metadata at once more easily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to N+1 offsets. They are going to parse a lot faster (>2x) than structs.

8: required list<i32> file_column_metadata_length;

/** REMOVED from v1: column_orders.
** Use `FileColumnMetadataV3.column_order` instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@emkornfield emkornfield May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, I misread where this was being moved to, this seems fine.

src/main/thrift/parquet.thrift Show resolved Hide resolved

/** NEW from v1: Optional key/value metadata for this column at the file level
**/
3: optional list<KeyValue> key_value_metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size. Switching it to separate "page" or something may unblock other potentials like putting extra user-defined secondary index. For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the key_value_metadata if we want to add custom index.

3: optional list<SortingColumn> sorting_columns

/** REMOVED from v1: file_offset.
* Use the OffsetIndex for each column instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the current spec does not prohibit placing column chunks in random order. So we must use file_offset together with total_compressed_size to determine the read range. This is a trick used to place small column chunks together which may be fetched in a single I/O.

2: required i64 num_rows

/** If set, specifies a sort ordering of the rows in this row group. */
3: optional list<SortingColumn> sorting_columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders)

Yes, but it can be worked around by simply dropping sort order metadata if row groups have inconsistent values. This is different to codec.

Comment on lines +985 to +987
/** NEW from v1: Byte length in file_path of ColumnChunkMetaDataV3, optionally encrypted
**/
3: required i32 metadata_file_length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do we still want to put a copy ColumnMetaData at the end of column chunk and another copy here at 4: optional binary encoded_metadata? I know it is good to keep backward compatibility. Does any implementation actually read it from the end of column chunk?

...
<Column N Chunk M + Column Metadata>

<File-level Column 1 Metadata v3>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'll have a file-level column metadata, we can optimize the storage of key_metadata of encrypted columns - by keeping the column key_metadata here, instead of the ColumnChunk structures.
Note: this is a storage-only optimization [O(N) instead of O(NxM)]; the reader processes only one key_metadata object per column [O(N)] already today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that each column uses a single key_metadata for all chunks, this sounds like a good idea.

## Encryption

Encryption with footer encryption enabled changes the above file structure slightly.
In particular, the "PAR1" magic number is replaced with "PARE".
Copy link
Contributor

@ggershinsky ggershinsky May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8: optional i32 column_index_length

/** Crypto metadata of encrypted columns **/
9: optional ColumnCryptoMetaData crypto_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed if we keep this in FileColumnMetadataV3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed especially if it can't be different for individual chunks

@@ -1165,6 +1317,62 @@ struct FileMetaData {
9: optional binary footer_signing_key_metadata
}

/** Metadata for a column in this file. */
struct FileColumnMetadataV3 {
/** All column chunks in this file (one per row group) **/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add in this column in the comment? also, rename the field to chunks?

/** Metadata for a column in this file. */
struct FileColumnMetadataV3 {
  /** All column chunks in this column (in this file - one per row group) **/
  1: required list<ColumnChunkV3> chunks

/** NEW from v1: Optional key/value metadata for this column at the file level
**/
3: optional list<KeyValue> key_value_metadata
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can add the column encryption key_metadata here,

/** Crypto metadata of encrypted column **/
  4: optional ColumnCryptoMetaData crypto_metadata

Copy link

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design adds another GET/read to the footer read phase.

The offset to the v3 metadata needs to be stored at a fixed offset off the tail of the file, so a single read of the last N bytes will pull it in, just as is done for v1. Even that is inefficient as a single high-latency GET only gets back 8 bytes right now: apps should download 1+MB or more for the same cost.

Will v1 readers suffer if there is extra data at the tail of their normal metadata? If they all work today (first bit of regression testing...) then I'd actually propose a block at the end with more info, for this and later releases.

so you'd have something like

byte[8] v3_position (not offset, actual position)
byte[8] v3_length
byte[4] "PAR3"
byte[4] v1_offset
byte[4] magic ["PAR1" | "PARE"]

For this to work with existing readers which are given bytes [len-(v1-offset), len-8] to decode, the existing readers must be OK with some trailing bytes they don't recognise. Does this hold? including for encrypted metadata?

If does, a v3 reader could open a file, read at least the last 32 bytes, check for being par3 by the two magic numbers, then, if true, go to v3 metadata.
If it doesn't, things get worse when opening a file. It'd actually make sense to grab the last, say, 2-4 MB in the file and go from there.
This may be better anyway, which is why abfs/gcs connectors cache footers, ongoing s3a prefetch will probably go for the final 8 MB.
but do in parquet and it'd be consistent everywhere rather than hoping the layers underneath do the right thing.

(side issue, table indices such as iceberg should store this footer too, or at least key values)

This makes me realise there's another aspect to regression testing here: samples of v3 format files must be given to existing v2 readers to verify they don't break. Are there any repos which do this yet?

@@ -81,7 +81,13 @@ more pages.
- Encoding/Compression - Page

## File format
This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.

This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"MUST", RFC-2119 style, maybe

@@ -107,12 +113,97 @@ start locations. More details on what is contained in the metadata can be found
in the Thrift definition.

Metadata is written after the data to allow for single pass writing.
This is especially useful when writing to backends such as S3.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a requirement for HDFS too, S3 is simply a beneficiary of a design requirement from the outset of an append-only filesystem being the target from day 1

Comment on lines +162 to +165
This file structure is backwards-compatible. Parquet 1 readers will read and
decode the legacy File Metadata in the file footer, while Parquet 3 readers
will notice the "PAR3" magic number just before the File Metadata and will
instead read and decode the File Metadata v3.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posit that many PiB of parquet data lives in HDFS which also has read latencies, but not as bad as S3.
S3 standard has pretty bad latency for small byte reads plus you get billed. best to grab large amounts speculatively

<File-level Column N Metadata v3>

File Metadata v3
4-byte length in bytes of File Metadata v3 (little endian)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset + length


Readers are expected to first read the file metadata to find all the column
chunks they are interested in. The columns chunks should then be read sequentially.

![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)

### Parquet 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People, who are not involved in the discussion of Parquet 3, might wonder why suddenly Parquet '3'?

@pitrou
Copy link
Member Author

pitrou commented Jun 4, 2024

@alkis @JFinis and others, just a quick note that you've convinced me that this proposal is suboptimal for footer access latency, and something better is required. I hope we can design something that's reasonably clear and easy to implement.

@JFinis
Copy link
Contributor

JFinis commented Jun 6, 2024 via email

@emkornfield
Copy link
Contributor

@pitrou can we close this now?

@pitrou
Copy link
Member Author

pitrou commented Nov 8, 2024

@pitrou can we close this now?

Yes, we should! Thanks for the reminder.

@pitrou pitrou closed this Nov 8, 2024
@pitrou pitrou deleted the v3-metadata branch November 8, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.