-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: Parquet 3 metadata with decoupled column metadata #242
Conversation
f6ad6e1
to
4adccbc
Compare
FWIW I'd be very interested to see how far we can push the current data structures with approaches like apache/arrow-rs#5775, before reaching for format changes. I'd also observe that the column statistics can already be stored separately from FileMetadata, and if you do so you're really only left with a couple of integers... The schema strikes me as a bigger potential bottleneck, but also one that I can't help feeling is unavoidable... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great initiative, @pitrou!
@@ -885,6 +971,44 @@ struct ColumnChunk { | |||
9: optional binary encrypted_column_metadata | |||
} | |||
|
|||
struct ColumnChunkV3 { | |||
/** File where column data is stored. **/ | |||
1: optional string file_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep this concept of having the metadata in a separate file? I did not see it working anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do support it in PyArrow (see tests here and here), and I think Dask makes use of it.
@jorisvandenbossche @mrocklin Am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the summary metadata file, IIUC. cc @rdblue for more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(thanks for pinging us)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyArrow provides write_metadata and parquet_dataset for such usecases. Original PR goes more in depth.
cc @jorisvandenbossche
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the benefits of using this _metadata index file should translate to cloud object stores too by reducing the number of objects/files to be read.
not really
- forces a full read of all generated files in job commit, which even if done in parallel is really slow. If it were to be done, it'd be better off done on demand in the first query. (note, faster reads would improve this)
- it doesn't work with the cloud committer design, which #1361 formalises without doing some bridging classes.
the reason for (2) is that the hadoop cloud-native committer design kept clear of making any changes to the superclass of ParquetOutputCommitter
as it is a critical piece of code in so many existing workflows, and really hard to understand. Not just a co-recursive algorithm, but two intermingled algorithms, one of which lacks the correctness guarantees (failures during task commit can be recovered from).
with a move to table based formats rather than directory trees, that whole commit process becomes much easier as well as supporting atomic job commits on a table (including deletes!). And as you note, these formats can include schema info too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, AFAIK the _metadata
summary file was a practice originally used in Spark (and supported by Parquet-mr), and inspired on that for example also taken over in Dask. We then implemented support for this in Arrow C++ / PyArrow mostly based on the dask usage (as a downstream user of pyarrow). In the meantime though, Spark disabled writing those files by default a long time ago (https://issues.apache.org/jira/browse/SPARK-15719), and also dask stopped doing this 2 years ago (dask/dask#8901),
Another Parquet dev mailing list thread with some discussion about this: https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamreeve I see so the parquet file is one with all the metadata and all the data is in files pointed to by this singleton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the benefits of using this _metadata index file should translate to cloud object stores too by reducing the number of objects/files to be read.
not really
Sorry I couldn't really follow this argument, that sounds like a Hadoop specific problem. To me a cloud object store means something like S3, and for our use case we're mostly concerned with reducing the number of objects that need to be read to satisfy read queries that filter data and don't need to read all files in a dataset, as we have many concurrent jobs running and adding load on storage.
Another Parquet dev mailing list thread with some discussion about this: https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7
Much of the discussion there seems to be related to issues users can run into if doing things like overwriting Parquet files or having heterogeneous schemas, which this feature was not designed for. But it sounds like others have also found this feature useful. I think this quote from Patrick Woody matches our experience: "outstandingly useful when you have well laid out data with a sort-order"
@adamreeve I see so the parquet file is one with all the metadata and all the data is in files pointed to by this singleton.
Yes, exactly.
The _metadata
file format could have been designed so that the file_path
field wasn't needed in the column chunk metadata. But it's there now and provides value to users while adding minimal overhead to those not using it (missing fields require zero space in serialized messages if I've understood the Thrift Compact Protocol correctly).
At first sight this would be a Rust-specific optimization. Also, while such improvements are good in themselves, they don't address the fundamental issue that file metadata size is currently O(n_row_groups * n_columns).
The main change in this PR is that a |
Is it not still - https://github.com/apache/parquet-format/pull/242/files#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR1337 Edit: Oh I see you lose the per row-groupness. Although there is nothing to prevent having one big row group... |
Hmm. If Writing one big row group is of course possible, but it probably comes with its own problems (such as RAM consumption in the writer?). |
The same optimisation could be done in C++, borrows are just pointers with compiler enforced lifetimes, but I accept it might be harder to achieve something similar in managed languages like Java without at least some cost. |
This assumes the Thrift C++ APIs allow this. |
9d4b0bd
to
e6a9088
Compare
3: optional list<SortingColumn> sorting_columns | ||
|
||
/** REMOVED from v1: file_offset. | ||
* Use the OffsetIndex for each column instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is useful when we want to estimate the whole read range but do not want to read offset index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my question would be: how do you estimate the read range if you only have the file offset, but not the on-disk length? Using total_compressed_size
perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_compressed_size should give the range I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the current spec does not prohibit placing column chunks in random order. So we must use file_offset
together with total_compressed_size
to determine the read range. This is a trick used to place small column chunks together which may be fetched in a single I/O.
9c71ce0
to
3d651b8
Compare
I added a "PAR3 without legacy metadata" variation for the distant future. |
will notice the "PAR3" magic number just before the File Metadata and will | ||
instead read and decode the File Metadata v3. | ||
|
||
### Parquet 3 without legacy metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac @gszadovszky Let me know if it is a good idea to even mention this :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2 cents I think at some point we want to avoid the duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. There will be a time when people will agree to remove the old metadata. I added the "PAR1" magic number specifically for this :)
This is not directly related to a new structure. However, it would be a good opportunity to explicitly declare the endianness of data and meta-data. |
1: required i32 version | ||
|
||
/** Parquet schema for this file **/ | ||
2: required list<SchemaElementV3> schema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lists cannot really be decoded in a random access manner. I suggest in V3 we should consider moving any list elements to a Page that has type byte_array where each element is a serialized struct (thrift or something else if we choose to move away from it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For heavily nested nested lists we might want to separate type specific fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not here. I think I need to do a more formal auditing to see what makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip lists do improve list random IO perf, or some variant where as the element list is built up somehow an offset to a later list element is added. But I don't know of any easy way to do that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concrete suggestion is to introduce a page encoding that supports random access: #250 which IIUC is similar to the approach described here for Columns but allows for the solution to be more easily generalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you build random access at the encoding level, then how about the compression step?
<File-level Column N Metadata v3> | ||
|
||
File Metadata v3 | ||
4-byte length in bytes of File Metadata v3 (little endian) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we might want a slightly extended. In particular, I think it is possible we want a digest (e.g. sha256) of the v3 header. This can serve two purposes:
- Be able to truly disambiguate the unlikely case that par3 ends here (e.g. unencoded data page where the last value in "PAR3".
- Additional ability to ensure contents are reliable.
Second I think we might want to record an offset to the "full footer" containing additional index information, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastly, we might want to consider if compression is should be a setting, if we move enough stuff out of thrift this probably isn't a concern anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
How many users have asked to sha256-protect the header, or how likely is it to get a corrupt header in the real world? We shouldn't start making gratuitous additions that are not backed by concrete use cases.
-
I don't know what you mean with "full footer", are you talking about the FileMetadata struct (which is still here, just below)? Something else?
-
As for a false positive "PAR3", there is indeed a small risk, though the additional "PAR3" at the beginning of the file should help disambiguate.
-
What does compression have to do here? I'm not following you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- A very small fraction likely, a lighter-weight digest is also fine, we have digests in other parts of the spec, and I think the main reasion for likely not having it on the footer was to avoid compatibility issues.
- FileMetadata + Serialized metadata like indeces/bloom filters and anything we move to the data page. after all the column chunks is what I mean by "Full Footer"
- It isn't clear to me that everyone will check the header. This adds an additional IO for not too much benefit unless, the entire file is being retrieved from disk.
- Compressing the thrift serialized data to minimize size if consumers want the ultimate smallest file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can add a CRC32 here if other people want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield IIUC the digest is not to protect for corruption but to make sure we do not mistakenly read a V3 footer in a file without one if we happen to see "PAR3" bytes before the V1 footer, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can serve both purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I posit the latter is more important. Reading a valid encoded PAR1
file as PAR3
by accident is unlikely. But in zettabytes of data stored in parquet globally it will happen. When it happens someone is going to come here with a pitchfork.
Whereever we store a new metadata footer, its content hash needs to be stored somewhere for verification. A crc32 of the footer should be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whereever we store a new metadata footer, its content hash needs to be stored somewhere for verification. A crc32 of the footer should be good enough.
This is exactly what I'm suggesting, I think it solves both use-cases (sha1 is probably overkill). Today there is no digest on the footer as far as I'm know.
2: required i64 num_rows | ||
|
||
/** If set, specifies a sort ordering of the rows in this row group. */ | ||
3: optional list<SortingColumn> sorting_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be per file. I guess the downside is expense concatenating files with different sort orders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea. Not knowing the precise use cases for this, I would err on the side of safety and not change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to make the same comment. I would imagine the point of sorting_columns
is to tip off a query engine as to which columns can be used for filtering (and perhaps also columns used as a compount key?). I can't see how it would make sense for this to vary per row group. Since we're blowing things up, I'd vote to move this to FileMetaData.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders)
Yes, but it can be worked around by simply dropping sort order metadata if row groups have inconsistent values. This is different to codec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fair, but why lose the information per row-group it seems some engines could make sure of this, although it is probably superceded by page indexes?
*/ | ||
|
||
/** Total byte size of all the uncompressed column data in this row group **/ | ||
1: required i64 total_byte_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this field is particularly useful, as it doesn't account for encodings. We might want a different field name with a clearer meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field was introduced in 2013, I'm not sure anyone still has a memory of the reasons. The author is named Nong Li, without a corresponding GitHub account. @nongli Is it you?
Also ping @julienledem, who was there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be used by parquet-mr's ParquetInputFormat.getSplits
, which hasn't changed much since 2014. It doesn't seemed used elsewhere in parquet-mr, and isn't used by Parquet C++ at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it might be used for a proxy as to actual data size, but it isn't a good one maybe we can keep it and always add another better field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with removing this, FTR, I would just like to make sure this isn't actually useful for some unexpected use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it was added before that. this commit is just referencing when we created the separate parquet-format repo. It might well be part of the original metadata. The redfile.thrift
name refers to the old RedElm name before we rename it Parquet. It is the size of the decompressed (after decompression, before decoding) page. Agreed it's not super useful.
6: optional string created_by | ||
|
||
/** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/ | ||
7: required list<i64> file_column_metadata_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this supposed to be at the beginning of the column chunk? Or is it part of the footer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the column information in the footer is probably still useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "in the footer", you mean after all the data?
In the README above, it is laid out thusly, so I suppose it is in the footer :-)
[...]
<File-level Column 1 Metadata v3>
...
<File-level Column N Metadata v3>
File Metadata v3
[...]
Though, of course, this is purely advisory and writers are free to place them elsewhere.
|
||
/** NEW from v1: Optional key/value metadata for this column at the file level | ||
**/ | ||
3: optional list<KeyValue> key_value_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to model this as a two data pages (key and value) or follow the suggestion above for simply using a page to individually store the encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is file-level metadata. Do we have reasons to believe there will be enough key-values to warrant the complexity of using dedicated data pages for this?
The more we deviate from Parquet 1 file organization, the more work it will create for implementors and the more potential for incompatibilites and bugs.
We should perhaps ask on the ML for opinions...
Edit: started a discussion on https://lists.apache.org/thread/9z4o0zbz34lt5jtllzfrr4gmxczddqyb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is mostly due to the fact that I think we want to follow a policy of "lazy" decoding as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that "lazy as much as possible" is that desirable. "Lazy where necessary" sounds more reasonable to me. Hence my question about typical metadata sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size. Switching it to separate "page" or something may unblock other potentials like putting extra user-defined secondary index. For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the key_value_metadata
if we want to add custom index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size.
Ok, but is it a practical concern? In Parquet C++ we have:
constexpr int32_t kDefaultThriftStringSizeLimit = 100 * 1000 * 1000;
// Structs in the thrift definition are relatively large (at least 300 bytes).
// This limits total memory to the same order of magnitude as
// kDefaultStringSizeLimit.
constexpr int32_t kDefaultThriftContainerSizeLimit = 1000 * 1000;
For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the
key_value_metadata
if we want to add custom index.
Well, you could also have a special-named column with 1 defined BYTE_ARRAY value for the piece of metadata you care about (or you could also model it more finely using Parquet types).
3: required i64 num_rows | ||
|
||
/** Row groups in this file **/ | ||
4: required list<RowGroupV3> row_groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per discussion above, we would really like to move away from using a list for the row_groups so that individual row_groups can be read in a random access way. That is, we don't have to read data about all the row_groups if we just want a single row group from the parquet file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that RowGroupV3
is heavily reduced compared to RowGroup
. That said, I understand the concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query?
I think yes? An engine might dispatch individual row-groups to different workers if there is external metadata to support knowing the number of row-group in a file before hand. Again given reduced size of row-groups this might be too much of a micro-optimization.
@pitrou In conjunction with this change, if we want improved random access for row groups and columns I think this would also be a good time to upgrade the OffsetIndex / ColumnIndex in two key ways:
I think this would make the ColumnIndex a lot more powerful as it could then be used for projection pushdown in a much faster way without the large overhead it has now. |
@corwinjoy IMO, I think these are reasonable suggestions, but I think they can be handled as a follow-up once we align on design principles here. In general for dictionaries (and other "auxiliary") metadata we should maybe consider this more holistically, on how pages can be linked effectively. |
@@ -467,6 +467,35 @@ struct SchemaElement { | |||
10: optional LogicalType logicalType | |||
} | |||
|
|||
struct SchemaElementV3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Antoine, on the question of keeping implementation simpler, would it pay to not revise this and just reuse the existing one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 these changes are not necessary, we should mark the relevant bits in SchemaElement
deprecated instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, of course, but deprecated fields will probably never be removed from the format (except perhaps in 20 years).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are deprecated fields a problem?
Assuming a writer writes them, the writer wastes cpu and bytes.
A reader can chooce to parse them or ignore them (by removing/commenting them out) from the .thrift
file. The latter means the deprecated fields will be ignored by the thrift parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mandating that readers remove fields from the official Parquet Thrift file sounds like a complicated and error-prone arrangement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no mandate. The official parquet thrift will comment them out.
- Writers compiled with old version of the official thrift file may write the fields.
- Writers compiled with new version of the official thrift file won't write the fields.
- Readers compiled with old version of the official thrift file may read the fields.
- Readers compiled with new version of the official thrift file will ignore the fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that, for any given implementation, we'll have either 1+3 (compiled with old version: backwards compatibility but no perf improvement when reading), or 2+4 (compiled with new version: better performance on read, but backwards compatibility is lost).
This doesn't satisfy the requirements we're trying to satisfy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. The problem is old readers and new writers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with no cleanup of deprecated fields. Marking them semantically as logically required for PAR1
footers seems reasonable to me. I hope at some point standalone PAR3 would become the default and they would not need to be written if they present meaningful overhead once that happens.
from a wire compatibility perspective required->optional should be forward/backward compatible.
/** Nested fields. */ | ||
5: optional i32 num_children; | ||
|
||
/** CHANGED from v1: from i32 to i64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you encountered a use-case for this? 2 Billion field ID seems quite high?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. I don't think this change is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. It seemed a bit arbitrary to limit the width of this, but I can undo the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is backwards compatible and in the long run, it probably isn't a big deal either way. I as just curious if there as a reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No concrete reason actually :-)
Perhaps we could articulate the concrete use-cases we want to support with this? I understand that there is a desire to support extremely wide schemas of say 10,000 columns, but the precise nature of these columns eludes me? The reason I ask this is if we stick with a standard page size of 1MB, then a 10,000 wide table with even distribution across the columns is unlikely to ever need multiple row groups - it will be 10GB just with just a single row group. This seems at odds with the stated motivation of this PR to avoid scaling per row group, which makes me think I am missing something. Perhaps the use-case involves much smaller column chunks than normal, which would imply small pages, which might require changes beyond metadata if we want to support effectively? But at the same time I struggle to see why you would want to do this? As an aside I did some toy benchmarking of parquet-rs, and confirmed that using thrift is perfectly fine, and can perform on par with flatbuffers - apache/arrow-rs#5770 (comment). It's a toy benchmark and should therefore be taken with a big grain of salt, but it at least would suggest 1us per column chunk is feasible |
|
||
File Metadata v3 | ||
4-byte length in bytes of File Metadata v3 (little endian) | ||
4-byte magic number "PAR3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more thought is potentially having a 16-bit or 32-bit flag map, initially set to zero if we do want to allow for future iterations on metadata intepretation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data point: we added this to the Arrow IPC format and nobody to my knowledge is making use of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be superior to ParquetVersion of parquet-cpp in terms of representing a collection of enabled features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what would be a concrete use case for those feature flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test if the reader is compatible and time to upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readers can already scan the metadata and see if they recognize all encodings, etc.
Also, what would be the point of erroring out if a feature is optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of current discussion on evolution, some of these might come fruitition some of them won't.
As a concrete example if we have a bitmap, we not longer need different magic footer/header values for compression.
Other potential items that we might not want to close off in :
- Move to flatbuffers in the future.
- Allow for discontinous column chunks (i.e. require offset index to read data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we can perhaps add 4 reserved bytes just before the magic number. But I would recommend against giving them any specific meaning until necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, for the concrete use case I meant to say "encryption" which I think we could use immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that encryption would be an immediate use case (as you did in #250, btw).
At least in datasets I've seen there are a small number of rows at least filtering (i.e. more columns then rows). |
Thank you Antoine Would you mind putting your feedback there? |
4-byte magic number "PAR1" | ||
4-byte magic number "PAR3" | ||
|
||
<Column 1 Chunk 1 + Column Metadata> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't <Column 1 Chunk 1 + Column 1 Chunk 1 Metadata>
and so on be better here according to the V3 metadata format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've literally copied this from the original snippet above. This proposal doesn't change anything in the layout here, so it would be confusing if I expressed it differently, IMHO.
This file structure is backwards-compatible. Parquet 1 readers will read and | ||
decode the legacy File Metadata in the file footer, while Parquet 3 readers | ||
will notice the "PAR3" magic number just before the File Metadata and will | ||
instead read and decode the File Metadata v3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Efficient I/O can be challenging with this proposal. A reader needs to read the last 8 bytes of the file, then read 8 bytes before the legacy footer, figure out a v3 footer exists and then read that.
It would be better if the v3 metadata are at the end of the file, right before the 4-byte len + PAR1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you suggest a layout that preserves compatibility with PAR1 readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few options:
- Use
FileMetadata.version
to introduce a new version of the metadata. Starting from the minimal change that can be done in place (DRAFT: Incremental improvements to parquet metadata #248) we can bump the version and removecolumns
fromRowGroup
and decouple the column metadata completely. - Add a binary field to
FileMetadata
namedv3_metadata
with tag number 10003. This field will encode the flatbuffer/thrift representation of the new footer. This field is going to be encoded last by thrift. Readers can manually look at the tail of the file and if they find this field, they can ignore the rest of the footer and parse these bytes only, ignoring the old style footer alltogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here is twofold:
- achieve better metadata parsing performance for PAR3 readers
- keep compatibility with PAR1 readers
This is why this proposal creates a separate array of structure types: so that PAR3 readers don't have to eagerly decode those pesky columns, while letting PAR1 readers correctly access column information.
I don't think any of your two proposals is able of achieving those two goals simultaneously, are they?
(admittedly, I'm not sure I understand proposal number 2, though it seems to require hand-coded Thrift parsing which doesn't sound like a tremendous idea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for 2, thrift could avoid parsing it assuming that we still follow the pattern of nested footer.
e.g. <field_marker 10003 and byte size><[serialized v3 metadatadata] + <v3 trailing bits (length, digest, feature bitmask)>"PAR3">0x0000<footer size>PAR1
as long as the byte size in the thrift header accounts for everything through PAR3
(as @alkis mentions below) it should work.
So the encoding/serialization would be manual but on decoding old readers should automatically drop the unknown field (it is possible some thrift implementations retain unknown fields, I know proto does) (i.e. the field ID 10003 should never actually be modeled in the schema).
"note 0x0000" is the stop field for structs if I am reading the thrift spec correctly
So the trade-offs of doing this approach are:
- A bit of extra data to be copied for readers accessing the original version.
- A guaranteed lower bound on amount of IO operations for V3 since it is incorporated into v2
- Potentially more memory utilization if accessing the original version if unknown fields are maintained by thrift implementation.
Effectively for doing the operation currently as proposed in V3 the trade-offs are reverse. I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With fetch latencies of that order, does the decode latency of the existing thrift payload even matter? I would be interested to see any empirical data that would suggest the current metadata structures are an appreciable bottleneck for use-cases involving a catalog. I created a mailing list thread related to this here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the concrete numbers. This gives us more precise insight for us to work on (though not all Parquet files are read from S3; local reads and writes are a very common use case as well). However, I'm a bit surprised by your numbers because typical guidance for S3 involves larger reads (i.e. multi-megabyte).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On S3 reads:
Guidance from S3 involves larger reads because of cost: reads from S3 typically cost an API call and bandwidth/transfer volume is free (same zone). 8MB reads will cost half the price of 4MB reads. Each connection can go up to ~100MB/sec so just transfering 4MB of data is at least 25ms. If one is reading large files, doing 100s of 8MB reads in parallel will saturate any network card - which is typically what engines do.
I posit that the vast majority of parquet encoded data is in some cloud somewhere (like >90% of all data). Hence working well with object stores (high latency, immutable files) is a requirement for any change. This is also lesson 4 from An Empirical Evaluation of Columnar Storage Formats.
With fetch latencies of that order, does the decode latency of the existing thrift payload even matter?
Yes it does. With the numbers above in mind:
- cold reads: baseline 110ms (as above). Optimized metadata are down to 2mb and parsing in 5ms translates to 20ms + 30ms + 5ms = 55ms --> 2x speedup
- warm reads: footer bytes are cached on disk/s3express/something-with-low-latency. It takes 5ms fetch + 40ms parse. Optimized, it takes 2ms to fetch + 5ms to parse = 7ms --> 6x speedup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I posit that many PiB of parquet data lives in HDFS which also has read latencies, but not as bad as S3.
S3 standard has pretty bad latency for small byte reads plus you get billed. best to grab large amounts speculatively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm late to this thread and haven't fully read it. I just want to give my +1 that the current proposal would require too many I/O requests and thus make this basically a deal breaker for high latency storage like S3. We would not use this due to this.
Any change that increases the number of data-dependent requests necessary to decode it basically a deal breaker for us and I guess is so for a lot of other data lake companies.
|
||
File Metadata v3 | ||
4-byte length in bytes of File Metadata v3 (little endian) | ||
4-byte magic number "PAR3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do encrypted footers work in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully, the same way they work for PAR1, but I agree this will need to be spelled out more explicitly :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, might need some analysis. If we continue using "PARE", can the readers know if the file is v1 or v3?
Maybe we can use something like "PRE3" instead (PaRquet Encrypted, or "PQE3" - ParQuet Encrypted, or "PEF3" - parquet encrypted footer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. I thought the "PARE" footer was in addition to the "PAR1" footer, but apparently it replaces it. Darn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using PAR3 + bit in a bitmap to represent encryption be a reasonable approach here to reduce the proliferation of magic footer values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, as long as we can read the bitmap before decrypting the footer
@@ -885,6 +971,44 @@ struct ColumnChunk { | |||
9: optional binary encrypted_column_metadata | |||
} | |||
|
|||
struct ColumnChunkV3 { | |||
/** File where column data is stored. **/ | |||
1: optional string file_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on removing this if we are doing breaking changes.
3: required i64 num_rows | ||
|
||
/** Row groups in this file **/ | ||
4: required list<RowGroupV3> row_groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query?
/** Nested fields. */ | ||
5: optional i32 num_children; | ||
|
||
/** CHANGED from v1: from i32 to i64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. I don't think this change is needed.
@@ -467,6 +467,35 @@ struct SchemaElement { | |||
10: optional LogicalType logicalType | |||
} | |||
|
|||
struct SchemaElementV3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 these changes are not necessary, we should mark the relevant bits in SchemaElement
deprecated instead.
@@ -835,6 +864,65 @@ struct ColumnMetaData { | |||
16: optional SizeStatistics size_statistics; | |||
} | |||
|
|||
struct ColumnChunkMetaDataV3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to SchemaElement, we can reuse the existing struct and deprecate the useless fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also don't we need a index into the list of SchemaElement
to reference it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm misunderstanding something, there should be one FileColumnMetadataV3
element per leaf SchemaElement
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an index referencing a SchemaElement
means that:
- a writer can skip encoding columns that do not have values in a rowgroup range
- a writer can encode/write columns in different order than metadata
(1) is important when schemata are very wide but data is sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, ignoring the fact that Parquet is currently not a sparse format, your proposal implies that readers have to do a O(n) search to find a given column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readers have to do a O(n) search to find a given column?
Why would they need an O(n) search? The index indexes an SchemaElement[]
in java or std::vector<SchemaElement>
in C++ which is O(1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's step back because I think I am not following you.
Here is the file organization as proposed here:
FileMetaDataV3
points to oneFileColumnMetadataV3
per columnFileColumnMetadataV3
points to oneColumnChunkV3
per row group
I'm not sure what you're proposing exactly when you mean "can skip encoding columns that do not have values"? This is currently not possible using Parquet, as at least the corresponding definition levels would be expected (even if they are all 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, ignoring the fact that Parquet is currently not a sparse format, your proposal implies that readers have to do a O(n) search to find a given column?
IIUC, Finding a column via schema elements today is also O(N) assuming no nesting. I think the difference is today the first thing implementations do create an efficient dictionary structure to amortize lookup of further columns.
I think if we want fast lookups without building any additional dictionaries in memory we should be considering a new stored index structure (or reconsider how we organize schema elements instead of a straight BFS).
/** REMOVED from v1: index_page_offset (unused in practice?) */ | ||
|
||
/** Compression codec **/ | ||
1: required CompressionCodec codec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could go in the SchemaElement
since it should not in principle vary between row groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, right now it's possible to make it vary. I don't know if any writers make use of this possibility, but I'm not sure there's any harm in keeping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enforcing same codec to all row groups will prohibit fast merging row groups of different parquet files without rewriting chunk data. So I vote for keeping it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest culprit in parsing metadata is Statistics
because every one of the values is a binary variable length thing. We could improve Statistics
trivially and in place if we add a few fixed size fields:
struct Statistics {
/**
* DEPRECATED
*/
1: optional binary max;
2: optional binary min;
/** count of null value in the column */
3: optional i64 null_count;
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
* Only one pair of min/max will be populated. For fixed sized types, one of the minN/maxN variants will be used. Otherwise min_value/max_value is used.
*/
5: optional binary max_value;
6: optional binary min_value;
7: optional byte max1;
8: optional byte min1;
9: optional i32 max4;
10: optional i32 min4;
11: optional i64 max8;
12: optional i64 min8;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since that structure is deprecated, do we really want to "improve" it, or should we focus our efforts on non-deprecated structures such as ColumnIndex
?
More generally, this should probably be discussed in a separate ML thread, for better separation of concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware the whole structure was deprecated. I thought only max
amd min
fields are deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is a separate topic. I still want to say that Apache ORC has defined separate statistics for different types: https://github.com/apache/orc-format/blob/6e30e63f5071b616ec5cedaac7b2e0a98ae377c5/src/main/proto/orc/proto/orc_proto.proto#L87-L99. But the dirty thing is that we might still need some sort of encoding for data types which do not have direct representation from protobuf or thrift, e.g. decimal type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an encoding shouldn't be a problem as long as the encoding is fast enough to decode (e.g. PLAIN), should it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree separate topic. I think you still need variable length types that rely on byte_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. Defining specific statistics for different data types adds complexity as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a union work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware the whole structure was deprecated. I thought only
max
amdmin
fields are deprecated.
Ahah, sorry. I think you're right actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought only max amd min fields are deprecated.
Sigh, a unrelated issue is that currently min
max
might still be written even if min_value
and max_value
is being provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied to comments. I also put a strawman in the comments for minimal changes we can do to improve FileMetadata
decoding speed: #248.
Most of the bottleneck in decoding FileMetadata
are variable length fields: list<>
and binary
. With those removed/elided we can start reducing the pain today and get to a much better state when we have a solid design for larger change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware the whole structure was deprecated. I thought only max
amd min
fields are deprecated.
@@ -835,6 +864,65 @@ struct ColumnMetaData { | |||
16: optional SizeStatistics size_statistics; | |||
} | |||
|
|||
struct ColumnChunkMetaDataV3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an index referencing a SchemaElement
means that:
- a writer can skip encoding columns that do not have values in a rowgroup range
- a writer can encode/write columns in different order than metadata
(1) is important when schemata are very wide but data is sparse.
<File-level Column N Metadata v3> | ||
|
||
File Metadata v3 | ||
4-byte length in bytes of File Metadata v3 (little endian) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield IIUC the digest is not to protect for corruption but to make sure we do not mistakenly read a V3 footer in a file without one if we happen to see "PAR3" bytes before the V1 footer, correct?
This file structure is backwards-compatible. Parquet 1 readers will read and | ||
decode the legacy File Metadata in the file footer, while Parquet 3 readers | ||
will notice the "PAR3" magic number just before the File Metadata and will | ||
instead read and decode the File Metadata v3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few options:
- Use
FileMetadata.version
to introduce a new version of the metadata. Starting from the minimal change that can be done in place (DRAFT: Incremental improvements to parquet metadata #248) we can bump the version and removecolumns
fromRowGroup
and decouple the column metadata completely. - Add a binary field to
FileMetadata
namedv3_metadata
with tag number 10003. This field will encode the flatbuffer/thrift representation of the new footer. This field is going to be encoded last by thrift. Readers can manually look at the tail of the file and if they find this field, they can ignore the rest of the footer and parse these bytes only, ignoring the old style footer alltogether.
/** NEW from v1: Byte length in file_path of ColumnChunkMetaDataV3, optionally encrypted | ||
**/ | ||
3: required i32 metadata_file_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you :)
The naming is a little awkward...I read these as "metadata-file offset"/"metadata-file length". Perhaps instead just "metadata_offset" and "metadata_length"?
Aside: just curious, but in the current format how do you make use of the file_offset
? Is there a way to deduce the length of the metadata? Or do you have to use a file based reader and seek to the offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, do we still want to put a copy ColumnMetaData
at the end of column chunk and another copy here at 4: optional binary encoded_metadata
? I know it is good to keep backward compatibility. Does any implementation actually read it from the end of column chunk?
/** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/ | ||
7: required list<i64> file_column_metadata_offset; | ||
/** NEW from v1: byte length of FileColumnMetadataV3, for each column **/ | ||
8: required list<i32> file_column_metadata_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it too late to add something like below for all offset/length pair?
struct Reference {
1: i64 offset
2: i32 length
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're looking to speed up thrift deserialization, I'd bet two int lists are going to be faster to parse than a list of structs. If the metadata objects are contiguous, maybe instead extend the offsets list by one and use deltas to calculate the lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going with the same intuition as @etseidl 's, though I don't have any precise insight into typical Thrift deserializer performance characteristics.
Mandating contiguous column metadata objects and using N+1 offsets is an intriguing idea. It could perhaps allow preloading many column metadata at once more easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to N+1 offsets. They are going to parse a lot faster (>2x) than structs.
8: required list<i32> file_column_metadata_length; | ||
|
||
/** REMOVED from v1: column_orders. | ||
** Use `FileColumnMetadataV3.column_order` instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, I misread where this was being moved to, this seems fine.
|
||
/** NEW from v1: Optional key/value metadata for this column at the file level | ||
**/ | ||
3: optional list<KeyValue> key_value_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size. Switching it to separate "page" or something may unblock other potentials like putting extra user-defined secondary index. For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the key_value_metadata
if we want to add custom index.
3: optional list<SortingColumn> sorting_columns | ||
|
||
/** REMOVED from v1: file_offset. | ||
* Use the OffsetIndex for each column instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the current spec does not prohibit placing column chunks in random order. So we must use file_offset
together with total_compressed_size
to determine the read range. This is a trick used to place small column chunks together which may be fetched in a single I/O.
2: required i64 num_rows | ||
|
||
/** If set, specifies a sort ordering of the rows in this row group. */ | ||
3: optional list<SortingColumn> sorting_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders)
Yes, but it can be worked around by simply dropping sort order metadata if row groups have inconsistent values. This is different to codec.
/** NEW from v1: Byte length in file_path of ColumnChunkMetaDataV3, optionally encrypted | ||
**/ | ||
3: required i32 metadata_file_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, do we still want to put a copy ColumnMetaData
at the end of column chunk and another copy here at 4: optional binary encoded_metadata
? I know it is good to keep backward compatibility. Does any implementation actually read it from the end of column chunk?
... | ||
<Column N Chunk M + Column Metadata> | ||
|
||
<File-level Column 1 Metadata v3> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we'll have a file-level column metadata, we can optimize the storage of key_metadata of encrypted columns - by keeping the column key_metadata here, instead of the ColumnChunk structures.
Note: this is a storage-only optimization [O(N) instead of O(NxM)]; the reader processes only one key_metadata object per column [O(N)] already today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that each column uses a single key_metadata for all chunks, this sounds like a good idea.
## Encryption | ||
|
||
Encryption with footer encryption enabled changes the above file structure slightly. | ||
In particular, the "PAR1" magic number is replaced with "PARE". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBD (#242 (comment))
8: optional i32 column_index_length | ||
|
||
/** Crypto metadata of encrypted columns **/ | ||
9: optional ColumnCryptoMetaData crypto_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed if we keep this in FileColumnMetadataV3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed especially if it can't be different for individual chunks
@@ -1165,6 +1317,62 @@ struct FileMetaData { | |||
9: optional binary footer_signing_key_metadata | |||
} | |||
|
|||
/** Metadata for a column in this file. */ | |||
struct FileColumnMetadataV3 { | |||
/** All column chunks in this file (one per row group) **/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add in this column in the comment? also, rename the field to chunks
?
/** Metadata for a column in this file. */
struct FileColumnMetadataV3 {
/** All column chunks in this column (in this file - one per row group) **/
1: required list<ColumnChunkV3> chunks
/** NEW from v1: Optional key/value metadata for this column at the file level | ||
**/ | ||
3: optional list<KeyValue> key_value_metadata | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can add the column encryption key_metadata here,
/** Crypto metadata of encrypted column **/
4: optional ColumnCryptoMetaData crypto_metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design adds another GET/read to the footer read phase.
The offset to the v3 metadata needs to be stored at a fixed offset off the tail of the file, so a single read of the last N bytes will pull it in, just as is done for v1. Even that is inefficient as a single high-latency GET only gets back 8 bytes right now: apps should download 1+MB or more for the same cost.
Will v1 readers suffer if there is extra data at the tail of their normal metadata? If they all work today (first bit of regression testing...) then I'd actually propose a block at the end with more info, for this and later releases.
so you'd have something like
byte[8] v3_position (not offset, actual position)
byte[8] v3_length
byte[4] "PAR3"
byte[4] v1_offset
byte[4] magic ["PAR1" | "PARE"]
For this to work with existing readers which are given bytes [len-(v1-offset), len-8] to decode, the existing readers must be OK with some trailing bytes they don't recognise. Does this hold? including for encrypted metadata?
If does, a v3 reader could open a file, read at least the last 32 bytes, check for being par3 by the two magic numbers, then, if true, go to v3 metadata.
If it doesn't, things get worse when opening a file. It'd actually make sense to grab the last, say, 2-4 MB in the file and go from there.
This may be better anyway, which is why abfs/gcs connectors cache footers, ongoing s3a prefetch will probably go for the final 8 MB.
but do in parquet and it'd be consistent everywhere rather than hoping the layers underneath do the right thing.
(side issue, table indices such as iceberg should store this footer too, or at least key values)
This makes me realise there's another aspect to regression testing here: samples of v3 format files must be given to existing v2 readers to verify they don't break. Are there any repos which do this yet?
@@ -81,7 +81,13 @@ more pages. | |||
- Encoding/Compression - Page | |||
|
|||
## File format | |||
This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format. | |||
|
|||
This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"MUST", RFC-2119 style, maybe
@@ -107,12 +113,97 @@ start locations. More details on what is contained in the metadata can be found | |||
in the Thrift definition. | |||
|
|||
Metadata is written after the data to allow for single pass writing. | |||
This is especially useful when writing to backends such as S3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a requirement for HDFS too, S3 is simply a beneficiary of a design requirement from the outset of an append-only filesystem being the target from day 1
This file structure is backwards-compatible. Parquet 1 readers will read and | ||
decode the legacy File Metadata in the file footer, while Parquet 3 readers | ||
will notice the "PAR3" magic number just before the File Metadata and will | ||
instead read and decode the File Metadata v3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I posit that many PiB of parquet data lives in HDFS which also has read latencies, but not as bad as S3.
S3 standard has pretty bad latency for small byte reads plus you get billed. best to grab large amounts speculatively
<File-level Column N Metadata v3> | ||
|
||
File Metadata v3 | ||
4-byte length in bytes of File Metadata v3 (little endian) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset + length
|
||
Readers are expected to first read the file metadata to find all the column | ||
chunks they are interested in. The columns chunks should then be read sequentially. | ||
|
||
![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif) | ||
|
||
### Parquet 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People, who are not involved in the discussion of Parquet 3, might wonder why suddenly Parquet '3'?
Interesting, so this feature is basically used to create a file comparable
to an Iceberg manifest. I see that it can be used for that.
Design-wise, I'm not the biggest fan of this special casing this through an
extra field instead of just storing a Parquet file that has all information
in normal Parquet columns (like a DeltaLake checkpoint Parquet file), but
the design is the way it is. Therefore, I do see that this field can be
used this way and I guess therefore there is a valid use case for this, so
it probably needs to be maintained for backward compatibility.
Cheers,
Jan
Am Do., 6. Juni 2024 um 16:08 Uhr schrieb Rok Mihevc <
***@***.***>:
… ***@***.**** commented on this pull request.
------------------------------
In src/main/thrift/parquet.thrift
<#242 (comment)>
:
> @@ -885,6 +971,44 @@ struct ColumnChunk {
9: optional binary encrypted_column_metadata
}
+struct ColumnChunkV3 {
+ /** File where column data is stored. **/
+ 1: optional string file_path
PyArrow provides write_metadata
<https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_metadata.html>
and parquet_dataset
<https://arrow.apache.org/docs/python/generated/pyarrow.dataset.parquet_dataset.html#pyarrow.dataset.parquet_dataset>
for such use case.
—
Reply to this email directly, view it on GitHub
<#242 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALLIYWBZ23WE4BG6MVBI7TZGBUNFAVCNFSM6AAAAABHZ7LAPSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCMBSGA4TKMRQGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@pitrou can we close this now? |
Yes, we should! Thanks for the reminder. |
Parquet 3 metadata proposal
This is a very rough attempt at solving the problem of
FileMetadata
footprint and decoding cost, especially for Parquet files with many columns (think tens of thousands columns).Context
This is in the context of the broader "Parquet v3" discussion on the mailing-list. A number of possible far-reaching changes are being collected in a document.
It is highly recommended that you read at least that document before commenting on this PR.
Specifically, some users would like to use Parquet files for data with tens of thousands of columns, and potentially hundreds or thousands of row groups. Reading the file-level metadata for such a file is prohibitively expensive given the current file structure where all column-level metadata is eagerly decoded as part of file-level metadata.
Contents
It includes a bunch of changes:
O(n_columns + n_row_groups)
instead ofO(n_columns * n_row_groups)
Jira
Commits
Documentation