[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata #40958

mrbrahman · 2024-04-02T20:35:16Z

Describe the enhancement requested

Hi,

One of the design principles of parquet from their Github page is 'Separating metadata and column data':

Separating metadata and column data.

The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.

In order to achieve the 'columns in different files', we need to

Ensure each file has the same number of row-groups
Ensure each corresponding row-group of each file have the same rows
Grab the 'metadata' from each file, 'zip/attach them vertically', and write out the new metadata file
Feed this metadata while reading the table

It looks like Arrow APIs provide nearly everything to achieve this, except for the bolded portion in point 3 above.

This ticket is requesting the addition of a new API to be able to 'zip'/'join'/'attach' metadata from 2 files.

For e.g.

import pyarrow.parquet as pq
m1 = pq.read_metadata('file1.parquet')  # say this has columns: col1, col2, col3
m1.set_file_path('file1.parquet')

m2 = pq.read_metadata('file2.parquet')  # say this has columns: col4, col5
m2.set_file_path('file2.parquet')

# requesting this new 'zip' API
m = m1.zip(m2)  # needs to ensure same number of row groups, and same number of rows within each row group

# m will now have metadata for col1, col2, col3, col4, col5 each pointing to appropriate data file

m.write_metadata('_metadata')

One this is done, a combined data can be created using:

m = pq.read_metadata('_metadata')
data = pq.ParquetFile('file1.parquet', 'file2.parquet', metadata=m)

# data should now be able to show all columns

Component(s)

C++, Python, Parquet

mapleFU · 2024-04-09T17:25:19Z

@wgtmac does spec allowing this currently?

wgtmac · 2024-04-10T02:53:08Z

I think this is the parquet summary metadata file. See parquet.summary.metadata.level from https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md#class-parquetoutputformat. But I don't know whether it is widely used.

mrbrahman · 2024-04-11T20:41:06Z

@wgtmac, no I don't think _metadata file would be widely used in the bigdata systems like Hadoop/Spark etc. However, with Apache Arrow it does seem to have the required API (in ParquetFile) to read metadata separately from the data.

Of course, I'm also not sure if Apache Arrow will also specifically support the below section from my request (because we currently have no way to stitch 2 metadata files):

One this is done, a combined data can be created using:

m = pq.read_metadata('_metadata')
data = pq.ParquetFile('file1.parquet', 'file2.parquet', metadata=m)

# data should now be able to show all columns

where file1.parquet contains col1, col2, co3 and file2.parquet contains col4 and co5 (different set of columns). Here only the the _metadata file has the overarching information about the 'table' definition.

I'm only guessing that it would be supported, since it has the API to do so. However, it would be nice to confirm that as well.

wgtmac · 2024-04-12T15:21:10Z

Before talking about the zip API, it still need some refactoring work to support the metadata file. What is your use case then? The possible use case for metadata file is to combine parquet files of different columns into a larger logical parquet file. From my perspective, if metadata file is not widely used, it seems not worth the effort to implement it.

mrbrahman · 2024-04-13T02:01:20Z

@wgtmac, sorry I didn't quite understand

it still need some refactoring work to support the metadata file

pyarrow already supports metadata file, right?

https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

wgtmac · 2024-04-13T15:01:05Z

Awkward. I even didn't notice that this is already supported. Thanks for pointing it out!

mrbrahman · 2024-06-05T16:47:42Z

@wgtmac, any further insights into this.

I think the only place that would need to change is https://github.com/apache/arrow/blob/main/cpp/src/parquet/metadata.cc

wgtmac · 2024-06-13T14:48:34Z

@mrbrahman This feature is controversial and recently there is a related discussion: apache/parquet-format#242 (comment).

mrbrahman added the Type: enhancement label Apr 2, 2024

github-actions bot added Component: C++ Component: Python Component: Parquet labels Apr 2, 2024

mrbrahman changed the title ~~API to 'zip' or (vertically) 'attach' parquet metadata~~ New API to 'zip' or (vertically) 'attach' parquet metadata Apr 2, 2024

kou changed the title ~~New API to 'zip' or (vertically) 'attach' parquet metadata~~ [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata #40958

[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata #40958

mrbrahman commented Apr 2, 2024 •

edited

Loading

Separating metadata and column data.

mapleFU commented Apr 9, 2024

wgtmac commented Apr 10, 2024

mrbrahman commented Apr 11, 2024

wgtmac commented Apr 12, 2024

mrbrahman commented Apr 13, 2024

wgtmac commented Apr 13, 2024

mrbrahman commented Jun 5, 2024

wgtmac commented Jun 13, 2024

[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata #40958

[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata #40958

Comments

mrbrahman commented Apr 2, 2024 • edited Loading

Describe the enhancement requested

Separating metadata and column data.

Component(s)

mapleFU commented Apr 9, 2024

wgtmac commented Apr 10, 2024

mrbrahman commented Apr 11, 2024

wgtmac commented Apr 12, 2024

mrbrahman commented Apr 13, 2024

wgtmac commented Apr 13, 2024

mrbrahman commented Jun 5, 2024

wgtmac commented Jun 13, 2024

mrbrahman commented Apr 2, 2024 •

edited

Loading