Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata #40958

Open
mrbrahman opened this issue Apr 2, 2024 · 8 comments

Comments

@mrbrahman
Copy link

mrbrahman commented Apr 2, 2024

Describe the enhancement requested

Hi,

One of the design principles of parquet from their Github page is 'Separating metadata and column data':

Separating metadata and column data.

The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.

In order to achieve the 'columns in different files', we need to

  1. Ensure each file has the same number of row-groups
  2. Ensure each corresponding row-group of each file have the same rows
  3. Grab the 'metadata' from each file, 'zip/attach them vertically', and write out the new metadata file
  4. Feed this metadata while reading the table

It looks like Arrow APIs provide nearly everything to achieve this, except for the bolded portion in point 3 above.

This ticket is requesting the addition of a new API to be able to 'zip'/'join'/'attach' metadata from 2 files.

For e.g.

import pyarrow.parquet as pq
m1 = pq.read_metadata('file1.parquet')  # say this has columns: col1, col2, col3
m1.set_file_path('file1.parquet')

m2 = pq.read_metadata('file2.parquet')  # say this has columns: col4, col5
m2.set_file_path('file2.parquet')

# requesting this new 'zip' API
m = m1.zip(m2)  # needs to ensure same number of row groups, and same number of rows within each row group

# m will now have metadata for col1, col2, col3, col4, col5 each pointing to appropriate data file

m.write_metadata('_metadata')

One this is done, a combined data can be created using:

m = pq.read_metadata('_metadata')
data = pq.ParquetFile('file1.parquet', 'file2.parquet', metadata=m)

# data should now be able to show all columns

Component(s)

C++, Python, Parquet

@mrbrahman mrbrahman changed the title API to 'zip' or (vertically) 'attach' parquet metadata New API to 'zip' or (vertically) 'attach' parquet metadata Apr 2, 2024
@kou kou changed the title New API to 'zip' or (vertically) 'attach' parquet metadata [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata Apr 3, 2024
@mapleFU
Copy link
Member

mapleFU commented Apr 9, 2024

@wgtmac does spec allowing this currently?

@wgtmac
Copy link
Member

wgtmac commented Apr 10, 2024

I think this is the parquet summary metadata file. See parquet.summary.metadata.level from https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md#class-parquetoutputformat. But I don't know whether it is widely used.

@mrbrahman
Copy link
Author

@wgtmac, no I don't think _metadata file would be widely used in the bigdata systems like Hadoop/Spark etc. However, with Apache Arrow it does seem to have the required API (in ParquetFile) to read metadata separately from the data.

Of course, I'm also not sure if Apache Arrow will also specifically support the below section from my request (because we currently have no way to stitch 2 metadata files):

One this is done, a combined data can be created using:

m = pq.read_metadata('_metadata')
data = pq.ParquetFile('file1.parquet', 'file2.parquet', metadata=m)

# data should now be able to show all columns

where file1.parquet contains col1, col2, co3 and file2.parquet contains col4 and co5 (different set of columns). Here only the the _metadata file has the overarching information about the 'table' definition.

I'm only guessing that it would be supported, since it has the API to do so. However, it would be nice to confirm that as well.

@wgtmac
Copy link
Member

wgtmac commented Apr 12, 2024

Before talking about the zip API, it still need some refactoring work to support the metadata file. What is your use case then? The possible use case for metadata file is to combine parquet files of different columns into a larger logical parquet file. From my perspective, if metadata file is not widely used, it seems not worth the effort to implement it.

@mrbrahman
Copy link
Author

@wgtmac, sorry I didn't quite understand

it still need some refactoring work to support the metadata file

pyarrow already supports metadata file, right?

https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

@wgtmac
Copy link
Member

wgtmac commented Apr 13, 2024

Awkward. I even didn't notice that this is already supported. Thanks for pointing it out!

@mrbrahman
Copy link
Author

@wgtmac, any further insights into this.

I think the only place that would need to change is https://github.com/apache/arrow/blob/main/cpp/src/parquet/metadata.cc

@wgtmac
Copy link
Member

wgtmac commented Jun 13, 2024

@mrbrahman This feature is controversial and recently there is a related discussion: apache/parquet-format#242 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants