-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata #40958
Comments
@wgtmac does spec allowing this currently? |
I think this is the parquet summary metadata file. See |
@wgtmac, no I don't think _metadata file would be widely used in the bigdata systems like Hadoop/Spark etc. However, with Apache Arrow it does seem to have the required API (in ParquetFile) to read metadata separately from the data. Of course, I'm also not sure if Apache Arrow will also specifically support the below section from my request (because we currently have no way to stitch 2 metadata files):
where file1.parquet contains col1, col2, co3 and file2.parquet contains col4 and co5 (different set of columns). Here only the the _metadata file has the overarching information about the 'table' definition. I'm only guessing that it would be supported, since it has the API to do so. However, it would be nice to confirm that as well. |
Before talking about the |
@wgtmac, sorry I didn't quite understand
pyarrow already supports metadata file, right? https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files |
Awkward. I even didn't notice that this is already supported. Thanks for pointing it out! |
@wgtmac, any further insights into this. I think the only place that would need to change is https://github.com/apache/arrow/blob/main/cpp/src/parquet/metadata.cc |
@mrbrahman This feature is controversial and recently there is a related discussion: apache/parquet-format#242 (comment). |
Describe the enhancement requested
Hi,
One of the design principles of parquet from their Github page is 'Separating metadata and column data':
In order to achieve the 'columns in different files', we need to
It looks like Arrow APIs provide nearly everything to achieve this, except for the bolded portion in point 3 above.
This ticket is requesting the addition of a new API to be able to 'zip'/'join'/'attach' metadata from 2 files.
For e.g.
One this is done, a combined data can be created using:
Component(s)
C++, Python, Parquet
The text was updated successfully, but these errors were encountered: