Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep arrow metadata in Delta Table metadata #1531

Open
j-bennet opened this issue Jul 12, 2023 · 3 comments
Open

Keep arrow metadata in Delta Table metadata #1531

j-bennet opened this issue Jul 12, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@j-bennet
Copy link

j-bennet commented Jul 12, 2023

Environment

Delta-rs version:

0.10.0

Binding:

Python

Environment:

  • Cloud provider:
  • OS: macOS
  • Other:

Bug

delta-rs loses metadata for parquet written with pandas (example data is attached).

test.parquet.zip

from deltalake import DeltaTable
import pyarrow.parquet as pq

if __name__ == "__main__":
    # read it back with delta-rs
    dt = DeltaTable("test.parquet")
    print("\nDeltaTable schema:")
    print(dt.schema().to_pyarrow().to_string())

    # read it back with pyarrow
    table = pq.read_table("test.parquet")
    print("\nPyarrow schema:")
    print(table.schema.to_string())

This outputs:

DeltaTable schema:
col2: string
col1: int32

Pyarrow schema:
col2: dictionary<values=string, indices=int32, ordered=0>
col1: dictionary<values=int32, indices=int32, ordered=0>
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 509

The schema metadata part in pyarrow.table is nowhere to be found in DeltaTable. Is it present, but not public? How can it be accessed?

@j-bennet j-bennet added the bug Something isn't working label Jul 12, 2023
@wjones127
Copy link
Collaborator

We don't save this metadata in Delta Tables. We could perhaps create a custom key in the configuration field of the table metadata, and preserve that in the Arrow schemas returned (which get passed onto Pandas).

@ion-elgreco ion-elgreco added enhancement New feature or request and removed bug Something isn't working labels Nov 22, 2023
@kylebarron
Copy link
Contributor

Is this a limitation of Delta Tables or of the client library? I'm specifically wondering in the context of GeoParquet, which uses schema metadata to declare per-column information like geometry type. With pyarrow we'd set schema metadata on the table and then write to Parquet, but it doesn't look like that works here.

@ion-elgreco
Copy link
Collaborator

@kylebarron if you would like to do this you could hijack the configuration key in the table metadata for it. But it still requires an implementation on the client side to use it.

And it will only work for that specific cliënt

@ion-elgreco ion-elgreco changed the title PyArrow metadata lost in DeltaTable Keep arrow metadata in Delta Table metadata Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants