Skip to content

Support round tripping extension types in parquet #7063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #6736
alamb opened this issue Feb 2, 2025 · 18 comments
Open
Tracked by #6736

Support round tripping extension types in parquet #7063

alamb opened this issue Feb 2, 2025 · 18 comments
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Feb 2, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

However, these extension type annotations apparently do not survive round trip write/reading to parquet as described in this comment with @kylebarron: https://github.com/apache/arrow-rs/pull/5822/files#r1925627198

Describe the solution you'd like
Make extension tests work round trip through parquet

Describe alternatives you've considered

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Feb 2, 2025
@alamb alamb added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Feb 2, 2025
@alamb
Copy link
Contributor Author

alamb commented Apr 6, 2025

@mbrobbel and @kylebarron -- this item was brought up as something required to implement Varaint support (#6736). Do you by any chance have any additional information on what is required / how it might be implemented you can share (so we can make progress)?

cc @PinkCrow007

@paleolimbot
Copy link
Member

We'll likely need this for Geometry/Geography as well! I can look more closely if I'm the one to start in on that change, but in C++ we handle something like that with an option to the reader properties:

https://github.com/apache/arrow/blob/c56ec12a2f38ed6fce2d87ea65345b6edf4234f5/cpp/src/parquet/properties.h#L988-L998

...although something more like a registry might be more flexible (e.g., so that a type could be registered with, for example, a DataFusion context like a UDT might be registered in Spark or DuckDB). (With apologies if I missed the major API challenge here 😬 )

@mbrobbel
Copy link
Member

mbrobbel commented Apr 8, 2025

@mbrobbel and @kylebarron -- this item was brought up as something required to implement Varaint support (#6736). Do you by any chance have any additional information on what is required / how it might be implemented you can share (so we can make progress)?

cc @PinkCrow007

My thinking was: given an Arrow field with an extension type and writing this field to Parquet and then reading it again (using the provied Arrow writers/readers in the parquet crate) should retain the extension type information i.e. after reading this field (from Parquet) it should still have the same extension type.

From briefly looking at this I think we can implement this by replacing some parquet impls to operate on fields instead of types.

@paleolimbot
Copy link
Member

replacing some parquet impls to operate on fields instead of types.

I'm experimenting with this in DataFusion, too ( apache/datafusion#15036 ) and there it's a fairly disruptive change (particularly for the UDFs). I know the version of DataType::Extension(name, metadata) was rejected but I wonder if DataType::Extension(extension_type) (never created without opt-in) would help minimize the surface area of supporting these types. (Although here there are probably sufficiently few internals that the change is not that difficult).

@alamb
Copy link
Contributor Author

alamb commented Apr 8, 2025

replacing some parquet impls to operate on fields instead of types.

I'm experimenting with this in DataFusion, too ( apache/datafusion#15036 ) and there it's a fairly disruptive change (particularly for the UDFs). I know the version of DataType::Extension(name, metadata) was rejected but I wonder if DataType::Extension(extension_type) (never created without opt-in) would help minimize the surface area of supporting these types. (Although here there are probably sufficiently few internals that the change is not that difficult).

I would personally be very open to the idea of adding a new enum to DataType. While it may not be strictly necessary from an API point of view, if it reduces churn and makes it easier to roll out extension types across the ecosystem it makes a lot of sense to me

We could also potentially feature flag it for a while (feature=extension) as a way to roll it out before the next major release 🤔

@mbrobbel
Copy link
Member

mbrobbel commented Apr 8, 2025

The spec defines extension types as metadata on Fields, so I think adding an Extension variant to DataType is going to cause confusing situations. But maybe I'm misunderstanding what you are suggesting?

@paleolimbot
Copy link
Member

The spec defines extension types as metadata on Fields

I believe the language in the spec is limited to exchange via FFI and serialization via IPC (e.g., Java, Go, and Arrow C++ include extension types as objects with native fields and only serialize when specifically requested). I don't believe it has caused confusing situations in other languages but it could be worth asking others! I will give it a try anyway and see what kind of implementation effort/downstream impacts it may have.

@mbrobbel
Copy link
Member

mbrobbel commented Apr 9, 2025

The spec defines extension types as metadata on Fields

I believe the language in the spec is limited to exchange via FFI and serialization via IPC (e.g., Java, Go, and Arrow C++ include extension types as objects with native fields and only serialize when specifically requested). I don't believe it has caused confusing situations in other languages but it could be worth asking others! I will give it a try anyway and see what kind of implementation effort/downstream impacts it may have.

The DataType approach requires all matches on instances of DataType to consider the extension type variant inner datatype, which has major downstream impact. It means that consumers that don't care about extension types now need to deal with them. The spec states that implementations are not required to support canonical extension types.

@tustvold
Copy link
Contributor

tustvold commented Apr 9, 2025

The consensus reached on #4472 was not to do this. I can appreciate there are tradeoffs to both approaches, but forcing all arrow code to care about extensions seems unfortunate and against the spirit of extension types.

Edit: I commented on the linked PR which I think highlights the challenge very well, every arrow kernel must now be updated to understand extension arrays, be able to downcast through them, preserve datatype, etc... This is a massive undertaking, and IMO makes the code more fragile

@paleolimbot
Copy link
Member

In #7398 and apache/datafusion#15663 I mostly found that I was adding an additional "this type isn't supported" to the end of enums with one or more entires that already had other types that weren't supported.

The tradeoff is I think that arrow-rs would take on the responsibility of ensuring the hard-to-implement operations (filter, take, concatenate) propagate the type so that extension authors/frameworks built on arrow don't have to (there's a few PRs in DataFusion that demonstrate that updating the DataType representation has quite large footprint). I would argue that by excluding the option in arrow-rs, we aren't necessarily eliminating any complexity, we're just punting it on to other projects.

I've tried building the PR implementing this against DataFusion to check the impact on downstream code ( apache/datafusion#15663 )...are there other projects with a high Arrow surface area that might have a larger disruption?

@tustvold
Copy link
Contributor

tustvold commented Apr 9, 2025

I mostly found that I was adding an additional "this type isn't supported" to the end of enums with one or more entires that already had other types that weren't supported.

Right but this regresses functionality that currently "just works", the same is likely true in many places in DataFusion. There is a big difference between it compiling, and it working as users expect. It seems inherently wrong to me that callsites that are agnostic to extension types, and which likely significantly outnumber the callsites that need to be aware of extension types, should be the ones needing to be updated.

@paleolimbot
Copy link
Member

Right but this regresses functionality that currently "just works", the same is likely true in many places in DataFusion.

I had envisioned that the use of Extension would only be opt-in (i.e., never automatically created on import from IPC or FFI), which would perhaps limit changes to things that work today (via field metadata propagation where this is possible). Libraries built on arrow-rs could experiment with creating these types and opt-in to their existence slowly (or never!).

It seems inherently wrong to me that callsites that are agnostic to extension types, and which likely significantly outnumber the callsites that need to be aware of extension types, should be the ones needing to be updated.

Yes, the top-level data type approach does involve implementing extension-aware code in some places. Mostly this is to just fall back to storage type behaviour, error, or use it as an opportunity for the extension author to inject behaviour (e.g., type equality). It's absolutely fair if the number of callsites required is inappropriate in scope in arrow-rs...I'm pursuing this as a comparison to the number of callsites required to update the array and/or data type representation in DataFusion to allow for extension type information to be propagated through expressions (which is also very high).

@tustvold
Copy link
Contributor

tustvold commented Apr 10, 2025

IMO a design should stand on its own merits, not based on what is expedient for one particular downstream use-case

update the array and/or data type representation

I'm not familiar with the precise problem being discussed here, but I'm guessing DF is relying on DataType instead of propogating Field/Schema. Such an approach is inherently flawed, as not only does it discard metadata, but also things like nullability, etc... I'm a little surprised if this is the case, as DF at least used to fairly aggressively make use of Schema, etc... but perhaps things have changed since I was very involved...

@alamb
Copy link
Contributor Author

alamb commented Apr 10, 2025

I'm not familiar with the precise problem being discussed here, but I'm guessing DF is relying on DataType instead of propogating Field/Schema. Such an approach is inherently flawed, as not only does it discard metadata, but also things like nullability, etc... I'm a little surprised if this is the case, as DF at least used to fairly aggressively make use of Schema, etc... but perhaps things have changed since I was very involved...

Yes, there are quite a few APIs in DataFusion that use DataType and not Field (for example the type coercion logic). If we were starting again today they would all probably be in terms of Field. On the other hand I don't see huge problems in making it easier to use extension types downstream by making changes to arrow-rs

@paleolimbot what do you think about filing a ticket in DataFusion like "Better extension type support" where we can start listing the APIs that would need to be updated to Field instead of DataType?

@paleolimbot
Copy link
Member

not based on what is expedient for one particular downstream use-case

Agreed! There are also a number of arrow-rs' own crates that make heavy use of the DataType and ArrayRef (e.g., the Parquet internals, casting). Another approach might be an arrow-extensions crate that includes an extension-aware cast(), CSV reader/writer (i.e., extension authors can choose how an array is represented in a CSV file), Parquet reader/writer (the issue here...how to propagate an extension type to Parquet and vice versa), extension-aware pretty printer, etc. All of these utilities are made use of by DataFusion but there is nothing strictly related to DataFusion about them and the work we do to (e.g.) support Variant, Geometry/Geography, or UUIDs doesn't have to only benefit DataFusion users.

what do you think about filing a ticket in DataFusion like "Better extension type support" where we can start listing the APIs that would need to be updated to Field instead of DataType?

Yes, I'll continue my efforts there, too! In addition to requiring a Field a number of them also require a Context of some kind to look up extension-specific behaviour (e.g., whether a cast should or should not happen).

@alamb
Copy link
Contributor Author

alamb commented Apr 10, 2025

A relevant PR from @timsaucer downstream in DataFusion in fact proposes exactly this (use a Field instead of DataType) for scalar functions:

@tustvold
Copy link
Contributor

There are also a number of arrow-rs' own crates that make heavy use of the DataType and ArrayRef (e.g., the Parquet internals, casting)

Indeed, and the benefit of the current design is extension types automatically "just work". The only logic that needs be concerned with them is logic that seeks to explicitly handle them, I am pretty strongly opposed to changing this.

@paleolimbot
Copy link
Member

extension types automatically "just work"

This is definitely true for some operations (e.g., arrow-select), but for others (e.g., cast, parse, print, write to CSV, arithmetic) it is very easy to do the wrong thing because there is no built-in alternative to capture that context: the DataType in its current form is not really "how the values of an array should be interpreted", but more like ArrowStorageType or PhysicalTypeButAlsoTimestamps. You are absolutely right that adding Extension to the DataType was never the intention of the original DataType...I think it is also true that a lot of usage (including DataFusion and the higher-level arrow crates) is not consistent with that intention.

We have an internal workaround that we're pursuing for now, but I'll continue to play with/review alternatives both here and in DataFusion...if nothing else, it's a great way to get to know arrow-rs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

4 participants