Support round tripping extension types in parquet #7063

alamb · 2025-02-02T12:04:15Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

@mbrobbel added initial support for arrow extension types in Add ExtensionType trait and CanonicalExtensionType enum #5822

However, these extension type annotations apparently do not survive round trip write/reading to parquet as described in this comment with @kylebarron: https://github.com/apache/arrow-rs/pull/5822/files#r1925627198

Describe the solution you'd like
Make extension tests work round trip through parquet

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2025-04-06T10:23:10Z

@mbrobbel and @kylebarron -- this item was brought up as something required to implement Varaint support (#6736). Do you by any chance have any additional information on what is required / how it might be implemented you can share (so we can make progress)?

cc @PinkCrow007

paleolimbot · 2025-04-07T16:10:35Z

We'll likely need this for Geometry/Geography as well! I can look more closely if I'm the one to start in on that change, but in C++ we handle something like that with an option to the reader properties:

https://github.com/apache/arrow/blob/c56ec12a2f38ed6fce2d87ea65345b6edf4234f5/cpp/src/parquet/properties.h#L988-L998

...although something more like a registry might be more flexible (e.g., so that a type could be registered with, for example, a DataFusion context like a UDT might be registered in Spark or DuckDB). (With apologies if I missed the major API challenge here 😬 )

mbrobbel · 2025-04-08T14:58:02Z

@mbrobbel and @kylebarron -- this item was brought up as something required to implement Varaint support (#6736). Do you by any chance have any additional information on what is required / how it might be implemented you can share (so we can make progress)?

cc @PinkCrow007

My thinking was: given an Arrow field with an extension type and writing this field to Parquet and then reading it again (using the provied Arrow writers/readers in the parquet crate) should retain the extension type information i.e. after reading this field (from Parquet) it should still have the same extension type.

From briefly looking at this I think we can implement this by replacing some parquet impls to operate on fields instead of types.

paleolimbot · 2025-04-08T15:04:38Z

replacing some parquet impls to operate on fields instead of types.

I'm experimenting with this in DataFusion, too ( apache/datafusion#15036 ) and there it's a fairly disruptive change (particularly for the UDFs). I know the version of DataType::Extension(name, metadata) was rejected but I wonder if DataType::Extension(extension_type) (never created without opt-in) would help minimize the surface area of supporting these types. (Although here there are probably sufficiently few internals that the change is not that difficult).

alamb · 2025-04-08T15:17:52Z

replacing some parquet impls to operate on fields instead of types.

I'm experimenting with this in DataFusion, too ( apache/datafusion#15036 ) and there it's a fairly disruptive change (particularly for the UDFs). I know the version of DataType::Extension(name, metadata) was rejected but I wonder if DataType::Extension(extension_type) (never created without opt-in) would help minimize the surface area of supporting these types. (Although here there are probably sufficiently few internals that the change is not that difficult).

I would personally be very open to the idea of adding a new enum to DataType. While it may not be strictly necessary from an API point of view, if it reduces churn and makes it easier to roll out extension types across the ecosystem it makes a lot of sense to me

We could also potentially feature flag it for a while (feature=extension) as a way to roll it out before the next major release 🤔

mbrobbel · 2025-04-08T17:38:06Z

The spec defines extension types as metadata on Fields, so I think adding an Extension variant to DataType is going to cause confusing situations. But maybe I'm misunderstanding what you are suggesting?

paleolimbot · 2025-04-08T19:24:05Z

The spec defines extension types as metadata on Fields

I believe the language in the spec is limited to exchange via FFI and serialization via IPC (e.g., Java, Go, and Arrow C++ include extension types as objects with native fields and only serialize when specifically requested). I don't believe it has caused confusing situations in other languages but it could be worth asking others! I will give it a try anyway and see what kind of implementation effort/downstream impacts it may have.

mbrobbel · 2025-04-09T08:04:53Z

The spec defines extension types as metadata on Fields

I believe the language in the spec is limited to exchange via FFI and serialization via IPC (e.g., Java, Go, and Arrow C++ include extension types as objects with native fields and only serialize when specifically requested). I don't believe it has caused confusing situations in other languages but it could be worth asking others! I will give it a try anyway and see what kind of implementation effort/downstream impacts it may have.

The DataType approach requires all matches on instances of DataType to consider the extension type variant inner datatype, which has major downstream impact. It means that consumers that don't care about extension types now need to deal with them. The spec states that implementations are not required to support canonical extension types.

tustvold · 2025-04-09T09:03:49Z

The consensus reached on #4472 was not to do this. I can appreciate there are tradeoffs to both approaches, but forcing all arrow code to care about extensions seems unfortunate and against the spirit of extension types.

Edit: I commented on the linked PR which I think highlights the challenge very well, every arrow kernel must now be updated to understand extension arrays, be able to downcast through them, preserve datatype, etc... This is a massive undertaking, and IMO makes the code more fragile

paleolimbot · 2025-04-09T20:49:36Z

In #7398 and apache/datafusion#15663 I mostly found that I was adding an additional "this type isn't supported" to the end of enums with one or more entires that already had other types that weren't supported.

The tradeoff is I think that arrow-rs would take on the responsibility of ensuring the hard-to-implement operations (filter, take, concatenate) propagate the type so that extension authors/frameworks built on arrow don't have to (there's a few PRs in DataFusion that demonstrate that updating the DataType representation has quite large footprint). I would argue that by excluding the option in arrow-rs, we aren't necessarily eliminating any complexity, we're just punting it on to other projects.

I've tried building the PR implementing this against DataFusion to check the impact on downstream code ( apache/datafusion#15663 )...are there other projects with a high Arrow surface area that might have a larger disruption?

tustvold · 2025-04-09T21:42:04Z

I mostly found that I was adding an additional "this type isn't supported" to the end of enums with one or more entires that already had other types that weren't supported.

Right but this regresses functionality that currently "just works", the same is likely true in many places in DataFusion. There is a big difference between it compiling, and it working as users expect. It seems inherently wrong to me that callsites that are agnostic to extension types, and which likely significantly outnumber the callsites that need to be aware of extension types, should be the ones needing to be updated.

paleolimbot · 2025-04-10T03:37:57Z

Right but this regresses functionality that currently "just works", the same is likely true in many places in DataFusion.

I had envisioned that the use of Extension would only be opt-in (i.e., never automatically created on import from IPC or FFI), which would perhaps limit changes to things that work today (via field metadata propagation where this is possible). Libraries built on arrow-rs could experiment with creating these types and opt-in to their existence slowly (or never!).

It seems inherently wrong to me that callsites that are agnostic to extension types, and which likely significantly outnumber the callsites that need to be aware of extension types, should be the ones needing to be updated.

Yes, the top-level data type approach does involve implementing extension-aware code in some places. Mostly this is to just fall back to storage type behaviour, error, or use it as an opportunity for the extension author to inject behaviour (e.g., type equality). It's absolutely fair if the number of callsites required is inappropriate in scope in arrow-rs...I'm pursuing this as a comparison to the number of callsites required to update the array and/or data type representation in DataFusion to allow for extension type information to be propagated through expressions (which is also very high).

tustvold · 2025-04-10T04:28:45Z

IMO a design should stand on its own merits, not based on what is expedient for one particular downstream use-case

update the array and/or data type representation

I'm not familiar with the precise problem being discussed here, but I'm guessing DF is relying on DataType instead of propogating Field/Schema. Such an approach is inherently flawed, as not only does it discard metadata, but also things like nullability, etc... I'm a little surprised if this is the case, as DF at least used to fairly aggressively make use of Schema, etc... but perhaps things have changed since I was very involved...

alamb · 2025-04-10T10:41:50Z

I'm not familiar with the precise problem being discussed here, but I'm guessing DF is relying on DataType instead of propogating Field/Schema. Such an approach is inherently flawed, as not only does it discard metadata, but also things like nullability, etc... I'm a little surprised if this is the case, as DF at least used to fairly aggressively make use of Schema, etc... but perhaps things have changed since I was very involved...

Yes, there are quite a few APIs in DataFusion that use DataType and not Field (for example the type coercion logic). If we were starting again today they would all probably be in terms of Field. On the other hand I don't see huge problems in making it easier to use extension types downstream by making changes to arrow-rs

@paleolimbot what do you think about filing a ticket in DataFusion like "Better extension type support" where we can start listing the APIs that would need to be updated to Field instead of DataType?

paleolimbot · 2025-04-10T16:39:19Z

not based on what is expedient for one particular downstream use-case

Agreed! There are also a number of arrow-rs' own crates that make heavy use of the DataType and ArrayRef (e.g., the Parquet internals, casting). Another approach might be an arrow-extensions crate that includes an extension-aware cast(), CSV reader/writer (i.e., extension authors can choose how an array is represented in a CSV file), Parquet reader/writer (the issue here...how to propagate an extension type to Parquet and vice versa), extension-aware pretty printer, etc. All of these utilities are made use of by DataFusion but there is nothing strictly related to DataFusion about them and the work we do to (e.g.) support Variant, Geometry/Geography, or UUIDs doesn't have to only benefit DataFusion users.

what do you think about filing a ticket in DataFusion like "Better extension type support" where we can start listing the APIs that would need to be updated to Field instead of DataType?

Yes, I'll continue my efforts there, too! In addition to requiring a Field a number of them also require a Context of some kind to look up extension-specific behaviour (e.g., whether a cast should or should not happen).

alamb · 2025-04-10T18:41:41Z

A relevant PR from @timsaucer downstream in DataFusion in fact proposes exactly this (use a Field instead of DataType) for scalar functions:

Add Extension Type / Metadata support for Scalar UDFs datafusion#15646

tustvold · 2025-04-11T08:46:02Z

There are also a number of arrow-rs' own crates that make heavy use of the DataType and ArrayRef (e.g., the Parquet internals, casting)

Indeed, and the benefit of the current design is extension types automatically "just work". The only logic that needs be concerned with them is logic that seeks to explicitly handle them, I am pretty strongly opposed to changing this.

paleolimbot · 2025-04-15T14:51:29Z

extension types automatically "just work"

This is definitely true for some operations (e.g., arrow-select), but for others (e.g., cast, parse, print, write to CSV, arithmetic) it is very easy to do the wrong thing because there is no built-in alternative to capture that context: the DataType in its current form is not really "how the values of an array should be interpreted", but more like ArrowStorageType or PhysicalTypeButAlsoTimestamps. You are absolutely right that adding Extension to the DataType was never the intention of the original DataType...I think it is also true that a lot of usage (including DataFusion and the higher-level arrow crates) is not consistent with that intention.

We have an internal workaround that we're pursuing for now, but I'll continue to play with/review alternatives both here and in DataFusion...if nothing else, it's a great way to get to know arrow-rs!

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Feb 2, 2025

alamb mentioned this issue Feb 2, 2025

Add ExtensionType trait and CanonicalExtensionType enum #5822

Merged

alamb added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Feb 2, 2025

alamb mentioned this issue Feb 2, 2025

Minor: add ticket reference for todo #7064

Merged

alamb mentioned this issue Apr 1, 2025

[EPIC] [Parquet] Implement Variant type support in Parquet #6736

Open

10 tasks

paleolimbot mentioned this issue Apr 9, 2025

Experiment with Extension as a DataType #7398

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support round tripping extension types in parquet #7063

Support round tripping extension types in parquet #7063

alamb commented Feb 2, 2025

alamb commented Apr 6, 2025

paleolimbot commented Apr 7, 2025

mbrobbel commented Apr 8, 2025

paleolimbot commented Apr 8, 2025

alamb commented Apr 8, 2025

mbrobbel commented Apr 8, 2025

paleolimbot commented Apr 8, 2025

mbrobbel commented Apr 9, 2025

tustvold commented Apr 9, 2025 •

edited

Loading

paleolimbot commented Apr 9, 2025

tustvold commented Apr 9, 2025

paleolimbot commented Apr 10, 2025

tustvold commented Apr 10, 2025 •

edited

Loading

alamb commented Apr 10, 2025

paleolimbot commented Apr 10, 2025

alamb commented Apr 10, 2025

tustvold commented Apr 11, 2025

paleolimbot commented Apr 15, 2025

Support round tripping extension types in parquet #7063

Support round tripping extension types in parquet #7063

Comments

alamb commented Feb 2, 2025

alamb commented Apr 6, 2025

paleolimbot commented Apr 7, 2025

mbrobbel commented Apr 8, 2025

paleolimbot commented Apr 8, 2025

alamb commented Apr 8, 2025

mbrobbel commented Apr 8, 2025

paleolimbot commented Apr 8, 2025

mbrobbel commented Apr 9, 2025

tustvold commented Apr 9, 2025 • edited Loading

paleolimbot commented Apr 9, 2025

tustvold commented Apr 9, 2025

paleolimbot commented Apr 10, 2025

tustvold commented Apr 10, 2025 • edited Loading

alamb commented Apr 10, 2025

paleolimbot commented Apr 10, 2025

alamb commented Apr 10, 2025

tustvold commented Apr 11, 2025

paleolimbot commented Apr 15, 2025

tustvold commented Apr 9, 2025 •

edited

Loading

tustvold commented Apr 10, 2025 •

edited

Loading