-
Notifications
You must be signed in to change notification settings - Fork 915
Support round tripping extension types in parquet #7063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@mbrobbel and @kylebarron -- this item was brought up as something required to implement Varaint support (#6736). Do you by any chance have any additional information on what is required / how it might be implemented you can share (so we can make progress)? cc @PinkCrow007 |
We'll likely need this for Geometry/Geography as well! I can look more closely if I'm the one to start in on that change, but in C++ we handle something like that with an option to the reader properties: ...although something more like a registry might be more flexible (e.g., so that a type could be registered with, for example, a DataFusion context like a UDT might be registered in Spark or DuckDB). (With apologies if I missed the major API challenge here 😬 ) |
My thinking was: given an Arrow field with an extension type and writing this field to Parquet and then reading it again (using the provied Arrow writers/readers in the From briefly looking at this I think we can implement this by replacing some |
I'm experimenting with this in DataFusion, too ( apache/datafusion#15036 ) and there it's a fairly disruptive change (particularly for the UDFs). I know the version of |
I would personally be very open to the idea of adding a new enum to DataType. While it may not be strictly necessary from an API point of view, if it reduces churn and makes it easier to roll out extension types across the ecosystem it makes a lot of sense to me We could also potentially feature flag it for a while ( |
The spec defines extension types as metadata on |
I believe the language in the spec is limited to exchange via FFI and serialization via IPC (e.g., Java, Go, and Arrow C++ include extension types as objects with native fields and only serialize when specifically requested). I don't believe it has caused confusing situations in other languages but it could be worth asking others! I will give it a try anyway and see what kind of implementation effort/downstream impacts it may have. |
The |
The consensus reached on #4472 was not to do this. I can appreciate there are tradeoffs to both approaches, but forcing all arrow code to care about extensions seems unfortunate and against the spirit of extension types. Edit: I commented on the linked PR which I think highlights the challenge very well, every arrow kernel must now be updated to understand extension arrays, be able to downcast through them, preserve datatype, etc... This is a massive undertaking, and IMO makes the code more fragile |
In #7398 and apache/datafusion#15663 I mostly found that I was adding an additional "this type isn't supported" to the end of enums with one or more entires that already had other types that weren't supported. The tradeoff is I think that arrow-rs would take on the responsibility of ensuring the hard-to-implement operations (filter, take, concatenate) propagate the type so that extension authors/frameworks built on arrow don't have to (there's a few PRs in DataFusion that demonstrate that updating the DataType representation has quite large footprint). I would argue that by excluding the option in arrow-rs, we aren't necessarily eliminating any complexity, we're just punting it on to other projects. I've tried building the PR implementing this against DataFusion to check the impact on downstream code ( apache/datafusion#15663 )...are there other projects with a high Arrow surface area that might have a larger disruption? |
Right but this regresses functionality that currently "just works", the same is likely true in many places in DataFusion. There is a big difference between it compiling, and it working as users expect. It seems inherently wrong to me that callsites that are agnostic to extension types, and which likely significantly outnumber the callsites that need to be aware of extension types, should be the ones needing to be updated. |
I had envisioned that the use of Extension would only be opt-in (i.e., never automatically created on import from IPC or FFI), which would perhaps limit changes to things that work today (via field metadata propagation where this is possible). Libraries built on arrow-rs could experiment with creating these types and opt-in to their existence slowly (or never!).
Yes, the top-level data type approach does involve implementing extension-aware code in some places. Mostly this is to just fall back to storage type behaviour, error, or use it as an opportunity for the extension author to inject behaviour (e.g., type equality). It's absolutely fair if the number of callsites required is inappropriate in scope in arrow-rs...I'm pursuing this as a comparison to the number of callsites required to update the array and/or data type representation in DataFusion to allow for extension type information to be propagated through expressions (which is also very high). |
IMO a design should stand on its own merits, not based on what is expedient for one particular downstream use-case
I'm not familiar with the precise problem being discussed here, but I'm guessing DF is relying on DataType instead of propogating Field/Schema. Such an approach is inherently flawed, as not only does it discard metadata, but also things like nullability, etc... I'm a little surprised if this is the case, as DF at least used to fairly aggressively make use of Schema, etc... but perhaps things have changed since I was very involved... |
Yes, there are quite a few APIs in DataFusion that use DataType and not Field (for example the type coercion logic). If we were starting again today they would all probably be in terms of Field. On the other hand I don't see huge problems in making it easier to use extension types downstream by making changes to arrow-rs @paleolimbot what do you think about filing a ticket in DataFusion like "Better extension type support" where we can start listing the APIs that would need to be updated to Field instead of DataType? |
Agreed! There are also a number of arrow-rs' own crates that make heavy use of the
Yes, I'll continue my efforts there, too! In addition to requiring a |
A relevant PR from @timsaucer downstream in DataFusion in fact proposes exactly this (use a Field instead of DataType) for scalar functions: |
Indeed, and the benefit of the current design is extension types automatically "just work". The only logic that needs be concerned with them is logic that seeks to explicitly handle them, I am pretty strongly opposed to changing this. |
This is definitely true for some operations (e.g., arrow-select), but for others (e.g., cast, parse, print, write to CSV, arithmetic) it is very easy to do the wrong thing because there is no built-in alternative to capture that context: the We have an internal workaround that we're pursuing for now, but I'll continue to play with/review alternatives both here and in DataFusion...if nothing else, it's a great way to get to know arrow-rs! |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
ExtensionType
trait andCanonicalExtensionType
enum #5822However, these extension type annotations apparently do not survive round trip write/reading to parquet as described in this comment with @kylebarron: https://github.com/apache/arrow-rs/pull/5822/files#r1925627198
Describe the solution you'd like
Make extension tests work round trip through parquet
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: