[DRAFT] Extension Type Registry Draft #18552
Draft
+503
−7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
This is a draft for #18223 . The APIs are not to be considered final (e.g., options are missing in the pretty printer).
The primary purpose is to spark discussion for now.
So happy to hear inputs!
Rationale for this change
How cool would it be to just state that you should properly format my byte-encoded uuids? :)
What changes are included in this PR?
LogicalTypetrait for some canonical extension types from arrow.UnresolvedExtensionType, a "DataFusion canonical extension type" that can be used to create aLogicalTypeinstance even without a registry. For example, thenew_...functions forDFSchemacould make use of this type, as they currently have no access to a registry. Furthermore, these function could directly instantiate the canonical arrow extension types as they are known to the system. Then the functions could resolve native and canonical extension types themselves without an access to the registry and then "delay" the resolving of the custom extension types. The idea is that there is then a "Type Resolver Pass" that has access to a registry and replaces all instances of this type with the actual one. While I hope that this is only a temporary solution until all places have access to a logical type registry, I think this has the potential to become a "permanent temporary solution". With this in mind, we could also consider making this explicit with an enum and not hide it behind dynamic dispatch.ValuePrettyPrinterfor showcasing the UUID pretty printing.ExtensionTypeRegistryinSessionStateWhat is also important is what is not included: an integrative example of making use of the pretty printer. I tried several avenues but, as you can imagine, each change to the core data structure is a huge plumbing effort (hopefully reduced by the existence of
UnresolvedLogicalType).I really like the suggestion by @paleolimbot to use pretty-printing record batches as the first use case. You can see a mini example in the test that pretty-prints UUIDs. The nice thing is that this probably would not require much plumbing as the [DataFrame] already has access to the [SessionState]. The only thing that's missing for me to actually include this example here is that
arrow-rsdoes not currently support passing custom pretty printers inpretty_format_batches_with_options.Imagine that the
to_stringfunction in theDataFramedoes the following:If you think this is a worthwhile pursuit we could add the capability to arrow-rs.
Are these changes tested?
Not really, as there is not integrative example yet.
Are there any user-facing changes?
There would be.