Skip to content

Conversation

@tobixdev
Copy link
Contributor

@tobixdev tobixdev commented Nov 8, 2025

Which issue does this PR close?

This is a draft for #18223 . The APIs are not to be considered final (e.g., options are missing in the pretty printer).
The primary purpose is to spark discussion for now.

So happy to hear inputs!

Rationale for this change

How cool would it be to just state that you should properly format my byte-encoded uuids? :)

What changes are included in this PR?

  • Defines the LogicalType trait for some canonical extension types from arrow.
  • Defines UnresolvedExtensionType, a "DataFusion canonical extension type" that can be used to create a LogicalType instance even without a registry. For example, the new_... functions for DFSchema could make use of this type, as they currently have no access to a registry. Furthermore, these function could directly instantiate the canonical arrow extension types as they are known to the system. Then the functions could resolve native and canonical extension types themselves without an access to the registry and then "delay" the resolving of the custom extension types. The idea is that there is then a "Type Resolver Pass" that has access to a registry and replaces all instances of this type with the actual one. While I hope that this is only a temporary solution until all places have access to a logical type registry, I think this has the potential to become a "permanent temporary solution". With this in mind, we could also consider making this explicit with an enum and not hide it behind dynamic dispatch.
  • Defines an incomplete ValuePrettyPrinter for showcasing the UUID pretty printing.
  • Plumbing for having ExtensionTypeRegistry in SessionState

What is also important is what is not included: an integrative example of making use of the pretty printer. I tried several avenues but, as you can imagine, each change to the core data structure is a huge plumbing effort (hopefully reduced by the existence of UnresolvedLogicalType).

I really like the suggestion by @paleolimbot to use pretty-printing record batches as the first use case. You can see a mini example in the test that pretty-prints UUIDs. The nice thing is that this probably would not require much plumbing as the [DataFrame] already has access to the [SessionState]. The only thing that's missing for me to actually include this example here is that arrow-rs does not currently support passing custom pretty printers in pretty_format_batches_with_options.

Imagine that the to_string function in the DataFrame does the following:

  1. Look up any extension type information from the schema (in a future world this would already be part of the schema and another lookup is not necessary)
  2. Gather the pretty printers
  3. Pass in pretty printer to arrow-rs for formatting.

If you think this is a worthwhile pursuit we could add the capability to arrow-rs.

Are these changes tested?

Not really, as there is not integrative example yet.

Are there any user-facing changes?

There would be.

@tobixdev tobixdev changed the title [DISCUSSION] Extension Type Registry [DISCUSSION] Extension Type Registry Draft Nov 8, 2025
@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate common Related to common crate functions Changes to functions implementation labels Nov 8, 2025
@tobixdev tobixdev changed the title [DISCUSSION] Extension Type Registry Draft [DRAFT] Extension Type Registry Draft Nov 9, 2025
@tobixdev tobixdev marked this pull request as draft November 9, 2025 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate functions Changes to functions implementation logical-expr Logical plan and expressions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant