-
Notifications
You must be signed in to change notification settings - Fork 915
Experiment with Extension as a DataType #7398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -276,6 +276,7 @@ fn build_extend(array: &ArrayData) -> Extend { | |||
UnionMode::Dense => union::build_extend_dense(array), | |||
}, | |||
DataType::RunEndEncoded(_, _) => todo!(), | |||
DataType::Extension(_) => unimplemented!("Extension not implemented"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This highlights the challenge with this approach, every kernel/function must now be updated to accommodate this new extension type. Not only does this stand a very high chance of regressing functionality, but also is IMO not in the spirit of extension types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the tradeoff is taking on the burden of extensibility (so that other projects don't have to implement it separately, for example, by redefining wrapper DataTypes or Arrays). This extra burden hasn't caused issues in Arrow C++/pyarrow that I'm aware of but absolutely fair that it might here.
Which issue does this PR close?
Experiments in pursuit of #7063 and apache/datafusion#12644 .
Rationale for this change
Customizing behaviour (e.g., sorting, signatures for user defined functions, import/export from/to formats like Parquet, Arrow IPC, and FFI) is increasingly requested. arrow-rs is also increasingly the implementation of choice for compute frameworks and many APIs have been built around the
DataType
andArrayRef
(e.g., GeoArrow, parquet, many parts of DataFusion). To support additional types, those frameworks have to either invent a new array type and rewrite substantial wrappers (e.g., GeoArrow), or use something other than aDataType
(e.g., a Field or MyOwnDataType) and forego having that type automatically pass through APIs from other arrow-rs-based libraries (or arrow itself).Incorporating an ExtensionType as a first-class data type provides a potentially less disruptive route for libraries like parquet and DataFusion to implement new Parquet types and other types other databases implement like JSON and UUID. I also have an experiment going for swapping out a
Field
for aDataType
in DataFusion ( apache/datafusion#15036 ).What changes are included in this PR?
A new
DataType::Extension()
enum member was added. I had hoped to have this beDataType::Extension(Arc<dyn ExtensionType>)
but the current design of the ExtensionType is not dyn-compatible. TheDynExtensionType
trait is really a placeholder to collect the requirements of the rest of arrow-rs for this experiment.This is currently just an experiment/minimum change that actually builds! I'm happy to continue the experiment if there is interest.
Are there any user-facing changes?
Yes, a new data type enum member at least (which is a breaking change).