Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impl ArrowNativeType and ArrowNativeTypeOp for u128, usize and isize #5176

Closed
wants to merge 2 commits into from

Conversation

mbrobbel
Copy link
Contributor

@mbrobbel mbrobbel commented Dec 6, 2023

Which issue does this PR close?

None. Does it need one?

Rationale for this change

Arrow's Schema.fbs restricts logical integer type bit widths to a max of 64, however it should be fine to use wider ints for custom types in Arrow buffers.

Note that there already is an implementation for i128 (used for decimals).

What changes are included in this PR?

Implements ArrowNativeType and ArrowNativeTypeOp for u128, usize and isize.

Are there any user-facing changes?

Users can now use u128, usize and isize in ScalarBuffer and BufferBuilder.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 6, 2023
@tustvold
Copy link
Contributor

tustvold commented Dec 6, 2023

What is the motivation for this? Perhaps you could expand on your intended use-case?

@mbrobbel
Copy link
Contributor Author

mbrobbel commented Dec 6, 2023

I'm working on an arrow-rs interop feature PR for narrow. For the fixed size physical layout I'm trying to support "values each having the same physical slot width typically measured in bytes" via FixedSize (which is like ArrowNativeType for that crate). To make ArrowNativeType a supertrait of FixedSize (which is required to make the interop work) it needs these implementations.

In addition, what do you think about:

impl<const N: usize, T: ArrowNativeType> ArrowNativeType for [T; N] { ... }

@tustvold
Copy link
Contributor

tustvold commented Dec 6, 2023

Perhaps you could use the Buffer APIs directly instead of using ScalarBuffer? This feels a bit out of scope for these abstractions?

@mbrobbel
Copy link
Contributor Author

mbrobbel commented Dec 6, 2023

I prefer avoiding the unsafe buffer methods and instead use the strongly typed methods, however they are all generic over ArrowNativeType, which is why I'm opening this PR.

Unless I'm mistaken this is following the Arrow spec, so I think it would be a good addition regardles of my use-case.

@tustvold
Copy link
Contributor

tustvold commented Dec 6, 2023

Unless I'm mistaken this is following the Arrow spec

I'm not aware of a part of the arrow specification calling for a pointer width type, i.e. usize or isize, nor an unsigned 128-bit integer?

I prefer avoiding the unsafe buffer methods

I appreciate this, but these APIs are not intended as a general purpose aligned allocation API, rather the backing storage for the arrow-rs array abstractions. Keeping the scope narrow, especially in an area that makes extensive use of unsafe, is quite important.

@mbrobbel
Copy link
Contributor Author

mbrobbel commented Dec 6, 2023

Unless I'm mistaken this is following the Arrow spec

I'm not aware of a part of the arrow specification calling for a pointer width type, i.e. usize or isize, nor an unsigned 128-bit integer?

The physical fixed-size primitive memory layout supports: "values each having the same physical slot width typically measured in bytes". u128, isize, usize (and [T: ArrowNativeType; N]) are types for values each having the same physical slot width, meaning that custom logical types should be able to use them in that physical memory layout I think.

For example, consider a custom UUID extension type. It could be stored in a fixed-size list layout (as suggested in the docs), but it's also valid to store it in a fixed-size primitive layout (using u128s).

@tustvold
Copy link
Contributor

tustvold commented Dec 6, 2023

https://github.com/apache/arrow/blob/92723f34f8df40b35e5840e61011c00766808014/format/Schema.fbs is the canonical list of types within the arrow data model.

For example, consider a custom UUID extension type. It could be stored in a fixed-size list layout (as suggested in the docs), but it's also valid to store it in a fixed-size primitive layout (using u128s).

It could also be stored as a uuid::UUID which happens to transmute safely to [u8; 16], does this mean we should derive ArrowNativeType for UUID? I appreciate this a strawman, but the purpose of ArrowNativeType is to provide the types necessary to implement the arrow data model. It is not intended to be some general purpose byte transmutation tool, a la something like bytemuck.

If the intention is to use this for FixedSizeBinary I wonder if you could just use a normal Buffer and then use something like bytemuck to safely transmute the [u8] to the desired [T].

@mbrobbel
Copy link
Contributor Author

mbrobbel commented Dec 6, 2023

https://github.com/apache/arrow/blob/92723f34f8df40b35e5840e61011c00766808014/format/Schema.fbs is the canonical list of types within the arrow data model.

I'm interested in extension types.

For example, consider a custom UUID extension type. It could be stored in a fixed-size list layout (as suggested in the docs), but it's also valid to store it in a fixed-size primitive layout (using u128s).

It could also be stored as a uuid::UUID which happens to transmute safely to [u8; 16], does this mean we should derive ArrowNativeType for UUID?

No, I think we should only implement it for [T; N] where T: ArrowNativeType. I consider the conversion to a "logical type" (in this case an extension type for a uuid) another layer.

If the intention is to use this for FixedSizeBinary I wonder if you could just use a normal Buffer and then use something like bytemuck to safely transmute the [u8] to the desired [T].

I'm trying to use this for the physical fixed-size primitive layout, which in my mind is like a Vec<T> where T: ArrowNativeType.

@tustvold
Copy link
Contributor

tustvold commented Dec 6, 2023

I appreciate the context, but I think if narrow wants such abstractions it should look to provide them itself. ArrowNativeType exists to serve the requirements of arrow-rs' array abstractions and not as a general purpose transmutation API.

This is inherently a very unsafe part of the API and I would very much like to keep it focused to avoid unintentional soundness issues.

Perhaps @alamb or @viirya might have thoughts on this?

@alamb
Copy link
Contributor

alamb commented Dec 6, 2023

Perhaps @alamb or @viirya might have thoughts on this?

I think making u128 an "ArrowNativeType" when there are no actual arrays that use them as native types is confusing 🤔

I'm working on an arrow-rs interop feature mbrobbel/narrow#100 for narrow. For the fixed size physical layout I'm trying to support "values each having the same physical slot width typically measured in bytes" via FixedSize (which is like ArrowNativeType for that crate). To make ArrowNativeType a supertrait of FixedSize (which is required to make the interop work) it needs these implementations.

I don't fully understand this -- given there is no arrow-rs array that will have an ArrowNativeType of u128 etc. is changing ArrowNativeType the only way to implement an interop layer ?

Is there an approach where you could implement conversions for an array to FixedSize with dynamic dispatch (e.g. match data_type { .... } )? You would have to list all the types, but that is feasible

@viirya
Copy link
Member

viirya commented Dec 6, 2023

As u128, usize, isize are not in the type list of Arrow spec (i.e., Schema.fbs), and we also define it as Trait expressing a Rust type that has the same in-memory representation as Arrow, it makes me think that it sounds confusing to extend it for arbitrary types like above even the argument that they have fixed-size primitive memory layout.

@mbrobbel
Copy link
Contributor Author

mbrobbel commented Dec 6, 2023

Thank you all for the thoughtful comments.

I can see how it doesn't make sense to add this. I'll try to figure out another way to make it work with narrow.

@mbrobbel mbrobbel closed this Dec 6, 2023
@mbrobbel mbrobbel deleted the u128-native-type branch December 6, 2023 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants