-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-24868: [C++] Add a Tensor logical value type with varying dimensions, implemented using ExtensionType #37166
Conversation
|
31f9dc6
to
db8d764
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked through the rst
portion and have a few questions.
Looking at PyTorch This makes me assume strides are stored for caching (and not to allow variable data layout per tensor). Should we adjust to store permutations (or strides) per row too or rather leave that for another extension ( |
No strong opinion regarding storing strides. Cc @thomasw21 I'm not even sure that nested tensors are used that often no ? I'm also pinging other experts to get more insights. |
How much extra complexity is added if we enable each tensor in an array to have it's own permutation defined? |
@AlenkaF Storing permutations would allow per tensor memory layout, which I'm not sure is really needed in practice and could be confusing and complex. |
0c43cd1
to
3390915
Compare
Co-authored-by: Joris Van den Bossche <[email protected]>
@jorisvandenbossche Can I let you give this the final review? |
@jorisvandenbossche can we merge this? |
- Example of minimal metadata is: | ||
|
||
``{}`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, one more small nitpick: the minimal metadata is actually no metadata, which is typically represented as an empty string (I am actually not fully sure if in this case the metadata key could also just not be present in the field metadata), instead of an empty json dict (I don't think we should necessarily recommend using an empty dict)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point! Empty string feels like the safer choice here. See my suggested change below.
Co-authored-by: Rok Mihevc <[email protected]>
Thanks a lot @rok ! |
…ns, implemented using ExtensionType (#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See #24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: #24868 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit a7fab04. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…mensions, implemented using ExtensionType (apache#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See apache#24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: apache#24868 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…mensions, implemented using ExtensionType (apache#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See apache#24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: apache#24868 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…mensions, implemented using ExtensionType (apache#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See apache#24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: apache#24868 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Rationale for this change
For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a
VariableShapeTensorType
.See #24868 and huggingface/datasets#5272
What changes are included in this PR?
This introduces definition of
arrow.variable_shape_tensor
extension and it's C++ implementation and a Python wrapper.Are these changes tested?
Yes.
Are there any user-facing changes?
This introduces new extension type to the user.