-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-17682: [Format] Add Bool8 Canonical Extension Type #43234
Conversation
|
go/arrow/extensions/bool8.go
Outdated
"github.com/apache/arrow/go/v17/internal/json" | ||
) | ||
|
||
const ExtensionNameBool8 = "bool8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason for this to be exported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reason currently, but I did have a thought about how it could be used.
Right now we have the functions (Register|Unregister|Get)ExtensionType
for interacting with the global extension registry which developers can use to opt in to any extensions their application recognizes. As we make a clearer distinction between regular extension types and canonical extension types, I was thinking that we could automatically pre-register extensions that are canonical so that applications automatically recognize them.
Whether we do this or not, the exported name key could make it easier to interact with the registry using a consistent identifier.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i think that's fine, not sure about automatically registering canonical types though. Does the C++ lib do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears it does actually: https://github.com/apache/arrow/blob/main/cpp/src/arrow/extension_type.cc#L144-L154
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this const entirely because it's only used in one place (in the Bool8Type.ExtensionName
impl) and it's simple enough to just get the type itself, which is exported, and just call ExtensionName()
if needed.
I'll also leave out auto-registering canonical extensions for now. Once Opaque merges and UUID gets moved out of internal
, we can set that up for them together.
FYI all subsequent code changes are in #43323
go/arrow/extensions/bool8.go
Outdated
|
||
func (b *Bool8Type) Serialize() string { return ExtensionNameBool8 } | ||
|
||
func (b *Bool8Type) String() string { return fmt.Sprintf("Bool8<storage=%s>", b.Storage) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convention is all lower case
I also think that convention is extension_type<storage=%s>
for the String method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that's exactly what the ExtensionBase
will fall back to if this is not implemented:
func (e *ExtensionBase) String() string { return fmt.Sprintf("extension_type<storage=%s>", e.Storage) }
Is there a convention for overriding this when the actual extension type is known, or should it just fall back to the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best would be to check the C++ lib implementations and see what they do. Ideally keeping the string representations close would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention in C++ (current precedent is just FixedShapeTensor for now) appears to be extension<EXTENSION_NAME[TYPE_PARAM_NAME=TYPE_PARAM_VALUE, ...]>
.
For FixedShapeTensor that ends up looking like extension<arrow.fixed_shape_tensor[value_type=int32, shape=[2,2]]>
for example.
So then I guess in this case it would be extension<arrow.bool8>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed this change to #43323
go/arrow/array/builder.go
Outdated
case arrow.EXTENSION: | ||
typ := dtype.(arrow.ExtensionType) | ||
bldr := NewExtensionBuilder(mem, typ) | ||
if custom, ok := typ.(ExtensionBuilderWrapper); ok { | ||
return custom.NewBuilder(bldr) | ||
if custom, ok := dtype.(CustomExtensionBuilder); ok { | ||
return custom.NewBuilder(mem) | ||
} | ||
return bldr | ||
return NewExtensionBuilder(mem, dtype.(arrow.ExtensionType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you ensure this is being tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good idea. I just added some additional tests comparing the behavior of an extension type with/without a custom builder to really make the distinction clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one LGTM!
|
||
* Description of the serialization: | ||
|
||
No metadata is required to interpret the type. Any metadata present should be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO we should not leave any ambiguity here. Perhaps reuse the same wording as for JSON above?
No metadata is required to interpret the type. Any metadata present should be ignored. | |
Metadata is either an empty string or a JSON string with an empty object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. Any objection with just Metadata is an empty string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it would be fine with me.
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit c537700. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change Go implementation of #43234 ### What changes are included in this PR? - Go implementation of the `Bool8` extension type - Minor refactor of existing extension builder interfaces ### Are these changes tested? Yes, unit tests and basic read/write benchmarks are included. ### Are there any user-facing changes? - A new extension type is added - Custom extension builders no longer need another builder created and released separately. * GitHub Issue: #17682 Authored-by: Joel Lubinitsky <[email protected]> Signed-off-by: Joel Lubinitsky <[email protected]>
### Rationale for this change C++ and Python implementations of #43234 ### What changes are included in this PR? - Implement C++ `Bool8Type`, `Bool8Array`, `Bool8Scalar`, and tests - Implement Python bindings to C++, as well as zero-copy numpy conversion methods - TODO: docs waiting for rebase on #43458 ### Are these changes tested? Yes ### Are there any user-facing changes? Bool8 extension type will be available in C++ and Python libraries * GitHub Issue: #17682 Authored-by: Joel Lubinitsky <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Rationale for this change
Closes: #17682
Arrow Boolean arrays store values as individual bits, which is a very compact representation but does not match the layout of many systems with which it interoperates. By adding an 8-bit Boolean extension type, zero-copy compatibility with many systems can be improved at the cost of large physical representation.
Go implementation: #43323
C++ / Python implementation: #43488
What changes are included in this PR?
Proposal and documentation for
Bool8
canonical extension type.Are these changes tested?
N/A
Are there any user-facing changes?
N/A