-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-17682: [C++][Python] Bool8 Extension Type Implementation #43488
Merged
Merged
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
c49815b
initial bool8
joellubi c51cb62
fix makearray and add comments
joellubi ca7b56d
add python bindings
joellubi 35d1ad9
handle non-zero-copy conversion to numpy
joellubi f07af50
add from_numpy, improve tests
joellubi 3f45772
simplify bool8 constructor
joellubi 03fa22f
fix precommit
joellubi 15fc298
fix docstring formatting
joellubi 84c161b
from_numpy using from_storage
joellubi 0cea571
impl as_py() for scalar
joellubi 71a5d96
fix pre-commit
joellubi b76013e
incorporate review comments
joellubi aa36147
fix whitespace
joellubi dc192f5
register extension by default
joellubi 735f591
more scalar tests
joellubi b2f83a1
run precommit
joellubi 8d07669
update docs and writable
joellubi aef60b1
fix doctest
joellubi 7084c19
pa.scalar working for extension types
joellubi 5bb387b
fix docstring
joellubi c697eda
apply python review suggestions
joellubi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
// Licensed to the Apache Software Foundation (ASF) under one | ||
// or more contributor license agreements. See the NOTICE file | ||
// distributed with this work for additional information | ||
// regarding copyright ownership. The ASF licenses this file | ||
// to you under the Apache License, Version 2.0 (the | ||
// "License"); you may not use this file except in compliance | ||
// with the License. You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, | ||
// software distributed under the License is distributed on an | ||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
// KIND, either express or implied. See the License for the | ||
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
#include <sstream> | ||
|
||
#include "arrow/extension/bool8.h" | ||
#include "arrow/util/logging.h" | ||
|
||
namespace arrow::extension { | ||
|
||
bool Bool8Type::ExtensionEquals(const ExtensionType& other) const { | ||
return extension_name() == other.extension_name(); | ||
} | ||
|
||
std::string Bool8Type::ToString(bool show_metadata) const { | ||
std::stringstream ss; | ||
ss << "extension<" << this->extension_name() << ">"; | ||
return ss.str(); | ||
} | ||
|
||
std::string Bool8Type::Serialize() const { return ""; } | ||
|
||
Result<std::shared_ptr<DataType>> Bool8Type::Deserialize( | ||
std::shared_ptr<DataType> storage_type, const std::string& serialized_data) const { | ||
if (storage_type->id() != Type::INT8) { | ||
return Status::Invalid("Expected INT8 storage type, got ", storage_type->ToString()); | ||
} | ||
joellubi marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if (serialized_data != "") { | ||
return Status::Invalid("Serialize data must be empty, got ", serialized_data); | ||
} | ||
return bool8(); | ||
} | ||
|
||
std::shared_ptr<Array> Bool8Type::MakeArray(std::shared_ptr<ArrayData> data) const { | ||
DCHECK_EQ(data->type->id(), Type::EXTENSION); | ||
DCHECK_EQ("arrow.bool8", | ||
internal::checked_cast<const ExtensionType&>(*data->type).extension_name()); | ||
return std::make_shared<Bool8Array>(data); | ||
} | ||
|
||
Result<std::shared_ptr<DataType>> Bool8Type::Make() { | ||
return std::make_shared<Bool8Type>(); | ||
} | ||
|
||
std::shared_ptr<DataType> bool8() { return std::make_shared<Bool8Type>(); } | ||
|
||
} // namespace arrow::extension |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
// Licensed to the Apache Software Foundation (ASF) under one | ||
// or more contributor license agreements. See the NOTICE file | ||
// distributed with this work for additional information | ||
// regarding copyright ownership. The ASF licenses this file | ||
// to you under the Apache License, Version 2.0 (the | ||
// "License"); you may not use this file except in compliance | ||
// with the License. You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, | ||
// software distributed under the License is distributed on an | ||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
// KIND, either express or implied. See the License for the | ||
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
#include "arrow/extension_type.h" | ||
|
||
namespace arrow::extension { | ||
|
||
/// \brief Bool8 is an alternate representation for boolean | ||
/// arrays using 8 bits instead of 1 bit per value. The underlying | ||
/// storage type is int8. | ||
class ARROW_EXPORT Bool8Array : public ExtensionArray { | ||
public: | ||
using ExtensionArray::ExtensionArray; | ||
}; | ||
|
||
/// \brief Bool8 is an alternate representation for boolean | ||
/// arrays using 8 bits instead of 1 bit per value. The underlying | ||
/// storage type is int8. | ||
class ARROW_EXPORT Bool8Type : public ExtensionType { | ||
public: | ||
/// \brief Construct a Bool8Type. | ||
Bool8Type() : ExtensionType(int8()) {} | ||
|
||
std::string extension_name() const override { return "arrow.bool8"; } | ||
std::string ToString(bool show_metadata = false) const override; | ||
|
||
bool ExtensionEquals(const ExtensionType& other) const override; | ||
|
||
std::string Serialize() const override; | ||
|
||
Result<std::shared_ptr<DataType>> Deserialize( | ||
std::shared_ptr<DataType> storage_type, | ||
const std::string& serialized_data) const override; | ||
|
||
/// Create a Bool8Array from ArrayData | ||
std::shared_ptr<Array> MakeArray(std::shared_ptr<ArrayData> data) const override; | ||
|
||
static Result<std::shared_ptr<DataType>> Make(); | ||
}; | ||
|
||
/// \brief Return a Bool8Type instance. | ||
ARROW_EXPORT std::shared_ptr<DataType> bool8(); | ||
|
||
} // namespace arrow::extension |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
// Licensed to the Apache Software Foundation (ASF) under one | ||
// or more contributor license agreements. See the NOTICE file | ||
// distributed with this work for additional information | ||
// regarding copyright ownership. The ASF licenses this file | ||
// to you under the Apache License, Version 2.0 (the | ||
// "License"); you may not use this file except in compliance | ||
// with the License. You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, | ||
// software distributed under the License is distributed on an | ||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
// KIND, either express or implied. See the License for the | ||
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
#include "arrow/extension/bool8.h" | ||
#include "arrow/io/memory.h" | ||
#include "arrow/ipc/reader.h" | ||
#include "arrow/ipc/writer.h" | ||
#include "arrow/testing/extension_type.h" | ||
#include "arrow/testing/gtest_util.h" | ||
|
||
namespace arrow { | ||
|
||
TEST(Bool8Type, Basics) { | ||
auto type = internal::checked_pointer_cast<extension::Bool8Type>(extension::bool8()); | ||
auto type2 = internal::checked_pointer_cast<extension::Bool8Type>(extension::bool8()); | ||
ASSERT_EQ("arrow.bool8", type->extension_name()); | ||
ASSERT_EQ(*type, *type); | ||
ASSERT_NE(*arrow::null(), *type); | ||
ASSERT_EQ(*type, *type2); | ||
ASSERT_EQ(*arrow::int8(), *type->storage_type()); | ||
ASSERT_EQ("", type->Serialize()); | ||
ASSERT_EQ("extension<arrow.bool8>", type->ToString(false)); | ||
} | ||
|
||
TEST(Bool8Type, CreateFromArray) { | ||
auto type = internal::checked_pointer_cast<extension::Bool8Type>(extension::bool8()); | ||
auto storage = ArrayFromJSON(int8(), "[-1,0,1,2,null]"); | ||
auto array = ExtensionType::WrapArray(type, storage); | ||
ASSERT_EQ(5, array->length()); | ||
ASSERT_EQ(1, array->null_count()); | ||
} | ||
|
||
TEST(Bool8Type, Deserialize) { | ||
auto type = internal::checked_pointer_cast<extension::Bool8Type>(extension::bool8()); | ||
ASSERT_OK_AND_ASSIGN(auto deserialized, type->Deserialize(type->storage_type(), "")); | ||
ASSERT_EQ(*type, *deserialized); | ||
ASSERT_NOT_OK(type->Deserialize(type->storage_type(), "must be empty")); | ||
ASSERT_EQ(*type, *deserialized); | ||
ASSERT_NOT_OK(type->Deserialize(uint8(), "")); | ||
ASSERT_EQ(*type, *deserialized); | ||
} | ||
|
||
TEST(Bool8Type, MetadataRoundTrip) { | ||
auto type = internal::checked_pointer_cast<extension::Bool8Type>(extension::bool8()); | ||
std::string serialized = type->Serialize(); | ||
ASSERT_OK_AND_ASSIGN(auto deserialized, | ||
type->Deserialize(type->storage_type(), serialized)); | ||
ASSERT_EQ(*type, *deserialized); | ||
} | ||
|
||
TEST(Bool8Type, BatchRoundTrip) { | ||
auto type = internal::checked_pointer_cast<extension::Bool8Type>(extension::bool8()); | ||
|
||
auto storage = ArrayFromJSON(int8(), "[-1,0,1,2,null]"); | ||
auto array = ExtensionType::WrapArray(type, storage); | ||
auto batch = | ||
RecordBatch::Make(schema({field("field", type)}), array->length(), {array}); | ||
|
||
std::shared_ptr<RecordBatch> written; | ||
{ | ||
ASSERT_OK_AND_ASSIGN(auto out_stream, io::BufferOutputStream::Create()); | ||
ASSERT_OK(ipc::WriteRecordBatchStream({batch}, ipc::IpcWriteOptions::Defaults(), | ||
out_stream.get())); | ||
|
||
ASSERT_OK_AND_ASSIGN(auto complete_ipc_stream, out_stream->Finish()); | ||
|
||
io::BufferReader reader(complete_ipc_stream); | ||
std::shared_ptr<RecordBatchReader> batch_reader; | ||
ASSERT_OK_AND_ASSIGN(batch_reader, ipc::RecordBatchStreamReader::Open(&reader)); | ||
ASSERT_OK(batch_reader->ReadNext(&written)); | ||
} | ||
|
||
ASSERT_EQ(*batch->schema(), *written->schema()); | ||
ASSERT_BATCHES_EQUAL(*batch, *written); | ||
} | ||
|
||
} // namespace arrow |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -174,6 +174,7 @@ def print_entry(label, value): | |
run_end_encoded, | ||
fixed_shape_tensor, | ||
opaque, | ||
bool8, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can keep the PyArrow changes for another PR to make reviewing simpler, IMHO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think it's ok to leave this PR as it is? |
||
field, | ||
type_for_alias, | ||
DataType, DictionaryType, StructType, | ||
|
@@ -184,7 +185,7 @@ def print_entry(label, value): | |
FixedSizeBinaryType, Decimal128Type, Decimal256Type, | ||
BaseExtensionType, ExtensionType, | ||
RunEndEncodedType, FixedShapeTensorType, OpaqueType, | ||
PyExtensionType, UnknownExtensionType, | ||
Bool8Type, PyExtensionType, UnknownExtensionType, | ||
register_extension_type, unregister_extension_type, | ||
DictionaryMemo, | ||
KeyValueMetadata, | ||
|
@@ -218,7 +219,7 @@ def print_entry(label, value): | |
MonthDayNanoIntervalArray, | ||
Decimal128Array, Decimal256Array, StructArray, ExtensionArray, | ||
RunEndEncodedArray, FixedShapeTensorArray, OpaqueArray, | ||
scalar, NA, _NULL as NULL, Scalar, | ||
Bool8Array, scalar, NA, _NULL as NULL, Scalar, | ||
NullScalar, BooleanScalar, | ||
Int8Scalar, Int16Scalar, Int32Scalar, Int64Scalar, | ||
UInt8Scalar, UInt16Scalar, UInt32Scalar, UInt64Scalar, | ||
|
@@ -235,7 +236,7 @@ def print_entry(label, value): | |
FixedSizeBinaryScalar, DictionaryScalar, | ||
MapScalar, StructScalar, UnionScalar, | ||
RunEndEncodedScalar, ExtensionScalar, | ||
FixedShapeTensorScalar, OpaqueScalar) | ||
FixedShapeTensorScalar, OpaqueScalar, Bool8Scalar) | ||
|
||
# Buffers, allocation | ||
from pyarrow.lib import (DeviceAllocationType, Device, MemoryManager, | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emm why is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what's specified in "description of the serialization" for Bool8.
This method is generally used to encode type parameters, but for bool8 there are no parameters. The type is fully defined by its name and storage type.