Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35627: [C++][Format][Integration] Add string view to the arrow format #35628

Closed
wants to merge 38 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
6d490ef
Draft basic scaffolding for Binary/StringView types and get compiling
wesm Sep 9, 2022
21688eb
BinaryViewBuilder: fix duplicate values in null bitmap
zagto Oct 18, 2022
6b6cd95
enable JSON converter for StringView/BinaryView
zagto Oct 18, 2022
52eb446
add StringView/BinaryView to AllTypeIds
zagto Oct 18, 2022
8a75259
implement inline visitor for StringView/BinaryView
zagto Oct 18, 2022
1d81aea
fix formatting
zagto Oct 18, 2022
3931e25
fix formatting
zagto Oct 18, 2022
7fe8e2d
run binary data visitor tests on StringView/BinaryView
zagto Oct 18, 2022
6624acc
fixes in substrait, rename in LICENSE, owning scalars
bkietz Nov 15, 2022
8511bf1
delete potentially internal viewing members for rvalues
bkietz Nov 18, 2022
6df010f
Added validation for StringView arrays
bkietz Nov 18, 2022
5c24fd5
Adding comparison and concatenation
bkietz Nov 18, 2022
190648c
wrote <=, needed >=
bkietz Nov 20, 2022
018b49f
Extract visitation of views owning buffers
bkietz Nov 28, 2022
0ee9d89
add cast to/from string_view
bkietz Nov 29, 2022
6ed4ac0
Adding IPC serde of views by converting to/from dense strings
bkietz Nov 30, 2022
a8bf258
initial attempt at indices/offsets repr in arrow
bkietz Apr 13, 2023
bfe16b3
formatting, compilation fixes
bkietz May 23, 2023
1a01463
CI fixes
bkietz May 24, 2023
bdf7836
format python too
bkietz May 24, 2023
80a8758
more casts
bkietz May 24, 2023
4cd23f2
msvc: more casts
bkietz May 24, 2023
2d93d97
Extend R converter with binary view support
bkietz May 24, 2023
2c395c3
more casts
bkietz May 24, 2023
e626949
read/write hex encode prefix for string view too
bkietz May 24, 2023
71db0dc
review comments re: Columnar.rst description
bkietz May 24, 2023
c961942
msvc: explicit unreachable switch default
bkietz May 25, 2023
84f0723
r: exclude decimal from binary_like
bkietz May 25, 2023
175baac
go impl isn't here... yet
bkietz May 25, 2023
b5196a6
add benchmarks for filter, take, sort
bkietz Jun 8, 2023
dd4a5be
clang-format
bkietz Jun 8, 2023
aaad04f
msvc: doesn't have constexpr conversion operator?
bkietz Jun 8, 2023
d71c223
try fix for opentelemetry build failure
bkietz Jun 9, 2023
5f06a1c
review comments
bkietz Jun 15, 2023
def55c6
repair merge error
bkietz Jun 16, 2023
dddc3df
formatting again
bkietz Jun 16, 2023
fdda27c
DCHECK_EQ() not DCHECK(==)
bkietz Jun 19, 2023
46cf7e6
add BinaryViewLike matcher
bkietz Jun 19, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion LICENSE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1894,7 +1894,7 @@ This project includes code from the autobrew project.
The following files are based on code from the autobrew project:
* r/tools/autobrew
* dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb
* dev/tasks/homebrew-formulae/autobrew/apache-arrow-static.rb
* dev/tasks/homebrew-formulae/autobrew/apache-arrow-static.rb

Copyright (c) 2019, Jeroen Ooms
License: MIT
Expand Down Expand Up @@ -1976,6 +1976,20 @@ License: http://www.apache.org/licenses/LICENSE-2.0

--------------------------------------------------------------------------------

This project includes code from Velox.

* cpp/src/arrow/util/string_header.h

is based on Velox's

* velox/type/StringView.h

Copyright: Copyright (c) Facebook, Inc. and its affiliates.
Home page: https://github.com/facebookincubator/velox
License: http://www.apache.org/licenses/LICENSE-2.0

--------------------------------------------------------------------------------

The file cpp/src/arrow/vendored/musl/strptime.c has the following license

Copyright © 2005-2020 Rich Felker, et al.
Expand Down
1 change: 1 addition & 0 deletions cpp/cmake_modules/ThirdpartyToolchain.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -4593,6 +4593,7 @@ macro(build_opentelemetry)
-DWITH_OTLP=ON
-DWITH_OTLP_HTTP=ON
-DWITH_OTLP_GRPC=OFF
-DWITH_STL=ON
"-DProtobuf_INCLUDE_DIR=${OPENTELEMETRY_PROTOBUF_INCLUDE_DIR}"
"-DProtobuf_LIBRARY=${OPENTELEMETRY_PROTOBUF_INCLUDE_DIR}"
"-DProtobuf_PROTOC_EXECUTABLE=${OPENTELEMETRY_PROTOC_EXECUTABLE}")
Expand Down
1 change: 1 addition & 0 deletions cpp/src/arrow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,7 @@ set(ARROW_SRCS
util/ree_util.cc
util/string.cc
util/string_builder.cc
util/string_header.cc
util/task_group.cc
util/tdigest.cc
util/thread_pool.cc
Expand Down
4 changes: 4 additions & 0 deletions cpp/src/arrow/array/array_base.cc
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,10 @@ struct ScalarFromArraySlotImpl {
return Finish(a.GetString(index_));
}

Status Visit(const BinaryViewArray& a) {
return Finish(std::string{a.GetView(index_)});
bkietz marked this conversation as resolved.
Show resolved Hide resolved
}

Status Visit(const FixedSizeBinaryArray& a) { return Finish(a.GetString(index_)); }

Status Visit(const DayTimeIntervalArray& a) { return Finish(a.Value(index_)); }
Expand Down
22 changes: 22 additions & 0 deletions cpp/src/arrow/array/array_binary.cc
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,28 @@ LargeStringArray::LargeStringArray(int64_t length,

Status LargeStringArray::ValidateUTF8() const { return internal::ValidateUTF8(*data_); }

BinaryViewArray::BinaryViewArray(const std::shared_ptr<ArrayData>& data) {
ARROW_CHECK_EQ(data->type->id(), Type::BINARY_VIEW);
SetData(data);
}

BinaryViewArray::BinaryViewArray(int64_t length, std::shared_ptr<Buffer> headers,
BufferVector char_buffers,
std::shared_ptr<Buffer> null_bitmap, int64_t null_count,
int64_t offset)
: PrimitiveArray(binary_view(), length, std::move(headers), std::move(null_bitmap),
null_count, offset) {
data_->buffers.resize(char_buffers.size() + 2);
std::move(char_buffers.begin(), char_buffers.end(), data_->buffers.begin() + 2);
}

StringViewArray::StringViewArray(const std::shared_ptr<ArrayData>& data) {
ARROW_CHECK_EQ(data->type->id(), Type::STRING_VIEW);
SetData(data);
}

Status StringViewArray::ValidateUTF8() const { return internal::ValidateUTF8(*data_); }

FixedSizeBinaryArray::FixedSizeBinaryArray(const std::shared_ptr<ArrayData>& data) {
SetData(data);
}
Expand Down
79 changes: 79 additions & 0 deletions cpp/src/arrow/array/array_binary.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@

#include <cstdint>
#include <memory>
#include <optional>
#include <string>
#include <string_view>
#include <vector>
Expand Down Expand Up @@ -217,6 +218,84 @@ class ARROW_EXPORT LargeStringArray : public LargeBinaryArray {
Status ValidateUTF8() const;
};

// ----------------------------------------------------------------------
// BinaryView and StringView

/// Concrete Array class for variable-size binary view data using the
/// StringHeader struct to reference in-line or out-of-line string values
class ARROW_EXPORT BinaryViewArray : public PrimitiveArray {
public:
using TypeClass = BinaryViewType;
using IteratorType = stl::ArrayIterator<BinaryViewArray>;

explicit BinaryViewArray(const std::shared_ptr<ArrayData>& data);

BinaryViewArray(int64_t length, std::shared_ptr<Buffer> headers,
BufferVector char_buffers,
std::shared_ptr<Buffer> null_bitmap = NULLPTR,
int64_t null_count = kUnknownNullCount, int64_t offset = 0);

const StringHeader* raw_values() const {
return reinterpret_cast<const StringHeader*>(raw_values_) + data_->offset;
}

// For API compatibility with BinaryArray etc.
std::string_view GetView(int64_t i) const {
const auto& s = raw_values()[i];
if (raw_pointers_) {
return std::string_view{s};
bkietz marked this conversation as resolved.
Show resolved Hide resolved
}
if (s.IsInline()) {
return {s.GetInlineData(), s.size()};
}
auto* char_buffers = data_->buffers.data() + 2;
return {char_buffers[s.GetBufferIndex()]->data_as<char>() + s.GetBufferOffset(),
s.size()};
}

std::optional<std::string_view> operator[](int64_t i) const {
return *IteratorType(*this, i);
}

IteratorType begin() const { return IteratorType(*this); }
IteratorType end() const { return IteratorType(*this, length()); }

bool has_raw_pointers() const { return raw_pointers_; }

protected:
using PrimitiveArray::PrimitiveArray;

void SetData(const std::shared_ptr<ArrayData>& data) {
PrimitiveArray::SetData(data);
raw_pointers_ =
internal::checked_cast<const BinaryViewType&>(*type()).has_raw_pointers();
}

bool raw_pointers_ = false;
};

/// Concrete Array class for variable-size string view (utf-8) data using
/// StringHeader to reference in-line or out-of-line string values
class ARROW_EXPORT StringViewArray : public BinaryViewArray {
public:
using TypeClass = StringViewType;

explicit StringViewArray(const std::shared_ptr<ArrayData>& data);

StringViewArray(int64_t length, std::shared_ptr<Buffer> data, BufferVector char_buffers,
std::shared_ptr<Buffer> null_bitmap = NULLPTR,
int64_t null_count = kUnknownNullCount, int64_t offset = 0)
: BinaryViewArray(length, std::move(data), std::move(char_buffers),
std::move(null_bitmap), null_count, offset) {
data_->type = utf8_view();
}

/// \brief Validate that this array contains only valid UTF8 entries
///
/// This check is also implied by ValidateFull()
Status ValidateUTF8() const;
};

// ----------------------------------------------------------------------
// Fixed width binary

Expand Down
Loading