Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types #35825

Closed
wants to merge 69 commits into from
Closed
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
e5e96ec
able to read the file
arthurpassos May 26, 2023
b9b48f8
remove diff out
arthurpassos May 26, 2023
ae62954
intermediate stage, not working properly anymore..
arthurpassos May 29, 2023
34917d5
still not working
arthurpassos May 29, 2023
835b07d
able to read the file again
arthurpassos May 29, 2023
50427c6
move use_binary_large_variants to arrowreaderproperties
arthurpassos May 30, 2023
df65ce7
cleanup a bit
arthurpassos May 30, 2023
e826b8e
back fromByteArray string & binary with setting
arthurpassos May 30, 2023
764ef98
some more adjustments
arthurpassos May 30, 2023
c6244ea
revert some stuff
arthurpassos May 30, 2023
5a4bbb0
revert some stuff
arthurpassos May 30, 2023
2d84e57
improvement
arthurpassos May 30, 2023
90f14df
remove dictionary64
arthurpassos May 30, 2023
b88b024
use 64bit on largebytearray class and initialize binary_large_variant…
arthurpassos May 31, 2023
0b53b05
add chunked string map test
arthurpassos May 31, 2023
f574e2e
add boolean comment
arthurpassos May 31, 2023
295e062
Make ChunkedRecordReader generic by using templates
arthurpassos May 31, 2023
25d7815
Make ByteArrayDictionaryReader generic with the use of templates
arthurpassos May 31, 2023
fe8d67b
make arrowbinaryhelper generic
arthurpassos May 31, 2023
35e5835
Make PlainByteArrayDecoder generic
arthurpassos May 31, 2023
9aff2f3
remove use_binary_large_variant from parquet reader properties
arthurpassos Jun 1, 2023
eb850c4
removed parquet::type::large_Byte_array
arthurpassos Jun 5, 2023
c2aab63
small adjustment
arthurpassos Jun 5, 2023
837ed6c
remove largebytearray class
arthurpassos Jun 6, 2023
35cdb99
simplify largebytearraytype a bit
arthurpassos Jun 6, 2023
a5000e1
simplify dictbytearraydecoderimpl a bit
arthurpassos Jun 6, 2023
eb71c17
remove one default argument
arthurpassos Jun 6, 2023
686a3f7
remove junk code
arthurpassos Jun 6, 2023
a61fc32
move use_binary_large_variant check inside frombytearray
arthurpassos Jun 6, 2023
e2600d0
simplify chunkedrecordreader a bit
arthurpassos Jun 6, 2023
3b86e23
simplify DictionaryRecordReaderImpl and fix DebugPrintState
arthurpassos Jun 6, 2023
cc027b7
simplify PlainByteArrayDecoderBase
arthurpassos Jun 6, 2023
177db7a
remove some todos
arthurpassos Jun 7, 2023
66223ee
Add comment explaining why struct LargeByteArrayType instead of alias
arthurpassos Jun 7, 2023
5cd39d8
address some pr comments
arthurpassos Jun 8, 2023
1089010
address a few more comments
arthurpassos Jun 8, 2023
a6c42ee
remove arrow-type include & move binarylimit trait
arthurpassos Jun 8, 2023
15be2a2
consolidate setdict
arthurpassos Jun 8, 2023
8d5ba3d
apply clangformat
arthurpassos Jun 8, 2023
fd8f979
removed todos
arthurpassos Jun 9, 2023
a5736d5
a bit more renaming
arthurpassos Jun 9, 2023
b4ecd0d
address one mor comment
arthurpassos Jun 9, 2023
9e9dff9
add overflow check in dict
arthurpassos Jun 9, 2023
ae1db20
address a few comments
arthurpassos Jun 12, 2023
09a9eaf
use int32_t explicitly
arthurpassos Jun 14, 2023
1664983
use template directly
arthurpassos Jun 14, 2023
322319e
use offset_type
arthurpassos Jun 15, 2023
1775a7a
address comments
arthurpassos Jun 15, 2023
7f6e2bf
address a few minor comments
arthurpassos Jun 16, 2023
75fb615
fix DictDecoderImpl
arthurpassos Jun 16, 2023
0801267
add non overflow test
arthurpassos Jun 16, 2023
7f09a16
string test
arthurpassos Jun 19, 2023
a8d20a4
address minor comments
arthurpassos Jun 20, 2023
5fcf4e1
use raw filereaderbuilder instead of adding a new openfile function
arthurpassos Jun 21, 2023
8901cbc
rename test
arthurpassos Jun 21, 2023
dff017a
update test file name
arthurpassos Jun 21, 2023
232e01f
update submodule?
arthurpassos Jun 21, 2023
d7d76c6
aply clang-format
arthurpassos Jun 21, 2023
90ceb07
address minor comments
arthurpassos Jun 22, 2023
0394963
delta & delta length for large*
arthurpassos Jun 22, 2023
a8df2e7
fix wrong if statements
arthurpassos Jun 22, 2023
2bb3b14
Template member variable as well
arthurpassos Jun 23, 2023
c114d44
add docstring
arthurpassos Jun 23, 2023
d1d5798
add LargeStringDictionary32Builder
arthurpassos Jun 23, 2023
0eaa60f
address a few comments
arthurpassos Jun 26, 2023
1e642fa
clang format
arthurpassos Jun 26, 2023
b299497
add binarypacked test for largebinaryvariant
arthurpassos Jun 26, 2023
2c23dd7
Revert "add binarypacked test for largebinaryvariant"
arthurpassos Jun 27, 2023
eca9d6f
only run largebinary tests if system is 64bit
arthurpassos Jul 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cpp/examples/parquet/parquet_arrow/reader_writer.cc
Original file line number Diff line number Diff line change
Expand Up @@ -137,4 +137,4 @@ int main(int argc, char** argv) {
read_single_rowgroup();
read_single_column();
read_single_column_chunk();
}
}
arthurpassos marked this conversation as resolved.
Show resolved Hide resolved
24 changes: 24 additions & 0 deletions cpp/src/arrow/array/builder_dict.h
Original file line number Diff line number Diff line change
Expand Up @@ -715,6 +715,29 @@ class Dictionary32Builder : public internal::DictionaryBuilderBase<Int32Builder,
}
};

/// \brief A DictionaryArray builder that always returns int64 dictionary
/// indices so that data cast to dictionary form will have a consistent index
/// type, e.g. for creating a ChunkedArray
template <typename T>
class Dictionary64Builder : public internal::DictionaryBuilderBase<Int64Builder, T> {
public:
using BASE = internal::DictionaryBuilderBase<Int64Builder, T>;
using BASE::BASE;

/// \brief Append dictionary indices directly without modifying memo
///
/// NOTE: Experimental API
Status AppendIndices(const int64_t* values, int64_t length,
const uint8_t* valid_bytes = NULLPTR) {
int64_t null_count_before = this->indices_builder_.null_count();
ARROW_RETURN_NOT_OK(this->indices_builder_.AppendValues(values, length, valid_bytes));
this->capacity_ = this->indices_builder_.capacity();
this->length_ += length;
this->null_count_ += this->indices_builder_.null_count() - null_count_before;
return Status::OK();
}
};

// ----------------------------------------------------------------------
// Binary / Unicode builders
// (compatibility aliases; those used to be derived classes with additional
Expand All @@ -724,6 +747,7 @@ using BinaryDictionaryBuilder = DictionaryBuilder<BinaryType>;
using StringDictionaryBuilder = DictionaryBuilder<StringType>;
using BinaryDictionary32Builder = Dictionary32Builder<BinaryType>;
using StringDictionary32Builder = Dictionary32Builder<StringType>;
using BinaryDictionary64Builder = Dictionary64Builder<LargeBinaryType>;
arthurpassos marked this conversation as resolved.
Show resolved Hide resolved

/// @}

Expand Down
2 changes: 2 additions & 0 deletions cpp/src/arrow/type.h
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,8 @@ class ARROW_EXPORT BaseBinaryType : public DataType {

constexpr int64_t kBinaryMemoryLimit = std::numeric_limits<int32_t>::max() - 1;

constexpr int64_t kLargeBinaryMemoryLimit = std::numeric_limits<int64_t>::max() - 1;

/// \addtogroup binary-datatypes
///
/// @{
Expand Down
3 changes: 2 additions & 1 deletion cpp/src/parquet/arrow/reader.cc
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,7 @@ class FileReaderImpl : public FileReader {
ctx->iterator_factory = SomeRowGroupsFactory(row_groups);
ctx->filter_leaves = true;
ctx->included_leaves = included_leaves;
ctx->use_binary_large_variants = reader_properties_.use_binary_large_variants();
return GetReader(manifest_.schema_fields[i], ctx, out);
}

Expand Down Expand Up @@ -462,7 +463,7 @@ class LeafReader : public ColumnReaderImpl {
input_(std::move(input)),
descr_(input_->descr()) {
record_reader_ = RecordReader::Make(
arthurpassos marked this conversation as resolved.
Show resolved Hide resolved
descr_, leaf_info, ctx_->pool, field_->type()->id() == ::arrow::Type::DICTIONARY);
descr_, leaf_info, ctx_->pool, field_->type()->id() == ::arrow::Type::DICTIONARY, false, ctx_->use_binary_large_variants);
arthurpassos marked this conversation as resolved.
Show resolved Hide resolved
NextRowGroup();
}

Expand Down
1 change: 1 addition & 0 deletions cpp/src/parquet/arrow/reader_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ using ::arrow::internal::SafeLeftShift;
using ::arrow::util::SafeLoadAs;

using parquet::internal::BinaryRecordReader;
using parquet::internal::LargeBinaryRecordReader;
using parquet::internal::DictionaryRecordReader;
using parquet::internal::RecordReader;
using parquet::schema::GroupNode;
Expand Down
1 change: 1 addition & 0 deletions cpp/src/parquet/arrow/reader_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ struct ReaderContext {
FileColumnIteratorFactory iterator_factory;
bool filter_leaves;
std::shared_ptr<std::unordered_set<int>> included_leaves;
bool use_binary_large_variants = false;
arthurpassos marked this conversation as resolved.
Show resolved Hide resolved

bool IncludesLeaf(int leaf_index) const {
if (this->filter_leaves) {
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/parquet/arrow/schema.cc
Original file line number Diff line number Diff line change
Expand Up @@ -473,7 +473,7 @@ ::arrow::Result<std::shared_ptr<ArrowType>> GetTypeForNode(
SchemaTreeContext* ctx) {
ASSIGN_OR_RAISE(
std::shared_ptr<ArrowType> storage_type,
GetArrowType(primitive_node, ctx->properties.coerce_int96_timestamp_unit()));
GetArrowType(primitive_node, ctx->properties.coerce_int96_timestamp_unit(), ctx->properties.use_binary_large_variants()));
if (ctx->properties.read_dictionary(column_index) &&
IsDictionaryReadSupported(*storage_type)) {
return ::arrow::dictionary(::arrow::int32(), storage_type);
Expand Down
26 changes: 22 additions & 4 deletions cpp/src/parquet/arrow/schema_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,23 @@ Result<std::shared_ptr<ArrowType>> FromByteArray(const LogicalType& logical_type
}
}

Result<std::shared_ptr<ArrowType>> FromLargeByteArray(const LogicalType& logical_type) {
switch (logical_type.type()) {
case LogicalType::Type::STRING:
return ::arrow::large_utf8();
case LogicalType::Type::DECIMAL:
return MakeArrowDecimal(logical_type);
case LogicalType::Type::NONE:
case LogicalType::Type::ENUM:
case LogicalType::Type::JSON:
case LogicalType::Type::BSON:
return ::arrow::large_binary();
default:
return Status::NotImplemented("Unhandled logical logical_type ",
logical_type.ToString(), " for binary array");
}
}

Result<std::shared_ptr<ArrowType>> FromFLBA(const LogicalType& logical_type,
int32_t physical_length) {
switch (logical_type.type()) {
Expand Down Expand Up @@ -181,7 +198,7 @@ Result<std::shared_ptr<ArrowType>> FromInt64(const LogicalType& logical_type) {

Result<std::shared_ptr<ArrowType>> GetArrowType(
Type::type physical_type, const LogicalType& logical_type, int type_length,
const ::arrow::TimeUnit::type int96_arrow_time_unit) {
const ::arrow::TimeUnit::type int96_arrow_time_unit, bool use_binary_large_variant) {
if (logical_type.is_invalid() || logical_type.is_null()) {
return ::arrow::null();
}
Expand All @@ -200,7 +217,7 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(
case ParquetType::DOUBLE:
return ::arrow::float64();
case ParquetType::BYTE_ARRAY:
return FromByteArray(logical_type);
return use_binary_large_variant ? FromLargeByteArray(logical_type) : FromByteArray(logical_type);
arthurpassos marked this conversation as resolved.
Show resolved Hide resolved
case ParquetType::FIXED_LEN_BYTE_ARRAY:
return FromFLBA(logical_type, type_length);
default: {
Expand All @@ -213,9 +230,10 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(

Result<std::shared_ptr<ArrowType>> GetArrowType(
const schema::PrimitiveNode& primitive,
const ::arrow::TimeUnit::type int96_arrow_time_unit) {
const ::arrow::TimeUnit::type int96_arrow_time_unit,
bool use_binary_large_variant) {
return GetArrowType(primitive.physical_type(), *primitive.logical_type(),
primitive.type_length(), int96_arrow_time_unit);
primitive.type_length(), int96_arrow_time_unit, use_binary_large_variant);
}

} // namespace arrow
Expand Down
11 changes: 8 additions & 3 deletions cpp/src/parquet/arrow/schema_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,22 +30,27 @@ namespace arrow {
using ::arrow::Result;

Result<std::shared_ptr<::arrow::DataType>> FromByteArray(const LogicalType& logical_type);
Result<std::shared_ptr<::arrow::DataType>> FromLargeByteArray(const LogicalType& logical_type);

Result<std::shared_ptr<::arrow::DataType>> FromFLBA(const LogicalType& logical_type,
int32_t physical_length);
Result<std::shared_ptr<::arrow::DataType>> FromInt32(const LogicalType& logical_type);
Result<std::shared_ptr<::arrow::DataType>> FromInt64(const LogicalType& logical_type);

Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type,
const LogicalType& logical_type,
int type_length);
int type_length,
bool use_binary_large_variant = false);
arthurpassos marked this conversation as resolved.
Show resolved Hide resolved

Result<std::shared_ptr<::arrow::DataType>> GetArrowType(
Type::type physical_type, const LogicalType& logical_type, int type_length,
::arrow::TimeUnit::type int96_arrow_time_unit = ::arrow::TimeUnit::NANO);
::arrow::TimeUnit::type int96_arrow_time_unit = ::arrow::TimeUnit::NANO,
bool use_binary_large_variant = false);

Result<std::shared_ptr<::arrow::DataType>> GetArrowType(
const schema::PrimitiveNode& primitive,
::arrow::TimeUnit::type int96_arrow_time_unit = ::arrow::TimeUnit::NANO);
::arrow::TimeUnit::type int96_arrow_time_unit = ::arrow::TimeUnit::NANO,
bool use_binary_large_variant = false);

} // namespace arrow
} // namespace parquet
Loading