[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary #41104

mapleFU · 2024-04-09T15:16:48Z

Describe the enhancement requested

Previously, an issue ( #35825 ) shows that directly read large binary by dict is not supported.

During writing to parquet, we don't allow a single ByteArray to exceeds 2GB. So, any single binary would be less than 2GB.

The parquet binary reader, which is separate into two styles of API, could be shown as below:

class BinaryRecordReader : virtual public RecordReader {
 public:
  virtual std::vector<std::shared_ptr<::arrow::Array>> GetBuilderChunks() = 0;
};

/// \brief Read records directly to dictionary-encoded Arrow form (int32
/// indices). Only valid for BYTE_ARRAY columns
class DictionaryRecordReader : virtual public RecordReader {
 public:
  virtual std::shared_ptr<::arrow::ChunkedArray> GetResult() = 0;
};

The api above, Both of these api don't support read "LargeBinary", however, the first api is able to separate the string into multiple separate chunk. When a BinaryBuilder reaches 2GB, it will rotate and switch to a new Binary. The api below can casting the result data to segments of large binary:

Status TransferColumnData(RecordReader* reader, const std::shared_ptr<Field>& value_field,
                          const ColumnDescriptor* descr, MemoryPool* pool,
                          std::shared_ptr<ChunkedArray>* out)

For Dictionary, though the api returns a std::shared_ptr<::arrow::ChunkedArray>. However, only one dictionary builder would be used. I think we can apply the same way for it.

Pros: we can support read more than 2GB data into dictionary column
Cons: data might be repeated among different dictionary columns. Maybe user should call "Concat" on that

Component(s)

C++, Parquet

The text was updated successfully, but these errors were encountered:

mapleFU · 2024-04-09T15:17:21Z

cc @pitrou @jorisvandenbossche @felipecrv

Also cc @jp0317 as Dictionary Reader user

mapleFU · 2024-04-09T15:36:11Z

More complexly, for array<string>, the string / dictionary - string can be cast and concat to large first, then concat to single "large" dictionary array. This would cause slightly performance lost only when reading the large binary

wgtmac · 2024-04-09T15:46:12Z

I can't remember the exact problem in that PR but the messy thing was in the parquet decoder where it outputs to an arrow array.

For the chunked array solution, I think a challenge is how do we deal with the higher-level API, like GetRecordBatchReader where arrays must be concatenated into a single array.

mapleFU · 2024-04-09T15:49:16Z

BinaryRecordReader already supports this. User needs passing a LargeBinary/LargeString to reading it.

DictionaryRecordReader will only have single builder, which could not storing more than 2GB data. And ListReader will concat underlying ChunkedArray to single Array, so I think we can testing it, but in theory we can read this

However, this proposal may introducing high peak memory and more memcpy than previous patch(because during concat multiple "large" buffer need to be concat together)

felipecrv · 2024-04-09T20:30:43Z

However, this proposal may introducing high peak memory and more memcpy than previous patch(because during concat multiple "large" buffer need to be concat together)

Better than not supporting the 64-bit-length strings.

mapleFU · 2024-04-16T04:55:48Z

@pitrou What do you think of this? If it's ok I'd like to add this later this month

jonkeane · 2024-06-26T13:13:34Z

I'm doing a bit of cleanup before the release and noticed that this was here. @pitrou do you have any thoughts above?

mapleFU added the Type: enhancement label Apr 9, 2024

github-actions bot added Component: Parquet Component: C++ labels Apr 9, 2024

mapleFU mentioned this issue Apr 9, 2024

GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types #35825

Closed

jonkeane mentioned this issue Jun 26, 2024

[C++] Parquet reader is unable to read LargeString columns #39682

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary #41104

[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary #41104

mapleFU commented Apr 9, 2024

mapleFU commented Apr 9, 2024

mapleFU commented Apr 9, 2024 •

edited

Loading

wgtmac commented Apr 9, 2024

mapleFU commented Apr 9, 2024

felipecrv commented Apr 9, 2024

mapleFU commented Apr 16, 2024

jonkeane commented Jun 26, 2024

[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary #41104

[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary #41104

Comments

mapleFU commented Apr 9, 2024

Describe the enhancement requested

Component(s)

mapleFU commented Apr 9, 2024

mapleFU commented Apr 9, 2024 • edited Loading

wgtmac commented Apr 9, 2024

mapleFU commented Apr 9, 2024

felipecrv commented Apr 9, 2024

mapleFU commented Apr 16, 2024

jonkeane commented Jun 26, 2024

mapleFU commented Apr 9, 2024 •

edited

Loading