Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary #41104

Open
mapleFU opened this issue Apr 9, 2024 · 7 comments

Comments

@mapleFU
Copy link
Member

mapleFU commented Apr 9, 2024

Describe the enhancement requested

Previously, an issue ( #35825 ) shows that directly read large binary by dict is not supported.

During writing to parquet, we don't allow a single ByteArray to exceeds 2GB. So, any single binary would be less than 2GB.

The parquet binary reader, which is separate into two styles of API, could be shown as below:

class BinaryRecordReader : virtual public RecordReader {
 public:
  virtual std::vector<std::shared_ptr<::arrow::Array>> GetBuilderChunks() = 0;
};

/// \brief Read records directly to dictionary-encoded Arrow form (int32
/// indices). Only valid for BYTE_ARRAY columns
class DictionaryRecordReader : virtual public RecordReader {
 public:
  virtual std::shared_ptr<::arrow::ChunkedArray> GetResult() = 0;
};

The api above, Both of these api don't support read "LargeBinary", however, the first api is able to separate the string into multiple separate chunk. When a BinaryBuilder reaches 2GB, it will rotate and switch to a new Binary. The api below can casting the result data to segments of large binary:

Status TransferColumnData(RecordReader* reader, const std::shared_ptr<Field>& value_field,
                          const ColumnDescriptor* descr, MemoryPool* pool,
                          std::shared_ptr<ChunkedArray>* out) 

For Dictionary, though the api returns a std::shared_ptr<::arrow::ChunkedArray>. However, only one dictionary builder would be used. I think we can apply the same way for it.

Pros: we can support read more than 2GB data into dictionary column
Cons: data might be repeated among different dictionary columns. Maybe user should call "Concat" on that

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented Apr 9, 2024

cc @pitrou @jorisvandenbossche @felipecrv

Also cc @jp0317 as Dictionary Reader user

@mapleFU
Copy link
Member Author

mapleFU commented Apr 9, 2024

More complexly, for array<string>, the string / dictionary - string can be cast and concat to large first, then concat to single "large" dictionary array. This would cause slightly performance lost only when reading the large binary

@wgtmac
Copy link
Member

wgtmac commented Apr 9, 2024

I can't remember the exact problem in that PR but the messy thing was in the parquet decoder where it outputs to an arrow array.

For the chunked array solution, I think a challenge is how do we deal with the higher-level API, like GetRecordBatchReader where arrays must be concatenated into a single array.

@mapleFU
Copy link
Member Author

mapleFU commented Apr 9, 2024

BinaryRecordReader already supports this. User needs passing a LargeBinary/LargeString to reading it.

DictionaryRecordReader will only have single builder, which could not storing more than 2GB data. And ListReader will concat underlying ChunkedArray to single Array, so I think we can testing it, but in theory we can read this

However, this proposal may introducing high peak memory and more memcpy than previous patch(because during concat multiple "large" buffer need to be concat together)

@felipecrv
Copy link
Contributor

However, this proposal may introducing high peak memory and more memcpy than previous patch(because during concat multiple "large" buffer need to be concat together)

Better than not supporting the 64-bit-length strings.

@mapleFU
Copy link
Member Author

mapleFU commented Apr 16, 2024

@pitrou What do you think of this? If it's ok I'd like to add this later this month

@jonkeane
Copy link
Member

I'm doing a bit of cleanup before the release and noticed that this was here. @pitrou do you have any thoughts above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants