-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary #41104
Comments
cc @pitrou @jorisvandenbossche @felipecrv Also cc @jp0317 as Dictionary Reader user |
More complexly, for |
I can't remember the exact problem in that PR but the messy thing was in the parquet decoder where it outputs to an arrow array. For the chunked array solution, I think a challenge is how do we deal with the higher-level API, like |
However, this proposal may introducing high peak memory and more memcpy than previous patch(because during concat multiple "large" buffer need to be concat together) |
Better than not supporting the 64-bit-length strings. |
@pitrou What do you think of this? If it's ok I'd like to add this later this month |
I'm doing a bit of cleanup before the release and noticed that this was here. @pitrou do you have any thoughts above? |
Describe the enhancement requested
Previously, an issue ( #35825 ) shows that directly read large binary by dict is not supported.
During writing to parquet, we don't allow a single ByteArray to exceeds 2GB. So, any single binary would be less than 2GB.
The parquet binary reader, which is separate into two styles of API, could be shown as below:
The api above, Both of these api don't support read "LargeBinary", however, the first api is able to separate the string into multiple separate chunk. When a
BinaryBuilder
reaches 2GB, it will rotate and switch to a new Binary. The api below can casting the result data to segments of large binary:For
Dictionary
, though the api returns astd::shared_ptr<::arrow::ChunkedArray>
. However, only one dictionary builder would be used. I think we can apply the same way for it.Pros: we can support read more than 2GB data into dictionary column
Cons: data might be repeated among different dictionary columns. Maybe user should call "Concat" on that
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: