-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Go] How to handle large lists? #33875
Comments
@minyoung You're absolutely correct that the solution here would be to leverage I'll add this to my list of stuff to do but if you want to give it a shot to implement, just tag me to review the PR if you file one! |
@zeroshade I had a go at implementing support for LargeString/LargeBinary: #33965 A potential gotcha I stumbled on though is that while we can now write out Presumably writing with: Aside: I found it humorous that the cpp side of things also errors out |
Updated the PR to now read into If the schema is not stored, then we'll read into |
I think that it's fine for now to have it store the schema and use that properly, if you feel like being fancy and looking to automatically use it based on the column sizes, feel free to add that to the PR but i don't think it's necessary for a first pass here. Thanks for the work! |
### Rationale for this change Handle writing `array.LargeString` and `array.LargeBinary` data types. This allows parquet files to contain more than 2G worth of binary data in a single column chunk. ### Are these changes tested? Unit tests included ### Are there any user-facing changes? * Closes: #33875 Authored-by: Min-Young Wu <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…apache#33965) ### Rationale for this change Handle writing `array.LargeString` and `array.LargeBinary` data types. This allows parquet files to contain more than 2G worth of binary data in a single column chunk. ### Are these changes tested? Unit tests included ### Are there any user-facing changes? * Closes: apache#33875 Authored-by: Min-Young Wu <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…apache#33965) ### Rationale for this change Handle writing `array.LargeString` and `array.LargeBinary` data types. This allows parquet files to contain more than 2G worth of binary data in a single column chunk. ### Are these changes tested? Unit tests included ### Are there any user-facing changes? * Closes: apache#33875 Authored-by: Min-Young Wu <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…apache#33965) ### Rationale for this change Handle writing `array.LargeString` and `array.LargeBinary` data types. This allows parquet files to contain more than 2G worth of binary data in a single column chunk. ### Are these changes tested? Unit tests included ### Are there any user-facing changes? * Closes: apache#33875 Authored-by: Min-Young Wu <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Describe the usage question you have. Please include as many useful details as possible.
I'm encountering a panic when trying to generate a parquet with a very large list of (nested) strings in it:
I can recreate the error with:
From skimming through the code and some breakpointing, it seems like the int32 offsets used has overflowed, hence the error with a number that's around -2^31. It seems like maybe I should use
LargeListOf
or maybeLargeString
in the schema, but when attempting to do so I get a not implemented yet error.How much work would it be to implement
LargeListOf
/LargeString
? And if that were done, would it resolve the panic?Writing out smaller row groups does work (presumably because the buffers and offsets are reset), but we would like to keep the number of row groups small (we only access a very small number of columns at a time, and our datasets can get very very wide).
Component(s)
Go
The text was updated successfully, but these errors were encountered: