-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parquet/pqarrow): Add ForceLarge option #197
Conversation
7f05b5a
to
5ea2f7b
Compare
parquet/pqarrow/properties.go
Outdated
// for string and binary columns respectively, instead of the default variants. This | ||
// can be necessary if you know that there are columns which contain more than 2GB of | ||
// data, which would prevent use of int32 offsets. | ||
ForceLarge bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI there was an attempt in parquet-cpp but the solution might not be expected: apache/arrow#35825
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the original poster that it's rather weird that we can generate Parquet files that we can't otherwise read. However I also agree with Antoine that it might be good to make this option per-column if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if you don't use this option, you can still read the parquet file, it would just require manually shrinking the batch size. I can definitely change this to make it a per-column option. That's fine, albeit a larger change since we don't currently expose which column we're determining the type for to the function that does the arrow type.
Alternately, we could utilize the column metadata for the row groups and decide ahead of time to switch to utilizing the Large variant for a column if the metadata says that it is large enough to warrant it, but that would make things really complex with row groups that may or may not be large enough to require it, etc.
The other alternative would be to forcibly reduce the batchsize when reading to accomodate?
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing it automatically would be surprising to users IMO. It would also potentially make inconsistent schemas when reading multiple files.
Reducing the batch size may make sense; alternatively an option to use StringView?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree that automatically changing the type could be confusing. Regardless of the approach used to convert types, the behavior of automatically reducing the batch size instead of exceeding the max offset of a variable width type would be very nice IMO.
parquet/pqarrow/properties.go
Outdated
// for string and binary columns respectively, instead of the default variants. This | ||
// can be necessary if you know that there are columns which contain more than 2GB of | ||
// data, which would prevent use of int32 offsets. | ||
ForceLarge bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the original poster that it's rather weird that we can generate Parquet files that we can't otherwise read. However I also agree with Antoine that it might be good to make this option per-column if possible.
The logic to do the auto splitting for batches is going to be pretty complex so I'll do it as a separate change. For now adding the ForceLarge option to per-column specify usage of the LargeVariants is sufficient to fix the reported issue. I'll create an issue to track adding logic for automatically shrinking batches if the size of the data is too large for the int32 offset types |
Rationale for this change
closes #195
For parquet files that contain more than 2GB of data in a column, we should allow a user to force using the LargeString/LargeBinary variants without requiring a stored schema.
What changes are included in this PR?
Adds
ForceLarge
option topqarrow.ArrowReadProperties
and enables it to force usage ofLargeString
andLargeBinary
data types.Are these changes tested?
Yes, a unit test is added.
Are there any user-facing changes?
No breaking changes, only the addition of a new option.