Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet/pqarrow): Add ForceLarge option #197

Merged
merged 2 commits into from
Dec 9, 2024

Conversation

zeroshade
Copy link
Member

Rationale for this change

closes #195

For parquet files that contain more than 2GB of data in a column, we should allow a user to force using the LargeString/LargeBinary variants without requiring a stored schema.

What changes are included in this PR?

Adds ForceLarge option to pqarrow.ArrowReadProperties and enables it to force usage of LargeString and LargeBinary data types.

Are these changes tested?

Yes, a unit test is added.

Are there any user-facing changes?

No breaking changes, only the addition of a new option.

@zeroshade zeroshade force-pushed the add-force-large-parquet branch from 7f05b5a to 5ea2f7b Compare November 22, 2024 21:48
// for string and binary columns respectively, instead of the default variants. This
// can be necessary if you know that there are columns which contain more than 2GB of
// data, which would prevent use of int32 offsets.
ForceLarge bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there was an attempt in parquet-cpp but the solution might not be expected: apache/arrow#35825

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the original poster that it's rather weird that we can generate Parquet files that we can't otherwise read. However I also agree with Antoine that it might be good to make this option per-column if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you don't use this option, you can still read the parquet file, it would just require manually shrinking the batch size. I can definitely change this to make it a per-column option. That's fine, albeit a larger change since we don't currently expose which column we're determining the type for to the function that does the arrow type.

Alternately, we could utilize the column metadata for the row groups and decide ahead of time to switch to utilizing the Large variant for a column if the metadata says that it is large enough to warrant it, but that would make things really complex with row groups that may or may not be large enough to require it, etc.

The other alternative would be to forcibly reduce the batchsize when reading to accomodate?

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it automatically would be surprising to users IMO. It would also potentially make inconsistent schemas when reading multiple files.

Reducing the batch size may make sense; alternatively an option to use StringView?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that automatically changing the type could be confusing. Regardless of the approach used to convert types, the behavior of automatically reducing the batch size instead of exceeding the max offset of a variable width type would be very nice IMO.

// for string and binary columns respectively, instead of the default variants. This
// can be necessary if you know that there are columns which contain more than 2GB of
// data, which would prevent use of int32 offsets.
ForceLarge bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the original poster that it's rather weird that we can generate Parquet files that we can't otherwise read. However I also agree with Antoine that it might be good to make this option per-column if possible.

@zeroshade
Copy link
Member Author

The logic to do the auto splitting for batches is going to be pretty complex so I'll do it as a separate change. For now adding the ForceLarge option to per-column specify usage of the LargeVariants is sufficient to fix the reported issue. I'll create an issue to track adding logic for automatically shrinking batches if the size of the data is too large for the int32 offset types

@zeroshade zeroshade merged commit 370dc98 into apache:main Dec 9, 2024
24 checks passed
@zeroshade zeroshade deleted the add-force-large-parquet branch December 9, 2024 19:20
@zeroshade zeroshade restored the add-force-large-parquet branch December 9, 2024 19:20
@zeroshade zeroshade deleted the add-force-large-parquet branch December 9, 2024 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

array.Binary and array.String should use int64 offsets
4 participants