feat(parquet/pqarrow): Add ForceLarge option #197

zeroshade · 2024-11-22T16:09:31Z

Rationale for this change

closes #195

For parquet files that contain more than 2GB of data in a column, we should allow a user to force using the LargeString/LargeBinary variants without requiring a stored schema.

What changes are included in this PR?

Adds ForceLarge option to pqarrow.ArrowReadProperties and enables it to force usage of LargeString and LargeBinary data types.

Are these changes tested?

Yes, a unit test is added.

Are there any user-facing changes?

No breaking changes, only the addition of a new option.

wgtmac · 2024-11-23T06:53:41Z

parquet/pqarrow/properties.go

+	// for string and binary columns respectively, instead of the default variants. This
+	// can be necessary if you know that there are columns which contain more than 2GB of
+	// data, which would prevent use of int32 offsets.
+	ForceLarge bool


FYI there was an attempt in parquet-cpp but the solution might not be expected: apache/arrow#35825

I agree with the original poster that it's rather weird that we can generate Parquet files that we can't otherwise read. However I also agree with Antoine that it might be good to make this option per-column if possible.

Well, if you don't use this option, you can still read the parquet file, it would just require manually shrinking the batch size. I can definitely change this to make it a per-column option. That's fine, albeit a larger change since we don't currently expose which column we're determining the type for to the function that does the arrow type.

Alternately, we could utilize the column metadata for the row groups and decide ahead of time to switch to utilizing the Large variant for a column if the metadata says that it is large enough to warrant it, but that would make things really complex with row groups that may or may not be large enough to require it, etc.

The other alternative would be to forcibly reduce the batchsize when reading to accomodate?

Thoughts?

Doing it automatically would be surprising to users IMO. It would also potentially make inconsistent schemas when reading multiple files.

Reducing the batch size may make sense; alternatively an option to use StringView?

I also agree that automatically changing the type could be confusing. Regardless of the approach used to convert types, the behavior of automatically reducing the batch size instead of exceeding the max offset of a variable width type would be very nice IMO.

lidavidm · 2024-11-25T01:16:44Z

parquet/pqarrow/properties.go

+	// for string and binary columns respectively, instead of the default variants. This
+	// can be necessary if you know that there are columns which contain more than 2GB of
+	// data, which would prevent use of int32 offsets.
+	ForceLarge bool


I agree with the original poster that it's rather weird that we can generate Parquet files that we can't otherwise read. However I also agree with Antoine that it might be good to make this option per-column if possible.

zeroshade · 2024-12-09T19:20:10Z

The logic to do the auto splitting for batches is going to be pretty complex so I'll do it as a separate change. For now adding the ForceLarge option to per-column specify usage of the LargeVariants is sufficient to fix the reported issue. I'll create an issue to track adding logic for automatically shrinking batches if the size of the data is too large for the int32 offset types

zeroshade requested review from kou, lidavidm, wgtmac and joellubi November 22, 2024 16:09

feat(parquet/pqarrow): Add ForceLarge option

5ea2f7b

zeroshade force-pushed the add-force-large-parquet branch from 7f05b5a to 5ea2f7b Compare November 22, 2024 21:48

wgtmac reviewed Nov 23, 2024

View reviewed changes

lidavidm approved these changes Nov 25, 2024

View reviewed changes

switch to per-column option

7e693de

lidavidm approved these changes Dec 7, 2024

View reviewed changes

zeroshade merged commit 370dc98 into apache:main Dec 9, 2024
24 checks passed

zeroshade deleted the add-force-large-parquet branch December 9, 2024 19:20

zeroshade restored the add-force-large-parquet branch December 9, 2024 19:20

zeroshade deleted the add-force-large-parquet branch December 9, 2024 19:20

zeroshade mentioned this pull request Dec 9, 2024

[Parquet] Automatically split batches to prevent overflow #211

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet/pqarrow): Add ForceLarge option #197

feat(parquet/pqarrow): Add ForceLarge option #197

zeroshade commented Nov 22, 2024

wgtmac Nov 23, 2024

lidavidm Nov 25, 2024

zeroshade Nov 26, 2024

lidavidm Nov 26, 2024

joellubi Nov 27, 2024

lidavidm Nov 25, 2024

zeroshade commented Dec 9, 2024

feat(parquet/pqarrow): Add ForceLarge option #197

feat(parquet/pqarrow): Add ForceLarge option #197

Conversation

zeroshade commented Nov 22, 2024

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

wgtmac Nov 23, 2024

Choose a reason for hiding this comment

lidavidm Nov 25, 2024

Choose a reason for hiding this comment

zeroshade Nov 26, 2024

Choose a reason for hiding this comment

lidavidm Nov 26, 2024

Choose a reason for hiding this comment

joellubi Nov 27, 2024

Choose a reason for hiding this comment

lidavidm Nov 25, 2024

Choose a reason for hiding this comment

zeroshade commented Dec 9, 2024