-
Notifications
You must be signed in to change notification settings - Fork 112
feat: Support Parquet writer options #1123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I really like the idea. Right now this does include a breaking change to a very popular user facing function. I think if we make the suggestion to allow for two function signatures we'll be able to include this in the next release.
data_pagesize_limit: int = 1024 * 1024, | ||
write_batch_size: int = 1024, | ||
writer_version: str = "1.0", | ||
skip_arrow_metadata: bool = False, | ||
compression: Optional[str] = "zstd(3)", | ||
dictionary_enabled: Optional[bool] = True, | ||
dictionary_page_size_limit: int = 1024 * 1024, | ||
statistics_enabled: Optional[str] = "page", | ||
max_row_group_size: int = 1024 * 1024, | ||
created_by: str = "datafusion-python", | ||
column_index_truncate_length: Optional[int] = 64, | ||
statistics_truncate_length: Optional[int] = None, | ||
data_page_row_count_limit: int = 20_000, | ||
encoding: Optional[str] = None, | ||
bloom_filter_on_write: bool = False, | ||
bloom_filter_fpp: Optional[float] = None, | ||
bloom_filter_ndv: Optional[int] = None, | ||
allow_single_file_parallelism: bool = True, | ||
maximum_parallel_row_group_writers: int = 1, | ||
maximum_buffered_record_batches_per_stream: int = 2, | ||
column_specific_options: Optional[dict[str, ParquetColumnOptions]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the full parquet writer options are going to be less commonly used for most users. I would suggest we want to have two different versions of the write_parquet
signature, one that is the simple existing compression parameters and on that looks like
def write_parquet(self, path: str | pathlib.Path, options: ParquetWriterOptions)
And then we have a ParquetWriterOptions
class that has this large initialize options.
for data_type in data_types: | ||
match data_type: | ||
case "int": | ||
data["int"] = [1, 2, 3] | ||
case "float": | ||
data["float"] = [1.01, 2.02, 3.03] | ||
case "str": | ||
data["str"] = ["a", "b", "c"] | ||
case "bool": | ||
data["bool"] = [True, False, True] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code failed for me because our minimum supported python is 3.9 and match was introduced in 3.10. 3.9 doesn't reach end of life until Oct 2025.
Which issue does this PR close?
N/A.
Rationale for this change
Supporting all Parquet writer options allows us more flexibility when creating data directly from
datafusion-python
.For consistency, it supports all writer options defined by
ParquetOptions
indatafusion
, using the same defaults: https://github.com/apache/datafusion/blob/555fc2e24dd669e44ac23a9a1d8406f4ac58a9ed/datafusion/common/src/config.rs#L423.What changes are included in this PR?
write_parquet
with all writer options, including column-specific options.pyarrow
does not expose page-level information, some options could not be directly tested, like enabling bloom-filters (an external tool confirmed that this option works). For this specific case, there is a test that compares the file sizes.)Are there any user-facing changes?
The main difference relates to the existing
compression
field, which now uses astr
likedatafusion
, instead of a custom enum. The main advantage is that future algorithms will not require updating the Python-side code.Additionally, the default compression was changed from
zstd(4)
tozstd(3)
, the same asdatafusion
.