feat: Support Parquet writer options #1123

nuno-faria · 2025-05-05T21:45:12Z

Which issue does this PR close?

N/A.

Rationale for this change

Supporting all Parquet writer options allows us more flexibility when creating data directly from datafusion-python.

For consistency, it supports all writer options defined by ParquetOptions in datafusion, using the same defaults: https://github.com/apache/datafusion/blob/555fc2e24dd669e44ac23a9a1d8406f4ac58a9ed/datafusion/common/src/config.rs#L423.

What changes are included in this PR?

Extended write_parquet with all writer options, including column-specific options.
Added relevant tests. (Since pyarrow does not expose page-level information, some options could not be directly tested, like enabling bloom-filters (an external tool confirmed that this option works). For this specific case, there is a test that compares the file sizes.)

Are there any user-facing changes?

The main difference relates to the existing compression field, which now uses a str like datafusion, instead of a custom enum. The main advantage is that future algorithms will not require updating the Python-side code.

Additionally, the default compression was changed from zstd(4) to zstd(3), the same as datafusion.

timsaucer

Overall, I really like the idea. Right now this does include a breaking change to a very popular user facing function. I think if we make the suggestion to allow for two function signatures we'll be able to include this in the next release.

timsaucer · 2025-06-13T16:16:36Z

python/datafusion/dataframe.py

+        data_pagesize_limit: int = 1024 * 1024,
+        write_batch_size: int = 1024,
+        writer_version: str = "1.0",
+        skip_arrow_metadata: bool = False,
+        compression: Optional[str] = "zstd(3)",
+        dictionary_enabled: Optional[bool] = True,
+        dictionary_page_size_limit: int = 1024 * 1024,
+        statistics_enabled: Optional[str] = "page",
+        max_row_group_size: int = 1024 * 1024,
+        created_by: str = "datafusion-python",
+        column_index_truncate_length: Optional[int] = 64,
+        statistics_truncate_length: Optional[int] = None,
+        data_page_row_count_limit: int = 20_000,
+        encoding: Optional[str] = None,
+        bloom_filter_on_write: bool = False,
+        bloom_filter_fpp: Optional[float] = None,
+        bloom_filter_ndv: Optional[int] = None,
+        allow_single_file_parallelism: bool = True,
+        maximum_parallel_row_group_writers: int = 1,
+        maximum_buffered_record_batches_per_stream: int = 2,
+        column_specific_options: Optional[dict[str, ParquetColumnOptions]] = None,


I think the full parquet writer options are going to be less commonly used for most users. I would suggest we want to have two different versions of the write_parquet signature, one that is the simple existing compression parameters and on that looks like

def write_parquet(self, path: str | pathlib.Path, options: ParquetWriterOptions)

And then we have a ParquetWriterOptions class that has this large initialize options.

timsaucer · 2025-06-13T16:20:46Z

python/tests/test_dataframe.py

+    for data_type in data_types:
+        match data_type:
+            case "int":
+                data["int"] = [1, 2, 3]
+            case "float":
+                data["float"] = [1.01, 2.02, 3.03]
+            case "str":
+                data["str"] = ["a", "b", "c"]
+            case "bool":
+                data["bool"] = [True, False, True]


This code failed for me because our minimum supported python is 3.9 and match was introduced in 3.10. 3.9 doesn't reach end of life until Oct 2025.

feat: Support Parquet writer options

e86c3e3

timsaucer requested changes Jun 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support Parquet writer options #1123

feat: Support Parquet writer options #1123

Uh oh!

nuno-faria commented May 5, 2025 •

edited

Loading

Uh oh!

timsaucer left a comment

Uh oh!

timsaucer Jun 13, 2025

Uh oh!

timsaucer Jun 13, 2025

Uh oh!

Uh oh!

feat: Support Parquet writer options #1123

Are you sure you want to change the base?

feat: Support Parquet writer options #1123

Uh oh!

Conversation

nuno-faria commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

timsaucer Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

timsaucer Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nuno-faria commented May 5, 2025 •

edited

Loading