-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default to ZSTD compression when writing Parquet #981
base: main
Are you sure you want to change the base?
Conversation
… write_parquet method
python/datafusion/dataframe.py
Outdated
""" | ||
path (str | pathlib.Path): The file path to write the Parquet file. | ||
compression (str): The compression algorithm to use. Default is "ZSTD". | ||
compression_level (int | None): The compression level to use. For ZSTD, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document that the compression level is different per algorithm. It's only zstd that has a 1-22 range IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean like
compression_level (int | None): The compression level to use. For ZSTD, the
recommended range is 1 to 22, with the default being 3. Higher levels
provide better compression but slower speed.
python/datafusion/dataframe.py
Outdated
# default compression level to 3 for ZSTD | ||
if compression == "ZSTD": | ||
if compression_level is None: | ||
compression_level = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to dig into what DuckDB's defaults are: https://duckdb.org/docs/data/parquet/overview.html#writing-to-parquet-files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.
I used the default compression level in the manual from Facebook (author of zstd) - https://facebook.github.io/zstd/zstd_manual.html
I could not find a default in DuckDB's documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @kylebarron ,
Shall we adopt delta-rs' default, and use 4 as the default ZSTD compression level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I have amended the default to 4.
In delta-rs we have the default to use "snappy" compression, except our optimize operation which uses ZSTD(4) |
python/datafusion/dataframe.py
Outdated
recommended range is 1 to 22, with the default being 4. Higher levels | ||
provide better compression but slower speed. | ||
""" | ||
if compression == "ZSTD": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to reuse this code I added in deltalake a while ago:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ion-elgreco ,
I added the Compression Enum but omitted the check_valid_levels because these are already implemented in Rust DataFrame. Compression levels are tested in:
datafusion-python/python/tests/test_dataframe.py
Lines 1093 to 1106 in 63b13da
@pytest.mark.parametrize( | |
"compression, compression_level", | |
[("gzip", 12), ("brotli", 15), ("zstd", 23), ("wrong", 12)], | |
) | |
def test_write_compressed_parquet_wrong_compression_level( | |
df, tmp_path, compression, compression_level | |
): | |
path = tmp_path | |
with pytest.raises(ValueError): | |
df.write_parquet( | |
str(path), | |
compression=compression, | |
compression_level=compression_level, |
…s' write_parquet method
…t writing method doc
…t compression levels
4d3fd8d
to
e7ec09b
Compare
e7ec09b
to
41e1742
Compare
Which issue does this PR close?
Closes #978.
Rationale for this change
Currently, the write_parquet method defaults to "uncompressed" Parquet files, which can lead to inefficient storage and slower performance during I/O operations. This change sets the default compression method to "ZSTD", a modern compression algorithm that provides an excellent balance of compression speed and ratio. Additionally, it introduces a default compression level of 3 for ZSTD, which is optimal for many use cases.
What changes are included in this PR?
Updated the default compression parameter in the write_parquet method from "uncompressed" to "ZSTD".
Introduced a default compression level of 3 for ZSTD if no level is specified.
Added validation to ensure the compression level for ZSTD falls within the valid range (1 to 22) and raises a ValueError otherwise.
Updated the docstring to clarify the default values and provide guidance for users on compression levels.
Are there any user-facing changes?
Yes:
The default behavior of write_parquet now compresses output files using ZSTD with a default compression level of 3, instead of leaving files uncompressed.
Users specifying an invalid compression level for ZSTD will now encounter a ValueError.