-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs/configuration.md: Documented table properties (#1231) #1232
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! This is much needed
I added some minor comments.
mkdocs/docs/configuration.md
Outdated
| `write.delete.mode` | `{copy-on-write, merge-on-read}` | `copy-on-write` | Configures the delete mode (either Copy-on-Write or Merge-on-Read). | | ||
| `schema.name-mapping.default` | Name mapping strategy | N/A | Default name mapping for schema evolution. | | ||
| `format-version` | `{1, 2}` | 2 | The version of the Iceberg table format to use. | | ||
| `write.metadata.previous-versions-max` | Integer | 100 | Maximum number of previous version metadata files to keep before deletion after commit. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: group this with the other write.metadata
I noticed these 3 options are missing iceberg-python/pyiceberg/table/__init__.py Lines 197 to 204 in 7cf0c22
|
Also curious if you have suggestion to prevent documentation drift in the future |
These options are already under Table behavior options. |
…n to Table behaviour options table
I believe we could utilize automated documentation generation tools or enforce strict documentation updates whenever a new feature is implemented. |
@kevinjqliu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for working on this
Thank you for the oppurtunity! Do let me know if you would want me to work on any other issue :)) |
@sikehish can you fix the CI lint issue? There are other "good first issue"s, please take a look https://github.com/apache/iceberg-python/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22 |
I can't find the CI Lint issue. Could you share the link? |
Yup, linting is in place now. Thanks for the reminder! |
Hi all I was trying 'target-file-size-bytes' lately, and to my understanding in the pyiceberg version we were using, it somehow violates the principle of least surprise. As far as I understand, in pyIceberg it is not the file size on disk, but the size in memory. A target-file-size-bytes of 512MB resulted for us in files of 20MB on disk. This caused a lot of trouble for us (first understanding) and secondly other tools now pick the wrong value from metadata. If I am not mistaken, it would be great to document that behaviour as it is not quite intuitive. |
Hi @mths1, Thanks for the feedback. You're right, This aligns with https://iceberg.apache.org/docs/latest/spark-writes/#controlling-file-sizes Perhaps we can mention this behavior in the table. For example, this is what the java docs mention
Maybe something like
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments on properties that are not supported. When setting them, PyIceberg will emit a warning. I think it would be confusing to suggest that they are supported in the docs. Apart from that, it looks good. Thanks @sikehish for working on this 🚀
@@ -47,6 +55,8 @@ Iceberg tables support table properties to configure table behavior. | |||
| `commit.manifest.target-size-bytes` | Size in bytes | 8388608 (8MB) | Target size when merging manifest files | | |||
| `commit.manifest.min-count-to-merge` | Number of manifests | 100 | Target size when merging manifest files | | |||
| `commit.manifest-merge.enabled` | Boolean | False | Controls whether to automatically merge manifests on writes | | |||
| `schema.name-mapping.default` | Name mapping strategy | N/A | Default name mapping for schema evolution. | | |||
| `format-version` | `{1, 2}` | 2 | The version of the Iceberg table format to use. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. Before nog aligning the markdown table would result in a lint error.
| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression codec. | | ||
| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg. | | ||
| `write.parquet.row-group-limit` | Number of rows | 1,048,576 | The upper bound of the number of entries within a single row group. | | ||
| `write.parquet.row-group-size-bytes` | Size in bytes | 128 MB | The maximum size (in bytes) of each Parquet row group. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is not supported:
iceberg-python/pyiceberg/io/pyarrow.py
Lines 2550 to 2556 in 583a7e9
for key_pattern in [ | |
TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES, | |
TableProperties.PARQUET_BLOOM_FILTER_MAX_BYTES, | |
f"{TableProperties.PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX}.*", | |
]: | |
if unsupported_keys := fnmatch.filter(table_properties, key_pattern): | |
warnings.warn(f"Parquet writer option(s) {unsupported_keys} not implemented") |
We can also make that explicit in the docs.
| `write.parquet.bloom-filter-max-bytes` | Size in bytes | 1 MB | The maximum size (in bytes) of the Bloom filter for Parquet files. | | ||
| `write.parquet.bloom-filter-enabled.column` | Column names | N/A | Enable Bloom filters for specific columns by prefixing the column name. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There ones are supported:
iceberg-python/pyiceberg/io/pyarrow.py
Lines 2550 to 2556 in 583a7e9
for key_pattern in [ | |
TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES, | |
TableProperties.PARQUET_BLOOM_FILTER_MAX_BYTES, | |
f"{TableProperties.PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX}.*", | |
]: | |
if unsupported_keys := fnmatch.filter(table_properties, key_pattern): | |
warnings.warn(f"Parquet writer option(s) {unsupported_keys} not implemented") |
We can also make that explicit in the docs.
Thanks for the review Fokko. +1 to adding a section to call out what is supported and unsupported. iceberg-python/pyiceberg/io/pyarrow.py Lines 2550 to 2556 in 583a7e9
|
This PR is for #1231.
Changes
configuration.md
file, including:write.target-file-size-bytes
write.parquet.row-group-size-bytes
write.parquet.bloom-filter-max-bytes
TableProperties
class.Files Modified
configuration.md
: Updated to reflect the complete list of properties.Also, do let me know if any modifications are to be made. I had to make a few assumptions for the options column where data wasn't readily available.