Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/configuration.md: Documented table properties (#1231) #1232

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sikehish
Copy link

This PR is for #1231.

Changes

  • Added missing table properties to the configuration.md file, including:
    • write.target-file-size-bytes
    • write.parquet.row-group-size-bytes
    • write.parquet.bloom-filter-max-bytes
    • Other relevant properties from TableProperties class.

Files Modified

  • configuration.md: Updated to reflect the complete list of properties.

Also, do let me know if any modifications are to be made. I had to make a few assumptions for the options column where data wasn't readily available.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! This is much needed

I added some minor comments.

mkdocs/docs/configuration.md Outdated Show resolved Hide resolved
| `write.delete.mode` | `{copy-on-write, merge-on-read}` | `copy-on-write` | Configures the delete mode (either Copy-on-Write or Merge-on-Read). |
| `schema.name-mapping.default` | Name mapping strategy | N/A | Default name mapping for schema evolution. |
| `format-version` | `{1, 2}` | 2 | The version of the Iceberg table format to use. |
| `write.metadata.previous-versions-max` | Integer | 100 | Maximum number of previous version metadata files to keep before deletion after commit. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: group this with the other write.metadata

@kevinjqliu
Copy link
Contributor

I noticed these 3 options are missing

MANIFEST_TARGET_SIZE_BYTES = "commit.manifest.target-size-bytes"
MANIFEST_TARGET_SIZE_BYTES_DEFAULT = 8 * 1024 * 1024 # 8 MB
MANIFEST_MIN_MERGE_COUNT = "commit.manifest.min-count-to-merge"
MANIFEST_MIN_MERGE_COUNT_DEFAULT = 100
MANIFEST_MERGE_ENABLED = "commit.manifest-merge.enabled"
MANIFEST_MERGE_ENABLED_DEFAULT = False

@kevinjqliu
Copy link
Contributor

Also curious if you have suggestion to prevent documentation drift in the future

@sikehish
Copy link
Author

I noticed these 3 options are missing

MANIFEST_TARGET_SIZE_BYTES = "commit.manifest.target-size-bytes"
MANIFEST_TARGET_SIZE_BYTES_DEFAULT = 8 * 1024 * 1024 # 8 MB
MANIFEST_MIN_MERGE_COUNT = "commit.manifest.min-count-to-merge"
MANIFEST_MIN_MERGE_COUNT_DEFAULT = 100
MANIFEST_MERGE_ENABLED = "commit.manifest-merge.enabled"
MANIFEST_MERGE_ENABLED_DEFAULT = False

These options are already under Table behavior options.

@sikehish
Copy link
Author

Also curious if you have suggestion to prevent documentation drift in the future

I believe we could utilize automated documentation generation tools or enforce strict documentation updates whenever a new feature is implemented.

@sikehish
Copy link
Author

@kevinjqliu
Hi, I've made the changes. Do let me know if any other changes are to be made.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for working on this

@sikehish
Copy link
Author

LGTM! Thanks for working on this

Thank you for the oppurtunity! Do let me know if you would want me to work on any other issue :))

@kevinjqliu
Copy link
Contributor

@sikehish can you fix the CI lint issue? make lint should work

There are other "good first issue"s, please take a look https://github.com/apache/iceberg-python/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22

@sikehish
Copy link
Author

@sikehish can you fix the CI lint issue? make lint should work

There are other "good first issue"s, please take a look https://github.com/apache/iceberg-python/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22

I can't find the CI Lint issue. Could you share the link?

@kevinjqliu
Copy link
Contributor

@sikehish
Copy link
Author

https://github.com/apache/iceberg-python/actions/runs/11389073844/job/31690659534?pr=1232

Yup, linting is in place now. Thanks for the reminder!

@mths1
Copy link

mths1 commented Oct 18, 2024

Hi all

I was trying 'target-file-size-bytes' lately, and to my understanding in the pyiceberg version we were using, it somehow violates the principle of least surprise. As far as I understand, in pyIceberg it is not the file size on disk, but the size in memory. A target-file-size-bytes of 512MB resulted for us in files of 20MB on disk. This caused a lot of trouble for us (first understanding) and secondly other tools now pick the wrong value from metadata. If I am not mistaken, it would be great to document that behaviour as it is not quite intuitive.

@kevinjqliu
Copy link
Contributor

Hi @mths1, Thanks for the feedback. You're right, write.target-file-size-bytes does not represent the resulting file's size on disk. It's based on the size of the in-memory arrow buffers and since parquet can be compressed, the resulting file size can be smaller.

This aligns with https://iceberg.apache.org/docs/latest/spark-writes/#controlling-file-sizes

Perhaps we can mention this behavior in the table. For example, this is what the java docs mention

write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes

Maybe something like

Controls the target size of in-memory buffers for writing files. The actual file size may be smaller due to compression.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments on properties that are not supported. When setting them, PyIceberg will emit a warning. I think it would be confusing to suggest that they are supported in the docs. Apart from that, it looks good. Thanks @sikehish for working on this 🚀

@@ -47,6 +55,8 @@ Iceberg tables support table properties to configure table behavior.
| `commit.manifest.target-size-bytes` | Size in bytes | 8388608 (8MB) | Target size when merging manifest files |
| `commit.manifest.min-count-to-merge` | Number of manifests | 100 | Target size when merging manifest files |
| `commit.manifest-merge.enabled` | Boolean | False | Controls whether to automatically merge manifests on writes |
| `schema.name-mapping.default` | Name mapping strategy | N/A | Default name mapping for schema evolution. |
| `format-version` | `{1, 2}` | 2 | The version of the Iceberg table format to use. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. Before nog aligning the markdown table would result in a lint error.

| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression codec. |
| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg. |
| `write.parquet.row-group-limit` | Number of rows | 1,048,576 | The upper bound of the number of entries within a single row group. |
| `write.parquet.row-group-size-bytes` | Size in bytes | 128 MB | The maximum size (in bytes) of each Parquet row group. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not supported:

for key_pattern in [
TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES,
TableProperties.PARQUET_BLOOM_FILTER_MAX_BYTES,
f"{TableProperties.PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX}.*",
]:
if unsupported_keys := fnmatch.filter(table_properties, key_pattern):
warnings.warn(f"Parquet writer option(s) {unsupported_keys} not implemented")

We can also make that explicit in the docs.

Comment on lines +42 to +43
| `write.parquet.bloom-filter-max-bytes` | Size in bytes | 1 MB | The maximum size (in bytes) of the Bloom filter for Parquet files. |
| `write.parquet.bloom-filter-enabled.column` | Column names | N/A | Enable Bloom filters for specific columns by prefixing the column name. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There ones are supported:

for key_pattern in [
TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES,
TableProperties.PARQUET_BLOOM_FILTER_MAX_BYTES,
f"{TableProperties.PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX}.*",
]:
if unsupported_keys := fnmatch.filter(table_properties, key_pattern):
warnings.warn(f"Parquet writer option(s) {unsupported_keys} not implemented")

We can also make that explicit in the docs.

@kevinjqliu
Copy link
Contributor

Thanks for the review Fokko. +1 to adding a section to call out what is supported and unsupported.
I believe these 3 are currently the unsupported properties

for key_pattern in [
TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES,
TableProperties.PARQUET_BLOOM_FILTER_MAX_BYTES,
f"{TableProperties.PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX}.*",
]:
if unsupported_keys := fnmatch.filter(table_properties, key_pattern):
warnings.warn(f"Parquet writer option(s) {unsupported_keys} not implemented")

@kevinjqliu kevinjqliu added this to the PyIceberg 0.9.0 release milestone Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants