docs/configuration.md: Documented table properties (#1231) #1232

sikehish · 2024-10-17T06:16:56Z

This PR is for #1231.

Changes

Added missing table properties to the configuration.md file, including:
- write.target-file-size-bytes
- write.parquet.row-group-size-bytes
- write.parquet.bloom-filter-max-bytes
- Other relevant properties from TableProperties class.

Files Modified

configuration.md: Updated to reflect the complete list of properties.

Also, do let me know if any modifications are to be made. I had to make a few assumptions for the options column where data wasn't readily available.

kevinjqliu

Thanks for the contribution! This is much needed

I added some minor comments.

mkdocs/docs/configuration.md

kevinjqliu · 2024-10-17T15:35:51Z

mkdocs/docs/configuration.md

+| `write.delete.mode`                    | `{copy-on-write, merge-on-read}`       | `copy-on-write` | Configures the delete mode (either Copy-on-Write or Merge-on-Read).                         |
+| `schema.name-mapping.default`          | Name mapping strategy                  | N/A             | Default name mapping for schema evolution.                                                  |
+| `format-version`                       | `{1, 2}`                               | 2               | The version of the Iceberg table format to use.                                             |
+| `write.metadata.previous-versions-max` | Integer                                | 100             | Maximum number of previous version metadata files to keep before deletion after commit.     |


nit: group this with the other write.metadata

kevinjqliu · 2024-10-17T15:38:37Z

I noticed these 3 options are missing

iceberg-python/pyiceberg/table/__init__.py

Lines 197 to 204 in 7cf0c22

    
           MANIFEST_TARGET_SIZE_BYTES = "commit.manifest.target-size-bytes" 
        
           MANIFEST_TARGET_SIZE_BYTES_DEFAULT = 8 * 1024 * 1024  # 8 MB 
        
           MANIFEST_MIN_MERGE_COUNT = "commit.manifest.min-count-to-merge" 
        
           MANIFEST_MIN_MERGE_COUNT_DEFAULT = 100 
        
           MANIFEST_MERGE_ENABLED = "commit.manifest-merge.enabled" 
        
           MANIFEST_MERGE_ENABLED_DEFAULT = False

kevinjqliu · 2024-10-17T15:40:59Z

Also curious if you have suggestion to prevent documentation drift in the future

sikehish · 2024-10-17T16:19:51Z

I noticed these 3 options are missing

iceberg-python/pyiceberg/table/__init__.py

Lines 197 to 204 in 7cf0c22

MANIFEST_TARGET_SIZE_BYTES = "commit.manifest.target-size-bytes"

MANIFEST_TARGET_SIZE_BYTES_DEFAULT = 8 * 1024 * 1024 # 8 MB

MANIFEST_MIN_MERGE_COUNT = "commit.manifest.min-count-to-merge"

MANIFEST_MIN_MERGE_COUNT_DEFAULT = 100

MANIFEST_MERGE_ENABLED = "commit.manifest-merge.enabled"

MANIFEST_MERGE_ENABLED_DEFAULT = False

These options are already under Table behavior options.

…n to Table behaviour options table

sikehish · 2024-10-17T16:34:44Z

Also curious if you have suggestion to prevent documentation drift in the future

I believe we could utilize automated documentation generation tools or enforce strict documentation updates whenever a new feature is implemented.

sikehish · 2024-10-17T16:35:17Z

@kevinjqliu
Hi, I've made the changes. Do let me know if any other changes are to be made.

kevinjqliu

LGTM! Thanks for working on this

sikehish · 2024-10-17T17:47:37Z

LGTM! Thanks for working on this

Thank you for the oppurtunity! Do let me know if you would want me to work on any other issue :))

kevinjqliu · 2024-10-17T18:11:39Z

@sikehish can you fix the CI lint issue? make lint should work

There are other "good first issue"s, please take a look https://github.com/apache/iceberg-python/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22

sikehish · 2024-10-17T18:19:50Z

@sikehish can you fix the CI lint issue? make lint should work

There are other "good first issue"s, please take a look https://github.com/apache/iceberg-python/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22

I can't find the CI Lint issue. Could you share the link?

kevinjqliu · 2024-10-17T18:26:40Z

https://github.com/apache/iceberg-python/actions/runs/11389073844/job/31690659534?pr=1232

sikehish · 2024-10-17T18:49:29Z

https://github.com/apache/iceberg-python/actions/runs/11389073844/job/31690659534?pr=1232

Yup, linting is in place now. Thanks for the reminder!

mths1 · 2024-10-18T16:03:38Z

Hi all

I was trying 'target-file-size-bytes' lately, and to my understanding in the pyiceberg version we were using, it somehow violates the principle of least surprise. As far as I understand, in pyIceberg it is not the file size on disk, but the size in memory. A target-file-size-bytes of 512MB resulted for us in files of 20MB on disk. This caused a lot of trouble for us (first understanding) and secondly other tools now pick the wrong value from metadata. If I am not mistaken, it would be great to document that behaviour as it is not quite intuitive.

kevinjqliu · 2024-10-19T18:17:59Z

Hi @mths1, Thanks for the feedback. You're right, write.target-file-size-bytes does not represent the resulting file's size on disk. It's based on the size of the in-memory arrow buffers and since parquet can be compressed, the resulting file size can be smaller.

This aligns with https://iceberg.apache.org/docs/latest/spark-writes/#controlling-file-sizes

Perhaps we can mention this behavior in the table. For example, this is what the java docs mention

write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes

Maybe something like

Controls the target size of in-memory buffers for writing files. The actual file size may be smaller due to compression.

Fokko

I left a few comments on properties that are not supported. When setting them, PyIceberg will emit a warning. I think it would be confusing to suggest that they are supported in the docs. Apart from that, it looks good. Thanks @sikehish for working on this 🚀

Fokko · 2024-10-28T14:37:21Z

mkdocs/docs/configuration.md

@@ -47,6 +55,8 @@ Iceberg tables support table properties to configure table behavior.
 | `commit.manifest.target-size-bytes`  | Size in bytes       | 8388608 (8MB) | Target size when merging manifest files                     |
 | `commit.manifest.min-count-to-merge` | Number of manifests | 100           | Target size when merging manifest files                     |
 | `commit.manifest-merge.enabled`      | Boolean             | False         | Controls whether to automatically merge manifests on writes |
+| `schema.name-mapping.default`          | Name mapping strategy                  | N/A             | Default name mapping for schema evolution.                                                  |
+| `format-version`                       | `{1, 2}`                               | 2               | The version of the Iceberg table format to use.                                             |


This is interesting. Before nog aligning the markdown table would result in a lint error.

Fokko · 2024-10-28T14:41:16Z

mkdocs/docs/configuration.md

+| `write.parquet.compression-codec`      | `{uncompressed,zstd,gzip,snappy}`     | zstd            | Sets the Parquet compression codec.                                                         |
+| `write.parquet.compression-level`      | Integer                                | null            | Parquet compression level for the codec. If not set, it is up to PyIceberg.                 |
+| `write.parquet.row-group-limit`        | Number of rows                         | 1,048,576       | The upper bound of the number of entries within a single row group.                         |
+| `write.parquet.row-group-size-bytes`   | Size in bytes                          | 128 MB          | The maximum size (in bytes) of each Parquet row group.                                      |


This one is not supported:

iceberg-python/pyiceberg/io/pyarrow.py

Lines 2550 to 2556 in 583a7e9

for key_pattern in [

TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES,

TableProperties.PARQUET_BLOOM_FILTER_MAX_BYTES,

f"{TableProperties.PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX}.*",

]:

if unsupported_keys := fnmatch.filter(table_properties, key_pattern):

warnings.warn(f"Parquet writer option(s) {unsupported_keys} not implemented")

We can also make that explicit in the docs.

Fokko · 2024-10-28T14:41:50Z

mkdocs/docs/configuration.md

+| `write.parquet.bloom-filter-max-bytes` | Size in bytes                          | 1 MB            | The maximum size (in bytes) of the Bloom filter for Parquet files.                          |
+| `write.parquet.bloom-filter-enabled.column` | Column names                        | N/A             | Enable Bloom filters for specific columns by prefixing the column name.                     |


There ones are supported:

iceberg-python/pyiceberg/io/pyarrow.py

Lines 2550 to 2556 in 583a7e9

for key_pattern in [

TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES,

TableProperties.PARQUET_BLOOM_FILTER_MAX_BYTES,

f"{TableProperties.PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX}.*",

]:

if unsupported_keys := fnmatch.filter(table_properties, key_pattern):

warnings.warn(f"Parquet writer option(s) {unsupported_keys} not implemented")

We can also make that explicit in the docs.

kevinjqliu · 2024-10-28T16:58:25Z

Thanks for the review Fokko. +1 to adding a section to call out what is supported and unsupported.
I believe these 3 are currently the unsupported properties

iceberg-python/pyiceberg/io/pyarrow.py

Lines 2550 to 2556 in 583a7e9

    
           for key_pattern in [ 
        
               TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES, 
        
               TableProperties.PARQUET_BLOOM_FILTER_MAX_BYTES, 
        
               f"{TableProperties.PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX}.*", 
        
           ]: 
        
               if unsupported_keys := fnmatch.filter(table_properties, key_pattern): 
        
                   warnings.warn(f"Parquet writer option(s) {unsupported_keys} not implemented")

docs/configuration.md: Documented table properties

8dfb913

sikehish mentioned this pull request Oct 17, 2024

Document table properties #1231

Open

kevinjqliu reviewed Oct 17, 2024

View reviewed changes

sikehish added 2 commits October 17, 2024 21:53

configuration.md: Moved schema.name-mapping.default and format-versio…

11ad717

…n to Table behaviour options table

configuration.md: grouped all write.metadata options together

ec363b6

kevinjqliu approved these changes Oct 17, 2024

View reviewed changes

kevinjqliu requested review from Fokko, sungwy and HonahX October 17, 2024 17:32

linting in place

d04d62d

sikehish force-pushed the main branch from 76cc5c6 to d04d62d Compare October 18, 2024 16:16

Merge branch 'apache:main' into main

97daf20

Fokko reviewed Oct 28, 2024

View reviewed changes

kevinjqliu added this to the PyIceberg 0.9.0 release milestone Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs/configuration.md: Documented table properties (#1231) #1232

docs/configuration.md: Documented table properties (#1231) #1232

sikehish commented Oct 17, 2024

kevinjqliu left a comment

kevinjqliu Oct 17, 2024

kevinjqliu commented Oct 17, 2024

kevinjqliu commented Oct 17, 2024

sikehish commented Oct 17, 2024

sikehish commented Oct 17, 2024

sikehish commented Oct 17, 2024

kevinjqliu left a comment

sikehish commented Oct 17, 2024

kevinjqliu commented Oct 17, 2024

sikehish commented Oct 17, 2024

kevinjqliu commented Oct 17, 2024

sikehish commented Oct 17, 2024

mths1 commented Oct 18, 2024

kevinjqliu commented Oct 19, 2024

Fokko left a comment

Fokko Oct 28, 2024

Fokko Oct 28, 2024

Fokko Oct 28, 2024

kevinjqliu commented Oct 28, 2024

	for key_pattern in [
	TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES,
	TableProperties.PARQUET_BLOOM_FILTER_MAX_BYTES,
	f"{TableProperties.PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX}.*",
	]:
	if unsupported_keys := fnmatch.filter(table_properties, key_pattern):
	warnings.warn(f"Parquet writer option(s) {unsupported_keys} not implemented")

		\| `write.parquet.bloom-filter-max-bytes` \| Size in bytes \| 1 MB \| The maximum size (in bytes) of the Bloom filter for Parquet files. \|
		\| `write.parquet.bloom-filter-enabled.column` \| Column names \| N/A \| Enable Bloom filters for specific columns by prefixing the column name. \|

docs/configuration.md: Documented table properties (#1231) #1232

Are you sure you want to change the base?

docs/configuration.md: Documented table properties (#1231) #1232

Conversation

sikehish commented Oct 17, 2024

Changes

Files Modified

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu Oct 17, 2024

Choose a reason for hiding this comment

kevinjqliu commented Oct 17, 2024

kevinjqliu commented Oct 17, 2024

sikehish commented Oct 17, 2024

sikehish commented Oct 17, 2024

sikehish commented Oct 17, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

sikehish commented Oct 17, 2024

kevinjqliu commented Oct 17, 2024

sikehish commented Oct 17, 2024

kevinjqliu commented Oct 17, 2024

sikehish commented Oct 17, 2024

mths1 commented Oct 18, 2024

kevinjqliu commented Oct 19, 2024

Fokko left a comment

Choose a reason for hiding this comment

Fokko Oct 28, 2024

Choose a reason for hiding this comment

Fokko Oct 28, 2024

Choose a reason for hiding this comment

Fokko Oct 28, 2024

Choose a reason for hiding this comment

kevinjqliu commented Oct 28, 2024