feat: support S3 Table Buckets with S3TablesCatalog #1429

felixscherz · 2024-12-14T14:06:47Z

Hi, this is in regards to #1404.

I created a first draft of an S3TablesCatalog that uses the S3 Table Bucket API for catalog operations.

How to run tests

Since moto does not support mocking the S3 Tables API yet (WIP: getmoto/moto#8470) we have to run tests against a live AWS account. To do that, create an S3 Tables Bucket in one of the supported regions and then set the table bucket ARN and AWS Region as environment variables

AWS_REGION=us-east-2 AWS_TEST_S3_BUCKET_ARN=... poetry run pytest tests/catalog/integration_test_s3tables.py

felixscherz · 2024-12-14T16:46:16Z

I was able to work around the issue above by using FsspecFileIO instead of the default PyarrowFileIO. Using FsspecFileIO the catalog is now able to create new tables.

kevinjqliu · 2024-12-20T17:02:32Z

Thanks for working on this @felixscherz Feel free to tag me when its ready for review :)

felixscherz · 2024-12-29T15:51:03Z

I think you can now review this PR if you have time @kevinjqliu :)
The biggest issue for now will be that testing is only possible against AWS itself since moto does not support the s3tables API yet. I created an issue on the moto side but have not had the time to implement it myself getmoto/moto#8422.

I currently run tests by setting the ARN env variable to that of an s3 table bucket I created within my personal AWS account:

https://github.com/felixscherz/iceberg-python/blob/feat/s3tables-catalog/tests/catalog/test_s3tables.py#L24-L31

kevinjqliu

Thanks for the PR, i added a few comments to clarify the catalog behaviors

I'm a little hesitant to merge this in given that we have to run tests against a production S3 endpoint. Maybe we can mock the endpoint?

pyiceberg/catalog/s3tables.py

tests/catalog/test_s3tables.py

pyiceberg/catalog/s3tables.py

kevinjqliu · 2025-01-04T00:20:16Z

I ran the tests locally ARN=arn:aws:s3tables:us-east-2:... poetry run pytest tests/catalog/test_s3tables.py
had to manually add s3tables.region to the catalog config

    properties = {"s3tables.table-bucket-arn": table_bucket_arn, "s3tables.region": "us-east-2", "py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"}

And these 3 testa failed, everything else is ✅

FAILED tests/catalog/test_s3tables.py::test_s3tables_api_raises_on_conflicting_version_tokens - botocore.exceptions.NoRegionError: You must specify a region.
FAILED tests/catalog/test_s3tables.py::test_s3tables_api_raises_on_preexisting_table - botocore.exceptions.NoRegionError: You must specify a region.
FAILED tests/catalog/test_s3tables.py::test_creating_catalog_validates_s3_table_bucket_exists - botocore.exceptions.NoRegionError: You must specify a region.

felixscherz · 2025-01-05T16:29:23Z

Thank you for the review!

I removed tests related to boto3 and set the AWS region explicitly for the test run.
I agree with you that we should not merge this as long as the only option for running tests is to run them against a live AWS account. I'm currently working on supporting s3tables with moto: getmoto/moto#8470.
We can hold off on this PR until moto can support s3tables so we can run tests against a mock endpoint?

kevinjqliu

Added a few more comments.

I was able to run the test locally

AWS_REGION=us-east-2 ARN=... poetry run pytest tests/catalog/test_s3tables.py

after making a few local changes

poetry update boto3
add aws_region fixture
pass aws_region to catalog

Could you update the PR description so others can test this PR out?

pyiceberg/catalog/s3tables.py

tests/catalog/test_s3tables.py

pyiceberg/catalog/s3tables.py

HonahX

@felixscherz Thanks for the great contribution! Looking forward to adding this to PyIceberg! I left some comments. Please let me know what you think.

tests/catalog/test_s3tables.py

pyiceberg/catalog/s3tables.py

HonahX · 2025-01-06T20:03:13Z

pyiceberg/catalog/s3tables.py

+    def commit_table(
+        self, table: Table, requirements: Tuple[TableRequirement, ...], updates: Tuple[TableUpdate, ...]
+    ) -> CommitTableResponse:


I did not find the logic for cases when table not exist, which means create_table_transaction will not be supported in the current version.

iceberg-python/pyiceberg/catalog/__init__.py

Lines 754 to 765 in e41c428

def create_table_transaction(

self,

identifier: Union[str, Identifier],

schema: Union[Schema, "pa.Schema"],

location: Optional[str] = None,

partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,

sort_order: SortOrder = UNSORTED_SORT_ORDER,

properties: Properties = EMPTY_DICT,

) -> CreateTableTransaction:

return CreateTableTransaction(

self._create_staged_table(identifier, schema, location, partition_spec, sort_order, properties)

)

We do not have to support everything in the initial PR. But it will be good to override create_table_transaction as "Not Implemented" for the s3tables

I added exceptions for this case for now along with a test. I will have a look at how to implement this properly

pyiceberg/catalog/s3tables.py

HonahX · 2025-01-07T03:52:36Z

pyiceberg/catalog/s3tables.py

+        try:
+            self.s3tables.create_table(
+                tableBucketARN=self.table_bucket_arn, namespace=namespace, name=table_name, format="ICEBERG"
+            )


If anything goes wrong after this point, I think we should clean up the created s3 table by s3tables' delete_table endpoint.

I added a try/except to delete the s3 table in case something goes wrong with writing the initial metadata.

pyiceberg/catalog/s3tables.py

kevinjqliu · 2025-01-08T04:06:02Z

can you run poetry lock --no-update for CI?

kevinjqliu · 2025-02-01T00:22:36Z

@felixscherz could you rebase this against main?

i see that getmoto/moto/8470 is now merged, thanks for driving that!
waiting for getmoto/moto@4f565fb to make it to the next release (5.0.28)

felixscherz · 2025-02-01T15:01:21Z

I rebased onto the main. I prepared the unit tests using the new moto features and I will commit them once the new moto release is available:)

felixscherz · 2025-02-03T08:07:51Z

@kevinjqliu moto==5.0.28 just got released, I added unit tests and documentation, could you have a look when you have time?

Co-authored-by: Honah J. <[email protected]>

kevinjqliu · 2025-02-12T20:19:09Z

its green! i'll review this so we can include it in the upcoming 0.9.0 release

kevinjqliu

LGTM! Thanks for adding both the unit test and integration test. And for driving downstream dependency to add support for S3Tables!! (getmoto/moto#8470)

I pushed a few changes to resolve merge conflicts. And I verified the integration test locally

AWS_REGION=us-east-2 AWS_TEST_S3_TABLE_BUCKET_ARN=arn:aws:s3tables:us-east-2:033327485438:bucket/s3-table poetry run pytest tests/catalog/integration_test_s3tables.py

kevinjqliu · 2025-02-17T00:53:38Z

pyiceberg/catalog/__init__.py

-    def _write_metadata(metadata: TableMetadata, io: FileIO, metadata_path: str) -> None:
-        ToOutputFile.table_metadata(metadata, io.new_output(metadata_path))
+    def _write_metadata(metadata: TableMetadata, io: FileIO, metadata_path: str, overwrite: bool = False) -> None:
+        ToOutputFile.table_metadata(metadata, io.new_output(metadata_path), overwrite=overwrite)


👍 default is False

iceberg-python/pyiceberg/serializers.py

Line 123 in 300b840

def table_metadata(metadata: TableMetadata, output_file: OutputFile, overwrite: bool = False) -> None:

geruh · 2025-02-18T19:51:51Z

pyiceberg/catalog/s3tables.py

+        try:
+            self.s3tables = session.client("s3tables", endpoint_url=properties.get(S3TABLES_ENDPOINT))
+        except boto3.session.UnknownServiceError as e:
+            raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current version: {boto3.__version__}.") from e


Need to make this an f string so that boto3 version can be interpolated

geruh · 2025-02-18T20:00:03Z

pyiceberg/catalog/s3tables.py

+        raise NotImplementedError("Namespace properties are read only.")
+
+    def purge_table(self, identifier: Union[str, Identifier]) -> None:
+        # purge is not supported as s3tables doesn't support delete operations


IIRC: s3Tables API only wants users to drop tables if they're specifying a purge because they will lose access to their data

https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-delete.html
Java impl: https://github.com/awslabs/s3-tables-catalog/blob/main/src/software/amazon/s3tables/iceberg/S3TablesCatalog.java#L439

geruh · 2025-02-18T20:34:27Z

pyiceberg/catalog/s3tables.py

+        namespace = self._validate_namespace_identifier(namespace)
+        paginator = self.s3tables.get_paginator("list_tables")
+        tables: List[Identifier] = []
+        for page in paginator.paginate(tableBucketARN=self.table_bucket_arn, namespace=namespace):


NIT: catch and throw the NoSuchNamespaceError to align with catalog exceptions

geruh · 2025-02-18T20:35:02Z

pyiceberg/catalog/s3tables.py

+
+    def load_namespace_properties(self, namespace: Union[str, Identifier]) -> Properties:
+        namespace = self._validate_namespace_identifier(namespace)
+        response = self.s3tables.get_namespace(tableBucketARN=self.table_bucket_arn, namespace=namespace)


NIT: same as above, catch and throw the NoSuchNamespaceError to align with catalog exceptions

geruh · 2025-02-18T20:39:02Z

pyiceberg/catalog/s3tables.py

+    def __init__(self, name: str, **properties: str):
+        super().__init__(name, **properties)
+
+        self.table_bucket_arn = self.properties[S3TABLES_TABLE_BUCKET_ARN]


What do you think about placing an assertion on this property to require the user to set the table bucket arn? Similar to the java implementation.

felixscherz mentioned this pull request Dec 14, 2024

Support for S3 catalog to work with S3 Tables #1404

Open

felixscherz marked this pull request as draft December 14, 2024 16:43

felixscherz marked this pull request as ready for review December 29, 2024 15:51

felixscherz changed the title ~~WIP: feat: support S3 Table Buckets with S3TablesCatalog~~ feat: support S3 Table Buckets with S3TablesCatalog Dec 29, 2024

kevinjqliu reviewed Jan 4, 2025

View reviewed changes

kevinjqliu reviewed Jan 5, 2025

View reviewed changes

felixscherz force-pushed the feat/s3tables-catalog branch from 398e2d7 to 05e4dfd Compare January 6, 2025 16:27

HonahX reviewed Jan 6, 2025

View reviewed changes

HonahX reviewed Jan 7, 2025

View reviewed changes

felixscherz force-pushed the feat/s3tables-catalog branch from 894cbc9 to 2e1c383 Compare January 8, 2025 08:08

kevinjqliu mentioned this pull request Jan 10, 2025

Add REST catalog integration tests #1439

Open

kevinjqliu added this to the PyIceberg 0.9.0 release milestone Feb 1, 2025

felixscherz force-pushed the feat/s3tables-catalog branch from 99e569b to 54b8e87 Compare February 1, 2025 14:46

kevinjqliu mentioned this pull request Feb 3, 2025

Build: Bump moto from 5.0.27 to 5.0.28 #1603

Merged

felixscherz force-pushed the feat/s3tables-catalog branch from f30b7e6 to 804a468 Compare February 12, 2025 07:33

felixscherz added 5 commits February 12, 2025 16:10

feat: initial setup for S3TablesCatalog

0c8fa1b

feat: support create_table using FsspecFileIO

1ca5e86

feat: implement drop_table

e659da1

feat: implement drop_namespace

9973d12

test: validate how version conflict is handled with s3tables API

3c36450

felixscherz and others added 15 commits February 12, 2025 16:10

Apply suggestions from code review

9c828e3

Co-authored-by: Honah J. <[email protected]>

feat: add link to naming-rules for invalid name errors

517f31d

feat: delete s3 table if writing new_table_metadata is unsuccessful

83739d5

chore: run linter

1475f5b

test: rename test_s3tables.py -> integration_test_s3tables.py

9ceea4b

fix: add license to files

930cc3e

fix: raise error when creating a table during a transaction

73cf922

test: mark create_table_transaction test wiht xfail

bbc5706

feat: raise NotImplementedError for view_exists

bad0eb5

test: use moto server for s3tables tests

38c4e6f

docs: add s3tables catalog

937d6af

chore: bump moto library

cf03cba

test: set region when creating table bucket

af6bce7

test: mock aws credentials

6af3391

chore: update poetry lock with s3tables

018bf0b

felixscherz force-pushed the feat/s3tables-catalog branch from c27b60d to 018bf0b Compare February 12, 2025 15:18

kevinjqliu added 3 commits February 16, 2025 13:57

Merge branch 'main' into feat/s3tables-catalog

cbfbedd

use new locationprovider for metadata location

f738e59

fix test region

df97205

kevinjqliu requested changes Feb 17, 2025

View reviewed changes

kevinjqliu approved these changes Feb 17, 2025

View reviewed changes

kevinjqliu requested review from HonahX and Fokko February 17, 2025 01:13

kevinjqliu removed this from the PyIceberg 0.9.0 release milestone Feb 17, 2025

geruh reviewed Feb 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support S3 Table Buckets with S3TablesCatalog #1429

feat: support S3 Table Buckets with S3TablesCatalog #1429

felixscherz commented Dec 14, 2024 •

edited

Loading

felixscherz commented Dec 14, 2024

kevinjqliu commented Dec 20, 2024

felixscherz commented Dec 29, 2024

kevinjqliu left a comment

kevinjqliu commented Jan 4, 2025

felixscherz commented Jan 5, 2025

kevinjqliu left a comment

HonahX left a comment

HonahX Jan 6, 2025

felixscherz Jan 9, 2025

HonahX Jan 7, 2025

felixscherz Jan 7, 2025

kevinjqliu commented Jan 8, 2025

kevinjqliu commented Feb 1, 2025 •

edited

Loading

felixscherz commented Feb 1, 2025 •

edited

Loading

felixscherz commented Feb 3, 2025

kevinjqliu commented Feb 12, 2025 •

edited

Loading

kevinjqliu left a comment

kevinjqliu Feb 17, 2025

geruh Feb 18, 2025

geruh Feb 18, 2025

geruh Feb 18, 2025 •

edited

Loading

geruh Feb 18, 2025

geruh Feb 18, 2025

	def create_table_transaction(
	self,
	identifier: Union[str, Identifier],
	schema: Union[Schema, "pa.Schema"],
	location: Optional[str] = None,
	partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,
	sort_order: SortOrder = UNSORTED_SORT_ORDER,
	properties: Properties = EMPTY_DICT,
	) -> CreateTableTransaction:
	return CreateTableTransaction(
	self._create_staged_table(identifier, schema, location, partition_spec, sort_order, properties)
	)

feat: support S3 Table Buckets with S3TablesCatalog #1429

Are you sure you want to change the base?

feat: support S3 Table Buckets with S3TablesCatalog #1429

Conversation

felixscherz commented Dec 14, 2024 • edited Loading

How to run tests

felixscherz commented Dec 14, 2024

kevinjqliu commented Dec 20, 2024

felixscherz commented Dec 29, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu commented Jan 4, 2025

felixscherz commented Jan 5, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

HonahX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu commented Jan 8, 2025

kevinjqliu commented Feb 1, 2025 • edited Loading

felixscherz commented Feb 1, 2025 • edited Loading

felixscherz commented Feb 3, 2025

kevinjqliu commented Feb 12, 2025 • edited Loading

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geruh Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixscherz commented Dec 14, 2024 •

edited

Loading

kevinjqliu commented Feb 1, 2025 •

edited

Loading

felixscherz commented Feb 1, 2025 •

edited

Loading

kevinjqliu commented Feb 12, 2025 •

edited

Loading

geruh Feb 18, 2025 •

edited

Loading