Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support S3 Table Buckets with S3TablesCatalog #1429

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
0c8fa1b
feat: initial setup for S3TablesCatalog
felixscherz Dec 14, 2024
1ca5e86
feat: support create_table using FsspecFileIO
felixscherz Dec 14, 2024
e659da1
feat: implement drop_table
felixscherz Dec 14, 2024
9973d12
feat: implement drop_namespace
felixscherz Dec 14, 2024
3c36450
test: validate how version conflict is handled with s3tables API
felixscherz Dec 15, 2024
c876a37
feat: implement commit_table
felixscherz Dec 15, 2024
43deab4
feat: implement table_exists
felixscherz Dec 18, 2024
907ed28
feat: implement list_tables
felixscherz Dec 18, 2024
4438c8e
refactor: improve list_namespace
felixscherz Dec 18, 2024
3499643
fix: return Identifier from list_tables
felixscherz Dec 18, 2024
fef4e69
feat: implement rename table
felixscherz Dec 21, 2024
85d9bf5
feat: implement load_namespace_properties
felixscherz Dec 21, 2024
1360ead
refactor: move some methods around
felixscherz Dec 21, 2024
6951076
feat: raise NotImplementedError for views functionality
felixscherz Dec 21, 2024
28ef842
feat: raise NotImplementedError for purge_table
felixscherz Dec 21, 2024
ff168f2
feat: raise NotImplementedError for update_namespace_properties
felixscherz Dec 21, 2024
cf244bd
feat: raise NotImplementedError for register_table
felixscherz Dec 21, 2024
f1de32b
fix: don't override create_table_transaction
felixscherz Dec 21, 2024
9fbc402
chore: run formatter
felixscherz Dec 21, 2024
4639e1e
feat: raise exceptions if boto3 doesn't support s3tables
felixscherz Dec 23, 2024
6b4d258
feat: make endpoint configurable
felixscherz Dec 23, 2024
ed0cba6
feat: explicitly configure tableBucketARN
felixscherz Dec 23, 2024
49355e2
fix: remove defaulting to FsspecIO
felixscherz Dec 23, 2024
61c4e9c
feat: raise exceptions for invalid namespace/table name
felixscherz Dec 23, 2024
a164e77
feat: improve error handling for create_table
felixscherz Dec 29, 2024
37bd609
feat: improve error handling for delete_table
felixscherz Dec 29, 2024
63f8525
chore: cleanup comments
felixscherz Dec 29, 2024
da91a11
feat: catch missing metadata for load_table
felixscherz Dec 29, 2024
bd5be82
feat: handle missing namespace and preexisting table
felixscherz Dec 29, 2024
c15ffdb
feat: handle versionToken and table in an atomic operation
felixscherz Dec 29, 2024
dceb55d
chore: run formatter
felixscherz Dec 29, 2024
3114424
chore: add type hints for tests
felixscherz Dec 29, 2024
1dda96d
fix: no longer enforce FsspecFileIO
felixscherz Jan 4, 2025
99d272d
test: remove tests for boto3 behavior
felixscherz Jan 4, 2025
c060ad9
test: verify column was created on commit
felixscherz Jan 4, 2025
da6516b
test: verify new data can be committed to table
felixscherz Jan 4, 2025
9f890c2
docs: update documentation for create_table
felixscherz Jan 5, 2025
ee93da2
test: set AWS regions explicitly
felixscherz Jan 5, 2025
0952b55
Apply suggestions from code review
felixscherz Jan 6, 2025
27414e1
test: commit new data to table
felixscherz Jan 6, 2025
2a8c5c4
feat: clarify update_namespace_properties error
felixscherz Jan 6, 2025
80884a6
feat: raise error when setting custom namespace properties
felixscherz Jan 6, 2025
a6a112f
refactor: change S3TableCatalog -> S3TablesCatalog
felixscherz Jan 7, 2025
662b5ea
feat: raise error on specified table location
felixscherz Jan 7, 2025
1cb6f68
feat: return empty list when querying a hierarchical namespace
felixscherz Jan 7, 2025
c110d71
refactor: use get_table_metadata_location instead of get_table
felixscherz Jan 7, 2025
3d2f749
refactor: extract 'ICEBERG' table format into constant
felixscherz Jan 7, 2025
44d7a1f
feat: change s3tables.table-bucket-arn -> s3tables.warehouse
felixscherz Jan 7, 2025
9c828e3
Apply suggestions from code review
felixscherz Jan 7, 2025
517f31d
feat: add link to naming-rules for invalid name errors
felixscherz Jan 7, 2025
83739d5
feat: delete s3 table if writing new_table_metadata is unsuccessful
felixscherz Jan 7, 2025
1475f5b
chore: run linter
felixscherz Jan 7, 2025
9ceea4b
test: rename test_s3tables.py -> integration_test_s3tables.py
felixscherz Jan 7, 2025
930cc3e
fix: add license to files
felixscherz Jan 8, 2025
73cf922
fix: raise error when creating a table during a transaction
felixscherz Jan 9, 2025
bbc5706
test: mark create_table_transaction test wiht xfail
felixscherz Jan 9, 2025
bad0eb5
feat: raise NotImplementedError for view_exists
felixscherz Feb 1, 2025
38c4e6f
test: use moto server for s3tables tests
felixscherz Feb 2, 2025
937d6af
docs: add s3tables catalog
felixscherz Feb 2, 2025
cf03cba
chore: bump moto library
felixscherz Feb 3, 2025
af6bce7
test: set region when creating table bucket
felixscherz Feb 12, 2025
6af3391
test: mock aws credentials
felixscherz Feb 12, 2025
018bf0b
chore: update poetry lock with s3tables
felixscherz Feb 12, 2025
cbfbedd
Merge branch 'main' into feat/s3tables-catalog
kevinjqliu Feb 16, 2025
f738e59
use new locationprovider for metadata location
kevinjqliu Feb 17, 2025
df97205
fix test region
kevinjqliu Feb 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions mkdocs/docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -553,6 +553,104 @@ catalog:

<!-- prettier-ignore-end -->

### S3Tables Catalog

The S3Tables Catalog leverages the catalog functionalities of the Amazon S3Tables service and requires an existing S3 Tables Bucket to operate.

To use Amazon S3Tables as your catalog, you can configure pyiceberg using one of the following methods. Additionally, refer to the [AWS documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) on configuring credentials to set up your AWS account credentials locally.

If you intend to use the same credentials for both the S3Tables Catalog and S3 FileIO, you can configure the [`client.*` properties](configuration.md#unified-aws-credentials) to streamline the process.

Note that the S3Tables Catalog manages the underlying table locations internally, which makes it incompatible with S3-like storage systems such as MinIO. If you specify the `s3tables.endpoint`, ensure that the `s3.endpoint` is configured accordingly.

```yaml
catalog:
default:
type: s3tables
warehouse: arn:aws:s3tables:us-east-1:012345678901:bucket/pyiceberg-catalog
```

If you prefer to pass the credentials explicitly to the client instead of relying on environment variables,

```yaml
catalog:
default:
type: s3tables
s3tables.access-key-id: <ACCESS_KEY_ID>
s3tables.secret-access-key: <SECRET_ACCESS_KEY>
s3tables.session-token: <SESSION_TOKEN>
s3tables.region: <REGION_NAME>
s3tables.endpoint: http://localhost:9000
s3.endpoint: http://localhost:9000
```

<!-- prettier-ignore-start -->

!!! Note "Client-specific Properties"
`s3tables.*` properties are for S3TablesCatalog only. If you want to use the same credentials for both S3TablesCatalog and S3 FileIO, you can set the `client.*` properties. See the [Unified AWS Credentials](configuration.md#unified-aws-credentials) section for more details.

<!-- prettier-ignore-end -->

<!-- markdown-link-check-disable -->

| Key | Example | Description |
| -------------------------- | ------------------- | -------------------------------------------------------------------------- |
| s3tables.profile-name | default | Configure the static profile used to access the S3Tables Catalog |
| s3tables.region | us-east-1 | Set the region of the S3Tables Catalog |
| s3tables.access-key-id | admin | Configure the static access key id used to access the S3Tables Catalog |
| s3tables.secret-access-key | password | Configure the static secret access key used to access the S3Tables Catalog |
| s3tables.session-token | AQoDYXdzEJr... | Configure the static session token used to access the S3Tables Catalog |
| s3tables.endpoint | <http://localhost>... | Configure the AWS endpoint |
| s3tables.warehouse | arn:aws:s3tables... | Set the underlying S3 Table Bucket |

<!-- markdown-link-check-enable-->

<!-- prettier-ignore-start -->

!!! warning "Removed Properties"
The properties `profile_name`, `region_name`, `aws_access_key_id`, `aws_secret_access_key`, and `aws_session_token` were deprecated and removed in 0.8.0

<!-- prettier-ignore-end -->

An example usage of the S3Tables Catalog is shown below:

```python
from pyiceberg.catalog.s3tables import S3TablesCatalog
import pyarrow as pa


table_bucket_arn: str = "..."
aws_region: str = "..."

properties = {"s3tables.warehouse": table_bucket_arn, "s3tables.region": aws_region}
catalog = S3TablesCatalog(name="s3tables_catalog", **properties)

database_name = "prod"

catalog.create_namespace(namespace=database_name)

pyarrow_table = pa.Table.from_arrays(
[
pa.array([None, "A", "B", "C"]),
pa.array([1, 2, 3, 4]),
pa.array([True, None, False, True]),
pa.array([None, "A", "B", "C"]),
],
schema=pa.schema(
[
pa.field("foo", pa.large_string(), nullable=True),
pa.field("bar", pa.int32(), nullable=False),
pa.field("baz", pa.bool_(), nullable=True),
pa.field("large", pa.large_string(), nullable=True),
]
),
)

identifier = (database_name, "orders")
table = catalog.create_table(identifier=identifier, schema=pyarrow_table.schema)
table.append(pyarrow_table)
```

### Custom Catalog Implementations

If you want to load any custom catalog implementation, you can set catalog configurations like the following:
Expand Down
11 changes: 6 additions & 5 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 13 additions & 2 deletions pyiceberg/catalog/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ class CatalogType(Enum):
DYNAMODB = "dynamodb"
SQL = "sql"
IN_MEMORY = "in-memory"
S3TABLES = "s3tables"


def load_rest(name: str, conf: Properties) -> Catalog:
Expand Down Expand Up @@ -175,13 +176,23 @@ def load_in_memory(name: str, conf: Properties) -> Catalog:
raise NotInstalledError("SQLAlchemy support not installed: pip install 'pyiceberg[sql-sqlite]'") from exc


def load_s3tables(name: str, conf: Properties) -> Catalog:
try:
from pyiceberg.catalog.s3tables import S3TablesCatalog

return S3TablesCatalog(name, **conf)
except ImportError as exc:
raise NotInstalledError("AWS S3Tables support not installed: pip install 'pyiceberg[s3tables]'") from exc


AVAILABLE_CATALOGS: dict[CatalogType, Callable[[str, Properties], Catalog]] = {
CatalogType.REST: load_rest,
CatalogType.HIVE: load_hive,
CatalogType.GLUE: load_glue,
CatalogType.DYNAMODB: load_dynamodb,
CatalogType.SQL: load_sql,
CatalogType.IN_MEMORY: load_in_memory,
CatalogType.S3TABLES: load_s3tables,
}


Expand Down Expand Up @@ -945,8 +956,8 @@ def _get_default_warehouse_location(self, database_name: str, table_name: str) -
raise ValueError("No default path is set, please specify a location when creating a table")

@staticmethod
def _write_metadata(metadata: TableMetadata, io: FileIO, metadata_path: str) -> None:
ToOutputFile.table_metadata(metadata, io.new_output(metadata_path))
def _write_metadata(metadata: TableMetadata, io: FileIO, metadata_path: str, overwrite: bool = False) -> None:
ToOutputFile.table_metadata(metadata, io.new_output(metadata_path), overwrite=overwrite)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 default is False

def table_metadata(metadata: TableMetadata, output_file: OutputFile, overwrite: bool = False) -> None:


@staticmethod
def _parse_metadata_version(metadata_location: str) -> int:
Expand Down
Loading