From c1aad9f9914223ae4d50511cd47f1089c6acf0c7 Mon Sep 17 00:00:00 2001 From: Richard Whaling Date: Sun, 22 Sep 2024 09:12:06 -0500 Subject: [PATCH 1/2] wip: updating minio integration with working docker example; removed references to dynamodb from cloudflare/minio page --- docs/integrations/object-storage/s3-like.md | 57 +++++++++------------ 1 file changed, 25 insertions(+), 32 deletions(-) diff --git a/docs/integrations/object-storage/s3-like.md b/docs/integrations/object-storage/s3-like.md index 4d32f7c41b..8fcf80efda 100644 --- a/docs/integrations/object-storage/s3-like.md +++ b/docs/integrations/object-storage/s3-like.md @@ -1,6 +1,6 @@ # CloudFlare R2 & Minio -`delta-rs` offers native support for using Cloudflare R2 and Minio's as storage backend. R2 and Minio support conditional puts, however we have to pass this flag into the storage options. See the example blow +`delta-rs` offers native support for using Cloudflare R2 or Minio as an S3-compatible storage backend. R2 and Minio support conditional puts, which removes the need for DynamoDB for safe concurrent writes. However, we have to pass this flag into the storage options. See the example below. You don’t need to install any extra dependencies to red/write Delta tables to S3 with engines that use `delta-rs`. You do need to configure your AWS access credentials correctly. @@ -43,41 +43,34 @@ storage_options = { ) ``` -## Delta Lake on S3: Safe Concurrent Writes -You need a locking provider to ensure safe concurrent writes when writing Delta tables to S3. This is because S3 does not guarantee mutual exclusion. +## Minio and Docker -A locking provider guarantees that only one writer is able to create the same file. This prevents corrupted or conflicting data. +Minio is straightforward to host locally with Docker and docker-compose, via the following `docker-compose.yml` file - just run `docker-compose up`: -`delta-rs` uses DynamoDB to guarantee safe concurrent writes. +```yaml +version: '3.8' -Run the code below in your terminal to create a DynamoDB table that will act as your locking provider. - -``` - aws dynamodb create-table \ - --table-name delta_log \ - --attribute-definitions AttributeName=tablePath,AttributeType=S AttributeName=fileName,AttributeType=S \ - --key-schema AttributeName=tablePath,KeyType=HASH AttributeName=fileName,KeyType=RANGE \ - --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 +services: + minio: + image: minio/minio + ports: + - "9000:9000" + - "9001:9001" + environment: + MINIO_ROOT_USER: ... + MINIO_ROOT_PASSWORD: ... + command: server /data --console-address ":9001" ``` -If for some reason you don't want to use DynamoDB as your locking mechanism you can choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to `true` in order to enable S3 unsafe writes. - -Read more in the [Usage](../../usage/writing/writing-to-s3-with-locking-provider.md) section. - -## Delta Lake on S3: Required permissions - -You need to have permissions to get, put and delete objects in the S3 bucket you're storing your data in. Please note that you must be allowed to delete objects even if you're just appending to the Delta Lake, because there are temporary files into the log folder that are deleted after usage. - -In AWS S3, you will need the following permissions: +With this configuration, Minio will host its S3-compatible API over HTTP, not HTTPS, on port 9000. This requires an additional flag in `storage_options`, `AWS_ALLOW_HTTP`, to be set to `true`: -- s3:GetObject -- s3:PutObject -- s3:DeleteObject - -In DynamoDB, you will need the following permissions: - -- dynamodb:GetItem -- dynamodb:Query -- dynamodb:PutItem -- dynamodb:UpdateItem +```python +storage_options = { + "AWS_ACCESS_KEY_ID": ..., + "AWS_SECRET_ACCESS_KEY": ..., + "AWS_ENDPOINT_URL": "http://localhost:9000", + "AWS_ALLOW_HTTP": "true", + "AWS_S3_ALLOW_UNSAFE_RENAME": "true" +} +``` From 2674cdaa20c43b94d150c0ced8a732f9081e110e Mon Sep 17 00:00:00 2001 From: Richard Whaling Date: Sun, 22 Sep 2024 14:57:09 -0500 Subject: [PATCH 2/2] updated to use correct conditional put env var --- docs/integrations/object-storage/s3-like.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/integrations/object-storage/s3-like.md b/docs/integrations/object-storage/s3-like.md index 8fcf80efda..6e85340e0a 100644 --- a/docs/integrations/object-storage/s3-like.md +++ b/docs/integrations/object-storage/s3-like.md @@ -30,7 +30,7 @@ Follow the steps below to use Delta Lake on S3 (R2/Minio) with Polars: ```python storage_options = { 'AWS_SECRET_ACCESS_KEY': , - 'conditional_put': 'etag', # Here we say to use conditional put, this provides safe concurrency. + 'aws_conditional_put': 'etag', # Here we say to use conditional put, this provides safe concurrency. } ``` @@ -67,10 +67,10 @@ With this configuration, Minio will host its S3-compatible API over HTTP, not HT ```python storage_options = { - "AWS_ACCESS_KEY_ID": ..., - "AWS_SECRET_ACCESS_KEY": ..., - "AWS_ENDPOINT_URL": "http://localhost:9000", - "AWS_ALLOW_HTTP": "true", - "AWS_S3_ALLOW_UNSAFE_RENAME": "true" + 'AWS_ACCESS_KEY_ID': ..., + 'AWS_SECRET_ACCESS_KEY': ..., + 'AWS_ENDPOINT_URL': 'http://localhost:9000', + 'AWS_ALLOW_HTTP': 'true', + 'aws_conditional_put': 'etag' } ```