-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ibis): introduce S3 File connector #1038
Conversation
WalkthroughThis pull request introduces support for S3 file data sources in the Ibis server. Key changes include the addition of new environment variables for AWS credentials in the CI workflow, modifications to the Changes
Possibly related PRs
Suggested Labels
Suggested Reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (5)
ibis-server/app/model/metadata/object_storage.py (1)
41-41
: Ensure robust path handling in_get_full_path
methodThe
_get_full_path
method is crucial for correctly formatting paths for object storage. Ensure that it handles edge cases, such as multiple leading slashes or empty paths, to prevent errors when accessing files.Suggested Improvement:
Consider normalizing the path to handle leading slashes and ensure consistency:
import posixpath def _get_full_path(self, path): normalized_path = posixpath.normpath(path.lstrip('/')) return f"{normalized_path}"This implementation helps maintain consistent path formatting across different storage systems.
ibis-server/app/model/__init__.py (1)
154-163
: Consider marking URL as SecretStr for enhanced security.While AWS credentials are properly secured using SecretStr, the URL field might contain sensitive path information that should also be protected.
- url: SecretStr = Field(description="the root path of the s3 bucket", default="/") + url: SecretStr = Field(description="the root path of the s3 bucket", default=SecretStr("/"))ibis-server/tests/routers/v2/connector/test_s3_file.py (3)
100-167
: Consider adding error case tests and using test data fixtures.While the happy path tests are well-implemented, consider:
- Adding tests for error cases (e.g., invalid SQL, missing permissions)
- Moving test data to fixtures for better maintainability
194-243
: Consider adding error cases for metadata retrieval.While the happy path is well-tested, consider adding tests for:
- Invalid bucket/path scenarios
- Permission denied cases
- Malformed metadata responses
245-261
: LGTM! Consider adding more format-related edge cases.The basic unsupported format test is good. Consider adding tests for:
- Empty format string
- Mixed case formats (e.g., "JSON" vs "json")
- Format with leading/trailing spaces
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
.github/workflows/ibis-ci.yml
(1 hunks)ibis-server/app/mdl/rewriter.py
(1 hunks)ibis-server/app/model/__init__.py
(3 hunks)ibis-server/app/model/connector.py
(3 hunks)ibis-server/app/model/data_source.py
(3 hunks)ibis-server/app/model/metadata/factory.py
(2 hunks)ibis-server/app/model/metadata/object_storage.py
(3 hunks)ibis-server/app/model/utils.py
(1 hunks)ibis-server/pyproject.toml
(1 hunks)ibis-server/tests/routers/v2/connector/test_s3_file.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
🔇 Additional comments (15)
ibis-server/app/model/metadata/object_storage.py (2)
153-158
: Abstract methods enhance extensibility for different storage typesThe introduction of
_get_dal_operator
and_get_full_path
methods inObjectStorageMetadata
allows for easier extension to support additional storage types. This design promotes better code organization and reusability.
168-196
: Verify S3 configuration and secret handling inS3FileMetadata
The
S3FileMetadata
class correctly overrides methods to handle S3 connections. However, please verify the following:
- All required S3 parameters (
root
,bucket
,region
,access_key_id
, andsecret_access_key
) are correctly passed toopendal.Operator
.- Secret values are securely handled and not exposed in logs. Ensure that logging statements do not include sensitive information.
- Path formatting in
_get_full_path
correctly constructs the S3 URI, handling edge cases like empty paths or unexpected characters.Consider adding unit tests for these scenarios to ensure robustness.
ibis-server/app/model/metadata/factory.py (2)
8-8
: AddingS3FileMetadata
to imports for metadata mappingThe import statement appropriately includes
S3FileMetadata
, preparing it for use in the metadata factory mapping.
23-23
: Includes3_file
in metadata mapping to support S3 data sourcesBy adding
DataSource.s3_file
to themapping
dictionary, the system can now handle S3 file data sources, enhancing the versatility of the metadata factory.ibis-server/app/mdl/rewriter.py (1)
75-76
: LGTM! Consistent handling of S3 files with local files.The change appropriately extends DuckDB dialect support to S3 files, maintaining consistency with local file handling.
ibis-server/app/model/__init__.py (1)
59-61
: LGTM! Clean DTO implementation.The QueryS3FileDTO cleanly extends QueryDTO with S3-specific connection info.
ibis-server/app/model/connector.py (1)
41-42
: LGTM! Proper integration of S3 support.The S3 file handling is correctly integrated using DuckDBConnector.
ibis-server/app/model/data_source.py (1)
48-48
: LGTM! Consistent enum implementation.The S3 file data source is properly integrated into both DataSource and DataSourceExtension enums.
Also applies to: 73-73
ibis-server/tests/routers/v2/connector/test_s3_file.py (5)
1-80
: LGTM! Well-structured manifest with proper security practices.The environment variables are correctly used for AWS credentials, and the manifest is well-organized with clear model definitions and relationships.
83-98
: LGTM! Well-designed test fixtures.The fixtures provide clean separation of concerns and include all necessary parameters for S3 connectivity.
169-192
: LGTM! Comprehensive dry run test coverage.Tests cover both successful and error scenarios for dry runs with appropriate assertions.
263-331
: LGTM! Excellent type mapping coverage for Parquet files.The tests thoroughly validate data type mappings and column structures for Parquet files.
333-473
: LGTM! Comprehensive format-specific type handling.The tests effectively cover format-specific type mappings and edge cases, with good documentation of special behaviors (e.g., invalid CSV handling).
.github/workflows/ibis-ci.yml (1)
70-73
: LGTM! Secure handling of AWS credentials.AWS credentials are properly managed using GitHub secrets and consistently integrated with the existing environment variable pattern.
ibis-server/pyproject.toml (1)
67-67
: LGTM! Consistent test marker addition.The s3_file marker follows the established pattern for test categorization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
ibis-server/app/model/utils.py (1)
17-22
: Enhance error handling with more descriptive messages and logging.The current error handling could be more informative to help with troubleshooting. Consider:
- Including more context in error messages
- Adding debug logging
+from loguru import logger + def init_duckdb_s3( connection: DuckDBPyConnection, connection_info: S3FileConnectionInfo ): + logger.debug("Initializing DuckDB S3 connection") create_secret = f""" CREATE SECRET wren_s3 ( TYPE S3, KEY_ID '{connection_info.access_key.get_secret_value()}', SECRET '{connection_info.secret_key.get_secret_value()}', REGION '{connection_info.region.get_secret_value()}' ) """ try: result = connection.execute(create_secret).fetchone() if result is None or not result[0]: - raise Exception("Failed to create secret") + raise Exception("Failed to create DuckDB S3 secret: empty or invalid result") except HTTPException as e: - raise Exception("Failed to create secret", e) + raise Exception(f"HTTP error while creating DuckDB S3 secret: {str(e)}") from e + logger.debug("Successfully created DuckDB S3 secret")ibis-server/app/model/metadata/object_storage.py (2)
199-203
: Improve S3 path handling robustness.The current path handling could be more robust by:
- Validating the bucket name format
- Ensuring consistent path separators for S3 URIs
def _get_full_path(self, path): + # Validate and sanitize the path + path = path.replace('\\', '/') if path.startswith("/"): path = path[1:] + + bucket = self.connection_info.bucket.get_secret_value() + if not bucket or '/' in bucket: + raise ValueError("Invalid S3 bucket name") - return f"s3://{self.connection_info.bucket.get_secret_value()}/{path}" + return f"s3://{bucket}/{path}"
182-186
: Enhance logging for better debugging capabilities.Add more detailed logging to help troubleshoot S3 connection issues.
def _get_connection(self): + logger.debug(f"Creating S3 connection for bucket: {self.connection_info.bucket.get_secret_value()}") conn = duckdb.connect() init_duckdb_s3(conn, self.connection_info) - logger.debug("Initialized duckdb s3") + logger.debug(f"Successfully initialized DuckDB S3 connection for region: {self.connection_info.region.get_secret_value()}") return conn
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
ibis-server/app/model/connector.py
(3 hunks)ibis-server/app/model/metadata/object_storage.py
(2 hunks)ibis-server/app/model/utils.py
(1 hunks)ibis-server/tests/routers/v2/connector/test_s3_file.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- ibis-server/app/model/connector.py
- ibis-server/tests/routers/v2/connector/test_s3_file.py
🧰 Additional context used
📓 Learnings (1)
ibis-server/app/model/utils.py (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1038
File: ibis-server/app/model/utils.py:6-19
Timestamp: 2025-01-16T10:16:30.138Z
Learning: DuckDB's CREATE SECRET syntax does not support prepared statements or parameter binding, requiring direct value interpolation.
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
🔇 Additional comments (2)
ibis-server/app/model/metadata/object_storage.py (2)
188-197
: LGTM! Secure handling of S3 credentials.The implementation properly configures the S3 operator with secure credential handling using OpenDAL.
52-64
: Consider handling complex data types.The current implementation skips struct and array types. This limitation should be documented and tracked.
Let's check if there are any existing issues tracking this limitation:
Would you like me to create an issue to track the implementation of struct and array type support?
7f9f811
to
ea34d15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
ibis-server/app/model/connector.py (1)
167-176
: Enhance error messages for better debugging.The error messages could be more specific by including the HTTP status code and response body when available.
- raise UnprocessableEntityError(f"Failed to execute query: {e!s}") + raise UnprocessableEntityError( + f"Failed to execute query: {e!s}. " + f"Status: {getattr(e, 'status_code', 'unknown')}, " + f"Response: {getattr(e, 'response', 'unknown')}" + )ibis-server/app/model/metadata/object_storage.py (1)
188-203
: Add input validation for S3 configuration.The S3 configuration could benefit from additional validation to ensure all required fields are properly formatted.
def _get_dal_operator(self): info: S3FileConnectionInfo = self.connection_info + # Validate S3 configuration + bucket = info.bucket.get_secret_value() + if not bucket or '/' in bucket: + raise UnprocessableEntityError("Invalid S3 bucket name") + region = info.region.get_secret_value() + if not region: + raise UnprocessableEntityError("Invalid S3 region") return opendal.Operator( "s3", root=info.url.get_secret_value(), - bucket=info.bucket.get_secret_value(), - region=info.region.get_secret_value(), + bucket=bucket, + region=region, secret_access_key=info.secret_key.get_secret_value(), access_key_id=info.access_key.get_secret_value(), )ibis-server/tests/routers/v2/connector/test_s3_file.py (3)
9-97
: Add test cases for edge cases and error scenarios.Consider adding test cases for:
- Empty bucket paths
- Invalid file formats
- Missing or corrupted files
- Rate limiting and timeout scenarios
100-192
: Enhance query test assertions.Consider adding assertions for:
- Column order in the result set
- Data type precision (especially for floating-point numbers)
- NULL value handling
- String encoding and special characters
297-507
: Improve test maintainability for type mappings.Consider refactoring the type mapping tests to use a data-driven approach:
TYPE_MAPPINGS = { 'parquet': { 'c_bigint': 'INT64', 'c_bit': 'STRING', # ... other mappings }, 'csv': { 'c_bigint': 'INT64', 'c_bit': 'STRING', # ... other mappings }, 'json': { 'c_bigint': 'INT64', 'c_bit': 'UUID', # ... other mappings } } @pytest.mark.parametrize('file_format,type_mappings', TYPE_MAPPINGS.items()) async def test_file_format_type_mappings(client, file_format, type_mappings): # Test implementation using the type_mappings dictionary
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
ibis-server/app/model/connector.py
(3 hunks)ibis-server/app/model/metadata/object_storage.py
(2 hunks)ibis-server/app/model/utils.py
(1 hunks)ibis-server/tests/routers/v2/connector/test_s3_file.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- ibis-server/app/model/utils.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
🔇 Additional comments (5)
ibis-server/app/model/connector.py (2)
13-26
: LGTM!The imports and data source handling are correctly implemented to support the S3 file connector.
42-43
: LGTM!Efficient implementation that reuses the DuckDBConnector for S3 file data sources.
ibis-server/app/model/metadata/object_storage.py (2)
175-186
: LGTM!The S3FileMetadata class is well-implemented with proper initialization and version handling.
36-43
:⚠️ Potential issueImprove path handling for security and maintainability.
The current path construction using string concatenation could be vulnerable to path traversal attacks and is error-prone.
Apply this diff to improve path handling:
- table_name = os.path.basename(os.path.normpath(file.path)) - full_path = f"{self.connection_info.url.get_secret_value()}/{table_name}/*.{self.connection_info.format}" + table_name = os.path.basename(os.path.normpath(file.path)) + # Prevent path traversal attacks + if '..' in table_name or table_name.startswith('/'): + logger.warning(f"Skipping suspicious table name: {table_name}") + continue + # Use os.path.join for path construction + full_path = os.path.join( + self.connection_info.url.get_secret_value(), + table_name, + f"*.{self.connection_info.format}" + )Likely invalid or redundant comment.
ibis-server/tests/routers/v2/connector/test_s3_file.py (1)
228-294
: LGTM!The metadata tests are comprehensive and cover all necessary scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
ibis-server/tests/routers/v2/connector/test_s3_file.py (6)
9-12
: Consider using AWS IAM role-based authentication.While using environment variables for credentials is acceptable for testing, consider supporting IAM role-based authentication for production environments. This would enhance security by eliminating the need to manage static credentials.
22-22
: Consider using path joining for S3 URIs.Instead of using f-strings for S3 URIs, consider using a helper function to construct S3 paths. This would ensure consistent path formatting and handle edge cases (e.g., extra slashes).
def build_s3_uri(bucket: str, path: str) -> str: return f"s3://{bucket.strip('/')}/{path.strip('/')}"Also applies to: 49-49
88-98
: Add validation for required connection parameters.The connection_info fixture should validate that all required S3 parameters are present and non-empty. This would help catch configuration issues early in the tests.
@pytest.fixture(scope="module") def connection_info() -> dict[str, str]: + required_params = ["bucket", "region", "access_key", "secret_key"] + missing_params = [param for param in required_params + if not os.getenv(f"AWS_{param.upper()}") + ] + if missing_params: + pytest.skip(f"Missing required S3 parameters: {', '.join(missing_params)}") return { "url": "/tpch/data", "format": "parquet", "bucket": bucket, "region": region, "access_key": access_key, "secret_key": secret_key, }
194-225
: Add more edge cases to invalid connection testing.While the test covers invalid credentials, consider adding tests for:
- Missing required parameters
- Invalid region
- Non-existent bucket
- Invalid URL format
367-436
: Add CSV-specific configuration tests.The CSV test should also verify handling of CSV-specific configurations like:
- Custom delimiters
- Header row options
- Quote characters
- Escape characters
Add a test case for CSV configuration:
async def test_list_csv_files_with_config(client): response = await client.post( url=f"{base_url}/metadata/tables", json={ "connectionInfo": { "url": "/test_file_source", "format": "csv", "bucket": bucket, "region": region, "access_key": access_key, "secret_key": secret_key, "csv_delimiter": "|", "csv_header": True, "csv_quote": '"', "csv_escape": "\\", }, }, ) assert response.status_code == 200 # Add assertions for CSV parsing with custom configuration
438-507
: Add tests for JSON array handling.The JSON tests should verify handling of:
- JSON arrays vs objects
- Nested JSON structures
- JSON Lines format
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ibis-server/tests/routers/v2/connector/test_local_file.py
(1 hunks)ibis-server/tests/routers/v2/connector/test_s3_file.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
🔇 Additional comments (1)
ibis-server/tests/routers/v2/connector/test_local_file.py (1)
246-247
: LGTM! Improved error handling for unsupported formats.The change from 501 to 422 status code is more appropriate as it indicates a client error (unsupported format) rather than a server implementation limitation. The error message is also more descriptive.
@wwwy3y3 There are still some issues about the secrets. I tested this PR with my personal CI goldmedal#2. It works great. |
Let's ignore the CI fails here for now. |
Description
URL
Connection Info
url
: The root path of the dataset. (It doesn't include the bucket name)format
: The specific file format.bucket
: The bucket name.region
: The aws region.access_key
: The access key ID. (AWS_ACCESS_KEY_ID
)secret_kety
: The secret key. (AWS_SECRET_ACCESS_KEY
)Summary by CodeRabbit
Release Notes
New Features
Improvements
Testing