Skip to content

Commit

Permalink
✨ Source S3: remove streams.*.file_type from source-s3 configuration (
Browse files Browse the repository at this point in the history
  • Loading branch information
maxi297 authored Sep 18, 2023
1 parent 327d3c9 commit 2954cbb
Show file tree
Hide file tree
Showing 10 changed files with 33 additions and 30 deletions.
2 changes: 1 addition & 1 deletion airbyte-integrations/connectors/source-s3/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,5 @@ COPY source_s3 ./source_s3
ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py"
ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]

LABEL io.airbyte.version=4.0.3
LABEL io.airbyte.version=4.0.4
LABEL io.airbyte.name=airbyte/source-s3
Original file line number Diff line number Diff line change
Expand Up @@ -110,37 +110,37 @@ acceptance_tests:
tests:
- config_path: secrets/config.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
- config_path: secrets/v4_csv_custom_encoding_config.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
- config_path: secrets/v4_csv_custom_format_config.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
- config_path: secrets/v4_csv_user_schema_config.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
- config_path: secrets/v4_csv_no_header_config.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
- config_path: secrets/v4_csv_skip_rows_config.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
- config_path: secrets/v4_csv_with_nulls_config.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
- config_path: secrets/v4_parquet_config.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
- config_path: secrets/v4_avro_config.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
- config_path: secrets/v4_jsonl_config.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
- config_path: secrets/v4_jsonl_newlines_config.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
full_refresh:
tests:
- config_path: secrets/config.json
Expand Down Expand Up @@ -190,6 +190,6 @@ acceptance_tests:
tests:
- spec_path: integration_tests/spec.json
backward_compatibility_tests_config:
disable_for_version: "3.1.11" # Switch to v4 changed config shape
disable_for_version: "4.0.3" # removing the `streams.*.file_type` field which was redundant with `streams.*.format`
connector_image: airbyte/source-s3:dev
test_strictness_level: high
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,6 @@
"description": "The name of the stream.",
"type": "string"
},
"file_type": {
"title": "File Type",
"description": "The data file type that is being extracted for a stream.",
"type": "string"
},
"globs": {
"title": "Globs",
"description": "The pattern used to specify which files should be selected from the file system. For more information on glob pattern matching look <a href=\"https://en.wikipedia.org/wiki/Glob_(programming)\">here</a>.",
Expand Down Expand Up @@ -283,7 +278,7 @@
"type": "boolean"
}
},
"required": ["name", "file_type"]
"required": ["name", "format"]
}
},
"bucket": {
Expand Down
5 changes: 4 additions & 1 deletion airbyte-integrations/connectors/source-s3/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ data:
connectorSubtype: file
connectorType: source
definitionId: 69589781-7828-43c5-9f63-8925b1c1ccc2
dockerImageTag: 4.0.3
dockerImageTag: 4.0.4
dockerRepository: airbyte/source-s3
githubIssueLabel: source-s3
icon: s3.svg
Expand All @@ -25,6 +25,9 @@ data:
4.0.0:
message: "UX improvement, multi-stream support and deprecation of some parsing features"
upgradeDeadline: "2023-10-05"
4.0.4:
message: "Following 4.0.0 config change, we are eliminating the `streams.*.file_type` field which was redundant with `streams.*.format`"
upgradeDeadline: "2023-10-18"
ab_internal:
sl: 300
ql: 400
Expand Down
2 changes: 1 addition & 1 deletion airbyte-integrations/connectors/source-s3/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from setuptools import find_packages, setup

MAIN_REQUIREMENTS = [
"airbyte-cdk>=0.51.14",
"airbyte-cdk>=0.51.17",
"pyarrow==12.0.1",
"smart-open[s3]==5.1.0",
"wcmatch==8.4",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ def convert(cls, legacy_config: SourceS3Spec) -> Mapping[str, Any]:
"streams": [
{
"name": legacy_config.dataset,
"file_type": legacy_config.format.filetype,
"globs": cls._create_globs(legacy_config.path_pattern),
"legacy_prefix": legacy_config.provider.path_prefix,
"validation_policy": "Emit Record",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from unittest.mock import Mock

import pytest
from airbyte_cdk.sources.file_based.config.csv_format import CsvFormat
from airbyte_cdk.sources.file_based.config.file_based_stream_config import FileBasedStreamConfig
from airbyte_cdk.sources.file_based.remote_file import RemoteFile
from airbyte_cdk.sources.file_based.stream.cursor.default_file_based_cursor import DefaultFileBasedCursor
Expand Down Expand Up @@ -486,7 +487,7 @@ def test_get_adjusted_date_timestamp(cursor_datetime, file_datetime, expected_ad


def _init_cursor_with_state(input_state, max_history_size: Optional[int] = None) -> Cursor:
cursor = Cursor(stream_config=FileBasedStreamConfig(file_type="csv", name="test", validation_policy="Emit Record"))
cursor = Cursor(stream_config=FileBasedStreamConfig(name="test", validation_policy="Emit Record", format=CsvFormat()))
cursor.set_initial_state(input_state)
if max_history_size is not None:
cursor.DEFAULT_MAX_HISTORY_SIZE = max_history_size
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,6 @@
"streams": [
{
"name": "test_data",
"file_type": "avro",
"globs": ["**/*.avro"],
"legacy_prefix": "a_folder/",
"validation_policy": "Emit Record",
Expand Down Expand Up @@ -65,7 +64,6 @@
"streams": [
{
"name": "test_data",
"file_type": "avro",
"globs": ["**/*.avro"],
"legacy_prefix": "",
"validation_policy": "Emit Record",
Expand Down Expand Up @@ -93,7 +91,6 @@
"streams": [
{
"name": "test_data",
"file_type": "avro",
"globs": ["*.csv", "**/*"],
"validation_policy": "Emit Record",
"legacy_prefix": "a_prefix/",
Expand Down Expand Up @@ -393,7 +390,6 @@ def test_convert_file_format(file_type, legacy_format_config, expected_format_co
"streams": [
{
"name": "test_data",
"file_type": file_type,
"globs": [f"**/*.{file_type}"],
"legacy_prefix": "",
"validation_policy": "Emit Record",
Expand Down
8 changes: 8 additions & 0 deletions docs/integrations/sources/s3-migrations.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# S3 Migration Guide

## Upgrading to 4.0.4

Note: This change is only breaking if you created S3 sources using the API and did not provide `streams.*.format`.

Following 4.0.0 config change, we are removing `streams.*.file_type` field which was redundant with `streams.*.format`. This is a breaking change as `format` now needs to be required. Given that the UI would always populate `format`, only users creating actors using the API and not providing `format` are be affected. In order to fix that, simply set `streams.*.format` to `{"filetype": <file_type>}`.


## Upgrading to 4.0.0

We have revamped the implementation to use the File-Based CDK. The goal is to increase resiliency and reduce development time. Here are the breaking changes:
Expand All @@ -18,3 +25,4 @@ Other than breaking changes, we have changed the UI from which the user configur
* You can now configure multiple streams by clicking on `Add` under `Streams`.
* `Output Stream Name` has been renamed to `Name` when configuring a specific stream.
* `Pattern of files to replicate` field has been renamed `Globs` under the stream configuration.

7 changes: 4 additions & 3 deletions docs/integrations/sources/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,9 +236,10 @@ There are currently no options for JSONL parsing.
## Changelog

| Version | Date | Pull Request | Subject |
|:--------|:-----------| :-------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------- |
| 4.0.3 | 2023-09-13 | [30387](https://github.com/airbytehq/airbyte/pull/30387) | Bump Airbyte-CDK version to improve messages for record parse errors
| 4.0.2 | 2023-09-07 | [28639](https://github.com/airbytehq/airbyte/pull/28639) | Always show S3 Key fields
|:--------|:-----------| :-------------------------------------------------------------------------------------------------------------- |:---------------------------------------------------------------------------------------------------------------------|
| 4.0.4 | 2023-09-18 | [30476](https://github.com/airbytehq/airbyte/pull/30476) | Remove streams.*.file_type from source-s3 configuration |
| 4.0.3 | 2023-09-13 | [30387](https://github.com/airbytehq/airbyte/pull/30387) | Bump Airbyte-CDK version to improve messages for record parse errors |
| 4.0.2 | 2023-09-07 | [28639](https://github.com/airbytehq/airbyte/pull/28639) | Always show S3 Key fields |
| 4.0.1 | 2023-09-06 | [30217](https://github.com/airbytehq/airbyte/pull/30217) | Migrate inference error to config errors and avoir sentry alerts |
| 4.0.0 | 2023-09-05 | [29757](https://github.com/airbytehq/airbyte/pull/29757) | New version using file-based CDK |
| 3.1.11 | 2023-08-30 | [29986](https://github.com/airbytehq/airbyte/pull/29986) | Add config error for conversion error |
Expand Down

0 comments on commit 2954cbb

Please sign in to comment.