diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/.dockerignore b/airbyte-integrations/connectors/source-azure-blob-storage/.dockerignore
new file mode 100644
index 000000000000..12815beba423
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/.dockerignore
@@ -0,0 +1,6 @@
+*
+!Dockerfile
+!main.py
+!source_azure_blob_storage
+!setup.py
+!secrets
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/Dockerfile b/airbyte-integrations/connectors/source-azure-blob-storage/Dockerfile
new file mode 100644
index 000000000000..6eb6b8df6c4f
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/Dockerfile
@@ -0,0 +1,30 @@
+FROM python:3.9-slim as base
+
+# build and load all requirements
+FROM base as builder
+RUN apt-get update
+WORKDIR /airbyte/integration_code
+
+COPY setup.py ./
+# install necessary packages to a temporary folder
+RUN pip install --prefix=/install .
+
+# build a clean environment
+FROM base
+WORKDIR /airbyte/integration_code
+
+# copy all loaded and built libraries to a pure basic image
+COPY --from=builder /install /usr/local
+# add default timezone settings
+COPY --from=builder /usr/share/zoneinfo/Etc/UTC /etc/localtime
+RUN echo "Etc/UTC" > /etc/timezone
+
+# copy payload code only
+COPY main.py ./
+COPY source_azure_blob_storage ./source_azure_blob_storage
+
+ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py"
+ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]
+
+LABEL io.airbyte.version=0.2.0
+LABEL io.airbyte.name=airbyte/source-azure-blob-storage
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/README.md b/airbyte-integrations/connectors/source-azure-blob-storage/README.md
index 855d13694a28..ffa457bc2b98 100644
--- a/airbyte-integrations/connectors/source-azure-blob-storage/README.md
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/README.md
@@ -1,10 +1,34 @@
-# Source Azure Blob Storage
+# Azure Blob Storage Source
-This is the repository for the Azure Blob Storage source connector in Java.
-For information about how to use this connector within Airbyte, see [the User Documentation](https://docs.airbyte.io/integrations/sources/azure-blob-storage).
+This is the repository for the Azure Blob Storage source connector, written in Python.
+For information about how to use this connector within Airbyte, see [the documentation](https://docs.airbyte.com/integrations/sources/azure-blob-storage).
## Local development
+### Prerequisites
+**To iterate on this connector, make sure to complete this prerequisites section.**
+
+#### Minimum Python version required `= 3.9.0`
+
+#### Build & Activate Virtual Environment and install dependencies
+From this connector directory, create a virtual environment:
+```
+python -m venv .venv
+```
+
+This will generate a virtualenv for this module in `.venv/`. Make sure this venv is active in your
+development environment of choice. To activate it from the terminal, run:
+```
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+If you are in an IDE, follow your IDE's instructions to activate the virtualenv.
+
+Note that while we are installing dependencies from `requirements.txt`, you should only edit `setup.py` for your dependencies. `requirements.txt` is
+used for editable installs (`pip install -e`) to pull in Python dependencies from the monorepo and will call `setup.py`.
+If this is mumbo jumbo to you, don't worry about it, just put your deps in `setup.py` but install using `pip install -r requirements.txt` and everything
+should work as you expect.
+
#### Building via Gradle
From the Airbyte repository root, run:
```
@@ -12,15 +36,31 @@ From the Airbyte repository root, run:
```
#### Create credentials
-**If you are a community contributor**, generate the necessary credentials and place them in `secrets/config.json` conforming to the spec file in `src/main/resources/spec.json`.
-Note that the `secrets` directory is git-ignored by default, so there is no danger of accidentally checking in sensitive information.
+**If you are a community contributor**, follow the instructions in the [documentation](https://docs.airbyte.com/integrations/sources/azure-blob-storage)
+to generate the necessary credentials. Then create a file `secrets/config.json` conforming to the `source_azure_blob_storage/spec.yaml` file.
+Note that the `secrets` directory is gitignored by default, so there is no danger of accidentally checking in sensitive information.
+See `integration_tests/sample_config.json` for a sample config file.
+
+**If you are an Airbyte core member**, copy the credentials in Lastpass under the secret name `source azure-blob-storage test creds`
+and place them into `secrets/config.json`.
-**If you are an Airbyte core member**, follow the [instructions](https://docs.airbyte.io/connector-development#using-credentials-in-ci) to set up the credentials.
+### Locally running the connector
+```
+python main.py spec
+python main.py check --config secrets/config.json
+python main.py discover --config secrets/config.json
+python main.py read --config secrets/config.json --catalog integration_tests/configured_catalog.json
+```
### Locally running the connector docker image
#### Build
-Build the connector image via Gradle:
+First, make sure you build the latest Docker image:
+```
+docker build . -t airbyte/source-azure-blob-storage:dev
+```
+
+You can also build the connector image via Gradle:
```
./gradlew :airbyte-integrations:connectors:source-azure-blob-storage:airbyteDocker
```
@@ -35,23 +75,39 @@ docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-azure-blob-storage:dev
docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-azure-blob-storage:dev discover --config /secrets/config.json
docker run --rm -v $(pwd)/secrets:/secrets -v $(pwd)/integration_tests:/integration_tests airbyte/source-azure-blob-storage:dev read --config /secrets/config.json --catalog /integration_tests/configured_catalog.json
```
-
## Testing
-We use `JUnit` for Java tests.
-
-### Unit and Integration Tests
-Place unit tests under `src/test/...`
-Place integration tests in `src/test-integration/...`
+ Make sure to familiarize yourself with [pytest test discovery](https://docs.pytest.org/en/latest/goodpractices.html#test-discovery) to know how your test files and methods should be named.
+First install test dependencies into your virtual environment:
+```
+pip install .[tests]
+```
+### Unit Tests
+To run unit tests locally, from the connector directory run:
+```
+python -m pytest unit_tests
+```
+### Integration Tests
+There are two types of integration tests: Acceptance Tests (Airbyte's test suite for all source connectors) and custom integration tests (which are specific to this connector).
+#### Custom Integration tests
+Place custom tests inside `integration_tests/` folder, then, from the connector root, run
+```
+python -m pytest integration_tests
+```
#### Acceptance Tests
-Airbyte has a standard test suite that all source connectors must pass. Implement the `TODO`s in
-`src/test-integration/java/io/airbyte/integrations/sources/AzureBlobStorageSourceAcceptanceTest.java`.
+Customize `acceptance-test-config.yml` file to configure tests. See [Connector Acceptance Tests](https://docs.airbyte.com/connector-development/testing-connectors/connector-acceptance-tests-reference) for more information.
+If your connector requires to create or destroy resources for use during acceptance tests create fixtures for it and place them inside integration_tests/acceptance.py.
+To run your integration tests with acceptance tests, from the connector root, run
+```
+python -m pytest integration_tests -p integration_tests.acceptance
+```
+To run your integration tests with docker
### Using gradle to run tests
All commands should be run from airbyte project root.
To run unit tests:
```
-./gradlew :airbyte-integrations:connectors:source-azure-blob-storage:unitTest
+./gradlew :airbyte-integrations:connectors:source-azure-blob-storage:check
```
To run acceptance and custom integration tests:
```
@@ -59,6 +115,10 @@ To run acceptance and custom integration tests:
```
## Dependency Management
+All of your dependencies should go in `setup.py`, NOT `requirements.txt`. The requirements file is only used to connect internal Airbyte dependencies in the monorepo for local development.
+We split dependencies between two groups, dependencies that are:
+* required for your connector to work need to go to `MAIN_REQUIREMENTS` list.
+* required for the testing need to go to `TEST_REQUIREMENTS` list
### Publishing a new version of the connector
You've checked out the repo, implemented a million dollar feature, and you're ready to share your changes with the world. Now what?
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/acceptance-test-config.yml b/airbyte-integrations/connectors/source-azure-blob-storage/acceptance-test-config.yml
index 80579ba60f35..8a06b3818f0f 100644
--- a/airbyte-integrations/connectors/source-azure-blob-storage/acceptance-test-config.yml
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/acceptance-test-config.yml
@@ -1,7 +1,137 @@
-# See [Source Acceptance Tests](https://docs.airbyte.com/connector-development/testing-connectors/source-acceptance-tests-reference)
-# for more information about how to configure these tests
-connector_image: airbyte/source-azure-blob-storage:dev
-acceptance-tests:
+acceptance_tests:
+ basic_read:
+ tests:
+ - config_path: secrets/config.json
+ expect_records:
+ path: integration_tests/expected_records/csv.jsonl
+ exact_order: true
+ - config_path: secrets/csv_custom_encoding_config.json
+ expect_records:
+ path: integration_tests/expected_records/csv_custom_encoding.jsonl
+ exact_order: true
+ - config_path: secrets/csv_custom_format_config.json
+ expect_records:
+ path: integration_tests/expected_records/csv_custom_format.jsonl
+ exact_order: true
+ - config_path: secrets/csv_user_schema_config.json
+ expect_records:
+ path: integration_tests/expected_records/csv_user_schema.jsonl
+ exact_order: true
+ - config_path: secrets/csv_no_header_config.json
+ expect_records:
+ path: integration_tests/expected_records/csv_no_header.jsonl
+ exact_order: true
+ - config_path: secrets/csv_skip_rows_config.json
+ expect_records:
+ path: integration_tests/expected_records/csv_skip_rows.jsonl
+ exact_order: true
+ - config_path: secrets/csv_skip_rows_no_header_config.json
+ expect_records:
+ path: integration_tests/expected_records/csv_skip_rows_no_header.jsonl
+ exact_order: true
+ - config_path: secrets/csv_with_nulls_config.json
+ expect_records:
+ path: integration_tests/expected_records/csv_with_nulls.jsonl
+ exact_order: true
+ - config_path: secrets/csv_with_null_bools_config.json
+ expect_records:
+ path: integration_tests/expected_records/csv_with_null_bools.jsonl
+ exact_order: true
+ - config_path: secrets/parquet_config.json
+ expect_records:
+ path: integration_tests/expected_records/parquet.jsonl
+ exact_order: true
+ - config_path: secrets/avro_config.json
+ expect_records:
+ path: integration_tests/expected_records/avro.jsonl
+ exact_order: true
+ - config_path: secrets/jsonl_config.json
+ expect_records:
+ path: integration_tests/expected_records/jsonl.jsonl
+ exact_order: true
+ - config_path: secrets/jsonl_newlines_config.json
+ expect_records:
+ path: integration_tests/expected_records/jsonl_newlines.jsonl
+ exact_order: true
+ connection:
+ tests:
+ - config_path: secrets/config.json
+ status: succeed
+ - config_path: secrets/csv_custom_encoding_config.json
+ status: succeed
+ - config_path: secrets/csv_custom_format_config.json
+ status: succeed
+ - config_path: secrets/csv_user_schema_config.json
+ status: succeed
+ - config_path: secrets/csv_no_header_config.json
+ status: succeed
+ - config_path: secrets/csv_skip_rows_config.json
+ status: succeed
+ - config_path: secrets/csv_skip_rows_no_header_config.json
+ status: succeed
+ - config_path: secrets/csv_with_nulls_config.json
+ status: succeed
+ - config_path: secrets/csv_with_null_bools_config.json
+ status: succeed
+ - config_path: secrets/parquet_config.json
+ status: succeed
+ - config_path: secrets/avro_config.json
+ status: succeed
+ - config_path: secrets/jsonl_config.json
+ status: succeed
+ - config_path: secrets/jsonl_newlines_config.json
+ status: succeed
+ discovery:
+ tests:
+ - config_path: secrets/config.json
+ - config_path: secrets/csv_custom_encoding_config.json
+ - config_path: secrets/csv_custom_format_config.json
+ - config_path: secrets/csv_user_schema_config.json
+ - config_path: secrets/csv_no_header_config.json
+ - config_path: secrets/csv_skip_rows_config.json
+ - config_path: secrets/csv_skip_rows_no_header_config.json
+ - config_path: secrets/csv_with_nulls_config.json
+ - config_path: secrets/csv_with_null_bools_config.json
+ - config_path: secrets/parquet_config.json
+ - config_path: secrets/avro_config.json
+ - config_path: secrets/jsonl_config.json
+ - config_path: secrets/jsonl_newlines_config.json
+ full_refresh:
+ tests:
+ - config_path: secrets/config.json
+ configured_catalog_path: integration_tests/configured_catalogs/csv.json
+ - config_path: secrets/parquet_config.json
+ configured_catalog_path: integration_tests/configured_catalogs/parquet.json
+ - config_path: secrets/avro_config.json
+ configured_catalog_path: integration_tests/configured_catalogs/avro.json
+ - config_path: secrets/jsonl_config.json
+ configured_catalog_path: integration_tests/configured_catalogs/jsonl.json
+ - config_path: secrets/jsonl_newlines_config.json
+ configured_catalog_path: integration_tests/configured_catalogs/jsonl.json
+ incremental:
+ tests:
+ - config_path: secrets/config.json
+ configured_catalog_path: integration_tests/configured_catalogs/csv.json
+ future_state:
+ future_state_path: integration_tests/abnormal_states/csv.json
+ - config_path: secrets/parquet_config.json
+ configured_catalog_path: integration_tests/configured_catalogs/parquet.json
+ future_state:
+ future_state_path: integration_tests/abnormal_states/parquet.json
+ - config_path: secrets/avro_config.json
+ configured_catalog_path: integration_tests/configured_catalogs/avro.json
+ future_state:
+ future_state_path: integration_tests/abnormal_states/avro.json
+ - config_path: secrets/jsonl_config.json
+ configured_catalog_path: integration_tests/configured_catalogs/jsonl.json
+ future_state:
+ future_state_path: integration_tests/abnormal_states/jsonl.json
+ - config_path: secrets/jsonl_newlines_config.json
+ configured_catalog_path: integration_tests/configured_catalogs/jsonl.json
+ future_state:
+ future_state_path: integration_tests/abnormal_states/jsonl_newlines.json
spec:
tests:
- - spec_path: "main/resources/spec.json"
+ - spec_path: integration_tests/spec.json
+connector_image: airbyte/source-azure-blob-storage:dev
+test_strictness_level: low
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/acceptance-test-docker.sh b/airbyte-integrations/connectors/source-azure-blob-storage/acceptance-test-docker.sh
index 5797d20fe9a7..b6d65deeccb4 100755
--- a/airbyte-integrations/connectors/source-azure-blob-storage/acceptance-test-docker.sh
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/acceptance-test-docker.sh
@@ -1,2 +1,3 @@
#!/usr/bin/env sh
+
source "$(git rev-parse --show-toplevel)/airbyte-integrations/bases/connector-acceptance-test/acceptance-test-docker.sh"
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/build.gradle b/airbyte-integrations/connectors/source-azure-blob-storage/build.gradle
deleted file mode 100644
index 15d120de4812..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/build.gradle
+++ /dev/null
@@ -1,28 +0,0 @@
-plugins {
- id 'application'
- id 'airbyte-java-connector'
-}
-
-airbyteJavaConnector {
- cdkVersionRequired = '0.1.0'
- features = ['db-sources']
- useLocalCdk = false
-}
-
-airbyteJavaConnector.addCdkDependencies()
-
-application {
- mainClass = 'io.airbyte.integrations.source.azureblobstorage.AzureBlobStorageSource'
-}
-
-
-dependencies {
- implementation libs.airbyte.protocol
-
- implementation "com.azure:azure-storage-blob:12.20.2"
- implementation "com.github.saasquatch:json-schema-inferrer:0.1.5"
-
- testImplementation "org.assertj:assertj-core:3.23.1"
- testImplementation "org.testcontainers:junit-jupiter:1.17.5"
- testImplementation 'org.skyscreamer:jsonassert:1.5.1'
-}
\ No newline at end of file
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/__init__.py b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/__init__.py
new file mode 100644
index 000000000000..c941b3045795
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/__init__.py
@@ -0,0 +1,3 @@
+#
+# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
+#
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/avro.json b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/avro.json
new file mode 100644
index 000000000000..f9428f77d8c8
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/avro.json
@@ -0,0 +1,12 @@
+[
+ {
+ "type": "STREAM",
+ "stream": {
+ "stream_state": {
+ "_ab_source_file_last_modified": "2999-01-01T00:00:00.000000Z_test_sample.avro",
+ "history": { "test_sample.avro": "2999-01-01T00:00:00.000000Z" }
+ },
+ "stream_descriptor": { "name": "airbyte-source-azure-blob-storage-test" }
+ }
+ }
+]
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/csv.json b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/csv.json
new file mode 100644
index 000000000000..10347a3c9e7a
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/csv.json
@@ -0,0 +1,12 @@
+[
+ {
+ "type": "STREAM",
+ "stream": {
+ "stream_state": {
+ "_ab_source_file_last_modified": "2999-01-01T00:00:00.000000Z_simple_test.csv",
+ "history": { "simple_test.csv": "2999-01-01T00:00:00.000000Z" }
+ },
+ "stream_descriptor": { "name": "airbyte-source-azure-blob-storage-test" }
+ }
+ }
+]
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/jsonl.json b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/jsonl.json
new file mode 100644
index 000000000000..99aab040ba62
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/jsonl.json
@@ -0,0 +1,12 @@
+[
+ {
+ "type": "STREAM",
+ "stream": {
+ "stream_state": {
+ "_ab_source_file_last_modified": "2999-01-01T00:00:00.000000Z_simple_test.jsonl",
+ "history": { "simple_test.jsonl": "2999-01-01T00:00:00.000000Z" }
+ },
+ "stream_descriptor": { "name": "airbyte-source-azure-blob-storage-test" }
+ }
+ }
+]
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/jsonl_newlines.json b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/jsonl_newlines.json
new file mode 100644
index 000000000000..c645ceeab9f5
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/jsonl_newlines.json
@@ -0,0 +1,14 @@
+[
+ {
+ "type": "STREAM",
+ "stream": {
+ "stream_state": {
+ "_ab_source_file_last_modified": "2999-01-01T00:00:00.000000Z_simple_test_newlines.jsonl",
+ "history": {
+ "simple_test_newlines.jsonl": "2999-01-01T00:00:00.000000Z"
+ }
+ },
+ "stream_descriptor": { "name": "airbyte-source-azure-blob-storage-test" }
+ }
+ }
+]
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/parquet.json b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/parquet.json
new file mode 100644
index 000000000000..be24e222e3d3
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/abnormal_states/parquet.json
@@ -0,0 +1,18 @@
+[
+ {
+ "type": "STREAM",
+ "stream": {
+ "stream_state": {
+ "_ab_source_file_last_modified": "2999-01-01T00:00:00.000000Z_simple_test.csv",
+ "history": {
+ "test_payroll/Fiscal_Year=2021/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet": "2999-01-01T00:00:00.000000Z",
+ "test_payroll/Fiscal_Year=2021/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Hour/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet": "2999-01-01T00:00:00.000000Z",
+ "test_payroll/Fiscal_Year=2021/Leave_Status_as_of_June_30=ON%20LEAVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet": "2999-01-01T00:00:00.000000Z",
+ "test_payroll/Fiscal_Year=2022/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet": "2999-01-01T00:00:00.000000Z",
+ "test_payroll/Fiscal_Year=2022/Leave_Status_as_of_June_30=ON%20LEAVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet": "2999-01-01T00:00:00.000000Z"
+ }
+ },
+ "stream_descriptor": { "name": "airbyte-source-azure-blob-storage-test" }
+ }
+ }
+]
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/acceptance.py b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/acceptance.py
new file mode 100644
index 000000000000..43ce950d77ca
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/acceptance.py
@@ -0,0 +1,16 @@
+#
+# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
+#
+
+
+import pytest
+
+pytest_plugins = ("connector_acceptance_test.plugin",)
+
+
+@pytest.fixture(scope="session", autouse=True)
+def connector_setup():
+ """This fixture is a placeholder for external resources that acceptance test might require."""
+ # TODO: setup test dependencies
+ yield
+ # TODO: clean up test dependencies
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalog.json b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalog.json
new file mode 100644
index 000000000000..d5ec74c6346f
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalog.json
@@ -0,0 +1,14 @@
+{
+ "streams": [
+ {
+ "stream": {
+ "name": "airbyte-source-azure-blob-storage-test",
+ "json_schema": {},
+ "supported_sync_modes": ["full_refresh"],
+ "source_defined_cursor": false
+ },
+ "sync_mode": "full_refresh",
+ "destination_sync_mode": "overwrite"
+ }
+ ]
+}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/avro.json b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/avro.json
new file mode 100644
index 000000000000..85f11913e421
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/avro.json
@@ -0,0 +1,38 @@
+{
+ "streams": [
+ {
+ "stream": {
+ "name": "airbyte-source-azure-blob-storage-test",
+ "json_schema": {
+ "type": "object",
+ "properties": {
+ "id": {
+ "type": ["integer", "null"]
+ },
+ "fullname_and_valid": {
+ "type": ["object", "null"],
+ "fullname": {
+ "type": ["string", "null"]
+ },
+ "valid": {
+ "type": ["boolean", "null"]
+ }
+ },
+ "_ab_source_file_last_modified": {
+ "type": "string",
+ "format": "date-time"
+ },
+ "_ab_source_file_url": {
+ "type": "string"
+ }
+ }
+ },
+ "supported_sync_modes": ["full_refresh", "incremental"],
+ "source_defined_cursor": true,
+ "default_cursor_field": ["_ab_source_file_last_modified"]
+ },
+ "sync_mode": "incremental",
+ "destination_sync_mode": "append"
+ }
+ ]
+}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/csv.json b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/csv.json
new file mode 100644
index 000000000000..009fe4584cc9
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/csv.json
@@ -0,0 +1,35 @@
+{
+ "streams": [
+ {
+ "stream": {
+ "name": "airbyte-source-azure-blob-storage-test",
+ "json_schema": {
+ "type": "object",
+ "properties": {
+ "id": {
+ "type": ["null", "integer"]
+ },
+ "name": {
+ "type": ["null", "string"]
+ },
+ "valid": {
+ "type": ["null", "boolean"]
+ },
+ "_ab_source_file_last_modified": {
+ "type": "string",
+ "format": "date-time"
+ },
+ "_ab_source_file_url": {
+ "type": "string"
+ }
+ }
+ },
+ "supported_sync_modes": ["full_refresh", "incremental"],
+ "source_defined_cursor": true,
+ "default_cursor_field": ["_ab_source_file_last_modified"]
+ },
+ "sync_mode": "incremental",
+ "destination_sync_mode": "append"
+ }
+ ]
+}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/jsonl.json b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/jsonl.json
new file mode 100644
index 000000000000..102bebaf253e
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/jsonl.json
@@ -0,0 +1,41 @@
+{
+ "streams": [
+ {
+ "stream": {
+ "name": "airbyte-source-azure-blob-storage-test",
+ "json_schema": {
+ "type": "object",
+ "properties": {
+ "id": {
+ "type": ["null", "integer"]
+ },
+ "name": {
+ "type": ["null", "string"]
+ },
+ "valid": {
+ "type": ["null", "boolean"]
+ },
+ "value": {
+ "type": ["null", "number"]
+ },
+ "event_date": {
+ "type": ["null", "string"]
+ },
+ "_ab_source_file_last_modified": {
+ "type": "string",
+ "format": "date-time"
+ },
+ "_ab_source_file_url": {
+ "type": "string"
+ }
+ }
+ },
+ "supported_sync_modes": ["full_refresh", "incremental"],
+ "source_defined_cursor": true,
+ "default_cursor_field": ["_ab_source_file_last_modified"]
+ },
+ "sync_mode": "incremental",
+ "destination_sync_mode": "append"
+ }
+ ]
+}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/parquet.json b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/parquet.json
new file mode 100644
index 000000000000..013465e64d42
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/configured_catalogs/parquet.json
@@ -0,0 +1,74 @@
+{
+ "streams": [
+ {
+ "stream": {
+ "name": "airbyte-source-azure-blob-storage-test",
+ "json_schema": {
+ "type": "object",
+ "properties": {
+ "Payroll_Number": {
+ "type": ["null", "number"]
+ },
+ "Last_Name": {
+ "type": ["null", "string"]
+ },
+ "First_Name": {
+ "type": ["null", "string"]
+ },
+ "Mid_Init": {
+ "type": ["null", "string"]
+ },
+ "Agency_Start_Date": {
+ "type": ["null", "string"]
+ },
+ "Work_Location_Borough": {
+ "type": ["null", "number"]
+ },
+ "Title_Description": {
+ "type": ["null", "string"]
+ },
+ "Base_Salary": {
+ "type": ["null", "number"]
+ },
+ "Regular_Hours": {
+ "type": ["null", "number"]
+ },
+ "Regular_Gross_Paid": {
+ "type": ["null", "number"]
+ },
+ "OT_Hours": {
+ "type": ["null", "number"]
+ },
+ "Total_OT_Paid": {
+ "type": ["null", "number"]
+ },
+ "Total_Other_Pay": {
+ "type": ["null", "number"]
+ },
+ "Fiscal_Year": {
+ "type": ["null", "string"]
+ },
+ "Leave_Status_as_of_June_30": {
+ "type": ["null", "string"]
+ },
+ "Pay_Basis": {
+ "type": ["null", "string"]
+ },
+ "_ab_source_file_last_modified": {
+ "type": "string",
+ "format": "date-time"
+ },
+ "_ab_source_file_url": {
+ "type": "string"
+ }
+ }
+ },
+ "supported_sync_modes": ["full_refresh", "incremental"],
+ "source_defined_cursor": true,
+ "default_cursor_field": ["_ab_source_file_last_modified"]
+ },
+ "sync_mode": "incremental",
+ "destination_sync_mode": "append"
+ }
+ ]
+}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/avro.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/avro.jsonl
new file mode 100644
index 000000000000..6f7a2a884d1e
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/avro.jsonl
@@ -0,0 +1,10 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 0, "fullname_and_valid": {"fullname": "cfjwIzCRTL", "valid": false}, "_ab_source_file_last_modified": "2023-10-12T15:27:33.000000Z", "_ab_source_file_url": "test_sample.avro"}, "emitted_at": 1697137055360}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 1, "fullname_and_valid": {"fullname": "LYOnPyuTWw", "valid": true}, "_ab_source_file_last_modified": "2023-10-12T15:27:33.000000Z", "_ab_source_file_url": "test_sample.avro"}, "emitted_at": 1697137055363}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 2, "fullname_and_valid": {"fullname": "hyTFbsxlRB", "valid": false}, "_ab_source_file_last_modified": "2023-10-12T15:27:33.000000Z", "_ab_source_file_url": "test_sample.avro"}, "emitted_at": 1697137055363}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 3, "fullname_and_valid": {"fullname": "ooEUiFcFqp", "valid": true}, "_ab_source_file_last_modified": "2023-10-12T15:27:33.000000Z", "_ab_source_file_url": "test_sample.avro"}, "emitted_at": 1697137055364}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 4, "fullname_and_valid": {"fullname": "pveENwAvOg", "valid": true}, "_ab_source_file_last_modified": "2023-10-12T15:27:33.000000Z", "_ab_source_file_url": "test_sample.avro"}, "emitted_at": 1697137055365}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 5, "fullname_and_valid": {"fullname": "pPhWgQgZFq", "valid": true}, "_ab_source_file_last_modified": "2023-10-12T15:27:33.000000Z", "_ab_source_file_url": "test_sample.avro"}, "emitted_at": 1697137055365}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 6, "fullname_and_valid": {"fullname": "MRNMXFkXZo", "valid": true}, "_ab_source_file_last_modified": "2023-10-12T15:27:33.000000Z", "_ab_source_file_url": "test_sample.avro"}, "emitted_at": 1697137055366}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 7, "fullname_and_valid": {"fullname": "MXvEWMgnIr", "valid": true}, "_ab_source_file_last_modified": "2023-10-12T15:27:33.000000Z", "_ab_source_file_url": "test_sample.avro"}, "emitted_at": 1697137055367}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 8, "fullname_and_valid": {"fullname": "rqmFGqZqdF", "valid": true}, "_ab_source_file_last_modified": "2023-10-12T15:27:33.000000Z", "_ab_source_file_url": "test_sample.avro"}, "emitted_at": 1697137055367}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 9, "fullname_and_valid": {"fullname": "lmPpQTcPFM", "valid": true}, "_ab_source_file_last_modified": "2023-10-12T15:27:33.000000Z", "_ab_source_file_url": "test_sample.avro"}, "emitted_at": 1697137055368}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv.jsonl
new file mode 100644
index 000000000000..e0ce199ad2b4
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv.jsonl
@@ -0,0 +1,8 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 1, "name": "PVdhmjb1", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:28.000000Z", "_ab_source_file_url": "simple_test.csv"}, "emitted_at": 1697137323277}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 2, "name": "j4DyXTS7", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:28.000000Z", "_ab_source_file_url": "simple_test.csv"}, "emitted_at": 1697137323280}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 3, "name": "v0w8fTME", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:28.000000Z", "_ab_source_file_url": "simple_test.csv"}, "emitted_at": 1697137323280}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 4, "name": "1q6jD8Np", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:28.000000Z", "_ab_source_file_url": "simple_test.csv"}, "emitted_at": 1697137323281}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 5, "name": "77h4aiMP", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:28.000000Z", "_ab_source_file_url": "simple_test.csv"}, "emitted_at": 1697137323282}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 6, "name": "Le35Wyic", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:28.000000Z", "_ab_source_file_url": "simple_test.csv"}, "emitted_at": 1697137323282}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 7, "name": "xZhh1Kyl", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:28.000000Z", "_ab_source_file_url": "simple_test.csv"}, "emitted_at": 1697137323283}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 8, "name": "M2t286iJ", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:28.000000Z", "_ab_source_file_url": "simple_test.csv"}, "emitted_at": 1697137323283}
\ No newline at end of file
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_custom_encoding.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_custom_encoding.jsonl
new file mode 100644
index 000000000000..6b2c285748b7
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_custom_encoding.jsonl
@@ -0,0 +1,8 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 1, "name": "PVdhmjb1\u20ac", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_encoded_as_cp1252.csv"}, "emitted_at": 1697137941724}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 2, "name": "j4DyXTS7", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_encoded_as_cp1252.csv"}, "emitted_at": 1697137941726}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 3, "name": "v0w8fTME", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_encoded_as_cp1252.csv"}, "emitted_at": 1697137941727}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 4, "name": "1q6jD8Np", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_encoded_as_cp1252.csv"}, "emitted_at": 1697137941727}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 5, "name": "77h4aiMP", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_encoded_as_cp1252.csv"}, "emitted_at": 1697137941728}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 6, "name": "Le35Wyic", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_encoded_as_cp1252.csv"}, "emitted_at": 1697137941729}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 7, "name": "xZhh1Kyl", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_encoded_as_cp1252.csv"}, "emitted_at": 1697137941729}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 8, "name": "M2t286iJ", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_encoded_as_cp1252.csv"}, "emitted_at": 1697137941730}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_custom_format.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_custom_format.jsonl
new file mode 100644
index 000000000000..908ba5255dae
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_custom_format.jsonl
@@ -0,0 +1,8 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 1, "name": "PVdhmj|b1", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/custom_format.csv"}, "emitted_at": 1697137871167}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 2, "name": "j4DyXTS7", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/custom_format.csv"}, "emitted_at": 1697137871168}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 3, "name": "v0w8fTME", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/custom_format.csv"}, "emitted_at": 1697137871168}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 4, "name": "1q6jD8Np", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/custom_format.csv"}, "emitted_at": 1697137871168}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 5, "name": "77h4aiMP", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/custom_format.csv"}, "emitted_at": 1697137871168}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 6, "name": "Le35Wyic", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/custom_format.csv"}, "emitted_at": 1697137871168}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 7, "name": "xZhh1Kyl", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/custom_format.csv"}, "emitted_at": 1697137871168}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 8, "name": "M2t286iJ", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/custom_format.csv"}, "emitted_at": 1697137871169}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_no_header.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_no_header.jsonl
new file mode 100644
index 000000000000..2814616406a3
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_no_header.jsonl
@@ -0,0 +1,8 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 1, "f1": "PVdhmjb1", "f2": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/no_header.csv"}, "emitted_at": 1697190339868}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 2, "f1": "j4DyXTS7", "f2": true, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/no_header.csv"}, "emitted_at": 1697190339871}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 3, "f1": "v0w8fTME", "f2": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/no_header.csv"}, "emitted_at": 1697190339872}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 4, "f1": "1q6jD8Np", "f2": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/no_header.csv"}, "emitted_at": 1697190339873}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 5, "f1": "77h4aiMP", "f2": true, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/no_header.csv"}, "emitted_at": 1697190339873}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 6, "f1": "Le35Wyic", "f2": true, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/no_header.csv"}, "emitted_at": 1697190339874}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 7, "f1": "xZhh1Kyl", "f2": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/no_header.csv"}, "emitted_at": 1697190339875}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 8, "f1": "M2t286iJ", "f2": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/no_header.csv"}, "emitted_at": 1697190339876}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_skip_rows.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_skip_rows.jsonl
new file mode 100644
index 000000000000..26e6740250a8
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_skip_rows.jsonl
@@ -0,0 +1,8 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 1, "name": "PVdhmjb1", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/skip_rows.csv"}, "emitted_at": 1697138054160}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 2, "name": "j4DyXTS7", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/skip_rows.csv"}, "emitted_at": 1697138054163}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 3, "name": "v0w8fTME", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/skip_rows.csv"}, "emitted_at": 1697138054163}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 4, "name": "1q6jD8Np", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/skip_rows.csv"}, "emitted_at": 1697138054164}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 5, "name": "77h4aiMP", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/skip_rows.csv"}, "emitted_at": 1697138054165}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 6, "name": "Le35Wyic", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/skip_rows.csv"}, "emitted_at": 1697138054165}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 7, "name": "xZhh1Kyl", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/skip_rows.csv"}, "emitted_at": 1697138054166}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 8, "name": "M2t286iJ", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/skip_rows.csv"}, "emitted_at": 1697138054166}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_skip_rows_no_header.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_skip_rows_no_header.jsonl
new file mode 100644
index 000000000000..fdf068745cd4
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_skip_rows_no_header.jsonl
@@ -0,0 +1,8 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 1, "f1": "PVdhmjb1", "f2": false, "_ab_source_file_last_modified": "2023-10-12T17:22:14.000000Z", "_ab_source_file_url": "csv_tests/skip_rows_no_header.csv"}, "emitted_at": 1697190448512}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 2, "f1": "j4DyXTS7", "f2": true, "_ab_source_file_last_modified": "2023-10-12T17:22:14.000000Z", "_ab_source_file_url": "csv_tests/skip_rows_no_header.csv"}, "emitted_at": 1697190448514}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 3, "f1": "v0w8fTME", "f2": false, "_ab_source_file_last_modified": "2023-10-12T17:22:14.000000Z", "_ab_source_file_url": "csv_tests/skip_rows_no_header.csv"}, "emitted_at": 1697190448515}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 4, "f1": "1q6jD8Np", "f2": false, "_ab_source_file_last_modified": "2023-10-12T17:22:14.000000Z", "_ab_source_file_url": "csv_tests/skip_rows_no_header.csv"}, "emitted_at": 1697190448516}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 5, "f1": "77h4aiMP", "f2": true, "_ab_source_file_last_modified": "2023-10-12T17:22:14.000000Z", "_ab_source_file_url": "csv_tests/skip_rows_no_header.csv"}, "emitted_at": 1697190448516}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 6, "f1": "Le35Wyic", "f2": true, "_ab_source_file_last_modified": "2023-10-12T17:22:14.000000Z", "_ab_source_file_url": "csv_tests/skip_rows_no_header.csv"}, "emitted_at": 1697190448517}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 7, "f1": "xZhh1Kyl", "f2": false, "_ab_source_file_last_modified": "2023-10-12T17:22:14.000000Z", "_ab_source_file_url": "csv_tests/skip_rows_no_header.csv"}, "emitted_at": 1697190448517}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"f0": 8, "f1": "M2t286iJ", "f2": false, "_ab_source_file_last_modified": "2023-10-12T17:22:14.000000Z", "_ab_source_file_url": "csv_tests/skip_rows_no_header.csv"}, "emitted_at": 1697190448518}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_user_schema.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_user_schema.jsonl
new file mode 100644
index 000000000000..4d4e8c269680
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_user_schema.jsonl
@@ -0,0 +1,8 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 1.0, "name": "PVdhmjb1", "valid": false, "valid_string": "False", "array": "[\"a\", \"b\", \"c\"]", "dict": "{\"key\": \"value\"}", "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/user_schema.csv"}, "emitted_at": 1697190171210}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 2.0, "name": "j4DyXTS7", "valid": true, "valid_string": "True", "array": "[\"a\", \"b\"]", "dict": "{\"key\": \"value_with_comma\\,\"}", "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/user_schema.csv"}, "emitted_at": 1697190171213}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 3.0, "name": "v0w8fTME", "valid": false, "valid_string": "False", "array": "[\"a\"]", "dict": "{\"key\": \"value\"}", "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/user_schema.csv"}, "emitted_at": 1697190171214}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 4.0, "name": "1q6jD8Np", "valid": false, "valid_string": "False", "array": "[]", "dict": "{}", "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/user_schema.csv"}, "emitted_at": 1697190171214}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 5.0, "name": "77h4aiMP", "valid": true, "valid_string": "True", "array": "[\"b\", \"c\"]", "dict": "{}", "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/user_schema.csv"}, "emitted_at": 1697190171215}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 6.0, "name": "Le35Wyic", "valid": true, "valid_string": "True", "array": "[\"a\", \"c\"]", "dict": "{}", "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/user_schema.csv"}, "emitted_at": 1697190171216}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 7.0, "name": "xZhh1Kyl", "valid": false, "valid_string": "False", "array": "[\"b\"]", "dict": "{}", "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/user_schema.csv"}, "emitted_at": 1697190171216}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 8.0, "name": "M2t286iJ", "valid": false, "valid_string": "False", "array": "[\"c\"]", "dict": "{}", "_ab_source_file_last_modified": "2023-10-12T15:27:45.000000Z", "_ab_source_file_url": "csv_tests/user_schema.csv"}, "emitted_at": 1697190171217}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_with_null_bools.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_with_null_bools.jsonl
new file mode 100644
index 000000000000..5f67bca607c9
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_with_null_bools.jsonl
@@ -0,0 +1,8 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 1, "name": "null", "valid": null, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_with_null_bools.csv"}, "emitted_at": 1697138273346}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 2, "name": "j4DyXTS7", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_with_null_bools.csv"}, "emitted_at": 1697138273348}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 3, "name": "NULL", "valid": null, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_with_null_bools.csv"}, "emitted_at": 1697138273349}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 4, "name": "1q6jD8Np", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_with_null_bools.csv"}, "emitted_at": 1697138273350}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 5, "name": "77h4aiMP", "valid": null, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_with_null_bools.csv"}, "emitted_at": 1697138273351}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 6, "name": "", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_with_null_bools.csv"}, "emitted_at": 1697138273352}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 7, "name": "xZhh1Kyl", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_with_null_bools.csv"}, "emitted_at": 1697138273352}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 8, "name": "M2t286iJ", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:43.000000Z", "_ab_source_file_url": "csv_tests/csv_with_null_bools.csv"}, "emitted_at": 1697138273353}
\ No newline at end of file
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_with_nulls.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_with_nulls.jsonl
new file mode 100644
index 000000000000..c5e78a965126
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/csv_with_nulls.jsonl
@@ -0,0 +1,8 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 1, "name": null, "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/csv_with_nulls.csv"}, "emitted_at": 1697138317244}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 2, "name": "j4DyXTS7", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/csv_with_nulls.csv"}, "emitted_at": 1697138317246}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 3, "name": null, "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/csv_with_nulls.csv"}, "emitted_at": 1697138317247}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 4, "name": "1q6jD8Np", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/csv_with_nulls.csv"}, "emitted_at": 1697138317247}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 5, "name": "77h4aiMP", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/csv_with_nulls.csv"}, "emitted_at": 1697138317248}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 6, "name": "Le35Wyic", "valid": true, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/csv_with_nulls.csv"}, "emitted_at": 1697138317248}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 7, "name": "xZhh1Kyl", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/csv_with_nulls.csv"}, "emitted_at": 1697138317249}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 8, "name": "M2t286iJ", "valid": false, "_ab_source_file_last_modified": "2023-10-12T15:27:53.000000Z", "_ab_source_file_url": "csv_tests/csv_with_nulls.csv"}, "emitted_at": 1697138317250}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/jsonl.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/jsonl.jsonl
new file mode 100644
index 000000000000..363228ec8e7c
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/jsonl.jsonl
@@ -0,0 +1,2 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 1, "name": "PVdhmjb1", "valid": false, "value": 1.2, "event_date": "2022-01-01T00:00:00Z", "_ab_source_file_last_modified": "2023-10-12T15:27:30.000000Z", "_ab_source_file_url": "simple_test.jsonl"}, "emitted_at": 1697137760681}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 2, "name": "ABCDEF", "valid": true, "value": 1, "event_date": "2023-01-01T00:00:00Z", "_ab_source_file_last_modified": "2023-10-12T15:27:30.000000Z", "_ab_source_file_url": "simple_test.jsonl"}, "emitted_at": 1697137760683}
\ No newline at end of file
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/jsonl_newlines.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/jsonl_newlines.jsonl
new file mode 100644
index 000000000000..502cd837cc66
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/jsonl_newlines.jsonl
@@ -0,0 +1,2 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 1, "name": "PVdhmjb1", "valid": false, "value": 1.2, "event_date": "2022-01-01T00:00:00Z", "_ab_source_file_last_modified": "2023-10-12T15:27:32.000000Z", "_ab_source_file_url": "simple_test_newlines.jsonl"}, "emitted_at": 1697137820278}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"id": 2, "name": "ABCDEF", "valid": true, "value": 1, "event_date": "2023-01-01T00:00:00Z", "_ab_source_file_last_modified": "2023-10-12T15:27:32.000000Z", "_ab_source_file_url": "simple_test_newlines.jsonl"}, "emitted_at": 1697137820280}
\ No newline at end of file
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/parquet.jsonl b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/parquet.jsonl
new file mode 100644
index 000000000000..75df4b4a4752
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/expected_records/parquet.jsonl
@@ -0,0 +1,15 @@
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "SCHWARTZ", "First_Name": "CHANA", "Mid_Init": "H", "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "*ATTORNEY AT LAW", "Base_Salary": 77015.0, "Regular_Hours": 1046.25, "Regular_Gross_Paid": 47316.74, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 8230.31, "Fiscal_Year": "2021", "Leave_Status_as_of_June_30": "ON LEAVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:44.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2021/Leave_Status_as_of_June_30=ON%20LEAVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190604125}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "WASHINGTON", "First_Name": "DOROTHY", "Mid_Init": null, "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "ADM MANAGER-NON-MGRL FROM M1/M2", "Base_Salary": 53373.0, "Regular_Hours": 1825.0, "Regular_Gross_Paid": 47436.44, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 1723.17, "Fiscal_Year": "2021", "Leave_Status_as_of_June_30": "ON LEAVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:44.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2021/Leave_Status_as_of_June_30=ON%20LEAVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190604128}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "DU", "First_Name": "MARK", "Mid_Init": null, "Agency_Start_Date": "03/24/2014", "Work_Location_Borough": null, "Title_Description": "HEARING OFFICER", "Base_Salary": 36.6, "Regular_Hours": 188.75, "Regular_Gross_Paid": 5334.45, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 0.0, "Fiscal_Year": "2021", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Hour", "_ab_source_file_last_modified": "2023-10-12T15:28:45.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2021/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Hour/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190608928}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "BIEBEL", "First_Name": "ANN", "Mid_Init": "M", "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "*ATTORNEY AT LAW", "Base_Salary": 77015.0, "Regular_Hours": 1825.0, "Regular_Gross_Paid": 76804.0, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 13750.36, "Fiscal_Year": "2021", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:46.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2021/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190612045}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "CARROLL", "First_Name": "FRAN", "Mid_Init": null, "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "*ATTORNEY AT LAW", "Base_Salary": 77015.0, "Regular_Hours": 1825.0, "Regular_Gross_Paid": 76804.0, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 13750.36, "Fiscal_Year": "2021", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:46.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2021/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190612046}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "BROWNSTEIN", "First_Name": "ELFREDA", "Mid_Init": "G", "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "*ATTORNEY AT LAW", "Base_Salary": 83504.0, "Regular_Hours": 1825.0, "Regular_Gross_Paid": 83275.15, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 13750.36, "Fiscal_Year": "2021", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:46.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2021/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190612048}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "WARD", "First_Name": "RENEE", "Mid_Init": null, "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "ADM MANAGER-NON-MGRL FROM M1/M2", "Base_Salary": 53373.0, "Regular_Hours": 1825.0, "Regular_Gross_Paid": 46588.76, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 3409.69, "Fiscal_Year": "2021", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:46.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2021/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190612050}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "SPIVEY", "First_Name": "NATASHA", "Mid_Init": "L", "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "ADM MANAGER-NON-MGRL FROM M1/M2", "Base_Salary": 53436.0, "Regular_Hours": 1825.0, "Regular_Gross_Paid": 53289.6, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 0.0, "Fiscal_Year": "2021", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:46.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2021/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190612052}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "SAMUEL", "First_Name": "GRACE", "Mid_Init": "Y", "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "ADM MANAGER-NON-MGRL FROM M1/M2", "Base_Salary": 55337.0, "Regular_Hours": 1825.0, "Regular_Gross_Paid": 55185.52, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 0.0, "Fiscal_Year": "2022", "Leave_Status_as_of_June_30": "ON LEAVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:48.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2022/Leave_Status_as_of_June_30=ON%20LEAVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190614835}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "THEIL", "First_Name": "JOANNE", "Mid_Init": "F", "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "*ATTORNEY AT LAW", "Base_Salary": 80438.0, "Regular_Hours": 1825.0, "Regular_Gross_Paid": 80217.55, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 13635.42, "Fiscal_Year": "2022", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:49.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2022/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190617470}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "DEMAIO", "First_Name": "DEIRDRE", "Mid_Init": null, "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "ADM MANAGER-NON-MGRL FROM M1/M2", "Base_Salary": 53512.0, "Regular_Hours": 1780.0, "Regular_Gross_Paid": 48727.47, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 3318.35, "Fiscal_Year": "2022", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:49.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2022/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190617472}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "MCLAURIN TRAPP", "First_Name": "CELESTINE", "Mid_Init": "T", "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "ADM MANAGER-NON-MGRL FROM M1/M2", "Base_Salary": 58951.0, "Regular_Hours": 1818.0, "Regular_Gross_Paid": 58563.27, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 8.25, "Fiscal_Year": "2022", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:49.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2022/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190617472}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "BUNDRANT", "First_Name": "TROY", "Mid_Init": null, "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "ADM MANAGER-NON-MGRL FROM M1/M2", "Base_Salary": 64769.0, "Regular_Hours": 1825.0, "Regular_Gross_Paid": 61817.94, "OT_Hours": 62.0, "Total_OT_Paid": 2576.58, "Total_Other_Pay": 106.68, "Fiscal_Year": "2022", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:49.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2022/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190617473}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "CHASE JONES", "First_Name": "DIANA", "Mid_Init": null, "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "ADM MANAGER-NON-MGRL FROM M1/M2", "Base_Salary": 66000.0, "Regular_Hours": 1825.0, "Regular_Gross_Paid": 65819.25, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 0.0, "Fiscal_Year": "2022", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:49.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2022/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190617473}
+{"stream": "airbyte-source-azure-blob-storage-test", "data": {"Payroll_Number": 820.0, "Last_Name": "JORDAN", "First_Name": "REGINALD", "Mid_Init": null, "Agency_Start_Date": "07/05/2010", "Work_Location_Borough": null, "Title_Description": "ADM MANAGER-NON-MGRL FROM M1/M2", "Base_Salary": 75000.0, "Regular_Hours": 1825.0, "Regular_Gross_Paid": 74794.46, "OT_Hours": 0.0, "Total_OT_Paid": 0.0, "Total_Other_Pay": 0.0, "Fiscal_Year": "2022", "Leave_Status_as_of_June_30": "ACTIVE", "Pay_Basis": "per Annum", "_ab_source_file_last_modified": "2023-10-12T15:28:49.000000Z", "_ab_source_file_url": "test_payroll/Fiscal_Year=2022/Leave_Status_as_of_June_30=ACTIVE/Pay_Basis=per%20Annum/4e0ea65c5a074c0592e43f7b950f3ce8-0.parquet"}, "emitted_at": 1697190617474}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/spec.json b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/spec.json
new file mode 100644
index 000000000000..156729699c45
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/integration_tests/spec.json
@@ -0,0 +1,323 @@
+{
+ "documentationUrl": "https://docs.airbyte.com/integrations/sources/azure-blob-storage",
+ "connectionSpecification": {
+ "title": "Config",
+ "description": "NOTE: When this Spec is changed, legacy_config_transformer.py must also be modified to uptake the changes\nbecause it is responsible for converting legacy Azure Blob Storage v0 configs into v1 configs using the File-Based CDK.",
+ "type": "object",
+ "properties": {
+ "start_date": {
+ "title": "Start Date",
+ "description": "UTC date and time in the format 2017-01-25T00:00:00.000000Z. Any file modified before this date will not be replicated.",
+ "examples": ["2021-01-01T00:00:00.000000Z"],
+ "format": "date-time",
+ "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{6}Z$",
+ "pattern_descriptor": "YYYY-MM-DDTHH:mm:ss.SSSSSSZ",
+ "order": 1,
+ "type": "string"
+ },
+ "streams": {
+ "title": "The list of streams to sync",
+ "description": "Each instance of this configuration defines a stream. Use this to define which files belong in the stream, their format, and how they should be parsed and validated. When sending data to warehouse destination such as Snowflake or BigQuery, each stream is a separate table.",
+ "order": 10,
+ "type": "array",
+ "items": {
+ "title": "FileBasedStreamConfig",
+ "type": "object",
+ "properties": {
+ "name": {
+ "title": "Name",
+ "description": "The name of the stream.",
+ "type": "string"
+ },
+ "globs": {
+ "title": "Globs",
+ "description": "The pattern used to specify which files should be selected from the file system. For more information on glob pattern matching look here.",
+ "type": "array",
+ "items": {
+ "type": "string"
+ }
+ },
+ "legacy_prefix": {
+ "title": "Legacy Prefix",
+ "description": "The path prefix configured in v3 versions of the S3 connector. This option is deprecated in favor of a single glob.",
+ "airbyte_hidden": true,
+ "type": "string"
+ },
+ "validation_policy": {
+ "title": "Validation Policy",
+ "description": "The name of the validation policy that dictates sync behavior when a record does not adhere to the stream schema.",
+ "default": "Emit Record",
+ "enum": ["Emit Record", "Skip Record", "Wait for Discover"]
+ },
+ "input_schema": {
+ "title": "Input Schema",
+ "description": "The schema that will be used to validate records extracted from the file. This will override the stream schema that is auto-detected from incoming files.",
+ "type": "string"
+ },
+ "primary_key": {
+ "title": "Primary Key",
+ "description": "The column or columns (for a composite key) that serves as the unique identifier of a record.",
+ "type": "string"
+ },
+ "days_to_sync_if_history_is_full": {
+ "title": "Days To Sync If History Is Full",
+ "description": "When the state history of the file store is full, syncs will only read files that were last modified in the provided day range.",
+ "default": 3,
+ "type": "integer"
+ },
+ "format": {
+ "title": "Format",
+ "description": "The configuration options that are used to alter how to read incoming files that deviate from the standard formatting.",
+ "type": "object",
+ "oneOf": [
+ {
+ "title": "Avro Format",
+ "type": "object",
+ "properties": {
+ "filetype": {
+ "title": "Filetype",
+ "default": "avro",
+ "const": "avro",
+ "type": "string"
+ },
+ "double_as_string": {
+ "title": "Convert Double Fields to Strings",
+ "description": "Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers.",
+ "default": false,
+ "type": "boolean"
+ }
+ }
+ },
+ {
+ "title": "CSV Format",
+ "type": "object",
+ "properties": {
+ "filetype": {
+ "title": "Filetype",
+ "default": "csv",
+ "const": "csv",
+ "type": "string"
+ },
+ "delimiter": {
+ "title": "Delimiter",
+ "description": "The character delimiting individual cells in the CSV data. This may only be a 1-character string. For tab-delimited data enter '\\t'.",
+ "default": ",",
+ "type": "string"
+ },
+ "quote_char": {
+ "title": "Quote Character",
+ "description": "The character used for quoting CSV values. To disallow quoting, make this field blank.",
+ "default": "\"",
+ "type": "string"
+ },
+ "escape_char": {
+ "title": "Escape Character",
+ "description": "The character used for escaping special characters. To disallow escaping, leave this field blank.",
+ "type": "string"
+ },
+ "encoding": {
+ "title": "Encoding",
+ "description": "The character encoding of the CSV data. Leave blank to default to UTF8. See list of python encodings for allowable options.",
+ "default": "utf8",
+ "type": "string"
+ },
+ "double_quote": {
+ "title": "Double Quote",
+ "description": "Whether two quotes in a quoted CSV value denote a single quote in the data.",
+ "default": true,
+ "type": "boolean"
+ },
+ "null_values": {
+ "title": "Null Values",
+ "description": "A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field.",
+ "default": [],
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "uniqueItems": true
+ },
+ "strings_can_be_null": {
+ "title": "Strings Can Be Null",
+ "description": "Whether strings can be interpreted as null values. If true, strings that match the null_values set will be interpreted as null. If false, strings that match the null_values set will be interpreted as the string itself.",
+ "default": true,
+ "type": "boolean"
+ },
+ "skip_rows_before_header": {
+ "title": "Skip Rows Before Header",
+ "description": "The number of rows to skip before the header row. For example, if the header row is on the 3rd row, enter 2 in this field.",
+ "default": 0,
+ "type": "integer"
+ },
+ "skip_rows_after_header": {
+ "title": "Skip Rows After Header",
+ "description": "The number of rows to skip after the header row.",
+ "default": 0,
+ "type": "integer"
+ },
+ "header_definition": {
+ "title": "CSV Header Definition",
+ "description": "How headers will be defined. `User Provided` assumes the CSV does not have a header row and uses the headers provided and `Autogenerated` assumes the CSV does not have a header row and the CDK will generate headers using for `f{i}` where `i` is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can skip rows.",
+ "default": {
+ "header_definition_type": "From CSV"
+ },
+ "oneOf": [
+ {
+ "title": "From CSV",
+ "type": "object",
+ "properties": {
+ "header_definition_type": {
+ "title": "Header Definition Type",
+ "default": "From CSV",
+ "const": "From CSV",
+ "type": "string"
+ }
+ }
+ },
+ {
+ "title": "Autogenerated",
+ "type": "object",
+ "properties": {
+ "header_definition_type": {
+ "title": "Header Definition Type",
+ "default": "Autogenerated",
+ "const": "Autogenerated",
+ "type": "string"
+ }
+ }
+ },
+ {
+ "title": "User Provided",
+ "type": "object",
+ "properties": {
+ "header_definition_type": {
+ "title": "Header Definition Type",
+ "default": "User Provided",
+ "const": "User Provided",
+ "type": "string"
+ },
+ "column_names": {
+ "title": "Column Names",
+ "description": "The column names that will be used while emitting the CSV records",
+ "type": "array",
+ "items": {
+ "type": "string"
+ }
+ }
+ },
+ "required": ["column_names"]
+ }
+ ],
+ "type": "object"
+ },
+ "true_values": {
+ "title": "True Values",
+ "description": "A set of case-sensitive strings that should be interpreted as true values.",
+ "default": ["y", "yes", "t", "true", "on", "1"],
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "uniqueItems": true
+ },
+ "false_values": {
+ "title": "False Values",
+ "description": "A set of case-sensitive strings that should be interpreted as false values.",
+ "default": ["n", "no", "f", "false", "off", "0"],
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "uniqueItems": true
+ },
+ "inference_type": {
+ "title": "Inference Type",
+ "description": "How to infer the types of the columns. If none, inference default to strings.",
+ "default": "None",
+ "airbyte_hidden": true,
+ "enum": ["None", "Primitive Types Only"]
+ }
+ }
+ },
+ {
+ "title": "Jsonl Format",
+ "type": "object",
+ "properties": {
+ "filetype": {
+ "title": "Filetype",
+ "default": "jsonl",
+ "const": "jsonl",
+ "type": "string"
+ }
+ }
+ },
+ {
+ "title": "Parquet Format",
+ "type": "object",
+ "properties": {
+ "filetype": {
+ "title": "Filetype",
+ "default": "parquet",
+ "const": "parquet",
+ "type": "string"
+ },
+ "decimal_as_float": {
+ "title": "Convert Decimal Fields to Floats",
+ "description": "Whether to convert decimal fields to floats. There is a loss of precision when converting decimals to floats, so this is not recommended.",
+ "default": false,
+ "type": "boolean"
+ }
+ }
+ }
+ ]
+ },
+ "schemaless": {
+ "title": "Schemaless",
+ "description": "When enabled, syncs will not validate or structure records against the stream's schema.",
+ "default": false,
+ "type": "boolean"
+ }
+ },
+ "required": ["name", "format"]
+ }
+ },
+ "azure_blob_storage_account_name": {
+ "title": "Azure Blob Storage account name",
+ "description": "The account's name of the Azure Blob Storage.",
+ "examples": ["airbyte5storage"],
+ "order": 2,
+ "type": "string"
+ },
+ "azure_blob_storage_account_key": {
+ "title": "Azure Blob Storage account key",
+ "description": "The Azure blob storage account key.",
+ "airbyte_secret": true,
+ "examples": [
+ "Z8ZkZpteggFx394vm+PJHnGTvdRncaYS+JhLKdj789YNmD+iyGTnG+PV+POiuYNhBg/ACS+LKjd%4FG3FHGN12Nd=="
+ ],
+ "order": 3,
+ "type": "string"
+ },
+ "azure_blob_storage_container_name": {
+ "title": "Azure blob storage container (Bucket) Name",
+ "description": "The name of the Azure blob storage container.",
+ "examples": ["airbytetescontainername"],
+ "order": 4,
+ "type": "string"
+ },
+ "azure_blob_storage_endpoint": {
+ "title": "Endpoint Domain Name",
+ "description": "This is Azure Blob Storage endpoint domain name. Leave default value (or leave it empty if run container from command line) to use Microsoft native from example.",
+ "examples": ["blob.core.windows.net"],
+ "order": 11,
+ "type": "string"
+ }
+ },
+ "required": [
+ "streams",
+ "azure_blob_storage_account_name",
+ "azure_blob_storage_account_key",
+ "azure_blob_storage_container_name"
+ ]
+ }
+}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/main.py b/airbyte-integrations/connectors/source-azure-blob-storage/main.py
new file mode 100644
index 000000000000..b3361a6556d7
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/main.py
@@ -0,0 +1,33 @@
+#
+# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
+#
+
+import sys
+import traceback
+from datetime import datetime
+
+from airbyte_cdk.entrypoint import AirbyteEntrypoint, launch
+from airbyte_cdk.models import AirbyteErrorTraceMessage, AirbyteMessage, AirbyteTraceMessage, TraceType, Type
+from source_azure_blob_storage import Config, SourceAzureBlobStorage, SourceAzureBlobStorageStreamReader
+
+if __name__ == "__main__":
+ args = sys.argv[1:]
+ catalog_path = AirbyteEntrypoint.extract_catalog(args)
+ try:
+ source = SourceAzureBlobStorage(SourceAzureBlobStorageStreamReader(), Config, catalog_path)
+ except Exception:
+ print(
+ AirbyteMessage(
+ type=Type.TRACE,
+ trace=AirbyteTraceMessage(
+ type=TraceType.ERROR,
+ emitted_at=int(datetime.now().timestamp() * 1000),
+ error=AirbyteErrorTraceMessage(
+ message="Error starting the sync. This could be due to an invalid configuration or catalog. Please contact Support for assistance.",
+ stack_trace=traceback.format_exc(),
+ ),
+ ),
+ ).json()
+ )
+ else:
+ launch(source, args)
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/metadata.yaml b/airbyte-integrations/connectors/source-azure-blob-storage/metadata.yaml
index 395b78ee078c..85c12ee3040d 100644
--- a/airbyte-integrations/connectors/source-azure-blob-storage/metadata.yaml
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/metadata.yaml
@@ -2,7 +2,7 @@ data:
connectorSubtype: file
connectorType: source
definitionId: fdaaba68-4875-4ed9-8fcd-4ae1e0a25093
- dockerImageTag: 0.1.0
+ dockerImageTag: 0.2.0
dockerRepository: airbyte/source-azure-blob-storage
githubIssueLabel: source-azure-blob-storage
icon: azureblobstorage.svg
@@ -16,7 +16,7 @@ data:
releaseStage: alpha
documentationUrl: https://docs.airbyte.com/integrations/sources/azure-blob-storage
tags:
- - language:java
+ - language:python
ab_internal:
sl: 100
ql: 100
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/requirements.txt b/airbyte-integrations/connectors/source-azure-blob-storage/requirements.txt
new file mode 100644
index 000000000000..7b9114ed5867
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/requirements.txt
@@ -0,0 +1,2 @@
+# This file is autogenerated -- only edit if you know what you are doing. Use setup.py for declaring dependencies.
+-e .
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/setup.py b/airbyte-integrations/connectors/source-azure-blob-storage/setup.py
new file mode 100644
index 000000000000..cfcb2ebfb9e1
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/setup.py
@@ -0,0 +1,23 @@
+#
+# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
+#
+
+
+from setuptools import find_packages, setup
+
+MAIN_REQUIREMENTS = ["airbyte-cdk>=0.51.17", "smart_open[azure]", "pytz", "fastavro==1.4.11", "pyarrow"]
+
+TEST_REQUIREMENTS = ["requests-mock~=1.9.3", "pytest-mock~=3.6.1", "pytest~=6.2"]
+
+setup(
+ name="source_azure_blob_storage",
+ description="Source implementation for Azure Blob Storage.",
+ author="Airbyte",
+ author_email="contact@airbyte.io",
+ packages=find_packages(),
+ install_requires=MAIN_REQUIREMENTS,
+ package_data={"": ["*.json", "*.yaml"]},
+ extras_require={
+ "tests": TEST_REQUIREMENTS,
+ },
+)
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/__init__.py b/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/__init__.py
new file mode 100644
index 000000000000..5ec5c4024c72
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/__init__.py
@@ -0,0 +1,10 @@
+#
+# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
+#
+
+
+from .config import Config
+from .source import SourceAzureBlobStorage
+from .stream_reader import SourceAzureBlobStorageStreamReader
+
+__all__ = ["SourceAzureBlobStorage", "SourceAzureBlobStorageStreamReader", "Config"]
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/config.py b/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/config.py
new file mode 100644
index 000000000000..da222c774fbc
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/config.py
@@ -0,0 +1,46 @@
+#
+# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
+#
+
+from typing import Optional
+
+from airbyte_cdk.sources.file_based.config.abstract_file_based_spec import AbstractFileBasedSpec
+from pydantic import AnyUrl, Field
+
+
+class Config(AbstractFileBasedSpec):
+ """
+ NOTE: When this Spec is changed, legacy_config_transformer.py must also be modified to uptake the changes
+ because it is responsible for converting legacy Azure Blob Storage v0 configs into v1 configs using the File-Based CDK.
+ """
+
+ @classmethod
+ def documentation_url(cls) -> AnyUrl:
+ return AnyUrl("https://docs.airbyte.com/integrations/sources/azure-blob-storage", scheme="https")
+
+ azure_blob_storage_account_name: str = Field(
+ title="Azure Blob Storage account name",
+ description="The account's name of the Azure Blob Storage.",
+ examples=["airbyte5storage"],
+ order=2,
+ )
+ azure_blob_storage_account_key: str = Field(
+ title="Azure Blob Storage account key",
+ description="The Azure blob storage account key.",
+ airbyte_secret=True,
+ examples=["Z8ZkZpteggFx394vm+PJHnGTvdRncaYS+JhLKdj789YNmD+iyGTnG+PV+POiuYNhBg/ACS+LKjd%4FG3FHGN12Nd=="],
+ order=3,
+ )
+ azure_blob_storage_container_name: str = Field(
+ title="Azure blob storage container (Bucket) Name",
+ description="The name of the Azure blob storage container.",
+ examples=["airbytetescontainername"],
+ order=4,
+ )
+ azure_blob_storage_endpoint: Optional[str] = Field(
+ title="Endpoint Domain Name",
+ description="This is Azure Blob Storage endpoint domain name. Leave default value (or leave it empty if run container from "
+ "command line) to use Microsoft native from example.",
+ examples=["blob.core.windows.net"],
+ order=11,
+ )
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/legacy_config_transformer.py b/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/legacy_config_transformer.py
new file mode 100644
index 000000000000..e3c316d3ec0d
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/legacy_config_transformer.py
@@ -0,0 +1,31 @@
+#
+# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
+#
+
+from typing import Any, Mapping, MutableMapping
+
+
+class LegacyConfigTransformer:
+ """
+ Class that takes in Azure Blob Storage source configs in the legacy format and transforms them into
+ configs that can be used by the new Azure Blob Storage source built with the file-based CDK.
+ """
+
+ @classmethod
+ def convert(cls, legacy_config: Mapping) -> MutableMapping[str, Any]:
+ azure_blob_storage_blobs_prefix = legacy_config.get("azure_blob_storage_blobs_prefix", "")
+
+ return {
+ "azure_blob_storage_endpoint": legacy_config.get("azure_blob_storage_endpoint", None),
+ "azure_blob_storage_account_name": legacy_config["azure_blob_storage_account_name"],
+ "azure_blob_storage_account_key": legacy_config["azure_blob_storage_account_key"],
+ "azure_blob_storage_container_name": legacy_config["azure_blob_storage_container_name"],
+ "streams": [
+ {
+ "name": legacy_config["azure_blob_storage_container_name"],
+ "legacy_prefix": azure_blob_storage_blobs_prefix,
+ "validation_policy": "Emit Record",
+ "format": {"filetype": "jsonl"},
+ }
+ ],
+ }
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/source.py b/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/source.py
new file mode 100644
index 000000000000..419119bb3ef8
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/source.py
@@ -0,0 +1,29 @@
+#
+# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
+#
+
+from typing import Any, Mapping
+
+from airbyte_cdk.config_observation import emit_configuration_as_airbyte_control_message
+from airbyte_cdk.sources.file_based.file_based_source import FileBasedSource
+
+from .legacy_config_transformer import LegacyConfigTransformer
+
+
+class SourceAzureBlobStorage(FileBasedSource):
+ def read_config(self, config_path: str) -> Mapping[str, Any]:
+ """
+ Used to override the default read_config so that when the new file-based Azure Blob Storage connector processes a config
+ in the legacy format, it can be transformed into the new config. This happens in entrypoint before we
+ validate the config against the new spec.
+ """
+ config = super().read_config(config_path)
+ if not self._is_v1_config(config):
+ converted_config = LegacyConfigTransformer.convert(config)
+ emit_configuration_as_airbyte_control_message(converted_config)
+ return converted_config
+ return config
+
+ @staticmethod
+ def _is_v1_config(config: Mapping[str, Any]) -> bool:
+ return "streams" in config
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/stream_reader.py b/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/stream_reader.py
new file mode 100644
index 000000000000..0f6ff2081174
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/stream_reader.py
@@ -0,0 +1,78 @@
+import logging
+from contextlib import contextmanager
+from io import IOBase
+from typing import Iterable, List, Optional
+
+import pytz
+from airbyte_cdk.sources.file_based.file_based_stream_reader import AbstractFileBasedStreamReader, FileReadMode
+from airbyte_cdk.sources.file_based.remote_file import RemoteFile
+from azure.storage.blob import BlobServiceClient, ContainerClient
+from smart_open import open
+
+from .config import Config
+
+
+class SourceAzureBlobStorageStreamReader(AbstractFileBasedStreamReader):
+ def __init__(self, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+ self._config = None
+
+ @property
+ def config(self) -> Config:
+ return self._config
+
+ @config.setter
+ def config(self, value: Config) -> None:
+ self._config = value
+
+ @property
+ def account_url(self) -> str:
+ if not self.config.azure_blob_storage_endpoint:
+ return f"https://{self.config.azure_blob_storage_account_name}.blob.core.windows.net"
+ return self.config.azure_blob_storage_endpoint
+
+ @property
+ def azure_container_client(self):
+ return ContainerClient(
+ self.account_url,
+ container_name=self.config.azure_blob_storage_container_name,
+ credential=self.config.azure_blob_storage_account_key,
+ )
+
+ @property
+ def azure_blob_service_client(self):
+ return BlobServiceClient(self.account_url, credential=self.config.azure_blob_storage_account_key)
+
+ def get_matching_files(
+ self,
+ globs: List[str],
+ prefix: Optional[str],
+ logger: logging.Logger,
+ ) -> Iterable[RemoteFile]:
+ prefixes = [prefix] if prefix else self.get_prefixes_from_globs(globs)
+ prefixes = prefixes or [None]
+ for prefix in prefixes:
+ for blob in self.azure_container_client.list_blobs(name_starts_with=prefix):
+ remote_file = RemoteFile(uri=blob.name, last_modified=blob.last_modified.astimezone(pytz.utc).replace(tzinfo=None))
+ if not globs or self.file_matches_globs(remote_file, globs):
+ yield remote_file
+
+ @contextmanager
+ def open_file(self, file: RemoteFile, mode: FileReadMode, encoding: Optional[str], logger: logging.Logger) -> IOBase:
+ try:
+ result = open(
+ f"azure://{self.config.azure_blob_storage_container_name}/{file.uri}",
+ transport_params={"client": self.azure_blob_service_client},
+ mode=mode.value,
+ encoding=encoding,
+ )
+ except OSError:
+ logger.warning(
+ f"We don't have access to {file.uri}. The file appears to have become unreachable during sync."
+ f"Check whether key {file.uri} exists in `{self.config.azure_blob_storage_container_name}` container and/or has proper ACL permissions"
+ )
+ # see https://docs.python.org/3/library/contextlib.html#contextlib.contextmanager for why we do this
+ try:
+ yield result
+ finally:
+ result.close()
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlob.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlob.java
deleted file mode 100644
index 3ae6c38ad433..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlob.java
+++ /dev/null
@@ -1,39 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-import java.time.OffsetDateTime;
-
-public record AzureBlob(
-
- String name,
-
- OffsetDateTime lastModified
-
-) {
-
- public static class Builder {
-
- private String name;
-
- private OffsetDateTime lastModified;
-
- public Builder withName(String name) {
- this.name = name;
- return this;
- }
-
- public Builder withLastModified(OffsetDateTime lastModified) {
- this.lastModified = lastModified;
- return this;
- }
-
- public AzureBlob build() {
- return new AzureBlob(name, lastModified);
- }
-
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobAdditionalProperties.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobAdditionalProperties.java
deleted file mode 100644
index 7d2514ab5810..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobAdditionalProperties.java
+++ /dev/null
@@ -1,19 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-public class AzureBlobAdditionalProperties {
-
- private AzureBlobAdditionalProperties() {
-
- }
-
- public static final String LAST_MODIFIED = "_ab_source_file_last_modified";
-
- public static final String BLOB_NAME = "_ab_source_blob_name";
-
- public static final String ADDITIONAL_PROPERTIES = "_ab_additional_properties";
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageConfig.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageConfig.java
deleted file mode 100644
index bfa058620f38..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageConfig.java
+++ /dev/null
@@ -1,76 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-import com.azure.storage.blob.BlobContainerClient;
-import com.azure.storage.blob.BlobContainerClientBuilder;
-import com.azure.storage.common.StorageSharedKeyCredential;
-import com.fasterxml.jackson.databind.JsonNode;
-
-public record AzureBlobStorageConfig(
-
- String endpoint,
-
- String accountName,
-
- String accountKey,
-
- String containerName,
-
- String prefix,
-
- Long schemaInferenceLimit,
-
- FormatConfig formatConfig
-
-) {
-
- public record FormatConfig(Format format) {
-
- public enum Format {
-
- JSONL
-
- }
-
- }
-
- public static AzureBlobStorageConfig createAzureBlobStorageConfig(JsonNode jsonNode) {
- return new AzureBlobStorageConfig(
- jsonNode.has("azure_blob_storage_endpoint") ? jsonNode.get("azure_blob_storage_endpoint").asText() : null,
- jsonNode.get("azure_blob_storage_account_name").asText(),
- jsonNode.get("azure_blob_storage_account_key").asText(),
- jsonNode.get("azure_blob_storage_container_name").asText(),
- jsonNode.has("azure_blob_storage_blobs_prefix") ? jsonNode.get("azure_blob_storage_blobs_prefix").asText() : null,
- jsonNode.has("azure_blob_storage_schema_inference_limit") ? jsonNode.get("azure_blob_storage_schema_inference_limit").asLong() : null,
- formatConfig(jsonNode));
- }
-
- public BlobContainerClient createBlobContainerClient() {
- StorageSharedKeyCredential credential = new StorageSharedKeyCredential(
- this.accountName(),
- this.accountKey());
-
- var builder = new BlobContainerClientBuilder()
- .credential(credential)
- .containerName(this.containerName());
-
- if (this.endpoint() != null) {
- builder.endpoint(this.endpoint());
- }
-
- return builder.buildClient();
- }
-
- private static FormatConfig formatConfig(JsonNode config) {
- JsonNode formatConfig = config.get("format");
-
- FormatConfig.Format formatType = FormatConfig.Format
- .valueOf(formatConfig.get("format_type").asText().toUpperCase());
-
- return new FormatConfig(formatType);
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageOperations.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageOperations.java
deleted file mode 100644
index 705440a597a7..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageOperations.java
+++ /dev/null
@@ -1,64 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-import com.azure.storage.blob.BlobContainerClient;
-import com.azure.storage.blob.models.BlobListDetails;
-import com.azure.storage.blob.models.ListBlobsOptions;
-import com.fasterxml.jackson.databind.JsonNode;
-import io.airbyte.commons.functional.CheckedFunction;
-import java.io.IOException;
-import java.io.UncheckedIOException;
-import java.time.OffsetDateTime;
-import java.util.ArrayList;
-import java.util.List;
-import org.apache.commons.lang3.StringUtils;
-
-public abstract class AzureBlobStorageOperations {
-
- protected final BlobContainerClient blobContainerClient;
-
- protected final AzureBlobStorageConfig azureBlobStorageConfig;
-
- protected AzureBlobStorageOperations(AzureBlobStorageConfig azureBlobStorageConfig) {
- this.azureBlobStorageConfig = azureBlobStorageConfig;
- this.blobContainerClient = azureBlobStorageConfig.createBlobContainerClient();
- }
-
- public abstract JsonNode inferSchema();
-
- public abstract List readBlobs(OffsetDateTime offsetDateTime);
-
- public List listBlobs() {
-
- var listBlobsOptions = new ListBlobsOptions();
- listBlobsOptions.setDetails(new BlobListDetails()
- .setRetrieveMetadata(true)
- .setRetrieveDeletedBlobs(false));
-
- if (!StringUtils.isBlank(azureBlobStorageConfig.prefix())) {
- listBlobsOptions.setPrefix(azureBlobStorageConfig.prefix());
- }
-
- var pagedIterable = blobContainerClient.listBlobs(listBlobsOptions, null);
-
- List azureBlobs = new ArrayList<>();
- pagedIterable.forEach(blobItem -> azureBlobs.add(new AzureBlob.Builder()
- .withName(blobItem.getName())
- .withLastModified(blobItem.getProperties().getLastModified())
- .build()));
- return azureBlobs;
-
- }
-
- protected R handleCheckedIOException(CheckedFunction checkedFunction, T parameter) {
- try {
- return checkedFunction.apply(parameter);
- } catch (IOException e) {
- throw new UncheckedIOException(e);
- }
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageSource.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageSource.java
deleted file mode 100644
index 9151cbae24f1..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageSource.java
+++ /dev/null
@@ -1,173 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-import com.fasterxml.jackson.databind.JsonNode;
-import io.airbyte.cdk.integrations.BaseConnector;
-import io.airbyte.cdk.integrations.base.AirbyteTraceMessageUtility;
-import io.airbyte.cdk.integrations.base.IntegrationRunner;
-import io.airbyte.cdk.integrations.base.Source;
-import io.airbyte.cdk.integrations.source.relationaldb.CursorInfo;
-import io.airbyte.cdk.integrations.source.relationaldb.StateDecoratingIterator;
-import io.airbyte.cdk.integrations.source.relationaldb.state.StateManager;
-import io.airbyte.cdk.integrations.source.relationaldb.state.StateManagerFactory;
-import io.airbyte.commons.features.EnvVariableFeatureFlags;
-import io.airbyte.commons.features.FeatureFlags;
-import io.airbyte.commons.stream.AirbyteStreamUtils;
-import io.airbyte.commons.util.AutoCloseableIterator;
-import io.airbyte.commons.util.AutoCloseableIterators;
-import io.airbyte.integrations.source.azureblobstorage.format.JsonlAzureBlobStorageOperations;
-import io.airbyte.protocol.models.JsonSchemaPrimitiveUtil;
-import io.airbyte.protocol.models.v0.AirbyteCatalog;
-import io.airbyte.protocol.models.v0.AirbyteConnectionStatus;
-import io.airbyte.protocol.models.v0.AirbyteMessage;
-import io.airbyte.protocol.models.v0.AirbyteRecordMessage;
-import io.airbyte.protocol.models.v0.AirbyteStream;
-import io.airbyte.protocol.models.v0.AirbyteStreamNameNamespacePair;
-import io.airbyte.protocol.models.v0.ConfiguredAirbyteCatalog;
-import io.airbyte.protocol.models.v0.SyncMode;
-import java.time.Instant;
-import java.time.OffsetDateTime;
-import java.util.List;
-import java.util.Optional;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
-public class AzureBlobStorageSource extends BaseConnector implements Source {
-
- private static final Logger LOGGER = LoggerFactory.getLogger(AzureBlobStorageSource.class);
-
- private final FeatureFlags featureFlags = new EnvVariableFeatureFlags();
-
- public static void main(final String[] args) throws Exception {
- final Source source = new AzureBlobStorageSource();
- LOGGER.info("starting Source: {}", AzureBlobStorageSource.class);
- new IntegrationRunner(source).run(args);
- LOGGER.info("completed Source: {}", AzureBlobStorageSource.class);
- }
-
- @Override
- public AirbyteConnectionStatus check(JsonNode config) {
- var azureBlobStorageConfig = AzureBlobStorageConfig.createAzureBlobStorageConfig(config);
- try {
- var azureBlobStorageOperations = switch (azureBlobStorageConfig.formatConfig().format()) {
- case JSONL -> new JsonlAzureBlobStorageOperations(azureBlobStorageConfig);
- };
- azureBlobStorageOperations.listBlobs();
-
- return new AirbyteConnectionStatus()
- .withStatus(AirbyteConnectionStatus.Status.SUCCEEDED);
- } catch (Exception e) {
- LOGGER.error("Error while listing Azure Blob Storage blobs with reason: ", e);
- return new AirbyteConnectionStatus()
- .withStatus(AirbyteConnectionStatus.Status.FAILED);
- }
-
- }
-
- @Override
- public AirbyteCatalog discover(JsonNode config) {
- var azureBlobStorageConfig = AzureBlobStorageConfig.createAzureBlobStorageConfig(config);
-
- var azureBlobStorageOperations = switch (azureBlobStorageConfig.formatConfig().format()) {
- case JSONL -> new JsonlAzureBlobStorageOperations(azureBlobStorageConfig);
- };
-
- JsonNode schema = azureBlobStorageOperations.inferSchema();
-
- return new AirbyteCatalog()
- .withStreams(List.of(new AirbyteStream()
- .withName(azureBlobStorageConfig.containerName())
- .withJsonSchema(schema)
- .withSourceDefinedCursor(true)
- .withDefaultCursorField(List.of(AzureBlobAdditionalProperties.LAST_MODIFIED))
- .withSupportedSyncModes(List.of(SyncMode.INCREMENTAL, SyncMode.FULL_REFRESH))));
- }
-
- @Override
- public AutoCloseableIterator read(JsonNode config, ConfiguredAirbyteCatalog catalog, JsonNode state) {
-
- final var streamState =
- AzureBlobStorageStateManager.deserializeStreamState(state, featureFlags.useStreamCapableState());
-
- final StateManager stateManager = StateManagerFactory
- .createStateManager(streamState.airbyteStateType(), streamState.airbyteStateMessages(), catalog);
-
- var azureBlobStorageConfig = AzureBlobStorageConfig.createAzureBlobStorageConfig(config);
- var azureBlobStorageOperations = switch (azureBlobStorageConfig.formatConfig().format()) {
- case JSONL -> new JsonlAzureBlobStorageOperations(azureBlobStorageConfig);
- };
-
- // only one stream per connection
- var streamIterators = catalog.getStreams().stream()
- .map(cas -> switch (cas.getSyncMode()) {
- case INCREMENTAL -> readIncremental(azureBlobStorageOperations, cas.getStream(), cas.getCursorField().get(0),
- stateManager);
- case FULL_REFRESH -> readFullRefresh(azureBlobStorageOperations, cas.getStream());
- })
- .toList();
-
- return AutoCloseableIterators.concatWithEagerClose(streamIterators, AirbyteTraceMessageUtility::emitStreamStatusTrace);
-
- }
-
- private AutoCloseableIterator readIncremental(AzureBlobStorageOperations azureBlobStorageOperations,
- AirbyteStream airbyteStream,
- String cursorField,
- StateManager stateManager) {
- var streamPair = new AirbyteStreamNameNamespacePair(airbyteStream.getName(), airbyteStream.getNamespace());
-
- Optional cursorInfo = stateManager.getCursorInfo(streamPair);
-
- var messageStream = cursorInfo
- .map(cursor -> {
- var offsetDateTime = cursor.getCursor() != null ? OffsetDateTime.parse(cursor.getCursor()) : null;
- return azureBlobStorageOperations.readBlobs(offsetDateTime);
- })
- .orElse(azureBlobStorageOperations.readBlobs(null))
- .stream()
- .map(jn -> new AirbyteMessage()
- .withType(AirbyteMessage.Type.RECORD)
- .withRecord(new AirbyteRecordMessage()
- .withStream(airbyteStream.getName())
- .withEmittedAt(Instant.now().toEpochMilli())
- .withData(jn)));
-
- final io.airbyte.protocol.models.AirbyteStreamNameNamespacePair airbyteStreamNameNamespacePair =
- AirbyteStreamUtils.convertFromAirbyteStream(airbyteStream);
-
- return AutoCloseableIterators.transform(autoCloseableIterator -> new StateDecoratingIterator(
- autoCloseableIterator,
- stateManager,
- streamPair,
- cursorField,
- cursorInfo.map(CursorInfo::getCursor).orElse(null),
- JsonSchemaPrimitiveUtil.JsonSchemaPrimitive.TIMESTAMP_WITH_TIMEZONE_V1,
- // TODO (itaseski) emit state after every record/blob since they can be sorted in increasing order
- 0),
- AutoCloseableIterators.fromStream(messageStream, airbyteStreamNameNamespacePair),
- airbyteStreamNameNamespacePair);
- }
-
- private AutoCloseableIterator readFullRefresh(AzureBlobStorageOperations azureBlobStorageOperations,
- AirbyteStream airbyteStream) {
-
- final io.airbyte.protocol.models.AirbyteStreamNameNamespacePair airbyteStreamNameNamespacePair =
- AirbyteStreamUtils.convertFromAirbyteStream(airbyteStream);
-
- var messageStream = azureBlobStorageOperations
- .readBlobs(null)
- .stream()
- .map(jn -> new AirbyteMessage()
- .withType(AirbyteMessage.Type.RECORD)
- .withRecord(new AirbyteRecordMessage()
- .withStream(airbyteStream.getName())
- .withEmittedAt(Instant.now().toEpochMilli())
- .withData(jn)));
-
- return AutoCloseableIterators.fromStream(messageStream, airbyteStreamNameNamespacePair);
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageStateManager.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageStateManager.java
deleted file mode 100644
index 204f75a2cca2..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageStateManager.java
+++ /dev/null
@@ -1,59 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-import com.fasterxml.jackson.databind.JsonNode;
-import io.airbyte.cdk.integrations.source.relationaldb.models.DbState;
-import io.airbyte.commons.json.Jsons;
-import io.airbyte.configoss.StateWrapper;
-import io.airbyte.configoss.helpers.StateMessageHelper;
-import io.airbyte.protocol.models.v0.AirbyteStateMessage;
-import io.airbyte.protocol.models.v0.AirbyteStreamState;
-import java.util.List;
-import java.util.Optional;
-
-public class AzureBlobStorageStateManager {
-
- private AzureBlobStorageStateManager() {
-
- }
-
- public static StreamState deserializeStreamState(final JsonNode state, final boolean useStreamCapableState) {
- final Optional typedState =
- StateMessageHelper.getTypedState(state, useStreamCapableState);
- return typedState.map(stateWrapper -> switch (stateWrapper.getStateType()) {
- case STREAM:
- yield new StreamState(AirbyteStateMessage.AirbyteStateType.STREAM, stateWrapper.getStateMessages().stream()
- .map(sm -> Jsons.object(Jsons.jsonNode(sm), AirbyteStateMessage.class)).toList());
- case LEGACY:
- yield new StreamState(AirbyteStateMessage.AirbyteStateType.LEGACY, List.of(
- new AirbyteStateMessage().withType(AirbyteStateMessage.AirbyteStateType.LEGACY)
- .withData(stateWrapper.getLegacyState())));
- case GLOBAL:
- throw new UnsupportedOperationException("Unsupported stream state");
- }).orElseGet(() -> {
- // create empty initial state
- if (useStreamCapableState) {
- return new StreamState(AirbyteStateMessage.AirbyteStateType.STREAM, List.of(
- new AirbyteStateMessage().withType(AirbyteStateMessage.AirbyteStateType.STREAM)
- .withStream(new AirbyteStreamState())));
- } else {
- // TODO (itaseski) remove support for DbState
- return new StreamState(AirbyteStateMessage.AirbyteStateType.LEGACY, List.of(
- new AirbyteStateMessage().withType(AirbyteStateMessage.AirbyteStateType.LEGACY)
- .withData(Jsons.jsonNode(new DbState()))));
- }
- });
- }
-
- record StreamState(
-
- AirbyteStateMessage.AirbyteStateType airbyteStateType,
-
- List airbyteStateMessages) {
-
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/format/JsonlAzureBlobStorageOperations.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/format/JsonlAzureBlobStorageOperations.java
deleted file mode 100644
index fcfe9b786b08..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/java/io/airbyte/integrations/source/azureblobstorage/format/JsonlAzureBlobStorageOperations.java
+++ /dev/null
@@ -1,101 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage.format;
-
-import com.azure.storage.blob.BlobClient;
-import com.fasterxml.jackson.databind.JsonNode;
-import com.fasterxml.jackson.databind.ObjectMapper;
-import com.fasterxml.jackson.databind.node.ObjectNode;
-import com.saasquatch.jsonschemainferrer.AdditionalPropertiesPolicies;
-import com.saasquatch.jsonschemainferrer.JsonSchemaInferrer;
-import com.saasquatch.jsonschemainferrer.SpecVersion;
-import io.airbyte.integrations.source.azureblobstorage.AzureBlob;
-import io.airbyte.integrations.source.azureblobstorage.AzureBlobAdditionalProperties;
-import io.airbyte.integrations.source.azureblobstorage.AzureBlobStorageConfig;
-import io.airbyte.integrations.source.azureblobstorage.AzureBlobStorageOperations;
-import java.io.BufferedReader;
-import java.io.IOException;
-import java.io.InputStreamReader;
-import java.io.UncheckedIOException;
-import java.nio.charset.Charset;
-import java.time.OffsetDateTime;
-import java.util.List;
-import java.util.Map;
-
-public class JsonlAzureBlobStorageOperations extends AzureBlobStorageOperations {
-
- private final ObjectMapper objectMapper;
-
- private final JsonSchemaInferrer jsonSchemaInferrer;
-
- public JsonlAzureBlobStorageOperations(AzureBlobStorageConfig azureBlobStorageConfig) {
- super(azureBlobStorageConfig);
- this.objectMapper = new ObjectMapper();
- this.jsonSchemaInferrer = JsonSchemaInferrer.newBuilder()
- .setSpecVersion(SpecVersion.DRAFT_07)
- .setAdditionalPropertiesPolicy(AdditionalPropertiesPolicies.allowed())
- .build();
- }
-
- @Override
- public JsonNode inferSchema() {
- var blobs = readBlobs(null, azureBlobStorageConfig.schemaInferenceLimit());
-
- // create super schema inferred from all blobs in the container
- var jsonSchema = jsonSchemaInferrer.inferForSamples(blobs);
-
- if (!jsonSchema.has("properties")) {
- jsonSchema.putObject("properties");
- }
-
- ((ObjectNode) jsonSchema.get("properties")).putPOJO(AzureBlobAdditionalProperties.BLOB_NAME,
- Map.of("type", "string"));
- ((ObjectNode) jsonSchema.get("properties")).putPOJO(AzureBlobAdditionalProperties.LAST_MODIFIED,
- Map.of("type", "string"));
- return jsonSchema;
- }
-
- @Override
- public List readBlobs(OffsetDateTime offsetDateTime) {
- return readBlobs(offsetDateTime, null);
- }
-
- private List readBlobs(OffsetDateTime offsetDateTime, Long limit) {
- record DecoratedAzureBlob(AzureBlob azureBlob, BlobClient blobClient) {}
-
- var blobsStream = limit == null ? listBlobs().stream() : listBlobs().stream().limit(limit);
-
- return blobsStream
- .filter(ab -> {
- if (offsetDateTime != null) {
- return ab.lastModified().isAfter(offsetDateTime);
- } else {
- return true;
- }
- })
- .map(ab -> new DecoratedAzureBlob(ab, blobContainerClient.getBlobClient(ab.name())))
- .map(dab -> {
- try (
- var br = new BufferedReader(
- new InputStreamReader(dab.blobClient().downloadContent().toStream(), Charset.defaultCharset()))) {
- return br.lines().map(line -> {
- var jsonNode =
- handleCheckedIOException(objectMapper::readTree, line);
- ((ObjectNode) jsonNode).put(AzureBlobAdditionalProperties.BLOB_NAME, dab.azureBlob().name());
- ((ObjectNode) jsonNode).put(AzureBlobAdditionalProperties.LAST_MODIFIED,
- dab.azureBlob().lastModified().toString());
- return jsonNode;
- })
- // need to materialize stream otherwise reader gets closed on return
- .toList();
- } catch (IOException e) {
- throw new UncheckedIOException(e);
- }
- })
- .flatMap(List::stream)
- .toList();
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/resources/spec.json b/airbyte-integrations/connectors/source-azure-blob-storage/src/main/resources/spec.json
deleted file mode 100644
index 9e4d74450a36..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/main/resources/spec.json
+++ /dev/null
@@ -1,74 +0,0 @@
-{
- "documentationUrl": "https://docs.airbyte.com/integrations/destinations/azureblobstorage",
- "connectionSpecification": {
- "$schema": "http://json-schema.org/draft-07/schema#",
- "title": "AzureBlobStorage Source Spec",
- "type": "object",
- "required": [
- "azure_blob_storage_account_name",
- "azure_blob_storage_account_key",
- "azure_blob_storage_container_name",
- "format"
- ],
- "additionalProperties": true,
- "properties": {
- "azure_blob_storage_endpoint": {
- "title": "Endpoint Domain Name",
- "type": "string",
- "default": "blob.core.windows.net",
- "description": "This is Azure Blob Storage endpoint domain name. Leave default value (or leave it empty if run container from command line) to use Microsoft native from example.",
- "examples": ["blob.core.windows.net"]
- },
- "azure_blob_storage_container_name": {
- "title": "Azure blob storage container (Bucket) Name",
- "type": "string",
- "description": "The name of the Azure blob storage container.",
- "examples": ["airbytetescontainername"]
- },
- "azure_blob_storage_account_name": {
- "title": "Azure Blob Storage account name",
- "type": "string",
- "description": "The account's name of the Azure Blob Storage.",
- "examples": ["airbyte5storage"]
- },
- "azure_blob_storage_account_key": {
- "title": "Azure Blob Storage account key",
- "description": "The Azure blob storage account key.",
- "airbyte_secret": true,
- "type": "string",
- "examples": [
- "Z8ZkZpteggFx394vm+PJHnGTvdRncaYS+JhLKdj789YNmD+iyGTnG+PV+POiuYNhBg/ACS+LKjd%4FG3FHGN12Nd=="
- ]
- },
- "azure_blob_storage_blobs_prefix": {
- "title": "Azure Blob Storage blobs prefix",
- "description": "The Azure blob storage prefix to be applied",
- "type": "string",
- "examples": ["FolderA/FolderB/"]
- },
- "azure_blob_storage_schema_inference_limit": {
- "title": "Azure Blob Storage schema inference limit",
- "description": "The Azure blob storage blobs to scan for inferring the schema, useful on large amounts of data with consistent structure",
- "type": "integer",
- "examples": ["500"]
- },
- "format": {
- "title": "Input Format",
- "type": "object",
- "description": "Input data format",
- "oneOf": [
- {
- "title": "JSON Lines: newline-delimited JSON",
- "required": ["format_type"],
- "properties": {
- "format_type": {
- "type": "string",
- "const": "JSONL"
- }
- }
- }
- ]
- }
- }
- }
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageContainer.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageContainer.java
deleted file mode 100644
index 3acb639e2b8e..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageContainer.java
+++ /dev/null
@@ -1,17 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-import org.testcontainers.containers.GenericContainer;
-
-// Azurite emulator for easier local azure storage development and testing
-// https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=docker-hub
-public class AzureBlobStorageContainer extends GenericContainer {
-
- public AzureBlobStorageContainer() {
- super("mcr.microsoft.com/azure-storage/azurite");
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageDataFactory.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageDataFactory.java
deleted file mode 100644
index fd9126a32ad8..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageDataFactory.java
+++ /dev/null
@@ -1,55 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-import com.fasterxml.jackson.databind.JsonNode;
-import io.airbyte.commons.json.Jsons;
-import io.airbyte.protocol.models.Field;
-import io.airbyte.protocol.models.JsonSchemaType;
-import io.airbyte.protocol.models.v0.CatalogHelpers;
-import io.airbyte.protocol.models.v0.ConfiguredAirbyteCatalog;
-import io.airbyte.protocol.models.v0.ConfiguredAirbyteStream;
-import io.airbyte.protocol.models.v0.DestinationSyncMode;
-import io.airbyte.protocol.models.v0.SyncMode;
-import java.util.List;
-import java.util.Map;
-
-public class AzureBlobStorageDataFactory {
-
- private AzureBlobStorageDataFactory() {
-
- }
-
- static JsonNode createAzureBlobStorageConfig(String host, String container) {
- return Jsons.jsonNode(Map.of(
- "azure_blob_storage_endpoint", host + "/devstoreaccount1",
- "azure_blob_storage_account_name", "devstoreaccount1",
- "azure_blob_storage_account_key",
- "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==",
- "azure_blob_storage_container_name", container,
- "azure_blob_storage_blobs_prefix", "FolderA/FolderB/",
- "azure_blob_storage_schema_inference_limit", 10L,
- "format", Jsons.deserialize("""
- {
- "format_type": "JSONL"
- }""")));
- }
-
- static ConfiguredAirbyteCatalog createConfiguredAirbyteCatalog(String streamName) {
- return new ConfiguredAirbyteCatalog().withStreams(List.of(
- new ConfiguredAirbyteStream()
- .withSyncMode(SyncMode.INCREMENTAL)
- .withCursorField(List.of(AzureBlobAdditionalProperties.LAST_MODIFIED))
- .withDestinationSyncMode(DestinationSyncMode.APPEND)
- .withStream(CatalogHelpers.createAirbyteStream(
- streamName,
- Field.of("attr_1", JsonSchemaType.STRING),
- Field.of("attr_2", JsonSchemaType.INTEGER),
- Field.of(AzureBlobAdditionalProperties.LAST_MODIFIED, JsonSchemaType.TIMESTAMP_WITH_TIMEZONE_V1),
- Field.of(AzureBlobAdditionalProperties.BLOB_NAME, JsonSchemaType.STRING))
- .withSupportedSyncModes(List.of(SyncMode.FULL_REFRESH, SyncMode.INCREMENTAL)))));
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageSourceAcceptanceTest.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageSourceAcceptanceTest.java
deleted file mode 100644
index d3e0f9a7de86..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageSourceAcceptanceTest.java
+++ /dev/null
@@ -1,69 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-import com.azure.core.util.BinaryData;
-import com.fasterxml.jackson.databind.JsonNode;
-import io.airbyte.cdk.integrations.standardtest.source.SourceAcceptanceTest;
-import io.airbyte.cdk.integrations.standardtest.source.TestDestinationEnv;
-import io.airbyte.commons.json.Jsons;
-import io.airbyte.commons.resources.MoreResources;
-import io.airbyte.protocol.models.v0.ConfiguredAirbyteCatalog;
-import io.airbyte.protocol.models.v0.ConnectorSpecification;
-import java.util.HashMap;
-
-public class AzureBlobStorageSourceAcceptanceTest extends SourceAcceptanceTest {
-
- private static final String STREAM_NAME = "airbyte-container";
-
- private AzureBlobStorageContainer azureBlobStorageContainer;
-
- private JsonNode jsonConfig;
-
- @Override
- protected String getImageName() {
- return "airbyte/source-azure-blob-storage:dev";
- }
-
- @Override
- protected JsonNode getConfig() throws Exception {
- return jsonConfig;
- }
-
- @Override
- protected void setupEnvironment(TestDestinationEnv environment) {
- azureBlobStorageContainer = new AzureBlobStorageContainer().withExposedPorts(10000);
- azureBlobStorageContainer.start();
- jsonConfig = AzureBlobStorageDataFactory.createAzureBlobStorageConfig(
- "http://127.0.0.1:" + azureBlobStorageContainer.getMappedPort(10000), STREAM_NAME);
-
- var azureBlobStorageConfig = AzureBlobStorageConfig.createAzureBlobStorageConfig(jsonConfig);
- var blobContainerClient = azureBlobStorageConfig.createBlobContainerClient();
- blobContainerClient.createIfNotExists();
- blobContainerClient.getBlobClient("FolderA/FolderB/blob1.json").upload(BinaryData.fromString("{\"attr1\":\"str_1\",\"attr2\":1}\n"));
- }
-
- @Override
- protected void tearDown(TestDestinationEnv testEnv) {
- azureBlobStorageContainer.stop();
- azureBlobStorageContainer.close();
- }
-
- @Override
- protected ConnectorSpecification getSpec() throws Exception {
- return Jsons.deserialize(MoreResources.readResource("spec.json"), ConnectorSpecification.class);
- }
-
- @Override
- protected ConfiguredAirbyteCatalog getConfiguredCatalog() {
- return AzureBlobStorageDataFactory.createConfiguredAirbyteCatalog(STREAM_NAME);
- }
-
- @Override
- protected JsonNode getState() {
- return Jsons.jsonNode(new HashMap<>());
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageSourceTest.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageSourceTest.java
deleted file mode 100644
index 873c3f7e12a2..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageSourceTest.java
+++ /dev/null
@@ -1,118 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-import static org.assertj.core.api.Assertions.assertThat;
-
-import com.azure.core.util.BinaryData;
-import com.fasterxml.jackson.databind.JsonNode;
-import io.airbyte.commons.json.Jsons;
-import io.airbyte.protocol.models.v0.AirbyteConnectionStatus;
-import io.airbyte.protocol.models.v0.AirbyteMessage;
-import io.airbyte.protocol.models.v0.SyncMode;
-import java.util.Iterator;
-import java.util.List;
-import java.util.stream.Stream;
-import org.junit.jupiter.api.AfterEach;
-import org.junit.jupiter.api.BeforeEach;
-import org.junit.jupiter.api.Test;
-
-class AzureBlobStorageSourceTest {
-
- private AzureBlobStorageSource azureBlobStorageSource;
-
- private AzureBlobStorageContainer azureBlobStorageContainer;
-
- private JsonNode jsonConfig;
-
- private static final String STREAM_NAME = "airbyte-container";
-
- @BeforeEach
- void setup() {
- azureBlobStorageContainer = new AzureBlobStorageContainer().withExposedPorts(10000);
- azureBlobStorageContainer.start();
- azureBlobStorageSource = new AzureBlobStorageSource();
- jsonConfig = AzureBlobStorageDataFactory.createAzureBlobStorageConfig(
- "http://127.0.0.1:" + azureBlobStorageContainer.getMappedPort(10000), STREAM_NAME);
-
- var azureBlobStorageConfig = AzureBlobStorageConfig.createAzureBlobStorageConfig(jsonConfig);
- var blobContainerClient = azureBlobStorageConfig.createBlobContainerClient();
- blobContainerClient.createIfNotExists();
- blobContainerClient.getBlobClient("FolderA/FolderB/blob1.json")
- .upload(BinaryData.fromString("{\"attr_1\":\"str_1\"}\n"));
- blobContainerClient.getBlobClient("FolderA/FolderB/blob2.json")
- .upload(BinaryData.fromString("{\"attr_2\":\"str_2\"}\n"));
- // blob in ignored path
- blobContainerClient.getBlobClient("FolderA/blob3.json").upload(BinaryData.fromString("{}"));
- }
-
- @AfterEach
- void tearDown() {
- azureBlobStorageContainer.stop();
- azureBlobStorageContainer.close();
- }
-
- @Test
- void testCheckConnectionWithSucceeded() {
- var airbyteConnectionStatus = azureBlobStorageSource.check(jsonConfig);
-
- assertThat(airbyteConnectionStatus.getStatus()).isEqualTo(AirbyteConnectionStatus.Status.SUCCEEDED);
-
- }
-
- @Test
- void testCheckConnectionWithFailed() {
-
- var failingConfig = AzureBlobStorageDataFactory.createAzureBlobStorageConfig(
- "http://127.0.0.1:" + azureBlobStorageContainer.getMappedPort(10000), "missing-container");
-
- var airbyteConnectionStatus = azureBlobStorageSource.check(failingConfig);
-
- assertThat(airbyteConnectionStatus.getStatus()).isEqualTo(AirbyteConnectionStatus.Status.FAILED);
-
- }
-
- @Test
- void testDiscover() {
- var airbyteCatalog = azureBlobStorageSource.discover(jsonConfig);
-
- assertThat(airbyteCatalog.getStreams())
- .hasSize(1)
- .element(0)
- .hasFieldOrPropertyWithValue("name", STREAM_NAME)
- .hasFieldOrPropertyWithValue("sourceDefinedCursor", true)
- .hasFieldOrPropertyWithValue("defaultCursorField", List.of(AzureBlobAdditionalProperties.LAST_MODIFIED))
- .hasFieldOrPropertyWithValue("supportedSyncModes", List.of(SyncMode.INCREMENTAL, SyncMode.FULL_REFRESH))
- .extracting("jsonSchema")
- .isNotNull();
-
- }
-
- @Test
- void testRead() {
- var configuredAirbyteCatalog = AzureBlobStorageDataFactory.createConfiguredAirbyteCatalog(STREAM_NAME);
-
- Iterator iterator =
- azureBlobStorageSource.read(jsonConfig, configuredAirbyteCatalog, Jsons.emptyObject());
-
- var airbyteRecordMessages = Stream.generate(() -> null)
- .takeWhile(x -> iterator.hasNext())
- .map(n -> iterator.next())
- .filter(am -> am.getType() == AirbyteMessage.Type.RECORD)
- .map(AirbyteMessage::getRecord)
- .toList();
-
- assertThat(airbyteRecordMessages)
- .hasSize(2)
- .anyMatch(arm -> arm.getStream().equals(STREAM_NAME) &&
- Jsons.serialize(arm.getData()).contains(
- "\"attr_1\":\"str_1\""))
- .anyMatch(arm -> arm.getStream().equals(STREAM_NAME) &&
- Jsons.serialize(arm.getData()).contains(
- "\"attr_2\":\"str_2\""));
-
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/JsonlAzureBlobStorageOperationsTest.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/JsonlAzureBlobStorageOperationsTest.java
deleted file mode 100644
index 69e7fa9daf12..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/test-integration/java/io/airbyte/integrations/source/azureblobstorage/JsonlAzureBlobStorageOperationsTest.java
+++ /dev/null
@@ -1,140 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-import static org.assertj.core.api.Assertions.assertThat;
-
-import com.azure.core.util.BinaryData;
-import com.azure.storage.blob.BlobContainerClient;
-import com.fasterxml.jackson.core.JsonProcessingException;
-import com.fasterxml.jackson.databind.JsonNode;
-import com.fasterxml.jackson.databind.ObjectMapper;
-import io.airbyte.integrations.source.azureblobstorage.format.JsonlAzureBlobStorageOperations;
-import java.time.OffsetDateTime;
-import org.json.JSONException;
-import org.junit.jupiter.api.AfterEach;
-import org.junit.jupiter.api.BeforeEach;
-import org.junit.jupiter.api.Test;
-import org.skyscreamer.jsonassert.JSONAssert;
-
-class JsonlAzureBlobStorageOperationsTest {
-
- private AzureBlobStorageContainer azureBlobStorageContainer;
-
- private AzureBlobStorageOperations azureBlobStorageOperations;
-
- private BlobContainerClient blobContainerClient;
-
- private ObjectMapper objectMapper;
-
- private static final String STREAM_NAME = "airbyte-container";
-
- @BeforeEach
- void setup() {
- azureBlobStorageContainer = new AzureBlobStorageContainer().withExposedPorts(10000);
- azureBlobStorageContainer.start();
- JsonNode jsonConfig = AzureBlobStorageDataFactory.createAzureBlobStorageConfig(
- "http://127.0.0.1:" + azureBlobStorageContainer.getMappedPort(10000), STREAM_NAME);
-
- var azureBlobStorageConfig = AzureBlobStorageConfig.createAzureBlobStorageConfig(jsonConfig);
- blobContainerClient = azureBlobStorageConfig.createBlobContainerClient();
- blobContainerClient.createIfNotExists();
- blobContainerClient.getBlobClient("FolderA/FolderB/blob1.json").upload(BinaryData
- .fromString("""
- {"name":"Molecule Man","age":29,"secretIdentity":"Dan Jukes","powers":["Radiation resistance","Turning tiny","Radiation blast"]}
- {"name":"Bat Man","secretIdentity":"Bruce Wayne","powers":["Agility", "Detective skills", "Determination"]}
- """));
- blobContainerClient.getBlobClient("FolderA/FolderB/blob2.json").upload(BinaryData.fromString(
- "{\"name\":\"Molecule Man\",\"surname\":\"Powers\",\"powers\":[\"Radiation resistance\",\"Turning tiny\",\"Radiation blast\"]}\n"));
- // should be ignored since its in ignored path
- blobContainerClient.getBlobClient("FolderA/blob3.json").upload(BinaryData.fromString("{\"ignored\":true}\n"));
- azureBlobStorageOperations = new JsonlAzureBlobStorageOperations(azureBlobStorageConfig);
- objectMapper = new ObjectMapper();
- }
-
- @AfterEach
- void tearDown() {
- azureBlobStorageContainer.stop();
- azureBlobStorageContainer.close();
- }
-
- @Test
- void testListBlobs() {
- var azureBlobs = azureBlobStorageOperations.listBlobs();
-
- assertThat(azureBlobs)
- .hasSize(2)
- .anyMatch(ab -> ab.name().equals("FolderA/FolderB/blob1.json"))
- .anyMatch(ab -> ab.name().equals("FolderA/FolderB/blob2.json"));
- }
-
- @Test
- void testInferSchema() throws JsonProcessingException, JSONException {
-
- var jsonSchema = azureBlobStorageOperations.inferSchema();
-
- JSONAssert.assertEquals(objectMapper.writeValueAsString(jsonSchema), """
- {
- "$schema": "http://json-schema.org/draft-07/schema#",
- "type": "object",
- "properties": {
- "name": {
- "type": "string"
- },
- "age": {
- "type": "integer"
- },
- "secretIdentity": {
- "type": "string"
- },
- "powers": {
- "type": "array",
- "items": {
- "type": "string"
- }
- },
- "surname": {
- "type": "string"
- },
- "_ab_source_blob_name": {
- "type": "string"
- },
- "_ab_source_file_last_modified": {
- "type": "string"
- }
- },
- "additionalProperties": true
- }
- """, true);
-
- }
-
- @Test
- void testReadBlobs() throws InterruptedException, JsonProcessingException, JSONException {
- var now = OffsetDateTime.now();
-
- Thread.sleep(1000);
-
- blobContainerClient.getBlobClient("FolderA/FolderB/blob1.json").upload(BinaryData.fromString(
- "{\"name\":\"Super Man\",\"secretIdentity\":\"Clark Kent\",\"powers\":[\"Lightning fast\",\"Super strength\",\"Laser vision\"]}\n"),
- true);
-
- var messages = azureBlobStorageOperations.readBlobs(now);
-
- var azureBlob = azureBlobStorageOperations.listBlobs().stream()
- .filter(ab -> ab.name().equals("FolderA/FolderB/blob1.json"))
- .findAny()
- .orElseThrow();
-
- assertThat(messages)
- .hasSize(1);
-
- JSONAssert.assertEquals(objectMapper.writeValueAsString(messages.get(0)), String.format(
- "{\"name\":\"Super Man\",\"secretIdentity\":\"Clark Kent\",\"powers\":[\"Lightning fast\",\"Super strength\",\"Laser vision\"],\"_ab_source_blob_name\":\"%s\",\"_ab_source_file_last_modified\":\"%s\"}\n",
- azureBlob.name(), azureBlob.lastModified().toString()), true);
-
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/src/test/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageConfigTest.java b/airbyte-integrations/connectors/source-azure-blob-storage/src/test/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageConfigTest.java
deleted file mode 100644
index 003a395bbd6e..000000000000
--- a/airbyte-integrations/connectors/source-azure-blob-storage/src/test/java/io/airbyte/integrations/source/azureblobstorage/AzureBlobStorageConfigTest.java
+++ /dev/null
@@ -1,45 +0,0 @@
-/*
- * Copyright (c) 2023 Airbyte, Inc., all rights reserved.
- */
-
-package io.airbyte.integrations.source.azureblobstorage;
-
-import static org.assertj.core.api.Assertions.assertThat;
-
-import io.airbyte.commons.json.Jsons;
-import java.util.Map;
-import org.junit.jupiter.api.Test;
-
-class AzureBlobStorageConfigTest {
-
- @Test
- void testAzureBlobStorageConfig() {
- var jsonConfig = Jsons.jsonNode(Map.of(
- "azure_blob_storage_endpoint", "http://127.0.0.1:10000/devstoreaccount1",
- "azure_blob_storage_account_name", "devstoreaccount1",
- "azure_blob_storage_account_key",
- "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==",
- "azure_blob_storage_container_name", "airbyte-container",
- "azure_blob_storage_blobs_prefix", "FolderA/FolderB/",
- "azure_blob_storage_schema_inference_limit", 10L,
- "format", Jsons.deserialize("""
- {
- "format_type": "JSONL"
- }""")));
-
- var azureBlobStorageConfig = AzureBlobStorageConfig.createAzureBlobStorageConfig(jsonConfig);
-
- assertThat(azureBlobStorageConfig)
- .hasFieldOrPropertyWithValue("endpoint", "http://127.0.0.1:10000/devstoreaccount1")
- .hasFieldOrPropertyWithValue("accountName", "devstoreaccount1")
- .hasFieldOrPropertyWithValue("accountKey",
- "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==")
- .hasFieldOrPropertyWithValue("containerName", "airbyte-container")
- .hasFieldOrPropertyWithValue("prefix", "FolderA/FolderB/")
- .hasFieldOrPropertyWithValue("schemaInferenceLimit", 10L)
- .hasFieldOrPropertyWithValue("formatConfig", new AzureBlobStorageConfig.FormatConfig(
- AzureBlobStorageConfig.FormatConfig.Format.JSONL));
-
- }
-
-}
diff --git a/airbyte-integrations/connectors/source-azure-blob-storage/unit_tests/unit_tests.py b/airbyte-integrations/connectors/source-azure-blob-storage/unit_tests/unit_tests.py
new file mode 100644
index 000000000000..17b5f7fc5913
--- /dev/null
+++ b/airbyte-integrations/connectors/source-azure-blob-storage/unit_tests/unit_tests.py
@@ -0,0 +1,28 @@
+from source_azure_blob_storage.legacy_config_transformer import LegacyConfigTransformer
+
+
+def test_config_convertation():
+ legacy_config = {
+ "azure_blob_storage_endpoint": "https://airbyteteststorage.blob.core.windows.net",
+ "azure_blob_storage_account_name": "airbyteteststorage",
+ "azure_blob_storage_account_key": "secret/key==",
+ "azure_blob_storage_container_name": "airbyte-source-azure-blob-storage-test",
+ "azure_blob_storage_blobs_prefix": "subfolder/",
+ "azure_blob_storage_schema_inference_limit": 500,
+ "format": "jsonl",
+ }
+ new_config = LegacyConfigTransformer.convert(legacy_config)
+ assert new_config == {
+ "azure_blob_storage_account_key": "secret/key==",
+ "azure_blob_storage_account_name": "airbyteteststorage",
+ "azure_blob_storage_container_name": "airbyte-source-azure-blob-storage-test",
+ "azure_blob_storage_endpoint": "https://airbyteteststorage.blob.core.windows.net",
+ "streams": [
+ {
+ "format": {"filetype": "jsonl"},
+ "legacy_prefix": "subfolder/",
+ "name": "airbyte-source-azure-blob-storage-test",
+ "validation_policy": "Emit Record",
+ }
+ ],
+ }
diff --git a/docs/integrations/sources/azure-blob-storage.md b/docs/integrations/sources/azure-blob-storage.md
index 3c0896f97c55..e86a0fa80980 100644
--- a/docs/integrations/sources/azure-blob-storage.md
+++ b/docs/integrations/sources/azure-blob-storage.md
@@ -14,46 +14,169 @@ Cloud storage may incur egress costs. Egress refers to data that is transferred
### Step 2: Set up the Azure Blob Storage connector in Airbyte
-
-1. Create a new Azure Blob Storage source with a suitable name.
-2. Set `container` appropriately. This will be the name of the container where the blobs are located.
-3. If you are only interested in blobs containing some prefix in the container set the blobs prefix property
-4. Set schema inference limit if you want to limit the number of blobs being considered for constructing the schema
-5. Choose the format corresponding to the format of your files.
-
+1. [Log in to your Airbyte Cloud](https://cloud.airbyte.com/workspaces) account, or navigate to your Airbyte Open Source dashboard.
+2. In the left navigation bar, click **Sources**. In the top-right corner, click **+ New source**.
+3. Find and select **Azure Blob Storage** from the list of available sources.
+4. Enter the name of your Azure **Account**.
+5. Enter the *Azure Blob Storage account key* which grants access to your account.
+6. Enter the name of the **Container** containing your files to replicate.
+7. Add a stream
+ 1. Write the **File Type**
+ 2. In the **Format** box, use the dropdown menu to select the format of the files you'd like to replicate. The supported formats are **CSV**, **Parquet**, **Avro** and **JSONL**. Toggling the **Optional fields** button within the **Format** box will allow you to enter additional configurations based on the selected format. For a detailed breakdown of these settings, refer to the [File Format section](#file-format-settings) below.
+ 3. Give a **Name** to the stream
+ 4. (Optional) - If you want to enforce a specific schema, you can enter a **Input schema**. By default, this value is set to `{}` and will automatically infer the schema from the file\(s\) you are replicating. For details on providing a custom schema, refer to the [User Schema section](#user-schema).
+ 5. Optionally, enter the **Globs** which dictates which files to be synced. This is a regular expression that allows Airbyte to pattern match the specific files to replicate. If you are replicating all the files within your bucket, use `**` as the pattern. For more precise pattern matching options, refer to the [Path Patterns section](#path-patterns) below.
+8. Optionally, enter the endpoint to use for the data replication.
+9. Optionally, enter the desired start date from which to begin replicating data.
## Supported sync modes
The Azure Blob Storage source connector supports the following [sync modes](https://docs.airbyte.com/cloud/core-concepts#connection-sync-modes):
| Feature | Supported? |
-|:-----------------------------------------------| :--------- |
+| :--------------------------------------------- |:-----------|
| Full Refresh Sync | Yes |
| Incremental Sync | Yes |
| Replicate Incremental Deletes | No |
-| Replicate Multiple Files \(blob prefix\) | Yes |
-| Replicate Multiple Streams \(distinct tables\) | No |
+| Replicate Multiple Files \(pattern matching\) | Yes |
+| Replicate Multiple Streams \(distinct tables\) | Yes |
| Namespaces | No |
+## File Compressions
+
+| Compression | Supported? |
+| :---------- | :--------- |
+| Gzip | Yes |
+| Zip | No |
+| Bzip2 | Yes |
+| Lzma | No |
+| Xz | No |
+| Snappy | No |
+
+Please let us know any specific compressions you'd like to see support for next!
+
+## Path Patterns
+
+\(tl;dr -> path pattern syntax using [wcmatch.glob](https://facelessuser.github.io/wcmatch/glob/). GLOBSTAR and SPLIT flags are enabled.\)
+
+This connector can sync multiple files by using glob-style patterns, rather than requiring a specific path for every file. This enables:
+
+- Referencing many files with just one pattern, e.g. `**` would indicate every file in the bucket.
+- Referencing future files that don't exist yet \(and therefore don't have a specific path\).
+
+You must provide a path pattern. You can also provide many patterns split with \| for more complex directory layouts.
+
+Each path pattern is a reference from the _root_ of the bucket, so don't include the bucket name in the pattern\(s\).
+
+Some example patterns:
+
+- `**` : match everything.
+- `**/*.csv` : match all files with specific extension.
+- `myFolder/**/*.csv` : match all csv files anywhere under myFolder.
+- `*/**` : match everything at least one folder deep.
+- `*/*/*/**` : match everything at least three folders deep.
+- `**/file.*|**/file` : match every file called "file" with any extension \(or no extension\).
+- `x/*/y/*` : match all files that sit in folder x -> any folder -> folder y.
+- `**/prefix*.csv` : match all csv files with specific prefix.
+- `**/prefix*.parquet` : match all parquet files with specific prefix.
+
+Let's look at a specific example, matching the following bucket layout:
+
+```text
+myBucket
+ -> log_files
+ -> some_table_files
+ -> part1.csv
+ -> part2.csv
+ -> images
+ -> more_table_files
+ -> part3.csv
+ -> extras
+ -> misc
+ -> another_part1.csv
+```
+
+We want to pick up part1.csv, part2.csv and part3.csv \(excluding another_part1.csv for now\). We could do this a few different ways:
+
+- We could pick up every csv file called "partX" with the single pattern `**/part*.csv`.
+- To be a bit more robust, we could use the dual pattern `some_table_files/*.csv|more_table_files/*.csv` to pick up relevant files only from those exact folders.
+- We could achieve the above in a single pattern by using the pattern `*table_files/*.csv`. This could however cause problems in the future if new unexpected folders started being created.
+- We can also recursively wildcard, so adding the pattern `extras/**/*.csv` would pick up any csv files nested in folders below "extras", such as "extras/misc/another_part1.csv".
+
+As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.
+
+## User Schema
+
+Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:
+
+- You only care about a specific known subset of the columns. The other columns would all still be included, but packed into the `_ab_additional_properties` map.
+- Your initial dataset is quite small \(in terms of number of records\), and you think the automatic type inference from this sample might not be representative of the data in the future.
+- You want to purposely define types for every column.
+- You know the names of columns that will be added to future data and want to include these in the core schema as columns rather than have them appear in the `_ab_additional_properties` map.
+
+Or any other reason! The schema must be provided as valid JSON as a map of `{"column": "datatype"}` where each datatype is one of:
+
+- string
+- number
+- integer
+- object
+- array
+- boolean
+- null
+
+For example:
+
+- {"id": "integer", "location": "string", "longitude": "number", "latitude": "number"}
+- {"username": "string", "friends": "array", "information": "object"}
+
+## File Format Settings
+
+### CSV
+
+Since CSV files are effectively plain text, providing specific reader options is often required for correct parsing of the files. These settings are applied when a CSV is created or exported so please ensure that this process happens consistently over time.
+
+- **Header Definition**: How headers will be defined. `User Provided` assumes the CSV does not have a header row and uses the headers provided and `Autogenerated` assumes the CSV does not have a header row and the CDK will generate headers using for `f{i}` where `i` is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can set a value for the "Skip rows before header" option to ignore the header row.
+- **Delimiter**: Even though CSV is an acronym for Comma Separated Values, it is used more generally as a term for flat file data that may or may not be comma separated. The delimiter field lets you specify which character acts as the separator. To use [tab-delimiters](https://en.wikipedia.org/wiki/Tab-separated_values), you can set this value to `\t`. By default, this value is set to `,`.
+- **Double Quote**: This option determines whether two quotes in a quoted CSV value denote a single quote in the data. Set to True by default.
+- **Encoding**: Some data may use a different character set \(typically when different alphabets are involved\). See the [list of allowable encodings here](https://docs.python.org/3/library/codecs.html#standard-encodings). By default, this is set to `utf8`.
+- **Escape Character**: An escape character can be used to prefix a reserved character and ensure correct parsing. A commonly used character is the backslash (`\`). For example, given the following data:
+
+```
+Product,Description,Price
+Jeans,"Navy Blue, Bootcut, 34\"",49.99
+```
+
+The backslash (`\`) is used directly before the second double quote (`"`) to indicate that it is _not_ the closing quote for the field, but rather a literal double quote character that should be included in the value (in this example, denoting the size of the jeans in inches: `34"` ).
+
+Leaving this field blank (default option) will disallow escaping.
+
+- **False Values**: A set of case-sensitive strings that should be interpreted as false values.
+- **Null Values**: A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field.
+- **Quote Character**: In some cases, data values may contain instances of reserved characters \(like a comma, if that's the delimiter\). CSVs can handle this by wrapping a value in defined quote characters so that on read it can parse it correctly. By default, this is set to `"`.
+- **Skip Rows After Header**: The number of rows to skip after the header row.
+- **Skip Rows Before Header**: The number of rows to skip before the header row.
+- **Strings Can Be Null**: Whether strings can be interpreted as null values. If true, strings that match the null_values set will be interpreted as null. If false, strings that match the null_values set will be interpreted as the string itself.
+- **True Values**: A set of case-sensitive strings that should be interpreted as true values.
+
+
+### Parquet
+
+Apache Parquet is a column-oriented data storage format of the Apache Hadoop ecosystem. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. At the moment, partitioned parquet datasets are unsupported. The following settings are available:
-## Azure Blob Storage Settings
+- **Convert Decimal Fields to Floats**: Whether to convert decimal fields to floats. There is a loss of precision when converting decimals to floats, so this is not recommended.
-* `azure_blob_storage_endpoint` : azure blob storage endpoint to connect to
-* `azure_blob_storage_container_name` : name of the container where your blobs are located
-* `azure_blob_storage_account_name` : name of your account
-* `azure_blob_storage_account_key` : key of your account
-* `azure_blob_storage_blobs_prefix` : prefix for getting files which contain that prefix i.e. FolderA/FolderB/ will get files named FolderA/FolderB/blob.json but not FolderA/blob.json
-* `azure_blob_storage_schema_inference_limit` : Limits the number of files being scanned for schema inference and can increase speed and efficiency
-* `format` : File format of the blobs in the container
+### Avro
-**File Format Settings**
+The Avro parser uses the [Fastavro library](https://fastavro.readthedocs.io/en/latest/). The following settings are available:
+- **Convert Double Fields to Strings**: Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers.
-### Jsonl
+### JSONL
-Only the line-delimited [JSON](https://jsonlines.org/) format is supported for now
+There are currently no options for JSONL parsing.
-## Changelog 21210
+## Changelog
-| Version | Date | Pull Request | Subject |
-|:--------|:-----------|:-------------------------------------------------|:------------------------------------------------------------------------|
-| 0.1.0 | 2023-02-17 | https://github.com/airbytehq/airbyte/pull/23222 | Initial release with full-refresh and incremental sync with JSONL files |
+| Version | Date | Pull Request | Subject |
+|:--------|:-----------|:------------------------------------------------|:------------------------------------------------------------------------|
+| 0.2.0 | 2023-10-10 | https://github.com/airbytehq/airbyte/pull/31336 | Migrate to File-based CDK. Add support of CSV, Parquet and Avro files |
+| 0.1.0 | 2023-02-17 | https://github.com/airbytehq/airbyte/pull/23222 | Initial release with full-refresh and incremental sync with JSONL files |