From 7bff33fc5d47c9b169e0e8d06f1e34dedc61cae3 Mon Sep 17 00:00:00 2001 From: Vlada Dusek Date: Sat, 7 Oct 2023 01:05:19 +0200 Subject: [PATCH] Source Apify Dataset: fix broken stream, manifest refactor (#30428) Co-authored-by: Joe Reuter Co-authored-by: flash1293 --- .../source-apify-dataset/Dockerfile | 2 +- .../connectors/source-apify-dataset/README.md | 56 +++++- .../acceptance-test-config.yml | 6 +- .../integration_tests/configured_catalog.json | 4 +- .../source-apify-dataset/metadata.yaml | 5 +- .../source_apify_dataset/manifest.yaml | 164 +++++++++--------- .../source_apify_dataset/schemas/dataset.json | 6 +- ...{datasets.json => dataset_collection.json} | 4 + .../schemas/item_collection.json | 21 --- .../schemas/item_collection_wcc.json | 59 +++++++ .../source_apify_dataset/source.py | 2 +- .../source_apify_dataset/spec.yaml | 30 ---- .../sources/apify-dataset-migrations.md | 4 + docs/integrations/sources/apify-dataset.md | 65 ++++--- 14 files changed, 262 insertions(+), 166 deletions(-) rename airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/{datasets.json => dataset_collection.json} (93%) delete mode 100644 airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/item_collection.json create mode 100644 airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/item_collection_wcc.json delete mode 100644 airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/spec.yaml diff --git a/airbyte-integrations/connectors/source-apify-dataset/Dockerfile b/airbyte-integrations/connectors/source-apify-dataset/Dockerfile index 0e4fe3cd8143..e83b72f6ecf0 100644 --- a/airbyte-integrations/connectors/source-apify-dataset/Dockerfile +++ b/airbyte-integrations/connectors/source-apify-dataset/Dockerfile @@ -34,5 +34,5 @@ COPY source_apify_dataset ./source_apify_dataset ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py" ENTRYPOINT ["python", "/airbyte/integration_code/main.py"] -LABEL io.airbyte.version=1.0.0 +LABEL io.airbyte.version=2.0.0 LABEL io.airbyte.name=airbyte/source-apify-dataset diff --git a/airbyte-integrations/connectors/source-apify-dataset/README.md b/airbyte-integrations/connectors/source-apify-dataset/README.md index 6d679120e45c..2a3da072031c 100644 --- a/airbyte-integrations/connectors/source-apify-dataset/README.md +++ b/airbyte-integrations/connectors/source-apify-dataset/README.md @@ -5,15 +5,50 @@ For information about how to use this connector within Airbyte, see [the documen ## Local development +#### Building via Python + +Create a Python virtual environment + +``` +virtualenv --python $(which python3.10) .venv +``` + +Source it + +``` +source .venv/bin/activate +``` + +Check connector specifications/definition + +``` +python main.py spec +``` + +Basic check - check connection to the API + +``` +python main.py check --config secrets/config.json +``` + +Integration tests - read operation from the API + +``` +python main.py read --config secrets/config.json --catalog integration_tests/configured_catalog.json +``` + #### Building via Gradle + You can also build the connector in Gradle. This is typically used in CI and not needed for your development workflow. To build using Gradle, from the Airbyte repository root, run: + ``` ./gradlew :airbyte-integrations:connectors:source-apify-dataset:build ``` #### Create credentials + **If you are a community contributor**, follow the instructions in the [documentation](https://docs.airbyte.com/integrations/sources/apify-dataset) to generate the necessary credentials. Then create a file `secrets/config.json` conforming to the `source_apify_dataset/spec.yaml` file. Note that any directory named `secrets` is gitignored across the entire Airbyte repo, so there is no danger of accidentally checking in sensitive information. @@ -25,56 +60,73 @@ and place them into `secrets/config.json`. ### Locally running the connector docker image #### Build + First, make sure you build the latest Docker image: + ``` docker build . -t airbyte/source-apify-dataset:dev ``` You can also build the connector image via Gradle: + ``` ./gradlew :airbyte-integrations:connectors:source-apify-dataset:airbyteDocker ``` + When building via Gradle, the docker image name and tag, respectively, are the values of the `io.airbyte.name` and `io.airbyte.version` `LABEL`s in the Dockerfile. #### Run + Then run any of the connector commands as follows: + ``` docker run --rm airbyte/source-apify-dataset:dev spec docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-apify-dataset:dev check --config /secrets/config.json docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-apify-dataset:dev discover --config /secrets/config.json docker run --rm -v $(pwd)/secrets:/secrets -v $(pwd)/integration_tests:/integration_tests airbyte/source-apify-dataset:dev read --config /secrets/config.json --catalog /integration_tests/configured_catalog.json ``` + ## Testing #### Acceptance Tests + Customize `acceptance-test-config.yml` file to configure tests. See [Connector Acceptance Tests](https://docs.airbyte.com/connector-development/testing-connectors/connector-acceptance-tests-reference) for more information. If your connector requires to create or destroy resources for use during acceptance tests create fixtures for it and place them inside integration_tests/acceptance.py. To run your integration tests with Docker, run: + ``` ./acceptance-test-docker.sh ``` ### Using gradle to run tests + All commands should be run from airbyte project root. To run unit tests: + ``` ./gradlew :airbyte-integrations:connectors:source-apify-dataset:unitTest ``` + To run acceptance and custom integration tests: + ``` ./gradlew :airbyte-integrations:connectors:source-apify-dataset:integrationTest ``` ## Dependency Management + All of your dependencies should go in `setup.py`, NOT `requirements.txt`. The requirements file is only used to connect internal Airbyte dependencies in the monorepo for local development. We split dependencies between two groups, dependencies that are: -* required for your connector to work need to go to `MAIN_REQUIREMENTS` list. -* required for the testing need to go to `TEST_REQUIREMENTS` list + +- required for your connector to work need to go to `MAIN_REQUIREMENTS` list. +- required for the testing need to go to `TEST_REQUIREMENTS` list ### Publishing a new version of the connector + You've checked out the repo, implemented a million dollar feature, and you're ready to share your changes with the world. Now what? + 1. Make sure your changes are passing unit and integration tests. 1. Bump the connector version in `Dockerfile` -- just increment the value of the `LABEL io.airbyte.version` appropriately (we use [SemVer](https://semver.org/)). 1. Create a Pull Request. diff --git a/airbyte-integrations/connectors/source-apify-dataset/acceptance-test-config.yml b/airbyte-integrations/connectors/source-apify-dataset/acceptance-test-config.yml index 3d4eb90bf608..71a772f72426 100644 --- a/airbyte-integrations/connectors/source-apify-dataset/acceptance-test-config.yml +++ b/airbyte-integrations/connectors/source-apify-dataset/acceptance-test-config.yml @@ -4,7 +4,7 @@ acceptance_tests: tests: - spec_path: "source_apify_dataset/spec.yaml" backward_compatibility_tests_config: - disable_for_version: 0.2.0 + disable_for_version: 2.0.0 connection: tests: - config_path: "secrets/config.json" @@ -15,7 +15,7 @@ acceptance_tests: tests: - config_path: "secrets/config.json" backward_compatibility_tests_config: - disable_for_version: 0.2.0 + disable_for_version: 2.0.0 basic_read: tests: - config_path: "secrets/config.json" @@ -32,7 +32,7 @@ acceptance_tests: - config_path: "secrets/config.json" configured_catalog_path: "integration_tests/configured_catalog.json" ignored_fields: - datasets: + dataset_collection: - name: "accessedAt" bypass_reason: "Change everytime" - name: "stats/readCount" diff --git a/airbyte-integrations/connectors/source-apify-dataset/integration_tests/configured_catalog.json b/airbyte-integrations/connectors/source-apify-dataset/integration_tests/configured_catalog.json index a7db5266dd74..6a76c95e966e 100644 --- a/airbyte-integrations/connectors/source-apify-dataset/integration_tests/configured_catalog.json +++ b/airbyte-integrations/connectors/source-apify-dataset/integration_tests/configured_catalog.json @@ -2,7 +2,7 @@ "streams": [ { "stream": { - "name": "datasets", + "name": "dataset_collection", "json_schema": {}, "supported_sync_modes": ["full_refresh"] }, @@ -20,7 +20,7 @@ }, { "stream": { - "name": "item_collection", + "name": "item_collection_website_content_crawler", "json_schema": {}, "supported_sync_modes": ["full_refresh"] }, diff --git a/airbyte-integrations/connectors/source-apify-dataset/metadata.yaml b/airbyte-integrations/connectors/source-apify-dataset/metadata.yaml index 4b3349457147..3c165d9154d0 100644 --- a/airbyte-integrations/connectors/source-apify-dataset/metadata.yaml +++ b/airbyte-integrations/connectors/source-apify-dataset/metadata.yaml @@ -11,7 +11,7 @@ data: connectorSubtype: api connectorType: source definitionId: 47f17145-fe20-4ef5-a548-e29b048adf84 - dockerImageTag: 1.0.0 + dockerImageTag: 2.0.0 dockerRepository: airbyte/source-apify-dataset githubIssueLabel: source-apify-dataset icon: apify-dataset.svg @@ -24,6 +24,9 @@ data: 1.0.0: upgradeDeadline: 2023-08-30 message: "Update spec to use token and ingest all 3 streams correctly" + 2.0.0: + upgradeDeadline: 2023-09-18 + message: "Fix broken stream, manifest refactor" supportLevel: community documentationUrl: https://docs.airbyte.com/integrations/sources/apify-dataset tags: diff --git a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/manifest.yaml b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/manifest.yaml index 314758404801..600dfa99356a 100644 --- a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/manifest.yaml +++ b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/manifest.yaml @@ -1,109 +1,115 @@ -version: "0.29.0" +version: "0.51.11" +type: DeclarativeSource -definitions: - selector: - type: RecordSelector - extractor: - type: DpathExtractor - field_path: ["data"] - requester: - type: HttpRequester - url_base: "https://api.apify.com/v2/" - http_method: "GET" - authenticator: - type: NoAuth - request_parameters: - token: "{{ config['token'] }}" +spec: + type: Spec + documentation_url: https://docs.airbyte.com/integrations/sources/apify-dataset + connection_specification: + $schema: http://json-schema.org/draft-07/schema# + title: Apify Dataset Spec + type: object + required: + - token + - dataset_id + properties: + token: + type: string + title: API token + description: >- + Personal API token of your Apify account. In Apify Console, you can find your API token in the + Settings section under the Integrations tab + after you login. See the Apify Docs + for more information. + examples: + - apify_api_PbVwb1cBbuvbfg2jRmAIHZKgx3NQyfEMG7uk + airbyte_secret: true + dataset_id: + type: string + title: Dataset ID + description: >- + ID of the dataset you would like to load to Airbyte. In Apify Console, you can view your datasets in the + Storage section under the Datasets tab + after you login. See the Apify Docs + for more information. + examples: + - rHuMdwm6xCFt6WiGU + additionalProperties: true +definitions: retriever: type: SimpleRetriever - record_selector: - $ref: "#/definitions/selector" - paginator: - type: "NoPagination" requester: - $ref: "#/definitions/requester" - - base_paginator: - type: "DefaultPaginator" - page_size_option: - type: "RequestOption" - inject_into: "request_parameter" - field_name: "limit" - pagination_strategy: - type: "OffsetIncrement" - page_size: 50 - page_token_option: - type: "RequestOption" - field_name: "offset" - inject_into: "request_parameter" - - base_stream: - type: DeclarativeStream - retriever: - $ref: "#/definitions/retriever" + type: HttpRequester + url_base: "https://api.apify.com/v2/" + http_method: "GET" + authenticator: + type: BearerAuthenticator + api_token: "{{ config['token'] }}" + paginator: + type: "DefaultPaginator" + page_size_option: + type: "RequestOption" + inject_into: "request_parameter" + field_name: "limit" + pagination_strategy: + type: "OffsetIncrement" + page_size: 50 + page_token_option: + type: "RequestOption" + field_name: "offset" + inject_into: "request_parameter" - datasets_stream: - $ref: "#/definitions/base_stream" +streams: + - type: DeclarativeStream + name: dataset_collection + primary_key: "id" $parameters: - name: "datasets" - primary_key: "id" path: "datasets" + schema_loader: + type: JsonFileSchemaLoader + file_path: "./source_apify_dataset/schemas/dataset_collection.json" retriever: $ref: "#/definitions/retriever" - paginator: - $ref: "#/definitions/base_paginator" record_selector: - $ref: "#/definitions/selector" + type: RecordSelector extractor: type: DpathExtractor field_path: ["data", "items"] - datasets_partition_router: - type: SubstreamPartitionRouter - parent_stream_configs: - - stream: "#/definitions/datasets_stream" - parent_key: "id" - partition_field: "parent_id" - - dataset_stream: - $ref: "#/definitions/base_stream" + - type: DeclarativeStream + name: dataset + primary_key: "id" $parameters: - name: "dataset" - primary_key: "id" - path: "datasets/{{ stream_partition.parent_id }}" + path: "datasets/{{ config['dataset_id'] }}" + schema_loader: + type: JsonFileSchemaLoader + file_path: "./source_apify_dataset/schemas/dataset.json" retriever: $ref: "#/definitions/retriever" - paginator: - $ref: "#/definitions/base_paginator" - partition_router: - $ref: "#/definitions/datasets_partition_router" + record_selector: + type: RecordSelector + extractor: + type: DpathExtractor + field_path: ["data"] - item_collection_stream: - $ref: "#/definitions/base_stream" + - type: DeclarativeStream + name: item_collection_website_content_crawler $parameters: - name: "item_collection" - path: "datasets/{{ stream_partition.parent_id }}/items" + path: "datasets/{{ config['dataset_id'] }}/items" + schema_loader: + type: JsonFileSchemaLoader + file_path: "./source_apify_dataset/schemas/item_collection_wcc.json" retriever: $ref: "#/definitions/retriever" - paginator: - $ref: "#/definitions/base_paginator" record_selector: - $ref: "#/definitions/selector" + type: RecordSelector extractor: type: DpathExtractor field_path: [] - partition_router: - $ref: "#/definitions/datasets_partition_router" - -streams: - - "#/definitions/datasets_stream" - - "#/definitions/dataset_stream" - - "#/definitions/item_collection_stream" check: type: CheckStream stream_names: - - "datasets" - - "dataset" - - "item_collection" + - dataset_collection + - dataset + - item_collection_website_content_crawler diff --git a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/dataset.json b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/dataset.json index b9a6b4240908..c98d9e2d81e4 100644 --- a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/dataset.json +++ b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/dataset.json @@ -18,6 +18,7 @@ }, "stats": { "type": ["null", "object"], + "additionalProperties": true, "properties": { "readCount": { "type": ["null", "number"] @@ -31,7 +32,7 @@ } }, "schema": { - "type": ["null", "string"] + "type": ["null", "string", "object"] }, "modifiedAt": { "type": ["null", "string"] @@ -51,6 +52,9 @@ "actRunId": { "type": ["null", "string"] }, + "title": { + "type": ["null", "string"] + }, "fields": { "anyOf": [ { diff --git a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/datasets.json b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/dataset_collection.json similarity index 93% rename from airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/datasets.json rename to airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/dataset_collection.json index 4a98370556a6..ed494c694ff2 100644 --- a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/datasets.json +++ b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/dataset_collection.json @@ -30,6 +30,7 @@ }, "stats": { "type": ["null", "object"], + "additionalProperties": true, "properties": { "readCount": { "type": ["null", "number"] @@ -54,6 +55,9 @@ "actRunId": { "type": ["null", "string"] }, + "title": { + "type": ["null", "string"] + }, "fields": { "anyOf": [ { diff --git a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/item_collection.json b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/item_collection.json deleted file mode 100644 index edb421509303..000000000000 --- a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/item_collection.json +++ /dev/null @@ -1,21 +0,0 @@ -{ - "$schema": "http://json-schema.org/draft-07/schema#", - "type": ["null", "object"], - "title": "Item collection schema", - "additionalProperties": true, - "properties": { - "url": { - "type": ["null", "string"] - }, - "#debug": { - "type": ["null", "object"], - "additionalProperties": true - }, - "pageTitle": { - "type": ["null", "string"] - }, - "#error": { - "type": ["null", "boolean"] - } - } -} diff --git a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/item_collection_wcc.json b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/item_collection_wcc.json new file mode 100644 index 000000000000..dc7c8a68ab47 --- /dev/null +++ b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/schemas/item_collection_wcc.json @@ -0,0 +1,59 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "title": "Item collection - Website Content Crawler (WCC)", + "type": ["null", "object"], + "additionalProperties": true, + "properties": { + "crawl": { + "additionalProperties": true, + "properties": { + "depth": { + "type": ["null", "number"] + }, + "httpStatusCode": { + "type": ["null", "number"] + }, + "loadedTime": { + "type": ["null", "string"] + }, + "loadedUrl": { + "type": ["null", "string"] + }, + "referrerUrl": { + "type": ["null", "string"] + } + }, + "type": ["null", "object"] + }, + "markdown": { + "type": ["null", "string"] + }, + "metadata": { + "additionalProperties": true, + "properties": { + "canonicalUrl": { + "type": ["null", "string"] + }, + "description": { + "type": ["null", "string"] + }, + "languageCode": { + "type": ["null", "string"] + }, + "title": { + "type": ["null", "string"] + } + }, + "type": ["null", "object"] + }, + "text": { + "type": ["null", "string"] + }, + "url": { + "type": ["null", "string"] + }, + "screenshotUrl": { + "type": ["null", "string"] + } + } +} diff --git a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/source.py b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/source.py index a65846dcdc8b..5b99be176ad1 100644 --- a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/source.py +++ b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/source.py @@ -15,4 +15,4 @@ # Declarative Source class SourceApifyDataset(YamlDeclarativeSource): def __init__(self): - super().__init__(**{"path_to_yaml": "manifest.yaml"}) + super().__init__(path_to_yaml="manifest.yaml") diff --git a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/spec.yaml b/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/spec.yaml deleted file mode 100644 index a5d0acadad36..000000000000 --- a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/spec.yaml +++ /dev/null @@ -1,30 +0,0 @@ -documentationUrl: https://docs.airbyte.com/integrations/sources/apify-dataset -connectionSpecification: - $schema: http://json-schema.org/draft-07/schema# - title: Apify Dataset Spec - type: object - required: - - token - additionalProperties: true - properties: - token: - title: Personal API tokens - description: >- - Your application's Client Secret. You can find this value on the console integrations tab - after you login. - type: string - examples: - - "Personal API tokens" - airbyte_secret: true - datasetId: - type: string - title: Dataset ID - description: ID of the dataset you would like to load to Airbyte. - clean: - type: boolean - title: Clean - description: - If set to true, only clean items will be downloaded from the dataset. - See description of what clean means in Apify - API docs. If not sure, set clean to false. diff --git a/docs/integrations/sources/apify-dataset-migrations.md b/docs/integrations/sources/apify-dataset-migrations.md index 9e1c23898419..e2c4a948a077 100644 --- a/docs/integrations/sources/apify-dataset-migrations.md +++ b/docs/integrations/sources/apify-dataset-migrations.md @@ -1,5 +1,9 @@ # Apify Dataset Migration Guide +## Upgrading to 2.0.0 + +Major update: The old broken Item Collection stream has been removed and replaced with a new Item Collection (WCC) stream specific for the datasets produced by [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor. Please update your connector configuration setup. Note: The schema of the Apify Dataset is at least Actor-specific, so we cannot have a general Stream with a static schema for getting data from a Dataset. + ## Upgrading to 1.0.0 A major update fixing the data ingestion to retrieve properly data from Apify. diff --git a/docs/integrations/sources/apify-dataset.md b/docs/integrations/sources/apify-dataset.md index 5d3e5fbb4359..69f060a2ab12 100644 --- a/docs/integrations/sources/apify-dataset.md +++ b/docs/integrations/sources/apify-dataset.md @@ -6,49 +6,64 @@ description: Web scraping and automation platform. ## Overview -[Apify](https://www.apify.com) is a web scraping and web automation platform providing both ready-made and custom solutions, an open-source [SDK](https://sdk.apify.com/) for web scraping, proxies, and many other tools to help you build and run web automation jobs at scale. +[Apify](https://apify.com/) is a web scraping and web automation platform providing both ready-made and custom solutions, an open-source [JavaScript SDK](https://docs.apify.com/sdk/js/) and [Python SDK](https://docs.apify.com/sdk/python/) for web scraping, proxies, and many other tools to help you build and run web automation jobs at scale. -The results of a scraping job are usually stored in [Apify Dataset](https://docs.apify.com/storage/dataset). This Airbyte connector allows you to automatically sync the contents of a dataset to your chosen destination using Airbyte. +The results of a scraping job are usually stored in the [Apify Dataset](https://docs.apify.com/storage/dataset). This Airbyte connector provides streams to work with the datasets, including syncing their contents to your chosen destination using Airbyte. To sync data from a dataset, all you need to know is its ID. You will find it in [Apify console](https://my.apify.com/) under storages. +Currently, only datasets provided by the Website Content Crawler Actor are supported. Adding streams for other Actors or a stream for the general dataset (with dynamic schema) will be added soon. + ### Running Airbyte sync from Apify webhook -When your Apify job \(aka [actor run](https://docs.apify.com/actors/running)\) finishes, it can trigger an Airbyte sync by calling the Airbyte [API](https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html#post-/v1/connections/sync) manual connection trigger \(`POST /v1/connections/sync`\). The API can be called from Apify [webhook](https://docs.apify.com/webhooks) which is executed when your Apify run finishes. +When your Apify job (aka [Actor run](https://docs.apify.com/platform/actors/running)) finishes, it can trigger an Airbyte sync by calling the Airbyte [API](https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html#post-/v1/connections/sync) manual connection trigger (`POST /v1/connections/sync`). The API can be called from Apify [webhook](https://docs.apify.com/platform/integrations/webhooks) which is executed when your Apify run finishes. ![](../../.gitbook/assets/apify_trigger_airbyte_connection.png) -### Output schema - -Since the dataset items do not have strongly typed schema, they are synced as objects stored in the `data` field, without any assumption on their content. - ### Features -| Feature | Supported? | -| :------------------------ | :--------------- | -| Full Refresh Sync | Yes | -| Incremental Sync | Yes | +| Feature | Supported? | +| :---------------- | :--------- | +| Full Refresh Sync | Yes | +| Incremental Sync | Yes | ### Performance considerations The Apify dataset connector uses [Apify Python Client](https://docs.apify.com/apify-client-python) under the hood and should handle any API limitations under normal usage. -## Getting started +## Streams + +### `dataset_collection` + +- Calls `api.apify.com/v2/datasets` ([docs](https://docs.apify.com/api/v2#/reference/datasets/dataset-collection/get-list-of-datasets)) +- Properties: + - Apify Personal API token (you can find it [here](https://console.apify.com/account/integrations)) + +### `dataset` -### Requirements +- Calls `https://api.apify.com/v2/datasets/{datasetId}` ([docs](https://docs.apify.com/api/v2#/reference/datasets/dataset/get-dataset)) +- Properties: + - Apify Personal API token (you can find it [here](https://console.apify.com/account/integrations)) + - Dataset ID (check the [docs](https://docs.apify.com/platform/storage/dataset)) -* Apify [token](https://console.apify.com/account/integrations) token -* Parameter clean: true or false +### `item_collection_website_content_crawler` -### Changelog +- Calls `api.apify.com/v2/datasets/{datasetId}/items` ([docs](https://docs.apify.com/api/v2#/reference/datasets/item-collection/get-items)) +- Properties: + - Apify Personal API token (you can find it [here](https://console.apify.com/account/integrations)) + - Dataset ID (check the [docs](https://docs.apify.com/platform/storage/dataset)) +- Limitations: + - Currently works only for the datasets produced by [Website Content Crawler](https://apify.com/apify/website-content-crawler). -| Version | Date | Pull Request | Subject | -| :-------- | :---------- | :------------------------------------------------------------ | :-------------------------------------------------------------------------- | -| 1.0.0 | 2023-08-25 | [29859](https://github.com/airbytehq/airbyte/pull/29859) | Migrate to lowcode | -| 0.2.0 | 2022-06-20 | [28290](https://github.com/airbytehq/airbyte/pull/28290) | Make connector work with platform changes not syncing empty stream schemas. | -| 0.1.11 | 2022-04-27 | [12397](https://github.com/airbytehq/airbyte/pull/12397) | No changes. Used connector to test publish workflow changes. | -| 0.1.9 | 2022-04-05 | [PR\#11712](https://github.com/airbytehq/airbyte/pull/11712) | No changes from 0.1.4. Used connector to test publish workflow changes. | -| 0.1.4 | 2021-12-23 | [PR\#8434](https://github.com/airbytehq/airbyte/pull/8434) | Update fields in source-connectors specifications | -| 0.1.2 | 2021-11-08 | [PR\#7499](https://github.com/airbytehq/airbyte/pull/7499) | Remove base-python dependencies | -| 0.1.0 | 2021-07-29 | [PR\#5069](https://github.com/airbytehq/airbyte/pull/5069) | Initial version of the connector | +## Changelog +| Version | Date | Pull Request | Subject | +| :------ | :--------- | :----------------------------------------------------------- | :-------------------------------------------------------------------------- | +| 2.0.0 | 2023-09-18 | [30428](https://github.com/airbytehq/airbyte/pull/30428) | Fix broken stream, manifest refactor | +| 1.0.0 | 2023-08-25 | [29859](https://github.com/airbytehq/airbyte/pull/29859) | Migrate to lowcode | +| 0.2.0 | 2022-06-20 | [28290](https://github.com/airbytehq/airbyte/pull/28290) | Make connector work with platform changes not syncing empty stream schemas. | +| 0.1.11 | 2022-04-27 | [12397](https://github.com/airbytehq/airbyte/pull/12397) | No changes. Used connector to test publish workflow changes. | +| 0.1.9 | 2022-04-05 | [PR\#11712](https://github.com/airbytehq/airbyte/pull/11712) | No changes from 0.1.4. Used connector to test publish workflow changes. | +| 0.1.4 | 2021-12-23 | [PR\#8434](https://github.com/airbytehq/airbyte/pull/8434) | Update fields in source-connectors specifications | +| 0.1.2 | 2021-11-08 | [PR\#7499](https://github.com/airbytehq/airbyte/pull/7499) | Remove base-python dependencies | +| 0.1.0 | 2021-07-29 | [PR\#5069](https://github.com/airbytehq/airbyte/pull/5069) | Initial version of the connector |