Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Update apify connector #31380

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -34,5 +34,5 @@ COPY source_apify_dataset ./source_apify_dataset
ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py"
ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]

LABEL io.airbyte.version=2.0.0
LABEL io.airbyte.version=2.1.0
LABEL io.airbyte.name=airbyte/source-apify-dataset
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,13 @@ data:
enabled: true
cloud:
enabled: true
dockerImageTag: 0.2.0 # https://github.com/airbytehq/airbyte/issues/30478
connectorSubtype: api
connectorType: source
definitionId: 47f17145-fe20-4ef5-a548-e29b048adf84
dockerImageTag: 2.0.0
dockerImageTag: 2.1.0
dockerRepository: airbyte/source-apify-dataset
githubIssueLabel: source-apify-dataset
icon: apify-dataset.svg
icon: apify.svg
license: MIT
name: Apify Dataset
releaseDate: 2023-08-25
Expand All @@ -27,6 +26,9 @@ data:
2.0.0:
upgradeDeadline: 2023-09-18
message: "Fix broken stream, manifest refactor"
2.1.0:
upgradeDeadline: 2023-10-27
message: "Rename dataset streams"
supportLevel: community
documentationUrl: https://docs.airbyte.com/integrations/sources/apify-dataset
tags:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,25 @@ streams:
type: DpathExtractor
field_path: []

- type: DeclarativeStream
name: item_collection
$parameters:
path: "datasets/{{ config['dataset_id'] }}/items"
schema_loader:
type: JsonFileSchemaLoader
file_path: "./source_apify_dataset/schemas/item_collection.json"
retriever:
$ref: "#/definitions/retriever"
record_selector:
type: RecordSelector
extractor:
class_name: source_apify_dataset.wrapping_dpath_extractor.WrappingDpathExtractor
field_path: []

check:
type: CheckStream
stream_names:
- dataset_collection
- dataset
- item_collection_website_content_crawler
- item_collection
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Item collection",
"type": ["null", "object"],
"additionalProperties": true,
"properties": {
"data": {
"additionalProperties": true,
"type": ["null", "object"]
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#
# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
#

from dataclasses import dataclass

import requests
from airbyte_cdk.sources.declarative.extractors.dpath_extractor import DpathExtractor
from airbyte_cdk.sources.declarative.types import Record


@dataclass
class WrappingDpathExtractor(DpathExtractor):
"""
Record extractor that wraps the extracted value into a dict, with the value being set to the key `data`.
This is done because the actual shape of the data is dynamic, so by wrapping everything into a `data` object
it can be specified as a generic object in the schema.

Note that this will cause fields to not be normalized in the destination.
"""

def extract_records(self, response: requests.Response) -> list[Record]:
records = super().extract_records(response)
return [{"data": record} for record in records]
6 changes: 5 additions & 1 deletion docs/integrations/sources/apify-dataset-migrations.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# Apify Dataset Migration Guide

## Upgrading to 2.1.0

A minor update adding a new stream `item_collection` for general datasets. No actions are required regarding your current connector configuration setup.

## Upgrading to 2.0.0

Major update: The old broken Item Collection stream has been removed and replaced with a new Item Collection (WCC) stream specific for the datasets produced by [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor. Please update your connector configuration setup. Note: The schema of the Apify Dataset is at least Actor-specific, so we cannot have a general Stream with a static schema for getting data from a Dataset.

## Upgrading to 1.0.0

A major update fixing the data ingestion to retrieve properly data from Apify.
Please update your connector configuration setup.
Please update your connector configuration setup.
11 changes: 9 additions & 2 deletions docs/integrations/sources/apify-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,19 +46,26 @@ The Apify dataset connector uses [Apify Python Client](https://docs.apify.com/ap
- Apify Personal API token (you can find it [here](https://console.apify.com/account/integrations))
- Dataset ID (check the [docs](https://docs.apify.com/platform/storage/dataset))

### `item_collection_website_content_crawler`
### `item_collection`

- Calls `api.apify.com/v2/datasets/{datasetId}/items` ([docs](https://docs.apify.com/api/v2#/reference/datasets/item-collection/get-items))
- Properties:
- Apify Personal API token (you can find it [here](https://console.apify.com/account/integrations))
- Dataset ID (check the [docs](https://docs.apify.com/platform/storage/dataset))
- Limitations:
- Currently works only for the datasets produced by [Website Content Crawler](https://apify.com/apify/website-content-crawler).
- The stream uses a dynamic schema (all the data are stored under the `"data"` key), so it should support all the Apify Datasets (produced by whatever Actor).

### `item_collection_website_content_crawler`

- Calls the same endpoint and uses the same properties as the `item_collection` stream.
- Limitations:
- The stream uses a static schema which corresponds to the datasets produced by [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor. So only datasets produced by this Actor are supported.

## Changelog

| Version | Date | Pull Request | Subject |
| :------ | :--------- | :----------------------------------------------------------- | :-------------------------------------------------------------------------- |
| 2.1.0 | 2023-10-13 | [31333](https://github.com/airbytehq/airbyte/pull/31333) | Add stream for arbitrary datasets |
| 2.0.0 | 2023-09-18 | [30428](https://github.com/airbytehq/airbyte/pull/30428) | Fix broken stream, manifest refactor |
| 1.0.0 | 2023-08-25 | [29859](https://github.com/airbytehq/airbyte/pull/29859) | Migrate to lowcode |
| 0.2.0 | 2022-06-20 | [28290](https://github.com/airbytehq/airbyte/pull/28290) | Make connector work with platform changes not syncing empty stream schemas. |
Expand Down
Loading