diff --git a/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md b/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md index 3e58b5a25d..743936af72 100644 --- a/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md +++ b/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md @@ -5,7 +5,7 @@ keywords: [insert values, file formats] --- import SetTheFormat from './_set_the_format.mdx'; -# SQL INSERT File Format +# SQL INSERT file format This file format contains an INSERT...VALUES statement to be executed on the destination during the `load` stage. @@ -18,12 +18,13 @@ Additional data types are stored as follows: This file format is [compressed](../../reference/performance.md#disabling-and-enabling-file-compression) by default. -## Supported Destinations +## Supported destinations This format is used by default by: **DuckDB**, **Postgres**, **Redshift**, **Synapse**, **MSSQL**, **Motherduck** -It is also supported by: **Filesystem** if you'd like to store INSERT VALUES statements for some reason +It is also supported by: **Filesystem** if you'd like to store INSERT VALUES statements for some reason. ## How to configure + diff --git a/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md b/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md index 5957ccc8ad..54e5b1cbd2 100644 --- a/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md +++ b/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md @@ -5,10 +5,9 @@ keywords: [jsonl, file formats] --- import SetTheFormat from './_set_the_format.mdx'; -# jsonl - JSON Delimited +# jsonl - JSON delimited -JSON Delimited is a file format that stores several JSON documents in one file. The JSON -documents are separated by a new line. +JSON delimited is a file format that stores several JSON documents in one file. The JSON documents are separated by a new line. Additional data types are stored as follows: @@ -18,13 +17,13 @@ Additional data types are stored as follows: - `HexBytes` is stored as a hex encoded string; - `json` is serialized as a string. -This file format is -[compressed](../../reference/performance.md#disabling-and-enabling-file-compression) by default. +This file format is [compressed](../../reference/performance.md#disabling-and-enabling-file-compression) by default. -## Supported Destinations +## Supported destinations This format is used by default by: **BigQuery**, **Snowflake**, **Filesystem**. ## How to configure + diff --git a/docs/website/docs/dlt-ecosystem/file-formats/parquet.md b/docs/website/docs/dlt-ecosystem/file-formats/parquet.md index 30f7051386..3830a45ff1 100644 --- a/docs/website/docs/dlt-ecosystem/file-formats/parquet.md +++ b/docs/website/docs/dlt-ecosystem/file-formats/parquet.md @@ -9,13 +9,13 @@ import SetTheFormat from './_set_the_format.mdx'; [Apache Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. `dlt` is capable of storing data in this format when configured to do so. -To use this format, you need a `pyarrow` package. You can get this package as a `dlt` extra as well: +To use this format, you need the `pyarrow` package. You can get this package as a `dlt` extra as well: ```sh pip install "dlt[parquet]" ``` -## Supported Destinations +## Supported destinations Supported by: **BigQuery**, **DuckDB**, **Snowflake**, **Filesystem**, **Athena**, **Databricks**, **Synapse** @@ -23,7 +23,7 @@ Supported by: **BigQuery**, **DuckDB**, **Snowflake**, **Filesystem**, **Athena* -## Destination AutoConfig +## Destination autoconfig `dlt` uses [destination capabilities](../../walkthroughs/create-new-destination.md#3-set-the-destination-capabilities) to configure the parquet writer: * It uses decimal and wei precision to pick the right **decimal type** and sets precision and scale. * It uses timestamp precision to pick the right **timestamp type** resolution (seconds, micro, or nano). @@ -32,17 +32,17 @@ Supported by: **BigQuery**, **DuckDB**, **Snowflake**, **Filesystem**, **Athena* Under the hood, `dlt` uses the [pyarrow parquet writer](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html) to create the files. The following options can be used to change the behavior of the writer: -- `flavor`: Sanitize schema or set other compatibility options to work with various target systems. Defaults to None which is **pyarrow** default. +- `flavor`: Sanitize schema or set other compatibility options to work with various target systems. Defaults to None, which is the **pyarrow** default. - `version`: Determine which Parquet logical types are available for use, whether the reduced set from the Parquet 1.x.x format or the expanded logical types added in later format versions. Defaults to "2.6". -- `data_page_size`: Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). Defaults to None which is **pyarrow** default. +- `data_page_size`: Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). Defaults to None, which is the **pyarrow** default. - `row_group_size`: Set the number of rows in a row group. [See here](#row-group-size) how this can optimize parallel processing of queries on your destination over the default setting of `pyarrow`. -- `timestamp_timezone`: A string specifying timezone, default is UTC. -- `coerce_timestamps`: resolution to which coerce timestamps, choose from **s**, **ms**, **us**, **ns** -- `allow_truncated_timestamps` - will raise if precision is lost on truncated timestamp. +- `timestamp_timezone`: A string specifying the timezone, default is UTC. +- `coerce_timestamps`: resolution to which to coerce timestamps, choose from **s**, **ms**, **us**, **ns** +- `allow_truncated_timestamps` - will raise if precision is lost on truncated timestamps. :::tip -Default parquet version used by `dlt` is 2.4. It coerces timestamps to microseconds and truncates nanoseconds silently. Such setting -provides best interoperability with database systems, including loading panda frames which have nanosecond resolution by default +The default parquet version used by `dlt` is 2.4. It coerces timestamps to microseconds and truncates nanoseconds silently. Such a setting +provides the best interoperability with database systems, including loading panda frames which have nanosecond resolution by default. ::: Read the [pyarrow parquet docs](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html) to learn more about these settings. @@ -68,28 +68,27 @@ NORMALIZE__DATA_WRITER__TIMESTAMP_TIMEZONE ``` ### Timestamps and timezones -`dlt` adds timezone (UTC adjustment) to all timestamps regardless of a precision (from seconds to nanoseconds). `dlt` will also create TZ aware timestamp columns in -the destinations. [duckdb is an exception here](../destinations/duckdb.md#supported-file-formats) +`dlt` adds timezone (UTC adjustment) to all timestamps regardless of the precision (from seconds to nanoseconds). `dlt` will also create TZ-aware timestamp columns in +the destinations. [DuckDB is an exception here](../destinations/duckdb.md#supported-file-formats). -### Disable timezones / utc adjustment flags +### Disable timezones / UTC adjustment flags You can generate parquet files without timezone adjustment information in two ways: -1. Set the **flavor** to spark. All timestamps will be generated via deprecated `int96` physical data type, without the logical one -2. Set the **timestamp_timezone** to empty string (ie. `DATA_WRITER__TIMESTAMP_TIMEZONE=""`) to generate logical type without UTC adjustment. +1. Set the **flavor** to spark. All timestamps will be generated via the deprecated `int96` physical data type, without the logical one. +2. Set the **timestamp_timezone** to an empty string (i.e., `DATA_WRITER__TIMESTAMP_TIMEZONE=""`) to generate a logical type without UTC adjustment. -To our best knowledge, arrow will convert your timezone aware DateTime(s) to UTC and store them in parquet without timezone information. +To our best knowledge, Arrow will convert your timezone-aware DateTime(s) to UTC and store them in parquet without timezone information. ### Row group size -The `pyarrow` parquet writer writes each item, i.e. table or record batch, in a separate row group. -This may lead to many small row groups which may not be optimal for certain query engines. For example, `duckdb` parallelizes on a row group. -`dlt` allows controlling the size of the row group by -[buffering and concatenating tables](../../reference/performance.md#controlling-in-memory-buffers) and batches before they are written. The concatenation is done as a zero-copy to save memory. -You can control the size of the row group by setting the maximum number of rows kept in the buffer. + +The `pyarrow` parquet writer writes each item, i.e., table or record batch, in a separate row group. This may lead to many small row groups, which may not be optimal for certain query engines. For example, `duckdb` parallelizes on a row group. `dlt` allows controlling the size of the row group by [buffering and concatenating tables](../../reference/performance.md#controlling-in-memory-buffers) and batches before they are written. The concatenation is done as a zero-copy to save memory. You can control the size of the row group by setting the maximum number of rows kept in the buffer. + ```toml [extract.data_writer] buffer_max_items=10e6 ``` -Mind that `dlt` holds the tables in memory. Thus, 1,000,000 rows in the example above may consume a significant amount of RAM. -`row_group_size` configuration setting has limited utility with `pyarrow` writer. It may be useful when you write single very large pyarrow tables -or when your in memory buffer is really large. \ No newline at end of file +Keep in mind that `dlt` holds the tables in memory. Thus, 1,000,000 rows in the example above may consume a significant amount of RAM. + +The `row_group_size` configuration setting has limited utility with the `pyarrow` writer. It may be useful when you write single very large pyarrow tables or when your in-memory buffer is really large. + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/_source-info-header.md b/docs/website/docs/dlt-ecosystem/verified-sources/_source-info-header.md index 112dcf06bf..2d41b6612c 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/_source-info-header.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/_source-info-header.md @@ -1,6 +1,7 @@ import Admonition from "@theme/Admonition"; import Link from '../../_book-onboarding-call.md'; - + Join our Slack community or . - \ No newline at end of file + + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md b/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md index b3328c6897..c4e4268647 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md @@ -9,13 +9,9 @@ import Header from './_source-info-header.md';
-[Amazon Kinesis](https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html) is a cloud-based -service for real-time data streaming and analytics, enabling the processing and analysis of large -streams of data in real time. +[Amazon Kinesis](https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html) is a cloud-based service for real-time data streaming and analytics, enabling the processing and analysis of large streams of data in real time. -Our AWS Kinesis [verified source](https://github.com/dlt-hub/verified-sources/tree/master/sources/kinesis) -loads messages from Kinesis streams to your preferred -[destination](../../dlt-ecosystem/destinations/). +Our AWS Kinesis [verified source](https://github.com/dlt-hub/verified-sources/tree/master/sources/kinesis) loads messages from Kinesis streams to your preferred [destination](../../dlt-ecosystem/destinations/). Resources that can be loaded using this verified source are: @@ -25,16 +21,14 @@ Resources that can be loaded using this verified source are: :::tip -You can check out our pipeline example -[here](https://github.com/dlt-hub/verified-sources/blob/master/sources/kinesis_pipeline.py). +You can check out our pipeline example [here](https://github.com/dlt-hub/verified-sources/blob/master/sources/kinesis_pipeline.py). ::: -## Setup Guide +## Setup guide ### Grab credentials -To use this verified source, you need an AWS `Access key` and `Secret access key`, which can be obtained -as follows: +To use this verified source, you need an AWS `Access key` and `Secret access key`, which can be obtained as follows: 1. Sign in to your AWS Management Console. 1. Navigate to the IAM (Identity and Access Management) dashboard. @@ -44,8 +38,7 @@ as follows: 1. Download or copy the Access Key ID and Secret Access Key for future use. :::info -The AWS UI, which is described here, might change. The full guide is available at this -[link](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html). +The AWS UI, which is described here, might change. The full guide is available at this [link](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html). ::: ### Initialize the verified source @@ -58,24 +51,17 @@ To get started with your data pipeline, follow these steps: dlt init kinesis duckdb ``` - [This command](../../reference/command-line-interface) will initialize - [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/kinesis_pipeline.py) - with Kinesis as the [source](../../general-usage/source) and [duckdb](../destinations/duckdb.md) - as the [destination](../destinations). + [This command](../../reference/command-line-interface) will initialize [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/kinesis_pipeline.py) with Kinesis as the [source](../../general-usage/source) and [duckdb](../destinations/duckdb.md) as the [destination](../destinations). -1. If you'd like to use a different destination, simply replace `duckdb` with the name of your - preferred [destination](../destinations). +1. If you'd like to use a different destination, simply replace `duckdb` with the name of your preferred [destination](../destinations). -1. After running this command, a new directory will be created with the necessary files and - configuration settings to get started. +1. After running this command, a new directory will be created with the necessary files and configuration settings to get started. For more information, read [Add a verified source.](../../walkthroughs/add-a-verified-source) ### Add credentials -1. In the `.dlt` folder, there's a file called `secrets.toml`. It's where you store sensitive - information securely, like access tokens. Keep this file safe. Here's its format for service - account authentication: +1. In the `.dlt` folder, there's a file called `secrets.toml`. It's where you store sensitive information securely, like access tokens. Keep this file safe. Here's its format for service account authentication: ```toml # Put your secret values and credentials here. @@ -93,13 +79,9 @@ For more information, read [Add a verified source.](../../walkthroughs/add-a-ver stream_name = "please set me up!" # Stream name (Optional). ``` -1. Replace the value of `aws_access_key_id` and `aws_secret_access_key` with the one that - [you copied above](#grab-credentials). This will ensure that the verified source can access - your Kinesis resource securely. +1. Replace the value of `aws_access_key_id` and `aws_secret_access_key` with the one that [you copied above](#grab-credentials). This will ensure that the verified source can access your Kinesis resource securely. -1. Next, follow the instructions in [Destinations](../destinations/duckdb) to add credentials for - your chosen destination. This will ensure that your data is properly routed to its final - destination. +1. Next, follow the instructions in [Destinations](../destinations/duckdb) to add credentials for your chosen destination. This will ensure that your data is properly routed to its final destination. For more information, read [Credentials](../../general-usage/credentials). @@ -110,11 +92,11 @@ For more information, read [Credentials](../../general-usage/credentials). ```sh pip install -r requirements.txt ``` -1. You're now ready to run the pipeline! To get started, run the following command: +2. You're now ready to run the pipeline! To get started, run the following command: ```sh python kinesis_pipeline.py ``` -1. Once the pipeline has finished running, you can verify that everything loaded correctly by using +3. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: ```sh dlt pipeline show @@ -132,7 +114,7 @@ For more information, read [Run a pipeline.](../../walkthroughs/run-a-pipeline) ### Resource `kinesis_stream` This resource reads a Kinesis stream and yields messages. It supports -[incremental loading](../../general-usage/incremental-loading) and parses messages as json by +[incremental loading](../../general-usage/incremental-loading) and parses messages as JSON by default. ```py @@ -180,14 +162,14 @@ resource will have the same name as the stream. When you iterate this resource ( shard, it will create an iterator to read messages: 1. If `initial_at_timestamp` is present, the resource will read all messages after this timestamp. -1. If `initial_at_timestamp` is 0, only the messages at the tip of the stream are read. -1. If no initial timestamp is provided, all messages will be retrieved (from the TRIM HORIZON). +2. If `initial_at_timestamp` is 0, only the messages at the tip of the stream are read. +3. If no initial timestamp is provided, all messages will be retrieved (from the TRIM HORIZON). The resource stores all message sequences per shard in the state. If you run the resource again, it will load messages incrementally: 1. For all shards that had messages, only messages after the last message are retrieved. -1. For shards that didn't have messages (or new shards), the last run time is used to get messages. +2. For shards that didn't have messages (or new shards), the last run time is used to get messages. Please check the `kinesis_stream` [docstring](https://github.com/dlt-hub/verified-sources/blob/master/sources/kinesis/__init__.py#L31-L46) for additional options, i.e., to limit the number of messages @@ -202,13 +184,13 @@ if False, `data` is returned as bytes. ## Customization + + ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods from this -verified source. +If you wish to create your own pipelines, you can leverage source and resource methods from this verified source. -1. Configure the [pipeline](../../general-usage/pipeline) by specifying the pipeline name, - destination, and dataset as follows: +1. Configure the [pipeline](../../general-usage/pipeline) by specifying the pipeline name, destination, and dataset as follows: ```py pipeline = dlt.pipeline( @@ -221,9 +203,9 @@ verified source. 1. To load messages from a stream from the last one hour: ```py - # the resource below will take its name from the stream name, - # it can be used multiple times by default it assumes that Data is json and parses it, - # here we disable that to just get bytes in data elements of the message + # The resource below will take its name from the stream name, + # it can be used multiple times. By default, it assumes that data is JSON and parses it, + # here we disable that to just get bytes in data elements of the message. kinesis_stream_data = kinesis_stream( "kinesis_source_name", parse_json=False, @@ -236,7 +218,7 @@ verified source. 1. For incremental Kinesis streams, to fetch only new messages: ```py - #running pipeline will get only new messages + # Running pipeline will get only new messages. info = pipeline.run(kinesis_stream_data) message_counts = pipeline.last_trace.last_normalize_info.row_counts if "kinesis_source_name" not in message_counts: @@ -245,7 +227,7 @@ verified source. print(pipeline.last_trace.last_normalize_info) ``` -1. To parse json with a simple decoder: +1. To parse JSON with a simple decoder: ```py def _maybe_parse_json(item: TDataItem) -> TDataItem: @@ -267,23 +249,23 @@ verified source. STATE_FILE = "kinesis_source_name.state.json" - # load the state if it exists + # Load the state if it exists. if os.path.exists(STATE_FILE): with open(STATE_FILE, "rb") as f: state = json.typed_loadb(f.read()) else: - # provide new state + # Provide new state. state = {} with Container().injectable_context( StateInjectableContext(state=state) ) as managed_state: - # dlt resources/source is just an iterator + # dlt resources/source is just an iterator. for message in kinesis_stream_data: - # here you can send the message somewhere + # Here you can send the message somewhere. print(message) - # save state after each message to have full transaction load - # dynamodb is also OK + # Save state after each message to have full transaction load. + # DynamoDB is also OK. with open(STATE_FILE, "wb") as f: json.typed_dump(managed_state.state, f) print(managed_state.state) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md b/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md index c92c9c6f6b..29b5e5618c 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md @@ -5,20 +5,20 @@ keywords: [arrow, pandas, parquet, source] --- import Header from './_source-info-header.md'; -# Arrow Table / Pandas +# Arrow table / Pandas
You can load data directly from an Arrow table or Pandas dataframe. -This is supported by all destinations, but recommended especially when using destinations that support the `parquet` file format natively (e.g. [Snowflake](../destinations/snowflake.md) and [Filesystem](../destinations/filesystem.md)). +This is supported by all destinations, but it is especially recommended when using destinations that support the `parquet` file format natively (e.g., [Snowflake](../destinations/snowflake.md) and [Filesystem](../destinations/filesystem.md)). See the [destination support](#destination-support-and-fallback) section for more information. -When used with a `parquet` supported destination this is a more performant way to load structured data since `dlt` bypasses many processing steps normally involved in passing JSON objects through the pipeline. -`dlt` automatically translates the Arrow table's schema to the destination table's schema and writes the table to a parquet file which gets uploaded to the destination without any further processing. +When used with a `parquet` supported destination, this is a more performant way to load structured data since `dlt` bypasses many processing steps normally involved in passing JSON objects through the pipeline. +`dlt` automatically translates the Arrow table's schema to the destination table's schema and writes the table to a parquet file, which gets uploaded to the destination without any further processing. ## Usage -To write an Arrow source, pass any `pyarrow.Table`, `pyarrow.RecordBatch` or `pandas.DataFrame` object (or list of thereof) to the pipeline's `run` or `extract` method, or yield table(s)/dataframe(s) from a `@dlt.resource` decorated function. +To write an Arrow source, pass any `pyarrow.Table`, `pyarrow.RecordBatch`, or `pandas.DataFrame` object (or list thereof) to the pipeline's `run` or `extract` method, or yield table(s)/dataframe(s) from a `@dlt.resource` decorated function. This example loads a Pandas dataframe to a Snowflake table: @@ -58,10 +58,10 @@ Note: The data in the table must be compatible with the destination database as Destinations that support the `parquet` format natively will have the data files uploaded directly as possible. Rewriting files can be avoided completely in many cases. -When the destination does not support `parquet`, the rows are extracted from the table and written in the destination's native format (usually `insert_values`) and this is generally much slower +When the destination does not support `parquet`, the rows are extracted from the table and written in the destination's native format (usually `insert_values`), and this is generally much slower as it requires processing the table row by row and rewriting data to disk. -The output file format is chosen automatically based on the destination's capabilities, so you can load arrow or pandas frames to any destination but performance will vary. +The output file format is chosen automatically based on the destination's capabilities, so you can load arrow or pandas frames to any destination, but performance will vary. ### Destinations that support parquet natively for direct loading * duckdb & motherduck @@ -89,13 +89,13 @@ add_dlt_id = true Keep in mind that enabling these incurs some performance overhead: -- `add_dlt_load_id` has minimal overhead since the column is added to arrow table in memory during `extract` stage, before parquet file is written to disk -- `add_dlt_id` adds the column during `normalize` stage after file has been extracted to disk. The file needs to be read back from disk in chunks, processed and rewritten with new columns +- `add_dlt_load_id` has minimal overhead since the column is added to the arrow table in memory during the `extract` stage, before the parquet file is written to disk +- `add_dlt_id` adds the column during the `normalize` stage after the file has been extracted to disk. The file needs to be read back from disk in chunks, processed, and rewritten with new columns ## Incremental loading with Arrow tables You can use incremental loading with Arrow tables as well. -Usage is the same as without other dlt resources. Refer to the [incremental loading](../../general-usage/incremental-loading.md) guide for more information. +Usage is the same as with other dlt resources. Refer to the [incremental loading](../../general-usage/incremental-loading.md) guide for more information. Example: @@ -104,12 +104,12 @@ import dlt from dlt.common import pendulum import pandas as pd -# Create a resource using that yields a dataframe, using the `ordered_at` field as an incremental cursor +# Create a resource that yields a dataframe, using the `ordered_at` field as an incremental cursor @dlt.resource(primary_key="order_id") def orders(ordered_at = dlt.sources.incremental('ordered_at')): - # Get dataframe/arrow table from somewhere + # Get a dataframe/arrow table from somewhere # If your database supports it, you can use the last_value to filter data at the source. - # Otherwise it will be filtered automatically after loading the data. + # Otherwise, it will be filtered automatically after loading the data. df = get_orders(since=ordered_at.last_value) yield df @@ -124,9 +124,9 @@ Look at the [Connector X + Arrow Example](../../examples/connector_x_arrow/) to ::: ## Loading JSON documents -If you want to skip default `dlt` JSON normalizer, you can use any available method to convert JSON documents into tabular data. +If you want to skip the default `dlt` JSON normalizer, you can use any available method to convert JSON documents into tabular data. * **pandas** has `read_json` and `json_normalize` methods -* **pyarrow** can infer table schema and convert JSON files into tables with `read_json` +* **pyarrow** can infer the table schema and convert JSON files into tables with `read_json` * **duckdb** can do the same with `read_json_auto` ```py @@ -153,15 +153,15 @@ The Arrow data types are translated to dlt data types as follows: | `int` | `bigint` | Precision is determined by the bit width. | | `binary` | `binary` | | | `decimal` | `decimal` | Precision and scale are determined by the type properties. | -| `struct` | `json` | | +| `struct` | `json` | | | | | | ## Loading nested types -All struct types are represented as `json` and will be loaded as JSON (if destination permits) or a string. Currently we do not support **struct** types, +All struct types are represented as `json` and will be loaded as JSON (if the destination permits) or a string. Currently, we do not support **struct** types, even if they are present in the destination (except **BigQuery** which can be [configured to handle them](../destinations/bigquery.md#use-bigquery-schema-autodetect-for-nested-fields)) -If you want to represent nested data as separated tables, you must yield panda frames and arrow tables as records. In the examples above: +If you want to represent nested data as separate tables, you must yield panda frames and arrow tables as records. In the examples above: ```py # yield panda frame as records pipeline.run(df.to_dict(orient='records'), table_name="orders") @@ -169,4 +169,5 @@ pipeline.run(df.to_dict(orient='records'), table_name="orders") # yield arrow table pipeline.run(table.to_pylist(), table_name="orders") ``` -Both Pandas and Arrow allow to stream records in batches. +Both Pandas and Arrow allow streaming records in batches. + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/github.md b/docs/website/docs/dlt-ecosystem/verified-sources/github.md index 830f4035d8..221a2c3009 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/github.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/github.md @@ -9,21 +9,20 @@ import Header from './_source-info-header.md';
-This verified source can be used to load data on issues or pull requests from any GitHub repository -onto a [destination](../../dlt-ecosystem/destinations) of your choice using [GitHub API](https://docs.github.com/en/rest?apiVersion=2022-11-28). +This verified source can be used to load data on issues or pull requests from any GitHub repository onto a [destination](../../dlt-ecosystem/destinations) of your choice using the [GitHub API](https://docs.github.com/en/rest?apiVersion=2022-11-28). Resources that can be loaded using this verified source are: | Name | Description | | ---------------- |----------------------------------------------------------------------------------| -| github_reactions | Retrieves all issues, pull requests, comments and reactions associated with them | +| github_reactions | Retrieves all issues, pull requests, comments, and reactions associated with them | | github_repo_events | Gets all the repo events associated with the repository | -## Setup Guide +## Setup guide ### Grab credentials -To get the API token, sign-in to your GitHub account and follow these steps: +To get the API token, sign in to your GitHub account and follow these steps: 1. Click on your profile picture in the top right corner. @@ -31,8 +30,7 @@ To get the API token, sign-in to your GitHub account and follow these steps: 1. Select "Developer settings" on the left panel. -1. Under "Personal access tokens", click on "Generate a personal access token (preferably under - Tokens(classic))". +1. Under "Personal access tokens", click on "Generate a personal access token (preferably under Tokens(classic))". 1. Grant at least the following scopes to the token by checking them. @@ -42,7 +40,7 @@ To get the API token, sign-in to your GitHub account and follow these steps: | read:repo_hook | Grants read and ping access to hooks in public or private repositories | | read:org | Read-only access to organization membership, organization projects, and team membership | | read:user | Grants access to read a user's profile data | - | read:project | Grants read only access to user and organization projects | + | read:project | Grants read-only access to user and organization projects | | read:discussion | Allows read access for team discussions | 1. Finally, click "Generate token". @@ -52,11 +50,11 @@ To get the API token, sign-in to your GitHub account and follow these steps: > You can optionally add API access tokens to avoid making requests as an unauthorized user. > If you wish to load data using the github_reaction source, the access token is mandatory. -More information you can see in the +For more information, see the [GitHub authentication](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28#basic-authentication) and [GitHub API token scopes](https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/scopes-for-oauth-apps) -documentations. +documentation. ### Initialize the verified source @@ -83,30 +81,24 @@ For more information, read the guide on [how to add a verified source](../../wal ### Add credentials -1. In `.dlt/secrets.toml`, you can securely store your access tokens and other sensitive - information. It's important to handle this file with care and keep it safe. Here's what the file - looks like: +1. In `.dlt/secrets.toml`, you can securely store your access tokens and other sensitive information. It's important to handle this file with care and keep it safe. Here's what the file looks like: ```toml # Put your secret values and credentials here - # Github access token (must be classic for reactions source) + # GitHub access token (must be classic for reactions source) [sources.github] access_token="please set me up!" # use GitHub access token here ``` -1. Replace the API token value with the [previously copied one](#grab-credentials) to ensure secure - access to your GitHub resources. +1. Replace the API token value with the [previously copied one](#grab-credentials) to ensure secure access to your GitHub resources. -1. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to - add credentials for your chosen destination, ensuring proper routing of your data to the final - destination. +1. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to add credentials for your chosen destination, ensuring proper routing of your data to the final destination. For more information, read the [General Usage: Credentials.](../../general-usage/credentials) ## Run the pipeline -1. Before running the pipeline, ensure that you have installed all the necessary dependencies by - running the command: +1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: ```sh pip install -r requirements.txt ``` @@ -114,25 +106,21 @@ For more information, read the [General Usage: Credentials.](../../general-usage ```sh python github_pipeline.py ``` -1. Once the pipeline has finished running, you can verify that everything loaded correctly by using - the following command: +1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: ```sh dlt pipeline show ``` - For example, the `pipeline_name` for the above pipeline example is `github_reactions`, you may - also use any custom name instead. + For example, the `pipeline_name` for the above pipeline example is `github_reactions`; you may also use any custom name instead. For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline). ## Sources and resources -`dlt` works on the principle of [sources](../../general-usage/source) and -[resources](../../general-usage/resource). +`dlt` works on the principle of [sources](../../general-usage/source) and [resources](../../general-usage/resource). ### Source `github_reactions` -This `dlt.source` function uses GraphQL to fetch DltResource objects: issues and pull requests along -with associated reactions, comments, and reactions to comments. +This `dlt.source` function uses GraphQL to fetch DltResource objects: issues and pull requests along with associated reactions, comments, and reactions to comments. ```py @dlt.source @@ -151,21 +139,17 @@ def github_reactions( `name`: Refers to the name of the repository. -`access_token`: Classic access token should be utilized and is stored in the `.dlt/secrets.toml` -file. +`access_token`: A classic access token should be utilized and is stored in the `.dlt/secrets.toml` file. `items_per_page`: The number of issues/pull requests to retrieve in a single page. Defaults to 100. -`max_items`: The maximum number of issues/pull requests to retrieve in total. If set to None, it -means all items will be retrieved. Defaults to None. +`max_items`: The maximum number of issues/pull requests to retrieve in total. If set to None, it means all items will be retrieved. Defaults to None. -`max_item_age_seconds`: The feature to restrict retrieval of items older than a specific duration is -yet to be implemented. Defaults to None. +`max_item_age_seconds`: The feature to restrict retrieval of items older than a specific duration is yet to be implemented. Defaults to None. ### Resource `_get_reactions_data` ("issues") -The `dlt.resource` function employs the `_get_reactions_data` method to retrieve data about issues, -their associated comments, and subsequent reactions. +The `dlt.resource` function employs the `_get_reactions_data` method to retrieve data about issues, their associated comments, and subsequent reactions. ```py dlt.resource( @@ -185,11 +169,9 @@ dlt.resource( ### Source `github_repo_events` -This `dlt.source` fetches repository events incrementally, dispatching them to separate tables based -on event type. It loads new events only and appends them to tables. +This `dlt.source` fetches repository events incrementally, dispatching them to separate tables based on event type. It loads new events only and appends them to tables. -> Note: Github allows retrieving up to 300 events for public repositories, so frequent updates are -> recommended for active repos. +> Note: GitHub allows retrieving up to 300 events for public repositories, so frequent updates are recommended for active repos. ```py @dlt.source(max_table_nesting=2) @@ -203,8 +185,7 @@ def github_repo_events( `name`: Denotes the name of the repository. -`access_token`: Optional classic or fine-grained access token. If not provided, calls are made -anonymously. +`access_token`: Optional classic or fine-grained access token. If not provided, calls are made anonymously. `max_table_nesting=2` sets the maximum nesting level to 2. @@ -212,8 +193,7 @@ Read more about [nesting levels](../../general-usage/source#reduce-the-nesting-l ### Resource `repo_events` -This `dlt.resource` function serves as the resource for the `github_repo_events` source. It yields -repository events as data items. +This `dlt.resource` function serves as the resource for the `github_repo_events` source. It yields repository events as data items. ```py dlt.resource(primary_key="id", table_name=lambda i: i["type"]) # type: ignore @@ -229,9 +209,7 @@ def repo_events( `table_name`: Routes data to appropriate tables based on the data type. -`last_created_at`: This parameter determines the initial value for "last_created_at" in -dlt.sources.incremental. If no value is given, the default "initial_value" is used. The function -"last_value_func" determines the most recent 'created_at' value. +`last_created_at`: This parameter determines the initial value for "last_created_at" in dlt.sources.incremental. If no value is given, the default "initial_value" is used. The function "last_value_func" determines the most recent 'created_at' value. Read more about [incremental loading](../../general-usage/incremental-loading#incremental_loading-with-last-value). @@ -239,8 +217,7 @@ Read more about [incremental loading](../../general-usage/incremental-loading#in ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods from this -verified source. +If you wish to create your own pipelines, you can leverage source and resource methods from this verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: @@ -252,18 +229,16 @@ verified source. ) ``` - To read more about pipeline configuration, please refer to our - [documentation](../../general-usage/pipeline). + To read more about pipeline configuration, please refer to our [documentation](../../general-usage/pipeline). -1. To load all the data from repo on issues, pull requests, their comments and reactions, you can do - the following: +1. To load all the data from the repo on issues, pull requests, their comments, and reactions, you can do the following: ```py load_data = github_reactions("duckdb", "duckdb") load_info = pipeline.run(load_data) print(load_info) ``` - here, "duckdb" is the owner of the repository and the name of the repository. + Here, "duckdb" is the owner of the repository and the name of the repository. 1. To load only the first 100 issues, you can do the following: @@ -273,8 +248,7 @@ verified source. print(load_info) ``` -1. You can use fetch and process repo events data incrementally. It loads all data during the first - run and incrementally in subsequent runs. +1. You can fetch and process repo events data incrementally. It loads all data during the first run and incrementally in subsequent runs. ```py load_data = github_repo_events( @@ -287,3 +261,4 @@ verified source. It is optional to use `access_token` or make anonymous API calls. + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md b/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md index b475ae00b7..b94606a7e9 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md @@ -5,7 +5,7 @@ keywords: [google analytics api, google analytics verified source, google analyt --- import Header from './_source-info-header.md'; -# Google Analytics +# Google analytics
@@ -25,7 +25,7 @@ Sources and resources that can be loaded using this verified source are: | metrics_table | Assembles and presents data relevant to the report's metrics | | dimensions_table | Compiles and displays data related to the report's dimensions | -## Setup Guide +## Setup guide ### Grab credentials @@ -103,7 +103,9 @@ python google_analytics/setup_script_gcp_oauth.py Once you have executed the script and completed the authentication, you will receive a "refresh token" that can be used to set up the "secrets.toml". -### Share the Google Analytics Property with the API: + + +### Share the Google Analytics property with the API > Note: For service account authentication, use the client_email. For OAuth authentication, use the > email associated with the app creation and refresh token generation. @@ -185,7 +187,7 @@ For more information, read the guide on [how to add a verified source](../../wal 1. `property_id` is a unique number that identifies a particular property. You will need to explicitly pass it to get data from the property that you're interested in. For example, if the - property that you want to get data from is “GA4-Google Merch Shop” then you will need to pass its + property that you want to get data from is “GA4-Google Merch Shop,” then you will need to pass its property id "213025502". ![Property ID](./docs_images/GA4_Property_ID_size.png) @@ -198,7 +200,7 @@ For more information, read the guide on [how to add a verified source](../../wal ```toml [sources.google_analytics] - property_id = "213025502" # this is example property id, please use yours + property_id = "213025502" # this is an example property id, please use yours queries = [ {"resource_name"= "sample_analytics_data1", "dimensions"= ["browser", "city"], "metrics"= ["totalUsers", "transactions"]}, {"resource_name"= "sample_analytics_data2", "dimensions"= ["browser", "city", "dateHour"], "metrics"= ["totalUsers"]} @@ -230,7 +232,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is - `dlt_google_analytics_pipeline`, you may also use any custom name instead. + `dlt_google_analytics_pipeline`, but you may also use any custom name instead. For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline). @@ -286,8 +288,7 @@ def get_metadata(client: Resource, property_id: int) -> Iterator[Metadata]: ### Transformer `metrics_table` -This transformer function extracts data using metadata and populates a table called "metrics" with -the data from each metric. +This transformer function extracts data using metadata and populates a table called "metrics" with the data from each metric. ```py @dlt.transformer(data_from=get_metadata, write_disposition="replace", name="metrics") @@ -298,14 +299,12 @@ def metrics_table(metadata: Metadata) -> Iterator[TDataItem]: `metadata`: GA4 metadata is stored in this "Metadata" class object. -Similarly, there is a transformer function called `dimensions_table` that populates a table called -"dimensions" with the data from each dimension. +Similarly, there is a transformer function called `dimensions_table` that populates a table called "dimensions" with the data from each dimension. ## Customization ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods from this -verified source. +If you wish to create your own pipelines, you can leverage source and resource methods from this verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: @@ -317,8 +316,7 @@ verified source. ) ``` - To read more about pipeline configuration, please refer to our - [documentation](../../general-usage/pipeline). + To read more about pipeline configuration, please refer to our [documentation](../../general-usage/pipeline). 1. To load all the data from metrics and dimensions: @@ -328,8 +326,7 @@ verified source. print(load_info) ``` - > Loads all the data till date in the first run, and then - > [incrementally](../../general-usage/incremental-loading) in subsequent runs. + > Loads all the data to date in the first run, and then [incrementally](../../general-usage/incremental-loading) in subsequent runs. 1. To load data from a specific start date: @@ -339,8 +336,7 @@ verified source. print(load_info) ``` - > Loads data starting from the specified date during the first run, and then - > [incrementally](../../general-usage/incremental-loading) in subsequent runs. + > Loads data starting from the specified date during the first run, and then [incrementally](../../general-usage/incremental-loading) in subsequent runs. diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md b/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md index 48320e0331..b672ac7a27 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md @@ -5,7 +5,7 @@ keywords: [google sheets api, google sheets verified source, google sheets] --- import Header from './_source-info-header.md'; -# Google Sheets +# Google sheets
@@ -14,7 +14,7 @@ offered by Google as part of its Google Workspace suite. This Google Sheets `dlt` verified source and [pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_sheets_pipeline.py) -loads data using “Google Sheets API” to the destination of your choice. +loads data using the “Google Sheets API” to the destination of your choice. Sources and resources that can be loaded using this verified source are: @@ -24,14 +24,14 @@ Sources and resources that can be loaded using this verified source are: | range_names | Processes the range and yields data from each range | | spreadsheet_info | Information about the spreadsheet and the ranges processed | -## Setup Guide +## Setup guide ### Grab credentials There are two methods to get authenticated for using this verified source: - OAuth credentials -- Service account credential +- Service account credentials Here we'll discuss how to set up both OAuth tokens and service account credentials. In general, OAuth tokens are preferred when user consent is required, while service account credentials are @@ -41,7 +41,7 @@ credentials. You can choose the method of authentication as per your requirement #### Google service account credentials You need to create a GCP service account to get API credentials if you don't have one. To create - one, follow these steps: +one, follow these steps: 1. Sign in to [console.cloud.google.com](http://console.cloud.google.com/). @@ -69,17 +69,17 @@ follow these steps: 1. Enable the Sheets API in the project. -1. Search credentials in the search bar and go to Credentials. +1. Search for credentials in the search bar and go to Credentials. 1. Go to Credentials -> OAuth client ID -> Select Desktop App from the Application type and give an appropriate name. -1. Download the credentials and fill "client_id", "client_secret" and "project_id" in +1. Download the credentials and fill "client_id", "client_secret", and "project_id" in "secrets.toml". 1. Go back to credentials and select the OAuth consent screen on the left. -1. Fill in the App name, user support email(your email), authorized domain (localhost.com), and dev +1. Fill in the App name, user support email (your email), authorized domain (localhost.com), and dev contact info (your email again). 1. Add the following scope: @@ -92,7 +92,7 @@ follow these steps: 1. Generate `refresh_token`: - After configuring "client_id", "client_secret" and "project_id" in "secrets.toml". To generate + After configuring "client_id", "client_secret", and "project_id" in "secrets.toml". To generate the refresh token, run the following script from the root folder: ```sh @@ -104,6 +104,8 @@ follow these steps: ### Prepare your data + + #### Share Google Sheet with the email > Note: For service account authentication, use the client_email. For OAuth authentication, use the @@ -129,48 +131,46 @@ When setting up the pipeline, you can use either the browser-copied URL of your https://docs.google.com/spreadsheets/d/1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4/edit?usp=sharing ``` -or spreadsheet id (which is a part of the url) +or the spreadsheet ID (which is a part of the URL) ```sh 1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4 ``` -typically you pass it directly to the [google_spreadsheet function](#create-your-own-pipeline) or in [config.toml](#add-credentials) as defined here. +Typically, you pass it directly to the [google_spreadsheet function](#create-your-own-pipeline) or in [config.toml](#add-credentials) as defined here. - -You can provide specific ranges to `google_spreadsheet` pipeline, as detailed in following. +You can provide specific ranges to the `google_spreadsheet` pipeline, as detailed in the following. #### Guidelines about headers -Make sure your data has headers and is in the form of well-structured table. +Make sure your data has headers and is in the form of a well-structured table. The first row of any extracted range should contain headers. Please make sure: 1. The header names are strings and are unique. 1. All the columns that you intend to extract have a header. -1. The data starts exactly at the origin of the range - otherwise a source will remove padding, but it +1. The data starts exactly at the origin of the range - otherwise, a source will remove padding, but it is a waste of resources. > When a source detects any problems with headers or table layout, it will issue a WARNING in the > log. Hence, we advise running your pipeline script manually/locally and fixing all the problems. 1. Columns without headers will be removed and not extracted. 1. Columns with headers that do not contain any data will be removed. -1. If there are any problems with reading headers (i.e. header is not string or is empty or not +1. If there are any problems with reading headers (i.e., the header is not a string or is empty or not unique): the headers row will be extracted as data and automatic header names will be used. -1. Empty rows are ignored +1. Empty rows are ignored. 1. `dlt` will normalize range names and headers into table and column names - so they may be different in the database than in Google Sheets. Prefer small cap names without special characters. - #### Guidelines about named ranges -We recommend to use +We recommend using [Named Ranges](https://support.google.com/docs/answer/63175?hl=en&co=GENIE.Platform%3DDesktop) to indicate which data should be extracted from a particular spreadsheet, and this is how this source will work by default - when called without setting any other options. All the named ranges will be -converted into tables, named after them and stored in the destination. +converted into tables, named after them, and stored in the destination. -1. You can let the spreadsheet users add and remove tables by just adding/removing the ranges, +1. You can let the spreadsheet users add and remove tables by just adding or removing the ranges; you do not need to configure the pipeline again. 1. You can indicate exactly the fragments of interest, and only this data will be retrieved, so it is @@ -194,16 +194,16 @@ converted into tables, named after them and stored in the destination. If you are not happy with the workflow above, you can: -1. Disable it by setting `get_named_ranges` option to `False`. +1. Disable it by setting the `get_named_ranges` option to `False`. -1. Enable retrieving all sheets/tabs with get_sheets option set to `True`. +1. Enable retrieving all sheets/tabs with the get_sheets option set to `True`. 1. Pass a list of ranges as supported by Google Sheets in range_names. > Note: To retrieve all named ranges with "get_named_ranges" or all sheets with "get_sheets" > methods, pass an empty `range_names` list as `range_names = []`. Even when you use a set - > "get_named_ranges" to false pass the range_names as an empty list to get all the sheets with - > "get_sheets" method. + > "get_named_ranges" to false, pass the range_names as an empty list to get all the sheets with + > the "get_sheets" method. ### Initialize the verified source @@ -260,7 +260,7 @@ For more information, read the guide on [how to add a verified source](../../wal 1. Finally, enter credentials for your chosen destination as per the [docs](../destinations/). -1. Next you need to configure ".dlt/config.toml", which looks like: +1. Next, you need to configure ".dlt/config.toml", which looks like: ```toml [sources.google_sheets] @@ -277,13 +277,13 @@ For more information, read the guide on [how to add a verified source](../../wal spreadsheet_identifier = "https://docs.google.com/spreadsheets/d/1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4/edit?usp=sharing" ``` - or spreadsheet id (which is a part of the url) + or the spreadsheet ID (which is a part of the URL) ```toml spreadsheet_identifier="1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4" ``` -> Note: You have an option to pass "range_names" and "spreadsheet_identifier" directly to the +> Note: You have the option to pass "range_names" and "spreadsheet_identifier" directly to the > google_spreadsheet function or in ".dlt/config.toml" For more information, read the [General Usage: Credentials.](../../general-usage/credentials) @@ -320,7 +320,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug The `dlt` normalizer uses the first row of data to infer types and attempts to coerce subsequent rows, creating variant columns if unsuccessful. This is standard behavior. If `dlt` did not correctly determine the data type in the column, or you want to change the data type for other reasons, then you can provide a type hint for the affected column in the resource. -Also, since recently `dlt`'s no longer recognizing date and time types, so you have to designate it yourself as `timestamp`. +Also, since recently, `dlt` no longer recognizes date and time types, so you have to designate it yourself as `timestamp`. Use the `apply_hints` method on the resource to achieve this. Here's how you can do it: @@ -332,11 +332,11 @@ for resource in resources: "date": {"data_type": "timestamp"}, }) ``` -In this example, the `total_amount` column is enforced to be of type double and `date` is enforced to be of type timestamp. +In this example, the `total_amount` column is enforced to be of type double, and `date` is enforced to be of type timestamp. This will ensure that all values in the `total_amount` column are treated as `double`, regardless of whether they are integers or decimals in the original Google Sheets data. -And `date` column will be represented as dates, not integers. +And the `date` column will be represented as dates, not integers. -For a single resource (e.g. `Sheet1`), you can simply use: +For a single resource (e.g., `Sheet1`), you can simply use: ```py source.Sheet1.apply_hints(columns={ "total_amount": {"data_type": "double"}, @@ -387,9 +387,9 @@ def google_spreadsheet( `credentials`: GCP credentials with Google Sheets API access. -`get_sheets`: If True, imports all spreadsheet sheets into the database. +`get_sheets`: If true, imports all spreadsheet sheets into the database. -`get_named_ranges`: If True, imports either all named ranges or those +`get_named_ranges`: If true, imports either all named ranges or those [specified](google_sheets.md#guidelines-about-named-ranges) into the database. ### Resource `range_names` @@ -412,7 +412,7 @@ headers, and data types as arguments. `write_disposition`: Dictates how data is loaded to the destination. -> Please Note: +> Please note: > > 1. Empty rows are ignored. > 1. Empty cells are converted to None (and then to NULL by dlt). @@ -420,7 +420,7 @@ headers, and data types as arguments. ### Resource `spreadsheet_info` -This resource loads the info about the sheets and range names into the destination as a table. +This resource loads the information about the sheets and range names into the destination as a table. This table refreshes after each load, storing information on loaded ranges: - Spreadsheet ID and title. @@ -440,18 +440,18 @@ dlt.resource( `name`: Denotes the table name, set here as "spreadsheet_info". -`write_disposition`: Dictates how data is loaded to the destination. +`write_disposition`: Dictates how data is loaded into the destination. [Read more](../../general-usage/incremental-loading#the-3-write-dispositions). -`merge_key`: Parameter is used to specify the column used to identify records for merging. In this -case,"spreadsheet_id", means that the records will be merged based on the values in this column. +`merge_key`: This parameter is used to specify the column used to identify records for merging. In this +case, "spreadsheet_id" means that the records will be merged based on the values in this column. [Read more](../../general-usage/incremental-loading#merge-incremental_loading). ## Customization + ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods from this -verified source. +If you wish to create your own pipelines, you can leverage source and resource methods from this verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: @@ -467,7 +467,7 @@ verified source. ```py load_data = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL range_names=["range_name1", "range_name2"], # Range names get_sheets=False, get_named_ranges=False, @@ -476,14 +476,13 @@ verified source. print(load_info) ``` - > Note: You can pass the URL or spreadsheet ID and range names explicitly or in - > ".dlt/config.toml". + > Note: You can pass the URL or spreadsheet ID and range names explicitly or in ".dlt/config.toml". -1. To load all the range_names from spreadsheet: +1. To load all the range_names from the spreadsheet: ```py load_data = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL get_sheets=False, get_named_ranges=True, ) @@ -493,11 +492,11 @@ verified source. > Pass an empty list to range_names in ".dlt/config.toml" to retrieve all range names. -1. To load all the sheets from spreadsheet: +1. To load all the sheets from the spreadsheet: ```py load_data = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL get_sheets=True, get_named_ranges=False, ) @@ -511,7 +510,7 @@ verified source. ```py load_data = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL get_sheets=True, get_named_ranges=True, ) @@ -525,17 +524,17 @@ verified source. ```py load_data1 = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL range_names=["Sheet 1!A1:B10"], get_named_ranges=False, ) load_data2 = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/3jo4HjqouQnnCIZAFa2rL6vT91YRN8aIhts22SKKO390/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/3jo4HjqouQnnCIZAFa2rL6vT91YRN8aIhts22SKKO390/edit#gid=0", # Spreadsheet URL range_names=["Sheet 1!B1:C10"], get_named_ranges=True, ) - load_info = pipeline.run([load_data1,load_data2]) + load_info = pipeline.run([load_data1, load_data2]) print(load_info) ``` @@ -543,7 +542,7 @@ verified source. ```py load_data = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL range_names=["Sheet 1!A1:B10"], get_named_ranges=False, ) @@ -554,29 +553,29 @@ verified source. print(load_info) ``` -### Using Airflow with Google Spreadsheets: +### Using Airflow with Google spreadsheets -Consider the following when using Google Spreadsheets with Airflow: +Consider the following when using Google spreadsheets with Airflow: -`Efficient Data Retrieval` +`Efficient data retrieval` - Our source fetches all required data with just two API calls, regardless of the number of specified data ranges. This allows for swift data loading from google_spreadsheet before executing the pipeline. -`Airflow Specificity` +`Airflow specificity` - With Airflow, data source creation and execution are distinct processes. - If your execution environment (runner) is on a different machine, this might cause the data to be loaded twice, leading to inefficiencies. -`Airflow Helper Caution` +`Airflow helper caution` - Avoid using `scc decomposition` because it unnecessarily creates a new source instance for every specified data range. This is not efficient and can cause redundant tasks. -#### Recommended Airflow Deployment +#### Recommended Airflow deployment -Below is the correct way to set up an Airflow DAG for this purpose: +Below is the correct way to set up an Airflow DAG for this purpose: -- Define a DAG to run daily, starting from say February 1, 2023. It avoids catching up for missed runs and ensures only one instance runs at a time. +- Define a DAG to run daily, starting from February 1, 2023. It avoids catching up for missed runs and ensures only one instance runs at a time. -- Data is imported from Google Spreadsheets and directed BigQuery. +- Data is imported from Google spreadsheets and directed to BigQuery. - When adding the Google Spreadsheet task to the pipeline, avoid decomposing it; run it as a single task for efficiency. @@ -591,7 +590,7 @@ Below is the correct way to set up an Airflow DAG for this purpose: def get_named_ranges(): tasks = PipelineTasksGroup("get_named_ranges", use_data_folder=False, wipe_local_data=True) - # import your source from pipeline script + # Import your source from pipeline script from google_sheets import google_spreadsheet pipeline = dlt.pipeline( @@ -600,8 +599,9 @@ def get_named_ranges(): destination='bigquery', ) - # do not use decompose to run `google_spreadsheet` in single task + # Do not use decompose to run `google_spreadsheet` in single task tasks.add_run(pipeline, google_spreadsheet("1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580"), decompose="none", trigger_rule="all_done", retries=0, provide_context=True) ``` + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md index aac77b9b0a..3d7b577c0f 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md @@ -24,29 +24,29 @@ Sources and resources that can be loaded using this verified source are: | get_messages | resource-transformer | Retrieves emails from the mailbox using given UIDs | | get_attachments | resource-transformer | Downloads attachments from emails using given UIDs | -## Setup Guide +## Setup guide ### Grab credentials 1. For verified source configuration, you need: - "host": IMAP server hostname (e.g., Gmail: imap.gmail.com, Outlook: imap-mail.outlook.com). - - "email_account": Associated email account name (e.g. dlthub@dlthub.com). + - "email_account": Associated email account name (e.g., dlthub@dlthub.com). - "password": APP password (for third-party clients) from the email provider. 2. Host addresses and APP password procedures vary by provider and can be found via a quick Google search. For Google Mail's app password, read [here](https://support.google.com/mail/answer/185833?hl=en#:~:text=An%20app%20password%20is%20a,2%2DStep%20Verification%20turned%20on). 3. However, this guide covers Gmail inbox configuration; similar steps apply to other providers. -### Accessing Gmail Inbox +### Accessing Gmail inbox 1. SMTP server DNS: 'imap.gmail.com' for Gmail. 2. Port: 993 (for internet messaging access protocol over TLS/SSL). -### Grab App password for Gmail +### Grab app password for Gmail 1. An app password is a 16-digit code allowing less secure apps/devices to access your Google Account, available only with 2-Step Verification activated. -#### Steps to Create and Use App Passwords: +#### Steps to create and use app passwords: 1. Visit your Google Account > Security. 2. Under "How you sign in to Google", enable 2-Step Verification. @@ -84,31 +84,25 @@ For more information, read the ### Add credential -1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can - securely store your access tokens and other sensitive information. It's important to handle this - file with care and keep it safe. Here's what the file looks like: +1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can securely store your access tokens and other sensitive information. It's important to handle this file with care and keep it safe. Here's what the file looks like: ```toml # put your secret values and credentials here - # do not share this file and do not push it to github + # do not share this file and do not push it to GitHub [sources.inbox] host = "Please set me up!" # The host address of the email service provider. email_account = "Please set me up!" # Email account associated with the service. - password = "Please set me up!" # # APP Password for the above email account. + password = "Please set me up!" # APP Password for the above email account. ``` -2. Replace the host, email, and password value with the [previously copied one](#grab-credentials) - to ensure secure access to your Inbox resources. +2. Replace the host, email, and password value with the [previously copied one](#grab-credentials) to ensure secure access to your Inbox resources. > When adding the App Password, remove any spaces. For instance, "abcd efgh ijkl mnop" should be "abcdefghijklmnop". -3. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to - add credentials for your chosen destination, ensuring proper routing of your data to the final - destination. +3. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to add credentials for your chosen destination, ensuring proper routing of your data to the final destination. ## Run the pipeline -1. Before running the pipeline, ensure that you have installed all the necessary dependencies by - running the command: +1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: ```sh pip install -r requirements.txt ``` @@ -123,20 +117,17 @@ For more information, read the For pdf parsing: - PyPDF2: `pip install PyPDF2` -2. Once the pipeline has finished running, you can verify that everything loaded correctly by using - the following command: +2. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: ```sh dlt pipeline show ``` - For example, the `pipeline_name` for the above pipeline example is `standard_inbox`, you may also - use any custom name instead. + For example, the `pipeline_name` for the above pipeline example is `standard_inbox`, you may also use any custom name instead. For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline) ## Sources and resources -`dlt` works on the principle of [sources](../../general-usage/source) and -[resources](../../general-usage/resource). +`dlt` works on the principle of [sources](../../general-usage/source) and [resources](../../general-usage/resource). ### Source `inbox_source` @@ -170,7 +161,7 @@ def inbox_source( `start_date`: Start date to collect emails. Default: `/inbox/settings.py` 'DEFAULT_START_DATE'. -`filter_emails`:Email addresses for 'FROM' filtering. Default: `/inbox/settings.py` 'FILTER_EMAILS'. +`filter_emails`: Email addresses for 'FROM' filtering. Default: `/inbox/settings.py` 'FILTER_EMAILS'. `filter_by_mime_type`: MIME types for attachment filtering. Default: None. @@ -207,7 +198,7 @@ def get_messages( `items`: An iterable containing dictionaries with 'message_uid' representing the email message UIDs. -`include_body`: Includes email body if True. Default: True. +`include_body`: Includes the email body if True. Default: True. ### Resource `get_attachments_by_uid` @@ -261,7 +252,7 @@ verified source. # Print the loading details. print(load_info) ``` - > Please refer to inbox_source() docstring for email filtering options by sender, date, or mime type. + > Please refer to the inbox_source() docstring for email filtering options by sender, date, or mime type. 3. To load messages from multiple emails, including "community@dlthub.com": ```py @@ -271,7 +262,7 @@ verified source. ``` 4. In `inbox_pipeline.py`, the `pdf_to_text` transformer extracts text from PDFs, treating each page as a separate data item. - Using the `pdf_to_text` function to load parsed pdfs from mail to the database: + Using the `pdf_to_text` function to load parsed PDFs from mail to the database: ```py filter_emails = ["mycreditcard@bank.com", "community@dlthub.com."] # Email senders diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/jira.md b/docs/website/docs/dlt-ecosystem/verified-sources/jira.md index d83d10f834..a5bdcee64e 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/jira.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/jira.md @@ -21,14 +21,14 @@ The endpoints that this verified source supports are: | Name | Description | | --------- | ---------------------------------------------------------------------------------------- | | issues | Individual pieces of work to be completed | -| users | Administrators of a given project | +| users | Administrators of a given project | | workflows | The key aspect of managing and tracking the progress of issues or tasks within a project | | projects | A collection of tasks that need to be completed to achieve a certain outcome | To get a complete list of sub-endpoints that can be loaded, see [jira/settings.py.](https://github.com/dlt-hub/verified-sources/blob/master/sources/jira/settings.py) -## Setup Guide +## Setup guide ### Grab credentials @@ -73,9 +73,7 @@ For more information, read the guide on [how to add a verified source](../../wal ### Add credentials -1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, where you can securely store - your access tokens and other sensitive information. It's important to handle this file with care - and keep it safe. +1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, where you can securely store your access tokens and other sensitive information. It's important to handle this file with care and keep it safe. Here's what the file looks like: @@ -87,24 +85,19 @@ For more information, read the guide on [how to add a verified source](../../wal api_token = "set me up!" # please set me up! ``` -1. A subdomain in a URL identifies your Jira account. For example, in - "https://example.atlassian.net", "example" is the subdomain. +1. A subdomain in a URL identifies your Jira account. For example, in "https://example.atlassian.net", "example" is the subdomain. 1. Use the email address associated with your Jira account. -1. Replace the "access_token" value with the [previously copied one](jira.md#grab-credentials) to - ensure secure access to your Jira account. +1. Replace the "api_token" value with the [previously copied one](jira.md#grab-credentials) to ensure secure access to your Jira account. -1. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to - add credentials for your chosen destination, ensuring proper routing of your data to the final - destination. +1. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to add credentials for your chosen destination, ensuring proper routing of your data to the final destination. For more information, read [General Usage: Credentials.](../../general-usage/credentials) ## Run the pipeline -1. Before running the pipeline, ensure that you have installed all the necessary dependencies by - running the command: +1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: ```sh pip install -r requirements.txt ``` @@ -112,26 +105,21 @@ For more information, read [General Usage: Credentials.](../../general-usage/cre ```sh python jira_pipeline.py ``` -1. Once the pipeline has finished running, you can verify that everything loaded correctly by using - the following command: +1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: ```sh dlt pipeline show ``` - For example, the `pipeline_name` for the above pipeline example is `jira_pipeline`. You may also - use any custom name instead. + For example, the `pipeline_name` for the above pipeline example is `jira_pipeline`. You may also use any custom name instead. For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline). ## Sources and resources -`dlt` works on the principle of [sources](../../general-usage/source) and -[resources](../../general-usage/resource). +`dlt` works on the principle of [sources](../../general-usage/source) and [resources](../../general-usage/resource). ### Default endpoints -You can write your own pipelines to load data to a destination using this verified source. However, -it is important to note the complete list of the default endpoints given in -[jira/settings.py.](https://github.com/dlt-hub/verified-sources/blob/master/sources/jira/settings.py) +You can write your own pipelines to load data to a destination using this verified source. However, it is important to note the complete list of the default endpoints given in [jira/settings.py.](https://github.com/dlt-hub/verified-sources/blob/master/sources/jira/settings.py) ### Source `jira` @@ -153,8 +141,7 @@ def jira( ### Source `jira_search` -This function returns a resource for querying issues using JQL -[(Jira Query Language)](https://support.atlassian.com/jira-service-management-cloud/docs/use-advanced-search-with-jira-query-language-jql/). +This function returns a resource for querying issues using JQL [(Jira Query Language)](https://support.atlassian.com/jira-service-management-cloud/docs/use-advanced-search-with-jira-query-language-jql/). ```py @dlt.source @@ -166,8 +153,7 @@ def jira_search( ... ``` -The above function uses the same arguments `subdomain`, `email`, and `api_token` as described above -for the [jira source](jira.md#source-jira). +The above function uses the same arguments `subdomain`, `email`, and `api_token` as described above for the [jira source](jira.md#source-jira). ### Resource `issues` @@ -183,14 +169,12 @@ def issues(jql_queries: List[str]) -> Iterable[TDataItem]: `jql_queries`: Accepts a list of JQL queries. ## Customization + ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods as discussed -above. +If you wish to create your own pipelines, you can leverage source and resource methods as discussed above. -1. Configure the pipeline by specifying the pipeline name, destination, and dataset. To read more - about pipeline configuration, please refer to our documentation - [here](../../general-usage/pipeline): +1. Configure the pipeline by specifying the pipeline name, destination, and dataset. To read more about pipeline configuration, please refer to our documentation [here](../../general-usage/pipeline): ```py pipeline = dlt.pipeline( @@ -200,16 +184,15 @@ above. ) ``` -2. To load custom endpoints such as “issues” and “users” using the jira source function: +2. To load custom endpoints such as "issues" and "users" using the jira source function: ```py - #Run the pipeline - load_info = pipeline.run(jira().with_resources("issues","users")) + # Run the pipeline + load_info = pipeline.run(jira().with_resources("issues", "users")) print(f"Load Information: {load_info}") ``` -3. To load the custom issues using JQL queries, you can use custom queries. Here is an example - below: +3. To load custom issues using JQL queries, you can use custom queries. Here is an example below: ```py # Define the JQL queries as follows diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md b/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md index fe3c426819..a402e2c5f0 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md @@ -20,7 +20,7 @@ The resource that can be loaded: | ----------------- |--------------------------------------------| | kafka_consumer | Extracts messages from Kafka topics | -## Setup Guide +## Setup guide ### Grab Kafka cluster credentials @@ -96,7 +96,7 @@ sasl_password="example_secret" For more information, read the [Walkthrough: Run a pipeline](../../walkthroughs/run-a-pipeline). -:::info If you created a topic and start reading from it immedately, the brokers may be not yet synchronized and offset from which `dlt` reads messages may become invalid. In this case the resource will return no messages. Pending messages will be received on next run (or when brokers synchronize) +:::info If you created a topic and start reading from it immediately, the brokers may not yet be synchronized, and the offset from which `dlt` reads messages may become invalid. In this case, the resource will return no messages. Pending messages will be received on the next run (or when brokers synchronize). ## Sources and resources @@ -126,7 +126,7 @@ def kafka_consumer( the `secrets.toml`. It may be used explicitly to pass an initialized Kafka Consumer object. -`msg_processor`: A function, which will be used to process every message +`msg_processor`: A function that will be used to process every message read from the given topics before saving them in the destination. It can be used explicitly to pass a custom processor. See the [default processor](https://github.com/dlt-hub/verified-sources/blob/fe8ed7abd965d9a0ca76d100551e7b64a0b95744/sources/kafka/helpers.py#L14-L50) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/matomo.md b/docs/website/docs/dlt-ecosystem/verified-sources/matomo.md index ee866ab086..e1230ad8b6 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/matomo.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/matomo.md @@ -13,16 +13,16 @@ Matomo is a free and open-source web analytics platform that provides detailed i This Matomo `dlt` verified source and [pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/matomo_pipeline.py) -loads data using “Matomo API” to the destination of your choice. +loads data using the “Matomo API” to the destination of your choice. The endpoints that this verified source supports are: | Name | Description | | ----------------- |---------------------------------------------------------------------------------| -| matomo_reports | Detailed analytics summaries of website traffic, visitor behavior, and more | -| matomo_visits | Individual user sessions on your website, pages viewed, visit duration and more | +| matomo_reports | Detailed analytics summaries of website traffic, visitor behavior, and more | +| matomo_visits | Individual user sessions on your website, pages viewed, visit duration, and more | -## Setup Guide +## Setup guide ### Grab credentials @@ -35,8 +35,8 @@ The endpoints that this verified source supports are: 1. Click "Create New Token." 1. Your token is displayed. 1. Copy the access token and update it in the `.dlt/secrets.toml` file. -1. Your Matomo URL is the web address in your browser when logged into Matomo, typically "https://mycompany.matomo.cloud/". Update it in the `.dlt/config.toml`. -1. The site_id is a unique ID for each monitored site in Matomo, found in the URL or via Administration > Measureables > Manage under ID. +1. Your Matomo URL is the web address in your browser when logged into Matomo, typically "https://mycompany.matomo.cloud/". Update it in the `.dlt/config.toml`. +1. The site_id is a unique ID for each monitored site in Matomo, found in the URL or via Administration > Measurables > Manage under ID. > Note: The Matomo UI, which is described here, might change. The full guide is available at [this link.](https://developer.matomo.org/guides/authentication-in-depth) @@ -66,23 +66,18 @@ For more information, read the guide on [how to add a verified source](../../wal ### Add credential -1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can - securely store your access tokens and other sensitive information. It's important to handle this - file with care and keep it safe. Here's what the file looks like: +1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can securely store your access tokens and other sensitive information. It's important to handle this file with care and keep it safe. Here's what the file looks like: ```toml # put your secret values and credentials here - # do not share this file and do not push it to github + # do not share this file and do not push it to GitHub [sources.matomo] api_token= "access_token" # please set me up!" ``` -1. Replace the api_token value with the [previously copied one](matomo.md#grab-credentials) - to ensure secure access to your Matomo resources. +1. Replace the api_token value with the [previously copied one](matomo.md#grab-credentials) to ensure secure access to your Matomo resources. -1. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to - add credentials for your chosen destination, ensuring proper routing of your data to the final - destination. +1. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to add credentials for your chosen destination, ensuring proper routing of your data to the final destination. 1. Next, store your pipeline configuration details in the `.dlt/config.toml`. @@ -95,16 +90,15 @@ For more information, read the guide on [how to add a verified source](../../wal site_id = 0 # please set me up! live_events_site_id = 0 # please set me up! ``` -1. Replace the value of `url` and `site_id` with the one that [you copied above](matomo.md#grab-url-and-site_id). +1. Replace the value of `url` and `site_id` with the one that [you copied above](matomo.md#grab-url-and-site_id). -1. To monitor live events on a website, enter the `live_event_site_id` (usually it is same as `site_id`). +1. To monitor live events on a website, enter the `live_event_site_id` (usually it is the same as `site_id`). For more information, read the [General Usage: Credentials.](../../general-usage/credentials) ## Run the pipeline -1. Before running the pipeline, ensure that you have installed all the necessary dependencies by - running the command: +1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: ```sh pip install -r requirements.txt ``` @@ -112,20 +106,17 @@ For more information, read the [General Usage: Credentials.](../../general-usage ```sh python matomo_pipeline.py ``` -1. Once the pipeline has finished running, you can verify that everything loaded correctly by using - the following command: +1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: ```sh dlt pipeline show ``` - For example, the `pipeline_name` for the above pipeline example is `matomo`, you may also - use any custom name instead. + For example, the `pipeline_name` for the above pipeline example is `matomo`, you may also use any custom name instead. For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline). ## Sources and resources -`dlt` works on the principle of [sources](../../general-usage/source) and -[resources](../../general-usage/resource). +`dlt` works on the principle of [sources](../../general-usage/source) and [resources](../../general-usage/resource). ### Source `matomo_reports` @@ -144,17 +135,17 @@ def matomo_reports( `api_token`: API access token for Matomo server authentication, defaults to "./dlt/secrets.toml" -`url` : Matomo server URL, defaults to "./dlt/config.toml" +`url`: Matomo server URL, defaults to "./dlt/config.toml" `queries`: List of dictionaries containing info on what data to retrieve from Matomo API. `site_id`: Website's Site ID as per Matomo account. ->Note: This is an [incremental](../../general-usage/incremental-loading) source method and loads the "last_date" from the state of last pipeline run. +>Note: This is an [incremental](../../general-usage/incremental-loading) source method and loads the "last_date" from the state of the last pipeline run. -### Source `matomo_visits`: +### Source `matomo_visits` -The function loads visits from current day and the past `initial_load_past_days` in first run. In subsequent runs it continues from last load and skips active visits until closed. +The function loads visits from the current day and the past `initial_load_past_days` on the first run. In subsequent runs, it continues from the last load and skips active visits until they are closed. ```py def matomo_visits( @@ -183,7 +174,7 @@ def matomo_visits( `get_live_event_visitors`: Retrieve unique visitor data, defaulting to False. ->Note: This is an [incremental](../../general-usage/incremental-loading) source method and loads the "last_date" from the state of last pipeline run. +>Note: This is an [incremental](../../general-usage/incremental-loading) source method and loads the "last_date" from the state of the last pipeline run. ### Resource `get_last_visits` @@ -206,7 +197,7 @@ def get_last_visits( `site_id`: Unique ID for each Matomo site. -`last_date`: Last resource load date, if exists. +`last_date`: Last resource load date, if it exists. `visit_timeout_seconds`: Time (in seconds) until a session is inactive and deemed closed. Default: 1800. @@ -214,12 +205,11 @@ def get_last_visits( `rows_per_page`: Number of rows on each page. ->Note: This is an [incremental](../../general-usage/incremental-loading) resource method and loads the "last_date" from the state of last pipeline run. - +>Note: This is an [incremental](../../general-usage/incremental-loading) resource method and loads the "last_date" from the state of the last pipeline run. ### Transformer `visitors` -This function, retrieves unique visit information from get_last_visits. +This function retrieves unique visit information from get_last_visits. ```py @dlt.transformer( @@ -244,8 +234,7 @@ def get_unique_visitors( ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods from this -verified source. +If you wish to create your own pipelines, you can leverage source and resource methods from this verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: @@ -257,8 +246,7 @@ verified source. ) ``` - To read more about pipeline configuration, please refer to our - [documentation](../../general-usage/pipeline). + To read more about pipeline configuration, please refer to our [documentation](../../general-usage/pipeline). 1. To load the data from reports. @@ -267,7 +255,7 @@ verified source. load_info = pipeline_reports.run(data_reports) print(load_info) ``` - >"site_id" defined in ".dlt/config.toml" + > "site_id" defined in ".dlt/config.toml" 1. To load custom data from reports using queries. @@ -278,17 +266,17 @@ verified source. "methods": ["CustomReports.getCustomReport"], "date": "2023-01-01", "period": "day", - "extra_params": {"idCustomReport": 1}, #id of the report + "extra_params": {"idCustomReport": 1}, # ID of the report }, ] - site_id = 1 #id of the site for which reports are being loaded + site_id = 1 # ID of the site for which reports are being loaded load_data = matomo_reports(queries=queries, site_id=site_id) load_info = pipeline_reports.run(load_data) print(load_info) ``` - >You can pass queries and site_id in the ".dlt/config.toml" as well. + > You can pass queries and site_id in the ".dlt/config.toml" as well. 1. To load data from reports and visits. @@ -308,3 +296,4 @@ verified source. ``` + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/notion.md b/docs/website/docs/dlt-ecosystem/verified-sources/notion.md index 69e66ed2aa..061b2c565b 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/notion.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/notion.md @@ -14,7 +14,7 @@ professional tasks, offering customizable notes, documents, databases, and more. This Notion `dlt` verified source and [pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/notion_pipeline.py) -loads data using “Notion API” to the destination of your choice. +loads data using the “Notion API” to the destination of your choice. Sources that can be loaded using this verified source are: @@ -22,7 +22,7 @@ Sources that can be loaded using this verified source are: |------------------|---------------------------------------| | notion_databases | Retrieves data from Notion databases. | -## Setup Guide +## Setup guide ### Grab credentials @@ -90,7 +90,7 @@ For more information, read the guide on [how to add a verified source.](../../wa your chosen destination. This will ensure that your data is properly routed to its final destination. -For more information, read the [General Usage: Credentials.](../../general-usage/credentials) +For more information, read the [General usage: Credentials.](../../general-usage/credentials) ## Run the pipeline @@ -99,11 +99,11 @@ For more information, read the [General Usage: Credentials.](../../general-usage ```sh pip install -r requirements.txt ``` -1. You're now ready to run the pipeline! To get started, run the following command: +2. You're now ready to run the pipeline! To get started, run the following command: ```sh python notion_pipeline.py ``` -1. Once the pipeline has finished running, you can verify that everything loaded correctly by using +3. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: ```sh dlt pipeline show @@ -120,7 +120,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug ### Source `notion_databases` -This function loads notion databases from notion into the destination. +This function loads notion databases from Notion into the destination. ```py @dlt.source @@ -131,7 +131,7 @@ def notion_databases( ... ``` -`database_ids`: A list of dictionaries each containing a database id and a name. +`database_ids`: A list of dictionaries, each containing a database ID and a name. `api_key`: The Notion API secret key. @@ -161,7 +161,7 @@ verified source. To read more about pipeline configuration, please refer to our [documentation](../../general-usage/pipeline). -1. To load all the integrated databases: +2. To load all the integrated databases: ```py load_data = notion_databases() @@ -169,7 +169,7 @@ verified source. print(load_info) ``` -1. To load the custom databases: +3. To load the custom databases: ```py selected_database_ids = [{"id": "0517dae9409845cba7d","use_name":"db_one"}, {"id": "d8ee2d159ac34cfc"}] @@ -178,7 +178,7 @@ verified source. print(load_info) ``` - The Database ID can be retrieved from the URL. For example if the URL is: + The Database ID can be retrieved from the URL. For example, if the URL is: ```sh https://www.notion.so/d8ee2d159ac34cfc85827ba5a0a8ae71?v=c714dec3742440cc91a8c38914f83b6b @@ -193,3 +193,4 @@ The database name ("use_name") is optional; if skipped, the pipeline will fetch automatically. + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/openapi-generator.md b/docs/website/docs/dlt-ecosystem/verified-sources/openapi-generator.md index 8c6533b246..a00a59a055 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/openapi-generator.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/openapi-generator.md @@ -66,7 +66,7 @@ We will create a simple example pipeline from a [PokeAPI spec](https://pokeapi.c dlt pipeline pokemon_pipeline show ``` -9. You can go to our docs at https://dlthub.com/docs to learn how to modify the generated pipeline to load to many destinations, place schema contracts on your pipeline, and many other things. +9. You can go to our docs at https://dlthub.com/docs to learn how to modify the generated pipeline to load to many destinations, place schema contracts on your pipeline, and many other things. :::note We used the `--global-limit 2` CLI flag to limit the requests to the PokeAPI @@ -94,12 +94,12 @@ pokemon_pipeline/ ``` :::warning -If you re-generate your pipeline, you will be prompted to continue if this folder exists. If you select yes, all generated files will be overwritten. All other files you may have created will remain in this folder. In non-interactive mode you will not be asked, and the generated files will be overwritten. +If you re-generate your pipeline, you will be prompted to continue if this folder exists. If you select yes, all generated files will be overwritten. All other files you may have created will remain in this folder. In non-interactive mode, you will not be asked, and the generated files will be overwritten. ::: ## A closer look at your `rest_api` dictionary in `pokemon/__init__.py` -This file contains the [configuration dictionary](./rest_api#source-configuration) for the rest_api source which is the main result of running this generator. For our Pokemon example, we have used an OpenAPI 3 spec that works out of the box. The result of this dictionary depends on the quality of the spec you are using, whether the API you are querying actually adheres to this spec, and whether our heuristics manage to find the right values. +This file contains the [configuration dictionary](./rest_api#source-configuration) for the rest_api source, which is the main result of running this generator. For our Pokemon example, we have used an OpenAPI 3 spec that works out of the box. The result of this dictionary depends on the quality of the spec you are using, whether the API you are querying actually adheres to this spec, and whether our heuristics manage to find the right values. The generated dictionary will look something like this: @@ -168,7 +168,7 @@ dlt-init-openapi pokemon --path ./path/to/my_spec.yml --no-interactive --output- **Options**: -_The only required options are either to supply a path or a URL to a spec_ +_The only required options are either to supply a path or a URL to a spec._ - `--url URL`: A URL to read the OpenAPI JSON or YAML file from. - `--path PATH`: A path to read the OpenAPI JSON or YAML file from locally. @@ -178,14 +178,14 @@ _The only required options are either to supply a path or a URL to a spec_ - `--log-level`: Set the logging level for stdout output, defaults to 20 (INFO). - `--global-limit`: Set a global limit on the generated source. - `--update-rest-api-source`: Update the locally cached rest_api verified source. -- `--allow-openapi-2`: Allows the use of OpenAPI v2. specs. Migration of the spec to 3.0 is recommended for better results though. +- `--allow-openapi-2`: Allows the use of OpenAPI v2 specs. Migration of the spec to 3.0 is recommended for better results, though. - `--version`: Show the installed version of the generator and exit. - `--help`: Show this message and exit. ## Config options You can pass a path to a config file with the `--config PATH` argument. To see available config values, go to https://github.com/dlt-hub/dlt-init-openapi/blob/devel/dlt_init_openapi/config.py and read the information below each field on the `Config` class. -The config file can be supplied as JSON or YAML dictionary. For example, to change the package name, you can create a YAML file: +The config file can be supplied as a JSON or YAML dictionary. For example, to change the package name, you can create a YAML file: ```yaml # config.yml @@ -199,7 +199,7 @@ $ dlt-init-openapi pokemon --url ... --config config.yml ``` ## Telemetry -We track your usage of this tool similar to how we track other commands in the dlt core library. Read more about this and how to disable it [here](../../reference/telemetry). +We track your usage of this tool similarly to how we track other commands in the dlt core library. Read more about this and how to disable it [here](../../reference/telemetry). ## Prior work This project started as a fork of [openapi-python-client](https://github.com/openapi-generators/openapi-python-client). Pretty much all parts are heavily changed or completely replaced, but some lines of code still exist, and we like to acknowledge the many good ideas we got from the original project :) @@ -207,4 +207,5 @@ This project started as a fork of [openapi-python-client](https://github.com/ope ## Implementation notes * OAuth Authentication currently is not natively supported. You can supply your own. * Per endpoint authentication currently is not supported by the generator. Only the first globally set securityScheme will be applied. You can add your own per endpoint if you need to. -* Basic OpenAPI 2.0 support is implemented. We recommend updating your specs at https://editor.swagger.io before using `dlt-init-openapi`. \ No newline at end of file +* Basic OpenAPI 2.0 support is implemented. We recommend updating your specs at https://editor.swagger.io before using `dlt-init-openapi`. + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/personio.md b/docs/website/docs/dlt-ecosystem/verified-sources/personio.md index 9829c94786..58816040ba 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/personio.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/personio.md @@ -12,7 +12,7 @@ import Header from './_source-info-header.md'; Personio is a human resources management software that helps businesses streamline HR processes, including recruitment, employee data management, and payroll, in one platform. -Our [Personio verified](https://github.com/dlt-hub/verified-sources/blob/master/sources/personio) source loads data using Perosnio API to your preferred +Our [Personio verified](https://github.com/dlt-hub/verified-sources/blob/master/sources/personio) source loads data using the Personio API to your preferred [destination](../destinations). :::tip @@ -23,17 +23,17 @@ Resources that can be loaded using this verified source are: | Name | Description | Endpoint | |----------------------------|-----------------------------------------------------------------------------------|---------------------------------------------------| -| employees | Retrieves company employees details | /company/employees | +| employees | Retrieves company employees' details | /company/employees | | absences | Retrieves absence periods for absences tracked in days | /company/time-offs | -| absences_types | Retrieves list of various types of employee absences | /company/time-off-types | +| absences_types | Retrieves a list of various types of employee absences | /company/time-off-types | | attendances | Retrieves attendance records for each employee | /company/attendances | | projects | Retrieves a list of all company projects | /company/attendances/projects | | document_categories | Retrieves all document categories of the company | /company/document-categories | -| employees_absences_balance | The transformer, retrieves the absence balance for a specific employee | /company/employees/{employee_id}/absences/balance | +| employees_absences_balance | The transformer retrieves the absence balance for a specific employee | /company/employees/{employee_id}/absences/balance | | custom_reports_list | Retrieves metadata about existing custom reports (name, report type, report date) | /company/custom-reports/reports | | custom_reports | The transformer for custom reports | /company/custom-reports/reports/{report_id} | -## Setup Guide +## Setup guide ### Grab credentials @@ -42,7 +42,7 @@ To load data from Personio, you need to obtain API credentials, `client_id` and 1. Sign in to your Personio account, and ensure that your user account has API access rights. 1. Navigate to Settings > Integrations > API credentials. 1. Click on "Generate new credentials." -1. Assign necessary permissions to credentials, i.e. read access. +1. Assign necessary permissions to credentials, i.e., read access. :::info The Personio UI, which is described here, might change. The full guide is available at this [link.](https://developer.personio.de/docs#21-employee-attendance-and-absence-endpoints) @@ -173,13 +173,12 @@ def employees( `items_per_page`: Maximum number of items per page, defaults to 200. -`allow_external_schedulers`: A boolean that, if True, permits [external schedulers](../../general-usage/incremental-loading#using-airflow-schedule-for-backfill-and-incremental-loading) to manage incremental loading. - +`allow_external_schedulers`: A boolean that, if true, permits [external schedulers](../../general-usage/incremental-loading#using-airflow-schedule-for-backfill-and-incremental-loading) to manage incremental loading. Like the `employees` resource discussed above, other resources `absences` and `attendances` load data incrementally from the Personio API to your preferred destination. -### Resource `absence_types` +### Resource `absence types` Simple resource, which retrieves a list of various types of employee absences. ```py @@ -195,16 +194,16 @@ It is important to note that the data is loaded in `replace` mode where the exis completely replaced. In addition to the mentioned resource, -there are three more resources `projects`, `custom_reports_list` and `document_categories` -with similar behaviour. +there are three more resources `projects`, `custom_reports_list`, and `document_categories` +with similar behavior. -### Resource-transformer `employees_absences_balance` +### Resource-transformer `employees absences balance` -Besides of these source and resource functions, there are two transformer functions +Besides these source and resource functions, there are two transformer functions for endpoints like `/company/employees/{employee_id}/absences/balance` and `/company/custom-reports/reports/{report_id}`. The transformer functions transform or process data from resources. -The transformer function `employees_absences_balance` process data from the `employees` resource. +The transformer function `employees_absences_balance` processes data from the `employees` resource. It fetches and returns a list of the absence balances for each employee. ```py @@ -219,7 +218,7 @@ def employees_absences_balance(employees_item: TDataItem) -> Iterable[TDataItem] ``` `employees_item`: The data item from the 'employees' resource. -It uses `@dlt.defer` decorator to enable parallel run in thread pool. +It uses the `@dlt.defer` decorator to enable parallel run in thread pool. ## Customization @@ -252,4 +251,3 @@ verified source. print(pipeline.run(load_data)) ``` - diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md b/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md index d571e5d386..ae99bc3f18 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md @@ -30,7 +30,7 @@ Sources and resources that can be loaded using this verified source are: | stage | Specific step in a sales process where a deal resides based on its progress | | user | Individual with a unique login credential who can access and use the platform | -## Setup Guide +## Setup guide ### Grab API token @@ -77,7 +77,7 @@ For more information, read the guide on [how to add a verified source.](../../wa ```toml [sources.pipedrive.credentials] # Note: Do not share this file and do not push it to GitHub! - pipedrive_api_key = "PIPEDRIVE_API_TOKEN" # please set me up ! + pipedrive_api_key = "PIPEDRIVE_API_TOKEN" # please set me up! ``` 1. Replace `PIPEDRIVE_API_TOKEN` with the API token you [copied above](#grab-api-token). @@ -93,11 +93,11 @@ For more information, read the [General Usage: Credentials.](../../general-usage ```sh pip install -r requirements.txt ``` -1. You're now ready to run the pipeline! To get started, run the following command: +2. You're now ready to run the pipeline! To get started, run the following command: ```sh python pipedrive_pipeline.py ``` -1. Once the pipeline has finished running, you can verify that everything loaded correctly by using +3. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: ```sh dlt pipeline show @@ -132,8 +132,8 @@ Pipedrive API. ### Source `pipedrive_source` -This function returns a list of resources including activities, deals, custom_fields_mapping and -other resources data from Pipedrive API. +This function returns a list of resources including activities, deals, custom_fields_mapping, and +other resources data from the Pipedrive API. ```py @dlt.source(name="pipedrive") @@ -146,8 +146,8 @@ def pipedrive_source( `pipedrive_api_key`: Authentication token for Pipedrive, configured in ".dlt/secrets.toml". -`since_timestamp`: Starting timestamp for incremental loading. By default, complete history is loaded - on the first run. And new data in subsequent runs. +`since_timestamp`: Starting timestamp for incremental loading. By default, the complete history is loaded + on the first run, and new data in subsequent runs. > Note: Incremental loading can be enabled or disabled depending on user preferences. @@ -167,7 +167,7 @@ for entity, resource_name in RECENTS_ENTITIES.items(): write_disposition="merge", )(entity, **resource_kwargs) - #yields endpoint_resources.values + # yields endpoint_resources.values ``` `entity and resource_name`: Key-value pairs from RECENTS_ENTITIES. @@ -198,8 +198,8 @@ def pipedrive_source(args): `write_disposition`: Sets the transformer to merge new data with existing data in the destination. -Similar to the transformer function "deals_participants" is another transformer function named -"deals_flow" that gets the flow of deals from the Pipedrive API, and then yields the result for +Similar to the transformer function "deals_participants," another transformer function named +"deals_flow" gets the flow of deals from the Pipedrive API and then yields the result for further processing or loading. ### Resource `create_state` @@ -225,7 +225,7 @@ entity exists. This updated state is then saved for future pipeline runs. Similar to the above functions, there are the following: `custom_fields_mapping`: Transformer function that parses and yields custom fields' mapping in order -to be stored in destination by dlt. +to be stored in the destination by dlt. `leads`: Resource function that incrementally loads Pipedrive leads by update_time. @@ -292,5 +292,3 @@ verified source. print(load_info) ``` - - diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/scrapy.md b/docs/website/docs/dlt-ecosystem/verified-sources/scrapy.md index 2e6b588c18..054193d77a 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/scrapy.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/scrapy.md @@ -9,7 +9,7 @@ keywords: [scraping, scraping verified source, scrapy] This verified source utilizes Scrapy, an open-source and collaborative framework for web scraping. Scrapy enables efficient extraction of required data from websites. -## Setup Guide +## Setup guide ### Initialize the verified source @@ -44,8 +44,8 @@ For more information, read the guide on start_urls = ["URL to be scraped"] # please set me up! start_urls_file = "/path/to/urls.txt" # please set me up! ``` - > When both `start_urls` and `start_urls_file` are provided they will be merged and deduplicated - > to ensure a Scrapy gets a unique set of start URLs. + > When both `start_urls` and `start_urls_file` are provided, they will be merged and deduplicated + > to ensure Scrapy gets a unique set of start URLs. 1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can securely store your access tokens and other sensitive information. It's important to handle this @@ -61,7 +61,7 @@ For more information, read [Secrets and Configs.](../../general-usage/credential In this section, we demonstrate how to use the `MySpider` class defined in "scraping_pipeline.py" to scrape data from "https://quotes.toscrape.com/page/1/". -1. Start with configuring the `config.toml` as follows: +1. Start by configuring the `config.toml` as follows: ```toml [sources.scraping] @@ -85,12 +85,14 @@ scrape data from "https://quotes.toscrape.com/page/1/". ## Customization + + ### Create your own pipeline If you wish to create your data pipeline, follow these steps: 1. The first step requires creating a spider class that scrapes data - from the website. For example, class `Myspider` below scrapes data from + from the website. For example, the class `Myspider` below scrapes data from URL: "https://quotes.toscrape.com/page/1/". ```py @@ -153,7 +155,7 @@ If you wish to create your data pipeline, follow these steps: In the above example, scrapy settings are passed as a parameter. For more information about scrapy settings, please refer to the - [Scrapy documentation.](https://docs.scrapy.org/en/latest/topics/settings.html). + [Scrapy documentation](https://docs.scrapy.org/en/latest/topics/settings.html). 1. To limit the number of items processed, use the "on_before_start" function to set a limit on the resources the pipeline processes. For instance, setting the resource limit to two allows @@ -187,3 +189,4 @@ If you wish to create your data pipeline, follow these steps: scraping_host.pipeline_runner.scraping_resource.add_limit(2) scraping_host.run(dataset_name="quotes", write_disposition="append") ``` + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/slack.md b/docs/website/docs/dlt-ecosystem/verified-sources/slack.md index 38eda15c94..35b12bb64f 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/slack.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/slack.md @@ -25,7 +25,7 @@ Sources and resources that can be loaded using this verified source are: | get_messages_resource | Retrieves all the messages for a given channel | | access_logs | Retrieves the access logs | -## Setup Guide +## Setup guide ### Grab user OAuth token @@ -204,7 +204,7 @@ def get_messages_resource( - `end_value`: Timestamp range end, defaulting to end_dt in slack_source. - - `allow_external_schedulers`: A boolean that, if True, permits [external schedulers](../../general-usage/incremental-loading#using-airflow-schedule-for-backfill-and-incremental-loading) to manage incremental loading. + - `allow_external_schedulers`: A boolean that, if true, permits [external schedulers](../../general-usage/incremental-loading#using-airflow-schedule-for-backfill-and-incremental-loading) to manage incremental loading. ### Resource `access_logs` @@ -217,7 +217,7 @@ This method retrieves access logs from the Slack API. primary_key="user_id", write_disposition="append", ) -# it is not an incremental resource it just has a end_date filter +# It is not an incremental resource; it just has an end_date filter. def logs_resource() -> Iterable[TDataItem]: ... ``` @@ -232,8 +232,7 @@ def logs_resource() -> Iterable[TDataItem]: ## Customization ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods from this -verified source. +If you wish to create your own pipelines, you can leverage source and resource methods from this verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: @@ -252,7 +251,7 @@ verified source. # Enable below to load only 'access_logs', available for paid accounts only. # source.access_logs.selected = True - # It loads data starting from 1st September 2023 to 8th Sep 2023. + # It loads data starting from 1st September 2023 to 8th September 2023. load_info = pipeline.run(source) print(load_info) ``` @@ -270,7 +269,7 @@ verified source. start_date=datetime(2023, 9, 1), end_date=datetime(2023, 9, 8), ) - # It loads data starting from 1st September 2023 to 8th Sep 2023 from the channels: "general" and "random". + # It loads data starting from 1st September 2023 to 8th September 2023 from the channels: "general" and "random". load_info = pipeline.run(source) print(load_info) ``` diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/stripe.md b/docs/website/docs/dlt-ecosystem/verified-sources/stripe.md index fdbefeddf1..15f75ac313 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/stripe.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/stripe.md @@ -13,25 +13,25 @@ import Header from './_source-info-header.md'; This Stripe `dlt` verified source and [pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/stripe_pipeline.py) -loads data using Stripe API to the destination of your choice. +loads data using the Stripe API to the destination of your choice. This verified source loads data from the following endpoints: -| Name | Description | +| Name | Description | |--------------------|--------------------------------------------| | Subscription | Recurring payment on Stripe | -| Account | User profile on Stripe | -| Coupon | Discount codes offered by businesses | -| Customer | Buyers using Stripe | -| Product | Items or services for sale | -| Price | Cost details for products or plans | -| Event | Significant activities in a Stripe account | -| Invoice | Payment request document | +| Account | User profile on Stripe | +| Coupon | Discount codes offered by businesses | +| Customer | Buyers using Stripe | +| Product | Items or services for sale | +| Price | Cost details for products or plans | +| Event | Significant activities in a Stripe account | +| Invoice | Payment request document | | BalanceTransaction | Funds movement record in Stripe | Please note that endpoints in the verified source can be customized as per the Stripe API [reference documentation.](https://stripe.com/docs/api) -## Setup Guide +## Setup guide ### Grab credentials @@ -89,8 +89,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage ## Run the pipeline -1. Before running the pipeline, ensure that you have installed all the necessary dependencies by - running the command: +1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: ```sh pip install -r requirements.txt @@ -102,26 +101,22 @@ For more information, read the [General Usage: Credentials.](../../general-usage python stripe_analytics_pipeline.py ``` -1. Once the pipeline has finished running, you can verify that everything loaded correctly by using - the following command: +1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: ```sh dlt pipeline show ``` - For example, the `pipeline_name` for the above pipeline example is `stripe_analytics`, you - may also use any custom name instead. + For example, the `pipeline_name` for the above pipeline example is `stripe_analytics`. You may also use any custom name instead. For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline). ## Sources and resources -`dlt` works on the principle of [sources](../../general-usage/source) and -[resources](../../general-usage/resource). +`dlt` works on the principle of [sources](../../general-usage/source) and [resources](../../general-usage/resource). ### Default endpoints -You can write your own pipelines to load data to a destination using this verified source. -However, it is important to note is how the `ENDPOINTS` and `INCREMENTAL_ENDPOINTS` tuples are defined in `stripe_analytics/settings.py`. +You can write your own pipelines to load data to a destination using this verified source. However, it is important to note how the `ENDPOINTS` and `INCREMENTAL_ENDPOINTS` tuples are defined in `stripe_analytics/settings.py`. ```py # The most popular Stripe API's endpoints @@ -168,20 +163,19 @@ def incremental_stripe_source( ``` `endpoints`: Tuple containing incremental endpoint names. -`initial_start_date`: Parameter for incremental loading; data after initial_start_date is loaded on the first run (default: None). +`initial_start_date`: Parameter for incremental loading; data after the initial_start_date is loaded on the first run (default: None). `end_date`: End datetime for data loading (default: None). - After each run, 'initial_start_date' updates to the last loaded date. Subsequent runs then retrieve only new data using append mode, streamlining the process and preventing redundant data downloads. For more information, read the [Incremental loading](../../general-usage/incremental-loading). ## Customization + ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods from this -verified source. +If you wish to create your own pipelines, you can leverage source and resource methods from this verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: @@ -235,3 +229,4 @@ verified source. > To load data, maintain the pipeline name and destination dataset name. The pipeline name is vital for accessing the last run's [state](../../general-usage/state), which determines the incremental data load's end date. Altering these names can trigger a [“dev_mode”](../../general-usage/pipeline#do-experiments-with-dev-mode), disrupting the metadata (state) tracking for [incremental data loading](../../general-usage/incremental-loading). +