diff --git a/docs/website/docs/dlt-ecosystem/destinations/athena.md b/docs/website/docs/dlt-ecosystem/destinations/athena.md index 9fc5dc15f9..b376337e77 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/athena.md +++ b/docs/website/docs/dlt-ecosystem/destinations/athena.md @@ -6,7 +6,7 @@ keywords: [aws, athena, glue catalog] # AWS Athena / Glue Catalog -The athena destination stores data as parquet files in s3 buckets and creates [external tables in aws athena](https://docs.aws.amazon.com/athena/latest/ug/creating-tables.html). You can then query those tables with athena sql commands which will then scan the whole folder of parquet files and return the results. This destination works very similar to other sql based destinations, with the exception of the merge write disposition not being supported at this time. dlt metadata will be stored in the same bucket as the parquet files, but as iceberg tables. Athena additionally supports writing individual data tables as iceberg tables, so the may be manipulated later, a common use-case would be to strip gdpr data from them. +The Athena destination stores data as Parquet files in S3 buckets and creates [external tables in AWS Athena](https://docs.aws.amazon.com/athena/latest/ug/creating-tables.html). You can then query those tables with Athena SQL commands, which will scan the entire folder of Parquet files and return the results. This destination works very similarly to other SQL-based destinations, with the exception that the merge write disposition is not supported at this time. The `dlt` metadata will be stored in the same bucket as the Parquet files, but as iceberg tables. Athena also supports writing individual data tables as Iceberg tables, so they may be manipulated later. A common use case would be to strip GDPR data from them. ## Install dlt with Athena **To install the DLT library with Athena dependencies:** @@ -17,35 +17,34 @@ pip install dlt[athena] ## Setup Guide ### 1. Initialize the dlt project -Let's start by initializing a new dlt project as follows: +Let's start by initializing a new `dlt` project as follows: ```bash dlt init chess athena ``` - > 💡 This command will initialise your pipeline with chess as the source and aws athena as the destination using the filesystem staging destination + > 💡 This command will initialize your pipeline with chess as the source and AWS Athena as the destination using the filesystem staging destination. -### 2. Setup bucket storage and athena credentials +### 2. Setup bucket storage and Athena credentials -First install dependencies by running: +First, install dependencies by running: ``` pip install -r requirements.txt ``` -or with `pip install dlt[athena]` which will install `s3fs`, `pyarrow`, `pyathena` and `botocore` packages. +or with `pip install dlt[athena]`, which will install `s3fs`, `pyarrow`, `pyathena`, and `botocore` packages. :::caution -You may also install the dependencies independently -try +You may also install the dependencies independently. Try ```sh pip install dlt pip install s3fs pip install pyarrow pip install pyathena ``` -so pip does not fail on backtracking +so pip does not fail on backtracking. ::: -To edit the `dlt` credentials file with your secret info, open `.dlt/secrets.toml`. You will need to provide a `bucket_url` which holds the uploaded parquet files, a `query_result_bucket` which athena uses to write query results too, and credentials that have write and read access to these two buckets as well as the full athena access aws role. +To edit the `dlt` credentials file with your secret info, open `.dlt/secrets.toml`. You will need to provide a `bucket_url`, which holds the uploaded parquet files, a `query_result_bucket`, which Athena uses to write query results to, and credentials that have write and read access to these two buckets as well as the full Athena access AWS role. The toml file looks like this: @@ -63,10 +62,10 @@ query_result_bucket="s3://[results_bucket_name]" # replace with your query resul [destination.athena.credentials] aws_access_key_id="please set me up!" # same as credentials for filesystem aws_secret_access_key="please set me up!" # same as credentials for filesystem -region_name="please set me up!" # set your aws region, for example "eu-central-1" for frankfurt +region_name="please set me up!" # set your AWS region, for example "eu-central-1" for Frankfurt ``` -if you have your credentials stored in `~/.aws/credentials` just remove the **[destination.filesystem.credentials]** and **[destination.athena.credentials]** section above and `dlt` will fall back to your **default** profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: `dlt-ci-user`): +If you have your credentials stored in `~/.aws/credentials`, just remove the **[destination.filesystem.credentials]** and **[destination.athena.credentials]** section above and `dlt` will fall back to your **default** profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: `dlt-ci-user`): ```toml [destination.filesystem.credentials] profile_name="dlt-ci-user" @@ -77,7 +76,7 @@ profile_name="dlt-ci-user" ## Additional Destination Configuration -You can provide an athena workgroup like so: +You can provide an Athena workgroup like so: ```toml [destination.athena] athena_work_group="my_workgroup" @@ -85,45 +84,43 @@ athena_work_group="my_workgroup" ## Write disposition -`athena` destination handles the write dispositions as follows: -- `append` - files belonging to such tables are added to dataset folder -- `replace` - all files that belong to such tables are deleted from dataset folder and then current set of files is added. -- `merge` - falls back to `append` +The `athena` destination handles the write dispositions as follows: +- `append` - files belonging to such tables are added to the dataset folder. +- `replace` - all files that belong to such tables are deleted from the dataset folder, and then the current set of files is added. +- `merge` - falls back to `append`. ## Data loading -Data loading happens by storing parquet files in an s3 bucket and defining a schema on athena. If you query data via SQL queries on athena, the returned data is read by -scanning your bucket and reading all relevant parquet files in there. +Data loading happens by storing parquet files in an S3 bucket and defining a schema on Athena. If you query data via SQL queries on Athena, the returned data is read by scanning your bucket and reading all relevant parquet files in there. `dlt` internal tables are saved as Iceberg tables. ### Data types -Athena tables store timestamps with millisecond precision and with that precision we generate parquet files. Mind that Iceberg tables have microsecond precision. +Athena tables store timestamps with millisecond precision, and with that precision, we generate parquet files. Keep in mind that Iceberg tables have microsecond precision. -Athena does not support JSON fields so JSON is stored as string. +Athena does not support JSON fields, so JSON is stored as a string. > ❗**Athena does not support TIME columns in parquet files**. `dlt` will fail such jobs permanently. Convert `datetime.time` objects to `str` or `datetime.datetime` to load them. ### Naming Convention -We follow our snake_case name convention. Mind the following: -* DDL use HIVE escaping with `````` +We follow our snake_case name convention. Keep the following in mind: +* DDL uses HIVE escaping with `````` * Other queries use PRESTO and regular SQL escaping. ## Staging support -Using a staging destination is mandatory when using the athena destination. If you do not set staging to `filesystem`, dlt will automatically do this for you. +Using a staging destination is mandatory when using the Athena destination. If you do not set staging to `filesystem`, `dlt` will automatically do this for you. If you decide to change the [filename layout](./filesystem#data-loading) from the default value, keep the following in mind so that Athena can reliably build your tables: - - You need to provide the `{table_name}` placeholder and this placeholder needs to be followed by a forward slash - - You need to provide the `{file_id}` placeholder and it needs to be somewhere after the `{table_name}` placeholder. - - {table_name} must be the first placeholder in the layout. + - You need to provide the `{table_name}` placeholder, and this placeholder needs to be followed by a forward slash. + - You need to provide the `{file_id}` placeholder, and it needs to be somewhere after the `{table_name}` placeholder. + - `{table_name}` must be the first placeholder in the layout. ## Additional destination options -### iceberg data tables -You can save your tables as iceberg tables to athena. This will enable you to for example delete data from them later if you need to. To switch a resouce to the iceberg table-format, -supply the table_format argument like this: +### Iceberg data tables +You can save your tables as Iceberg tables to Athena. This will enable you, for example, to delete data from them later if you need to. To switch a resource to the iceberg table format, supply the table_format argument like this: ```python @dlt.resource(table_format="iceberg") @@ -131,29 +128,26 @@ def data() -> Iterable[TDataItem]: ... ``` -Alternatively you can set all tables to use the iceberg format with a config variable: +Alternatively, you can set all tables to use the iceberg format with a config variable: ```toml [destination.athena] force_iceberg = "True" ``` -For every table created as an iceberg table, the athena destination will create a regular athena table in the staging dataset of both the filesystem as well as the athena glue catalog and then -copy all data into the final iceberg table that lives with the non-iceberg tables in the same dataset on both filesystem and the glue catalog. Switching from iceberg to regular table or vice versa -is not supported. +For every table created as an iceberg table, the Athena destination will create a regular Athena table in the staging dataset of both the filesystem and the Athena glue catalog, and then copy all data into the final iceberg table that lives with the non-iceberg tables in the same dataset on both the filesystem and the glue catalog. Switching from iceberg to regular table or vice versa is not supported. ### dbt support -Athena is supported via `dbt-athena-community`. Credentials are passed into `aws_access_key_id` and `aws_secret_access_key` of generated dbt profile. Iceberg tables are supported but you need to make sure that you materialize your models as iceberg tables if your source table is iceberg. We encountered problems with materializing -date time columns due to different precision on iceberg (nanosecond) and regular Athena tables (millisecond). -The Athena adapter requires that you setup **region_name** in Athena configuration below. You can also setup table catalog name to change the default: **awsdatacatalog** +Athena is supported via `dbt-athena-community`. Credentials are passed into `aws_access_key_id` and `aws_secret_access_key` of the generated dbt profile. Iceberg tables are supported, but you need to make sure that you materialize your models as iceberg tables if your source table is iceberg. We encountered problems with materializing date time columns due to different precision on iceberg (nanosecond) and regular Athena tables (millisecond). +The Athena adapter requires that you set up **region_name** in the Athena configuration below. You can also set up the table catalog name to change the default: **awsdatacatalog** ```toml [destination.athena] aws_data_catalog="awsdatacatalog" ``` ### Syncing of `dlt` state -- This destination fully supports [dlt state sync.](../../general-usage/state#syncing-state-with-destination). The state is saved in athena iceberg tables in your s3 bucket. +- This destination fully supports [dlt state sync.](../../general-usage/state#syncing-state-with-destination). The state is saved in Athena iceberg tables in your S3 bucket. ## Supported file formats diff --git a/docs/website/docs/dlt-ecosystem/destinations/bigquery.md b/docs/website/docs/dlt-ecosystem/destinations/bigquery.md index 25b01923b5..e852bfa9e5 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/bigquery.md +++ b/docs/website/docs/dlt-ecosystem/destinations/bigquery.md @@ -28,7 +28,7 @@ dlt init chess bigquery pip install -r requirements.txt ``` -This will install dlt with **bigquery** extra, which contains all the dependencies required by the bigquery client. +This will install dlt with the `bigquery` extra, which contains all the dependencies required by the bigquery client. **3. Log in to or create a Google Cloud account** @@ -58,7 +58,7 @@ You don't need to grant users access to this service account now, so click the ` In the service accounts table page that you're redirected to after clicking `Done` as instructed above, select the three dots under the `Actions` column for the service account you created and select `Manage keys`. -This will take you to page where you can click the `Add key` button, then the `Create new key` button, +This will take you to a page where you can click the `Add key` button, then the `Create new key` button, and finally the `Create` button, keeping the preselected `JSON` option. A `JSON` file that includes your service account private key will then be downloaded. @@ -83,11 +83,11 @@ private_key = "private_key" # please set me up! client_email = "client_email" # please set me up! ``` -You can specify the location of the data i.e. `EU` instead of `US` which is a default. +You can specify the location of the data i.e. `EU` instead of `US` which is the default. ### OAuth 2.0 Authentication -You can use the OAuth 2.0 authentication. You'll need to generate a **refresh token** with right scopes (I suggest to ask our GPT-4 assistant for details). +You can use OAuth 2.0 authentication. You'll need to generate a **refresh token** with the right scopes (we suggest asking our GPT-4 assistant for details). Then you can fill the following information in `secrets.toml` ```toml @@ -103,9 +103,9 @@ refresh_token = "refresh_token" # please set me up! ### Using Default Credentials -Google provides several ways to get default credentials i.e. from `GOOGLE_APPLICATION_CREDENTIALS` environment variable or metadata services. +Google provides several ways to get default credentials i.e. from the `GOOGLE_APPLICATION_CREDENTIALS` environment variable or metadata services. VMs available on GCP (cloud functions, Composer runners, Colab notebooks) have associated service accounts or authenticated users. -Will try to use default credentials if nothing is explicitly specified in the secrets. +`dlt` will try to use default credentials if nothing is explicitly specified in the secrets. ```toml [destination.bigquery] @@ -114,16 +114,16 @@ location = "US" ## Write Disposition -All write dispositions are supported +All write dispositions are supported. -If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized` the destination tables will be dropped and +If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized`, the destination tables will be dropped and recreated with a [clone command](https://cloud.google.com/bigquery/docs/table-clones-create) from the staging tables. ## Data Loading -`dlt` uses `BigQuery` load jobs that send files from local filesystem or gcs buckets. -Loader follows [Google recommendations](https://cloud.google.com/bigquery/docs/error-messages) when retrying and terminating jobs. -Google BigQuery client implements elaborate retry mechanism and timeouts for queries and file uploads, which may be configured in destination options. +`dlt` uses `BigQuery` load jobs that send files from the local filesystem or GCS buckets. +The loader follows [Google recommendations](https://cloud.google.com/bigquery/docs/error-messages) when retrying and terminating jobs. +The Google BigQuery client implements an elaborate retry mechanism and timeouts for queries and file uploads, which may be configured in destination options. ## Supported File Formats @@ -143,36 +143,36 @@ When staging is enabled: BigQuery supports the following [column hints](https://dlthub.com/docs/general-usage/schema#tables-and-columns): -* `partition` - creates a partition with a day granularity on decorated column (`PARTITION BY DATE`). - May be used with `datetime`, `date` and `bigint` data types. +* `partition` - creates a partition with a day granularity on the decorated column (`PARTITION BY DATE`). + May be used with `datetime`, `date`, and `bigint` data types. Only one column per table is supported and only when a new table is created. For more information on BigQuery partitioning, read the [official docs](https://cloud.google.com/bigquery/docs/partitioned-tables). > ❗ `bigint` maps to BigQuery's **INT64** data type. > Automatic partitioning requires converting an INT64 column to a UNIX timestamp, which `GENERATE_ARRAY` doesn't natively support. > With a 10,000 partition limit, we can’t cover the full INT64 range. - > Instead, we set 86,400 second boundaries to enable daily partitioning. + > Instead, we set 86,400-second boundaries to enable daily partitioning. > This captures typical values, but extremely large/small outliers go to an `__UNPARTITIONED__` catch-all partition. * `cluster` - creates a cluster column(s). Many columns per table are supported and only when a new table is created. ## Staging Support -BigQuery supports gcs as a file staging destination. dlt will upload files in the parquet format to gcs and ask BigQuery to copy their data directly into the db. -Please refer to the [Google Storage filesystem documentation](./filesystem.md#google-storage) to learn how to set up your gcs bucket with the bucket_url and credentials. -If you use the same service account for gcs and your redshift deployment, you do not need to provide additional authentication for BigQuery to be able to read from your bucket. +BigQuery supports GCS as a file staging destination. `dlt` will upload files in the parquet format to GCS and ask BigQuery to copy their data directly into the database. +Please refer to the [Google Storage filesystem documentation](./filesystem.md#google-storage) to learn how to set up your GCS bucket with the bucket_url and credentials. +If you use the same service account for GCS and your Redshift deployment, you do not need to provide additional authentication for BigQuery to be able to read from your bucket. -Alternatively to parquet files, you can specify jsonl as the staging file format. For this set the `loader_file_format` argument of the `run` command of the pipeline to `jsonl`. +Alternatively to parquet files, you can specify jsonl as the staging file format. For this, set the `loader_file_format` argument of the `run` command of the pipeline to `jsonl`. ### BigQuery/GCS Staging Example ```python # Create a dlt pipeline that will load # chess player data to the BigQuery destination -# via a gcs bucket. +# via a GCS bucket. pipeline = dlt.pipeline( pipeline_name='chess_pipeline', - destination='biquery', + destination='bigquery', staging='filesystem', # Add this to activate the staging location. dataset_name='player_data' ) @@ -180,7 +180,7 @@ pipeline = dlt.pipeline( ## Additional Destination Options -You can configure the data location and various timeouts as shown below. This information is not a secret so can be placed in `config.toml` as well: +You can configure the data location and various timeouts as shown below. This information is not a secret so it can be placed in `config.toml` as well: ```toml [destination.bigquery] @@ -191,15 +191,15 @@ retry_deadline=60.0 ``` * `location` sets the [BigQuery data location](https://cloud.google.com/bigquery/docs/locations) (default: **US**) -* `http_timeout` sets the timeout when connecting and getting a response from BigQuery API (default: **15 seconds**) -* `file_upload_timeout` a timeout for file upload when loading local files: the total time of the upload may not exceed this value (default: **30 minutes**, set in seconds) -* `retry_deadline` a deadline for a [DEFAULT_RETRY used by Google](https://cloud.google.com/python/docs/reference/storage/1.39.0/retry_timeout) +* `http_timeout` sets the timeout when connecting and getting a response from the BigQuery API (default: **15 seconds**) +* `file_upload_timeout` is a timeout for file upload when loading local files: the total time of the upload may not exceed this value (default: **30 minutes**, set in seconds) +* `retry_deadline` is a deadline for a [DEFAULT_RETRY used by Google](https://cloud.google.com/python/docs/reference/storage/1.39.0/retry_timeout) ### dbt Support This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-bigquery](https://github.com/dbt-labs/dbt-bigquery). Credentials, if explicitly defined, are shared with `dbt` along with other settings like **location** and retries and timeouts. -In case of implicit credentials (i.e. available in cloud function), `dlt` shares the `project_id` and delegates obtaining credentials to `dbt` adapter. +In the case of implicit credentials (i.e. available in a cloud function), `dlt` shares the `project_id` and delegates obtaining credentials to the `dbt` adapter. ### Syncing of `dlt` State @@ -215,7 +215,7 @@ The adapter updates the DltResource with metadata about the destination column a ### Use an Adapter to Apply Hints to a Resource -Here is an example of how to use the `bigquery_adapter` method to apply hints to a resource on both column level and table level: +Here is an example of how to use the `bigquery_adapter` method to apply hints to a resource on both the column level and table level: ```python from datetime import date, timedelta @@ -246,9 +246,9 @@ bigquery_adapter( bigquery_adapter(event_data, table_description="Dummy event data.") ``` -Above, the adapter specifies that `event_date` should be used for partitioning and both `event_date` and `user_id` should be used for clustering (in the given order) when the table is created. +In the example above, the adapter specifies that `event_date` should be used for partitioning and both `event_date` and `user_id` should be used for clustering (in the given order) when the table is created. -Some things to note with the adapter's behaviour: +Some things to note with the adapter's behavior: - You can only partition on one column (refer to [supported hints](#supported-column-hints)). - You can cluster on as many columns as you would like. diff --git a/docs/website/docs/dlt-ecosystem/destinations/databricks.md b/docs/website/docs/dlt-ecosystem/destinations/databricks.md index fc100e41e2..d00c603c14 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/databricks.md +++ b/docs/website/docs/dlt-ecosystem/destinations/databricks.md @@ -7,7 +7,7 @@ keywords: [Databricks, destination, data warehouse] --- # Databricks -*Big thanks to Evan Phillips and [swishbi.com](https://swishbi.com/) for contributing code, time and test environment* +*Big thanks to Evan Phillips and [swishbi.com](https://swishbi.com/) for contributing code, time, and a test environment.* ## Install dlt with Databricks **To install the DLT library with Databricks dependencies:** @@ -28,7 +28,7 @@ If you already have your Databricks workspace set up, you can skip to the [Loade 1. Create a Databricks workspace in Azure - In your Azure Portal search for Databricks and create a new workspace. In the "Pricing Tier" section, select "Premium" to be able to use the Unity Catalog. + In your Azure Portal, search for Databricks and create a new workspace. In the "Pricing Tier" section, select "Premium" to be able to use the Unity Catalog. 2. Create an ADLS Gen 2 storage account @@ -42,7 +42,7 @@ If you already have your Databricks workspace set up, you can skip to the [Loade 4. Create an Access Connector for Azure Databricks This will allow Databricks to access your storage account. - In the Azure Portal search for "Access Connector for Azure Databricks" and create a new connector. + In the Azure Portal, search for "Access Connector for Azure Databricks" and create a new connector. 5. Grant access to your storage container @@ -54,16 +54,16 @@ If you already have your Databricks workspace set up, you can skip to the [Loade 1. Now go to your Databricks workspace - To get there from the Azure Portal, search for "Databricks" and select your Databricks and click "Launch Workspace". + To get there from the Azure Portal, search for "Databricks", select your Databricks, and click "Launch Workspace". 2. In the top right corner, click on your email address and go to "Manage Account" 3. Go to "Data" and click on "Create Metastore" Name your metastore and select a region. - If you'd like to set up a storage container for the whole metastore you can add your ADLS URL and Access Connector Id here. You can also do this on a granular level when creating the catalog. + If you'd like to set up a storage container for the whole metastore, you can add your ADLS URL and Access Connector Id here. You can also do this on a granular level when creating the catalog. - In the next step assign your metastore to your workspace. + In the next step, assign your metastore to your workspace. 4. Go back to your workspace and click on "Catalog" in the left-hand menu @@ -77,7 +77,7 @@ If you already have your Databricks workspace set up, you can skip to the [Loade Set the URL of our storage container. This should be in the form: `abfss://@.dfs.core.windows.net/` - Once created you can test the connection to make sure the container is accessible from databricks. + Once created, you can test the connection to make sure the container is accessible from Databricks. 7. Now you can create a catalog @@ -113,7 +113,7 @@ Example: [destination.databricks.credentials] server_hostname = "MY_DATABRICKS.azuredatabricks.net" http_path = "/sql/1.0/warehouses/12345" -access_token "MY_ACCESS_TOKEN" +access_token = "MY_ACCESS_TOKEN" catalog = "my_catalog" ``` @@ -123,7 +123,7 @@ All write dispositions are supported ## Data loading Data is loaded using `INSERT VALUES` statements by default. -Efficient loading from a staging filesystem is also supported by configuring an Amazon S3 or Azure Blob Storage bucket as a staging destination. When staging is enabled `dlt` will upload data in `parquet` files to the bucket and then use `COPY INTO` statements to ingest the data into Databricks. +Efficient loading from a staging filesystem is also supported by configuring an Amazon S3 or Azure Blob Storage bucket as a staging destination. When staging is enabled, `dlt` will upload data in `parquet` files to the bucket and then use `COPY INTO` statements to ingest the data into Databricks. For more information on staging, see the [staging support](#staging-support) section below. ## Supported file formats @@ -133,7 +133,7 @@ For more information on staging, see the [staging support](#staging-support) sec The `jsonl` format has some limitations when used with Databricks: -1. Compression must be disabled to load jsonl files in databricks. Set `data_writer.disable_compression` to `true` in dlt config when using this format. +1. Compression must be disabled to load jsonl files in Databricks. Set `data_writer.disable_compression` to `true` in dlt config when using this format. 2. The following data types are not supported when using `jsonl` format with `databricks`: `decimal`, `complex`, `date`, `binary`. Use `parquet` if your data contains these types. 3. `bigint` data type with precision is not supported with `jsonl` format @@ -144,16 +144,16 @@ Databricks supports both Amazon S3 and Azure Blob Storage as staging locations. ### Databricks and Amazon S3 -Please refer to the [S3 documentation](./filesystem.md#aws-s3) for details on connecting your s3 bucket with the bucket_url and credentials. +Please refer to the [S3 documentation](./filesystem.md#aws-s3) for details on connecting your S3 bucket with the bucket_url and credentials. -Example to set up Databricks with s3 as a staging destination: +Example to set up Databricks with S3 as a staging destination: ```python import dlt # Create a dlt pipeline that will load # chess player data to the Databricks destination -# via staging on s3 +# via staging on S3 pipeline = dlt.pipeline( pipeline_name='chess_pipeline', destination='databricks', @@ -195,4 +195,4 @@ This destination fully supports [dlt state sync](../../general-usage/state#synci - [Load data from Google Analytics to Databricks in python with dlt](https://dlthub.com/docs/pipelines/google_analytics/load-data-with-python-from-google_analytics-to-databricks) - [Load data from Google Sheets to Databricks in python with dlt](https://dlthub.com/docs/pipelines/google_sheets/load-data-with-python-from-google_sheets-to-databricks) - [Load data from Chess.com to Databricks in python with dlt](https://dlthub.com/docs/pipelines/chess/load-data-with-python-from-chess-to-databricks) - \ No newline at end of file + diff --git a/docs/website/docs/dlt-ecosystem/destinations/duckdb.md b/docs/website/docs/dlt-ecosystem/destinations/duckdb.md index db7428dcc9..9452a80c50 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/duckdb.md +++ b/docs/website/docs/dlt-ecosystem/destinations/duckdb.md @@ -7,38 +7,38 @@ keywords: [duckdb, destination, data warehouse] # DuckDB ## Install dlt with DuckDB -**To install the DLT library with DuckDB dependencies:** +**To install the DLT library with DuckDB dependencies, run:** ``` pip install dlt[duckdb] ``` ## Setup Guide -**1. Initialize a project with a pipeline that loads to DuckDB by running** +**1. Initialize a project with a pipeline that loads to DuckDB by running:** ``` dlt init chess duckdb ``` -**2. Install the necessary dependencies for DuckDB by running** +**2. Install the necessary dependencies for DuckDB by running:** ``` pip install -r requirements.txt ``` -**3. Run the pipeline** +**3. Run the pipeline:** ``` python3 chess_pipeline.py ``` ## Write disposition -All write dispositions are supported +All write dispositions are supported. ## Data loading -`dlt` will load data using large INSERT VALUES statements by default. Loading is multithreaded (20 threads by default). If you are ok with installing `pyarrow` we suggest to switch to `parquet` as file format. Loading is faster (and also multithreaded). +`dlt` will load data using large INSERT VALUES statements by default. Loading is multithreaded (20 threads by default). If you are okay with installing `pyarrow`, we suggest switching to `parquet` as the file format. Loading is faster (and also multithreaded). ### Names normalization -`dlt` uses standard **snake_case** naming convention to keep identical table and column identifiers across all destinations. If you want to use **duckdb** wide range of characters (ie. emojis) for table and column names, you can switch to **duck_case** naming convention which accepts almost any string as an identifier: +`dlt` uses the standard **snake_case** naming convention to keep identical table and column identifiers across all destinations. If you want to use the **duckdb** wide range of characters (i.e., emojis) for table and column names, you can switch to the **duck_case** naming convention, which accepts almost any string as an identifier: * `\n` `\r` and `" are translated to `_` -* multiple `_` are translated to single `_` +* multiple `_` are translated to a single `_` Switch the naming convention using `config.toml`: ```toml @@ -46,31 +46,31 @@ Switch the naming convention using `config.toml`: naming="duck_case" ``` -or via env variable `SCHEMA__NAMING` or directly in code: +or via the env variable `SCHEMA__NAMING` or directly in the code: ```python dlt.config["schema.naming"] = "duck_case" ``` :::caution -**duckdb** identifiers are **case insensitive** but display names preserve case. This may create name clashes if for example you load json with -`{"Column": 1, "column": 2}` will map data to a single column. +**duckdb** identifiers are **case insensitive** but display names preserve case. This may create name clashes if, for example, you load JSON with +`{"Column": 1, "column": 2}` as it will map data to a single column. ::: ## Supported file formats -You can configure the following file formats to load data to duckdb +You can configure the following file formats to load data to duckdb: * [insert-values](../file-formats/insert-format.md) is used by default * [parquet](../file-formats/parquet.md) is supported :::note -`duckdb` cannot COPY many parquet files to a single table from multiple threads. In this situation `dlt` serializes the loads. Still - that may be faster than INSERT +`duckdb` cannot COPY many parquet files to a single table from multiple threads. In this situation, `dlt` serializes the loads. Still, that may be faster than INSERT. ::: -* [jsonl](../file-formats/jsonl.md) **is supported but does not work if JSON fields are optional. the missing keys fail the COPY instead of being interpreted as NULL** +* [jsonl](../file-formats/jsonl.md) **is supported but does not work if JSON fields are optional. The missing keys fail the COPY instead of being interpreted as NULL.** ## Supported column hints -`duckdb` may create unique indexes for all columns with `unique` hints but this behavior **is disabled by default** because it slows the loading down significantly. +`duckdb` may create unique indexes for all columns with `unique` hints, but this behavior **is disabled by default** because it slows the loading down significantly. ## Destination Configuration -By default, a DuckDB database will be created in the current working directory with a name `.duckdb` (`chess.duckdb` in the example above). After loading, it is available in `read/write` mode via `with pipeline.sql_client() as con:` which is a wrapper over `DuckDBPyConnection`. See [duckdb docs](https://duckdb.org/docs/api/python/overview#persistent-storage) for details. +By default, a DuckDB database will be created in the current working directory with a name `.duckdb` (`chess.duckdb` in the example above). After loading, it is available in `read/write` mode via `with pipeline.sql_client() as con:`, which is a wrapper over `DuckDBPyConnection`. See [duckdb docs](https://duckdb.org/docs/api/python/overview#persistent-storage) for details. The `duckdb` credentials do not require any secret values. You are free to pass the configuration explicitly via the `credentials` parameter to `dlt.pipeline` or `pipeline.run` methods. For example: ```python @@ -88,17 +88,17 @@ db = duckdb.connect() p = dlt.pipeline(pipeline_name='chess', destination='duckdb', dataset_name='chess_data', full_refresh=False, credentials=db) ``` -This destination accepts database connection strings in format used by [duckdb-engine](https://github.com/Mause/duckdb_engine#configuration). +This destination accepts database connection strings in the format used by [duckdb-engine](https://github.com/Mause/duckdb_engine#configuration). -You can configure a DuckDB destination with [secret / config values](../../general-usage/credentials) (e.g. using a `secrets.toml` file) +You can configure a DuckDB destination with [secret / config values](../../general-usage/credentials) (e.g., using a `secrets.toml` file) ```toml destination.duckdb.credentials=duckdb:///_storage/test_quack.duckdb ``` -**duckdb://** url above creates a **relative** path to `_storage/test_quack.duckdb`. To define **absolute** path you need to specify four slashes ie. `duckdb:////_storage/test_quack.duckdb`. +The **duckdb://** URL above creates a **relative** path to `_storage/test_quack.duckdb`. To define an **absolute** path, you need to specify four slashes, i.e., `duckdb:////_storage/test_quack.duckdb`. A few special connection strings are supported: -* **:pipeline:** creates the database in the working directory of the pipeline with name `quack.duckdb`. -* **:memory:** creates in memory database. This may be useful for testing. +* **:pipeline:** creates the database in the working directory of the pipeline with the name `quack.duckdb`. +* **:memory:** creates an in-memory database. This may be useful for testing. ### Additional configuration @@ -109,10 +109,10 @@ create_indexes=true ``` ### dbt support -This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-duckdb](https://github.com/jwills/dbt-duckdb) which is a community supported package. The `duckdb` database is shared with `dbt`. In rare cases you may see information that binary database format does not match the database format expected by `dbt-duckdb`. You may avoid that by updating the `duckdb` package in your `dlt` project with `pip install -U`. +This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-duckdb](https://github.com/jwills/dbt-duckdb), which is a community-supported package. The `duckdb` database is shared with `dbt`. In rare cases, you may see information that the binary database format does not match the database format expected by `dbt-duckdb`. You can avoid that by updating the `duckdb` package in your `dlt` project with `pip install -U`. ### Syncing of `dlt` state -This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination) +This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). ## Additional Setup guides @@ -124,4 +124,4 @@ This destination fully supports [dlt state sync](../../general-usage/state#synci - [Load data from Chess.com to DuckDB in python with dlt](https://dlthub.com/docs/pipelines/chess/load-data-with-python-from-chess-to-duckdb) - [Load data from HubSpot to DuckDB in python with dlt](https://dlthub.com/docs/pipelines/hubspot/load-data-with-python-from-hubspot-to-duckdb) - [Load data from GitHub to DuckDB in python with dlt](https://dlthub.com/docs/pipelines/github/load-data-with-python-from-github-to-duckdb) - \ No newline at end of file + diff --git a/docs/website/docs/dlt-ecosystem/destinations/index.md b/docs/website/docs/dlt-ecosystem/destinations/index.md index 5d26c0f138..2c24d14312 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/index.md +++ b/docs/website/docs/dlt-ecosystem/destinations/index.md @@ -5,11 +5,11 @@ keywords: ['destinations'] --- import DocCardList from '@theme/DocCardList'; -Pick one of our high quality destinations and load your data to a local database, warehouse or a data lake. Append, replace or merge your data. Apply performance hints like partitions, clusters or indexes. Load directly or via staging. Each of our destinations goes through few hundred automated tests every day. +Pick one of our high-quality destinations and load your data into a local database, warehouse, or data lake. Append, replace, or merge your data. Apply performance hints like partitions, clusters, or indexes. Load directly or via staging. Each of our destinations undergoes several hundred automated tests every day. -* Destination or feature missing? [Join our Slack community](https://dlthub.com/community) and ask for it -* Need more info? [Join our Slack community](https://dlthub.com/community) and ask in the tech help channel or [Talk to an engineer](https://calendar.app.google/kiLhuMsWKpZUpfho6) +* Is a destination or feature missing? [Join our Slack community](https://dlthub.com/community) and ask for it. +* Need more info? [Join our Slack community](https://dlthub.com/community) and ask in the tech help channel or [Talk to an engineer](https://calendar.app.google/kiLhuMsWKpZUpfho6). -Otherwise pick a destination below: +Otherwise, pick a destination below: diff --git a/docs/website/docs/dlt-ecosystem/destinations/motherduck.md b/docs/website/docs/dlt-ecosystem/destinations/motherduck.md index b002286bcf..1288b9caac 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/motherduck.md +++ b/docs/website/docs/dlt-ecosystem/destinations/motherduck.md @@ -5,7 +5,7 @@ keywords: [MotherDuck, duckdb, destination, data warehouse] --- # MotherDuck -> 🧪 MotherDuck is still invitation only and intensively tested. Please see the limitations / problems at the end. +> 🧪 MotherDuck is still invitation-only and is being intensively tested. Please see the limitations/problems at the end. ## Install dlt with MotherDuck **To install the DLT library with MotherDuck dependencies:** @@ -14,12 +14,12 @@ pip install dlt[motherduck] ``` :::tip -Decrease the number of load workers to 3-5 depending on the quality of your internet connection if you see a lot of retries in your logs with various timeout, add the following to your `config.toml`: +If you see a lot of retries in your logs with various timeouts, decrease the number of load workers to 3-5 depending on the quality of your internet connection. Add the following to your `config.toml`: ```toml [load] workers=3 ``` -or export **LOAD__WORKERS=3** env variable. See more in [performance](../../reference/performance.md) +or export the **LOAD__WORKERS=3** env variable. See more in [performance](../../reference/performance.md) ::: ## Setup Guide @@ -34,7 +34,7 @@ dlt init chess motherduck pip install -r requirements.txt ``` -This will install dlt with **motherduck** extra which contains **duckdb** and **pyarrow** dependencies +This will install dlt with the **motherduck** extra which contains **duckdb** and **pyarrow** dependencies. **3. Add your MotherDuck token to `.dlt/secrets.toml`** ```toml @@ -42,63 +42,61 @@ This will install dlt with **motherduck** extra which contains **duckdb** and ** database = "dlt_data_3" password = "" ``` -Paste your **service token** into password. The `database` field is optional but we recommend to set it. MotherDuck will create this database (in this case `dlt_data_3`) for you. +Paste your **service token** into the password field. The `database` field is optional, but we recommend setting it. MotherDuck will create this database (in this case `dlt_data_3`) for you. -Alternatively you can use the connection string syntax +Alternatively, you can use the connection string syntax. ```toml [destination] motherduck.credentials="md:///dlt_data_3?token=" ``` -**3. Run the pipeline** +**4. Run the pipeline** ``` python3 chess_pipeline.py ``` ## Write disposition -All write dispositions are supported +All write dispositions are supported. ## Data loading -By default **parquet** files and `COPY` command is used to move files to remote duckdb database. All write dispositions are supported. +By default, Parquet files and the `COPY` command are used to move files to the remote duckdb database. All write dispositions are supported. -**INSERT** format is also supported and will execute a large INSERT queries directly into the remote database. This is way slower and may exceed maximum query size - so not advised. +The **INSERT** format is also supported and will execute large INSERT queries directly into the remote database. This method is significantly slower and may exceed the maximum query size, so it is not advised. ## dbt support -This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-duckdb](https://github.com/jwills/dbt-duckdb) which is a community supported package. `dbt` version >= 1.5 is required (which is current `dlt` default.) +This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-duckdb](https://github.com/jwills/dbt-duckdb), which is a community-supported package. `dbt` version >= 1.5 is required (which is the current `dlt` default.) ## Syncing of `dlt` state -This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination) +This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). ## Automated tests -Each destination must pass few hundred automatic tests. MotherDuck is passing those tests (except the transactions OFC). However we encountered issues with ATTACH timeouts when connecting which makes running such number of tests unstable. Tests on CI are disabled. +Each destination must pass a few hundred automatic tests. MotherDuck is passing these tests (except for the transactions, of course). However, we have encountered issues with ATTACH timeouts when connecting, which makes running such a number of tests unstable. Tests on CI are disabled. ## Troubleshooting / limitations ### I see a lot of errors in the log like DEADLINE_EXCEEDED or Connection timed out -Motherduck is very sensitive to quality of the internet connection and **number of workers used to load data**. Decrease the number of workers and make sure your internet connection really works. We could not find any way to increase those timeouts yet. - +MotherDuck is very sensitive to the quality of the internet connection and the **number of workers used to load data**. Decrease the number of workers and ensure your internet connection is stable. We have not found any way to increase these timeouts yet. ### MotherDuck does not support transactions. -Do not use `begin`, `commit` and `rollback` on `dlt` **sql_client** or on duckdb dbapi connection. It has no effect for DML statements (they are autocommit). It is confusing the query engine for DDL (tables not found etc.). -If your connection if of poor quality and you get a time out when executing DML query it may happen that your transaction got executed, - +Do not use `begin`, `commit`, and `rollback` on `dlt` **sql_client** or on the duckdb dbapi connection. It has no effect on DML statements (they are autocommit). It confuses the query engine for DDL (tables not found, etc.). +If your connection is of poor quality and you get a timeout when executing a DML query, it may happen that your transaction got executed. ### I see some exception with home_dir missing when opening `md:` connection. -Some internal component (HTTPS) requires **HOME** env variable to be present. Export such variable to the command line. Here is what we do in our tests: +Some internal component (HTTPS) requires the **HOME** env variable to be present. Export such a variable to the command line. Here is what we do in our tests: ```python os.environ["HOME"] = "/tmp" ``` -before opening connection +before opening the connection. ### I see some watchdog timeouts. We also see them. ``` 'ATTACH_DATABASE': keepalive watchdog timeout ``` -My observation is that if you write a lot of data into the database then close the connection and then open it again to write, there's a chance of such timeout. Possible **WAL** file is being written to the remote duckdb database. +Our observation is that if you write a lot of data into the database, then close the connection and then open it again to write, there's a chance of such a timeout. A possible **WAL** file is being written to the remote duckdb database. ### Invalid Input Error: Initialization function "motherduck_init" from file Use `duckdb 0.8.1` or above. - \ No newline at end of file + diff --git a/docs/website/docs/dlt-ecosystem/destinations/mssql.md b/docs/website/docs/dlt-ecosystem/destinations/mssql.md index 9d216a52a3..5ed4b69707 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/mssql.md +++ b/docs/website/docs/dlt-ecosystem/destinations/mssql.md @@ -7,7 +7,7 @@ keywords: [mssql, sqlserver, destination, data warehouse] # Microsoft SQL Server ## Install dlt with MS SQL -**To install the DLT library with MS SQL dependencies:** +**To install the DLT library with MS SQL dependencies, use:** ``` pip install dlt[mssql] ``` @@ -16,23 +16,23 @@ pip install dlt[mssql] ### Prerequisites -_Microsoft ODBC Driver for SQL Server_ must be installed to use this destination. -This can't be included with `dlt`'s python dependencies, so you must install it separately on your system. You can find the official installation instructions [here](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver16). +The _Microsoft ODBC Driver for SQL Server_ must be installed to use this destination. +This cannot be included with `dlt`'s python dependencies, so you must install it separately on your system. You can find the official installation instructions [here](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver16). Supported driver versions: * `ODBC Driver 18 for SQL Server` * `ODBC Driver 17 for SQL Server` -You can [configure driver name](#additional-destination-options) explicitly as well. +You can also [configure the driver name](#additional-destination-options) explicitly. ### Create a pipeline -**1. Initalize a project with a pipeline that loads to MS SQL by running** +**1. Initialize a project with a pipeline that loads to MS SQL by running:** ``` dlt init chess mssql ``` -**2. Install the necessary dependencies for MS SQL by running** +**2. Install the necessary dependencies for MS SQL by running:** ``` pip install -r requirements.txt ``` @@ -40,11 +40,11 @@ or run: ``` pip install dlt[mssql] ``` -This will install dlt with **mssql** extra which contains all the dependencies required by the SQL server client. +This will install `dlt` with the `mssql` extra, which contains all the dependencies required by the SQL server client. **3. Enter your credentials into `.dlt/secrets.toml`.** -Example, replace with your database connection info: +For example, replace with your database connection info: ```toml [destination.mssql.credentials] database = "dlt_data" @@ -61,34 +61,34 @@ You can also pass a SQLAlchemy-like database connection: destination.mssql.credentials="mssql://loader:@loader.database.windows.net/dlt_data?connect_timeout=15" ``` -To pass credentials directly you can use `credentials` argument passed to `dlt.pipeline` or `pipeline.run` methods. +To pass credentials directly, you can use the `credentials` argument passed to `dlt.pipeline` or `pipeline.run` methods. ```python pipeline = dlt.pipeline(pipeline_name='chess', destination='postgres', dataset_name='chess_data', credentials="mssql://loader:@loader.database.windows.net/dlt_data?connect_timeout=15") ``` ## Write disposition -All write dispositions are supported +All write dispositions are supported. -If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized` the destination tables will be dropped and +If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized`, the destination tables will be dropped and recreated with an `ALTER SCHEMA ... TRANSFER`. The operation is atomic: mssql supports DDL transactions. ## Data loading -Data is loaded via INSERT statements by default. MSSQL has a limit of 1000 rows per INSERT and this is what we use. +Data is loaded via INSERT statements by default. MSSQL has a limit of 1000 rows per INSERT, and this is what we use. ## Supported file formats * [insert-values](../file-formats/insert-format.md) is used by default ## Supported column hints -**mssql** will create unique indexes for all columns with `unique` hints. This behavior **may be disabled** +**mssql** will create unique indexes for all columns with `unique` hints. This behavior **may be disabled**. ## Syncing of `dlt` state -This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination) +This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). ## Data types -MS SQL does not support JSON columns, so JSON objects are stored as strings in `nvarchar` column. +MS SQL does not support JSON columns, so JSON objects are stored as strings in `nvarchar` columns. ## Additional destination options -**mssql** destination **does not** creates UNIQUE indexes by default on columns with `unique` hint (ie. `_dlt_id`). To enable this behavior +The **mssql** destination **does not** create UNIQUE indexes by default on columns with the `unique` hint (i.e., `_dlt_id`). To enable this behavior: ```toml [destination.mssql] create_indexes=true @@ -108,7 +108,7 @@ destination.mssql.credentials="mssql://loader:@loader.database.windows ``` ### dbt support -No dbt support yet +No dbt support yet. ## Additional Setup guides @@ -120,4 +120,4 @@ No dbt support yet - [Load data from GitHub to Microsoft SQL Server in python with dlt](https://dlthub.com/docs/pipelines/github/load-data-with-python-from-github-to-mssql) - [Load data from Notion to Microsoft SQL Server in python with dlt](https://dlthub.com/docs/pipelines/notion/load-data-with-python-from-notion-to-mssql) - [Load data from HubSpot to Microsoft SQL Server in python with dlt](https://dlthub.com/docs/pipelines/hubspot/load-data-with-python-from-hubspot-to-mssql) - \ No newline at end of file + diff --git a/docs/website/docs/dlt-ecosystem/destinations/postgres.md b/docs/website/docs/dlt-ecosystem/destinations/postgres.md index cd0ea08929..10b935c083 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/postgres.md +++ b/docs/website/docs/dlt-ecosystem/destinations/postgres.md @@ -7,47 +7,47 @@ keywords: [postgres, destination, data warehouse] # Postgres ## Install dlt with PostgreSQL -**To install the DLT library with PostgreSQL dependencies:** +**To install the DLT library with PostgreSQL dependencies, run:** ``` pip install dlt[postgres] ``` ## Setup Guide -**1. Initialize a project with a pipeline that loads to Postgres by running** +**1. Initialize a project with a pipeline that loads to Postgres by running:** ``` dlt init chess postgres ``` -**2. Install the necessary dependencies for Postgres by running** +**2. Install the necessary dependencies for Postgres by running:** ``` pip install -r requirements.txt ``` -This will install dlt with **postgres** extra which contains `psycopg2` client. +This will install dlt with the `postgres` extra, which contains the `psycopg2` client. -**3. Create a new database after setting up a Postgres instance and `psql` / query editor by running** +**3. After setting up a Postgres instance and `psql` / query editor, create a new database by running:** ``` CREATE DATABASE dlt_data; ``` -Add `dlt_data` database to `.dlt/secrets.toml`. +Add the `dlt_data` database to `.dlt/secrets.toml`. -**4. Create a new user by running** +**4. Create a new user by running:** ``` CREATE USER loader WITH PASSWORD ''; ``` -Add `loader` user and `` password to `.dlt/secrets.toml`. +Add the `loader` user and `` password to `.dlt/secrets.toml`. -**5. Give the `loader` user owner permissions by running** +**5. Give the `loader` user owner permissions by running:** ``` ALTER DATABASE dlt_data OWNER TO loader; ``` -It is possible to set more restrictive permissions (e.g. give user access to a specific schema). +You can set more restrictive permissions (e.g., give user access to a specific schema). **6. Enter your credentials into `.dlt/secrets.toml`.** -It should now look like +It should now look like this: ```toml [destination.postgres.credentials] @@ -59,33 +59,33 @@ port = 5432 connect_timeout = 15 ``` -You can also pass a database connection string similar to the one used by `psycopg2` library or [SQLAlchemy](https://docs.sqlalchemy.org/en/20/core/engines.html#postgresql). Credentials above will look like this: +You can also pass a database connection string similar to the one used by the `psycopg2` library or [SQLAlchemy](https://docs.sqlalchemy.org/en/20/core/engines.html#postgresql). The credentials above will look like this: ```toml # keep it at the top of your toml file! before any section starts destination.postgres.credentials="postgresql://loader:@localhost/dlt_data?connect_timeout=15" ``` -To pass credentials directly you can use `credentials` argument passed to `dlt.pipeline` or `pipeline.run` methods. +To pass credentials directly, you can use the `credentials` argument passed to the `dlt.pipeline` or `pipeline.run` methods. ```python pipeline = dlt.pipeline(pipeline_name='chess', destination='postgres', dataset_name='chess_data', credentials="postgresql://loader:@localhost/dlt_data") ``` ## Write disposition -All write dispositions are supported +All write dispositions are supported. -If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized` the destination tables will be dropped and replaced by the staging tables. +If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized`, the destination tables will be dropped and replaced by the staging tables. ## Data loading `dlt` will load data using large INSERT VALUES statements by default. Loading is multithreaded (20 threads by default). ## Supported file formats -* [insert-values](../file-formats/insert-format.md) is used by default +* [insert-values](../file-formats/insert-format.md) is used by default. ## Supported column hints -`postgres` will create unique indexes for all columns with `unique` hints. This behavior **may be disabled** +`postgres` will create unique indexes for all columns with `unique` hints. This behavior **may be disabled**. ## Additional destination options -Postgres destination creates UNIQUE indexes by default on columns with `unique` hint (ie. `_dlt_id`). To disable this behavior +The Postgres destination creates UNIQUE indexes by default on columns with the `unique` hint (i.e., `_dlt_id`). To disable this behavior: ```toml [destination.postgres] create_indexes=false @@ -95,16 +95,16 @@ create_indexes=false This destination [integrates with dbt](../transformations/dbt/dbt.md) via dbt-postgres. ### Syncing of `dlt` state -This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination) +This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). -## Additional Setup guides - -- [Load data from HubSpot to PostgreSQL in python with dlt](https://dlthub.com/docs/pipelines/hubspot/load-data-with-python-from-hubspot-to-postgres) -- [Load data from GitHub to PostgreSQL in python with dlt](https://dlthub.com/docs/pipelines/github/load-data-with-python-from-github-to-postgres) -- [Load data from Chess.com to PostgreSQL in python with dlt](https://dlthub.com/docs/pipelines/chess/load-data-with-python-from-chess-to-postgres) -- [Load data from Notion to PostgreSQL in python with dlt](https://dlthub.com/docs/pipelines/notion/load-data-with-python-from-notion-to-postgres) -- [Load data from Google Analytics to PostgreSQL in python with dlt](https://dlthub.com/docs/pipelines/google_analytics/load-data-with-python-from-google_analytics-to-postgres) -- [Load data from Google Sheets to PostgreSQL in python with dlt](https://dlthub.com/docs/pipelines/google_sheets/load-data-with-python-from-google_sheets-to-postgres) -- [Load data from Stripe to PostgreSQL in python with dlt](https://dlthub.com/docs/pipelines/stripe_analytics/load-data-with-python-from-stripe_analytics-to-postgres) - \ No newline at end of file +## Additional Setup Guides + +- [Load data from HubSpot to PostgreSQL in Python with dlt](https://dlthub.com/docs/pipelines/hubspot/load-data-with-python-from-hubspot-to-postgres) +- [Load data from GitHub to PostgreSQL in Python with dlt](https://dlthub.com/docs/pipelines/github/load-data-with-python-from-github-to-postgres) +- [Load data from Chess.com to PostgreSQL in Python with dlt](https://dlthub.com/docs/pipelines/chess/load-data-with-python-from-chess-to-postgres) +- [Load data from Notion to PostgreSQL in Python with dlt](https://dlthub.com/docs/pipelines/notion/load-data-with-python-from-notion-to-postgres) +- [Load data from Google Analytics to PostgreSQL in Python with dlt](https://dlthub.com/docs/pipelines/google_analytics/load-data-with-python-from-google_analytics-to-postgres) +- [Load data from Google Sheets to PostgreSQL in Python with dlt](https://dlthub.com/docs/pipelines/google_sheets/load-data-with-python-from-google_sheets-to-postgres) +- [Load data from Stripe to PostgreSQL in Python with dlt](https://dlthub.com/docs/pipelines/stripe_analytics/load-data-with-python-from-stripe_analytics-to-postgres) + diff --git a/docs/website/docs/dlt-ecosystem/destinations/qdrant.md b/docs/website/docs/dlt-ecosystem/destinations/qdrant.md index 04b5cac19b..ff37252852 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/qdrant.md +++ b/docs/website/docs/dlt-ecosystem/destinations/qdrant.md @@ -92,7 +92,7 @@ qdrant_adapter(data, embed) It accepts the following arguments: -- `data`: a dlt resource object or a Python data structure (e.g. a list of dictionaries). +- `data`: a dlt resource object or a Python data structure (e.g., a list of dictionaries). - `embed`: a name of the field or a list of names to generate embeddings for. Returns: [DLT resource](../../general-usage/resource.md) object that you can pass to the `pipeline.run()`. @@ -135,7 +135,7 @@ info = pipeline.run( ### Merge The [merge](../../general-usage/incremental-loading.md) write disposition merges the data from the resource with the data at the destination. -For `merge` disposition, you would need to specify a `primary_key` for the resource: +For the `merge` disposition, you need to specify a `primary_key` for the resource: ```python info = pipeline.run( @@ -166,7 +166,7 @@ Qdrant uses collections to categorize and identify data. To avoid potential nami For example, if you have a dataset named `movies_dataset` and a table named `actors`, the Qdrant collection name would be `movies_dataset_actors` (the default separator is an underscore). -However, if you prefer to have class names without the dataset prefix, skip `dataset_name` argument. +However, if you prefer to have class names without the dataset prefix, skip the `dataset_name` argument. For example: @@ -185,7 +185,7 @@ pipeline = dlt.pipeline( - `upload_batch_size`: (int) The batch size for data uploads. The default value is 64. -- `upload_parallelism`: (int) The maximal number of concurrent threads to run data uploads. The default value is 1. +- `upload_parallelism`: (int) The maximum number of concurrent threads to run data uploads. The default value is 1. - `upload_max_retries`: (int) The number of retries to upload data in case of failure. The default value is 3. @@ -222,4 +222,4 @@ You can find the setup instructions to run Qdrant [here](https://qdrant.tech/doc Qdrant destination supports syncing of the `dlt` state. - \ No newline at end of file + diff --git a/docs/website/docs/dlt-ecosystem/destinations/redshift.md b/docs/website/docs/dlt-ecosystem/destinations/redshift.md index cb220a31fc..bc03dbbbeb 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/redshift.md +++ b/docs/website/docs/dlt-ecosystem/destinations/redshift.md @@ -29,7 +29,7 @@ pip install -r requirements.txt or with `pip install dlt[redshift]`, which installs the `dlt` library and the necessary dependencies for working with Amazon Redshift as a destination. ### 2. Setup Redshift cluster -To load data into Redshift, it is necessary to create a Redshift cluster and enable access to your IP address through the VPC inbound rules associated with the cluster. While we recommend asking our GPT-4 assistant for details, we have provided a general outline of the process below: +To load data into Redshift, you need to create a Redshift cluster and enable access to your IP address through the VPC inbound rules associated with the cluster. While we recommend asking our GPT-4 assistant for details, we have provided a general outline of the process below: 1. You can use an existing cluster or create a new one. 2. To create a new cluster, navigate to the 'Provisioned Cluster Dashboard' and click 'Create Cluster'. @@ -59,9 +59,9 @@ To load data into Redshift, it is necessary to create a Redshift cluster and ena redshift-cluster-1.cv3cmsy7t4il.us-east-1.redshift.amazonaws.com ``` -3. The `connect_timeout` is the number of minutes the pipeline will wait before the timeout. +3. The `connect_timeout` is the number of minutes the pipeline will wait before timing out. -You can also pass a database connection string similar to the one used by `psycopg2` library or [SQLAlchemy](https://docs.sqlalchemy.org/en/20/core/engines.html#postgresql). Credentials above will look like this: +You can also pass a database connection string similar to the one used by the `psycopg2` library or [SQLAlchemy](https://docs.sqlalchemy.org/en/20/core/engines.html#postgresql). The credentials above will look like this: ```toml # keep it at the top of your toml file! before any section starts destination.redshift.credentials="redshift://loader:@localhost/dlt_data?connect_timeout=15" @@ -82,25 +82,24 @@ When staging is enabled: > ❗ **Redshift cannot load `TIME` columns from `json` or `parquet` files**. `dlt` will fail such jobs permanently. Switch to direct `insert_values` to load time columns. -> ❗ **Redshift cannot detect compression type from `json` files**. `dlt` assumes that `jsonl` files are gzip compressed which is the default. - -> ❗ **Redshift loads `complex` types as strings into SUPER with `parquet`**. Use `jsonl` format to store JSON in SUPER natively or transform your SUPER columns with `PARSE_JSON``. +> ❗ **Redshift cannot detect compression type from `json` files**. `dlt` assumes that `jsonl` files are gzip compressed, which is the default. +> ❗ **Redshift loads `complex` types as strings into SUPER with `parquet`**. Use `jsonl` format to store JSON in SUPER natively or transform your SUPER columns with `PARSE_JSON`. ## Supported column hints Amazon Redshift supports the following column hints: -- `cluster` - hint is a Redshift term for table distribution. Applying it to a column makes it the "DISTKEY," affecting query and join performance. Check the following [documentation](https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html) for more info. -- `sort` - creates SORTKEY to order rows on disk physically. It is used to improve a query and join speed in Redshift, please read the [sort key docs](https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html) to learn more. +- `cluster` - This hint is a Redshift term for table distribution. Applying it to a column makes it the "DISTKEY," affecting query and join performance. Check the following [documentation](https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html) for more info. +- `sort` - This hint creates a SORTKEY to order rows on disk physically. It is used to improve query and join speed in Redshift. Please read the [sort key docs](https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html) to learn more. ## Staging support -Redshift supports s3 as a file staging destination. dlt will upload files in the parquet format to s3 and ask redshift to copy their data directly into the db. Please refere to the [S3 documentation](./filesystem.md#aws-s3) to learn how to set up your s3 bucket with the bucket_url and credentials. The `dlt` Redshift loader will use the aws credentials provided for s3 to access the s3 bucket if not specified otherwise (see config options below). Alternatively to parquet files, you can also specify jsonl as the staging file format. For this set the `loader_file_format` argument of the `run` command of the pipeline to `jsonl`. +Redshift supports s3 as a file staging destination. dlt will upload files in the parquet format to s3 and ask Redshift to copy their data directly into the db. Please refer to the [S3 documentation](./filesystem.md#aws-s3) to learn how to set up your s3 bucket with the bucket_url and credentials. The `dlt` Redshift loader will use the AWS credentials provided for s3 to access the s3 bucket if not specified otherwise (see config options below). Alternatively to parquet files, you can also specify jsonl as the staging file format. For this, set the `loader_file_format` argument of the `run` command of the pipeline to `jsonl`. -### Authentication iam Role +### Authentication IAM Role -If you would like to load from s3 without forwarding the aws staging credentials but authorize with an iam role connected to Redshift, follow the [Redshift documentation](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html) to create a role with access to s3 linked to your redshift cluster and change your destination settings to use the iam role: +If you would like to load from s3 without forwarding the AWS staging credentials but authorize with an IAM role connected to Redshift, follow the [Redshift documentation](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html) to create a role with access to s3 linked to your Redshift cluster and change your destination settings to use the IAM role: ```toml [destination] @@ -143,4 +142,4 @@ Supported loader file formats for Redshift are `sql` and `insert_values` (defaul - [Load data from GitHub to Redshift in python with dlt](https://dlthub.com/docs/pipelines/github/load-data-with-python-from-github-to-redshift) - [Load data from Stripe to Redshift in python with dlt](https://dlthub.com/docs/pipelines/stripe_analytics/load-data-with-python-from-stripe_analytics-to-redshift) - [Load data from Google Sheets to Redshift in python with dlt](https://dlthub.com/docs/pipelines/google_sheets/load-data-with-python-from-google_sheets-to-redshift) - \ No newline at end of file + diff --git a/docs/website/docs/dlt-ecosystem/destinations/snowflake.md b/docs/website/docs/dlt-ecosystem/destinations/snowflake.md index 34efb0df39..a6058a255e 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/snowflake.md +++ b/docs/website/docs/dlt-ecosystem/destinations/snowflake.md @@ -7,30 +7,30 @@ keywords: [Snowflake, destination, data warehouse] # Snowflake ## Install dlt with Snowflake -**To install the DLT library with Snowflake dependencies:** +**To install the DLT library with Snowflake dependencies, run:** ``` pip install dlt[snowflake] ``` ## Setup Guide -**1. Initialize a project with a pipeline that loads to snowflake by running** +**1. Initialize a project with a pipeline that loads to Snowflake by running:** ``` dlt init chess snowflake ``` -**2. Install the necessary dependencies for snowflake by running** +**2. Install the necessary dependencies for Snowflake by running:** ``` pip install -r requirements.txt ``` -This will install dlt with **snowflake** extra which contains Snowflake Python dbapi client. +This will install `dlt` with the `snowflake` extra, which contains the Snowflake Python dbapi client. -**3. Create a new database, user and give dlt access** +**3. Create a new database, user, and give dlt access.** Read the next chapter below. **4. Enter your credentials into `.dlt/secrets.toml`.** -It should now look like +It should now look like this: ```toml [destination.snowflake.credentials] database = "dlt_data" @@ -40,14 +40,13 @@ host = "kgiotue-wn98412" warehouse = "COMPUTE_WH" role = "DLT_LOADER_ROLE" ``` -In case of snowflake **host** is your [Account Identifier](https://docs.snowflake.com/en/user-guide/admin-account-identifier). You can get in **Admin**/**Accounts** by copying account url: -https://kgiotue-wn98412.snowflakecomputing.com and extracting the host name (**kgiotue-wn98412**) +In the case of Snowflake, the **host** is your [Account Identifier](https://docs.snowflake.com/en/user-guide/admin-account-identifier). You can get it in **Admin**/**Accounts** by copying the account URL: https://kgiotue-wn98412.snowflakecomputing.com and extracting the host name (**kgiotue-wn98412**). -The **warehouse** and **role** are optional if you assign defaults to your user. In the example below we do not do that, so we set them explicitly. +The **warehouse** and **role** are optional if you assign defaults to your user. In the example below, we do not do that, so we set them explicitly. ### Setup the database user and permissions -Instructions below assume that you use the default account setup that you get after creating Snowflake account. You should have default warehouse named **COMPUTE_WH** and snowflake account. Below we create a new database, user and assign permissions. The permissions are very generous. A more experienced user can easily reduce `dlt` permissions to just one schema in the database. +The instructions below assume that you use the default account setup that you get after creating a Snowflake account. You should have a default warehouse named **COMPUTE_WH** and a Snowflake account. Below, we create a new database, user, and assign permissions. The permissions are very generous. A more experienced user can easily reduce `dlt` permissions to just one schema in the database. ```sql --create database with standard settings CREATE DATABASE dlt_data; @@ -67,17 +66,17 @@ GRANT ALL PRIVILEGES ON FUTURE SCHEMAS IN DATABASE dlt_data TO DLT_LOADER_ROLE; GRANT ALL PRIVILEGES ON FUTURE TABLES IN DATABASE dlt_data TO DLT_LOADER_ROLE; ``` -Now you can use the user named `LOADER` to access database `DLT_DATA` and log in with specified password. +Now you can use the user named `LOADER` to access the database `DLT_DATA` and log in with the specified password. You can also decrease the suspend time for your warehouse to 1 minute (**Admin**/**Warehouses** in Snowflake UI) ### Authentication types -Snowflake destination accepts three authentication types +Snowflake destination accepts three authentication types: - password authentication - [key pair authentication](https://docs.snowflake.com/en/user-guide/key-pair-auth) - external authentication -The **password authentication** is not any different from other databases like Postgres or Redshift. `dlt` follows the same syntax as [SQLAlchemy dialect](https://docs.snowflake.com/en/developer-guide/python-connector/sqlalchemy#required-parameters). +The **password authentication** is not any different from other databases like Postgres or Redshift. `dlt` follows the same syntax as the [SQLAlchemy dialect](https://docs.snowflake.com/en/developer-guide/python-connector/sqlalchemy#required-parameters). You can also pass credentials as a database connection string. For example: ```toml @@ -85,7 +84,7 @@ You can also pass credentials as a database connection string. For example: destination.snowflake.credentials="snowflake://loader:@kgiotue-wn98412/dlt_data?warehouse=COMPUTE_WH&role=DLT_LOADER_ROLE" ``` -In **key pair authentication** you replace password with a private key string that should be in Base64-encoded DER format ([DBT also recommends](https://docs.getdbt.com/docs/core/connect-data-platform/snowflake-setup#key-pair-authentication) base64-encoded private keys for Snowflake connections). The private key may also be encrypted. In that case you must provide a passphrase alongside with the private key. +In **key pair authentication**, you replace the password with a private key string that should be in Base64-encoded DER format ([DBT also recommends](https://docs.getdbt.com/docs/core/connect-data-platform/snowflake-setup#key-pair-authentication) base64-encoded private keys for Snowflake connections). The private key may also be encrypted. In that case, you must provide a passphrase alongside the private key. ```toml [destination.snowflake.credentials] database = "dlt_data" @@ -96,13 +95,13 @@ private_key_passphrase="passphrase" ``` > You can easily get the base64-encoded value of your private key by running `base64 -i .pem` in your terminal -If you pass a passphrase in the connection string, please url encode it. +If you pass a passphrase in the connection string, please URL encode it. ```toml # keep it at the top of your toml file! before any section starts destination.snowflake.credentials="snowflake://loader:@kgiotue-wn98412/dlt_data?private_key=&private_key_passphrase=" ``` -In **external authentication** you can use oauth provider like Okta or external browser to authenticate. You pass your authenticator and refresh token as below: +In **external authentication**, you can use an OAuth provider like Okta or an external browser to authenticate. You pass your authenticator and refresh token as below: ```toml [destination.snowflake.credentials] database = "dlt_data" @@ -110,17 +109,17 @@ username = "loader" authenticator="..." token="..." ``` -or in connection string as query parameters. +or in the connection string as query parameters. Refer to Snowflake [OAuth](https://docs.snowflake.com/en/user-guide/oauth-intro) for more details. ## Write disposition -All write dispositions are supported +All write dispositions are supported. -If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized` the destination tables will be dropped and +If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized`, the destination tables will be dropped and recreated with a [clone command](https://docs.snowflake.com/en/sql-reference/sql/create-clone) from the staging tables. ## Data loading -The data is loaded using internal Snowflake stage. We use `PUT` command and per-table built-in stages by default. Stage files are immediately removed (if not specified otherwise). +The data is loaded using an internal Snowflake stage. We use the `PUT` command and per-table built-in stages by default. Stage files are immediately removed (if not specified otherwise). ## Supported file formats * [insert-values](../file-formats/insert-format.md) is used by default @@ -131,47 +130,47 @@ When staging is enabled: * [jsonl](../file-formats/jsonl.md) is used by default * [parquet](../file-formats/parquet.md) is supported -> ❗ When loading from `parquet`, Snowflake will store `complex` types (JSON) in `VARIANT` as string. Use `jsonl` format instead or use `PARSE_JSON` to update the `VARIANT`` field after loading. +> ❗ When loading from `parquet`, Snowflake will store `complex` types (JSON) in `VARIANT` as a string. Use the `jsonl` format instead or use `PARSE_JSON` to update the `VARIANT` field after loading. ## Supported column hints Snowflake supports the following [column hints](https://dlthub.com/docs/general-usage/schema#tables-and-columns): -* `cluster` - creates a cluster column(s). Many column per table are supported and only when a new table is created. +* `cluster` - creates a cluster column(s). Many columns per table are supported and only when a new table is created. ### Table and column identifiers -Snowflake makes all unquoted identifiers uppercase and then resolves them case-insensitive in SQL statements. `dlt` (effectively) does not quote identifies in DDL preserving default behavior. +Snowflake makes all unquoted identifiers uppercase and then resolves them case-insensitively in SQL statements. `dlt` (effectively) does not quote identifiers in DDL, preserving default behavior. -Names of tables and columns in [schemas](../../general-usage/schema.md) are kept in lower case like for all other destinations. This is the pattern we observed in other tools ie. `dbt`. In case of `dlt` it is however trivial to define your own uppercase [naming convention](../../general-usage/schema.md#naming-convention) +Names of tables and columns in [schemas](../../general-usage/schema.md) are kept in lower case like for all other destinations. This is the pattern we observed in other tools, i.e., `dbt`. In the case of `dlt`, it is, however, trivial to define your own uppercase [naming convention](../../general-usage/schema.md#naming-convention) ## Staging support -Snowflake supports s3 and gcs as a file staging destinations. dlt will upload files in the parquet format to the bucket provider and will ask snowflake to copy their data directly into the db. +Snowflake supports S3 and GCS as file staging destinations. dlt will upload files in the parquet format to the bucket provider and will ask Snowflake to copy their data directly into the db. -Alternavitely to parquet files, you can also specify jsonl as the staging file format. For this set the `loader_file_format` argument of the `run` command of the pipeline to `jsonl`. +Alternatively to parquet files, you can also specify jsonl as the staging file format. For this, set the `loader_file_format` argument of the `run` command of the pipeline to `jsonl`. ### Snowflake and Amazon S3 -Please refer to the [S3 documentation](./filesystem.md#aws-s3) to learn how to set up your bucket with the bucket_url and credentials. For s3 The dlt Redshift loader will use the aws credentials provided for s3 to access the s3 bucket if not specified otherwise (see config options below). Alternatively you can create a stage for your S3 Bucket by following the instructions provided in the [Snowflake S3 documentation](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration). +Please refer to the [S3 documentation](./filesystem.md#aws-s3) to learn how to set up your bucket with the bucket_url and credentials. For S3, the dlt Redshift loader will use the AWS credentials provided for S3 to access the S3 bucket if not specified otherwise (see config options below). Alternatively, you can create a stage for your S3 Bucket by following the instructions provided in the [Snowflake S3 documentation](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration). The basic steps are as follows: * Create a storage integration linked to GCS and the right bucket -* Grant access to this storage integration to the snowflake role you are using to load the data into snowflake. +* Grant access to this storage integration to the Snowflake role you are using to load the data into Snowflake. * Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data. -* Also grant access to this stage for the role you are using to load data into snowflake. +* Also grant access to this stage for the role you are using to load data into Snowflake. * Provide the name of your stage (including the namespace) to dlt like so: -To prevent dlt from forwarding the s3 bucket credentials on every command, and set your s3 stage, change these settings: +To prevent dlt from forwarding the S3 bucket credentials on every command, and set your S3 stage, change these settings: ```toml [destination] stage_name=PUBLIC.my_s3_stage ``` -To run Snowflake with s3 as staging destination: +To run Snowflake with S3 as the staging destination: ```python # Create a dlt pipeline that will load -# chess player data to the snowflake destination -# via staging on s3 +# chess player data to the Snowflake destination +# via staging on S3 pipeline = dlt.pipeline( pipeline_name='chess_pipeline', destination='snowflake', @@ -182,12 +181,12 @@ pipeline = dlt.pipeline( ### Snowflake and Google Cloud Storage -Please refer to the [Google Storage filesystem documentation](./filesystem.md#google-storage) to learn how to set up your bucket with the bucket_url and credentials. For gcs you can define a stage in Snowflake and provide the stage identifier in the configuration (see config options below.) Please consult the snowflake Documentation on [how to create a stage for your GCS Bucket](https://docs.snowflake.com/en/user-guide/data-load-gcs-config). The basic steps are as follows: +Please refer to the [Google Storage filesystem documentation](./filesystem.md#google-storage) to learn how to set up your bucket with the bucket_url and credentials. For GCS, you can define a stage in Snowflake and provide the stage identifier in the configuration (see config options below.) Please consult the Snowflake Documentation on [how to create a stage for your GCS Bucket](https://docs.snowflake.com/en/user-guide/data-load-gcs-config). The basic steps are as follows: * Create a storage integration linked to GCS and the right bucket -* Grant access to this storage integration to the snowflake role you are using to load the data into snowflake. +* Grant access to this storage integration to the Snowflake role you are using to load the data into Snowflake. * Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data. -* Also grant access to this stage for the role you are using to load data into snowflake. +* Also grant access to this stage for the role you are using to load data into Snowflake. * Provide the name of your stage (including the namespace) to dlt like so: ```toml @@ -195,12 +194,12 @@ Please refer to the [Google Storage filesystem documentation](./filesystem.md#go stage_name=PUBLIC.my_gcs_stage ``` -To run Snowflake with gcs as staging destination: +To run Snowflake with GCS as the staging destination: ```python # Create a dlt pipeline that will load -# chess player data to the snowflake destination -# via staging on gcs +# chess player data to the Snowflake destination +# via staging on GCS pipeline = dlt.pipeline( pipeline_name='chess_pipeline', destination='snowflake', @@ -211,14 +210,14 @@ pipeline = dlt.pipeline( ### Snowflake and Azure Blob Storage -Please refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure-blob-storage) to learn how to set up your bucket with the bucket_url and credentials. For azure the Snowflake loader will use -the filesystem credentials for your azure blob storage container if not specified otherwise (see config options below). Alternatively you can define an external stage in Snowflake and provide the stage identifier. -Please consult the snowflake Documentation on [how to create a stage for your Azure Blob Storage Container](https://docs.snowflake.com/en/user-guide/data-load-azure). The basic steps are as follows: +Please refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure-blob-storage) to learn how to set up your bucket with the bucket_url and credentials. For Azure, the Snowflake loader will use +the filesystem credentials for your Azure Blob Storage container if not specified otherwise (see config options below). Alternatively, you can define an external stage in Snowflake and provide the stage identifier. +Please consult the Snowflake Documentation on [how to create a stage for your Azure Blob Storage Container](https://docs.snowflake.com/en/user-guide/data-load-azure). The basic steps are as follows: * Create a storage integration linked to Azure Blob Storage and the right container -* Grant access to this storage integration to the snowflake role you are using to load the data into snowflake. +* Grant access to this storage integration to the Snowflake role you are using to load the data into Snowflake. * Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data. -* Also grant access to this stage for the role you are using to load data into snowflake. +* Also grant access to this stage for the role you are using to load data into Snowflake. * Provide the name of your stage (including the namespace) to dlt like so: ```toml @@ -226,12 +225,12 @@ Please consult the snowflake Documentation on [how to create a stage for your Az stage_name=PUBLIC.my_azure_stage ``` -To run Snowflake with azure as staging destination: +To run Snowflake with Azure as the staging destination: ```python # Create a dlt pipeline that will load -# chess player data to the snowflake destination -# via staging on azure +# chess player data to the Snowflake destination +# via staging on Azure pipeline = dlt.pipeline( pipeline_name='chess_pipeline', destination='snowflake', @@ -241,7 +240,7 @@ pipeline = dlt.pipeline( ``` ## Additional destination options -You can define your own stage to PUT files and disable removing of the staged files after loading. +You can define your own stage to PUT files and disable the removal of the staged files after loading. ```toml [destination.snowflake] # Use an existing named stage instead of the default. Default uses the implicit table stage per table @@ -251,7 +250,7 @@ keep_staged_files=true ``` ### dbt support -This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-snowflake](https://github.com/dbt-labs/dbt-snowflake). Both password and key pair authentication is supported and shared with dbt runners. +This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-snowflake](https://github.com/dbt-labs/dbt-snowflake). Both password and key pair authentication are supported and shared with dbt runners. ### Syncing of `dlt` state This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination) @@ -266,4 +265,4 @@ This destination fully supports [dlt state sync](../../general-usage/state#synci - [Load data from HubSpot to Snowflake in python with dlt](https://dlthub.com/docs/pipelines/hubspot/load-data-with-python-from-hubspot-to-snowflake) - [Load data from Chess.com to Snowflake in python with dlt](https://dlthub.com/docs/pipelines/chess/load-data-with-python-from-chess-to-snowflake) - [Load data from Google Sheets to Snowflake in python with dlt](https://dlthub.com/docs/pipelines/google_sheets/load-data-with-python-from-google_sheets-to-snowflake) - \ No newline at end of file + diff --git a/docs/website/docs/dlt-ecosystem/destinations/synapse.md b/docs/website/docs/dlt-ecosystem/destinations/synapse.md index 6ace1ac5a8..bac184fd41 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/synapse.md +++ b/docs/website/docs/dlt-ecosystem/destinations/synapse.md @@ -18,13 +18,13 @@ pip install dlt[synapse] * **Microsoft ODBC Driver for SQL Server** - _Microsoft ODBC Driver for SQL Server_ must be installed to use this destination. + The _Microsoft ODBC Driver for SQL Server_ must be installed to use this destination. This can't be included with `dlt`'s python dependencies, so you must install it separately on your system. You can find the official installation instructions [here](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver16). Supported driver versions: * `ODBC Driver 18 for SQL Server` - > 💡 Older driver versions don't properly work, because they don't support the `LongAsMax` keyword that got [introduced](https://learn.microsoft.com/en-us/sql/connect/odbc/windows/features-of-the-microsoft-odbc-driver-for-sql-server-on-windows?view=sql-server-ver15#microsoft-odbc-driver-180-for-sql-server-on-windows) in `ODBC Driver 18 for SQL Server`. Synapse does not support the legacy ["long data types"](https://learn.microsoft.com/en-us/sql/t-sql/data-types/ntext-text-and-image-transact-sql), and requires "max data types" instead. `dlt` uses the `LongAsMax` keyword to automatically do the conversion. + > 💡 Older driver versions don't work properly because they don't support the `LongAsMax` keyword that was [introduced](https://learn.microsoft.com/en-us/sql/connect/odbc/windows/features-of-the-microsoft-odbc-driver-for-sql-server-on-windows?view=sql-server-ver15#microsoft-odbc-driver-180-for-sql-server-on-windows) in `ODBC Driver 18 for SQL Server`. Synapse does not support the legacy ["long data types"](https://learn.microsoft.com/en-us/sql/t-sql/data-types/ntext-text-and-image-transact-sql), and requires "max data types" instead. `dlt` uses the `LongAsMax` keyword to automatically do the conversion. * **Azure Synapse Workspace and dedicated SQL pool** You need an Azure Synapse workspace with a dedicated SQL pool to load data into. If you don't have one yet, you can use this [quickstart](https://learn.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-sql-pool-studio). @@ -67,7 +67,7 @@ GRANT ADMINISTER DATABASE BULK OPERATIONS TO loader; -- only required when loadi Optionally, you can create a `WORKLOAD GROUP` and add the `loader` user as a member to manage [workload isolation](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-workload-isolation). See the [instructions](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/data-loading-best-practices#create-a-loading-user) on setting up a loader user for an example of how to do this. -**3. Enter your credentials into `.dlt/secrets.toml`.** +**4. Enter your credentials into `.dlt/secrets.toml`.** Example, replace with your database connection info: ```toml @@ -97,7 +97,7 @@ pipeline = dlt.pipeline( ``` ## Write disposition -All write dispositions are supported +All write dispositions are supported. If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized`, the destination tables will be dropped and replaced by the staging tables with an `ALTER SCHEMA ... TRANSFER` command. Please note that this operation is **not** atomic—it involves multiple DDL commands and Synapse does not support DDL transactions. @@ -134,12 +134,11 @@ Possible values: > ❗ Important: >* **Set `default_table_index_type` to `"clustered_columnstore_index"` if you want to change the default** (see [additional destination options](#additional-destination-options)). >* **CLUSTERED COLUMNSTORE INDEX tables do not support the `varchar(max)`, `nvarchar(max)`, and `varbinary(max)` data types.** If you don't specify the `precision` for columns that map to any of these types, `dlt` will use the maximum lengths `varchar(4000)`, `nvarchar(4000)`, and `varbinary(8000)`. ->* **While Synapse creates CLUSTERED COLUMNSTORE INDEXES by default, `dlt` creates HEAP tables by default.** HEAP is a more robust choice, because it supports all data types and doesn't require conversions. ->* **When using the `insert-from-staging` [`replace` strategy](../../general-usage/full-loading.md), the staging tables are always created as HEAP tables**—any configuration of the table index types is ignored. The HEAP strategy makes sense - for staging tables for reasons explained [here](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#heap-tables). ->* **When using the `staging-optimized` [`replace` strategy](../../general-usage/full-loading.md), the staging tables are already created with the configured table index type**, because the staging table becomes the final table. ->* **`dlt` system tables are always created as HEAP tables, regardless of any configuration.** This is in line with Microsoft's recommendation that "for small lookup tables, less than 60 million rows, consider using HEAP or clustered index for faster query performance." ->* Child tables, if any, inherent the table index type of their parent table. +>* **While Synapse creates CLUSTERED COLUMNSTORE INDEXES by default, `dlt` creates HEAP tables by default.** HEAP is a more robust choice because it supports all data types and doesn't require conversions. +>* **When using the `insert-from-staging` [`replace` strategy](../../general-usage/full-loading.md), the staging tables are always created as HEAP tables**—any configuration of the table index types is ignored. The HEAP strategy makes sense for staging tables for reasons explained [here](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#heap-tables). +>* **When using the `staging-optimized` [`replace` strategy](../../general-usage/full-loading.md), the staging tables are already created with the configured table index type**, because the staging table becomes the final table. +>* **`dlt` system tables are always created as HEAP tables, regardless of any configuration.** This is in line with Microsoft's recommendation that "for small lookup tables, less than 60 million rows, consider using HEAP or clustered index for faster query performance." +>* Child tables, if any, inherit the table index type of their parent table. ## Supported column hints @@ -148,7 +147,7 @@ Synapse supports the following [column hints](https://dlthub.com/docs/general-us * `primary_key` - creates a `PRIMARY KEY NONCLUSTERED NOT ENFORCED` constraint on the column * `unique` - creates a `UNIQUE NOT ENFORCED` constraint on the column -> ❗ These hints are **disabled by default**. This is because the `PRIMARY KEY` and `UNIQUE` [constraints](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-table-constraints) are tricky in Synapse: they are **not enforced** and can lead to innacurate results if the user does not ensure all column values are unique. For the column hints to take effect, the `create_indexes` configuration needs to be set to `True`, see [additional destination options](#additional-destination-options). +> ❗ These hints are **disabled by default**. This is because the `PRIMARY KEY` and `UNIQUE` [constraints](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-table-constraints) are tricky in Synapse: they are **not enforced** and can lead to inaccurate results if the user does not ensure all column values are unique. For the column hints to take effect, the `create_indexes` configuration needs to be set to `True`, see [additional destination options](#additional-destination-options). ## Staging support Synapse supports Azure Blob Storage (both standard and [ADLS Gen2](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)) as a file staging destination. `dlt` first uploads Parquet files to the blob container, and then instructs Synapse to read the Parquet file and load its data into a Synapse table using the [COPY INTO](https://learn.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql) statement. @@ -190,9 +189,9 @@ destination.synapse.credentials = "synapse://loader:your_loader_password@your_sy ``` Descriptions: -- `default_table_index_type` sets the [table index type](#table-index-type) that is used if no table index type is specified on the resource. +- `default_table_index_type` sets the [table index type](#table-index-type) that is used if no table index type is specified on the resource. - `create_indexes` determines if `primary_key` and `unique` [column hints](#supported-column-hints) are applied. -- `staging_use_msi` determines if the Managed Identity of the Synapse workspace is used to authorize access to the [staging](#staging-support) Storage Account. Ensure the Managed Identity has the [Storage Blob Data Reader](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles#storage-blob-data-reader) role (or a higher-priviliged role) assigned on the blob container if you set this option to `"true"`. +- `staging_use_msi` determines if the Managed Identity of the Synapse workspace is used to authorize access to the [staging](#staging-support) Storage Account. Ensure the Managed Identity has the [Storage Blob Data Reader](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles#storage-blob-data-reader) role (or a higher-privileged role) assigned on the blob container if you set this option to `"true"`. - `port` used for the ODBC connection. - `connect_timeout` sets the timeout for the `pyodbc` connection attempt, in seconds. @@ -212,4 +211,4 @@ This destination fully supports [dlt state sync](../../general-usage/state#synci - [Load data from GitHub to Azure Synapse in python with dlt](https://dlthub.com/docs/pipelines/github/load-data-with-python-from-github-to-synapse) - [Load data from Stripe to Azure Synapse in python with dlt](https://dlthub.com/docs/pipelines/stripe_analytics/load-data-with-python-from-stripe_analytics-to-synapse) - [Load data from Chess.com to Azure Synapse in python with dlt](https://dlthub.com/docs/pipelines/chess/load-data-with-python-from-chess-to-synapse) - \ No newline at end of file + diff --git a/docs/website/docs/dlt-ecosystem/destinations/weaviate.md b/docs/website/docs/dlt-ecosystem/destinations/weaviate.md index 2ec09e9c24..6bd52acd35 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/weaviate.md +++ b/docs/website/docs/dlt-ecosystem/destinations/weaviate.md @@ -6,8 +6,8 @@ keywords: [weaviate, vector database, destination, dlt] # Weaviate -[Weaviate](https://weaviate.io/) is an open source vector database. It allows you to store data objects and perform similarity searches over them. -This destination helps you to load data into Weaviate from [dlt resources](../../general-usage/resource.md). +[Weaviate](https://weaviate.io/) is an open-source vector database. It allows you to store data objects and perform similarity searches over them. +This destination helps you load data into Weaviate from [dlt resources](../../general-usage/resource.md). ## Setup Guide @@ -30,13 +30,13 @@ X-OpenAI-Api-Key = "your-openai-api-key" In this setup guide, we are using the [Weaviate Cloud Services](https://console.weaviate.cloud/) to get a Weaviate instance and [OpenAI API](https://platform.openai.com/) for generating embeddings through the [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module. -You can host your own weaviate instance using docker compose, kubernetes or embedded. Refer to Weaviate's [How-to: Install](https://weaviate.io/developers/weaviate/installation) or [dlt recipe we use for our tests](#run-weaviate-fully-standalone). In that case you can skip the credentials part altogether: +You can host your own Weaviate instance using Docker Compose, Kubernetes, or embedded. Refer to Weaviate's [How-to: Install](https://weaviate.io/developers/weaviate/installation) or [dlt recipe we use for our tests](#run-weaviate-fully-standalone). In that case, you can skip the credentials part altogether: ```toml [destination.weaviate.credentials.additional_headers] X-OpenAI-Api-Key = "your-openai-api-key" ``` -The `url` will default to **http://localhost:8080** and `api_key` is not defined - which are the defaults for Weaviate container. +The `url` will default to **http://localhost:8080** and `api_key` is not defined - which are the defaults for the Weaviate container. 3. Define the source of the data. For starters, let's load some data from a simple data structure: @@ -101,7 +101,7 @@ weaviate_adapter(data, vectorize, tokenization) ``` It accepts the following arguments: -- `data`: a dlt resource object or a Python data structure (e.g. a list of dictionaries). +- `data`: a dlt resource object or a Python data structure (e.g., a list of dictionaries). - `vectorize`: a name of the field or a list of names that should be vectorized by Weaviate. - `tokenization`: the dictionary containing the tokenization configuration for a field. The dictionary should have the following structure `{'field_name': 'method'}`. Valid methods are "word", "lowercase", "whitespace", "field". The default is "word". See [Property tokenization](https://weaviate.io/developers/weaviate/config-refs/schema#property-tokenization) in Weaviate documentation for more details. @@ -146,7 +146,7 @@ info = pipeline.run( ### Merge The [merge](../../general-usage/incremental-loading.md) write disposition merges the data from the resource with the data in the destination. -For `merge` disposition you would need to specify a `primary_key` for the resource: +For the `merge` disposition, you would need to specify a `primary_key` for the resource: ```python info = pipeline.run( @@ -159,18 +159,18 @@ info = pipeline.run( ) ``` -Internally dlt will use `primary_key` (`document_id` in the example above) to generate a unique identifier ([UUID](https://weaviate.io/developers/weaviate/manage-data/create#id)) for each object in Weaviate. If the object with the same UUID already exists in Weaviate, it will be updated with the new data. Otherwise, a new object will be created. +Internally, dlt will use `primary_key` (`document_id` in the example above) to generate a unique identifier ([UUID](https://weaviate.io/developers/weaviate/manage-data/create#id)) for each object in Weaviate. If the object with the same UUID already exists in Weaviate, it will be updated with the new data. Otherwise, a new object will be created. :::caution -If you are using the merge write disposition, you must set it from the first run of your pipeline, otherwise the data will be duplicated in the database on subsequent loads. +If you are using the `merge` write disposition, you must set it from the first run of your pipeline; otherwise, the data will be duplicated in the database on subsequent loads. ::: ### Append -This is the default disposition. It will append the data to the existing data in the destination ignoring the `primary_key` field. +This is the default disposition. It will append the data to the existing data in the destination, ignoring the `primary_key` field. ## Data loading @@ -199,7 +199,7 @@ Weaviate uses classes to categorize and identify data. To avoid potential naming For example, if you have a dataset named `movies_dataset` and a table named `actors`, the Weaviate class name would be `MoviesDataset_Actors` (the default separator is an underscore). -However, if you prefer to have class names without the dataset prefix, skip `dataset_name` argument. +However, if you prefer to have class names without the dataset prefix, skip the `dataset_name` argument. For example: @@ -241,7 +241,7 @@ The default naming convention described above will preserve the casing of the pr in Weaviate but also requires that your input data does not have clashing property names when comparing case insensitive ie. (`caseName` == `casename`). In such case Weaviate destination will fail to create classes and report a conflict. -You can configure alternative naming convention which will lowercase all properties. The clashing properties will be merged and the classes created. Still if you have a document where clashing properties like: +You can configure an alternative naming convention which will lowercase all properties. The clashing properties will be merged and the classes created. Still, if you have a document where clashing properties like: ```json {"camelCase": 1, "CamelCase": 2} ``` @@ -249,7 +249,7 @@ it will be normalized to: ``` {"camelcase": 2} ``` -so your best course of action is to clean up the data yourself before loading and use default naming convention. Nevertheless you can configure the alternative in `config.toml`: +so your best course of action is to clean up the data yourself before loading and use the default naming convention. Nevertheless, you can configure the alternative in `config.toml`: ```toml [schema] naming="dlt.destinations.weaviate.impl.ci_naming" @@ -291,12 +291,12 @@ Below is an example that configures the **contextionary** vectorizer. You can pu vectorizer="text2vec-contextionary" module_config={text2vec-contextionary = { vectorizeClassName = false, vectorizePropertyName = true}} ``` -You can find docker composer with the instructions to run [here](https://github.com/dlt-hub/dlt/tree/devel/dlt/destinations/weaviate/README.md) +You can find Docker Compose with the instructions to run [here](https://github.com/dlt-hub/dlt/tree/devel/dlt/destinations/weaviate/README.md) ### dbt support -Currently Weaviate destination does not support dbt. +Currently, Weaviate destination does not support dbt. ### Syncing of `dlt` state @@ -304,4 +304,4 @@ Weaviate destination supports syncing of the `dlt` state. - \ No newline at end of file +