From b6922d5880df62b8e6901c22e22cea0805463c22 Mon Sep 17 00:00:00 2001 From: Dave Date: Tue, 17 Sep 2024 10:51:20 +0200 Subject: [PATCH] first 15 pages --- docs/website/docs/_book-onboarding-call.md | 3 +- .../website/docs/build-a-pipeline-tutorial.md | 204 +++----- .../user_agent_device_data_enrichment.md | 163 +++---- .../docs/general-usage/destination-tables.md | 114 ++--- .../website/docs/general-usage/destination.md | 86 ++-- .../docs/general-usage/full-loading.md | 31 +- docs/website/docs/general-usage/glossary.md | 25 +- .../docs/general-usage/incremental-loading.md | 434 ++++++++---------- .../docs/general-usage/naming-convention.md | 60 +-- docs/website/docs/general-usage/pipeline.md | 129 +++--- docs/website/docs/general-usage/resource.md | 113 ++--- .../docs/general-usage/schema-contracts.md | 122 +++-- .../docs/general-usage/schema-evolution.md | 36 +- docs/website/docs/general-usage/schema.md | 241 ++++------ docs/website/docs/general-usage/source.md | 88 ++-- docs/website/docs/general-usage/state.md | 86 ++-- docs/website/docs/intro.md | 9 +- 17 files changed, 808 insertions(+), 1136 deletions(-) diff --git a/docs/website/docs/_book-onboarding-call.md b/docs/website/docs/_book-onboarding-call.md index 4725128bf0..561a479299 100644 --- a/docs/website/docs/_book-onboarding-call.md +++ b/docs/website/docs/_book-onboarding-call.md @@ -1 +1,2 @@ -book a call with a dltHub Solutions Engineer +Book a call with a dltHub Solutions Engineer + diff --git a/docs/website/docs/build-a-pipeline-tutorial.md b/docs/website/docs/build-a-pipeline-tutorial.md index 128f4ccc88..0541bb2eab 100644 --- a/docs/website/docs/build-a-pipeline-tutorial.md +++ b/docs/website/docs/build-a-pipeline-tutorial.md @@ -7,33 +7,31 @@ keywords: [getting started, quick start, basics] # Building data pipelines with `dlt`, from basic to advanced This in-depth overview will take you through the main areas of pipelining with `dlt`. Go to the -related pages you are instead looking for the [quickstart](./intro.md). +related pages if you are instead looking for the [quickstart](./intro.md). ## Why build pipelines with `dlt`? -`dlt` offers functionality to support the entire extract and load process. Let's look at the high level diagram: +`dlt` offers functionality to support the entire extract and load process. Let's look at the high-level diagram: ![dlt source resource pipe diagram](/img/dlt-high-level.png) +First, we have a `pipeline` function that can infer a schema from data and load the data to the destination. +We can use this pipeline with JSON data, dataframes, or other iterable objects such as generator functions. -First, we have a `pipeline` function, that can infer a schema from data and load the data to the destination. -We can use this pipeline with json data, dataframes, or other iterable objects such as generator functions. +This pipeline provides effortless loading via a schema discovery, versioning, and evolution +engine that ensures you can "just load" any data with row and column-level lineage. -This pipeline provides effortless loading via a schema discovery, versioning and evolution -engine that ensures you can "just load" any data with row and column level lineage. - -By utilizing a `dlt pipeline`, we can easily adapt and structure data as it evolves, reducing the time spent on +By utilizing a `dlt` pipeline, we can easily adapt and structure data as it evolves, reducing the time spent on maintenance and development. -This allows our data team to focus on leveraging the data and driving value, while ensuring +This allows our data team to focus on leveraging the data and driving value while ensuring effective governance through timely notifications of any changes. -For extract, `dlt` also provides `source` and `resource` decorators that enable defining -how extracted data should be loaded, while supporting graceful, +For extraction, `dlt` also provides `source` and `resource` decorators that enable defining +how extracted data should be loaded while supporting graceful, scalable extraction via micro-batching and parallelism. - -## The simplest pipeline: 1 liner to load data with schema evolution +## The simplest pipeline: 1-liner to load data with schema evolution ```py import dlt @@ -49,7 +47,7 @@ For example, let's consider a scenario where you want to load a list of objects named "three". With `dlt`, you can create a pipeline and run it with just a few lines of code: 1. [Create a pipeline](walkthroughs/create-a-pipeline.md) to the [destination](dlt-ecosystem/destinations). -1. Give this pipeline data and [run it](walkthroughs/run-a-pipeline.md). +2. Give this pipeline data and [run it](walkthroughs/run-a-pipeline.md). ```py import dlt @@ -76,20 +74,22 @@ The data you can pass to it should be iterable: lists of rows, generators, or `d just fine. If you want to configure how the data is loaded, you can choose between `write_disposition`s -such as `replace`, `append` and `merge` in the pipeline function. +such as `replace`, `append`, and `merge` in the pipeline function. -Here is an example where we load some data to duckdb by `upserting` or `merging` on the id column found in the data. +Here is an example where we load some data to DuckDB by `upserting` or `merging` on the id column found in the data. In this example, we also run a dbt package and then load the outcomes of the load jobs into their respective tables. -This will enable us to log when schema changes occurred and match them to the loaded data for lineage, granting us both column and row level lineage. +This will enable us to log when schema changes occurred and match them to the loaded data for lineage, granting us both column and row-level lineage. We also alert the schema change to a Slack channel where hopefully the producer and consumer are subscribed. ```py import dlt -# have data? dlt likes data +# Have data? dlt likes data data = [{'id': 1, 'name': 'John'}] +``` -# open connection +# Open connection +```py pipeline = dlt.pipeline( destination='duckdb', dataset_name='raw_data' @@ -115,7 +115,7 @@ models_info = dbt.run_all() # Load metadata for monitoring and load package lineage. # This allows for both row and column level lineage, -# as it contains schema update info linked to the loaded data +# as it contains schema update info linked to the loaded data. pipeline.run([load_info], table_name="loading_status", write_disposition='append') pipeline.run([models_info], table_name="transform_status", write_disposition='append') ``` @@ -142,7 +142,7 @@ or incremental extraction metadata, which enables `dlt` to extract and load by y Technically, two key aspects contribute to `dlt`'s effectiveness: -- Scalability through iterators, chunking, parallelization. +- Scalability through iterators, chunking, and parallelization. - The utilization of implicit extraction DAGs that allow efficient API calls for data enrichments or transformations. @@ -181,32 +181,19 @@ the correct order, accounting for any dependencies and transformations. When deploying to Airflow, the internal DAG is unpacked into Airflow tasks in such a way to ensure consistency and allow granular loading. -## Defining Incremental Loading +## Defining incremental loading -[Incremental loading](general-usage/incremental-loading.md) is a crucial concept in data pipelines that involves loading only new or changed -data instead of reloading the entire dataset. This approach provides several benefits, including -low-latency data transfer and cost savings. +[Incremental loading](general-usage/incremental-loading.md) is a crucial concept in data pipelines that involves loading only new or changed data instead of reloading the entire dataset. This approach provides several benefits, including low-latency data transfer and cost savings. ### Declarative loading -Declarative loading allows you to specify the desired state of the data in the target destination, -enabling efficient incremental updates. With `dlt`, you can define the incremental loading -behavior using the `write_disposition` parameter. There are three options available: - -1. Full load: This option replaces the entire destination dataset with the data produced by the - source on the current run. You can achieve this by setting `write_disposition='replace'` in - your resources. It is suitable for stateless data that doesn't change, such as recorded events - like page views. -1. Append: The append option adds new data to the existing destination dataset. By using - `write_disposition='append'`, you can ensure that only new records are loaded. This is - suitable for stateless data that can be easily appended without any conflicts. -1. Merge: The merge option is used when you want to merge new data with the existing destination - dataset while also handling deduplication or upserts. It requires the use of `merge_key` - and/or `primary_key` to identify and update specific records. By setting - `write_disposition='merge'`, you can perform merge-based incremental loading. - -For example, let's say you want to load GitHub events and update them in the destination, ensuring -that only one instance of each event is present. +Declarative loading allows you to specify the desired state of the data in the target destination, enabling efficient incremental updates. With `dlt`, you can define the incremental loading behavior using the `write_disposition` parameter. There are three options available: + +1. Full load: This option replaces the entire destination dataset with the data produced by the source on the current run. You can achieve this by setting `write_disposition='replace'` in your resources. It is suitable for stateless data that doesn't change, such as recorded events like page views. +2. Append: The append option adds new data to the existing destination dataset. By using `write_disposition='append'`, you can ensure that only new records are loaded. This is suitable for stateless data that can be easily appended without any conflicts. +3. Merge: The merge option is used when you want to merge new data with the existing destination dataset while also handling deduplication or upserts. It requires the use of `merge_key` and/or `primary_key` to identify and update specific records. By setting `write_disposition='merge'`, you can perform merge-based incremental loading. + +For example, let's say you want to load GitHub events and update them in the destination, ensuring that only one instance of each event is present. You can use the merge write disposition as follows: @@ -216,54 +203,35 @@ def github_repo_events(): yield from _get_event_pages() ``` -In this example, the `github_repo_events` resource uses the merge write disposition with -`primary_key="id"`. This ensures that only one copy of each event, identified by its unique ID, -is present in the `github_repo_events` table. `dlt` takes care of loading the data -incrementally, deduplicating it, and performing the necessary merge operations. +In this example, the `github_repo_events` resource uses the merge write disposition with `primary_key="id"`. This ensures that only one copy of each event, identified by its unique ID, is present in the `github_repo_events` table. `dlt` takes care of loading the data incrementally, deduplicating it, and performing the necessary merge operations. ### Advanced state management -Advanced state management in `dlt` allows you to store and retrieve values across pipeline runs -by persisting them at the destination but accessing them in a dictionary in code. This enables you -to track and manage incremental loading effectively. By leveraging the pipeline state, you can -preserve information, such as last values, checkpoints or column renames, and utilize them later in -the pipeline. +Advanced state management in `dlt` allows you to store and retrieve values across pipeline runs by persisting them at the destination but accessing them in a dictionary in code. This enables you to track and manage incremental loading effectively. By leveraging the pipeline state, you can preserve information, such as last values, checkpoints, or column renames, and utilize them later in the pipeline. -## Transforming the Data +## Transforming the data -Data transformation plays a crucial role in the data loading process. You can perform -transformations both before and after loading the data. Here's how you can achieve it: +Data transformation plays a crucial role in the data loading process. You can perform transformations both before and after loading the data. Here's how you can achieve it: -### Before Loading +### Before loading -Before loading the data, you have the flexibility to perform transformations using Python. You can -leverage Python's extensive libraries and functions to manipulate and preprocess the data as needed. -Here's an example of -[pseudonymizing columns](general-usage/customising-pipelines/pseudonymizing_columns.md) before -loading the data. +Before loading the data, you have the flexibility to perform transformations using Python. You can leverage Python's extensive libraries and functions to manipulate and preprocess the data as needed. Here's an example of [pseudonymizing columns](general-usage/customising-pipelines/pseudonymizing_columns.md) before loading the data. -In the above example, the `pseudonymize_name` function pseudonymizes the `name` column by -generating a deterministic hash using SHA256. It adds a salt to the column value to ensure -consistent mapping. The `dummy_source` generates dummy data with an `id` and `name` -column, and the `add_map` function applies the `pseudonymize_name` transformation to each -record. +In the above example, the `pseudonymize_name` function pseudonymizes the `name` column by generating a deterministic hash using SHA256. It adds a salt to the column value to ensure consistent mapping. The `dummy_source` generates dummy data with an `id` and `name` column, and the `add_map` function applies the `pseudonymize_name` transformation to each record. -### After Loading +### After loading For transformations after loading the data, you have several options available: #### [Using dbt](dlt-ecosystem/transformations/dbt/dbt.md) -dbt is a powerful framework for transforming data. It enables you to structure your transformations -into DAGs, providing cross-database compatibility and various features such as templating, -backfills, testing, and troubleshooting. You can use the dbt runner in `dlt` to seamlessly -integrate dbt into your pipeline. Here's an example of running a dbt package after loading the data: +dbt is a powerful framework for transforming data. It enables you to structure your transformations into DAGs, providing cross-database compatibility and various features such as templating, backfills, testing, and troubleshooting. You can use the dbt runner in `dlt` to seamlessly integrate dbt into your pipeline. Here's an example of running a dbt package after loading the data: ```py import dlt from pipedrive import pipedrive_source -# load to raw +# Load to raw pipeline = dlt.pipeline( pipeline_name='pipedrive', destination='bigquery', @@ -281,29 +249,23 @@ pipeline = dlt.pipeline( dataset_name='pipedrive_dbt' ) -# make venv and install dbt in it. +# Make venv and install dbt in it. venv = dlt.dbt.get_venv(pipeline) -# get package from local or github link and run +# Get package from local or GitHub link and run dbt = dlt.dbt.package(pipeline, "pipedrive/dbt_pipedrive/pipedrive", venv=venv) models = dbt.run_all() -# show outcome +# Show outcome for m in models: print(f"Model {m.model_name} materialized in {m.time} with status {m.status} and message {m.message}") ``` -In this example, the first pipeline loads the data using `pipedrive_source()`. The second -pipeline performs transformations using a dbt package called `pipedrive` after loading the data. -The `dbt.package` function sets up the dbt runner, and `dbt.run_all()` executes the dbt -models defined in the package. +In this example, the first pipeline loads the data using `pipedrive_source()`. The second pipeline performs transformations using a dbt package called `pipedrive` after loading the data. The `dbt.package` function sets up the dbt runner, and `dbt.run_all()` executes the dbt models defined in the package. #### [Using the `dlt` SQL client](dlt-ecosystem/transformations/sql.md) -Another option is to leverage the `dlt` SQL client to query the loaded data and perform -transformations using SQL statements. You can execute SQL statements that change the database schema -or manipulate data within tables. Here's an example of inserting a row into the `customers` -table using the `dlt` SQL client: +Another option is to leverage the `dlt` SQL client to query the loaded data and perform transformations using SQL statements. You can execute SQL statements that change the database schema or manipulate data within tables. Here's an example of inserting a row into the `customers` table using the `dlt` SQL client: ```py pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm") @@ -314,14 +276,11 @@ with pipeline.sql_client() as client: ) ``` -In this example, the `execute_sql` method of the SQL client allows you to execute SQL -statements. The statement inserts a row with values into the `customers` table. +In this example, the `execute_sql` method of the SQL client allows you to execute SQL statements. The statement inserts a row with values into the `customers` table. #### [Using Pandas](dlt-ecosystem/transformations/pandas.md) -You can fetch query results as Pandas data frames and perform transformations using Pandas -functionalities. Here's an example of reading data from the `issues` table in DuckDB and -counting reaction types using Pandas: +You can fetch query results as Pandas data frames and perform transformations using Pandas functionalities. Here's an example of reading data from the `issues` table in DuckDB and counting reaction types using Pandas: ```py pipeline = dlt.pipeline( @@ -340,90 +299,63 @@ with pipeline.sql_client() as client: counts = reactions.sum(0).sort_values(0, ascending=False) ``` -By leveraging these transformation options, you can shape and manipulate the data before or after -loading it, allowing you to meet specific requirements and ensure data quality and consistency. +By leveraging these transformation options, you can shape and manipulate the data before or after loading it, allowing you to meet specific requirements and ensure data quality and consistency. ## Adjusting the automated normalisation -To streamline the process, `dlt` recommends attaching schemas to sources implicitly instead of -creating them explicitly. You can provide a few global schema settings and let the table and column -schemas be generated from the resource hints and the data itself. The `dlt.source` decorator accepts a -schema instance that you can create and modify within the source function. Additionally, you can -store schema files with the source Python module and have them automatically loaded and used as the -schema for the source. +To streamline the process, `dlt` recommends attaching schemas to sources implicitly instead of creating them explicitly. You can provide a few global schema settings and let the table and column schemas be generated from the resource hints and the data itself. The `dlt.source` decorator accepts a schema instance that you can create and modify within the source function. Additionally, you can store schema files with the source Python module and have them automatically loaded and used as the schema for the source. -By adjusting the automated normalization process in `dlt`, you can ensure that the generated database -schema meets your specific requirements and aligns with your preferred naming conventions, data -types, and other customization needs. +By adjusting the automated normalization process in `dlt`, you can ensure that the generated database schema meets your specific requirements and aligns with your preferred naming conventions, data types, and other customization needs. -### Customizing the Normalization Process +### Customizing the normalization process Customizing the normalization process in `dlt` allows you to adapt it to your specific requirements. -You can adjust table and column names, configure column properties, define data type autodetectors, -apply performance hints, specify preferred data types, or change how ids are propagated in the -unpacking process. +You can adjust table and column names, configure column properties, define data type autodetectors, apply performance hints, specify preferred data types, or change how ids are propagated in the unpacking process. -These customization options enable you to create a schema that aligns with your desired naming -conventions, data types, and overall data structure. With `dlt`, you have the flexibility to tailor -the normalization process to meet your unique needs and achieve optimal results. +These customization options enable you to create a schema that aligns with your desired naming conventions, data types, and overall data structure. With `dlt`, you have the flexibility to tailor the normalization process to meet your unique needs and achieve optimal results. Read more about how to configure [schema generation.](general-usage/schema.md) -### Exporting and Importing Schema Files +### Exporting and importing schema files -`dlt` allows you to export and import schema files, which contain the structure and instructions for -processing and loading the data. Exporting schema files enables you to modify them directly, making -adjustments to the schema as needed. You can then import the modified schema files back into `dlt` to -use them in your pipeline. +`dlt` allows you to export and import schema files, which contain the structure and instructions for processing and loading the data. Exporting schema files enables you to modify them directly, making adjustments to the schema as needed. You can then import the modified schema files back into `dlt` to use them in your pipeline. Read more: [Adjust a schema docs.](walkthroughs/adjust-a-schema.md) -## Governance Support in `dlt` Pipelines +## Governance support in `dlt` pipelines -`dlt` pipelines offer robust governance support through three key mechanisms: pipeline metadata -utilization, schema enforcement and curation, and schema change alerts. +`dlt` pipelines offer robust governance support through three key mechanisms: pipeline metadata utilization, schema enforcement and curation, and schema change alerts. -### Pipeline Metadata +### Pipeline metadata -`dlt` pipelines leverage metadata to provide governance capabilities. This metadata includes load IDs, -which consist of a timestamp and pipeline name. Load IDs enable incremental transformations and data -vaulting by tracking data loads and facilitating data lineage and traceability. +`dlt` pipelines leverage metadata to provide governance capabilities. This metadata includes load IDs, which consist of a timestamp and pipeline name. Load IDs enable incremental transformations and data vaulting by tracking data loads and facilitating data lineage and traceability. Read more about [lineage](general-usage/destination-tables.md#data-lineage). -### Schema Enforcement and Curation +### Schema enforcement and curation -`dlt` empowers users to enforce and curate schemas, ensuring data consistency and quality. Schemas -define the structure of normalized data and guide the processing and loading of data. By adhering to -predefined schemas, pipelines maintain data integrity and facilitate standardized data handling -practices. +`dlt` empowers users to enforce and curate schemas, ensuring data consistency and quality. Schemas define the structure of normalized data and guide the processing and loading of data. By adhering to predefined schemas, pipelines maintain data integrity and facilitate standardized data handling practices. Read more: [Adjust a schema docs.](walkthroughs/adjust-a-schema.md) ### Schema evolution -`dlt` enables proactive governance by alerting users to schema changes. When modifications occur in -the source data’s schema, such as table or column alterations, `dlt` notifies stakeholders, allowing -them to take necessary actions, such as reviewing and validating the changes, updating downstream -processes, or performing impact analysis. +`dlt` enables proactive governance by alerting users to schema changes. When modifications occur in the source data’s schema, such as table or column alterations, `dlt` notifies stakeholders, allowing them to take necessary actions, such as reviewing and validating the changes, updating downstream processes, or performing impact analysis. -These governance features in `dlt` pipelines contribute to better data management practices, -compliance adherence, and overall data governance, promoting data consistency, traceability, and -control throughout the data processing lifecycle. +These governance features in `dlt` pipelines contribute to better data management practices, compliance adherence, and overall data governance, promoting data consistency, traceability, and control throughout the data processing lifecycle. ### Scaling and finetuning -`dlt` offers several mechanism and configuration options to scale up and finetune pipelines: +`dlt` offers several mechanisms and configuration options to scale up and finetune pipelines: -- Running extraction, normalization and load in parallel. +- Running extraction, normalization, and load in parallel. - Writing sources and resources that are run in parallel via thread pools and async execution. -- Finetune the memory buffers, intermediary file sizes and compression options. +- Finetune the memory buffers, intermediary file sizes, and compression options. Read more about [performance.](reference/performance.md) ### Other advanced topics -`dlt` is a constantly growing library that supports many features and use cases needed by the -community. [Join our Slack](https://dlthub.com/community) -to find recent releases or discuss what you can build with `dlt`. +`dlt` is a constantly growing library that supports many features and use cases needed by the community. [Join our Slack](https://dlthub.com/community) to find recent releases or discuss what you can build with `dlt`. + diff --git a/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md b/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md index 2448d31a06..b24050c9df 100644 --- a/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md +++ b/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md @@ -1,44 +1,38 @@ --- -title: User-agent device data enrichment +title: User-agent device data enrichment description: Enriching the user-agent device data with average device price. keywords: [data enrichment, user-agent data, device enrichment] --- # Data enrichment part one: User-agent device data enrichment -Data enrichment enhances raw data with valuable information from multiple sources, increasing its -analytical and decision-making value. +Data enrichment enhances raw data with valuable information from multiple sources, increasing its analytical and decision-making value. -This part covers enriching sample data with device price. Understanding the price segment -of the device that the user used to access your service can be helpful in personalized marketing, -customer segmentation, and many more. +This part covers enriching sample data with device price. Understanding the price segment of the device that the user used to access your service can be helpful in personalized marketing, customer segmentation, and many more. -This documentation will discuss how to enrich the user device information with the average market -price. +This documentation will discuss how to enrich the user device information with the average market price. -## Setup Guide +## Setup guide -We use SerpAPI to retrieve device prices using Google Shopping, but alternative services or APIs are -viable. +We use SerpAPI to retrieve device prices using Google Shopping, but alternative services or APIs are viable. :::note -SerpAPI free tier offers 100 free calls monthly. For production, consider upgrading to a higher -plan. +SerpAPI free tier offers 100 free calls monthly. For production, consider upgrading to a higher plan. ::: - ## Creating data enrichment pipeline -You can either follow the example in the linked Colab notebook or follow this documentation to -create the user-agent device data enrichment pipeline. + +You can either follow the example in the linked Colab notebook or follow this documentation to create the user-agent device data enrichment pipeline. ### A. Colab notebook -The Colab notebook combines three data enrichment processes for a sample dataset, starting with "Data -enrichment part one: User-agent device data". + +The Colab notebook combines three data enrichment processes for a sample dataset, starting with "Data enrichment part one: User-agent device data." Here's the link to the notebook: **[Colab Notebook](https://colab.research.google.com/drive/1ZKEkf1LRSld7CWQFS36fUXjhJKPAon7P?usp=sharing).** ### B. Create a pipeline + Alternatively, to create a data enrichment pipeline, you can start by creating the following directory structure: ```text @@ -47,80 +41,72 @@ user_device_enrichment/ │ └── secrets.toml └── device_enrichment_pipeline.py ``` + ### 1. Creating resource - `dlt` works on the principle of [sources](https://dlthub.com/docs/general-usage/source) - and [resources.](https://dlthub.com/docs/general-usage/resource) +`dlt` works on the principle of [sources](https://dlthub.com/docs/general-usage/source) and [resources.](https://dlthub.com/docs/general-usage/resource) - This data resource yields data typical of what many web analytics and - tracking tools can collect. However, the specifics of what data is collected - and how it's used can vary significantly among different tracking services. +This data resource yields data typical of what many web analytics and tracking tools can collect. However, the specifics of what data is collected and how it's used can vary significantly among different tracking services. - Let's examine a synthetic dataset created for this article. It includes: +Let's examine a synthetic dataset created for this article. It includes: - `user_id`: Web trackers typically assign unique ID to users for - tracking their journeys and interactions over time. +`user_id`: Web trackers typically assign a unique ID to users for tracking their journeys and interactions over time. - `device_name`: User device information helps in understanding the user base's device. +`device_name`: User device information helps in understanding the user base's device. - `page_refer`: The referer URL is tracked to analyze traffic sources and user navigation behavior. +`page_refer`: The referer URL is tracked to analyze traffic sources and user navigation behavior. - Here's the resource that yields the sample data as discussed above: +Here's the resource that yields the sample data as discussed above: - ```py - import dlt - - @dlt.resource(write_disposition="append") - def tracked_data(): - """ - A generator function that yields a series of dictionaries, each representing - user tracking data. - - This function is decorated with `dlt.resource` to integrate into the DLT (Data - Loading Tool) pipeline. The `write_disposition` parameter is set to "append" to - ensure that data from this generator is appended to the existing data in the - destination table. - - Yields: - dict: A dictionary with keys 'user_id', 'device_name', and 'page_referer', - representing the user's tracking data including their device and the page - they were referred from. - """ - - # Sample data representing tracked user data - sample_data = [ - {"user_id": 1, "device_name": "Sony Experia XZ", "page_referer": - "https://b2venture.lightning.force.com/"}, - {"user_id": 2, "device_name": "Samsung Galaxy S23 Ultra 5G", - "page_referer": "https://techcrunch.com/2023/07/20/can-dlthub-solve-the-python-library-problem-for-ai-dig-ventures-thinks-so/"}, - {"user_id": 3, "device_name": "Apple iPhone 14 Pro Max", - "page_referer": "https://dlthub.com/success-stories/freelancers-perspective/"}, - {"user_id": 4, "device_name": "OnePlus 11R", - "page_referer": "https://www.reddit.com/r/dataengineering/comments/173kp9o/ideas_for_data_validation_on_data_ingestion/"}, - {"user_id": 5, "device_name": "Google Pixel 7 Pro", "page_referer": "https://pypi.org/"}, - ] - - # Yielding each user's data as a dictionary - for user_data in sample_data: - yield user_data - ``` +```py +import dlt + +@dlt.resource(write_disposition="append") +def tracked_data(): + """ + A generator function that yields a series of dictionaries, each representing + user tracking data. + + This function is decorated with `dlt.resource` to integrate into the DLT (Data + Loading Tool) pipeline. The `write_disposition` parameter is set to "append" to + ensure that data from this generator is appended to the existing data in the + destination table. + + Yields: + dict: A dictionary with keys 'user_id', 'device_name', and 'page_referer', + representing the user's tracking data including their device and the page + they were referred from. + """ + + # Sample data representing tracked user data + sample_data = [ + {"user_id": 1, "device_name": "Sony Experia XZ", "page_referer": + "https://b2venture.lightning.force.com/"}, + {"user_id": 2, "device_name": "Samsung Galaxy S23 Ultra 5G", + "page_referer": "https://techcrunch.com/2023/07/20/can-dlthub-solve-the-python-library-problem-for-ai-dig-ventures-thinks-so/"}, + {"user_id": 3, "device_name": "Apple iPhone 14 Pro Max", + "page_referer": "https://dlthub.com/success-stories/freelancers-perspective/"}, + {"user_id": 4, "device_name": "OnePlus 11R", + "page_referer": "https://www.reddit.com/r/dataengineering/comments/173kp9o/ideas_for_data_validation_on_data_ingestion/"}, + {"user_id": 5, "device_name": "Google Pixel 7 Pro", "page_referer": "https://pypi.org/"}, + ] + + # Yielding each user's data as a dictionary + for user_data in sample_data: + yield user_data +``` ### 2. Create `fetch_average_price` function -This particular function retrieves the average price of a device by utilizing SerpAPI and Google -shopping listings. To filter the data, the function uses `dlt` state, and only fetches prices -from SerpAPI for devices that have not been updated in the most recent run or for those that were -loaded more than 180 days in the past. +This particular function retrieves the average price of a device by utilizing SerpAPI and Google shopping listings. To filter the data, the function uses `dlt` state and only fetches prices from SerpAPI for devices that have not been updated in the most recent run or for those that were loaded more than 180 days in the past. The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the API token key. -1. In the `.dlt`folder, there's a file called `secrets.toml`. It's where you store sensitive - information securely, like access tokens. Keep this file safe. Here's its format for service - account authentication: +1. In the `.dlt` folder, there's a file called `secrets.toml`. It's where you store sensitive information securely, like access tokens. Keep this file safe. Here's its format for service account authentication: ```py [sources] - api_key= "Please set me up!" #Serp Api key. + api_key= "Please set me up!" # Serp API key. ``` 1. Replace the value of the `api_key`. @@ -229,21 +215,13 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the - Add map function - Transformer function + The `dlt` library's `transformer` and `add_map` functions serve distinct purposes in data processing. - The `dlt` library's `transformer` and `add_map` functions serve distinct purposes in data - processing. - - `Transformers` used to process a resource and are ideal for post-load data transformations in a - pipeline, compatible with tools like `dbt`, the `dlt SQL client`, or Pandas for intricate data - manipulation. To read more: + `Transformers` are used to process a resource and are ideal for post-load data transformations in a pipeline, compatible with tools like `dbt`, the `dlt SQL client`, or Pandas for intricate data manipulation. To read more: [Click here.](../../general-usage/resource#process-resources-with-dlttransformer) - Conversely, `add_map` used to customize a resource applies transformations at an item level - within a resource. It's useful for tasks like anonymizing individual data records. More on this - can be found under - [Customize resources](../../general-usage/resource#customize-resources) in the - documentation. - + Conversely, `add_map` is used to customize a resource and applies transformations at an item level within a resource. It's useful for tasks like anonymizing individual data records. More on this can be found under + [Customize resources](../../general-usage/resource#customize-resources) in the documentation. 1. Here, we create the pipeline and use the `add_map` functionality: @@ -262,9 +240,7 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the ``` :::info - Please note that the same outcome can be achieved by using the transformer function. To - do so, you need to add the transformer decorator at the top of the `fetch_average_price` function. - For `pipeline.run`, you can use the following code: + Please note that the same outcome can be achieved by using the transformer function. To do so, you need to add the transformer decorator at the top of the `fetch_average_price` function. For `pipeline.run`, you can use the following code: ```py # using fetch_average_price as a transformer function @@ -274,14 +250,13 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the ) ``` - This will execute the `fetch_average_price` function with the tracked data and return the average - price. + This will execute the `fetch_average_price` function with the tracked data and return the average price. ::: ### Run the pipeline 1. Install necessary dependencies for the preferred - [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/), For example, duckdb: + [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/), for example, duckdb: ```sh pip install "dlt[duckdb]" @@ -299,7 +274,5 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the dlt pipeline show ``` - For example, the "pipeline_name" for the above pipeline example is `data_enrichment_one`; you can use - any custom name instead. - + For example, the "pipeline_name" for the above pipeline example is `data_enrichment_one`; you can use any custom name instead. diff --git a/docs/website/docs/general-usage/destination-tables.md b/docs/website/docs/general-usage/destination-tables.md index 405fd4379d..9c6ae4a1b2 100644 --- a/docs/website/docs/general-usage/destination-tables.md +++ b/docs/website/docs/general-usage/destination-tables.md @@ -6,9 +6,7 @@ keywords: [destination tables, loaded data, data structure, schema, table, neste # Destination tables -When you run a [pipeline](pipeline.md), dlt creates tables in the destination database and loads the data -from your [source](source.md) into these tables. In this section, we will take a closer look at what -destination tables look like and how they are organized. +When you run a [pipeline](pipeline.md), dlt creates tables in the destination database and loads the data from your [source](source.md) into these tables. In this section, we will take a closer look at what destination tables look like and how they are organized. We start with a simple dlt pipeline: @@ -30,8 +28,7 @@ load_info = pipeline.run(data, table_name="users") :::note -Here we are using the [DuckDb destination](../dlt-ecosystem/destinations/duckdb.md), which is an in-memory database. Other database destinations -will behave similarly and have similar concepts. +Here we are using the [DuckDb destination](../dlt-ecosystem/destinations/duckdb.md), which is an in-memory database. Other database destinations will behave similarly and have similar concepts. ::: @@ -39,22 +36,13 @@ Running this pipeline will create a database schema in the destination database ## Database schema -The database schema is a collection of tables that represent the data you loaded into the database. -The schema name is the same as the `dataset_name` you provided in the pipeline definition. -In the example above, we explicitly set the `dataset_name` to `mydata`. If you don't set it, -it will be set to the pipeline name with a suffix `_dataset`. +The database schema is a collection of tables that represent the data you loaded into the database. The schema name is the same as the `dataset_name` you provided in the pipeline definition. In the example above, we explicitly set the `dataset_name` to `mydata`. If you don't set it, it will be set to the pipeline name with a suffix `_dataset`. -Be aware that the schema referred to in this section is distinct from the [dlt Schema](schema.md). -The database schema pertains to the structure and organization of data within the database, including table -definitions and relationships. On the other hand, the "dlt Schema" specifically refers to the format -and structure of normalized data within the dlt pipeline. +Be aware that the schema referred to in this section is distinct from the [dlt Schema](schema.md). The database schema pertains to the structure and organization of data within the database, including table definitions and relationships. On the other hand, the "dlt Schema" specifically refers to the format and structure of normalized data within the dlt pipeline. ## Tables -Each [resource](resource.md) in your pipeline definition will be represented by a table in -the destination. In the example above, we have one resource, `users`, so we will have one table, `mydata.users`, -in the destination. Where `mydata` is the schema name, and `users` is the table name. Here also, we explicitly set -the `table_name` to `users`. When `table_name` is not set, the table name will be set to the resource name. +Each [resource](resource.md) in your pipeline definition will be represented by a table in the destination. In the example above, we have one resource, `users`, so we will have one table, `mydata.users`, in the destination. Where `mydata` is the schema name, and `users` is the table name. Here also, we explicitly set the `table_name` to `users`. When `table_name` is not set, the table name will be set to the resource name. For example, we can rewrite the pipeline above as: @@ -78,9 +66,7 @@ The result will be the same; note that we do not explicitly pass `table_name="us :::note -Special tables are created to track the pipeline state. These tables are prefixed with `_dlt_` -and are not shown in the `show` command of the `dlt pipeline` CLI. However, you can see them when -connecting to the database directly. +Special tables are created to track the pipeline state. These tables are prefixed with `_dlt_` and are not shown in the `show` command of the `dlt pipeline` CLI. However, you can see them when connecting to the database directly. ::: @@ -134,32 +120,28 @@ Running this pipeline will create two tables in the destination, `users` (**root | 2 | Spot | dog | 9uxh36VU9lqKpw | wX3f5vn801W16A | 1 | | 3 | Fido | dog | pe3FVtCWz8VuNA | rX8ybgTeEmAmmA | 0 | -When inferring a database schema, `dlt` maps the structure of Python objects (ie. from parsed JSON files) into nested tables and creates -references between them. +When inferring a database schema, `dlt` maps the structure of Python objects (i.e., from parsed JSON files) into nested tables and creates references between them. This is how it works: 1. Each row in all (root and nested) data tables created by `dlt` contains a unique column named `_dlt_id` (**row key**). -1. Each nested table contains column named `_dlt_parent_id` referencing to a particular row (`_dlt_id`) of a parent table (**parent key**). +1. Each nested table contains a column named `_dlt_parent_id` referencing a particular row (`_dlt_id`) of a parent table (**parent key**). 1. Rows in nested tables come from the Python lists: `dlt` stores the position of each item in the list in `_dlt_list_idx`. 1. For nested tables that are loaded with the `merge` write disposition, we add a **root key** column `_dlt_root_id`, which references the child table to a row in the root table. -[Learn more on nested references, row keys and parent keys](schema.md#nested-references-root-and-nested-tables) +[Learn more on nested references, row keys, and parent keys](schema.md#nested-references-root-and-nested-tables) ## Naming convention: tables and columns During a pipeline run, dlt [normalizes both table and column names](schema.md#naming-convention) to ensure compatibility with the destination database's accepted format. All names from your source data will be transformed into snake_case and will only include alphanumeric characters. Please be aware that the names in the destination database may differ somewhat from those in your original input. ### Variant columns -If your data has inconsistent types, `dlt` will dispatch the data to several **variant columns**. For example, if you have a resource (i.e., JSON file) with a field with name `answer` and your data contains boolean values, you will get a column with name `answer` of type `BOOLEAN` in your destination. If for some reason, on the next load, you get integer and string values in `answer`, the inconsistent data will go to `answer__v_bigint` and `answer__v_text` columns respectively. -The general naming rule for variant columns is `__v_` where `original_name` is the existing column name (with data type clash) and `type` is the name of the data type stored in the variant. -## Load Packages and Load IDs +If your data has inconsistent types, `dlt` will dispatch the data to several **variant columns**. For example, if you have a resource (i.e., JSON file) with a field named `answer` and your data contains boolean values, you will get a column named `answer` of type `BOOLEAN` in your destination. If for some reason, on the next load, you get integer and string values in `answer`, the inconsistent data will go to `answer__v_bigint` and `answer__v_text` columns respectively. The general naming rule for variant columns is `__v_` where `original_name` is the existing column name (with data type clash) and `type` is the name of the data type stored in the variant. -Each execution of the pipeline generates one or more load packages. A load package typically contains data retrieved from -all the [resources](glossary.md#resource) of a particular [source](glossary.md#source). -These packages are uniquely identified by a `load_id`. The `load_id` of a particular package is added to the top data tables -(referenced as `_dlt_load_id` column in the example above) and to the special `_dlt_loads` table with a status of 0 (when the load process is fully completed). +## Load packages and load IDs + +Each execution of the pipeline generates one or more load packages. A load package typically contains data retrieved from all the [resources](glossary.md#resource) of a particular [source](glossary.md#source). These packages are uniquely identified by a `load_id`. The `load_id` of a particular package is added to the top data tables (referenced as the `_dlt_load_id` column in the example above) and to the special `_dlt_loads` table with a status of 0 (when the load process is fully completed). To illustrate this, let's load more data into the same destination: @@ -173,8 +155,7 @@ data = [ ] ``` -The rest of the pipeline definition remains the same. Running this pipeline will create a new load -package with a new `load_id` and add the data to the existing tables. The `users` table will now look like this: +The rest of the pipeline definition remains the same. Running this pipeline will create a new load package with a new `load_id` and add the data to the existing tables. The `users` table will now look like this: **mydata.users** @@ -193,39 +174,21 @@ The `_dlt_loads` table will look like this: | 1234562350.98417 | quick_start | 0 | 2023-09-12 16:45:51.17865+00 | aOEb...Qekd/58= | | **1234563456.12345** | quick_start | 0 | 2023-09-12 16:46:03.10662+00 | aOEb...Qekd/58= | -The `_dlt_loads` table tracks complete loads and allows chaining transformations on top of them. -Many destinations do not support distributed and long-running transactions (e.g., Amazon Redshift). -In that case, the user may see the partially loaded data. It is possible to filter such data out: any -row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to identify -and delete data for packages that never got completed. +The `_dlt_loads` table tracks complete loads and allows chaining transformations on top of them. Many destinations do not support distributed and long-running transactions (e.g., Amazon Redshift). In that case, the user may see the partially loaded data. It is possible to filter such data out: any row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to identify and delete data for packages that never got completed. -For each load, you can test and [alert](../running-in-production/alerting.md) on anomalies (e.g., -no data, too much loaded to a table). There are also some useful load stats in the `Load info` tab -of the [Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md#exploring-the-data) -mentioned above. +For each load, you can test and [alert](../running-in-production/alerting.md) on anomalies (e.g., no data, too much loaded to a table). There are also some useful load stats in the `Load info` tab of the [Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md#exploring-the-data) mentioned above. -You can add [transformations](../dlt-ecosystem/transformations/) and chain them together -using the `status` column. You start the transformation for all the data with a particular -`load_id` with a status of 0 and then update it to 1. The next transformation starts with the status -of 1 and is then updated to 2. This can be repeated for every additional transformation. +You can add [transformations](../dlt-ecosystem/transformations/) and chain them together using the `status` column. You start the transformation for all the data with a particular `load_id` with a status of 0 and then update it to 1. The next transformation starts with the status of 1 and is then updated to 2. This can be repeated for every additional transformation. ### Data lineage -Data lineage can be super relevant for architectures like the -[data vault architecture](https://www.data-vault.co.uk/what-is-data-vault/) or when troubleshooting. -The data vault architecture is a data warehouse that large organizations use when representing the -same process across multiple systems, which adds data lineage requirements. Using the pipeline name -and `load_id` provided out of the box by `dlt`, you are able to identify the source and time of data. +Data lineage can be super relevant for architectures like the [data vault architecture](https://www.data-vault.co.uk/what-is-data-vault/) or when troubleshooting. The data vault architecture is a data warehouse that large organizations use when representing the same process across multiple systems, which adds data lineage requirements. Using the pipeline name and `load_id` provided out of the box by `dlt`, you are able to identify the source and time of data. -You can [save](../running-in-production/running.md#inspect-and-save-the-load-info-and-trace) -complete lineage info for a particular `load_id` including a list of loaded files, error messages -(if any), elapsed times, schema changes. This can be helpful, for example, when troubleshooting -problems. +You can [save](../running-in-production/running.md#inspect-and-save-the-load-info-and-trace) complete lineage info for a particular `load_id` including a list of loaded files, error messages (if any), and elapsed times, and schema changes. This can be helpful, for example, when troubleshooting problems. ## Staging dataset -So far we've been using the `append` write disposition in our example pipeline. This means that -each time we run the pipeline, the data is appended to the existing tables. When you use the [merge write disposition](incremental-loading.md), dlt creates a staging database schema for staging data. This schema is named `_staging` [by default](https://dlthub.com/docs/devel/dlt-ecosystem/staging#staging-dataset) and contains the same tables as the destination schema. When you run the pipeline, the data from the staging tables is loaded into the destination tables in a single atomic transaction. +So far, we've been using the `append` write disposition in our example pipeline. This means that each time we run the pipeline, the data is appended to the existing tables. When you use the [merge write disposition](incremental-loading.md), dlt creates a staging database schema for staging data. This schema is named `_staging` [by default](https://dlthub.com/docs/devel/dlt-ecosystem/staging#staging-dataset) and contains the same tables as the destination schema. When you run the pipeline, the data from the staging tables is loaded into the destination tables in a single atomic transaction. Let's illustrate this with an example. We change our pipeline to use the `merge` write disposition: @@ -249,7 +212,7 @@ load_info = pipeline.run(users) ``` Running this pipeline will create a schema in the destination database with the name `mydata_staging`. -If you inspect the tables in this schema, you will find the `mydata_staging.users` table identical to the`mydata.users` table in the previous example. +If you inspect the tables in this schema, you will find the `mydata_staging.users` table identical to the `mydata.users` table in the previous example. Here is what the tables may look like after running the pipeline: @@ -272,10 +235,7 @@ Notice that the `mydata.users` table now contains the data from both the previou ## Dev mode (versioned) datasets -When you set the `dev_mode` argument to `True` in `dlt.pipeline` call, dlt creates a versioned dataset. -This means that each time you run the pipeline, the data is loaded into a new dataset (a new database schema). -The dataset name is the same as the `dataset_name` you provided in the pipeline definition with a -datetime-based suffix. +When you set the `dev_mode` argument to `True` in the `dlt.pipeline` call, dlt creates a versioned dataset. This means that each time you run the pipeline, the data is loaded into a new dataset (a new database schema). The dataset name is the same as the `dataset_name` you provided in the pipeline definition with a datetime-based suffix. We modify our pipeline to use the `dev_mode` option to see how this works: @@ -296,41 +256,27 @@ pipeline = dlt.pipeline( load_info = pipeline.run(data, table_name="users") ``` -Every time you run this pipeline, a new schema will be created in the destination database with a -datetime-based suffix. The data will be loaded into tables in this schema. -For example, the first time you run the pipeline, the schema will be named -`mydata_20230912064403`, the second time it will be named `mydata_20230912064407`, and so on. +Every time you run this pipeline, a new schema will be created in the destination database with a datetime-based suffix. The data will be loaded into tables in this schema. For example, the first time you run the pipeline, the schema will be named `mydata_20230912064403`, the second time it will be named `mydata_20230912064407`, and so on. ## Loading data into existing tables not created by dlt -You can also load data from `dlt` into tables that already exist in the destination dataset and were not created by `dlt`. -There are a few things to keep in mind when you are doing this: +You can also load data from `dlt` into tables that already exist in the destination dataset and were not created by `dlt`. There are a few things to keep in mind when you are doing this: -If you load data to a table that exists but does not contain any data, in most cases your load will succeed without problems. -`dlt` will create the needed columns and insert the incoming data. `dlt` will only be aware of columns that exist on the -discovered or provided internal schema, so if you have columns in your destination, that are not anticipated by `dlt`, they -will remain in the destination but stay unknown to `dlt`. This will generally not be a problem. +If you load data to a table that exists but does not contain any data, in most cases your load will succeed without problems. `dlt` will create the needed columns and insert the incoming data. `dlt` will only be aware of columns that exist on the discovered or provided internal schema, so if you have columns in your destination that are not anticipated by `dlt`, they will remain in the destination but stay unknown to `dlt`. This will generally not be a problem. -If your destination table already exists and contains columns that have the same name as columns discovered by `dlt` but -do not have matching datatypes, your load will fail and you will have to fix the column on the destination table first, -or change the column name in your incoming data to something else to avoid a collision. +If your destination table already exists and contains columns that have the same name as columns discovered by `dlt` but do not have matching data types, your load will fail, and you will have to fix the column on the destination table first or change the column name in your incoming data to something else to avoid a collision. -If your destination table exists and already contains data, your load might also initially fail, since `dlt` creates -special `non-nullable` columns that contains required mandatory metadata. Some databases will not allow you to create -`non-nullable` columns on tables that have data, since the initial value for these columns of the existing rows can -not be inferred. You will have to manually create these columns with the correct type on your existing tables and -make them `nullable`, then fill in values for the existing rows. Some databases may allow you to create a new column -that is `non-nullable` and take a default value for existing rows in the same command. The columns you will need to -create are: +If your destination table exists and already contains data, your load might also initially fail since `dlt` creates special `non-nullable` columns that contain required mandatory metadata. Some databases will not allow you to create `non-nullable` columns on tables that have data since the initial value for these columns of the existing rows cannot be inferred. You will have to manually create these columns with the correct type on your existing tables and make them `nullable`, then fill in values for the existing rows. Some databases may allow you to create a new column that is `non-nullable` and take a default value for existing rows in the same command. The columns you will need to create are: | name | type | | --- | --- | | _dlt_load_id | text/string/varchar | | _dlt_id | text/string/varchar | -For nested tables you may also need to create: +For nested tables, you may also need to create: | name | type | | --- | --- | | _dlt_parent_id | text/string/varchar | -| _dlt_root_id | text/string/varchar | \ No newline at end of file +| _dlt_root_id | text/string/varchar | + diff --git a/docs/website/docs/general-usage/destination.md b/docs/website/docs/general-usage/destination.md index d88a0b53f2..d09e10d469 100644 --- a/docs/website/docs/general-usage/destination.md +++ b/docs/website/docs/general-usage/destination.md @@ -6,44 +6,42 @@ keywords: [destination, load data, configure destination, name destination] # Destination -[Destination](glossary.md#destination) is a location in which `dlt` creates and maintains the current version of the schema and loads your data. Destinations come in various forms: databases, datalakes, vector stores or files. `dlt` deals with this variety via modules which you declare when creating a pipeline. +[Destination](glossary.md#destination) is a location in which `dlt` creates and maintains the current version of the schema and loads your data. Destinations come in various forms: databases, datalakes, vector stores, or files. `dlt` deals with this variety via modules which you declare when creating a pipeline. We maintain a set of [built-in destinations](../dlt-ecosystem/destinations/) that you can use right away. ## Declare the destination type -We recommend that you declare the destination type when creating a pipeline instance with `dlt.pipeline`. This allows the `run` method to synchronize your local pipeline state with destination and `extract` and `normalize` to create compatible load packages and schemas. You can also pass the destination to `run` and `load` methods. +We recommend that you declare the destination type when creating a pipeline instance with `dlt.pipeline`. This allows the `run` method to synchronize your local pipeline state with the destination and `extract` and `normalize` to create compatible load packages and schemas. You can also pass the destination to `run` and `load` methods. * Use destination **shorthand type** -Above we want to use **filesystem** built-in destination. You can use shorthand types only for built-ins. +Above we want to use the **filesystem** built-in destination. You can use shorthand types only for built-ins. * Use full **destination factory type** -Above we use built in **filesystem** destination by providing a factory type `filesystem` from module `dlt.destinations`. You can pass [destinations from external modules](#declare-external-destination) as well. +Above we use the built-in **filesystem** destination by providing a factory type `filesystem` from the module `dlt.destinations`. You can pass [destinations from external modules](#declare-external-destination) as well. * Import **destination factory** -Above we import destination factory for **filesystem** and pass it to the pipeline. +Above we import the destination factory for **filesystem** and pass it to the pipeline. All examples above will create the same destination class with default parameters and pull required config and secret values from [configuration](credentials/index.md) - they are equivalent. - ### Pass explicit parameters and a name to a destination -You can instantiate **destination factory** yourself to configure it explicitly. When doing this you work with destinations the same way you work with [sources](source.md) +You can instantiate the **destination factory** yourself to configure it explicitly. When doing this you work with destinations the same way you work with [sources](source.md). -Above we import and instantiate the `filesystem` destination factory. We pass explicit url of the bucket and name the destination to `production_az_bucket`. - -If destination is not named, its shorthand type (the Python factory name) serves as a destination name. Name your destination explicitly if you need several separate configurations of destinations of the same type (i.e. you wish to maintain credentials for development, staging and production storage buckets in the same config file). Destination name is also stored in the [load info](../running-in-production/running.md#inspect-and-save-the-load-info-and-trace) and pipeline traces so use them also when you need more descriptive names (other than, for example, `filesystem`). +Above we import and instantiate the `filesystem` destination factory. We pass the explicit URL of the bucket and name the destination `production_az_bucket`. +If the destination is not named, its shorthand type (the Python factory name) serves as a destination name. Name your destination explicitly if you need several separate configurations of destinations of the same type (i.e. you wish to maintain credentials for development, staging, and production storage buckets in the same config file). The destination name is also stored in the [load info](../running-in-production/running.md#inspect-and-save-the-load-info-and-trace) and pipeline traces, so use them also when you need more descriptive names (other than, for example, `filesystem`). ## Configure a destination -We recommend to pass the credentials and other required parameters to configuration via TOML files, environment variables or other [config providers](credentials/setup). This allows you, for example, to easily switch to production destinations after deployment. +We recommend passing the credentials and other required parameters to configuration via TOML files, environment variables, or other [config providers](credentials/setup). This allows you, for example, to easily switch to production destinations after deployment. -We recommend to use the [default config section layout](credentials/setup#structure-of-secrets.toml-and-config.toml) as below: +We recommend using the [default config section layout](credentials/setup#structure-of-secrets.toml-and-config.toml) as below: or via environment variables: @@ -53,30 +51,29 @@ DESTINATION__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_NAME=dltdata DESTINATION__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_KEY="storage key" ``` -For named destinations you use their names in the config section +For named destinations, you use their names in the config section. - -Note that when you use [`dlt init` command](../walkthroughs/add-a-verified-source.md) to create or add a data source, `dlt` creates a sample configuration for selected destination. +Note that when you use the [`dlt init` command](../walkthroughs/add-a-verified-source.md) to create or add a data source, `dlt` creates a sample configuration for the selected destination. ### Pass explicit credentials -You can pass credentials explicitly when creating destination factory instance. This replaces the `credentials` argument in `dlt.pipeline` and `pipeline.load` methods - which is now deprecated. You can pass the required credentials object, its dictionary representation or the supported native form like below: +You can pass credentials explicitly when creating a destination factory instance. This replaces the `credentials` argument in `dlt.pipeline` and `pipeline.load` methods, which are now deprecated. You can pass the required credentials object, its dictionary representation, or the supported native form as shown below: :::tip -You can create and pass partial credentials and `dlt` will fill the missing data. Below we pass postgres connection string but without password and expect that it will be present in environment variables (or any other [config provider](credentials/setup)) +You can create and pass partial credentials, and `dlt` will fill in the missing data. Below, we pass a Postgres connection string without a password and expect that it will be present in environment variables (or any other [config provider](credentials/setup)). -Please read how to use [various built in credentials types](credentials/complex_types). +Please read how to use [various built-in credentials types](credentials/complex_types). ::: ### Inspect destination capabilities -[Destination capabilities](../walkthroughs/create-new-destination.md#3-set-the-destination-capabilities) tell `dlt` what given destination can and cannot do. For example it tells which file formats it can load, what is maximum query or identifier length. Inspect destination capabilities as follows: +[Destination capabilities](../walkthroughs/create-new-destination.md#3-set-the-destination-capabilities) tell `dlt` what a given destination can and cannot do. For example, it tells which file formats it can load and what the maximum query or identifier length is. Inspect destination capabilities as follows: ```py import dlt pipeline = dlt.pipeline("snowflake_test", destination="snowflake") @@ -84,13 +81,13 @@ print(dict(pipeline.destination.capabilities())) ``` ### Pass additional parameters and change destination capabilities -Destination factory accepts additional parameters that will be used to pre-configure it and change destination capabilities. +The destination factory accepts additional parameters that will be used to pre-configure it and change destination capabilities. ```py import dlt duck_ = dlt.destinations.duckdb(naming_convention="duck_case", recommended_file_size=120000) print(dict(duck_.capabilities())) ``` -Example above is overriding `naming_convention` and `recommended_file_size` in the destination capabilities. +The example above overrides `naming_convention` and `recommended_file_size` in the destination capabilities. ### Configure multiple destinations in a pipeline To configure multiple destinations within a pipeline, you need to provide the credentials for each destination in the "secrets.toml" file. This example demonstrates how to configure a BigQuery destination named `destination_one`: @@ -124,56 +121,56 @@ Similarly, you can assign multiple destinations to the same or different drivers ## Access a destination When loading data, `dlt` will access the destination in two cases: 1. At the beginning of the `run` method to sync the pipeline state with the destination (or if you call `pipeline.sync_destination` explicitly). -2. In the `pipeline.load` method - to migrate schema and load the load package. +2. In the `pipeline.load` method - to migrate the schema and load the load package. -Obviously, dlt will access the destination when you instantiate [sql_client](../dlt-ecosystem/transformations/sql.md). +Obviously, `dlt` will access the destination when you instantiate [sql_client](../dlt-ecosystem/transformations/sql.md). :::note -`dlt` will not import the destination dependencies or access destination configuration if access is not needed. You can build multi-stage pipelines where steps are executed in separate processes or containers - the `extract` and `normalize` step do not need destination dependencies, configuration and actual connection. +`dlt` will not import the destination dependencies or access the destination configuration if access is not needed. You can build multi-stage pipelines where steps are executed in separate processes or containers - the `extract` and `normalize` steps do not need destination dependencies, configuration, and actual connection. ::: + ## Control how `dlt` creates table, column and other identifiers -`dlt` maps identifiers found in the source data into destination identifiers (ie. table and columns names) using [naming conventions](naming-convention.md) which ensure that -character set, identifier length and other properties fit into what given destination can handle. For example our [default naming convention (**snake case**)](naming-convention.md#default-naming-convention-snake_case) converts all names in the source (ie. JSON document fields) into snake case, case insensitive identifiers. +`dlt` maps identifiers found in the source data into destination identifiers (i.e., table and column names) using [naming conventions](naming-convention.md) which ensure that the character set, identifier length, and other properties fit into what a given destination can handle. For example, our [default naming convention (**snake case**)](naming-convention.md#default-naming-convention-snake_case) converts all names in the source (i.e., JSON document fields) into snake case, case-insensitive identifiers. -Each destination declares its preferred naming convention, support for case sensitive identifiers and case folding function that case insensitive identifiers follow. For example: -1. Redshift - by default does not support case sensitive identifiers and converts all of them to lower case. -2. Snowflake - supports case sensitive identifiers and considers upper cased identifiers as case insensitive (which is the default case folding) -3. DuckDb - does not support case sensitive identifiers but does not case fold them so it preserves the original casing in the information schema. -4. Athena - does not support case sensitive identifiers and converts all of them to lower case. -5. BigQuery - all identifiers are case sensitive, there's no case insensitive mode available via case folding (but it can be enabled in dataset level). +Each destination declares its preferred naming convention, support for case-sensitive identifiers, and case folding function that case-insensitive identifiers follow. For example: +1. Redshift - by default, does not support case-sensitive identifiers and converts all of them to lower case. +2. Snowflake - supports case-sensitive identifiers and considers upper-cased identifiers as case-insensitive (which is the default case folding). +3. DuckDb - does not support case-sensitive identifiers but does not case fold them, so it preserves the original casing in the information schema. +4. Athena - does not support case-sensitive identifiers and converts all of them to lower case. +5. BigQuery - all identifiers are case-sensitive; there's no case-insensitive mode available via case folding (but it can be enabled at the dataset level). -You can change the naming convention used in [many different ways](naming-convention.md#configure-naming-convention), below we set the preferred naming convention on the Snowflake destination to `sql_cs` to switch Snowflake to case sensitive mode: +You can change the naming convention used in [many different ways](naming-convention.md#configure-naming-convention). Below, we set the preferred naming convention on the Snowflake destination to `sql_cs` to switch Snowflake to case-sensitive mode: ```py import dlt snow_ = dlt.destinations.snowflake(naming_convention="sql_cs_v1") ``` -Setting naming convention will impact all new schemas being created (ie. on first pipeline run) and will re-normalize all existing identifiers. +Setting the naming convention will impact all new schemas being created (i.e., on the first pipeline run) and will re-normalize all existing identifiers. :::caution -`dlt` prevents re-normalization of identifiers in tables that were already created at the destination. Use [refresh](pipeline.md#refresh-pipeline-data-and-state) mode to drop the data. You can also disable this behavior via [configuration](naming-convention.md#avoid-identifier-collisions) +`dlt` prevents re-normalization of identifiers in tables that were already created at the destination. Use [refresh](pipeline.md#refresh-pipeline-data-and-state) mode to drop the data. You can also disable this behavior via [configuration](naming-convention.md#avoid-identifier-collisions). ::: :::note -Destinations that support case sensitive identifiers but use case folding convention to enable case insensitive identifiers are configured in case insensitive mode by default. Examples: Postgres, Snowflake, Oracle. +Destinations that support case-sensitive identifiers but use case folding convention to enable case-insensitive identifiers are configured in case-insensitive mode by default. Examples: Postgres, Snowflake, Oracle. ::: :::caution -If you use case sensitive naming convention with case insensitive destination, `dlt` will: -1. Fail the load if it detects identifier collision due to case folding +If you use a case-sensitive naming convention with a case-insensitive destination, `dlt` will: +1. Fail the load if it detects an identifier collision due to case folding. 2. Warn if any case folding is applied by the destination. ::: -### Enable case sensitive identifiers support -Selected destinations may be configured so they start accepting case sensitive identifiers. For example, it is possible to set case sensitive collation on **mssql** database and then tell `dlt` about it. +### Enable case-sensitive identifiers support +Selected destinations may be configured so they start accepting case-sensitive identifiers. For example, it is possible to set case-sensitive collation on an **mssql** database and then tell `dlt` about it. ```py from dlt.destinations import mssql dest_ = mssql(has_case_sensitive_identifiers=True, naming_convention="sql_cs_v1") ``` -Above we can safely use case sensitive naming convention without worrying of name collisions. +Above, we can safely use a case-sensitive naming convention without worrying about name collisions. You can configure the case sensitivity, **but configuring destination capabilities is not currently supported**. ```toml @@ -182,10 +179,11 @@ has_case_sensitive_identifiers=true ``` :::note -In most cases setting the flag above just indicates to `dlt` that you switched the case sensitive option on a destination. `dlt` will not do that for you. Refer to destination documentation for details. +In most cases, setting the flag above just indicates to `dlt` that you switched the case-sensitive option on a destination. `dlt` will not do that for you. Refer to the destination documentation for details. ::: ## Create new destination You have two ways to implement a new destination: -1. You can use `@dlt.destination` decorator and [implement a sink function](../dlt-ecosystem/destinations/destination.md). This is perfect way to implement reverse ETL destinations that push data back to REST APIs. -2. You can implement [a full destination](../walkthroughs/create-new-destination.md) where you have a full control over load jobs and schema migration. +1. You can use the `@dlt.destination` decorator and [implement a sink function](../dlt-ecosystem/destinations/destination.md). This is a perfect way to implement reverse ETL destinations that push data back to REST APIs. +2. You can implement [a full destination](../walkthroughs/create-new-destination.md) where you have full control over load jobs and schema migration. + diff --git a/docs/website/docs/general-usage/full-loading.md b/docs/website/docs/general-usage/full-loading.md index 434615fecf..1c4d46a6a2 100644 --- a/docs/website/docs/general-usage/full-loading.md +++ b/docs/website/docs/general-usage/full-loading.md @@ -5,9 +5,7 @@ keywords: [full loading, loading methods, replace] --- # Full loading -Full loading is the act of fully reloading the data of your tables. All existing data -will be removed and replaced by whatever the source produced on this run. Resources -that are not selected while performing a full load will not replace any data in the destination. +Full loading is the act of fully reloading the data of your tables. All existing data will be removed and replaced by whatever the source produced on this run. Resources that are not selected while performing a full load will not replace any data in the destination. ## Performing a full load @@ -27,7 +25,7 @@ p.run(issues, write_disposition="replace", primary_key="id", table_name="issues" ## Choosing the correct replace strategy for your full load -`dlt` implements three different strategies for doing a full load on your table: `truncate-and-insert`, `insert-from-staging` and `staging-optimized`. The exact behaviour of these strategies can also vary between the available destinations. +`dlt` implements three different strategies for doing a full load on your table: `truncate-and-insert`, `insert-from-staging`, and `staging-optimized`. The exact behavior of these strategies can also vary between the available destinations. You can select a strategy with a setting in your `config.toml` file. If you do not select a strategy, dlt will default to `truncate-and-insert`. @@ -39,32 +37,19 @@ replace_strategy = "staging-optimized" ### The `truncate-and-insert` strategy -The `truncate-and-insert` replace strategy is the default and the fastest of all three strategies. If you load data with this setting, then the -destination tables will be truncated at the beginning of the load and the new data will be inserted consecutively but not within the same transaction. -The downside of this strategy is, that your tables will have no data for a while until the load is completed. You -may end up with new data in some tables and no data in other tables if the load fails during the run. Such incomplete load may be however detected by checking if the -[_dlt_loads table contains load id](destination-tables.md#load-packages-and-load-ids) from _dlt_load_id of the replaced tables. If you prefer to have no data downtime, please use one of the other strategies. +The `truncate-and-insert` replace strategy is the default and the fastest of all three strategies. If you load data with this setting, then the destination tables will be truncated at the beginning of the load and the new data will be inserted consecutively but not within the same transaction. The downside of this strategy is that your tables will have no data for a while until the load is completed. You may end up with new data in some tables and no data in other tables if the load fails during the run. Such an incomplete load may, however, be detected by checking if the [_dlt_loads table contains load id](destination-tables.md#load-packages-and-load-ids) from _dlt_load_id of the replaced tables. If you prefer to have no data downtime, please use one of the other strategies. ### The `insert-from-staging` strategy -The `insert-from-staging` is the slowest of all three strategies. It will load all new data into staging tables away from your final destination tables and will then truncate and insert the new data in one transaction. -It also maintains a consistent state between nested and root tables at all times. Use this strategy if you have the requirement for consistent destination datasets with zero downtime and the `optimized` strategy does not work for you. -This strategy behaves the same way across all destinations. +The `insert-from-staging` is the slowest of all three strategies. It will load all new data into staging tables away from your final destination tables and will then truncate and insert the new data in one transaction. It also maintains a consistent state between nested and root tables at all times. Use this strategy if you have the requirement for consistent destination datasets with zero downtime and the `optimized` strategy does not work for you. This strategy behaves the same way across all destinations. ### The `staging-optimized` strategy -The `staging-optimized` strategy has all the upsides of the `insert-from-staging` but implements certain optimizations for faster loading on some destinations. -This comes at the cost of destination tables being dropped and recreated in some cases, which will mean that any views or other constraints you have -placed on those tables will be dropped with the table. If you have a setup where you need to retain your destination tables, do not use the `staging-optimized` -strategy. If you do not care about tables being dropped but need the upsides of the `insert-from-staging` with some performance (and cost) saving -opportunities, you should use this strategy. The `staging-optimized` strategy behaves differently across destinations: +The `staging-optimized` strategy has all the upsides of the `insert-from-staging` but implements certain optimizations for faster loading on some destinations. This comes at the cost of destination tables being dropped and recreated in some cases, which will mean that any views or other constraints you have placed on those tables will be dropped with the table. If you have a setup where you need to retain your destination tables, do not use the `staging-optimized` strategy. If you do not care about tables being dropped but need the upsides of the `insert-from-staging` with some performance (and cost) saving opportunities, you should use this strategy. The `staging-optimized` strategy behaves differently across destinations: * Postgres: After loading the new data into the staging tables, the destination tables will be dropped and replaced by the staging tables. No data needs to be moved, so this strategy is almost as fast as `truncate-and-insert`. -* bigquery: After loading the new data into the staging tables, the destination tables will be dropped and - recreated with a [clone command](https://cloud.google.com/bigquery/docs/table-clones-create) from the staging tables. This is a low cost and fast way to create a second independent table from the data of another. Learn - more about [table cloning on bigquery](https://cloud.google.com/bigquery/docs/table-clones-intro). -* snowflake: After loading the new data into the staging tables, the destination tables will be dropped and - recreated with a [clone command](https://docs.snowflake.com/en/sql-reference/sql/create-clone) from the staging tables. This is a low cost and fast way to create a second independent table from the data of another. Learn - more about [table cloning on snowflake](https://docs.snowflake.com/en/user-guide/object-clone). +* bigquery: After loading the new data into the staging tables, the destination tables will be dropped and recreated with a [clone command](https://cloud.google.com/bigquery/docs/table-clones-create) from the staging tables. This is a low-cost and fast way to create a second independent table from the data of another. Learn more about [table cloning on bigquery](https://cloud.google.com/bigquery/docs/table-clones-intro). +* snowflake: After loading the new data into the staging tables, the destination tables will be dropped and recreated with a [clone command](https://docs.snowflake.com/en/sql-reference/sql/create-clone) from the staging tables. This is a low-cost and fast way to create a second independent table from the data of another. Learn more about [table cloning on snowflake](https://docs.snowflake.com/en/user-guide/object-clone). For all other [destinations](../dlt-ecosystem/destinations/index.md), please look at their respective documentation pages to see if and how the `staging-optimized` strategy is implemented. If it is not implemented, `dlt` will fall back to the `insert-from-staging` strategy. + diff --git a/docs/website/docs/general-usage/glossary.md b/docs/website/docs/general-usage/glossary.md index 5ae256b268..ba854bb33d 100644 --- a/docs/website/docs/general-usage/glossary.md +++ b/docs/website/docs/general-usage/glossary.md @@ -8,13 +8,13 @@ keywords: [glossary, resource, source, pipeline] ## [Source](source) -Location that holds data with certain structure. Organized into one or more resources. +Location that holds data with a certain structure. Organized into one or more resources. - If endpoints in an API are the resources, then the API is the source. - If tabs in a spreadsheet are the resources, then the source is the spreadsheet. - If tables in a database are the resources, then the source is the database. -Within this documentation, **source** refers also to the software component (i.e. Python function) +Within this documentation, **source** also refers to the software component (i.e., Python function) that **extracts** data from the source location using one or more resource components. ## [Resource](resource) @@ -26,38 +26,39 @@ origin. - If the source is a spreadsheet, then a resource is a tab in that spreadsheet. - If the source is a database, then a resource is a table in that database. -Within this documentation, **resource** refers also to the software component (i.e. Python function) -that **extracts** the data from source location. +Within this documentation, **resource** also refers to the software component (i.e., Python function) +that **extracts** the data from the source location. ## [Destination](../dlt-ecosystem/destinations) -The data store where data from the source is loaded (e.g. Google BigQuery). +The data store where data from the source is loaded (e.g., Google BigQuery). ## [Pipeline](pipeline) Moves the data from the source to the destination, according to instructions provided in the schema -(i.e. extracting, normalizing, and loading the data). +(i.e., extracting, normalizing, and loading the data). ## [Verified source](../walkthroughs/add-a-verified-source) A Python module distributed with `dlt init` that allows creating pipelines that extract data from a -particular **Source**. Such module is intended to be published in order for others to use it to +particular **Source**. Such a module is intended to be published in order for others to use it to build pipelines. -A source must be published to become "verified": which means that it has tests, test data, -demonstration scripts, documentation and the dataset produces was reviewed by a data engineer. +A source must be published to become "verified," which means that it has tests, test data, +demonstration scripts, documentation, and the dataset produced was reviewed by a data engineer. ## [Schema](schema) -Describes the structure of normalized data (e.g. unpacked tables, column types, etc.) and provides -instructions on how the data should be processed and loaded (i.e. it tells `dlt` about the content +Describes the structure of normalized data (e.g., unpacked tables, column types, etc.) and provides +instructions on how the data should be processed and loaded (i.e., it tells `dlt` about the content of the data and how to load it into the destination). ## [Config](credentials/setup#secrets.toml-and-config.toml) -A set of values that are passed to the pipeline at run time (e.g. to change its behavior locally vs. +A set of values that are passed to the pipeline at runtime (e.g., to change its behavior locally vs. in production). ## [Credentials](credentials/complex_types) A subset of configuration whose elements are kept secret and never shared in plain text. + diff --git a/docs/website/docs/general-usage/incremental-loading.md b/docs/website/docs/general-usage/incremental-loading.md index 819ac2fb0c..4270b88d6f 100644 --- a/docs/website/docs/general-usage/incremental-loading.md +++ b/docs/website/docs/general-usage/incremental-loading.md @@ -7,10 +7,10 @@ keywords: [incremental loading, loading methods, append, merge] # Incremental loading Incremental loading is the act of loading only new or changed data and not old records that we -already loaded. It enables low-latency and low cost data transfer. +already loaded. It enables low-latency and low-cost data transfer. The challenge of incremental pipelines is that if we do not keep track of the state of the load -(i.e. which increments were loaded and which are to be loaded). Read more about state +(i.e., which increments were loaded and which are to be loaded). Read more about state [here](state.md). ## Choosing a write disposition @@ -18,11 +18,11 @@ The challenge of incremental pipelines is that if we do not keep track of the st ### The 3 write dispositions: - **Full load**: replaces the destination dataset with whatever the source produced on this run. To -achieve this, use `write_disposition='replace'` in your resources. Learn more in the [full loading docs](./full-loading.md) +achieve this, use `write_disposition='replace'` in your resources. Learn more in the [full loading docs](./full-loading.md). - **Append**: appends the new data to the destination. Use `write_disposition='append'`. -- **Merge**: Merges new data to the destination using `merge_key` and/or deduplicates/upserts new data +- **Merge**: merges new data to the destination using `merge_key` and/or deduplicates/upserts new data using `primary_key`. Use `write_disposition='merge'`. ### Two simple questions determine the write disposition you use @@ -33,18 +33,18 @@ using `primary_key`. Use `write_disposition='merge'`. -The "write disposition" you choose depends on the data set and how you can extract it. +The "write disposition" you choose depends on the dataset and how you can extract it. To find the "write disposition" you should use, the first question you should ask yourself is "Is my -data stateful or stateless"? Stateful data has a state that is subject to change - for example a -user's profile Stateless data cannot change - for example, a recorded event, such as a page view. +data stateful or stateless?" Stateful data has a state that is subject to change - for example, a +user's profile. Stateless data cannot change - for example, a recorded event, such as a page view. Because stateless data does not need to be updated, we can just append it. For stateful data, comes a second question - Can I extract it incrementally from the source? If yes, you should use [slowly changing dimensions (Type-2)](#scd2-strategy), which allow you to maintain historical records of data changes over time. -If not, then we need to replace the entire data set. If however we can request the data incrementally such -as "all users added or modified since yesterday" then we can simply apply changes to our existing +If not, then we need to replace the entire dataset. If, however, we can request the data incrementally such +as "all users added or modified since yesterday," then we can simply apply changes to our existing dataset with the merge write disposition. ## Merge incremental loading @@ -59,19 +59,12 @@ The `merge` write disposition can be used with three different strategies: The default `delete-insert` strategy is used in two scenarios: -1. You want to keep only one instance of certain record i.e. you receive updates of the `user` state - from an API and want to keep just one record per `user_id`. -1. You receive data in daily batches, and you want to make sure that you always keep just a single - instance of a record for each batch even in case you load an old batch or load the current batch - several times a day (i.e. to receive "live" updates). +1. You want to keep only one instance of a certain record, i.e., you receive updates of the `user` state from an API and want to keep just one record per `user_id`. +2. You receive data in daily batches, and you want to make sure that you always keep just a single instance of a record for each batch, even in case you load an old batch or load the current batch several times a day (i.e., to receive "live" updates). -The `delete-insert` strategy loads data to a `staging` dataset, deduplicates the staging data if a -`primary_key` is provided, deletes the data from the destination using `merge_key` and `primary_key`, -and then inserts the new records. All of this happens in a single atomic transaction for a root and all -nested tables. +The `delete-insert` strategy loads data to a `staging` dataset, deduplicates the staging data if a `primary_key` is provided, deletes the data from the destination using `merge_key` and `primary_key`, and then inserts the new records. All of this happens in a single atomic transaction for a root and all nested tables. -Example below loads all the GitHub events and updates them in the destination using "id" as primary -key, making sure that only a single copy of event is present in `github_repo_events` table: +The example below loads all the GitHub events and updates them in the destination using "id" as the primary key, making sure that only a single copy of the event is present in the `github_repo_events` table: ```py @dlt.resource(primary_key="id", write_disposition="merge") @@ -99,8 +92,7 @@ def resource(): ... ``` -Example below merges on a column `batch_day` that holds the day for which given record is valid. -Merge keys also can be compound: +The example below merges on a column `batch_day` that holds the day for which a given record is valid. Merge keys can also be compound: ```py @dlt.resource(merge_key="batch_day", write_disposition="merge") @@ -108,9 +100,7 @@ def get_daily_batch(day): yield _get_batch_from_bucket(day) ``` -As with any other write disposition you can use it to load data ad hoc. Below we load issues with -top reactions for `duckdb` repo. The lists have, obviously, many overlapping issues, but we want to -keep just one instance of each. +As with any other write disposition, you can use it to load data ad hoc. Below we load issues with top reactions for the `duckdb` repo. The lists have, obviously, many overlapping issues, but we want to keep just one instance of each. ```py p = dlt.pipeline(destination="bigquery", dataset_name="github") @@ -124,33 +114,30 @@ for reaction in reactions: p.run(issues, write_disposition="merge", primary_key="id", table_name="issues") ``` -Example below dispatches GitHub events to several tables by event type, keeps one copy of each event -by "id" and skips loading of past records using "last value" incremental. As you can see, all of -this we can just declare in our resource. +The example below dispatches GitHub events to several tables by event type, keeps one copy of each event by "id" and skips loading of past records using "last value" incremental. As you can see, all of this can be declared in our resource. ```py @dlt.resource(primary_key="id", write_disposition="merge", table_name=lambda i: i['type']) def github_repo_events(last_created_at = dlt.sources.incremental("created_at", "1970-01-01T00:00:00Z")): - """A resource taking a stream of github events and dispatching them to tables named by event type. Deduplicates be 'id'. Loads incrementally by 'created_at' """ + """A resource taking a stream of GitHub events and dispatching them to tables named by event type. Deduplicates by 'id'. Loads incrementally by 'created_at' """ yield from _get_rest_pages("events") ``` :::note -If you use the `merge` write disposition, but do not specify merge or primary keys, merge will fallback to `append`. -The appended data will be inserted from a staging table in one transaction for most destinations in this case. +If you use the `merge` write disposition but do not specify merge or primary keys, merge will fall back to `append`. The appended data will be inserted from a staging table in one transaction for most destinations in this case. ::: #### Delete records The `hard_delete` column hint can be used to delete records from the destination dataset. The behavior of the delete mechanism depends on the data type of the column marked with the hint: -1) `bool` type: only `True` leads to a delete—`None` and `False` values are disregarded -2) other types: each `not None` value leads to a delete +1) `bool` type: only `True` leads to a delete—`None` and `False` values are disregarded. +2) Other types: each `not None` value leads to a delete. Each record in the destination table with the same `primary_key` or `merge_key` as a record in the source dataset that's marked as a delete will be deleted. Deletes are propagated to any nested table that might exist. For each record that gets deleted in the root table, all corresponding records in the nested table(s) will also be deleted. Records in parent and nested tables are linked through the `root key` that is explained in the next section. -##### Example: with primary key and boolean delete column +##### Example: With primary key and boolean delete column ```py @dlt.resource( primary_key="id", @@ -158,54 +145,54 @@ Deletes are propagated to any nested table that might exist. For each record tha columns={"deleted_flag": {"hard_delete": True}} ) def resource(): - # this will insert a record (assuming a record with id = 1 does not yet exist) + # This will insert a record (assuming a record with id = 1 does not yet exist) yield {"id": 1, "val": "foo", "deleted_flag": False} - # this will update the record + # This will update the record yield {"id": 1, "val": "bar", "deleted_flag": None} - # this will delete the record + # This will delete the record yield {"id": 1, "val": "foo", "deleted_flag": True} - # similarly, this would have also deleted the record - # only the key and the column marked with the "hard_delete" hint suffice to delete records + # Similarly, this would have also deleted the record + # Only the key and the column marked with the "hard_delete" hint suffice to delete records yield {"id": 1, "deleted_flag": True} ... ``` -##### Example: with merge key and non-boolean delete column +##### Example: With merge key and non-boolean delete column ```py @dlt.resource( merge_key="id", write_disposition="merge", columns={"deleted_at_ts": {"hard_delete": True}}) def resource(): - # this will insert two records + # This will insert two records yield [ {"id": 1, "val": "foo", "deleted_at_ts": None}, {"id": 1, "val": "bar", "deleted_at_ts": None} ] - # this will delete two records + # This will delete two records yield {"id": 1, "val": "foo", "deleted_at_ts": "2024-02-22T12:34:56Z"} ... ``` -##### Example: with primary key and "dedup_sort" hint +##### Example: With primary key and "dedup_sort" hint ```py @dlt.resource( primary_key="id", write_disposition="merge", columns={"deleted_flag": {"hard_delete": True}, "lsn": {"dedup_sort": "desc"}}) def resource(): - # this will insert one record (the one with lsn = 3) + # This will insert one record (the one with lsn = 3) yield [ {"id": 1, "val": "foo", "lsn": 1, "deleted_flag": None}, {"id": 1, "val": "baz", "lsn": 3, "deleted_flag": None}, {"id": 1, "val": "bar", "lsn": 2, "deleted_flag": True} ] - # this will insert nothing, because the "latest" record is a delete + # This will insert nothing, because the "latest" record is a delete yield [ {"id": 2, "val": "foo", "lsn": 1, "deleted_flag": False}, {"id": 2, "lsn": 2, "deleted_flag": True} @@ -219,11 +206,7 @@ Indexing is important for doing lookups by column value, especially for merge wr #### Forcing root key propagation -Merge write disposition requires that the `_dlt_id` (`row_key`) of root table is propagated to nested -tables. This concept is similar to foreign key but it always references the root (top level) table, skipping any intermediate parents -We call it `root key`. Root key is automatically propagated for all tables that have `merge` write disposition -set. We do not enable it everywhere because it takes storage space. Nevertheless, is some cases you -may want to permanently enable root key propagation. +Merge write disposition requires that the `_dlt_id` (`row_key`) of the root table is propagated to nested tables. This concept is similar to a foreign key, but it always references the root (top-level) table, skipping any intermediate parents. We call it `root key`. The root key is automatically propagated for all tables that have `merge` write disposition set. We do not enable it everywhere because it takes storage space. Nevertheless, in some cases, you may want to permanently enable root key propagation. ```py pipeline = dlt.pipeline( @@ -233,51 +216,54 @@ pipeline = dlt.pipeline( dev_mode=True ) fb_ads = facebook_ads_source() -# enable root key propagation on a source that is not a merge one by default. -# this is not required if you always use merge but below we start with replace +# Enable root key propagation on a source that is not a merge one by default. +# This is not required if you always use merge but below we start with replace fb_ads.root_key = True -# load only disapproved ads +# Load only disapproved ads fb_ads.ads.bind(states=("DISAPPROVED", )) info = pipeline.run(fb_ads.with_resources("ads"), write_disposition="replace") -# merge the paused ads. the disapproved ads stay there! +``` + +# Merge the paused ads. The disapproved ads stay there! fb_ads = facebook_ads_source() fb_ads.ads.bind(states=("PAUSED", )) info = pipeline.run(fb_ads.with_resources("ads"), write_disposition="merge") ``` -In example above we enforce the root key propagation with `fb_ads.root_key = True`. This ensures -that correct data is propagated on initial `replace` load so the future `merge` load can be +In the example above, we enforce the root key propagation with `fb_ads.root_key = True`. This ensures +that correct data is propagated on the initial `replace` load so the future `merge` load can be executed. You can achieve the same in the decorator `@dlt.source(root_key=True)`. ### `scd2` strategy -`dlt` can create [Slowly Changing Dimension Type 2](https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row) (SCD2) destination tables for dimension tables that change in the source. The resource is expected to provide a full extract of the source table each run. A row hash is stored in `_dlt_id` and used as surrogate key to identify source records that have been inserted, updated, or deleted. A `NULL` value is used by default to indicate an active record, but it's possible to use a configurable high timestamp (e.g. 9999-12-31 00:00:00.000000) instead. +`dlt` can create [Slowly Changing Dimension Type 2](https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row) (SCD2) destination tables for dimension tables that change in the source. The resource is expected to provide a full extract of the source table each run. A row hash is stored in `_dlt_id` and used as a surrogate key to identify source records that have been inserted, updated, or deleted. A `NULL` value is used by default to indicate an active record, but it's possible to use a configurable high timestamp (e.g., 9999-12-31 00:00:00.000000) instead. :::note -The `unique` hint for `_dlt_id` in the root table is set to `false` when using `scd2`. This differs from [default behavior](./destination-tables.md#child-and-parent-tables). The reason is that the surrogate key stored in `_dlt_id` contains duplicates after an _insert-delete-reinsert_ pattern: -1. record with surrogate key X is inserted in a load at `t1` -2. record with surrogate key X is deleted in a later load at `t2` -3. record with surrogate key X is reinserted in an even later load at `t3` +The `unique` hint for `_dlt_id` in the root table is set to `false` when using `scd2`. This differs from [default behavior](./destination-tables.md#child-and-parent-tables). The reason is that the surrogate key stored in `_dlt_id` contains duplicates after an _insert-delete-reinsert_ pattern: +1. A record with surrogate key X is inserted in a load at `t1`. +2. The record with surrogate key X is deleted in a later load at `t2`. +3. The record with surrogate key X is reinserted in an even later load at `t3`. -After this pattern, the `scd2` table in the destination has two records for surrogate key X: one for validity window `[t1, t2]`, and one for `[t3, NULL]`. A duplicate value exists in `_dlt_id` because both records have the same surrogate key. +After this pattern, the `scd2` table in the destination has two records for surrogate key X: one for the validity window `[t1, t2]`, and one for `[t3, NULL]`. A duplicate value exists in `_dlt_id` because both records have the same surrogate key. Note that: -- the composite key `(_dlt_id, _dlt_valid_from)` is unique -- `_dlt_id` remains unique for nested tables—`scd2` does not affect this +- The composite key `(_dlt_id, _dlt_valid_from)` is unique. +- `_dlt_id` remains unique for nested tables—`scd2` does not affect this. ::: + #### Example: `scd2` merge strategy ```py @dlt.resource( write_disposition={"disposition": "merge", "strategy": "scd2"} ) def dim_customer(): - # initial load + # Initial load yield [ {"customer_key": 1, "c1": "foo", "c2": 1}, {"customer_key": 2, "c1": "bar", "c2": 2} ] -pipeline.run(dim_customer()) # first run — 2024-04-09 18:27:53.734235 +pipeline.run(dim_customer()) # First run — 2024-04-09 18:27:53.734235 ... ``` @@ -291,13 +277,13 @@ pipeline.run(dim_customer()) # first run — 2024-04-09 18:27:53.734235 ```py ... def dim_customer(): - # second load — record for customer_key 1 got updated + # Second load — record for customer_key 1 got updated yield [ {"customer_key": 1, "c1": "foo_updated", "c2": 1}, {"customer_key": 2, "c1": "bar", "c2": 2} ] -pipeline.run(dim_customer()) # second run — 2024-04-09 22:13:07.943703 +pipeline.run(dim_customer()) # Second run — 2024-04-09 22:13:07.943703 ``` *`dim_customer` destination table after second run—inserted new record for `customer_key` 1 and retired old record by updating `_dlt_valid_to`:* @@ -311,12 +297,12 @@ pipeline.run(dim_customer()) # second run — 2024-04-09 22:13:07.943703 ```py ... def dim_customer(): - # third load — record for customer_key 2 got deleted + # Third load — record for customer_key 2 got deleted yield [ {"customer_key": 1, "c1": "foo_updated", "c2": 1}, ] -pipeline.run(dim_customer()) # third run — 2024-04-10 06:45:22.847403 +pipeline.run(dim_customer()) # Third run — 2024-04-10 06:45:22.847403 ``` *`dim_customer` destination table after third run—retired deleted record by updating `_dlt_valid_to`:* @@ -334,7 +320,7 @@ pipeline.run(dim_customer()) # third run — 2024-04-10 06:45:22.847403 write_disposition={ "disposition": "merge", "strategy": "scd2", - "validity_column_names": ["from", "to"], # will use "from" and "to" instead of default values + "validity_column_names": ["from", "to"], # Will use "from" and "to" instead of default values } ) def dim_customer(): @@ -349,7 +335,7 @@ You can configure the literal used to indicate an active record with `active_rec write_disposition={ "disposition": "merge", "strategy": "scd2", - # accepts various types of date/datetime objects + # Accepts various types of date/datetime objects "active_record_timestamp": "9999-12-31", } ) @@ -364,7 +350,7 @@ You can configure the "boundary timestamp" used for record validity windows with write_disposition={ "disposition": "merge", "strategy": "scd2", - # accepts various types of date/datetime objects + # Accepts various types of date/datetime objects "boundary_timestamp": "2024-08-21T12:15:00+00:00", } ) @@ -372,8 +358,8 @@ def dim_customer(): ... ``` -#### Example: use your own row hash -By default, `dlt` generates a row hash based on all columns provided by the resource and stores it in `_dlt_id`. You can use your own hash instead by specifying `row_version_column_name` in the `write_disposition` dictionary. You might already have a column present in your resource that can naturally serve as row hash, in which case it's more efficient to use those pre-existing hash values than to generate new artificial ones. This option also allows you to use hashes based on a subset of columns, in case you want to ignore changes in some of the columns. When using your own hash, values for `_dlt_id` are randomly generated. +#### Example: Use your own row hash +By default, `dlt` generates a row hash based on all columns provided by the resource and stores it in `_dlt_id`. You can use your own hash instead by specifying `row_version_column_name` in the `write_disposition` dictionary. You might already have a column present in your resource that can naturally serve as a row hash, in which case it's more efficient to use those pre-existing hash values than to generate new artificial ones. This option also allows you to use hashes based on a subset of columns, in case you want to ignore changes in some of the columns. When using your own hash, values for `_dlt_id` are randomly generated. ```py @dlt.resource( write_disposition={ @@ -387,9 +373,9 @@ def dim_customer(): ... ``` -#### 🧪 Use scd2 with Arrow Tables and Panda frames -`dlt` will not add **row hash** column to the tabular data automatically (we are working on it). -You need to do that yourself by adding a transform function to `scd2` resource that computes row hashes (using pandas.util, should be fairly fast). +#### 🧪 Use scd2 with Arrow tables and Panda frames +`dlt` will not add a **row hash** column to the tabular data automatically (we are working on it). +You need to do that yourself by adding a transform function to the `scd2` resource that computes row hashes (using pandas.util, should be fairly fast). ```py import dlt from dlt.sources.helpers.transform import add_row_hash_to_table @@ -404,10 +390,10 @@ scd2_r = dlt.resource( }, ).add_map(add_row_hash_to_table("row_hash")) ``` -`add_row_hash_to_table` is the name of the transform function that will compute and create `row_hash` column that is declared as holding the hash by `row_version_column_name`. +`add_row_hash_to_table` is the name of the transform function that will compute and create the `row_hash` column that is declared as holding the hash by `row_version_column_name`. :::tip -You can modify existing resources that yield data in tabular form by calling `apply_hints` and passing `scd2` config in `write_disposition` and then by +You can modify existing resources that yield data in tabular form by calling `apply_hints` and passing the `scd2` config in `write_disposition` and then by adding the transform with `add_map`. ::: @@ -416,9 +402,9 @@ Nested tables, if any, do not contain validity columns. Validity columns are onl #### Limitations -* You cannot use columns like `updated_at` or integer `version` of a record that are unique within a `primary_key` (even if it is defined). Hash column -must be unique for a root table. We are working to allow `updated_at` style tracking -* We do not detect changes in nested tables (except new records) if row hash of the corresponding parent row does not change. Use `updated_at` or similar +* You cannot use columns like `updated_at` or an integer `version` of a record that are unique within a `primary_key` (even if it is defined). The hash column +must be unique for a root table. We are working to allow `updated_at` style tracking. +* We do not detect changes in nested tables (except new records) if the row hash of the corresponding parent row does not change. Use `updated_at` or a similar column in the root table to stamp changes in nested data. * `merge_key(s)` are (for now) ignored. @@ -432,12 +418,12 @@ The `upsert` merge strategy is currently supported for these destinations: - `mssql` - `postgres` - `snowflake` -- 🧪 `filesytem` with `delta` table format (see limitations [here](../dlt-ecosystem/destinations/filesystem.md#known-limitations)) +- 🧪 `filesystem` with `delta` table format (see limitations [here](../dlt-ecosystem/destinations/filesystem.md#known-limitations)) ::: The `upsert` merge strategy does primary-key based *upserts*: -- *update* record if key exists in target table -- *insert* record if key does not exist in target table +- *update* record if the key exists in the target table +- *insert* record if the key does not exist in the target table You can [delete records](#delete-records) with the `hard_delete` hint. @@ -462,18 +448,14 @@ def my_upsert_resource(): ## Incremental loading with a cursor field -In most of the REST APIs (and other data sources i.e. database tables) you can request new or updated -data by passing a timestamp or id of the "last" record to a query. The API/database returns just the -new/updated records from which you take maximum/minimum timestamp/id for the next load. +In most of the REST APIs (and other data sources, i.e., database tables), you can request new or updated data by passing a timestamp or ID of the "last" record to a query. The API/database returns just the new/updated records from which you take the maximum/minimum timestamp/ID for the next load. -To do incremental loading this way, we need to +To do incremental loading this way, we need to: -- figure which field is used to track changes (the so called **cursor field**) (e.g. “inserted_at”, "updated_at”, etc.); -- how to past the "last" (maximum/minimum) value of cursor field to an API to get just new / modified data (how we do this depends on the source API). +- Figure out which field is used to track changes (the so-called **cursor field**) (e.g., “inserted_at”, "updated_at”, etc.); +- Determine how to pass the "last" (maximum/minimum) value of the cursor field to an API to get just new/modified data (how we do this depends on the source API). -Once you've figured that out, `dlt` takes care of finding maximum/minimum cursor field values, removing -duplicates and managing the state with last values of cursor. Take a look at GitHub example below, where we -request recently created issues. +Once you've figured that out, `dlt` takes care of finding maximum/minimum cursor field values, removing duplicates, and managing the state with the last values of the cursor. Take a look at the GitHub example below, where we request recently created issues. ```py @dlt.resource(primary_key="id") @@ -489,31 +471,24 @@ def repo_issues( print(updated_at.last_value) ``` -Here we add `updated_at` argument that will receive incremental state, initialized to -`1970-01-01T00:00:00Z`. It is configured to track `updated_at` field in issues yielded by -`repo_issues` resource. It will store the newest `updated_at` value in `dlt` -[state](state.md) and make it available in `updated_at.start_value` on next pipeline -run. This value is inserted in `_get_issues_page` function into request query param **since** to [Github API](https://docs.github.com/en/rest/issues/issues?#list-repository-issues) - -In essence, `dlt.sources.incremental` instance above -* **updated_at.initial_value** which is always equal to "1970-01-01T00:00:00Z" passed in constructor -* **updated_at.start_value** a maximum `updated_at` value from the previous run or the **initial_value** on first run -* **updated_at.last_value** a "real time" `updated_at` value updated with each yielded item or page. before first yield it equals **start_value** -* **updated_at.end_value** (here not used) [marking end of backfill range](#using-end_value-for-backfill) +Here we add the `updated_at` argument that will receive incremental state, initialized to `1970-01-01T00:00:00Z`. It is configured to track the `updated_at` field in issues yielded by the `repo_issues` resource. It will store the newest `updated_at` value in `dlt` [state](state.md) and make it available in `updated_at.start_value` on the next pipeline run. This value is inserted in the `_get_issues_page` function into the request query param **since** to [GitHub API](https://docs.github.com/en/rest/issues/issues?#list-repository-issues). -When paginating you probably need **start_value** which does not change during the execution of the resource, however -most paginators will return a **next page** link which you should use. +In essence, the `dlt.sources.incremental` instance above: +* **updated_at.initial_value** which is always equal to "1970-01-01T00:00:00Z" passed in the constructor +* **updated_at.start_value** a maximum `updated_at` value from the previous run or the **initial_value** on the first run +* **updated_at.last_value** a "real-time" `updated_at` value updated with each yielded item or page. Before the first yield, it equals **start_value** +* **updated_at.end_value** (here not used) [marking the end of the backfill range](#using-end_value-for-backfill) -Behind the scenes, `dlt` will deduplicate the results ie. in case the last issue is returned again -(`updated_at` filter is inclusive) and skip already loaded ones. +When paginating, you probably need **start_value**, which does not change during the execution of the resource. However, most paginators will return a **next page** link which you should use. +Behind the scenes, `dlt` will deduplicate the results, i.e., in case the last issue is returned again (`updated_at` filter is inclusive) and skip already loaded ones. -In the example below we -incrementally load the GitHub events, where API does not let us filter for the newest events - it -always returns all of them. Nevertheless, `dlt` will load only the new items, filtering out all the -duplicates and past issues. +In the example below, we incrementally load the GitHub events, where the API does not let us filter for the newest events - it always returns all of them. Nevertheless, `dlt` will load only the new items, filtering out all the duplicates and past issues. ```py -# use naming function in table name to generate separate tables for each event +``` + +# Use naming function in table name to generate separate tables for each event +```python @dlt.resource(primary_key="id", table_name=lambda i: i['type']) # type: ignore def repo_events( last_created_at = dlt.sources.incremental("created_at", initial_value="1970-01-01T00:00:00Z", last_value_func=max), row_order="desc" @@ -523,35 +498,35 @@ def repo_events( yield page ``` -We just yield all the events and `dlt` does the filtering (using `id` column declared as +We just yield all the events and `dlt` does the filtering (using the `id` column declared as `primary_key`). -Github returns events ordered from newest to oldest. So we declare the `rows_order` as **descending** to [stop requesting more pages once the incremental value is out of range](#declare-row-order-to-not-request-unnecessary-data). We stop requesting more data from the API after finding the first event with `created_at` earlier than `initial_value`. +GitHub returns events ordered from newest to oldest. So we declare the `rows_order` as **descending** to [stop requesting more pages once the incremental value is out of range](#declare-row-order-to-not-request-unnecessary-data). We stop requesting more data from the API after finding the first event with `created_at` earlier than `initial_value`. :::note `dlt.sources.incremental` is implemented as a [filter function](resource.md#filter-transform-and-pivot-data) that is executed **after** all other transforms -you add with `add_map` / `add_filter`. This means that you can manipulate the data item before incremental filter sees it. For example: -* you can create surrogate primary key from other columns -* you can modify cursor value or create a new field composed from other fields -* dump Pydantic models to Python dicts to allow incremental to find custost values +you add with `add_map` / `add_filter`. This means that you can manipulate the data item before the incremental filter sees it. For example: +* You can create a surrogate primary key from other columns. +* You can modify the cursor value or create a new field composed of other fields. +* Dump Pydantic models to Python dicts to allow incremental to find custom values. [Data validation with Pydantic](schema-contracts.md#use-pydantic-models-for-data-validation) happens **before** incremental filtering. ::: -### max, min or custom `last_value_func` +### Max, min or custom `last_value_func` -`dlt.sources.incremental` allows to choose a function that orders (compares) cursor values to current `last_value`. -* The default function is built-in `max` which returns bigger value of the two -* Another built-in `min` returns smaller value. +`dlt.sources.incremental` allows choosing a function that orders (compares) cursor values to the current `last_value`. +* The default function is the built-in `max` which returns the bigger value of the two. +* Another built-in `min` returns the smaller value. You can pass your custom function as well. This lets you define -`last_value` on nested types i.e. dictionaries and store indexes of last values, not just simple +`last_value` on nested types, i.e., dictionaries, and store indexes of last values, not just simple types. The `last_value` argument is a [JSON Path](https://github.com/json-path/JsonPath#operators) and lets you select nested data (including the whole data item when `$` is used). -Example below creates last value which is a dictionary holding a max `created_at` value for each +The example below creates a last value which is a dictionary holding a max `created_at` value for each created table name: -```py +```python def by_event_type(event): last_value = None if len(event) == 1: @@ -575,7 +550,7 @@ def get_events(last_created_at = dlt.sources.incremental("$", last_value_func=by ### Using `last_value_func` for lookback The example below uses the `last_value_func` to load data from the past month. -```py +```python def lookback(event): last_value = None if len(event) == 1: @@ -598,7 +573,9 @@ def get_events(last_created_at = dlt.sources.incremental("created_at", last_valu ``` ### Using `end_value` for backfill -You can specify both initial and end dates when defining incremental loading. Let's go back to our Github example: + +You can specify both initial and end dates when defining incremental loading. Let's go back to our GitHub example: + ```py @dlt.resource(primary_key="id") def repo_issues( @@ -606,16 +583,16 @@ def repo_issues( repository, created_at = dlt.sources.incremental("created_at", initial_value="1970-01-01T00:00:00Z", end_value="2022-07-01T00:00:00Z") ): - # get issues from created from last "created_at" value + # get issues created from the last "created_at" value for page in _get_issues_page(access_token, repository, since=created_at.start_value, until=created_at.end_value): yield page ``` -Above we use `initial_value` and `end_value` arguments of the `incremental` to define the range of issues that we want to retrieve -and pass this range to the Github API (`since` and `until`). As in the examples above, `dlt` will make sure that only the issues from -defined range are returned. + +Above, we use the `initial_value` and `end_value` arguments of the `incremental` to define the range of issues that we want to retrieve and pass this range to the GitHub API (`since` and `until`). As in the examples above, `dlt` will make sure that only the issues from the defined range are returned. Please note that when `end_date` is specified, `dlt` **will not modify the existing incremental state**. The backfill is **stateless** and: -1. You can run backfill and incremental load in parallel (ie. in Airflow DAG) in a single pipeline. + +1. You can run backfill and incremental load in parallel (i.e., in an Airflow DAG) in a single pipeline. 2. You can partition your backfill into several smaller chunks and run them in parallel as well. To define specific ranges to load, you can simply override the incremental argument in the resource, for example: @@ -634,36 +611,33 @@ august_issues = repo_issues( ... ``` -Note that `dlt`'s incremental filtering considers the ranges half closed. `initial_value` is inclusive, `end_value` is exclusive, so chaining ranges like above works without overlaps. - +Note that `dlt`'s incremental filtering considers the ranges half-closed. `initial_value` is inclusive, `end_value` is exclusive, so chaining ranges like above works without overlaps. ### Declare row order to not request unnecessary data -With `row_order` argument set, `dlt` will stop getting data from the data source (ie. Github API) if it detect that values of cursor field are out of range of **start** and **end** values. +With the `row_order` argument set, `dlt` will stop getting data from the data source (i.e., GitHub API) if it detects that values of the cursor field are out of range of **start** and **end** values. In particular: * `dlt` stops processing when the resource yields any item with an _equal or greater_ cursor value than the `end_value` and `row_order` is set to **asc**. (`end_value` is not included) * `dlt` stops processing when the resource yields any item with a _lower_ cursor value than the `last_value` and `row_order` is set to **desc**. (`last_value` is included) :::note -"higher" and "lower" here refers to when the default `last_value_func` is used (`max()`), -when using `min()` "higher" and "lower" are inverted. +"higher" and "lower" here refer to when the default `last_value_func` is used (`max()`). +When using `min()`, "higher" and "lower" are inverted. ::: :::caution -If you use `row_order`, **make sure that the data source returns ordered records** (ascending / descending) on the cursor field, -e.g. if an API returns results both higher and lower -than the given `end_value` in no particular order, data reading stops and you'll miss the data items that were out of order. +If you use `row_order`, **make sure that the data source returns ordered records** (ascending/descending) on the cursor field. +For example, if an API returns results both higher and lower than the given `end_value` in no particular order, data reading stops and you'll miss the data items that were out of order. ::: -Row order is the most useful when: +Row order is most useful when: -1. The data source does **not** offer start/end filtering of results (e.g. there is no `start_time/end_time` query parameter or similar) -2. The source returns results **ordered by the cursor field** +1. The data source does **not** offer start/end filtering of results (e.g., there is no `start_time/end_time` query parameter or similar). +2. The source returns results **ordered by the cursor field**. -The github events example is exactly such case. The results are ordered on cursor value descending but there's no way to tell API to limit returned items to those created before certain date. Without the `row_order` setting, we'd be getting all events, each time we extract the `github_events` resource. +The GitHub events example is exactly such a case. The results are ordered on cursor value descending, but there's no way to tell the API to limit returned items to those created before a certain date. Without the `row_order` setting, we'd be getting all events each time we extract the `github_events` resource. -In the same fashion the `row_order` can be used to **optimize backfill** so we don't continue -making unnecessary API requests after the end of range is reached. For example: +In the same fashion, the `row_order` can be used to **optimize backfill** so we don't continue making unnecessary API requests after the end of the range is reached. For example: ```py @dlt.resource(primary_key="id") @@ -682,22 +656,14 @@ def tickets( yield page ``` -In this example we're loading tickets from Zendesk. The Zendesk API yields items paginated and ordered by oldest to newest, -but only offers a `start_time` parameter for filtering so we cannot tell it to -stop getting data at `end_value`. Instead we set `row_order` to `asc` and `dlt` wil stop -getting more pages from API after first page with cursor value `updated_at` is found older -than `end_value`. +In this example, we're loading tickets from Zendesk. The Zendesk API yields items paginated and ordered from oldest to newest, but only offers a `start_time` parameter for filtering, so we cannot tell it to stop getting data at `end_value`. Instead, we set `row_order` to `asc` and `dlt` will stop getting more pages from the API after the first page with a cursor value `updated_at` is found older than `end_value`. :::caution -In rare cases when you use Incremental with a transformer, `dlt` will not be able to automatically close -generator associated with a row that is out of range. You can still call the `can_close()` method on -incremental and exit yield loop when true. +In rare cases when you use Incremental with a transformer, `dlt` will not be able to automatically close the generator associated with a row that is out of range. You can still call the `can_close()` method on incremental and exit the yield loop when true. ::: :::tip -The `dlt.sources.incremental` instance provides `start_out_of_range` and `end_out_of_range` -attributes which are set when the resource yields an element with a higher/lower cursor value than the -initial or end values. If you do not want `dlt` to stop processing automatically and instead to handle such events yourself, do not specify `row_order`: +The `dlt.sources.incremental` instance provides `start_out_of_range` and `end_out_of_range` attributes, which are set when the resource yields an element with a higher/lower cursor value than the initial or end values. If you do not want `dlt` to stop processing automatically and instead handle such events yourself, do not specify `row_order`: ```py @dlt.transformer(primary_key="id") def tickets( @@ -716,27 +682,16 @@ def tickets( # Stop loading when we reach the end value if updated_at.end_out_of_range: return - ``` ::: + + ### Deduplicate overlapping ranges with primary key -`Incremental` **does not** deduplicate datasets like **merge** write disposition does. It however -makes sure than when another portion of data is extracted, records that were previously loaded won't be -included again. `dlt` assumes that you load a range of data, where the lower bound is inclusive (ie. greater than equal). -This makes sure that you never lose any data but will also re-acquire some rows. -For example: you have a database table with an cursor field on `updated_at` which has a day resolution, then there's a high -chance that after you extract data on a given day, still more records will be added. When you extract on the next day, you -should reacquire data from the last day to make sure all records are present, this will however create overlap with data -from previous extract. - -By default, content hash (a hash of `json` representation of a row) will be used to deduplicate. -This may be slow so`dlt.sources.incremental` will inherit the primary key that is set on the resource. -You can optionally set a `primary_key` that is used exclusively to -deduplicate and which does not become a table hint. The same setting lets you disable the -deduplication altogether when empty tuple is passed. Below we pass `primary_key` directly to -`incremental` to disable deduplication. That overrides `delta` primary_key set in the resource: +`Incremental` **does not** deduplicate datasets like **merge** write disposition does. However, it ensures that when another portion of data is extracted, records that were previously loaded won't be included again. `dlt` assumes that you load a range of data where the lower bound is inclusive (i.e., greater than or equal to). This ensures that you never lose any data but will also re-acquire some rows. For example, if you have a database table with a cursor field on `updated_at` which has a day resolution, there's a high chance that after you extract data on a given day, more records will still be added. When you extract on the next day, you should reacquire data from the last day to ensure all records are present. This will, however, create overlap with data from the previous extract. + +By default, a content hash (a hash of the `json` representation of a row) will be used to deduplicate. This may be slow, so `dlt.sources.incremental` will inherit the primary key that is set on the resource. You can optionally set a `primary_key` that is used exclusively to deduplicate and which does not become a table hint. The same setting lets you disable the deduplication altogether when an empty tuple is passed. Below we pass `primary_key` directly to `incremental` to disable deduplication. That overrides the `delta` primary key set in the resource: ```py @dlt.resource(primary_key="delta") @@ -748,8 +703,7 @@ def some_data(last_timestamp=dlt.sources.incremental("item.ts", primary_key=())) ### Using `dlt.sources.incremental` with dynamically created resources -When resources are [created dynamically](source.md#create-resources-dynamically) it is possible to -use `dlt.sources.incremental` definition as well. +When resources are [created dynamically](source.md#create-resources-dynamically), it is possible to use the `dlt.sources.incremental` definition as well. ```py @dlt.source @@ -757,7 +711,7 @@ def stripe(): # declare a generator function def get_resource( endpoint: Endpoints, - created: dlt.sources.incremental=dlt.sources.incremental("created") + created: dlt.sources.incremental = dlt.sources.incremental("created") ): ... yield data @@ -772,21 +726,20 @@ def stripe(): )(endpoint) ``` -Please note that in the example above, `get_resource` is passed as a function to `dlt.resource` to -which we bind the endpoint: **dlt.resource(...)(endpoint)**. +Please note that in the example above, `get_resource` is passed as a function to `dlt.resource` to which we bind the endpoint: **dlt.resource(...)(endpoint)**. :::caution -The typical mistake is to pass a generator (not a function) as below: +A typical mistake is to pass a generator (not a function) as below: `yield dlt.resource(get_resource(endpoint), name=endpoint.value, write_disposition="merge", primary_key="id")`. -Here we call **get_resource(endpoint)** and that creates un-evaluated generator on which resource -is created. That prevents `dlt` from controlling the **created** argument during runtime and will -result in `IncrementalUnboundError` exception. +Here we call **get_resource(endpoint)**, and that creates an un-evaluated generator on which the resource is created. That prevents `dlt` from controlling the **created** argument during runtime and will result in an `IncrementalUnboundError` exception. ::: ### Using Airflow schedule for backfill and incremental loading -When [running in Airflow task](../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file), you can opt-in your resource to get the `initial_value`/`start_value` and `end_value` from Airflow schedule associated with your DAG. Let's assume that **Zendesk tickets** resource contains a year of data with thousands of tickets. We want to backfill the last year of data week by week and then continue incremental loading daily. + +When [running in Airflow task](../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file), you can opt-in your resource to get the `initial_value`/`start_value` and `end_value` from the Airflow schedule associated with your DAG. Let's assume that the **Zendesk tickets** resource contains a year of data with thousands of tickets. We want to backfill the last year of data week by week and then continue incremental loading daily. + ```py @dlt.resource(primary_key="id") def tickets( @@ -801,11 +754,13 @@ def tickets( ): yield page ``` -We opt-in to Airflow scheduler by setting `allow_external_schedulers` to `True`: + +We opt-in to the Airflow scheduler by setting `allow_external_schedulers` to `True`: 1. When running on Airflow, the start and end values are controlled by Airflow and `dlt` [state](state.md) is not used. 2. In all other environments, the `incremental` behaves as usual, maintaining `dlt` state. -Let's generate a deployment with `dlt deploy zendesk_pipeline.py airflow-composer` and customize the dag: +Let's generate a deployment with `dlt deploy zendesk_pipeline.py airflow-composer` and customize the DAG: + ```py @dag( schedule_interval='@weekly', @@ -828,21 +783,22 @@ def zendesk_backfill_bigquery(): ) # select only incremental endpoints in support api data = zendesk_support().with_resources("tickets", "ticket_events", "ticket_metric_events") - # create the source, the "serialize" decompose option will converts dlt resources into Airflow tasks. use "none" to disable it + # create the source, the "serialize" decompose option will convert dlt resources into Airflow tasks. use "none" to disable it tasks.add_run(pipeline, data, decompose="serialize", trigger_rule="all_done", retries=0, provide_context=True) zendesk_backfill_bigquery() ``` + What got customized: -1. We use weekly schedule, and want to get the data from February 2023 (`start_date`) until end of July ('end_date'). -2. We make Airflow to generate all weekly runs (`catchup` is True). -2. We create `zendesk_support` resources where we select only the incremental resources we want to backfill. +1. We use a weekly schedule and want to get the data from February 2023 (`start_date`) until the end of July (`end_date`). +2. We make Airflow generate all weekly runs (`catchup` is True). +3. We create `zendesk_support` resources where we select only the incremental resources we want to backfill. -When you enable the DAG in Airflow, it will generate several runs and start executing them, starting in February and ending in August. Your resource will receive -subsequent weekly intervals starting with `2023-02-12, 00:00:00 UTC` to `2023-02-19, 00:00:00 UTC`. +When you enable the DAG in Airflow, it will generate several runs and start executing them, starting in February and ending in August. Your resource will receive subsequent weekly intervals starting with `2023-02-12, 00:00:00 UTC` to `2023-02-19, 00:00:00 UTC`. You can repurpose the DAG above to start loading new data incrementally after (or during) the backfill: + ```py @dag( schedule_interval='@daily', @@ -864,19 +820,19 @@ def zendesk_new_bigquery(): ) tasks.add_run(pipeline, zendesk_support(), decompose="serialize", trigger_rule="all_done", retries=0, provide_context=True) ``` -Above, we switch to daily schedule and disable catchup and end date. We also load all the support resources to the same dataset as backfill (`zendesk_support_data`). -If you want to run this DAG parallel with the backfill DAG, change the pipeline name ie. to `zendesk_support_new` as above. + +Above, we switch to a daily schedule and disable catchup and end date. We also load all the support resources to the same dataset as backfill (`zendesk_support_data`). If you want to run this DAG in parallel with the backfill DAG, change the pipeline name, i.e., to `zendesk_support_new` as above. **Under the hood** -Before `dlt` starts executing incremental resources, it looks for `data_interval_start` and `data_interval_end` Airflow task context variables. Those got mapped to `initial_value` and `end_value` of the -`Incremental` class: -1. `dlt` is smart enough to convert Airflow datetime to iso strings or unix timestamps if your resource is using them. In our example we instantiate `updated_at=dlt.sources.incremental[int]`, where we declare the last value type to be **int**. `dlt` can also infer type if you provide `initial_value` argument. + +Before `dlt` starts executing incremental resources, it looks for `data_interval_start` and `data_interval_end` Airflow task context variables. Those get mapped to `initial_value` and `end_value` of the `Incremental` class: +1. `dlt` is smart enough to convert Airflow datetime to ISO strings or Unix timestamps if your resource is using them. In our example, we instantiate `updated_at=dlt.sources.incremental[int]`, where we declare the last value type to be **int**. `dlt` can also infer type if you provide the `initial_value` argument. 2. If `data_interval_end` is in the future or is None, `dlt` sets the `end_value` to **now**. -3. If `data_interval_start` == `data_interval_end` we have a manually triggered DAG run. In that case `data_interval_end` will also be set to **now**. +3. If `data_interval_start` == `data_interval_end`, we have a manually triggered DAG run. In that case, `data_interval_end` will also be set to **now**. **Manual runs** -You can run DAGs manually but you must remember to specify the Airflow logical date of the run in the past (use Run with config option). For such run `dlt` will load all data from that past date until now. -If you do not specify the past date, a run with a range (now, now) will happen yielding no data. + +You can run DAGs manually, but you must remember to specify the Airflow logical date of the run in the past (use the Run with config option). For such a run, `dlt` will load all data from that past date until now. If you do not specify the past date, a run with a range (now, now) will happen, yielding no data. ### Reading incremental loading parameters from configuration @@ -913,8 +869,8 @@ Consider the example below for reading incremental loading parameters from "conf You can customize the incremental processing of dlt by setting the parameter `on_cursor_value_missing`. When loading incrementally with the default settings, there are two assumptions: -1. each row contains the cursor path -2. each row is expected to contain a value at the cursor path that is not `None`. +1. Each row contains the cursor path. +2. Each row is expected to contain a value at the cursor path that is not `None`. For example, the two following source data will raise an error: ```py @@ -937,8 +893,7 @@ def some_data_without_cursor_value(updated_at=dlt.sources.incremental("updated_a list(some_data_without_cursor_value()) ``` - -To process a data set where some records do not include the incremental cursor path or where the values at the cursor path are `None,` there are the following four options: +To process a data set where some records do not include the incremental cursor path or where the values at the cursor path are `None`, there are the following four options: 1. Configure the incremental load to raise an exception in case there is a row where the cursor path is missing or has the value `None` using `incremental(..., on_cursor_value_missing="raise")`. This is the default behavior. 2. Configure the incremental load to tolerate the missing cursor path and `None` values using `incremental(..., on_cursor_value_missing="include")`. @@ -961,7 +916,7 @@ assert result[1] == {"id": 2, "created_at": 2} assert result[2] == {"id": 3, "created_at": 4, "updated_at": None} ``` -If you do not want to import records without the cursor path or where the value at the cursor path is `None` use the following incremental configuration: +If you do not want to import records without the cursor path or where the value at the cursor path is `None`, use the following incremental configuration: ```py @dlt.resource @@ -977,7 +932,7 @@ assert len(result) == 1 ``` ### Transform records before incremental processing -If you want to load data that includes `None` values you can transform the records before the incremental processing. +If you want to load data that includes `None` values, you can transform the records before the incremental processing. You can add steps to the pipeline that [filter, transform, or pivot your data](../general-usage/resource.md#filter-transform-and-pivot-data). :::caution @@ -1013,15 +968,12 @@ result_filtered = list(without_none) assert len(result_filtered) == 2 ``` - ## Doing a full refresh You may force a full refresh of a `merge` and `append` pipelines: -1. In case of a `merge` the data in the destination is deleted and loaded fresh. Currently we do not - deduplicate data during the full refresh. -1. In case of `dlt.sources.incremental` the data is deleted and loaded from scratch. The state of - the incremental is reset to the initial value. +1. In the case of a `merge`, the data in the destination is deleted and loaded fresh. Currently, we do not deduplicate data during the full refresh. +1. In the case of `dlt.sources.incremental`, the data is deleted and loaded from scratch. The state of the incremental is reset to the initial value. Example: @@ -1035,31 +987,25 @@ p.run(merge_source().with_resources("merge_table"), write_disposition="replace") p.run(merge_source()) ``` -Passing write disposition to `replace` will change write disposition on all the resources in -`repo_events` during the run of the pipeline. +Passing write disposition to `replace` will change the write disposition on all the resources in `repo_events` during the run of the pipeline. ## Custom incremental loading with pipeline state -The pipeline state is a Python dictionary which gets committed atomically with the data; you can set -values in it in your resources and on next pipeline run, request them back. +The pipeline state is a Python dictionary that gets committed atomically with the data; you can set values in it in your resources and on the next pipeline run, request them back. -The pipeline state is in principle scoped to the resource - all values of the state set by resource -are private and isolated from any other resource. You can also access the source-scoped state which -can be shared across resources. +The pipeline state is, in principle, scoped to the resource - all values of the state set by a resource are private and isolated from any other resource. You can also access the source-scoped state, which can be shared across resources. [You can find more information on pipeline state here](state.md#pipeline-state). ### Preserving the last value in resource state. -For the purpose of preserving the “last value” or similar loading checkpoints, we can open a dlt -state dictionary with a key and a default value as below. When the resource is executed and the data -is loaded, the yielded resource data will be loaded at the same time with the update to the state. +For the purpose of preserving the “last value” or similar loading checkpoints, we can open a dlt state dictionary with a key and a default value as below. When the resource is executed and the data is loaded, the yielded resource data will be loaded at the same time with the update to the state. -In the two examples below you see how the `dlt.sources.incremental` is working under the hood. +In the two examples below, you see how the `dlt.sources.incremental` is working under the hood. ```py @resource() def tweets(): - # Get a last value from loaded metadata. If not exist, get None + # Get the last value from loaded metadata. If it does not exist, get None last_val = dlt.current.resource_state().setdefault("last_updated", None) # get data and yield it data = get_data(start_from=last_val) @@ -1068,13 +1014,12 @@ def tweets(): dlt.current.resource_state()["last_updated"] = data["last_timestamp"] ``` -If we keep a list or a dictionary in the state, we can modify the underlying values in the objects, -and thus we do not need to set the state back explicitly. +If we keep a list or a dictionary in the state, we can modify the underlying values in the objects, and thus we do not need to set the state back explicitly. ```py @resource() def tweets(): - # Get a last value from loaded metadata. If not exist, get None + # Get the last value from loaded metadata. If it does not exist, get None loaded_dates = dlt.current.resource_state().setdefault("days_loaded", []) # do stuff: get data and add new values to the list # `loaded_date` is a reference to the `dlt.current.resource_state()["days_loaded"]` list @@ -1083,24 +1028,18 @@ def tweets(): loaded_dates.append('2023-01-01') ``` -Step by step explanation of how to get or set the state: +Step-by-step explanation of how to get or set the state: -1. We can use the function `var = dlt.current.resource_state().setdefault("key", [])`. This allows - us to retrieve the values of `key`. If `key` was not set yet, we will get the default value `[]` - instead. -1. We now can treat `var` as a python list - We can append new values to it, or if applicable we can - read the values from previous loads. -1. On pipeline run, the data will load, and the new `var`'s value will get saved in the state. The - state is stored at the destination, so it will be available on subsequent runs. +1. We can use the function `var = dlt.current.resource_state().setdefault("key", [])`. This allows us to retrieve the values of `key`. If `key` was not set yet, we will get the default value `[]` instead. +2. We now can treat `var` as a Python list - We can append new values to it, or if applicable, we can read the values from previous loads. +3. On pipeline run, the data will load, and the new `var`'s value will get saved in the state. The state is stored at the destination, so it will be available on subsequent runs. ### Advanced state usage: storing a list of processed entities -Let’s look at the `player_games` resource from the chess pipeline. The chess API has a method to -request games archives for a given month. The task is to prevent the user to load the same month -data twice - even if the user makes a mistake and requests the same months range again: +Let’s look at the `player_games` resource from the chess pipeline. The chess API has a method to request game archives for a given month. The task is to prevent the user from loading the same month's data twice - even if the user makes a mistake and requests the same month's range again: - Our data is requested in 2 steps: - - Get all available archives URLs. + - Get all available archive URLs. - Get the data from each URL. - We will add the “chess archives” URLs to this list we created. - This will allow us to track what data we have loaded. @@ -1116,7 +1055,7 @@ def players_games(chess_url, players, start_month=None, end_month=None): # as far as python is concerned, this variable behaves like # loaded_archives_cache = state['archives'] or [] - # afterwards we can modify list, and finally + # afterwards we can modify the list, and finally # when the data is loaded, the cache is updated with our loaded_archives_cache # get archives for a given player @@ -1133,14 +1072,14 @@ def players_games(chess_url, players, start_month=None, end_month=None): print(f"skipping archive {url}") ``` -### Advanced state usage: tracking the last value for all search terms in Twitter API +### Advanced state usage: Tracking the last value for all search terms in Twitter API ```py @dlt.resource(write_disposition="append") def search_tweets(twitter_bearer_token=dlt.secrets.value, search_terms=None, start_time=None, end_time=None, last_value=None): headers = _headers(twitter_bearer_token) for search_term in search_terms: - # make cache for each term + # Make cache for each term last_value_cache = dlt.current.resource_state().setdefault(f"last_value_{search_term}", None) print(f'last_value_cache: {last_value_cache}') params = {...} @@ -1149,9 +1088,9 @@ def search_tweets(twitter_bearer_token=dlt.secrets.value, search_terms=None, sta for page in response: page['search_term'] = search_term last_id = page.get('meta', {}).get('newest_id', 0) - #set it back - not needed if we + # Set it back - not needed if we dlt.current.resource_state()[f"last_value_{search_term}"] = max(last_value_cache or 0, int(last_id)) - # print the value for each search term + # Print the value for each search term print(f'new_last_value_cache for term {search_term}: {last_value_cache}') yield page @@ -1161,11 +1100,11 @@ def search_tweets(twitter_bearer_token=dlt.secrets.value, search_terms=None, sta If you see that the incremental loading is not working as expected and the incremental values are not modified between pipeline runs, check the following: -1. Make sure the `destination`, `pipeline_name` and `dataset_name` are the same between pipeline runs. +1. Make sure the `destination`, `pipeline_name`, and `dataset_name` are the same between pipeline runs. 2. Check if `dev_mode` is `False` in the pipeline configuration. Check if `refresh` for associated sources and resources is not enabled. -3. Check the logs for `Bind incremental on ...` message. This message indicates that the incremental value was bound to the resource and shows the state of the incremental value. +3. Check the logs for the `Bind incremental on ...` message. This message indicates that the incremental value was bound to the resource and shows the state of the incremental value. 4. After the pipeline run, check the state of the pipeline. You can do this by running the following command: @@ -1217,3 +1156,4 @@ sources: ``` Verify that the `last_value` is updated between pipeline runs. + diff --git a/docs/website/docs/general-usage/naming-convention.md b/docs/website/docs/general-usage/naming-convention.md index 16898cf8d1..caca7290b3 100644 --- a/docs/website/docs/general-usage/naming-convention.md +++ b/docs/website/docs/general-usage/naming-convention.md @@ -1,12 +1,12 @@ --- title: Naming Convention -description: Control how dlt creates table, column and other identifiers +description: Control how dlt creates table, column, and other identifiers keywords: [identifiers, snake case, case sensitive, case insensitive, naming] --- -# Naming Convention -`dlt` creates table and column identifiers from the data. The data source, i.e. a stream of JSON documents, may have identifiers (i.e. key names in a dictionary) with any Unicode characters, of any length, and naming style. On the other hand, destinations require that you follow strict rules when you name tables, columns, or collections. -A good example is [Redshift](../dlt-ecosystem/destinations/redshift.md#naming-convention) that accepts case-insensitive alphanumeric identifiers with a maximum of 127 characters. +# Naming convention +`dlt` creates table and column identifiers from the data. The data source, i.e., a stream of JSON documents, may have identifiers (i.e., key names in a dictionary) with any Unicode characters, of any length, and naming style. On the other hand, destinations require that you follow strict rules when you name tables, columns, or collections. +A good example is [Redshift](../dlt-ecosystem/destinations/redshift.md#naming-convention), which accepts case-insensitive alphanumeric identifiers with a maximum of 127 characters. `dlt` groups tables from a single [source](source.md) in a [schema](schema.md). Each schema defines a **naming convention** that tells `dlt` how to translate identifiers to the namespace that the destination understands. Naming conventions are, in essence, functions that map strings from the source identifier format into the destination identifier format. For example, our **snake_case** (default) naming convention will translate the `DealFlow` source identifier into the `deal_flow` destination identifier. @@ -20,19 +20,19 @@ The standard behavior of `dlt` is to **use the same naming convention for all de ### Use default naming convention (snake_case) **snake_case** is a case-insensitive naming convention, converting source identifiers into lower-case snake case identifiers with a reduced alphabet. -- Spaces around identifiers are trimmed -- Keeps ASCII alphanumerics and underscores, replaces all other characters with underscores (with the exceptions below) -- Replaces `+` and `*` with `x`, `-` with `_`, `@` with `a`, and `|` with `l` +- Spaces around identifiers are trimmed. +- Keeps ASCII alphanumerics and underscores, replaces all other characters with underscores (with the exceptions below). +- Replaces `+` and `*` with `x`, `-` with `_`, `@` with `a`, and `|` with `l`. - Prepends `_` if the name starts with a number. - Multiples of `_` are converted into a single `_`. -- Replaces all trailing `_` with `x` +- Replaces all trailing `_` with `x`. Uses __ as a nesting separator for tables and flattened column names. :::tip If you do not like **snake_case**, your next safe option is **sql_ci**, which generates SQL-safe, lowercase, case-insensitive identifiers without any other transformations. To permanently change the default naming convention on a given machine: -1. set an environment variable `SCHEMA__NAMING` to `sql_ci_v1` OR -2. add the following line to your global `config.toml` (the one in your home dir, i.e. `~/.dlt/config.toml`) +1. Set an environment variable `SCHEMA__NAMING` to `sql_ci_v1` OR +2. Add the following line to your global `config.toml` (the one in your home dir, i.e., `~/.dlt/config.toml`): ```toml [schema] naming="sql_ci_v1" @@ -43,15 +43,15 @@ naming="sql_ci_v1" ### Pick the right identifier form when defining resources `dlt` keeps source (not normalized) identifiers during data [extraction](../reference/explainers/how-dlt-works.md#extract) and translates them during [normalization](../reference/explainers/how-dlt-works.md#normalize). For you, it means: 1. If you write a [transformer](resource.md#process-resources-with-dlttransformer) or a [mapping/filtering function](resource.md#filter-transform-and-pivot-data), you will see the original data, without any normalization. Use the source identifiers to access the dicts! -2. If you define a `primary_key` or `cursor` that participate in [cursor field incremental loading](incremental-loading.md#incremental-loading-with-a-cursor-field), use the source identifiers (`dlt` uses them to inspect source data, `Incremental` class is just a filtering function). -3. When defining any other hints, i.e. `columns` or `merge_key`, you can pick source or destination identifiers. `dlt` normalizes all hints together with your data. -4. The `Schema` object (i.e. obtained from the pipeline or from `dlt` source via `discover_schema`) **always contains destination (normalized) identifiers**. +2. If you define a `primary_key` or `cursor` that participates in [cursor field incremental loading](incremental-loading.md#incremental-loading-with-a-cursor-field), use the source identifiers (`dlt` uses them to inspect source data, `Incremental` class is just a filtering function). +3. When defining any other hints, i.e., `columns` or `merge_key`, you can pick source or destination identifiers. `dlt` normalizes all hints together with your data. +4. The `Schema` object (i.e., obtained from the pipeline or from `dlt` source via `discover_schema`) **always contains destination (normalized) identifiers**. ### Understand the identifier normalization Identifiers are translated from source to destination form in the **normalize** step. Here's how `dlt` picks the naming convention: * The default naming convention is **snake_case**. -* Each destination may define a preferred naming convention in [destination capabilities](destination.md#pass-additional-parameters-and-change-destination-capabilities). Some destinations (i.e. Weaviate) need a specialized naming convention and will override the default. +* Each destination may define a preferred naming convention in [destination capabilities](destination.md#pass-additional-parameters-and-change-destination-capabilities). Some destinations (i.e., Weaviate) need a specialized naming convention and will override the default. * You can [configure a naming convention explicitly](#set-and-adjust-naming-convention-explicitly). Such configuration overrides the destination settings. * This naming convention is used when new schemas are created. It happens when the pipeline is run for the first time. * Schemas preserve the naming convention when saved. Your running pipelines will maintain existing naming conventions if not requested otherwise. @@ -62,22 +62,24 @@ If you change the naming convention and `dlt` detects a change in the destinatio ::: ### Case-sensitive and insensitive destinations -Naming conventions declare if the destination identifiers they produce are case-sensitive or insensitive. This helps `dlt` to [generate case-sensitive / insensitive identifiers for the destinations that support both](destination.md#control-how-dlt-creates-table-column-and-other-identifiers). For example: if you pick a case-insensitive naming like **snake_case** or **sql_ci_v1**, with Snowflake, `dlt` will generate all uppercase identifiers that Snowflake sees as case-insensitive. If you pick a case-sensitive naming like **sql_cs_v1**, `dlt` will generate quoted case-sensitive identifiers that preserve identifier capitalization. +Naming conventions declare if the destination identifiers they produce are case-sensitive or insensitive. This helps `dlt` to [generate case-sensitive / insensitive identifiers for the destinations that support both](destination.md#control-how-dlt-creates-table-column-and-other-identifiers). For example, if you pick a case-insensitive naming like **snake_case** or **sql_ci_v1**, with Snowflake, `dlt` will generate all uppercase identifiers that Snowflake sees as case-insensitive. If you pick a case-sensitive naming like **sql_cs_v1**, `dlt` will generate quoted case-sensitive identifiers that preserve identifier capitalization. -Note that many destinations are exclusively case-insensitive, of which some preserve the casing of identifiers (i.e. **duckdb**) and some will case-fold identifiers when creating tables (i.e. **Redshift**, **Athena** do lowercase on the names). `dlt` is able to detect resulting identifier [collisions](#avoid-identifier-collisions) and stop the load process before data is mangled. +Note that many destinations are exclusively case-insensitive, of which some preserve the casing of identifiers (i.e., **duckdb**) and some will case-fold identifiers when creating tables (i.e., **Redshift**, **Athena** do lowercase on the names). `dlt` is able to detect resulting identifier [collisions](#avoid-identifier-collisions) and stop the load process before data is mangled. ### Identifier shortening Identifier shortening happens during normalization. `dlt` takes the maximum length of the identifier from the destination capabilities and will trim the identifiers that are too long. The default shortening behavior generates short deterministic hashes of the source identifiers and places them in the middle of the destination identifier. This (with a high probability) avoids shortened identifier collisions. ### 🚧 [WIP] Name convention changes are lossy -`dlt` does not store the source identifiers in the schema so when the naming convention changes (or we increase the maximum identifier length), it is not able to generate a fully correct set of new identifiers. Instead, it will re-normalize already normalized identifiers. We are currently working to store the full identifier lineage - source identifiers will be stored and mapped to the destination in the schema. +`dlt` does not store the source identifiers in the schema, so when the naming convention changes (or we increase the maximum identifier length), it is not able to generate a fully correct set of new identifiers. Instead, it will re-normalize already normalized identifiers. We are currently working to store the full identifier lineage - source identifiers will be stored and mapped to the destination in the schema. ## Pick your own naming convention + + ### Configure naming convention -You can use `config.toml`, environment variables, or any other configuration provider to set the naming convention name. Configured naming convention **overrides all other settings** -- changes the naming convention stored in the already created schema -- overrides the destination capabilities preference. +You can use `config.toml`, environment variables, or any other configuration provider to set the naming convention name. The configured naming convention **overrides all other settings**: +- Changes the naming convention stored in the already created schema. +- Overrides the destination capabilities preference. ```toml [schema] naming="sql_ci_v1" @@ -139,27 +141,27 @@ Depending on the destination, certain names may not be allowed. To ensure your d ## Avoid identifier collisions `dlt` detects various types of identifier collisions and ignores the others. -1. `dlt` detects collisions if a case-sensitive naming convention is used on a case-insensitive destination -2. `dlt` detects collisions if a change of naming convention changes the identifiers of tables already created in the destination -3. `dlt` detects collisions when the naming convention is applied to column names of arrow tables +1. `dlt` detects collisions if a case-sensitive naming convention is used on a case-insensitive destination. +2. `dlt` detects collisions if a change of naming convention changes the identifiers of tables already created in the destination. +3. `dlt` detects collisions when the naming convention is applied to column names of arrow tables. `dlt` will not detect a collision when normalizing source data. If you have a dictionary, keys will be merged if they collide after being normalized. -You can create a custom naming convention that does not generate collisions on data, see examples below. - +You can create a custom naming convention that does not generate collisions on data; see examples below. ## Write your own naming convention Custom naming conventions are classes that derive from `NamingConvention` that you can import from `dlt.common.normalizers.naming`. We recommend the following module layout: -1. Each naming convention resides in a separate Python module (file) -2. The class is always named `NamingConvention` +1. Each naming convention resides in a separate Python module (file). +2. The class is always named `NamingConvention`. In that case, you can use a fully qualified module name in [schema configuration](#configure-naming-convention) or pass the module [explicitly](#set-and-adjust-naming-convention-explicitly). We include [two examples](../examples/custom_naming) of naming conventions that you may find useful: 1. A variant of `sql_ci` that generates identifier collisions with a low (user-defined) probability by appending a deterministic tag to each name. -2. A variant of `sql_cs` that allows for LATIN (i.e. umlaut) characters +2. A variant of `sql_cs` that allows for LATIN (i.e. umlaut) characters. :::note Note that a fully qualified name of your custom naming convention will be stored in the `Schema` and `dlt` will attempt to import it when the schema is loaded from storage. You should distribute your custom naming conventions with your pipeline code or via a pip package from which it can be imported. -::: \ No newline at end of file +::: + diff --git a/docs/website/docs/general-usage/pipeline.md b/docs/website/docs/general-usage/pipeline.md index 40f9419bc2..e8bd9b2e58 100644 --- a/docs/website/docs/general-usage/pipeline.md +++ b/docs/website/docs/general-usage/pipeline.md @@ -8,12 +8,12 @@ keywords: [pipeline, source, full refresh, dev mode] A [pipeline](glossary.md#pipeline) is a connection that moves the data from your Python code to a [destination](glossary.md#destination). The pipeline accepts `dlt` [sources](source.md) or -[resources](resource.md) as well as generators, async generators, lists and any iterables. -Once the pipeline runs, all resources get evaluated and the data is loaded at destination. +[resources](resource.md) as well as generators, async generators, lists, and any iterables. +Once the pipeline runs, all resources get evaluated and the data is loaded at the destination. Example: -This pipeline will load a list of objects into `duckdb` table with a name "three": +This pipeline will load a list of objects into a `duckdb` table with the name "three": ```py import dlt @@ -25,30 +25,30 @@ info = pipeline.run([{'id':1}, {'id':2}, {'id':3}], table_name="three") print(info) ``` -You instantiate a pipeline by calling `dlt.pipeline` function with following arguments: +You instantiate a pipeline by calling the `dlt.pipeline` function with the following arguments: -- `pipeline_name` a name of the pipeline that will be used to identify it in trace and monitoring +- `pipeline_name`: a name of the pipeline that will be used to identify it in trace and monitoring events and to restore its state and data schemas on subsequent runs. If not provided, `dlt` will - create pipeline name from the file name of currently executing Python module. -- `destination` a name of the [destination](../dlt-ecosystem/destinations) to which dlt - will load the data. May also be provided to `run` method of the `pipeline`. -- `dataset_name` a name of the dataset to which the data will be loaded. A dataset is a logical - group of tables i.e. `schema` in relational databases or folder grouping many files. May also be - provided later to the `run` or `load` methods of the pipeline. If not provided at all then + create a pipeline name from the file name of the currently executing Python module. +- `destination`: a name of the [destination](../dlt-ecosystem/destinations) to which dlt + will load the data. May also be provided to the `run` method of the `pipeline`. +- `dataset_name`: a name of the dataset to which the data will be loaded. A dataset is a logical + group of tables, i.e., `schema` in relational databases or a folder grouping many files. May also be + provided later to the `run` or `load` methods of the pipeline. If not provided at all, then defaults to the `pipeline_name`. -To load the data you call the `run` method and pass your data in `data` argument. +To load the data, you call the `run` method and pass your data in the `data` argument. Arguments: - `data` (the first argument) may be a dlt source, resource, generator function, or any Iterator / - Iterable (i.e. a list or the result of `map` function). + Iterable (i.e., a list or the result of the `map` function). - `write_disposition` controls how to write data to a table. Defaults to "append". - `append` will always add new data at the end of the table. - `replace` will replace existing data with new data. - `skip` will prevent data from loading. - `merge` will deduplicate and merge data based on `primary_key` and `merge_key` hints. -- `table_name` - specified in case when table name cannot be inferred i.e. from the resources or name +- `table_name` - specified in cases when the table name cannot be inferred, i.e., from the resources or the name of the generator function. Example: This pipeline will load the data the generator `generate_rows(10)` produces: @@ -70,24 +70,25 @@ print(info) ## Pipeline working directory Each pipeline that you create with `dlt` stores extracted files, load packages, inferred schemas, -execution traces and the [pipeline state](state.md) in a folder in the local filesystem. The default -location for such folders is in user home directory: `~/.dlt/pipelines/`. +execution traces, and the [pipeline state](state.md) in a folder in the local filesystem. The default +location for such folders is in the user home directory: `~/.dlt/pipelines/`. You can inspect stored artifacts using the command [dlt pipeline info](../reference/command-line-interface.md#dlt-pipeline) and [programmatically](../walkthroughs/run-a-pipeline.md#4-inspect-a-load-process). -> 💡 A pipeline with given name looks for its working directory in location above - so if you have two +> 💡 A pipeline with a given name looks for its working directory in the location above - so if you have two > pipeline scripts that create a pipeline with the same name, they will see the same working folder -> and share all the possible state. You may override the default location using `pipelines_dir` +> and share all the possible states. You may override the default location using the `pipelines_dir` > argument when creating the pipeline. -> 💡 You can attach `Pipeline` instance to an existing working folder, without creating a new +> 💡 You can attach a `Pipeline` instance to an existing working folder, without creating a new > pipeline with `dlt.attach`. -### Separate working environments with `pipelines_dir`. -You can run several pipelines with the same name but with different configuration ie. to target development / staging / production environments. -Set the `pipelines_dir` argument to store all the working folders in specific place. For example: +### Separate working environments with `pipelines_dir` + +You can run several pipelines with the same name but with different configurations, i.e., to target development/staging/production environments. Set the `pipelines_dir` argument to store all the working folders in a specific place. For example: + ```py import dlt from dlt.common.pipeline import get_dlt_pipelines_dir @@ -95,36 +96,29 @@ from dlt.common.pipeline import get_dlt_pipelines_dir dev_pipelines_dir = os.path.join(get_dlt_pipelines_dir(), "dev") pipeline = dlt.pipeline(destination="duckdb", dataset_name="sequence", pipelines_dir=dev_pipelines_dir) ``` -stores pipeline working folder in `~/.dlt/pipelines/dev/`. Mind that you need to pass this `~/.dlt/pipelines/dev/` -in to all cli commands to get info/trace for that pipeline. + +This stores the pipeline working folder in `~/.dlt/pipelines/dev/`. Note that you need to pass this `~/.dlt/pipelines/dev/` into all CLI commands to get info/trace for that pipeline. ## Do experiments with dev mode -If you [create a new pipeline script](../walkthroughs/create-a-pipeline.md) you will be -experimenting a lot. If you want that each time the pipeline resets its state and loads data to a -new dataset, set the `dev_mode` argument of the `dlt.pipeline` method to True. Each time the -pipeline is created, `dlt` adds datetime-based suffix to the dataset name. +If you [create a new pipeline script](../walkthroughs/create-a-pipeline.md), you will be experimenting a lot. If you want the pipeline to reset its state and load data to a new dataset each time, set the `dev_mode` argument of the `dlt.pipeline` method to True. Each time the pipeline is created, `dlt` adds a datetime-based suffix to the dataset name. ## Refresh pipeline data and state -You can reset parts or all of your sources by using the `refresh` argument to `dlt.pipeline` or the pipeline's `run` or `extract` method. -That means when you run the pipeline the sources/resources being processed will have their state reset and their tables either dropped or truncated -depending on which refresh mode is used. +You can reset parts or all of your sources by using the `refresh` argument to `dlt.pipeline` or the pipeline's `run` or `extract` method. This means that when you run the pipeline, the sources/resources being processed will have their state reset and their tables either dropped or truncated depending on which refresh mode is used. -`refresh` option works with all relational/sql destinations and file buckets (`filesystem`). it does not work with vector databases (we are working on that) and -with custom destinations. +The `refresh` option works with all relational/SQL destinations and file buckets (`filesystem`). It does not work with vector databases (we are working on that) and with custom destinations. The `refresh` argument should have one of the following string values to decide the refresh mode: ### Drop tables and pipeline state for a source with `drop_sources` -All sources being processed in `pipeline.run` or `pipeline.extract` are refreshed. -That means all tables listed in their schemas are dropped and state belonging to those sources and all their resources is completely wiped. -The tables are deleted both from pipeline's schema and from the destination database. -If you only have one source or run with all your sources together, then this is practically like running the pipeline again for the first time +All sources being processed in `pipeline.run` or `pipeline.extract` are refreshed. This means all tables listed in their schemas are dropped, and the state belonging to those sources and all their resources is completely wiped. The tables are deleted both from the pipeline's schema and from the destination database. + +If you only have one source or run with all your sources together, then this is practically like running the pipeline again for the first time. :::caution -This erases schema history for the selected sources and only the latest version is stored +This erases schema history for the selected sources, and only the latest version is stored. ::: ```py @@ -133,26 +127,26 @@ import dlt pipeline = dlt.pipeline("airtable_demo", destination="duckdb") pipeline.run(airtable_emojis(), refresh="drop_sources") ``` -In example above we instruct `dlt` to wipe pipeline state belonging to `airtable_emojis` source and drop all the database tables in `duckdb` to -which data was loaded. The `airtable_emojis` source had two resources named "📆 Schedule" and "💰 Budget" loading to tables "_schedule" and "_budget". Here's -what `dlt` does step by step: -1. collects a list of tables to drop by looking for all the tables in the schema that are created in the destination. -2. removes existing pipeline state associated with `airtable_emojis` source -3. resets the schema associated with `airtable_emojis` source -4. executes `extract` and `normalize` steps. those will create fresh pipeline state and a schema -5. before it executes `load` step, the collected tables are dropped from staging and regular dataset -6. schema `airtable_emojis` (associated with the source) is be removed from `_dlt_version` table -7. executes `load` step as usual so tables are re-created and fresh schema and pipeline state are stored. + +In the example above, we instruct `dlt` to wipe the pipeline state belonging to the `airtable_emojis` source and drop all the database tables in `duckdb` to which data was loaded. The `airtable_emojis` source had two resources named "📆 Schedule" and "💰 Budget" loading to tables "_schedule" and "_budget". Here's what `dlt` does step by step: +1. Collects a list of tables to drop by looking for all the tables in the schema that are created in the destination. +2. Removes the existing pipeline state associated with the `airtable_emojis` source. +3. Resets the schema associated with the `airtable_emojis` source. +4. Executes `extract` and `normalize` steps. These will create a fresh pipeline state and a schema. +5. Before it executes the `load` step, the collected tables are dropped from the staging and regular dataset. +6. The schema `airtable_emojis` (associated with the source) is removed from the `_dlt_version` table. +7. Executes the `load` step as usual, so tables are re-created and fresh schema and pipeline state are stored. ### Selectively drop tables and resource state with `drop_resources` -Limits the refresh to the resources being processed in `pipeline.run` or `pipeline.extract` (.e.g by using `source.with_resources(...)`). -Tables belonging to those resources are dropped and their resource state is wiped (that includes incremental state). -The tables are deleted both from pipeline's schema and from the destination database. -Source level state keys are not deleted in this mode (i.e. `dlt.state()[<'my_key>'] = ''`) +Limits the refresh to the resources being processed in `pipeline.run` or `pipeline.extract` (e.g., by using `source.with_resources(...)`). +Tables belonging to those resources are dropped, and their resource state is wiped (that includes incremental state). +The tables are deleted both from the pipeline's schema and from the destination database. + +Source-level state keys are not deleted in this mode (i.e., `dlt.state()[<'my_key>'] = ''`). :::caution -This erases schema history for all affected sources and only the latest schema version is stored. +This erases schema history for all affected sources, and only the latest schema version is stored. ::: ```py @@ -161,44 +155,48 @@ import dlt pipeline = dlt.pipeline("airtable_demo", destination="duckdb") pipeline.run(airtable_emojis().with_resources("📆 Schedule"), refresh="drop_resources") ``` -Above we request that the state associated with "📆 Schedule" resource is reset and the table generated by it ("_schedule") is dropped. Other resources, -tables and state are not affected. Please check `drop_sources` for step by step description of what `dlt` does internally. + +Above, we request that the state associated with the "📆 Schedule" resource is reset and the table generated by it ("_schedule") is dropped. Other resources, +tables, and state are not affected. Please check `drop_sources` for a step-by-step description of what `dlt` does internally. ### Selectively truncate tables and reset resource state with `drop_data` -Same as `drop_resources` but instead of dropping tables from schema only the data is deleted from them (i.e. by `TRUNCATE ` in sql destinations). Resource state for selected resources is also wiped. In case of [incremental resources](incremental-loading.md#incremental-loading-with-a-cursor-field) this will + +Same as `drop_resources`, but instead of dropping tables from the schema, only the data is deleted from them (i.e., by `TRUNCATE ` in SQL destinations). Resource state for selected resources is also wiped. In the case of [incremental resources](incremental-loading.md#incremental-loading-with-a-cursor-field), this will reset the cursor state and fully reload the data from the `initial_value`. The schema remains unmodified in this case. + ```py import dlt pipeline = dlt.pipeline("airtable_demo", destination="duckdb") pipeline.run(airtable_emojis().with_resources("📆 Schedule"), refresh="drop_data") ``` -Above the incremental state of the "📆 Schedule" is reset before `extract` step so data is fully reacquired. Just before `load` step starts, - the "_schedule" is truncated and new (full) table data will be inserted/copied. + +Above, the incremental state of the "📆 Schedule" is reset before the `extract` step, so data is fully reacquired. Just before the `load` step starts, +the "_schedule" is truncated, and new (full) table data will be inserted/copied. ## Display the loading progress -You can add a progress monitor to the pipeline. Typically, its role is to visually assure user that -pipeline run is progressing. `dlt` supports 4 progress monitors out of the box: +You can add a progress monitor to the pipeline. Typically, its role is to visually assure the user that +the pipeline run is progressing. `dlt` supports 4 progress monitors out of the box: - [enlighten](https://github.com/Rockhopper-Technologies/enlighten) - a status bar with progress bars that also allows for logging. -- [tqdm](https://github.com/tqdm/tqdm) - most popular Python progress bar lib, proven to work in +- [tqdm](https://github.com/tqdm/tqdm) - the most popular Python progress bar library, proven to work in Notebooks. - [alive_progress](https://github.com/rsalmei/alive-progress) - with the most fancy animations. -- **log** - dumps the progress information to log, console or text stream. **the most useful on - production** optionally adds memory and cpu usage stats. +- **log** - dumps the progress information to log, console, or text stream. **The most useful in + production**, optionally adds memory and CPU usage stats. > 💡 You must install the required progress bar library yourself. -You pass the progress monitor in `progress` argument of the pipeline. You can use a name from the +You pass the progress monitor in the `progress` argument of the pipeline. You can use a name from the list above as in the following example: ```py # create a pipeline loading chess data that dumps -# progress to stdout each 10 seconds (the default) +# progress to stdout every 10 seconds (the default) pipeline = dlt.pipeline( pipeline_name="chess_pipeline", destination='duckdb', @@ -232,3 +230,4 @@ pipeline = dlt.pipeline( Note that the value of the `progress` argument is [configurable](../walkthroughs/run-a-pipeline.md#2-see-the-progress-during-loading). + diff --git a/docs/website/docs/general-usage/resource.md b/docs/website/docs/general-usage/resource.md index d4dedd42bd..d9638d9d39 100644 --- a/docs/website/docs/general-usage/resource.md +++ b/docs/website/docs/general-usage/resource.md @@ -8,14 +8,12 @@ keywords: [resource, api endpoint, dlt.resource] ## Declare a resource -A [resource](glossary.md#resource) is an ([optionally async](../reference/performance.md#parallelism)) function that yields data. To create a -resource, we add the `@dlt.resource` decorator to that function. +A [resource](glossary.md#resource) is an ([optionally async](../reference/performance.md#parallelism)) function that yields data. To create a resource, we add the `@dlt.resource` decorator to that function. Commonly used arguments: - `name` The name of the table generated by this resource. Defaults to the decorated function name. -- `write_disposition` How should the data be loaded at the destination? Currently supported: `append`, - `replace`, and `merge`. Defaults to `append.` +- `write_disposition` How should the data be loaded at the destination? Currently supported: `append`, `replace`, and `merge`. Defaults to `append.` Example: @@ -40,22 +38,15 @@ for row in source_name().resources.get('table_name'): print(row) ``` -Typically, resources are declared and grouped with related resources within a [source](source.md) -function. +Typically, resources are declared and grouped with related resources within a [source](source.md) function. ### Define schema -`dlt` will infer [schema](schema.md) for tables associated with resources from the resource's data. -You can modify the generation process by using the table and column hints. Resource decorator -accepts the following arguments: +`dlt` will infer [schema](schema.md) for tables associated with resources from the resource's data. You can modify the generation process by using the table and column hints. The resource decorator accepts the following arguments: -1. `table_name` the name of the table, if different from the resource name. -1. `primary_key` and `merge_key` define the name of the columns (compound keys are allowed) that will - receive those hints. Used in [incremental loading](incremental-loading.md). -1. `columns` let's you define one or more columns, including the data types, nullability, and other - hints. The column definition is a `TypedDict`: `TTableSchemaColumns`. In the example below, we tell - `dlt` that column `tags` (containing a list of tags) in the `user` table should have type `json`, - which means that it will be loaded as JSON/struct and not as a separate nested table. +1. `table_name` The name of the table, if different from the resource name. +1. `primary_key` and `merge_key` Define the name of the columns (compound keys are allowed) that will receive those hints. Used in [incremental loading](incremental-loading.md). +1. `columns` Lets you define one or more columns, including the data types, nullability, and other hints. The column definition is a `TypedDict`: `TTableSchemaColumns`. In the example below, we tell `dlt` that column `tags` (containing a list of tags) in the `user` table should have type `json`, which means that it will be loaded as JSON/struct and not as a separate nested table. ```py @dlt.resource(name="user", columns={"tags": {"data_type": "json"}}) @@ -67,8 +58,7 @@ accepts the following arguments: ``` :::note -You can pass dynamic hints which are functions that take the data item as input and return a -hint value. This lets you create table and column schemas depending on the data. See an [example below](#adjust-schema-when-you-yield-data). +You can pass dynamic hints which are functions that take the data item as input and return a hint value. This lets you create table and column schemas depending on the data. See an [example below](#adjust-schema-when-you-yield-data). ::: :::tip @@ -76,7 +66,8 @@ You can mark some resource arguments as [configuration and credentials](credenti ::: ### Put a contract on tables, columns, and data -Use the `schema_contract` argument to tell dlt how to [deal with new tables, data types, and bad data types](schema-contracts.md). For example, if you set it to **freeze**, `dlt` will not allow for any new tables, columns, or data types to be introduced to the schema - it will raise an exception. Learn more on available contract modes [here](schema-contracts.md#setting-up-the-contract) + +Use the `schema_contract` argument to tell dlt how to [deal with new tables, data types, and bad data types](schema-contracts.md). For example, if you set it to **freeze**, `dlt` will not allow for any new tables, columns, or data types to be introduced to the schema - it will raise an exception. Learn more about available contract modes [here](schema-contracts.md#setting-up-the-contract). ### Define a schema with Pydantic @@ -114,11 +105,11 @@ Pydantic models integrate well with [schema contracts](schema-contracts.md) as d Things to note: -- Fields with an `Optional` type are marked as `nullable` +- Fields with an `Optional` type are marked as `nullable`. - Fields with a `Union` type are converted to the first (not `None`) type listed in the union. For example, `status: Union[int, str]` results in a `bigint` column. -- `list`, `dict`, and nested Pydantic model fields will use the `json` type which means they'll be stored as a JSON object in the database instead of creating nested tables. +- `list`, `dict`, and nested Pydantic model fields will use the `json` type, which means they'll be stored as a JSON object in the database instead of creating nested tables. -You can override this by configuring the Pydantic model +You can override this by configuring the Pydantic model: ```py from typing import ClassVar @@ -135,12 +126,12 @@ def get_users(): `"skip_nested_types"` omits any `dict`/`list`/`BaseModel` type fields from the schema, so dlt will fall back on the default behavior of creating nested tables for these fields. -We do not support `RootModel` that validate simple types. You can add such a validator yourself, see [data filtering section](#filter-transform-and-pivot-data). +We do not support `RootModel` that validate simple types. You can add such a validator yourself; see [data filtering section](#filter-transform-and-pivot-data). ### Dispatch data to many tables You can load data to many tables from a single resource. The most common case is a stream of events -of different types, each with different data schema. To deal with this, you can use the `table_name` +of different types, each with a different data schema. To deal with this, you can use the `table_name` argument on `dlt.resource`. You could pass the table name as a function with the data item as an argument and the `table_name` string as a return value. @@ -191,10 +182,7 @@ so `dlt` can pass them automatically to your functions. ### Process resources with `dlt.transformer` -You can feed data from a resource into another one. The most common case is when you have an API -that returns a list of objects (i.e. users) in one endpoint and user details in another. You can deal -with this by declaring a resource that obtains a list of users and another resource that receives -items from the list and downloads the profiles. +You can feed data from a resource into another one. The most common case is when you have an API that returns a list of objects (i.e., users) in one endpoint and user details in another. You can deal with this by declaring a resource that obtains a list of users and another resource that receives items from the list and downloads the profiles. ```py @dlt.resource(write_disposition="replace") @@ -202,7 +190,7 @@ def users(limit=None): for u in _get_users(limit): yield u -# feed data from users as user_item below, +# Feed data from users as user_item below, # all transformers must have at least one # argument that will receive data from the parent resource @dlt.transformer(data_from=users) @@ -210,22 +198,21 @@ def users_details(user_item): for detail in _get_details(user_item["user_id"]): yield detail -# just load the user_details. +# Just load the user_details. # dlt figures out dependencies for you. pipeline.run(user_details) ``` -In the example above, `user_details` will receive data from the default instance of the `users` resource (with `limit` set to `None`). You can also use -**pipe |** operator to bind resources dynamically +In the example above, `user_details` will receive data from the default instance of the `users` resource (with `limit` set to `None`). You can also use the **pipe |** operator to bind resources dynamically. ```py -# you can be more explicit and use a pipe operator. -# with it you can create dynamic pipelines where the dependencies -# are set at run time and resources are parametrized i.e. -# below we want to load only 100 users from `users` endpoint +# You can be more explicit and use a pipe operator. +# With it, you can create dynamic pipelines where the dependencies +# are set at runtime and resources are parametrized, i.e., +# below we want to load only 100 users from the `users` endpoint pipeline.run(users(limit=100) | user_details) ``` :::tip -Transformers are allowed not only to **yield** but also to **return** values and can decorate **async** functions and [**async generators**](../reference/performance.md#extract). Below we decorate an async function and request details on two pokemons. Http calls are made in parallel via httpx library. +Transformers are allowed not only to **yield** but also to **return** values and can decorate **async** functions and [**async generators**](../reference/performance.md#extract). Below we decorate an async function and request details on two pokemons. HTTP calls are made in parallel via the httpx library. ```py import dlt import httpx @@ -237,26 +224,24 @@ async def pokemon(id): r = await client.get(f"https://pokeapi.co/api/v2/pokemon/{id}") return r.json() -# get bulbasaur and ivysaur (you need dlt 0.4.6 for pipe operator working with lists) +# Get Bulbasaur and Ivysaur (you need dlt 0.4.6 for the pipe operator to work with lists) print(list([1,2] | pokemon())) ``` ::: ### Declare a standalone resource -A standalone resource is defined on a function that is top level in a module (not an inner function) that accepts config and secrets values. Additionally, -if the `standalone` flag is specified, the decorated function signature and docstring will be preserved. `dlt.resource` will just wrap the -decorated function, and the user must call the wrapper to get the actual resource. Below we declare a `filesystem` resource that must be called before use. +A standalone resource is defined on a function that is top-level in a module (not an inner function) that accepts config and secrets values. Additionally, if the `standalone` flag is specified, the decorated function signature and docstring will be preserved. `dlt.resource` will just wrap the decorated function, and the user must call the wrapper to get the actual resource. Below we declare a `filesystem` resource that must be called before use. ```py @dlt.resource(standalone=True) def filesystem(bucket_url=dlt.config.value): - """list and yield files in `bucket_url`""" + """List and yield files in `bucket_url`""" ... # `filesystem` must be called before it is extracted or used in any other way pipeline.run(filesystem("s3://my-bucket/reports"), table_name="reports") ``` -Standalone may have a dynamic name that depends on the arguments passed to the decorated function. For example: +Standalone resources may have a dynamic name that depends on the arguments passed to the decorated function. For example: ```py @dlt.resource(standalone=True, name=lambda args: args["stream_name"]) def kinesis(stream_name: str): @@ -264,13 +249,10 @@ def kinesis(stream_name: str): kinesis_stream = kinesis("telemetry_stream") ``` -`kinesis_stream` resource has a name **telemetry_stream** - +The `kinesis_stream` resource has the name **telemetry_stream**. ### Declare parallel and async resources -You can extract multiple resources in parallel threads or with async IO. -To enable this for a sync resource you can set the `parallelized` flag to `True` in the resource decorator: - +You can extract multiple resources in parallel threads or with async IO. To enable this for a sync resource, you can set the `parallelized` flag to `True` in the resource decorator: ```py @dlt.resource(parallelized=True) @@ -282,8 +264,9 @@ def get_users(): def get_orders(): for o in _get_orders(): yield o +``` -# users and orders will be iterated in parallel in two separate threads +# Users and orders will be iterated in parallel in two separate threads pipeline.run([get_users(), get_orders()]) ``` @@ -302,7 +285,7 @@ Please find more details in [extract performance](../reference/performance.md#ex ### Filter, transform and pivot data -You can attach any number of transformations that are evaluated on an item per item basis to your +You can attach any number of transformations that are evaluated on an item-per-item basis to your resource. The available transformation types: - **map** - transform the data item (`resource.add_map`). @@ -350,8 +333,8 @@ You can limit how deep `dlt` goes when generating nested tables and flattening d and generate nested tables for all nested lists, without limit. :::note -`max_table_nesting` is optional so you can skip it, in this case dlt will -use it from the source if it is specified there or fallback to the default +`max_table_nesting` is optional so you can skip it. In this case, dlt will +use it from the source if it is specified there or fall back to the default value which has 1000 as the maximum nesting level. ::: @@ -396,15 +379,15 @@ resource = my_resource() resource.max_table_nesting = 0 ``` -Several data sources are prone to contain semi-structured documents with very deep nesting i.e. +Several data sources are prone to contain semi-structured documents with very deep nesting, i.e., MongoDB databases. Our practical experience is that setting the `max_nesting_level` to 2 or 3 -produces the clearest and human-readable schemas. +produces the clearest and most human-readable schemas. ### Sample from large data If your resource loads thousands of pages of data from a REST API or millions of rows from a db table, you may want to just sample a fragment of it in order to quickly see the dataset with example data and test your transformations, etc. In order to do that, you limit how many items will be yielded by a resource (or source) by calling the `add_limit` method. This method will close the generator which produces the data after the limit is reached. -In the example below, we load just 10 first items from an infinite counter - that would otherwise never end. +In the example below, we load just the first 10 items from an infinite counter - that would otherwise never end. ```py r = dlt.resource(itertools.count(), name="infinity").add_limit(10) @@ -458,7 +441,7 @@ tables.users.table_name = "other_users" ### Adjust schema when you yield data -You can set or update the table name, columns, and other schema elements when your resource is executed and you already yield data. Such changes will be merged with the existing schema in the same way the `apply_hints` method above works. There are many reasons to adjust the schema at runtime. For example, when using Airflow, you should avoid lengthy operations (i.e. reflecting database tables) during the creation of the DAG, so it is better to do it when the DAG executes. You may also emit partial hints (i.e. precision and scale for decimal types) for columns to help `dlt` type inference. +You can set or update the table name, columns, and other schema elements when your resource is executed and you already yield data. Such changes will be merged with the existing schema in the same way the `apply_hints` method above works. There are many reasons to adjust the schema at runtime. For example, when using Airflow, you should avoid lengthy operations (i.e., reflecting database tables) during the creation of the DAG, so it is better to do it when the DAG executes. You may also emit partial hints (i.e., precision and scale for decimal types) for columns to help `dlt` type inference. ```py @dlt.resource @@ -487,11 +470,13 @@ def sql_table(credentials, schema, table): In the example above, we use `dlt.mark.with_hints` and `dlt.mark.make_hints` to emit columns and primary key with the first extracted item. The table schema will be adjusted after the `batch` is processed in the extract pipeline but before any schema contracts are applied and data is persisted in the load package. :::tip -You can emit columns as a Pydantic model and use dynamic hints (i.e. lambda for table name) as well. You should avoid redefining `Incremental` this way. +You can emit columns as a Pydantic model and use dynamic hints (i.e., lambda for table name) as well. You should avoid redefining `Incremental` this way. ::: ### Import external files -You can import external files i.e. `csv`, `parquet`, and `jsonl` by yielding items marked with `with_file_import`, optionally passing a table schema corresponding to the imported file. `dlt` will not read, parse, and normalize any names (i.e. `csv` or `arrow` headers) and will attempt to copy the file into the destination as is. + +You can import external files, i.e., `csv`, `parquet`, and `jsonl` by yielding items marked with `with_file_import`, optionally passing a table schema corresponding to the imported file. `dlt` will not read, parse, and normalize any names (i.e., `csv` or `arrow` headers) and will attempt to copy the file into the destination as is. + ```py import os import dlt @@ -517,9 +502,9 @@ def orders(items: Iterator[FileItemDict]): item.fsspec.download(item["file_url"], dest_file) # tell dlt to import the dest_file as `csv` yield dlt.mark.with_file_import(dest_file, "csv") +``` - -# use the filesystem verified source to glob a bucket +# Use the filesystem verified source to glob a bucket downloader = filesystem( bucket_url="s3://my_bucket/csv", file_glob="today/*.csv.gz") | orders @@ -536,7 +521,7 @@ include_header=false on_error_continue=true ``` -You can sniff the schema from the data i.e. using `duckdb` to infer the table schema from a `csv` file. `dlt.mark.with_file_import` accepts additional arguments that you can use to pass hints at runtime. +You can sniff the schema from the data, i.e., using `duckdb` to infer the table schema from a `csv` file. `dlt.mark.with_file_import` accepts additional arguments that you can use to pass hints at runtime. :::note * If you do not define any columns, the table will not be created in the destination. `dlt` will still attempt to load data into it, so if you create a fitting table upfront, the load process will succeed. @@ -544,7 +529,7 @@ You can sniff the schema from the data i.e. using `duckdb` to infer the table sc ::: ### Duplicate and rename resources -There are cases when your resources are generic (i.e. bucket filesystem) and you want to load several instances of it (i.e. files from different folders) to separate tables. In the example below, we use the `filesystem` source to load csvs from two different folders into separate tables: +There are cases when your resources are generic (i.e., bucket filesystem) and you want to load several instances of it (i.e., files from different folders) to separate tables. In the example below, we use the `filesystem` source to load csvs from two different folders into separate tables: ```py @dlt.resource(standalone=True) def filesystem(bucket_url): @@ -567,7 +552,7 @@ pipeline.run( ) ``` -The `with_name` method returns a deep copy of the original resource, its data pipe, and the data pipes of a parent resource. A renamed clone is fully separated from the original resource (and other clones) when loading: it maintains a separate [resource state](state.md#read-and-write-pipeline-state-in-a-resource) and will load to a table +The `with_name` method returns a deep copy of the original resource, its data pipe, and the data pipes of a parent resource. A renamed clone is fully separated from the original resource (and other clones) when loading: it maintains a separate [resource state](state.md#read-and-write-pipeline-state-in-a-resource) and will load to a table. ## Load resources @@ -604,6 +589,7 @@ The resource above will be saved and loaded from a `parquet` file (if the destin A special `file_format`: **preferred** will load the resource using a format that is preferred by a destination. This setting supersedes the `loader_file_format` passed to the `run` method. ::: + ### Do a full refresh To do a full refresh of an `append` or `merge` resource, you set the `refresh` argument on the `run` method to `drop_data`. This will truncate the tables without dropping them. @@ -616,3 +602,4 @@ You can also [fully drop the tables](pipeline.md#refresh-pipeline-data-and-state ```py p.run(merge_source(), refresh="drop_sources") ``` + diff --git a/docs/website/docs/general-usage/schema-contracts.md b/docs/website/docs/general-usage/schema-contracts.md index e48fe979fd..c96236085d 100644 --- a/docs/website/docs/general-usage/schema-contracts.md +++ b/docs/website/docs/general-usage/schema-contracts.md @@ -4,9 +4,7 @@ description: Controlling schema evolution and validating data keywords: [data contracts, schema, dlt schema, pydantic] --- -`dlt` will evolve the schema at the destination by following the structure and data types of the extracted data. There are several modes -that you can use to control this automatic schema evolution, from the default modes where all changes to the schema are accepted to -a frozen schema that does not change at all. +`dlt` will evolve the schema at the destination by following the structure and data types of the extracted data. There are several modes that you can use to control this automatic schema evolution, from the default modes where all changes to the schema are accepted to a frozen schema that does not change at all. Consider this example: @@ -16,53 +14,50 @@ def items(): ... ``` -This resource will allow new tables (both nested tables and [tables with dynamic names](resource.md#dispatch-data-to-many-tables)) to be created, but will throw an exception if data is extracted for an existing table which contains a new column. +This resource will allow new tables (both nested tables and [tables with dynamic names](resource.md#dispatch-data-to-many-tables)) to be created, but will throw an exception if data is extracted for an existing table that contains a new column. ### Setting up the contract You can control the following **schema entities**: -* `tables` - contract is applied when a new table is created -* `columns` - contract is applied when a new column is created on an existing table -* `data_type` - contract is applied when data cannot be coerced into a data type associate with existing column. +* `tables` - the contract is applied when a new table is created +* `columns` - the contract is applied when a new column is created on an existing table +* `data_type` - the contract is applied when data cannot be coerced into a data type associated with an existing column. -You can use **contract modes** to tell `dlt` how to apply contract for a particular entity: +You can use **contract modes** to tell `dlt` how to apply the contract for a particular entity: * `evolve`: No constraints on schema changes. -* `freeze`: This will raise an exception if data is encountered that does not fit the existing schema, so no data will be loaded to the destination +* `freeze`: This will raise an exception if data is encountered that does not fit the existing schema, so no data will be loaded to the destination. * `discard_row`: This will discard any extracted row if it does not adhere to the existing schema, and this row will not be loaded to the destination. -* `discard_value`: This will discard data in an extracted row that does not adhere to the existing schema and the row will be loaded without this data. +* `discard_value`: This will discard data in an extracted row that does not adhere to the existing schema, and the row will be loaded without this data. :::note The default mode (**evolve**) works as follows: -1. New tables may be always created -2. New columns may be always appended to the existing table -3. Data that do not coerce to existing data type of a particular column will be sent to a [variant column](schema.md#variant-columns) created for this particular type. +1. New tables may always be created. +2. New columns may always be appended to the existing table. +3. Data that does not coerce to the existing data type of a particular column will be sent to a [variant column](schema.md#variant-columns) created for this particular type. ::: #### Passing schema_contract argument -The `schema_contract` exists on the [dlt.source](source.md) decorator as a default for all resources in that source and on the -[dlt.resource](source.md) decorator as a directive for the individual resource - and as a consequence - on all tables created by this resource. -Additionally it exists on the `pipeline.run()` method, which will override all existing settings. +The `schema_contract` exists on the [dlt.source](source.md) decorator as a default for all resources in that source and on the [dlt.resource](source.md) decorator as a directive for the individual resource - and as a consequence - on all tables created by this resource. Additionally, it exists on the `pipeline.run()` method, which will override all existing settings. The `schema_contract` argument accepts two forms: 1. **full**: a mapping of schema entities to contract modes -2. **shorthand** a contract mode (string) that will be applied to all schema entities. +2. **shorthand**: a contract mode (string) that will be applied to all schema entities. -For example setting `schema_contract` to *freeze* will expand to the full form: +For example, setting `schema_contract` to *freeze* will expand to the full form: ```py {"tables": "freeze", "columns": "freeze", "data_type": "freeze"} ``` -You can change the contract on the **source** instance via `schema_contract` property. For **resource** you can use [apply_hints](resource#set-table-name-and-adjust-schema). +You can change the contract on the **source** instance via the `schema_contract` property. For **resource**, you can use [apply_hints](resource#set-table-name-and-adjust-schema). - -#### Nuances of contract modes. +#### Nuances of contract modes 1. Contracts are applied **after names of tables and columns are normalized**. -2. Contract defined on a resource is applied to all root tables and nested tables created by that resource. -3. `discard_row` works on table level. So for example if you have two tables in nested relationship ie. *users* and *users__addresses* and contract is violated in *users__addresses* table, the row of that table is discarded while the parent row in *users* table will be loaded. +2. A contract defined on a resource is applied to all root tables and nested tables created by that resource. +3. `discard_row` works on the table level. So, for example, if you have two tables in a nested relationship, i.e., *users* and *users__addresses*, and the contract is violated in the *users__addresses* table, the row of that table is discarded while the parent row in the *users* table will be loaded. ### Use Pydantic models for data validation Pydantic models can be used to [define table schemas and validate incoming data](resource.md#define-a-schema-with-pydantic). You can use any model you already have. `dlt` will internally synthesize (if necessary) new models that conform with the **schema contract** on the resource. -Just passing a model in `column` argument of the [dlt.resource](resource.md#define-a-schema-with-pydantic) sets a schema contract that conforms to default Pydantic behavior: +Just passing a model in the `column` argument of the [dlt.resource](resource.md#define-a-schema-with-pydantic) sets a schema contract that conforms to default Pydantic behavior: ```py { "tables": "evolve", @@ -70,18 +65,18 @@ Just passing a model in `column` argument of the [dlt.resource](resource.md#defi "data_type": "freeze" } ``` -New tables are allowed, extra fields are ignored and invalid data raises an exception. +New tables are allowed, extra fields are ignored, and invalid data raises an exception. -If you pass schema contract explicitly the following happens to schema entities: -1. **tables** do not impact the Pydantic models +If you pass the schema contract explicitly, the following happens to schema entities: +1. **tables** do not impact the Pydantic models. 2. **columns** modes are mapped into the **extra** modes of Pydantic (see below). `dlt` will apply this setting recursively if models contain other models. -3. **data_type** supports following modes for Pydantic: **evolve** will synthesize lenient model that allows for any data type. This may result with variant columns upstream. +3. **data_type** supports the following modes for Pydantic: **evolve** will synthesize a lenient model that allows for any data type. This may result in variant columns upstream. **freeze** will re-raise `ValidationException`. **discard_row** will remove the non-validating data items. **discard_value** is not currently supported. We may eventually do that on Pydantic v2. `dlt` maps column contract modes into the extra fields settings as follows. -Note that this works in two directions. If you use a model with such setting explicitly configured, `dlt` sets the column contract mode accordingly. This also avoids synthesizing modified models. +Note that this works in two directions. If you use a model with such a setting explicitly configured, `dlt` sets the column contract mode accordingly. This also avoids synthesizing modified models. | column mode | pydantic extra | | ------------- | -------------- | @@ -99,25 +94,24 @@ Model validation is added as a [transform step](resource.md#filter-transform-and :::note Pydantic models work on the **extracted** data **before names are normalized or nested tables are created**. Make sure to name model fields as in your input data and handle nested data with the nested models. -As a consequence, `discard_row` will drop the whole data item - even if nested model was affected. +As a consequence, `discard_row` will drop the whole data item - even if a nested model was affected. ::: ### Set contracts on Arrow Tables and Pandas All contract settings apply to [arrow tables and panda frames](../dlt-ecosystem/verified-sources/arrow-pandas.md) as well. -1. **tables** mode the same - no matter what is the data item type -2. **columns** will allow new columns, raise an exception or modify tables/frames still in extract step to avoid re-writing parquet files. -3. **data_type** changes to data types in tables/frames are not allowed and will result in data type schema clash. We could allow for more modes (evolving data types in Arrow tables sounds weird but ping us on Slack if you need it.) +1. **tables** mode the same - no matter what is the data item type. +2. **columns** will allow new columns, raise an exception, or modify tables/frames still in the extract step to avoid re-writing parquet files. +3. **data_type** changes to data types in tables/frames are not allowed and will result in a data type schema clash. We could allow for more modes (evolving data types in Arrow tables sounds weird but ping us on Slack if you need it). Here's how `dlt` deals with column modes: -1. **evolve** new columns are allowed (table may be reordered to put them at the end) -2. **discard_value** column will be deleted -3. **discard_row** rows with the column present will be deleted and then column will be deleted -4. **freeze** exception on a new column - +1. **evolve** new columns are allowed (the table may be reordered to put them at the end). +2. **discard_value** column will be deleted. +3. **discard_row** rows with the column present will be deleted and then the column will be deleted. +4. **freeze** exception on a new column. ### Get context from DataValidationError in freeze mode -When contract is violated in freeze mode, `dlt` raises `DataValidationError` exception. This exception gives access to the full context and passes the evidence to the caller. -As with any other exception coming from pipeline run, it will be re-raised via `PipelineStepFailed` exception which you should catch in except: + +When a contract is violated in freeze mode, `dlt` raises a `DataValidationError` exception. This exception gives access to the full context and passes the evidence to the caller. As with any other exception coming from a pipeline run, it will be re-raised via a `PipelineStepFailed` exception, which you should catch in the except block: ```py try: @@ -129,27 +123,26 @@ except PipelineStepFailed as pip_ex: if pip_ex.step == "extract": if isinstance(pip_ex.__context__, DataValidationError): ... - - ``` `DataValidationError` provides the following context: -1. `schema_name`, `table_name` and `column_name` provide the logical "location" at which the contract was violated. -2. `schema_entity` and `contract_mode` tell which contract was violated -3. `table_schema` contains the schema against which the contract was validated. May be Pydantic model or `dlt` TTableSchema instance -4. `schema_contract` the full, expanded schema contract -5. `data_item` causing data item (Python dict, arrow table, pydantic model or list of there of) - +1. `schema_name`, `table_name`, and `column_name` provide the logical "location" at which the contract was violated. +2. `schema_entity` and `contract_mode` tell which contract was violated. +3. `table_schema` contains the schema against which the contract was validated. It may be a Pydantic model or a `dlt` TTableSchema instance. +4. `schema_contract` is the full, expanded schema contract. +5. `data_item` is the causing data item (Python dict, arrow table, Pydantic model, or list of these). ### Contracts on new tables -If a table is a **new table** that has not been created on the destination yet, dlt will allow the creation of new columns. For a single pipeline run, the column mode is changed (internally) to **evolve** and then reverted back to the original mode. This allows for initial schema inference to happen and then on subsequent run, the inferred contract will be applied to a new data. -Following tables are considered new: -1. Child tables inferred from the nested data -2. Dynamic tables created from the data during extraction -3. Tables containing **incomplete** columns - columns without data type bound to them. +If a table is a **new table** that has not been created on the destination yet, dlt will allow the creation of new columns. For a single pipeline run, the column mode is changed (internally) to **evolve** and then reverted back to the original mode. This allows for initial schema inference to happen, and then on a subsequent run, the inferred contract will be applied to new data. + +The following tables are considered new: +1. Child tables inferred from the nested data. +2. Dynamic tables created from the data during extraction. +3. Tables containing **incomplete** columns - columns without a data type bound to them. + +For example, such a table is considered new because the column **number** is incomplete (defines primary key and NOT null but no data type): -For example such table is considered new because column **number** is incomplete (define primary key and NOT null but no data type) ```yaml blocks: description: Ethereum blocks @@ -162,17 +155,17 @@ blocks: ``` What tables are not considered new: -1. Those with columns defined by Pydantic modes +1. Those with columns defined by Pydantic models. ### Working with datasets that have manually added tables and columns on the first load -In some cases you might be working with datasets that have tables or columns created outside of dlt. If you are loading to a table not created by `dlt` for the first time, `dlt` will not know about this table while enforcing schema contracts. This means that if you do a load where the `tables` are set to `evolve`, all will work as planned. If you have `tables` set to `freeze`, dlt will raise an exception because it thinks you are creating a new table (which you are from dlts perspective). You can allow `evolve` for one load and then switch back to `freeze`. +In some cases, you might be working with datasets that have tables or columns created outside of dlt. If you are loading to a table not created by `dlt` for the first time, `dlt` will not know about this table while enforcing schema contracts. This means that if you do a load where the `tables` are set to `evolve`, all will work as planned. If you have `tables` set to `freeze`, dlt will raise an exception because it thinks you are creating a new table (which you are from dlt's perspective). You can allow `evolve` for one load and then switch back to `freeze`. The same thing will happen if `dlt` knows your table, but you have manually added a column to your destination and you have `columns` set to `freeze`. -### Code Examples +### Code examples -The below code will silently ignore new subtables, allow new columns to be added to existing tables and raise an error if a variant of a column is discovered. +The below code will silently ignore new subtables, allow new columns to be added to existing tables, and raise an error if a variant of a column is discovered. ```py @dlt.resource(schema_contract={"tables": "discard_row", "columns": "evolve", "data_type": "freeze"}) @@ -180,15 +173,13 @@ def items(): ... ``` -The below Code will raise on any encountered schema change. Note: You can always set a string which will be interpreted as though all keys are set to these values. +The below code will raise an error on any encountered schema change. Note: You can always set a string which will be interpreted as though all keys are set to these values. ```py pipeline.run(my_source(), schema_contract="freeze") ``` -The below code defines some settings on the source which can be overwritten on the resource which in turn can be overwritten by the global override on the `run` method. -Here for all resources variant columns are frozen and raise an error if encountered, on `items` new columns are allowed but `other_items` inherits the `freeze` setting from -the source, thus new columns are frozen there. New tables are allowed. +The below code defines some settings on the source which can be overwritten on the resource, which in turn can be overwritten by the global override on the `run` method. Here, for all resources, variant columns are frozen and raise an error if encountered. On `items`, new columns are allowed, but `other_items` inherits the `freeze` setting from the source, thus new columns are frozen there. New tables are allowed. ```py @dlt.resource(schema_contract={"columns": "evolve"}) @@ -204,10 +195,11 @@ def source(): return [items(), other_items()] -# this will use the settings defined by the decorators +# This will use the settings defined by the decorators pipeline.run(source()) -# this will freeze the whole schema, regardless of the decorator settings +# This will freeze the whole schema, regardless of the decorator settings pipeline.run(source(), schema_contract="freeze") -``` \ No newline at end of file +``` + diff --git a/docs/website/docs/general-usage/schema-evolution.md b/docs/website/docs/general-usage/schema-evolution.md index b2b81cfdca..051841dcd1 100644 --- a/docs/website/docs/general-usage/schema-evolution.md +++ b/docs/website/docs/general-usage/schema-evolution.md @@ -8,7 +8,7 @@ keywords: [schema evolution, schema, dlt schema] Schema evolution is a best practice when ingesting most data. It’s simply a way to get data across a format barrier. -It separates the technical challenge of “loading” data, from the business challenge of “curating” data. This enables us to have pipelines that are maintainable by different individuals at different stages. +It separates the technical challenge of “loading” data from the business challenge of “curating” data. This enables us to have pipelines that are maintainable by different individuals at different stages. However, for cases where schema evolution might be triggered by malicious events, such as in web tracking, data contracts are advised. Read more about how to implement data contracts [here](https://dlthub.com/docs/general-usage/schema-contracts). @@ -20,9 +20,9 @@ As the structure of data changes, such as the addition of new columns, changing ## Inferring a schema from nested data -The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into relational format, `dlt` flattens dictionaries and unpacks nested lists into sub-tables. +The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into a relational format, `dlt` flattens dictionaries and unpacks nested lists into sub-tables. -We’ll review some examples here and figure out how `dlt` creates initial schema and how normalisation works. Consider a pipeline that loads the following schema: +We’ll review some examples here and figure out how `dlt` creates the initial schema and how normalization works. Consider a pipeline that loads the following schema: ```py data = [{ @@ -42,23 +42,23 @@ data = [{ dlt.pipeline("organizations_pipeline", destination="duckdb").run(data, table_name="org") ``` -The schema of data above is loaded to the destination as follows: +The schema of the data above is loaded to the destination as follows: ### What did the schema inference engine do? -As you can see above the `dlt's` inference engine generates the structure of the data based on the source and provided hints. It normalizes the data, creates tables and columns, and infers data types. +As you can see above, the `dlt's` inference engine generates the structure of the data based on the source and provided hints. It normalizes the data, creates tables and columns, and infers data types. For more information, you can refer to the **[Schema](https://dlthub.com/docs/general-usage/schema)** and **[Adjust a Schema](https://dlthub.com/docs/walkthroughs/adjust-a-schema)** sections in the documentation. ## Evolving the schema -For a typical data source schema tends to change with time, and `dlt` handles this changing schema seamlessly. +For a typical data source, the schema tends to change with time, and `dlt` handles this changing schema seamlessly. Let’s add the following 4 cases: -- A column is added : a field named “CEO” was added. -- A column type is changed: Datatype of column named “inventory_nr” was changed from integer to string. +- A column is added: a field named “CEO” was added. +- A column type is changed: The datatype of the column named “inventory_nr” was changed from integer to string. - A column is removed: a field named “room” was commented out/removed. - A column is renamed: a field “building” was renamed to “main_block”. @@ -71,7 +71,7 @@ data = [{ "address": { # 'building' renamed to 'main_block' 'main_block': 'r&d', - # Removed room column + # Removed room column # "room": 7890, }, "Inventory": [ @@ -81,6 +81,7 @@ data = [{ {"name": "Type-inferrer", "inventory nr": "AR3621"} ] }] +``` # Run `dlt` pipeline dlt.pipeline("organizations_pipeline", destination="duckdb").run(data, table_name="org") @@ -110,7 +111,7 @@ The column lineage can be tracked by loading the 'load_info' to the destination. **Getting notifications** -We can read the load outcome and send it to slack webhook with `dlt`. +We can read the load outcome and send it to a Slack webhook with `dlt`. ```py # Import the send_slack_message function from the dlt library from dlt.common.runtime.slack import send_slack_message @@ -125,7 +126,7 @@ for package in info.load_packages: # Iterate over each column in the current table for column_name, column in table["columns"].items(): # Send a message to the Slack channel with the table - # and column update information + # and column update information send_slack_message( hook, message=( @@ -143,7 +144,7 @@ This script sends Slack notifications for schema updates using the `send_slack_m ### How to test for removed columns - applying “not null” constraint -A column not existing, and a column being null, are two different things. However, when it comes to APIs and json, it’s usually all treated the same - the key-value pair will simply not exist. +A column not existing, and a column being null, are two different things. However, when it comes to APIs and JSON, it’s usually all treated the same - the key-value pair will simply not exist. To remove a column, exclude it from the output of the resource function. Subsequent data inserts will treat this column as null. Verify column removal by applying a not null constraint. For instance, after removing the "room" column, apply a not null constraint to confirm its exclusion. @@ -166,7 +167,7 @@ pipeline = dlt.pipeline("organizations_pipeline", destination="duckdb") # Adding not null constraint pipeline.run(data, table_name="org", columns={"room": {"data_type": "bigint", "nullable": False}}) ``` -During pipeline execution a data validation error indicates that a removed column is being passed as null. +During pipeline execution, a data validation error indicates that a removed column is being passed as null. ## Some schema changes in the data @@ -204,12 +205,13 @@ The schema of the data above is loaded to the destination as follows: The schema evolution engine in the `dlt` library is designed to handle changes in the structure of your data over time. For example: -- As above in continuation of the inferred schema, the “specifications” are nested in "details”, which are nested in “Inventory”, all under table name “org”. So the table created for projects is `org__inventory__details__specifications`. +- As above in continuation of the inferred schema, the “specifications” are nested in "details,” which are nested in “Inventory,” all under the table name “org.” So the table created for projects is `org__inventory__details__specifications`. -These is a simple examples of how schema evolution works. +This is a simple example of how schema evolution works. ## Schema evolution using schema and data contracts -Demonstrating schema evolution without talking about schema and data contracts is only one side of the coin. Schema and data contracts dictate the terms of how the schema being written to destination should evolve. +Demonstrating schema evolution without talking about schema and data contracts is only one side of the coin. Schema and data contracts dictate the terms of how the schema being written to the destination should evolve. + +Schema and data contracts can be applied to entities ‘tables,’ ‘columns,’ and ‘data_types’ using contract modes ‘evolve,’ ‘freeze,’ ‘discard_rows,’ and ‘discard_columns’ to tell `dlt` how to apply a contract for a particular entity. To read more about **schema and data contracts** read our [documentation](https://dlthub.com/docs/general-usage/schema-contracts). -Schema and data contracts can be applied to entities ‘tables’ , ‘columns’ and ‘data_types’ using contract modes ‘evolve’, freeze’, ‘discard_rows’ and ‘discard_columns’ to tell `dlt` how to apply contract for a particular entity. To read more about **schema and data contracts** read our [documentation](https://dlthub.com/docs/general-usage/schema-contracts). \ No newline at end of file diff --git a/docs/website/docs/general-usage/schema.md b/docs/website/docs/general-usage/schema.md index 534d3ca3bd..aed16abc28 100644 --- a/docs/website/docs/general-usage/schema.md +++ b/docs/website/docs/general-usage/schema.md @@ -6,66 +6,44 @@ keywords: [schema, dlt schema, yaml] # Schema -The schema describes the structure of normalized data (e.g. tables, columns, data types, etc.) and -provides instructions on how the data should be processed and loaded. `dlt` generates schemas from -the data during the normalization process. User can affect this standard behavior by providing -**hints** that change how tables, columns and other metadata is generated and how the data is -loaded. Such hints can be passed in the code ie. to `dlt.resource` decorator or `pipeline.run` -method. Schemas can be also exported and imported as files, which can be directly modified. +The schema describes the structure of normalized data (e.g., tables, columns, data types, etc.) and provides instructions on how the data should be processed and loaded. `dlt` generates schemas from the data during the normalization process. Users can affect this standard behavior by providing **hints** that change how tables, columns, and other metadata are generated and how the data is loaded. Such hints can be passed in the code, i.e., to the `dlt.resource` decorator or `pipeline.run` method. Schemas can also be exported and imported as files, which can be directly modified. -> 💡 `dlt` associates a schema with a [source](source.md) and a table schema with a -> [resource](resource.md). +> 💡 `dlt` associates a schema with a [source](source.md) and a table schema with a [resource](resource.md). ## Schema content hash and version -Each schema file contains content based hash `version_hash` that is used to: +Each schema file contains a content-based hash `version_hash` that is used to: -1. Detect manual changes to schema (ie. user edits content). +1. Detect manual changes to the schema (i.e., user edits content). 1. Detect if the destination database schema is synchronized with the file schema. Each time the schema is saved, the version hash is updated. -Each schema contains a numeric version which increases automatically whenever schema is updated and -saved. Numeric version is meant to be human-readable. There are cases (parallel processing) where -the order is lost. +Each schema contains a numeric version that increases automatically whenever the schema is updated and saved. The numeric version is meant to be human-readable. There are cases (parallel processing) where the order is lost. -> 💡 Schema in the destination is migrated if its hash is not stored in `_dlt_versions` table. In -> principle many pipelines may send data to a single dataset. If table name clash then a single -> table with the union of the columns will be created. If columns clash, and they have different -> types etc. then the load may fail if the data cannot be coerced. +> 💡 The schema in the destination is migrated if its hash is not stored in the `_dlt_versions` table. In principle, many pipelines may send data to a single dataset. If table names clash, then a single table with the union of the columns will be created. If columns clash, and they have different types, etc., then the load may fail if the data cannot be coerced. ## Naming convention -`dlt` creates tables, nested tables and column schemas from the data. The data being loaded, -typically JSON documents, contains identifiers (i.e. key names in a dictionary) with any Unicode -characters, any lengths and naming styles. On the other hand the destinations accept very strict -namespaces for their identifiers. Like Redshift that accepts case-insensitive alphanumeric -identifiers with maximum 127 characters. +`dlt` creates tables, nested tables, and column schemas from the data. The data being loaded, typically JSON documents, contains identifiers (i.e., key names in a dictionary) with any Unicode characters, any lengths, and naming styles. On the other hand, the destinations accept very strict namespaces for their identifiers, like Redshift, which accepts case-insensitive alphanumeric identifiers with a maximum of 127 characters. -Each schema contains [naming convention](naming-convention.md) that tells `dlt` how to translate identifiers to the -namespace that the destination understands. This convention can be configured, changed in code or enforced via -destination. +Each schema contains a [naming convention](naming-convention.md) that tells `dlt` how to translate identifiers to the namespace that the destination understands. This convention can be configured, changed in code, or enforced via the destination. The default naming convention: -1. Converts identifiers to snake_case, small caps. Removes all ascii characters except ascii - alphanumerics and underscores. -1. Adds `_` if name starts with number. -1. Multiples of `_` are converted into single `_`. +1. Converts identifiers to snake_case, small caps. Removes all ASCII characters except ASCII alphanumerics and underscores. +1. Adds `_` if the name starts with a number. +1. Multiples of `_` are converted into a single `_`. 1. Nesting is expressed as double `_` in names. -1. It shorts the identifier if it exceed the length at the destination. +1. It shortens the identifier if it exceeds the length at the destination. -> 💡 Standard behavior of `dlt` is to **use the same naming convention for all destinations** so -> users see always the same tables and columns in their databases. +> 💡 The standard behavior of `dlt` is to **use the same naming convention for all destinations** so users always see the same tables and columns in their databases. -> 💡 If you provide any schema elements that contain identifiers via decorators or arguments (i.e. -> `table_name` or `columns`) all the names used will be converted via the naming convention when -> adding to the schema. For example if you execute `dlt.run(... table_name="CamelCase")` the data -> will be loaded into `camel_case`. +> 💡 If you provide any schema elements that contain identifiers via decorators or arguments (i.e., `table_name` or `columns`), all the names used will be converted via the naming convention when adding to the schema. For example, if you execute `dlt.run(... table_name="CamelCase")`, the data will be loaded into `camel_case`. -> 💡 Use simple, short small caps identifiers for everything! +> 💡 Use simple, short, small caps identifiers for everything! -To retain the original naming convention (like keeping `"createdAt"` as it is instead of converting it to `"created_at"`), you can use the direct naming convention, in "config.toml" as follows: +To retain the original naming convention (like keeping `"createdAt"` as it is instead of converting it to `"created_at"`), you can use the direct naming convention in "config.toml" as follows: ```toml [schema] naming="direct" @@ -74,82 +52,69 @@ naming="direct" Opting for `"direct"` naming bypasses most name normalization processes. This means any unusual characters present will be carried over unchanged to database tables and columns. Please be aware of this behavior to avoid potential issues. ::: -The naming convention is configurable and users can easily create their own -conventions that i.e. pass all the identifiers unchanged if the destination accepts that (i.e. -DuckDB). +The naming convention is configurable, and users can easily create their own conventions that, i.e., pass all the identifiers unchanged if the destination accepts that (i.e., DuckDB). ## Data normalizer -Data normalizer changes the structure of the input data, so it can be loaded into destination. The -standard `dlt` normalizer creates a relational structure from Python dictionaries and lists. -Elements of that structure: table and column definitions, are added to the schema. +Data normalizer changes the structure of the input data so it can be loaded into the destination. The standard `dlt` normalizer creates a relational structure from Python dictionaries and lists. Elements of that structure, such as table and column definitions, are added to the schema. -The data normalizer is configurable and users can plug their own normalizers i.e. to handle the -nested table linking differently or generate parquet-like data structs instead of nested -tables. +The data normalizer is configurable, and users can plug in their own normalizers, i.e., to handle the nested table linking differently or generate parquet-like data structures instead of nested tables. ## Tables and columns -The key components of a schema are tables and columns. You can find a dictionary of tables in -`tables` key or via `tables` property of Schema object. +The key components of a schema are tables and columns. You can find a dictionary of tables in the `tables` key or via the `tables` property of the Schema object. A table schema has the following properties: 1. `name` and `description`. -1. `columns` with dictionary of table schemas. +1. `columns` with a dictionary of table schemas. 1. `write_disposition` hint telling `dlt` how new data coming to the table is loaded. -1. `schema_contract` - describes a [contract on the table](schema-contracts.md) -1. `parent` a part of the nested reference, defined on a nested table and points to the parent table. +1. `schema_contract` - describes a [contract on the table](schema-contracts.md). +1. `parent` - a part of the nested reference, defined on a nested table and points to the parent table. -Table schema is extended by data normalizer. Standard data normalizer adds propagated columns to it. +The table schema is extended by the data normalizer. The standard data normalizer adds propagated columns to it. -A column schema contains following properties: +A column schema contains the following properties: 1. `name` and `description` of a column in a table. Data type information: 1. `data_type` with a column data type. -1. `precision` a precision for **text**, **timestamp**, **time**, **bigint**, **binary**, and **decimal** types -1. `scale` a scale for **decimal** type -1. `timezone` a flag indicating TZ aware or NTZ **timestamp** and **time**. Default value is **true** -1. `nullable` tells if column is nullable or not. -1. `is_variant` telling that column was generated as variant of another column. +1. `precision` - a precision for **text**, **timestamp**, **time**, **bigint**, **binary**, and **decimal** types. +1. `scale` - a scale for **decimal** type. +1. `timezone` - a flag indicating TZ aware or NTZ **timestamp** and **time**. The default value is **true**. +1. `nullable` - tells if the column is nullable or not. +1. `is_variant` - tells that the column was generated as a variant of another column. -A column schema contains following basic hints: +A column schema contains the following basic hints: -1. `primary_key` marks a column as a part of primary key. -1. `unique` tells that column is unique. on some destination that generates unique index. -1. `merge_key` marks a column as a part of merge key used by +1. `primary_key` - marks a column as a part of the primary key. +1. `unique` - tells that the column is unique. On some destinations, that generates a unique index. +1. `merge_key` - marks a column as a part of the merge key used by [incremental load](./incremental-loading.md#merge-incremental_loading). -Hints below are used to create [nested references](#root-and-nested-tables-nested-references) -1. `row_key` a special form of primary key created by `dlt` to uniquely identify rows of data -1. `parent_key` a special form of foreign key used by nested tables to refer to parent tables -1. `root_key` marks a column as a part of root key which is a type of foreign key always referring to the - root table. -1. `_dlt_list_idx` index on a nested list from which nested table is created. +Hints below are used to create [nested references](#root-and-nested-tables-nested-references): +1. `row_key` - a special form of primary key created by `dlt` to uniquely identify rows of data. +1. `parent_key` - a special form of foreign key used by nested tables to refer to parent tables. +1. `root_key` - marks a column as a part of the root key, which is a type of foreign key always referring to the root table. +1. `_dlt_list_idx` - index on a nested list from which the nested table is created. `dlt` lets you define additional performance hints: -1. `partition` marks column to be used to partition data. -1. `cluster` marks column to be part to be used to cluster data -1. `sort` marks column as sortable/having order. on some destinations that non-unique generates - index. +1. `partition` - marks a column to be used to partition data. +1. `cluster` - marks a column to be used to cluster data. +1. `sort` - marks a column as sortable/having order. On some destinations, that non-unique generates an index. :::note -Each destination can interpret the hints in its own way. For example `cluster` hint is used by -Redshift to define table distribution and by BigQuery to specify cluster column. DuckDB and -Postgres ignore it when creating tables. +Each destination can interpret the hints in its own way. For example, the `cluster` hint is used by Redshift to define table distribution and by BigQuery to specify the cluster column. DuckDB and Postgres ignore it when creating tables. ::: ### Variant columns -Variant columns are generated by a normalizer when it encounters data item with type that cannot be -coerced in existing column. Please see our [`coerce_row`](https://github.com/dlt-hub/dlt/blob/7d9baf1b8fdf2813bcf7f1afe5bb3558993305ca/dlt/common/schema/schema.py#L205) if you are interested to see how internally it works. +Variant columns are generated by a normalizer when it encounters a data item with a type that cannot be coerced into an existing column. Please see our [`coerce_row`](https://github.com/dlt-hub/dlt/blob/7d9baf1b8fdf2813bcf7f1afe5bb3558993305ca/dlt/common/schema/schema.py#L205) if you are interested in seeing how it works internally. -Let's consider our [getting started](../intro) example with slightly different approach, -where `id` is an integer type at the beginning +Let's consider our [getting started](../intro) example with a slightly different approach, where `id` is an integer type at the beginning: ```py data = [ @@ -157,14 +122,14 @@ data = [ ] ``` -once pipeline runs we will have the following schema: +Once the pipeline runs, we will have the following schema: | name | data_type | nullable | | ------------- | ------------- | -------- | | id | bigint | true | | human_name | text | true | -Now imagine the data has changed and `id` field also contains strings +Now imagine the data has changed and the `id` field also contains strings: ```py data = [ @@ -173,8 +138,7 @@ data = [ ] ``` -So after you run the pipeline `dlt` will automatically infer type changes and will add a new field in the schema `id__v_text` -to reflect that new data type for `id` so for any type which is not compatible with integer it will create a new field. +So after you run the pipeline, `dlt` will automatically infer type changes and will add a new field in the schema `id__v_text` to reflect that new data type for `id`. For any type which is not compatible with an integer, it will create a new field. | name | data_type | nullable | | ------------- | ------------- | -------- | @@ -182,10 +146,9 @@ to reflect that new data type for `id` so for any type which is not compatible w | human_name | text | true | | id__v_text | text | true | -On the other hand if `id` field was already a string then introducing new data with `id` containing other types -will not change schema because they can be coerced to string. +On the other hand, if the `id` field was already a string, then introducing new data with `id` containing other types will not change the schema because they can be coerced to a string. -Now go ahead and try to add a new record where `id` is float number, you should see a new field `id__v_double` in the schema. +Now go ahead and try to add a new record where `id` is a float number. You should see a new field `id__v_double` in the schema. ### Data types @@ -197,70 +160,66 @@ Now go ahead and try to add a new record where `id` is float number, you should | timestamp | `'2023-07-26T14:45:00Z'`, `datetime.datetime.now()` | Supports precision expressed as parts of a second | | date | `datetime.date(2023, 7, 26)` | | | time | `'14:01:02'`, `datetime.time(14, 1, 2)` | Supports precision - see **timestamp** | -| bigint | `9876543210` | Supports precision as number of bits | +| bigint | `9876543210` | Supports precision as the number of bits | | binary | `b'\x00\x01\x02\x03'` | Supports precision, like **text** | | json | `[4, 5, 6]`, `{'a': 1}` | | | decimal | `Decimal('4.56')` | Supports precision and scale | | wei | `2**56` | | -`wei` is a datatype tries to best represent native Ethereum 256bit integers and fixed point +`wei` is a datatype that tries to best represent native Ethereum 256-bit integers and fixed-point decimals. It works correctly on Postgres and BigQuery. All the other destinations have insufficient precision. -`json` data type tells `dlt` to load that element as JSON or string and do not attempt to flatten +The `json` data type tells `dlt` to load that element as JSON or string and not attempt to flatten or create a nested table out of it. Note that structured types like arrays or maps are not supported by `dlt` at this point. -`time` data type is saved in destination without timezone info, if timezone is included it is stripped. E.g. `'14:01:02+02:00` -> `'14:01:02'`. +The `time` data type is saved in the destination without timezone info; if a timezone is included, it is stripped. E.g. `'14:01:02+02:00` -> `'14:01:02'`. :::tip -The precision and scale are interpreted by particular destination and are validated when a column is created. Destinations that +The precision and scale are interpreted by the particular destination and are validated when a column is created. Destinations that do not support precision for a given data type will ignore it. -The precision for **timestamp** is useful when creating **parquet** files. Use 3 - for milliseconds, 6 for microseconds, 9 for nanoseconds +The precision for **timestamp** is useful when creating **parquet** files. Use 3 for milliseconds, 6 for microseconds, and 9 for nanoseconds. -The precision for **bigint** is mapped to available integer types ie. TINYINT, INT, BIGINT. The default is 64 bits (8 bytes) precision (BIGINT) +The precision for **bigint** is mapped to available integer types, i.e., TINYINT, INT, BIGINT. The default is 64 bits (8 bytes) precision (BIGINT). ::: ## Table references -`dlt` tables to refer to other tables. It supports two types of such references. -1. **nested reference** created automatically when nested data (ie. `json` document containing nested list) is converted into relational form. Those -references use specialized column and table hints and are used ie. when [merging data](incremental-loading.md). -2. **table references** are optional, user-defined annotations that are not verified and enforced but may be used by downstream tools ie. +`dlt` tables refer to other tables. It supports two types of such references. +1. **Nested reference** created automatically when nested data (i.e., `json` document containing a nested list) is converted into relational form. Those +references use specialized column and table hints and are used, i.e., when [merging data](incremental-loading.md). +2. **Table references** are optional, user-defined annotations that are not verified and enforced but may be used by downstream tools, i.e., to generate automatic tests or models for the loaded data. -### Nested references: root and nested tables -When `dlt` normalizes nested data into relational schema it will automatically create [**root** and **nested** tables](destination-tables.md) and link them using **nested references**. +### Nested references: Root and nested tables -1. All tables get a column with `row_key` hint (named `_dlt_id` by default) to uniquely identify each row of data. -2. Nested tables get `parent` table hint with a name of the parent table. Root table does not have `parent` hint defined. -3. Nested tables get a column with `parent_key` hint (named `_dlt_parent_id` by default) that refers to `row_key` of the `parent` table. +When `dlt` normalizes nested data into a relational schema, it will automatically create [**root** and **nested** tables](destination-tables.md) and link them using **nested references**. -`parent` + `row_key` + `parent_key` form a **nested reference**: from nested table to `parent` table and are extensively used when loading data. Both `replace` and `merge` write dispositions +1. All tables get a column with the `row_key` hint (named `_dlt_id` by default) to uniquely identify each row of data. +2. Nested tables get the `parent` table hint with the name of the parent table. The root table does not have the `parent` hint defined. +3. Nested tables get a column with the `parent_key` hint (named `_dlt_parent_id` by default) that refers to the `row_key` of the `parent` table. + +`parent` + `row_key` + `parent_key` form a **nested reference**: from the nested table to the `parent` table and are extensively used when loading data. Both `replace` and `merge` write dispositions. `row_key` is created as follows: -1. Random string on **root** tables, except for [`upsert`](incremental-loading.md#upsert-strategy) and -[`scd2`](incremental-loading.md#scd2-strategy) merge strategies, where it is a deterministic hash of `primary_key` (or whole row, so called `content_hash`, if PK is not defined). -2. A deterministic hash of `parent_key`, `parent` table name and position in the list (`_dlt_list_idx`) -for **nested** tables. +1. A random string on **root** tables, except for [`upsert`](incremental-loading.md#upsert-strategy) and [`scd2`](incremental-loading.md#scd2-strategy) merge strategies, where it is a deterministic hash of `primary_key` (or the whole row, so-called `content_hash`, if PK is not defined). +2. A deterministic hash of `parent_key`, `parent` table name, and position in the list (`_dlt_list_idx`) for **nested** tables. -You are able to bring your own `row_key` by adding `_dlt_id` column/field to your data (both root and nested). All data types with equal operator are supported. +You are able to bring your own `row_key` by adding the `_dlt_id` column/field to your data (both root and nested). All data types with the equal operator are supported. -`merge` write disposition requires additional nested reference that goes from **nested** to **root** table, skipping all parent tables in between. This reference is created by [adding a column with hint](incremental-loading.md#forcing-root-key-propagation) `root_key` (named `_dlt_root_id` by default) to nested tables. +`merge` write disposition requires an additional nested reference that goes from **nested** to **root** table, skipping all parent tables in between. This reference is created by [adding a column with the hint](incremental-loading.md#forcing-root-key-propagation) `root_key` (named `_dlt_root_id` by default) to nested tables. ### Table references + You can annotate tables with table references. This feature is coming soon. ## Schema settings -The `settings` section of schema file lets you define various global rules that impact how tables -and columns are inferred from data. For example you can assign **primary_key** hint to all columns with name `id` or force **timestamp** data type on all columns containing `timestamp` with an use of regex pattern. +The `settings` section of the schema file lets you define various global rules that impact how tables and columns are inferred from data. For example, you can assign the **primary_key** hint to all columns with the name `id` or force the **timestamp** data type on all columns containing `timestamp` with the use of a regex pattern. ### Data type autodetectors -You can define a set of functions that will be used to infer the data type of the column from a -value. The functions are run from top to bottom on the lists. Look in `detections.py` to see what is -available. **iso_timestamp** detector that looks for ISO 8601 strings and converts them to **timestamp** -is enabled by default. +You can define a set of functions that will be used to infer the data type of the column from a value. The functions are run from top to bottom on the lists. Look in `detections.py` to see what is available. The **iso_timestamp** detector that looks for ISO 8601 strings and converts them to **timestamp** is enabled by default. ```yaml settings: @@ -273,24 +232,21 @@ settings: - wei_to_double ``` -Alternatively you can add and remove detections from code: +Alternatively, you can add and remove detections from code: ```py source = data_source() # remove iso time detector source.schema.remove_type_detection("iso_timestamp") - # convert UNIX timestamp (float, withing a year from NOW) into timestamp + # convert UNIX timestamp (float, within a year from NOW) into timestamp source.schema.add_type_detection("timestamp") ``` -Above we modify a schema that comes with a source to detect UNIX timestamps with **timestamp** detector. +Above, we modify a schema that comes with a source to detect UNIX timestamps with the **timestamp** detector. ### Column hint rules -You can define a global rules that will apply hints of a newly inferred columns. Those rules apply -to normalized column names. You can use column names directly or with regular expressions. `dlt` is matching -the column names **after they got normalized with naming convention**. +You can define global rules that will apply hints to newly inferred columns. These rules apply to normalized column names. You can use column names directly or with regular expressions. `dlt` matches the column names **after they are normalized with naming conventions**. -By default, schema adopts hints rules from json(relational) normalizer to support correct hinting -of columns added by normalizer: +By default, the schema adopts hint rules from the json(relational) normalizer to support the correct hinting of columns added by the normalizer: ```yaml settings: @@ -310,13 +266,13 @@ settings: root_key: - _dlt_root_id ``` -Above we require exact column name match for a hint to apply. You can also use regular expression (which we call `SimpleRegex`) as follows: +Above, we require an exact column name match for a hint to apply. You can also use regular expressions (which we call `SimpleRegex`) as follows: ```yaml settings: partition: - re:_timestamp$ ``` -Above we add `partition` hint to all columns ending with `_timestamp`. You can do same thing in the code +Above, we add the `partition` hint to all columns ending with `_timestamp`. You can do the same thing in the code: ```py source = data_source() # this will update existing hints with the hints passed @@ -325,10 +281,7 @@ Above we add `partition` hint to all columns ending with `_timestamp`. You can d ### Preferred data types -You can define rules that will set the data type for newly created columns. Put the rules under -`preferred_types` key of `settings`. On the left side there's a rule on a column name, on the right -side is the data type. You can use column names directly or with regular expressions. -`dlt` is matching the column names **after they got normalized with naming convention**. +You can define rules that will set the data type for newly created columns. Put the rules under the `preferred_types` key of `settings`. On the left side, there's a rule on a column name; on the right side is the data type. You can use column names directly or with regular expressions. `dlt` matches the column names **after they are normalized with naming conventions**. Example: @@ -341,8 +294,7 @@ settings: updated_at: timestamp ``` -Above we prefer `timestamp` data type for all columns containing **timestamp** substring and define a few exact matches ie. **created_at**. -Here's same thing in code +Above, we prefer the `timestamp` data type for all columns containing the **timestamp** substring and define a few exact matches, i.e., **created_at**. Here's the same thing in code: ```py source = data_source() source.schema.update_preferred_types( @@ -390,7 +342,7 @@ load_info = pipeline.run(source_data) This example iterates through MongoDB collections, applying the **json** [data type](schema#data-types) to a specified column, and then processes the data with `pipeline.run`. ## View and print the schema -To view and print the default schema in a clear YAML format use the command: +To view and print the default schema in a clear YAML format, use the command: ```py pipeline.default_schema.to_pretty_yaml() @@ -419,16 +371,16 @@ schema files in your pipeline. ## Attaching schemas to sources -We recommend to not create schemas explicitly. Instead, user should provide a few global schema -settings and then let the table and column schemas to be generated from the resource hints and the +We recommend not creating schemas explicitly. Instead, users should provide a few global schema +settings and then let the table and column schemas be generated from the resource hints and the data itself. The `dlt.source` decorator accepts a schema instance that you can create yourself and modify in -whatever way you wish. The decorator also support a few typical use cases: +whatever way you wish. The decorator also supports a few typical use cases: ### Schema created implicitly by decorator -If no schema instance is passed, the decorator creates a schema with the name set to source name and +If no schema instance is passed, the decorator creates a schema with the name set to the source name and all the settings to default. ### Automatically load schema file stored with source python module @@ -437,16 +389,16 @@ If no schema instance is passed, and a file with a name `{source name}_schema.ym same folder as the module with the decorated function, it will be automatically loaded and used as the schema. -This should make easier to bundle a fully specified (or pre-configured) schema with a source. +This should make it easier to bundle a fully specified (or pre-configured) schema with a source. ### Schema is modified in the source function body -What if you can configure your schema or add some tables only inside your schema function, when i.e. -you have the source credentials and user settings available? You could for example add detailed -schemas of all the database tables when someone requests a table data to be loaded. This information -is available only at the moment source function is called. +What if you can configure your schema or add some tables only inside your schema function, when, for example, +you have the source credentials and user settings available? You could, for example, add detailed +schemas of all the database tables when someone requests table data to be loaded. This information +is available only at the moment the source function is called. -Similarly to the `source_state()` and `resource_state()` , source and resource function has current +Similarly to the `source_state()` and `resource_state()`, the source and resource function has the current schema available via `dlt.current.source_schema()`. Example: @@ -458,8 +410,9 @@ def textual(nesting_level: int): schema = dlt.current.source_schema() # remove date detector schema.remove_type_detection("iso_timestamp") - # convert UNIX timestamp (float, withing a year from NOW) into timestamp + # convert UNIX timestamp (float, within a year from NOW) into timestamp schema.add_type_detection("timestamp") return dlt.resource([]) ``` + diff --git a/docs/website/docs/general-usage/source.md b/docs/website/docs/general-usage/source.md index e94cc2bd30..9a2dd30392 100644 --- a/docs/website/docs/general-usage/source.md +++ b/docs/website/docs/general-usage/source.md @@ -6,25 +6,20 @@ keywords: [source, api, dlt.source] # Source -A [source](glossary.md#source) is a logical grouping of resources i.e. endpoints of a -single API. The most common approach is to define it in a separate Python module. +A [source](glossary.md#source) is a logical grouping of resources, i.e., endpoints of a single API. The most common approach is to define it in a separate Python module. - A source is a function decorated with `@dlt.source` that returns one or more resources. -- A source can optionally define a [schema](schema.md) with tables, columns, performance hints and - more. +- A source can optionally define a [schema](schema.md) with tables, columns, performance hints, and more. - The source Python module typically contains optional customizations and data transformations. -- The source Python module typically contains the authentication and pagination code for particular - API. +- The source Python module typically contains the authentication and pagination code for a particular API. ## Declare sources -You declare source by decorating an (optionally async) function that return or yields one or more resource with `dlt.source`. Our -[Create a pipeline](../walkthroughs/create-a-pipeline.md) how to guide teaches you how to do that. +You declare a source by decorating an (optionally async) function that returns or yields one or more resources with `dlt.source`. Our [Create a pipeline](../walkthroughs/create-a-pipeline.md) how-to guide teaches you how to do that. ### Create resources dynamically -You can create resources by using `dlt.resource` as a function. In an example below we reuse a -single generator function to create a list of resources for several Hubspot endpoints. +You can create resources by using `dlt.resource` as a function. In the example below, we reuse a single generator function to create a list of resources for several Hubspot endpoints. ```py @dlt.source @@ -43,21 +38,19 @@ def hubspot(api_key=dlt.secrets.value): ### Attach and configure schemas -You can [create, attach and configure schema](schema.md#attaching-schemas-to-sources) that will be -used when loading the source. +You can [create, attach, and configure schema](schema.md#attaching-schemas-to-sources) that will be used when loading the source. -### Avoid long lasting operations in source function -Do not extract data in source function. Leave that task to your resources if possible. Source function is executed immediately when called (contrary to resources which delay execution - like Python generators). There are several benefits (error handling, execution metrics, parallelization) you get when you extract data in `pipeline.run` or `pipeline.extract`. +### Avoid long-lasting operations in source function -If this is impractical (for example you want to reflect a database to create resources for tables) make sure you do not call source function too often. [See this note if you plan to deploy on Airflow](../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file) +Do not extract data in the source function. Leave that task to your resources if possible. The source function is executed immediately when called (contrary to resources which delay execution - like Python generators). There are several benefits (error handling, execution metrics, parallelization) you get when you extract data in `pipeline.run` or `pipeline.extract`. +If this is impractical (for example, you want to reflect a database to create resources for tables), make sure you do not call the source function too often. [See this note if you plan to deploy on Airflow](../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file). ## Customize sources ### Access and select resources to load -You can access resources present in a source and select which of them you want to load. In case of -`hubspot` resource above we could select and load "companies", "deals" and "products" resources: +You can access resources present in a source and select which of them you want to load. In the case of the `hubspot` resource above, we could select and load "companies", "deals", and "products" resources: ```py from hubspot import hubspot @@ -84,10 +77,9 @@ print(source.deals.selected) source.deals.selected = False ``` -### Filter, transform and pivot data +### Filter, transform, and pivot data -You can modify and filter data in resources, for example if we want to keep only deals after certain -date: +You can modify and filter data in resources. For example, if we want to keep only deals after a certain date: ```py source.deals.add_filter(lambda deal: deal["created_at"] > yesterday) @@ -97,11 +89,7 @@ Find more on transforms [here](resource.md#filter-transform-and-pivot-data). ### Load data partially -You can limit the number of items produced by each resource by calling a `add_limit` method on a -source. This is useful for testing, debugging and generating sample datasets for experimentation. -You can easily get your test dataset in a few minutes, when otherwise you'd need to wait hours for -the full loading to complete. Below we limit the `pipedrive` source to just get **10 pages** of data -from each endpoint. Mind that the transformers will be evaluated fully: +You can limit the number of items produced by each resource by calling an `add_limit` method on a source. This is useful for testing, debugging, and generating sample datasets for experimentation. You can easily get your test dataset in a few minutes, whereas otherwise, you'd need to wait hours for the full loading to complete. Below we limit the `pipedrive` source to just get **10 pages** of data from each endpoint. Note that the transformers will be evaluated fully: ```py from pipedrive import pipedrive_source @@ -119,11 +107,7 @@ Find more on sampling data [here](resource.md#sample-from-large-data). ### Add more resources to existing source -You can add a custom resource to source after it was created. Imagine that you want to score all the -deals with a keras model that will tell you if the deal is a fraud or not. In order to do that you -declare a new -[transformer that takes the data from](resource.md#feeding-data-from-one-resource-into-another) `deals` -resource and add it to the source. +You can add a custom resource to a source after it is created. Imagine that you want to score all the deals with a keras model that will tell you if the deal is a fraud or not. To do that, you declare a new [transformer that takes the data from](resource.md#feeding-data-from-one-resource-into-another) the `deals` resource and add it to the source. ```py import dlt @@ -143,7 +127,7 @@ source.resources.add(source.deals | deal_scores) # load the data: you'll see the new table `deal_scores` in your destination! pipeline.run(source) ``` -You can also set the resources in the source as follows +You can also set the resources in the source as follows: ```py source.deal_scores = source.deals | deal_scores ``` @@ -152,13 +136,12 @@ or source.resources["deal_scores"] = source.deals | deal_scores ``` :::note -When adding resource to the source, `dlt` clones the resource so your existing instance is not affected. +When adding a resource to the source, `dlt` clones the resource so your existing instance is not affected. ::: ### Reduce the nesting level of generated tables -You can limit how deep `dlt` goes when generating nested tables and flattening dicts into columns. By default, the library will descend -and generate nested tables for all nested lists and columns form dicts, without limit. +You can limit how deep `dlt` goes when generating nested tables and flattening dicts into columns. By default, the library will descend and generate nested tables for all nested lists and columns from dicts, without limit. ```py @dlt.source(max_table_nesting=1) @@ -166,13 +149,10 @@ def mongo_db(): ... ``` -In the example above, we want only 1 level of nested tables to be generated (so there are no nested -tables of a nested table). Typical settings: +In the example above, we want only 1 level of nested tables to be generated (so there are no nested tables of a nested table). Typical settings: -- `max_table_nesting=0` will not generate nested tables and will not flatten dicts into columns at all. All nested data will be - represented as JSON. -- `max_table_nesting=1` will generate nested tables of root tables and nothing more. All nested - data in nested tables will be represented as JSON. +- `max_table_nesting=0` will not generate nested tables and will not flatten dicts into columns at all. All nested data will be represented as JSON. +- `max_table_nesting=1` will generate nested tables of root tables and nothing more. All nested data in nested tables will be represented as JSON. You can achieve the same effect after the source instance is created: @@ -183,17 +163,12 @@ source = mongo_db() source.max_table_nesting = 0 ``` -Several data sources are prone to contain semi-structured documents with very deep nesting i.e. -MongoDB databases. Our practical experience is that setting the `max_nesting_level` to 2 or 3 -produces the clearest and human-readable schemas. +Several data sources are prone to contain semi-structured documents with very deep nesting, e.g., MongoDB databases. Our practical experience is that setting the `max_nesting_level` to 2 or 3 produces the clearest and most human-readable schemas. :::tip -The `max_table_nesting` parameter at the source level doesn't automatically apply to individual -resources when accessed directly (e.g., using `source.resources["resource_1"])`. To make sure it -works, either use `source.with_resources("resource_1")` or set the parameter directly on the resource. +The `max_table_nesting` parameter at the source level doesn't automatically apply to individual resources when accessed directly (e.g., using `source.resources["resource_1"]`). To make sure it works, either use `source.with_resources("resource_1")` or set the parameter directly on the resource. ::: - You can directly configure the `max_table_nesting` parameter on the resource level as: ```py @@ -209,28 +184,28 @@ my_source.my_resource.max_table_nesting = 0 ### Modify schema -The schema is available via `schema` property of the source. -[You can manipulate this schema i.e. add tables, change column definitions etc. before the data is loaded.](schema.md#schema-is-modified-in-the-source-function-body) +The schema is available via the `schema` property of the source. +[You can manipulate this schema, i.e., add tables, change column definitions, etc., before the data is loaded.](schema.md#schema-is-modified-in-the-source-function-body) -Source provides two other convenience properties: +The source provides two other convenience properties: -1. `max_table_nesting` to set the maximum nesting level for nested tables and flattened columns -1. `root_key` to propagate the `_dlt_id` of from a root table to all nested tables. +1. `max_table_nesting` to set the maximum nesting level for nested tables and flattened columns. +2. `root_key` to propagate the `_dlt_id` from a root table to all nested tables. ## Load sources -You can pass individual sources or list of sources to the `dlt.pipeline` object. By default, all the -sources will be loaded to a single dataset. +You can pass individual sources or a list of sources to the `dlt.pipeline` object. By default, all the +sources will be loaded into a single dataset. You are also free to decompose a single source into several ones. For example, you may want to break -down a 50 table copy job into an airflow dag with high parallelism to load the data faster. To do +down a 50-table copy job into an airflow dag with high parallelism to load the data faster. To do so, you could get the list of resources as: ```py # get a list of resources' names resource_list = sql_source().resources.keys() -#now we are able to make a pipeline for each resource +# now we are able to make a pipeline for each resource for res in resource_list: pipeline.run(sql_source().with_resources(res)) ``` @@ -249,3 +224,4 @@ With selected resources: ```py p.run(tables.with_resources("users"), write_disposition="replace") ``` + diff --git a/docs/website/docs/general-usage/state.md b/docs/website/docs/general-usage/state.md index b34d37c8b1..7251017689 100644 --- a/docs/website/docs/general-usage/state.md +++ b/docs/website/docs/general-usage/state.md @@ -6,14 +6,11 @@ keywords: [state, metadata, dlt.current.resource_state, dlt.current.source_state # State -The pipeline state is a Python dictionary which lives alongside your data; you can store values in -it and, on next pipeline run, request them back. +The pipeline state is a Python dictionary that lives alongside your data; you can store values in it and, on the next pipeline run, request them back. ## Read and write pipeline state in a resource -You read and write the state in your resources. Below we use the state to create a list of chess -game archives which we then use to -[prevent requesting duplicates](incremental-loading.md#advanced-state-usage-storing-a-list-of-processed-entities). +You read and write the state in your resources. Below we use the state to create a list of chess game archives, which we then use to [prevent requesting duplicates](incremental-loading.md#advanced-state-usage-storing-a-list-of-processed-entities). ```py @dlt.resource(write_disposition="append") @@ -35,55 +32,41 @@ def players_games(chess_url, player, start_month=None, end_month=None): yield r.json().get("games", []) ``` -Above, we request the resource-scoped state. The `checked_archives` list stored under `archives` -dictionary key is private and visible only to the `players_games` resource. +Above, we request the resource-scoped state. The `checked_archives` list stored under the `archives` dictionary key is private and visible only to the `players_games` resource. -The pipeline state is stored locally in -[pipeline working directory](pipeline.md#pipeline-working-directory) and as a consequence - it -cannot be shared with pipelines with different names. You must also make sure that data written into -the state is JSON Serializable. Except standard Python types, `dlt` handles `DateTime`, `Decimal`, -`bytes` and `UUID`. +The pipeline state is stored locally in the [pipeline working directory](pipeline.md#pipeline-working-directory) and, as a consequence, it cannot be shared with pipelines with different names. You must also make sure that data written into the state is JSON serializable. Except for standard Python types, `dlt` handles `DateTime`, `Decimal`, `bytes`, and `UUID`. ## Share state across resources and read state in a source -You can also access the source-scoped state with `dlt.current.source_state()` which can be shared -across resources of a particular source and is also available read-only in the source-decorated -functions. The most common use case for the source-scoped state is to store mapping of custom fields -to their displayable names. You can take a look at our -[pipedrive source](https://github.com/dlt-hub/verified-sources/blob/master/sources/pipedrive/__init__.py#L118) -for an example of state passed across resources. +You can also access the source-scoped state with `dlt.current.source_state()`, which can be shared across resources of a particular source and is also available read-only in the source-decorated functions. The most common use case for the source-scoped state is to store a mapping of custom fields to their displayable names. You can take a look at our [pipedrive source](https://github.com/dlt-hub/verified-sources/blob/master/sources/pipedrive/__init__.py#L118) for an example of state passed across resources. :::tip -[decompose your source](../reference/performance.md#source-decomposition-for-serial-and-parallel-resource-execution) -in order to, for example run it on Airflow in parallel. If you cannot avoid that, designate one of -the resources as state writer and all the other as state readers. This is exactly what `pipedrive` -pipeline does. With such structure you will still be able to run some of your resources in -parallel. +[Decompose your source](../reference/performance.md#source-decomposition-for-serial-and-parallel-resource-execution) in order to, for example, run it on Airflow in parallel. If you cannot avoid that, designate one of the resources as the state writer and all the others as state readers. This is exactly what the `pipedrive` pipeline does. With such a structure, you will still be able to run some of your resources in parallel. ::: :::caution -The `dlt.state()` is a deprecated alias to `dlt.current.source_state()` and will be soon -removed. +The `dlt.state()` is a deprecated alias to `dlt.current.source_state()` and will be soon removed. ::: + ## Syncing state with destination What if you run your pipeline on, for example, Airflow where every task gets a clean filesystem and [pipeline working directory](pipeline.md#pipeline-working-directory) is always deleted? `dlt` loads -your state into the destination together with all other data and when faced with a clean start, it -will try to restore state from the destination. +your state into the destination together with all other data, and when faced with a clean start, it +will try to restore the state from the destination. -The remote state is identified by pipeline name, the destination location (as given by the -credentials) and destination dataset. To re-use the same state, use the same pipeline name and +The remote state is identified by the pipeline name, the destination location (as given by the +credentials), and the destination dataset. To re-use the same state, use the same pipeline name and destination. The state is stored in the `_dlt_pipeline_state` table at the destination and contains information -about the pipeline, pipeline run (that the state belongs to) and state blob. +about the pipeline, pipeline run (that the state belongs to), and state blob. -`dlt` has `dlt pipeline sync` command where you can +`dlt` has a `dlt pipeline sync` command where you can [request the state back from that table](../reference/command-line-interface.md#sync-pipeline-with-the-destination). > 💡 If you can keep the pipeline working directory across the runs, you can disable the state sync -> by setting `restore_from_destination=false` i.e. in your `config.toml`. +> by setting `restore_from_destination=false` in your `config.toml`. ## When to use pipeline state @@ -94,71 +77,71 @@ about the pipeline, pipeline run (that the state belongs to) and state blob. if the list is not much bigger than 100k elements. - [Store large dictionaries of last values](incremental-loading.md#advanced-state-usage-tracking-the-last-value-for-all-search-terms-in-twitter-api) if you are not able to implement it with the standard incremental construct. -- Store the custom fields dictionaries, dynamic configurations and other source-scoped state. +- Store custom fields dictionaries, dynamic configurations, and other source-scoped state. ## Do not use pipeline state if it can grow to millions of records Do not use dlt state when it may grow to millions of elements. Do you plan to store modification -timestamps of all of your millions of user records? This is probably a bad idea! In that case you +timestamps of all of your millions of user records? This is probably a bad idea! In that case, you could: -- Store the state in dynamo-db, redis etc. taking into the account that if the extract stage fails - you'll end with invalid state. +- Store the state in dynamo-db, redis, etc., taking into account that if the extract stage fails, + you'll end up with an invalid state. - Use your loaded data as the state. `dlt` exposes the current pipeline via `dlt.current.pipeline()` from which you can obtain [sqlclient](../dlt-ecosystem/transformations/sql.md) - and load the data of interest. In that case try at least to process your user records in batches. + and load the data of interest. In that case, try at least to process your user records in batches. ### Access data in the destination instead of pipeline state -In the example below, we load recent comments made by given `user_id`. We access `user_comments` table to select -maximum comment id for a given user. + +In the example below, we load recent comments made by a given `user_id`. We access the `user_comments` table to select the maximum comment id for a given user. + ```py import dlt @dlt.resource(name="user_comments") def comments(user_id: str): current_pipeline = dlt.current.pipeline() - # find last comment id for given user_id by looking in destination + # find the last comment id for the given user_id by looking in the destination max_id: int = 0 - # on first pipeline run, user_comments table does not yet exist so do not check at all - # alternatively catch DatabaseUndefinedRelation which is raised when unknown table is selected + # on the first pipeline run, the user_comments table does not yet exist, so do not check at all + # alternatively, catch DatabaseUndefinedRelation which is raised when an unknown table is selected if not current_pipeline.first_run: with current_pipeline.sql_client() as client: - # we may get last user comment or None which we replace with 0 + # we may get the last user comment or None, which we replace with 0 max_id = ( client.execute_sql( "SELECT MAX(_id) FROM user_comments WHERE user_id=?", user_id )[0][0] or 0 ) - # use max_id to filter our results (we simulate API query) + # use max_id to filter our results (we simulate an API query) yield from [ {"_id": i, "value": letter, "user_id": user_id} for i, letter in zip([1, 2, 3], ["A", "B", "C"]) if i > max_id ] ``` -When pipeline is first run, the destination dataset and `user_comments` table do not yet exist. We skip the destination -query by using `first_run` property of the pipeline. We also handle a situation where there are no comments for a user_id -by replacing None with 0 as `max_id`. + +When the pipeline is first run, the destination dataset and `user_comments` table do not yet exist. We skip the destination query by using the `first_run` property of the pipeline. We also handle a situation where there are no comments for a user_id by replacing None with 0 as `max_id`. ## Inspect the pipeline state -You can inspect pipeline state with +You can inspect the pipeline state with the [`dlt pipeline` command](../reference/command-line-interface.md#dlt-pipeline): ```sh dlt pipeline -v chess_pipeline info ``` -will display source and resource state slots for all known sources. +This will display source and resource state slots for all known sources. ## Reset the pipeline state: full or partial **To fully reset the state:** - Drop the destination dataset to fully reset the pipeline. -- [Set the `dev_mode` flag when creating pipeline](pipeline.md#do-experiments-with-dev-mode). +- [Set the `dev_mode` flag when creating the pipeline](pipeline.md#do-experiments-with-dev-mode). - Use the `dlt pipeline drop --drop-all` command to [drop state and tables for a given schema name](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). @@ -167,4 +150,5 @@ will display source and resource state slots for all known sources. - Use the `dlt pipeline drop ` command to [drop state and tables for a given resource](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). - Use the `dlt pipeline drop --state-paths` command to - [reset the state at given path without touching the tables and data](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). + [reset the state at a given path without touching the tables and data](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). + diff --git a/docs/website/docs/intro.md b/docs/website/docs/intro.md index 6660696cfb..5d1843e28f 100644 --- a/docs/website/docs/intro.md +++ b/docs/website/docs/intro.md @@ -18,7 +18,7 @@ dlt is designed to be easy to use, flexible, and scalable: - dlt infers [schemas](./general-usage/schema) and [data types](./general-usage/schema/#data-types), [normalizes the data](./general-usage/schema/#data-normalizer), and handles nested data structures. - dlt supports a variety of [popular destinations](./dlt-ecosystem/destinations/) and has an interface to add [custom destinations](./dlt-ecosystem/destinations/destination) to create reverse ETL pipelines. -- dlt can be deployed anywhere Python runs, be it on [Airflow](./walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [serverless functions](./walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-functions) or any other cloud deployment of your choice. +- dlt can be deployed anywhere Python runs, be it on [Airflow](./walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [serverless functions](./walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-functions), or any other cloud deployment of your choice. - dlt automates pipeline maintenance with [schema evolution](./general-usage/schema-evolution) and [schema and data contracts](./general-usage/schema-contracts). To get started with dlt, install the library using pip: @@ -43,7 +43,7 @@ We recommend using a clean virtual environment for your experiments! Read the [d ]}> -Use dlt's [REST API source](./tutorial/rest-api) to extract data from any REST API. Define API endpoints you’d like to fetch data from, pagination method and authentication and dlt will handle the rest: +Use dlt's [REST API source](./tutorial/rest-api) to extract data from any REST API. Define API endpoints you’d like to fetch data from, pagination method, and authentication, and dlt will handle the rest: ```py import dlt @@ -76,7 +76,7 @@ Follow the [REST API source tutorial](./tutorial/rest-api) to learn more about t -Use the [SQL source](./tutorial/sql-database) to extract data from the database like PostgreSQL, MySQL, SQLite, Oracle and more. +Use the [SQL source](./tutorial/sql-database) to extract data from databases like PostgreSQL, MySQL, SQLite, Oracle, and more. ```py from dlt.sources.sql_database import sql_database @@ -155,4 +155,5 @@ If you'd like to try out dlt without installing it on your machine, check out th 1. Give the library a ⭐ and check out the code on [GitHub](https://github.com/dlt-hub/dlt). 1. Ask questions and share how you use the library on [Slack](https://dlthub.com/community). -1. Report problems and make feature requests [here](https://github.com/dlt-hub/dlt/issues/new/choose). \ No newline at end of file +1. Report problems and make feature requests [here](https://github.com/dlt-hub/dlt/issues/new/choose). +