diff --git a/docs/website/docs/_book-onboarding-call.md b/docs/website/docs/_book-onboarding-call.md index 4725128bf0..a6941e5f3b 100644 --- a/docs/website/docs/_book-onboarding-call.md +++ b/docs/website/docs/_book-onboarding-call.md @@ -1 +1,2 @@ -book a call with a dltHub Solutions Engineer +Book a call with a dltHub Solutions Engineer. + diff --git a/docs/website/docs/build-a-pipeline-tutorial.md b/docs/website/docs/build-a-pipeline-tutorial.md index de1a4d647f..0fe483c944 100644 --- a/docs/website/docs/build-a-pipeline-tutorial.md +++ b/docs/website/docs/build-a-pipeline-tutorial.md @@ -6,21 +6,19 @@ keywords: [getting started, quick start, basics] # Building data pipelines with `dlt`, from basic to advanced -This in-depth overview will take you through the main areas of pipelining with `dlt`. Go to the -related pages you are instead looking for the [quickstart](./intro.md). +This in-depth overview will take you through the main areas of pipelining with `dlt`. If you are looking for the [quickstart](./intro.md), go to the related pages. ## Why build pipelines with `dlt`? -`dlt` offers functionality to support the entire extract and load process. Let's look at the high level diagram: +`dlt` offers functionality to support the entire extract and load process. Let's look at the high-level diagram: ![dlt source resource pipe diagram](/img/dlt-high-level.png) +First, we have a `pipeline` function that can infer a schema from data and load the data to the destination. +We can use this pipeline with JSON data, dataframes, or other iterable objects such as generator functions. -First, we have a `pipeline` function, that can infer a schema from data and load the data to the destination. -We can use this pipeline with json data, dataframes, or other iterable objects such as generator functions. - -This pipeline provides effortless loading via a schema discovery, versioning and evolution -engine that ensures you can "just load" any data with row and column level lineage. +This pipeline provides effortless loading via a schema discovery, versioning, and evolution +engine that ensures you can "just load" any data with row and column-level lineage. By utilizing a `dlt pipeline`, we can easily adapt and structure data as it evolves, reducing the time spent on maintenance and development. @@ -28,11 +26,10 @@ maintenance and development. This allows our data team to focus on leveraging the data and driving value, while ensuring effective governance through timely notifications of any changes. -For extract, `dlt` also provides `source` and `resource` decorators that enable defining +For extraction, `dlt` also provides `source` and `resource` decorators that enable defining how extracted data should be loaded, while supporting graceful, scalable extraction via micro-batching and parallelism. - ## The simplest pipeline: 1 liner to load data with schema evolution ```py @@ -41,12 +38,9 @@ import dlt dlt.pipeline(destination='duckdb', dataset_name='mydata').run([{'id': 1, 'name': 'John'}], table_name="users") ``` -A pipeline in the `dlt` library is a powerful tool that allows you to move data from your Python code -to a destination with a single function call. By defining a pipeline, you can easily load, -normalize, and evolve your data schemas, enabling seamless data integration and analysis. +A pipeline in the `dlt` library is a powerful tool that allows you to move data from your Python code to a destination with a single function call. By defining a pipeline, you can easily load, normalize, and evolve your data schemas, enabling seamless data integration and analysis. -For example, let's consider a scenario where you want to load a list of objects into a DuckDB table -named "three". With `dlt`, you can create a pipeline and run it with just a few lines of code: +For example, let's consider a scenario where you want to load a list of objects into a DuckDB table named "three". With `dlt`, you can create a pipeline and run it with just a few lines of code: 1. [Create a pipeline](./walkthroughs/create-a-pipeline.md) to the [destination](dlt-ecosystem/destinations). 1. Give this pipeline data and [run it](./walkthroughs/run-a-pipeline.md). @@ -67,20 +61,15 @@ info = pipeline.run(data, table_name="countries") print(info) ``` -In this example, the `pipeline` function is used to create a pipeline with the specified -destination (DuckDB) and dataset name ("country_data"). The `run` method is then called to load -the data from a list of objects into the table named "countries". The `info` variable stores -information about the loaded data, such as package IDs and job metadata. +In this example, the `pipeline` function is used to create a pipeline with the specified destination (DuckDB) and dataset name ("country_data"). The `run` method is then called to load the data from a list of objects into the table named "countries". The `info` variable stores information about the loaded data, such as package IDs and job metadata. -The data you can pass to it should be iterable: lists of rows, generators, or `dlt` sources will do -just fine. +The data you can pass to it should be iterable: lists of rows, generators, or `dlt` sources will do just fine. -If you want to configure how the data is loaded, you can choose between `write_disposition`s -such as `replace`, `append` and `merge` in the pipeline function. +If you want to configure how the data is loaded, you can choose between `write_disposition`s such as `replace`, `append`, and `merge` in the pipeline function. Here is an example where we load some data to duckdb by `upserting` or `merging` on the id column found in the data. In this example, we also run a dbt package and then load the outcomes of the load jobs into their respective tables. -This will enable us to log when schema changes occurred and match them to the loaded data for lineage, granting us both column and row level lineage. +This will enable us to log when schema changes occurred and match them to the loaded data for lineage, granting us both column and row-level lineage. We also alert the schema change to a Slack channel where hopefully the producer and consumer are subscribed. ```py @@ -137,55 +126,34 @@ for package in load_info.load_packages: ## Extracting data with `dlt` -Extracting data with `dlt` is simple - you simply decorate your data-producing functions with loading -or incremental extraction metadata, which enables `dlt` to extract and load by your custom logic. +Extracting data with `dlt` is simple - you simply decorate your data-producing functions with loading or incremental extraction metadata, which enables `dlt` to extract and load by your custom logic. Technically, two key aspects contribute to `dlt`'s effectiveness: -- Scalability through iterators, chunking, parallelization. -- The utilization of implicit extraction DAGs that allow efficient API calls for data - enrichments or transformations. +- Scalability through iterators, chunking, and parallelization. +- The utilization of implicit extraction DAGs that allow efficient API calls for data enrichments or transformations. ### Scalability via iterators, chunking, and parallelization -`dlt` offers scalable data extraction by leveraging iterators, chunking, and parallelization -techniques. This approach allows for efficient processing of large datasets by breaking them down -into manageable chunks. +`dlt` offers scalable data extraction by leveraging iterators, chunking, and parallelization techniques. This approach allows for efficient processing of large datasets by breaking them down into manageable chunks. -For example, consider a scenario where you need to extract data from a massive database with -millions of records. Instead of loading the entire dataset at once, `dlt` allows you to use -iterators to fetch data in smaller, more manageable portions. This technique enables incremental -processing and loading, which is particularly useful when dealing with limited memory resources. +For example, consider a scenario where you need to extract data from a massive database with millions of records. Instead of loading the entire dataset at once, `dlt` allows you to use iterators to fetch data in smaller, more manageable portions. This technique enables incremental processing and loading, which is particularly useful when dealing with limited memory resources. -Furthermore, `dlt` facilitates parallelization during the extraction process. By processing -multiple data chunks simultaneously, `dlt` takes advantage of parallel processing capabilities, -resulting in significantly reduced extraction times. This parallelization enhances performance, -especially when dealing with high-volume data sources. +Furthermore, `dlt` facilitates parallelization during the extraction process. By processing multiple data chunks simultaneously, `dlt` takes advantage of parallel processing capabilities, resulting in significantly reduced extraction times. This parallelization enhances performance, especially when dealing with high-volume data sources. ### Implicit extraction DAGs -`dlt` incorporates the concept of implicit extraction DAGs to handle the dependencies between -data sources and their transformations automatically. A DAG represents a directed graph without -cycles, where each node represents a data source or transformation step. +`dlt` incorporates the concept of implicit extraction DAGs to handle the dependencies between data sources and their transformations automatically. A DAG represents a directed graph without cycles, where each node represents a data source or transformation step. -When using `dlt`, the tool automatically generates an extraction DAG based on the dependencies -identified between the data sources and their transformations. This extraction DAG determines the -optimal order for extracting the resources to ensure data consistency and integrity. +When using `dlt`, the tool automatically generates an extraction DAG based on the dependencies identified between the data sources and their transformations. This extraction DAG determines the optimal order for extracting the resources to ensure data consistency and integrity. -For instance, imagine a pipeline where data needs to be extracted from multiple API endpoints and -undergo certain transformations or enrichments via additional calls before loading it into a -database. `dlt` analyzes the dependencies between the API endpoints and transformations and -generates an extraction DAG accordingly. The extraction DAG ensures that the data is extracted in -the correct order, accounting for any dependencies and transformations. +For instance, imagine a pipeline where data needs to be extracted from multiple API endpoints and undergo certain transformations or enrichments via additional calls before loading it into a database. `dlt` analyzes the dependencies between the API endpoints and transformations and generates an extraction DAG accordingly. The extraction DAG ensures that the data is extracted in the correct order, accounting for any dependencies and transformations. -When deploying to Airflow, the internal DAG is unpacked into Airflow tasks in such a way to ensure -consistency and allow granular loading. +When deploying to Airflow, the internal DAG is unpacked into Airflow tasks in such a way to ensure consistency and allow granular loading. -## Defining Incremental Loading +## Defining incremental loading -[Incremental loading](general-usage/incremental-loading.md) is a crucial concept in data pipelines that involves loading only new or changed -data instead of reloading the entire dataset. This approach provides several benefits, including -low-latency data transfer and cost savings. +[Incremental loading](general-usage/incremental-loading.md) is a crucial concept in data pipelines that involves loading only new or changed data instead of reloading the entire dataset. This approach provides several benefits, including low-latency data transfer and cost savings. ### Declarative loading @@ -197,10 +165,10 @@ behavior using the `write_disposition` parameter. There are three options availa source on the current run. You can achieve this by setting `write_disposition='replace'` in your resources. It is suitable for stateless data that doesn't change, such as recorded events like page views. -1. Append: The append option adds new data to the existing destination dataset. By using +2. Append: The append option adds new data to the existing destination dataset. By using `write_disposition='append'`, you can ensure that only new records are loaded. This is suitable for stateless data that can be easily appended without any conflicts. -1. Merge: The merge option is used when you want to merge new data with the existing destination +3. Merge: The merge option is used when you want to merge new data with the existing destination dataset while also handling deduplication or upserts. It requires the use of `merge_key` and/or `primary_key` to identify and update specific records. By setting `write_disposition='merge'`, you can perform merge-based incremental loading. @@ -226,15 +194,15 @@ incrementally, deduplicating it, and performing the necessary merge operations. Advanced state management in `dlt` allows you to store and retrieve values across pipeline runs by persisting them at the destination but accessing them in a dictionary in code. This enables you to track and manage incremental loading effectively. By leveraging the pipeline state, you can -preserve information, such as last values, checkpoints or column renames, and utilize them later in +preserve information, such as last values, checkpoints, or column renames, and utilize them later in the pipeline. -## Transforming the Data +## Transforming the data Data transformation plays a crucial role in the data loading process. You can perform transformations both before and after loading the data. Here's how you can achieve it: -### Before Loading +### Before loading Before loading the data, you have the flexibility to perform transformations using Python. You can leverage Python's extensive libraries and functions to manipulate and preprocess the data as needed. @@ -248,16 +216,13 @@ consistent mapping. The `dummy_source` generates dummy data with an `id` and `na column, and the `add_map` function applies the `pseudonymize_name` transformation to each record. -### After Loading +### After loading For transformations after loading the data, you have several options available: #### [Using dbt](dlt-ecosystem/transformations/dbt/dbt.md) -dbt is a powerful framework for transforming data. It enables you to structure your transformations -into DAGs, providing cross-database compatibility and various features such as templating, -backfills, testing, and troubleshooting. You can use the dbt runner in `dlt` to seamlessly -integrate dbt into your pipeline. Here's an example of running a dbt package after loading the data: +dbt is a powerful framework for transforming data. It enables you to structure your transformations into DAGs, providing cross-database compatibility and various features such as templating, backfills, testing, and troubleshooting. You can use the dbt runner in `dlt` to seamlessly integrate dbt into your pipeline. Here's an example of running a dbt package after loading the data: ```py import dlt @@ -284,7 +249,7 @@ pipeline = dlt.pipeline( # make venv and install dbt in it. venv = dlt.dbt.get_venv(pipeline) -# get package from local or github link and run +# get package from local or GitHub link and run dbt = dlt.dbt.package(pipeline, "pipedrive/dbt_pipedrive/pipedrive", venv=venv) models = dbt.run_all() @@ -293,17 +258,11 @@ for m in models: print(f"Model {m.model_name} materialized in {m.time} with status {m.status} and message {m.message}") ``` -In this example, the first pipeline loads the data using `pipedrive_source()`. The second -pipeline performs transformations using a dbt package called `pipedrive` after loading the data. -The `dbt.package` function sets up the dbt runner, and `dbt.run_all()` executes the dbt -models defined in the package. +In this example, the first pipeline loads the data using `pipedrive_source()`. The second pipeline performs transformations using a dbt package called `pipedrive` after loading the data. The `dbt.package` function sets up the dbt runner, and `dbt.run_all()` executes the dbt models defined in the package. #### [Using the `dlt` SQL client](dlt-ecosystem/transformations/sql.md) -Another option is to leverage the `dlt` SQL client to query the loaded data and perform -transformations using SQL statements. You can execute SQL statements that change the database schema -or manipulate data within tables. Here's an example of inserting a row into the `customers` -table using the `dlt` SQL client: +Another option is to leverage the `dlt` SQL client to query the loaded data and perform transformations using SQL statements. You can execute SQL statements that change the database schema or manipulate data within tables. Here's an example of inserting a row into the `customers` table using the `dlt` SQL client: ```py pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm") @@ -314,14 +273,11 @@ with pipeline.sql_client() as client: ) ``` -In this example, the `execute_sql` method of the SQL client allows you to execute SQL -statements. The statement inserts a row with values into the `customers` table. +In this example, the `execute_sql` method of the SQL client allows you to execute SQL statements. The statement inserts a row with values into the `customers` table. #### [Using Pandas](dlt-ecosystem/transformations/pandas.md) -You can fetch query results as Pandas data frames and perform transformations using Pandas -functionalities. Here's an example of reading data from the `issues` table in DuckDB and -counting reaction types using Pandas: +You can fetch query results as Pandas data frames and perform transformations using Pandas functionalities. Here's an example of reading data from the `issues` table in DuckDB and counting reaction types using Pandas: ```py pipeline = dlt.pipeline( @@ -340,10 +296,9 @@ with pipeline.sql_client() as client: counts = reactions.sum(0).sort_values(0, ascending=False) ``` -By leveraging these transformation options, you can shape and manipulate the data before or after -loading it, allowing you to meet specific requirements and ensure data quality and consistency. +By leveraging these transformation options, you can shape and manipulate the data before or after loading it, allowing you to meet specific requirements and ensure data quality and consistency. -## Adjusting the automated normalisation +## Adjusting the automated normalization To streamline the process, `dlt` recommends attaching schemas to sources implicitly instead of creating them explicitly. You can provide a few global schema settings and let the table and column @@ -356,12 +311,12 @@ By adjusting the automated normalization process in `dlt`, you can ensure that t schema meets your specific requirements and aligns with your preferred naming conventions, data types, and other customization needs. -### Customizing the Normalization Process +### Customizing the normalization process Customizing the normalization process in `dlt` allows you to adapt it to your specific requirements. You can adjust table and column names, configure column properties, define data type autodetectors, -apply performance hints, specify preferred data types, or change how ids are propagated in the +apply performance hints, specify preferred data types, or change how IDs are propagated in the unpacking process. These customization options enable you to create a schema that aligns with your desired naming @@ -370,7 +325,7 @@ the normalization process to meet your unique needs and achieve optimal results. Read more about how to configure [schema generation.](general-usage/schema.md) -### Exporting and Importing Schema Files +### Exporting and importing schema files `dlt` allows you to export and import schema files, which contain the structure and instructions for processing and loading the data. Exporting schema files enables you to modify them directly, making @@ -379,12 +334,12 @@ use them in your pipeline. Read more: [Adjust a schema docs.](./walkthroughs/adjust-a-schema.md) -## Governance Support in `dlt` Pipelines +## Governance support in `dlt` pipelines `dlt` pipelines offer robust governance support through three key mechanisms: pipeline metadata utilization, schema enforcement and curation, and schema change alerts. -### Pipeline Metadata +### Pipeline metadata `dlt` pipelines leverage metadata to provide governance capabilities. This metadata includes load IDs, which consist of a timestamp and pipeline name. Load IDs enable incremental transformations and data @@ -392,7 +347,7 @@ vaulting by tracking data loads and facilitating data lineage and traceability. Read more about [lineage](general-usage/destination-tables.md#data-lineage). -### Schema Enforcement and Curation +### Schema enforcement and curation `dlt` empowers users to enforce and curate schemas, ensuring data consistency and quality. Schemas define the structure of normalized data and guide the processing and loading of data. By adhering to @@ -414,16 +369,15 @@ control throughout the data processing lifecycle. ### Scaling and finetuning -`dlt` offers several mechanism and configuration options to scale up and finetune pipelines: +`dlt` offers several mechanisms and configuration options to scale up and finetune pipelines: -- Running extraction, normalization and load in parallel. +- Running extraction, normalization, and load in parallel. - Writing sources and resources that are run in parallel via thread pools and async execution. -- Finetune the memory buffers, intermediary file sizes and compression options. +- Finetuning the memory buffers, intermediary file sizes, and compression options. Read more about [performance.](reference/performance.md) ### Other advanced topics -`dlt` is a constantly growing library that supports many features and use cases needed by the -community. [Join our Slack](https://dlthub.com/community) -to find recent releases or discuss what you can build with `dlt`. +`dlt` is a constantly growing library that supports many features and use cases needed by the community. [Join our Slack](https://dlthub.com/community) to find recent releases or discuss what you can build with `dlt`. + diff --git a/docs/website/docs/general-usage/glossary.md b/docs/website/docs/general-usage/glossary.md index 5ae256b268..b2fa2c6fa1 100644 --- a/docs/website/docs/general-usage/glossary.md +++ b/docs/website/docs/general-usage/glossary.md @@ -8,13 +8,13 @@ keywords: [glossary, resource, source, pipeline] ## [Source](source) -Location that holds data with certain structure. Organized into one or more resources. +A location that holds data with a certain structure, organized into one or more resources. - If endpoints in an API are the resources, then the API is the source. -- If tabs in a spreadsheet are the resources, then the source is the spreadsheet. -- If tables in a database are the resources, then the source is the database. +- If tabs in a spreadsheet are the resources, then the spreadsheet is the source. +- If tables in a database are the resources, then the database is the source. -Within this documentation, **source** refers also to the software component (i.e. Python function) +Within this documentation, **source** also refers to the software component (i.e., a Python function) that **extracts** data from the source location using one or more resource components. ## [Resource](resource) @@ -26,38 +26,39 @@ origin. - If the source is a spreadsheet, then a resource is a tab in that spreadsheet. - If the source is a database, then a resource is a table in that database. -Within this documentation, **resource** refers also to the software component (i.e. Python function) -that **extracts** the data from source location. +Within this documentation, **resource** also refers to the software component (i.e., a Python function) +that **extracts** the data from the source location. ## [Destination](../dlt-ecosystem/destinations) -The data store where data from the source is loaded (e.g. Google BigQuery). +The data store where data from the source is loaded (e.g., Google BigQuery). ## [Pipeline](pipeline) Moves the data from the source to the destination, according to instructions provided in the schema -(i.e. extracting, normalizing, and loading the data). +(i.e., extracting, normalizing, and loading the data). ## [Verified source](../walkthroughs/add-a-verified-source) A Python module distributed with `dlt init` that allows creating pipelines that extract data from a -particular **Source**. Such module is intended to be published in order for others to use it to +particular **Source**. Such a module is intended to be published in order for others to use it to build pipelines. -A source must be published to become "verified": which means that it has tests, test data, -demonstration scripts, documentation and the dataset produces was reviewed by a data engineer. +A source must be published to become "verified," which means that it has tests, test data, +demonstration scripts, documentation, and the dataset produced was reviewed by a data engineer. ## [Schema](schema) -Describes the structure of normalized data (e.g. unpacked tables, column types, etc.) and provides -instructions on how the data should be processed and loaded (i.e. it tells `dlt` about the content +Describes the structure of normalized data (e.g., unpacked tables, column types, etc.) and provides +instructions on how the data should be processed and loaded (i.e., it tells `dlt` about the content of the data and how to load it into the destination). ## [Config](credentials/setup#secrets.toml-and-config.toml) -A set of values that are passed to the pipeline at run time (e.g. to change its behavior locally vs. +A set of values that are passed to the pipeline at run time (e.g., to change its behavior locally vs. in production). ## [Credentials](credentials/complex_types) A subset of configuration whose elements are kept secret and never shared in plain text. + diff --git a/docs/website/docs/general-usage/naming-convention.md b/docs/website/docs/general-usage/naming-convention.md index 16898cf8d1..f04a8b91fd 100644 --- a/docs/website/docs/general-usage/naming-convention.md +++ b/docs/website/docs/general-usage/naming-convention.md @@ -4,8 +4,8 @@ description: Control how dlt creates table, column and other identifiers keywords: [identifiers, snake case, case sensitive, case insensitive, naming] --- -# Naming Convention -`dlt` creates table and column identifiers from the data. The data source, i.e. a stream of JSON documents, may have identifiers (i.e. key names in a dictionary) with any Unicode characters, of any length, and naming style. On the other hand, destinations require that you follow strict rules when you name tables, columns, or collections. +# Naming convention +`dlt` creates table and column identifiers from the data. The data source, i.e., a stream of JSON documents, may have identifiers (i.e., key names in a dictionary) with any Unicode characters, of any length, and naming style. On the other hand, destinations require that you follow strict rules when you name tables, columns, or collections. A good example is [Redshift](../dlt-ecosystem/destinations/redshift.md#naming-convention) that accepts case-insensitive alphanumeric identifiers with a maximum of 127 characters. `dlt` groups tables from a single [source](source.md) in a [schema](schema.md). Each schema defines a **naming convention** that tells `dlt` how to translate identifiers to the @@ -20,19 +20,19 @@ The standard behavior of `dlt` is to **use the same naming convention for all de ### Use default naming convention (snake_case) **snake_case** is a case-insensitive naming convention, converting source identifiers into lower-case snake case identifiers with a reduced alphabet. -- Spaces around identifiers are trimmed -- Keeps ASCII alphanumerics and underscores, replaces all other characters with underscores (with the exceptions below) -- Replaces `+` and `*` with `x`, `-` with `_`, `@` with `a`, and `|` with `l` +- Spaces around identifiers are trimmed. +- Keeps ASCII alphanumerics and underscores, replaces all other characters with underscores (with the exceptions below). +- Replaces `+` and `*` with `x`, `-` with `_`, `@` with `a`, and `|` with `l`. - Prepends `_` if the name starts with a number. - Multiples of `_` are converted into a single `_`. -- Replaces all trailing `_` with `x` +- Replaces all trailing `_` with `x`. -Uses __ as a nesting separator for tables and flattened column names. +Uses `__` as a nesting separator for tables and flattened column names. :::tip If you do not like **snake_case**, your next safe option is **sql_ci**, which generates SQL-safe, lowercase, case-insensitive identifiers without any other transformations. To permanently change the default naming convention on a given machine: -1. set an environment variable `SCHEMA__NAMING` to `sql_ci_v1` OR -2. add the following line to your global `config.toml` (the one in your home dir, i.e. `~/.dlt/config.toml`) +1. Set an environment variable `SCHEMA__NAMING` to `sql_ci_v1` OR +2. Add the following line to your global `config.toml` (the one in your home dir, i.e., `~/.dlt/config.toml`) ```toml [schema] naming="sql_ci_v1" @@ -43,48 +43,52 @@ naming="sql_ci_v1" ### Pick the right identifier form when defining resources `dlt` keeps source (not normalized) identifiers during data [extraction](../reference/explainers/how-dlt-works.md#extract) and translates them during [normalization](../reference/explainers/how-dlt-works.md#normalize). For you, it means: 1. If you write a [transformer](resource.md#process-resources-with-dlttransformer) or a [mapping/filtering function](resource.md#filter-transform-and-pivot-data), you will see the original data, without any normalization. Use the source identifiers to access the dicts! -2. If you define a `primary_key` or `cursor` that participate in [cursor field incremental loading](incremental-loading.md#incremental-loading-with-a-cursor-field), use the source identifiers (`dlt` uses them to inspect source data, `Incremental` class is just a filtering function). -3. When defining any other hints, i.e. `columns` or `merge_key`, you can pick source or destination identifiers. `dlt` normalizes all hints together with your data. -4. The `Schema` object (i.e. obtained from the pipeline or from `dlt` source via `discover_schema`) **always contains destination (normalized) identifiers**. +2. If you define a `primary_key` or `cursor` that participates in [cursor field incremental loading](incremental-loading.md#incremental-loading-with-a-cursor-field), use the source identifiers (`dlt` uses them to inspect source data, `Incremental` class is just a filtering function). +3. When defining any other hints, i.e., `columns` or `merge_key`, you can pick source or destination identifiers. `dlt` normalizes all hints together with your data. +4. The `Schema` object (i.e., obtained from the pipeline or from `dlt` source via `discover_schema`) **always contains destination (normalized) identifiers**. ### Understand the identifier normalization + Identifiers are translated from source to destination form in the **normalize** step. Here's how `dlt` picks the naming convention: * The default naming convention is **snake_case**. -* Each destination may define a preferred naming convention in [destination capabilities](destination.md#pass-additional-parameters-and-change-destination-capabilities). Some destinations (i.e. Weaviate) need a specialized naming convention and will override the default. +* Each destination may define a preferred naming convention in [destination capabilities](destination.md#pass-additional-parameters-and-change-destination-capabilities). Some destinations (i.e., Weaviate) need a specialized naming convention and will override the default. * You can [configure a naming convention explicitly](#set-and-adjust-naming-convention-explicitly). Such configuration overrides the destination settings. * This naming convention is used when new schemas are created. It happens when the pipeline is run for the first time. * Schemas preserve the naming convention when saved. Your running pipelines will maintain existing naming conventions if not requested otherwise. * `dlt` applies the final naming convention in the `normalize` step. Jobs (files) in the load package now have destination identifiers. The pipeline schema is duplicated, locked, and saved in the load package and will be used by the destination. :::caution -If you change the naming convention and `dlt` detects a change in the destination identifiers for tables/collections/files that already exist and store data, the normalize process will fail. This prevents an unwanted schema migration. New columns and tables will be created for identifiers that changed. +If you change the naming convention and `dlt` detects a change in the destination identifiers for tables/collections/files that already exist and store data, the normalize process will fail. This prevents unwanted schema migration. New columns and tables will be created for identifiers that have changed. ::: ### Case-sensitive and insensitive destinations -Naming conventions declare if the destination identifiers they produce are case-sensitive or insensitive. This helps `dlt` to [generate case-sensitive / insensitive identifiers for the destinations that support both](destination.md#control-how-dlt-creates-table-column-and-other-identifiers). For example: if you pick a case-insensitive naming like **snake_case** or **sql_ci_v1**, with Snowflake, `dlt` will generate all uppercase identifiers that Snowflake sees as case-insensitive. If you pick a case-sensitive naming like **sql_cs_v1**, `dlt` will generate quoted case-sensitive identifiers that preserve identifier capitalization. -Note that many destinations are exclusively case-insensitive, of which some preserve the casing of identifiers (i.e. **duckdb**) and some will case-fold identifiers when creating tables (i.e. **Redshift**, **Athena** do lowercase on the names). `dlt` is able to detect resulting identifier [collisions](#avoid-identifier-collisions) and stop the load process before data is mangled. +Naming conventions declare whether the destination identifiers they produce are case-sensitive or insensitive. This helps `dlt` to [generate case-sensitive / insensitive identifiers for the destinations that support both](destination.md#control-how-dlt-creates-table-column-and-other-identifiers). For example, if you pick a case-insensitive naming like **snake_case** or **sql_ci_v1**, with Snowflake, `dlt` will generate all uppercase identifiers that Snowflake sees as case-insensitive. If you pick a case-sensitive naming like **sql_cs_v1**, `dlt` will generate quoted case-sensitive identifiers that preserve identifier capitalization. + +Note that many destinations are exclusively case-insensitive, of which some preserve the casing of identifiers (i.e., **duckdb**) and some will case-fold identifiers when creating tables (i.e., **Redshift**, **Athena** do lowercase on the names). `dlt` is able to detect resulting identifier [collisions](#avoid-identifier-collisions) and stop the load process before data is mangled. ### Identifier shortening + Identifier shortening happens during normalization. `dlt` takes the maximum length of the identifier from the destination capabilities and will trim the identifiers that are too long. The default shortening behavior generates short deterministic hashes of the source identifiers and places them in the middle of the destination identifier. This (with a high probability) avoids shortened identifier collisions. ### šŸš§ [WIP] Name convention changes are lossy + `dlt` does not store the source identifiers in the schema so when the naming convention changes (or we increase the maximum identifier length), it is not able to generate a fully correct set of new identifiers. Instead, it will re-normalize already normalized identifiers. We are currently working to store the full identifier lineage - source identifiers will be stored and mapped to the destination in the schema. ## Pick your own naming convention ### Configure naming convention -You can use `config.toml`, environment variables, or any other configuration provider to set the naming convention name. Configured naming convention **overrides all other settings** -- changes the naming convention stored in the already created schema -- overrides the destination capabilities preference. +You can use `config.toml`, environment variables, or any other configuration provider to set the naming convention name. The configured naming convention **overrides all other settings**: +- Changes the naming convention stored in the already created schema. +- Overrides the destination capabilities preference. ```toml [schema] naming="sql_ci_v1" ``` The configuration above will request **sql_ci_v1** for all pipelines (schemas). An environment variable `SCHEMA__NAMING` set to `sql_ci_v1` has the same effect. -You have an option to set the naming convention per source: +You have the option to set the naming convention per source: ```toml [sources.zendesk] config="prop" @@ -93,7 +97,7 @@ naming="sql_cs_v1" [sources.zendesk.credentials] password="pass" ``` -The snippet above demonstrates how to apply certain naming for an example `zendesk` source. +The snippet above demonstrates how to apply a certain naming for an example `zendesk` source. You can use naming conventions that you created yourself or got from other users. In that case, you should pass a full Python import path to the [module that contains the naming convention](#write-your-own-naming-convention): ```toml @@ -106,11 +110,11 @@ naming="tests.common.cases.normalizers.sql_upper" ### Available naming conventions You can pick from a few built-in naming conventions. -* `snake_case` - the default -* `duck_case` - case-sensitive, allows all Unicode characters like emoji šŸ’„ -* `direct` - case-sensitive, allows all Unicode characters, does not contract underscores -* `sql_cs_v1` - case-sensitive, generates SQL-safe identifiers -* `sql_ci_v1` - case-insensitive, generates SQL-safe lowercase identifiers +* `snake_case` - the default. +* `duck_case` - case-sensitive, allows all Unicode characters like emoji šŸ’„. +* `direct` - case-sensitive, allows all Unicode characters, does not contract underscores. +* `sql_cs_v1` - case-sensitive, generates SQL-safe identifiers. +* `sql_ci_v1` - case-insensitive, generates SQL-safe lowercase identifiers. ### Ignore naming convention for `dataset_name` @@ -130,7 +134,7 @@ pipeline = dlt.pipeline(dataset_name="MyCamelCaseName") The default value for the `enable_dataset_name_normalization` configuration option is `true`. :::note -The same setting would be applied to [staging dataset](../dlt-ecosystem/staging#staging-dataset). Thus, if you set `enable_dataset_name_normalization` to `false`, the staging dataset name would also **not** be normalized. +The same setting would be applied to the [staging dataset](../dlt-ecosystem/staging#staging-dataset). Thus, if you set `enable_dataset_name_normalization` to `false`, the staging dataset name would also **not** be normalized. ::: :::caution @@ -139,27 +143,28 @@ Depending on the destination, certain names may not be allowed. To ensure your d ## Avoid identifier collisions `dlt` detects various types of identifier collisions and ignores the others. -1. `dlt` detects collisions if a case-sensitive naming convention is used on a case-insensitive destination -2. `dlt` detects collisions if a change of naming convention changes the identifiers of tables already created in the destination -3. `dlt` detects collisions when the naming convention is applied to column names of arrow tables +1. `dlt` detects collisions if a case-sensitive naming convention is used on a case-insensitive destination. +2. `dlt` detects collisions if a change of naming convention changes the identifiers of tables already created in the destination. +3. `dlt` detects collisions when the naming convention is applied to column names of arrow tables. `dlt` will not detect a collision when normalizing source data. If you have a dictionary, keys will be merged if they collide after being normalized. You can create a custom naming convention that does not generate collisions on data, see examples below. - ## Write your own naming convention -Custom naming conventions are classes that derive from `NamingConvention` that you can import from `dlt.common.normalizers.naming`. We recommend the following module layout: -1. Each naming convention resides in a separate Python module (file) -2. The class is always named `NamingConvention` + +Custom naming conventions are classes that derive from `NamingConvention`, which you can import from `dlt.common.normalizers.naming`. We recommend the following module layout: +1. Each naming convention resides in a separate Python module (file). +2. The class is always named `NamingConvention`. In that case, you can use a fully qualified module name in [schema configuration](#configure-naming-convention) or pass the module [explicitly](#set-and-adjust-naming-convention-explicitly). We include [two examples](../examples/custom_naming) of naming conventions that you may find useful: 1. A variant of `sql_ci` that generates identifier collisions with a low (user-defined) probability by appending a deterministic tag to each name. -2. A variant of `sql_cs` that allows for LATIN (i.e. umlaut) characters +2. A variant of `sql_cs` that allows for LATIN (i.e., umlaut) characters. :::note -Note that a fully qualified name of your custom naming convention will be stored in the `Schema` and `dlt` will attempt to import it when the schema is loaded from storage. +Note that the fully qualified name of your custom naming convention will be stored in the `Schema`, and `dlt` will attempt to import it when the schema is loaded from storage. You should distribute your custom naming conventions with your pipeline code or via a pip package from which it can be imported. -::: \ No newline at end of file +::: + diff --git a/docs/website/docs/general-usage/pipeline.md b/docs/website/docs/general-usage/pipeline.md index 40f9419bc2..dd05092625 100644 --- a/docs/website/docs/general-usage/pipeline.md +++ b/docs/website/docs/general-usage/pipeline.md @@ -6,14 +6,14 @@ keywords: [pipeline, source, full refresh, dev mode] # Pipeline -A [pipeline](glossary.md#pipeline) is a connection that moves the data from your Python code to a +A [pipeline](glossary.md#pipeline) is a connection that moves data from your Python code to a [destination](glossary.md#destination). The pipeline accepts `dlt` [sources](source.md) or -[resources](resource.md) as well as generators, async generators, lists and any iterables. -Once the pipeline runs, all resources get evaluated and the data is loaded at destination. +[resources](resource.md), as well as generators, async generators, lists, and any iterables. +Once the pipeline runs, all resources are evaluated and the data is loaded at the destination. Example: -This pipeline will load a list of objects into `duckdb` table with a name "three": +This pipeline will load a list of objects into a `duckdb` table named "three": ```py import dlt @@ -25,30 +25,30 @@ info = pipeline.run([{'id':1}, {'id':2}, {'id':3}], table_name="three") print(info) ``` -You instantiate a pipeline by calling `dlt.pipeline` function with following arguments: +You instantiate a pipeline by calling the `dlt.pipeline` function with the following arguments: -- `pipeline_name` a name of the pipeline that will be used to identify it in trace and monitoring +- `pipeline_name`: a name of the pipeline that will be used to identify it in trace and monitoring events and to restore its state and data schemas on subsequent runs. If not provided, `dlt` will - create pipeline name from the file name of currently executing Python module. -- `destination` a name of the [destination](../dlt-ecosystem/destinations) to which dlt - will load the data. May also be provided to `run` method of the `pipeline`. -- `dataset_name` a name of the dataset to which the data will be loaded. A dataset is a logical - group of tables i.e. `schema` in relational databases or folder grouping many files. May also be - provided later to the `run` or `load` methods of the pipeline. If not provided at all then - defaults to the `pipeline_name`. + create a pipeline name from the file name of the currently executing Python module. +- `destination`: a name of the [destination](../dlt-ecosystem/destinations) to which dlt + will load the data. It may also be provided to the `run` method of the `pipeline`. +- `dataset_name`: a name of the dataset to which the data will be loaded. A dataset is a logical + group of tables, i.e., `schema` in relational databases or a folder grouping many files. It may also be + provided later to the `run` or `load` methods of the pipeline. If not provided at all, then + it defaults to the `pipeline_name`. -To load the data you call the `run` method and pass your data in `data` argument. +To load the data, you call the `run` method and pass your data in the `data` argument. Arguments: - `data` (the first argument) may be a dlt source, resource, generator function, or any Iterator / - Iterable (i.e. a list or the result of `map` function). + Iterable (i.e., a list or the result of the `map` function). - `write_disposition` controls how to write data to a table. Defaults to "append". - `append` will always add new data at the end of the table. - `replace` will replace existing data with new data. - `skip` will prevent data from loading. - `merge` will deduplicate and merge data based on `primary_key` and `merge_key` hints. -- `table_name` - specified in case when table name cannot be inferred i.e. from the resources or name +- `table_name`: specified in cases when the table name cannot be inferred, i.e., from the resources or name of the generator function. Example: This pipeline will load the data the generator `generate_rows(10)` produces: @@ -70,24 +70,24 @@ print(info) ## Pipeline working directory Each pipeline that you create with `dlt` stores extracted files, load packages, inferred schemas, -execution traces and the [pipeline state](state.md) in a folder in the local filesystem. The default -location for such folders is in user home directory: `~/.dlt/pipelines/`. +execution traces, and the [pipeline state](state.md) in a folder in the local filesystem. The default +location for such folders is in the user's home directory: `~/.dlt/pipelines/`. You can inspect stored artifacts using the command [dlt pipeline info](../reference/command-line-interface.md#dlt-pipeline) and [programmatically](../walkthroughs/run-a-pipeline.md#4-inspect-a-load-process). -> šŸ’” A pipeline with given name looks for its working directory in location above - so if you have two +> šŸ’” A pipeline with a given name looks for its working directory in the location above - so if you have two > pipeline scripts that create a pipeline with the same name, they will see the same working folder -> and share all the possible state. You may override the default location using `pipelines_dir` +> and share all the possible state. You may override the default location using the `pipelines_dir` > argument when creating the pipeline. -> šŸ’” You can attach `Pipeline` instance to an existing working folder, without creating a new +> šŸ’” You can attach a `Pipeline` instance to an existing working folder, without creating a new > pipeline with `dlt.attach`. -### Separate working environments with `pipelines_dir`. -You can run several pipelines with the same name but with different configuration ie. to target development / staging / production environments. -Set the `pipelines_dir` argument to store all the working folders in specific place. For example: +### Separate working environments with `pipelines_dir` +You can run several pipelines with the same name but with different configurations, for example, to target development, staging, or production environments. +Set the `pipelines_dir` argument to store all the working folders in a specific place. For example: ```py import dlt from dlt.common.pipeline import get_dlt_pipelines_dir @@ -95,36 +95,36 @@ from dlt.common.pipeline import get_dlt_pipelines_dir dev_pipelines_dir = os.path.join(get_dlt_pipelines_dir(), "dev") pipeline = dlt.pipeline(destination="duckdb", dataset_name="sequence", pipelines_dir=dev_pipelines_dir) ``` -stores pipeline working folder in `~/.dlt/pipelines/dev/`. Mind that you need to pass this `~/.dlt/pipelines/dev/` -in to all cli commands to get info/trace for that pipeline. +This stores the pipeline working folder in `~/.dlt/pipelines/dev/`. Note that you need to pass this `~/.dlt/pipelines/dev/` +into all CLI commands to get info/trace for that pipeline. ## Do experiments with dev mode -If you [create a new pipeline script](../walkthroughs/create-a-pipeline.md) you will be -experimenting a lot. If you want that each time the pipeline resets its state and loads data to a +If you [create a new pipeline script](../walkthroughs/create-a-pipeline.md), you will be +experimenting a lot. If you want each time the pipeline resets its state and loads data to a new dataset, set the `dev_mode` argument of the `dlt.pipeline` method to True. Each time the -pipeline is created, `dlt` adds datetime-based suffix to the dataset name. +pipeline is created, `dlt` adds a datetime-based suffix to the dataset name. ## Refresh pipeline data and state You can reset parts or all of your sources by using the `refresh` argument to `dlt.pipeline` or the pipeline's `run` or `extract` method. -That means when you run the pipeline the sources/resources being processed will have their state reset and their tables either dropped or truncated +That means when you run the pipeline, the sources/resources being processed will have their state reset and their tables either dropped or truncated, depending on which refresh mode is used. -`refresh` option works with all relational/sql destinations and file buckets (`filesystem`). it does not work with vector databases (we are working on that) and +The `refresh` option works with all relational/SQL destinations and file buckets (`filesystem`). It does not work with vector databases (we are working on that) and with custom destinations. The `refresh` argument should have one of the following string values to decide the refresh mode: ### Drop tables and pipeline state for a source with `drop_sources` All sources being processed in `pipeline.run` or `pipeline.extract` are refreshed. -That means all tables listed in their schemas are dropped and state belonging to those sources and all their resources is completely wiped. -The tables are deleted both from pipeline's schema and from the destination database. +That means all tables listed in their schemas are dropped and the state belonging to those sources and all their resources is completely wiped. +The tables are deleted both from the pipeline's schema and from the destination database. -If you only have one source or run with all your sources together, then this is practically like running the pipeline again for the first time +If you only have one source or run with all your sources together, then this is practically like running the pipeline again for the first time. :::caution -This erases schema history for the selected sources and only the latest version is stored +This erases schema history for the selected sources and only the latest version is stored. ::: ```py @@ -133,26 +133,27 @@ import dlt pipeline = dlt.pipeline("airtable_demo", destination="duckdb") pipeline.run(airtable_emojis(), refresh="drop_sources") ``` -In example above we instruct `dlt` to wipe pipeline state belonging to `airtable_emojis` source and drop all the database tables in `duckdb` to +In the example above, we instruct `dlt` to wipe the pipeline state belonging to the `airtable_emojis` source and drop all the database tables in `duckdb` to which data was loaded. The `airtable_emojis` source had two resources named "šŸ“† Schedule" and "šŸ’° Budget" loading to tables "_schedule" and "_budget". Here's what `dlt` does step by step: -1. collects a list of tables to drop by looking for all the tables in the schema that are created in the destination. -2. removes existing pipeline state associated with `airtable_emojis` source -3. resets the schema associated with `airtable_emojis` source -4. executes `extract` and `normalize` steps. those will create fresh pipeline state and a schema -5. before it executes `load` step, the collected tables are dropped from staging and regular dataset -6. schema `airtable_emojis` (associated with the source) is be removed from `_dlt_version` table -7. executes `load` step as usual so tables are re-created and fresh schema and pipeline state are stored. +1. Collects a list of tables to drop by looking for all the tables in the schema that are created in the destination. +2. Removes existing pipeline state associated with the `airtable_emojis` source. +3. Resets the schema associated with the `airtable_emojis` source. +4. Executes `extract` and `normalize` steps. These will create fresh pipeline state and a schema. +5. Before it executes the `load` step, the collected tables are dropped from staging and regular dataset. +6. Schema `airtable_emojis` (associated with the source) is removed from the `_dlt_version` table. +7. Executes the `load` step as usual so tables are re-created and fresh schema and pipeline state are stored. ### Selectively drop tables and resource state with `drop_resources` -Limits the refresh to the resources being processed in `pipeline.run` or `pipeline.extract` (.e.g by using `source.with_resources(...)`). -Tables belonging to those resources are dropped and their resource state is wiped (that includes incremental state). -The tables are deleted both from pipeline's schema and from the destination database. -Source level state keys are not deleted in this mode (i.e. `dlt.state()[<'my_key>'] = ''`) +Limits the refresh to the resources being processed in `pipeline.run` or `pipeline.extract` (e.g., by using `source.with_resources(...)`). +Tables belonging to those resources are dropped, and their resource state is wiped (that includes incremental state). +The tables are deleted both from the pipeline's schema and from the destination database. + +Source level state keys are not deleted in this mode (i.e., `dlt.state()[<'my_key>'] = ''`) :::caution -This erases schema history for all affected sources and only the latest schema version is stored. +This erases schema history for all affected sources, and only the latest schema version is stored. ::: ```py @@ -161,11 +162,12 @@ import dlt pipeline = dlt.pipeline("airtable_demo", destination="duckdb") pipeline.run(airtable_emojis().with_resources("šŸ“† Schedule"), refresh="drop_resources") ``` -Above we request that the state associated with "šŸ“† Schedule" resource is reset and the table generated by it ("_schedule") is dropped. Other resources, -tables and state are not affected. Please check `drop_sources` for step by step description of what `dlt` does internally. +Above, we request that the state associated with the "šŸ“† Schedule" resource is reset, and the table generated by it ("_schedule") is dropped. Other resources, +tables, and state are not affected. Please check `drop_sources` for a step-by-step description of what `dlt` does internally. ### Selectively truncate tables and reset resource state with `drop_data` -Same as `drop_resources` but instead of dropping tables from schema only the data is deleted from them (i.e. by `TRUNCATE ` in sql destinations). Resource state for selected resources is also wiped. In case of [incremental resources](incremental-loading.md#incremental-loading-with-a-cursor-field) this will + +Same as `drop_resources`, but instead of dropping tables from the schema, only the data is deleted from them (i.e., by `TRUNCATE ` in SQL destinations). Resource state for selected resources is also wiped. In the case of [incremental resources](incremental-loading.md#incremental-loading-with-a-cursor-field), this will reset the cursor state and fully reload the data from the `initial_value`. The schema remains unmodified in this case. @@ -175,30 +177,30 @@ import dlt pipeline = dlt.pipeline("airtable_demo", destination="duckdb") pipeline.run(airtable_emojis().with_resources("šŸ“† Schedule"), refresh="drop_data") ``` -Above the incremental state of the "šŸ“† Schedule" is reset before `extract` step so data is fully reacquired. Just before `load` step starts, - the "_schedule" is truncated and new (full) table data will be inserted/copied. +Above, the incremental state of the "šŸ“† Schedule" is reset before the `extract` step so data is fully reacquired. Just before the `load` step starts, +the "_schedule" is truncated, and new (full) table data will be inserted/copied. ## Display the loading progress -You can add a progress monitor to the pipeline. Typically, its role is to visually assure user that -pipeline run is progressing. `dlt` supports 4 progress monitors out of the box: +You can add a progress monitor to the pipeline. Typically, its role is to visually assure the user that +the pipeline run is progressing. `dlt` supports 4 progress monitors out of the box: - [enlighten](https://github.com/Rockhopper-Technologies/enlighten) - a status bar with progress bars that also allows for logging. -- [tqdm](https://github.com/tqdm/tqdm) - most popular Python progress bar lib, proven to work in +- [tqdm](https://github.com/tqdm/tqdm) - the most popular Python progress bar lib, proven to work in Notebooks. - [alive_progress](https://github.com/rsalmei/alive-progress) - with the most fancy animations. -- **log** - dumps the progress information to log, console or text stream. **the most useful on - production** optionally adds memory and cpu usage stats. +- **log** - dumps the progress information to log, console, or text stream. **the most useful on + production** optionally adds memory and CPU usage stats. > šŸ’” You must install the required progress bar library yourself. -You pass the progress monitor in `progress` argument of the pipeline. You can use a name from the +You pass the progress monitor in the `progress` argument of the pipeline. You can use a name from the list above as in the following example: ```py # create a pipeline loading chess data that dumps -# progress to stdout each 10 seconds (the default) +# progress to stdout every 10 seconds (the default) pipeline = dlt.pipeline( pipeline_name="chess_pipeline", destination='duckdb', @@ -232,3 +234,4 @@ pipeline = dlt.pipeline( Note that the value of the `progress` argument is [configurable](../walkthroughs/run-a-pipeline.md#2-see-the-progress-during-loading). + diff --git a/docs/website/docs/general-usage/schema-evolution.md b/docs/website/docs/general-usage/schema-evolution.md index b2b81cfdca..24ed3bc557 100644 --- a/docs/website/docs/general-usage/schema-evolution.md +++ b/docs/website/docs/general-usage/schema-evolution.md @@ -6,9 +6,9 @@ keywords: [schema evolution, schema, dlt schema] ## When to use schema evolution? -Schema evolution is a best practice when ingesting most data. Itā€™s simply a way to get data across a format barrier. +Schema evolution is a best practice when ingesting most data. It's simply a way to get data across a format barrier. -It separates the technical challenge of ā€œloadingā€ data, from the business challenge of ā€œcuratingā€ data. This enables us to have pipelines that are maintainable by different individuals at different stages. +It separates the technical challenge of "loading" data from the business challenge of "curating" data. This enables us to have pipelines that are maintainable by different individuals at different stages. However, for cases where schema evolution might be triggered by malicious events, such as in web tracking, data contracts are advised. Read more about how to implement data contracts [here](https://dlthub.com/docs/general-usage/schema-contracts). @@ -16,13 +16,13 @@ However, for cases where schema evolution might be triggered by malicious events `dlt` automatically infers the initial schema for your first pipeline run. However, in most cases, the schema tends to change over time, which makes it critical for downstream consumers to adapt to schema changes. -As the structure of data changes, such as the addition of new columns, changing data types, etc., `dlt` handles these schema changes, enabling you to adapt to changes without losing velocity. +As the structure of data changes, such as the addition of new columns or changing data types, `dlt` handles these schema changes, enabling you to adapt to changes without losing velocity. ## Inferring a schema from nested data -The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into relational format, `dlt` flattens dictionaries and unpacks nested lists into sub-tables. +The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into a relational format, `dlt` flattens dictionaries and unpacks nested lists into sub-tables. -Weā€™ll review some examples here and figure out how `dlt` creates initial schema and how normalisation works. Consider a pipeline that loads the following schema: +We'll review some examples here and figure out how `dlt` creates the initial schema and how normalization works. Consider a pipeline that loads the following schema: ```py data = [{ @@ -47,18 +47,18 @@ The schema of data above is loaded to the destination as follows: ### What did the schema inference engine do? -As you can see above the `dlt's` inference engine generates the structure of the data based on the source and provided hints. It normalizes the data, creates tables and columns, and infers data types. +As you can see above, the `dlt's` inference engine generates the structure of the data based on the source and provided hints. It normalizes the data, creates tables and columns, and infers data types. -For more information, you can refer to theĀ **[Schema](https://dlthub.com/docs/general-usage/schema)**Ā andĀ **[Adjust a Schema](https://dlthub.com/docs/walkthroughs/adjust-a-schema)**Ā sections in the documentation. +For more information, you can refer to the **[Schema](https://dlthub.com/docs/general-usage/schema)** and **[Adjust a Schema](https://dlthub.com/docs/walkthroughs/adjust-a-schema)** sections in the documentation. ## Evolving the schema -For a typical data source schema tends to change with time, and `dlt` handles this changing schema seamlessly. +For a typical data source, the schema tends to change over time, and `dlt` handles this changing schema seamlessly. Letā€™s add the following 4 cases: -- A column is added : a field named ā€œCEOā€ was added. -- A column type is changed: Datatype of column named ā€œinventory_nrā€ was changed from integer to string. +- A column is added: a field named ā€œCEOā€ was added. +- A column type is changed: The datatype of the column named ā€œinventory_nrā€ was changed from integer to string. - A column is removed: a field named ā€œroomā€ was commented out/removed. - A column is renamed: a field ā€œbuildingā€ was renamed to ā€œmain_blockā€. @@ -106,11 +106,11 @@ By separating the technical process of loading data from curation, you free the **Tracking column lineage** -The column lineage can be tracked by loading the 'load_info' to the destination. The 'load_info' contains information about columns ā€˜data typesā€™, ā€˜add timesā€™, and ā€˜load idā€™. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog. +The column lineage can be tracked by loading the 'load_info' to the destination. The 'load_info' contains information about columnsā€™ data types, add times, and load id. To read more, please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog. **Getting notifications** -We can read the load outcome and send it to slack webhook with `dlt`. +We can read the load outcome and send it to a Slack webhook with `dlt`. ```py # Import the send_slack_message function from the dlt library from dlt.common.runtime.slack import send_slack_message @@ -141,14 +141,13 @@ This script sends Slack notifications for schema updates using the `send_slack_m `dlt` allows schema evolution control via its schema and data contracts. Refer to ourĀ **[documentation](https://dlthub.com/docs/general-usage/schema-contracts)**Ā for details. -### How to test for removed columns - applying ā€œnot nullā€ constraint +### How to test for removed columns - applying "not null" constraint -A column not existing, and a column being null, are two different things. However, when it comes to APIs and json, itā€™s usually all treated the same - the key-value pair will simply not exist. +A column not existing and a column being null are two different things. However, when it comes to APIs and JSON, it's usually all treated the same - the key-value pair will simply not exist. To remove a column, exclude it from the output of the resource function. Subsequent data inserts will treat this column as null. Verify column removal by applying a not null constraint. For instance, after removing the "room" column, apply a not null constraint to confirm its exclusion. ```py - data = [{ "organization": "Tech Innovations Inc.", "address": { @@ -166,7 +165,7 @@ pipeline = dlt.pipeline("organizations_pipeline", destination="duckdb") # Adding not null constraint pipeline.run(data, table_name="org", columns={"room": {"data_type": "bigint", "nullable": False}}) ``` -During pipeline execution a data validation error indicates that a removed column is being passed as null. +During pipeline execution, a data validation error indicates that a removed column is being passed as null. ## Some schema changes in the data @@ -202,14 +201,15 @@ The schema of the data above is loaded to the destination as follows: ## What did the schema evolution engine do? -The schema evolution engine in theĀ `dlt`Ā library is designed to handle changes in the structure of your data over time. For example: +The schema evolution engine in the `dlt` library is designed to handle changes in the structure of your data over time. For example: -- As above in continuation of the inferred schema, the ā€œspecificationsā€ are nested in "detailsā€, which are nested in ā€œInventoryā€, all under table name ā€œorgā€. So the table created for projects is `org__inventory__details__specifications`. +- As above in continuation of the inferred schema, the "specifications" are nested in "details", which are nested in "Inventory", all under table name "org". So the table created for projects is `org__inventory__details__specifications`. -These is a simple examples of how schema evolution works. +This is a simple example of how schema evolution works. ## Schema evolution using schema and data contracts -Demonstrating schema evolution without talking about schema and data contracts is only one side of the coin. Schema and data contracts dictate the terms of how the schema being written to destination should evolve. +Demonstrating schema evolution without talking about schema and data contracts is only one side of the coin. Schema and data contracts dictate the terms of how the schema being written to the destination should evolve. + +Schema and data contracts can be applied to entities like 'tables', 'columns', and 'data_types' using contract modes such as 'evolve', 'freeze', 'discard_rows', and 'discard_columns' to tell `dlt` how to apply contracts for a particular entity. To read more about **schema and data contracts**, read our [documentation](https://dlthub.com/docs/general-usage/schema-contracts). -Schema and data contracts can be applied to entities ā€˜tablesā€™ , ā€˜columnsā€™ and ā€˜data_typesā€™ using contract modes ā€˜evolveā€™, freezeā€™, ā€˜discard_rowsā€™ and ā€˜discard_columnsā€™Ā to tellĀ `dlt`Ā how to apply contract for a particular entity. To read more about **schema and data contracts** read our [documentation](https://dlthub.com/docs/general-usage/schema-contracts). \ No newline at end of file diff --git a/docs/website/docs/general-usage/schema.md b/docs/website/docs/general-usage/schema.md index 534d3ca3bd..0efe281db6 100644 --- a/docs/website/docs/general-usage/schema.md +++ b/docs/website/docs/general-usage/schema.md @@ -6,66 +6,66 @@ keywords: [schema, dlt schema, yaml] # Schema -The schema describes the structure of normalized data (e.g. tables, columns, data types, etc.) and +The schema describes the structure of normalized data (e.g., tables, columns, data types, etc.) and provides instructions on how the data should be processed and loaded. `dlt` generates schemas from -the data during the normalization process. User can affect this standard behavior by providing -**hints** that change how tables, columns and other metadata is generated and how the data is -loaded. Such hints can be passed in the code ie. to `dlt.resource` decorator or `pipeline.run` -method. Schemas can be also exported and imported as files, which can be directly modified. +the data during the normalization process. Users can affect this standard behavior by providing +**hints** that change how tables, columns, and other metadata are generated and how the data is +loaded. Such hints can be passed in the code, i.e., to the `dlt.resource` decorator or the `pipeline.run` +method. Schemas can also be exported and imported as files, which can be directly modified. > šŸ’” `dlt` associates a schema with a [source](source.md) and a table schema with a > [resource](resource.md). ## Schema content hash and version -Each schema file contains content based hash `version_hash` that is used to: +Each schema file contains a content-based hash `version_hash` that is used to: -1. Detect manual changes to schema (ie. user edits content). +1. Detect manual changes to the schema (i.e., user edits content). 1. Detect if the destination database schema is synchronized with the file schema. Each time the schema is saved, the version hash is updated. -Each schema contains a numeric version which increases automatically whenever schema is updated and -saved. Numeric version is meant to be human-readable. There are cases (parallel processing) where +Each schema contains a numeric version which increases automatically whenever the schema is updated and +saved. The numeric version is meant to be human-readable. There are cases (parallel processing) where the order is lost. -> šŸ’” Schema in the destination is migrated if its hash is not stored in `_dlt_versions` table. In -> principle many pipelines may send data to a single dataset. If table name clash then a single +> šŸ’” The schema in the destination is migrated if its hash is not stored in the `_dlt_versions` table. In +> principle, many pipelines may send data to a single dataset. If table names clash, then a single > table with the union of the columns will be created. If columns clash, and they have different -> types etc. then the load may fail if the data cannot be coerced. +> types, etc., then the load may fail if the data cannot be coerced. ## Naming convention -`dlt` creates tables, nested tables and column schemas from the data. The data being loaded, -typically JSON documents, contains identifiers (i.e. key names in a dictionary) with any Unicode -characters, any lengths and naming styles. On the other hand the destinations accept very strict -namespaces for their identifiers. Like Redshift that accepts case-insensitive alphanumeric -identifiers with maximum 127 characters. +`dlt` creates tables, nested tables, and column schemas from the data. The data being loaded, +typically JSON documents, contains identifiers (i.e., key names in a dictionary) with any Unicode +characters, any lengths, and naming styles. On the other hand, the destinations accept very strict +namespaces for their identifiers. Like Redshift, that accepts case-insensitive alphanumeric +identifiers with a maximum of 127 characters. -Each schema contains [naming convention](naming-convention.md) that tells `dlt` how to translate identifiers to the -namespace that the destination understands. This convention can be configured, changed in code or enforced via +Each schema contains a [naming convention](naming-convention.md) that tells `dlt` how to translate identifiers to the +namespace that the destination understands. This convention can be configured, changed in code, or enforced via destination. The default naming convention: -1. Converts identifiers to snake_case, small caps. Removes all ascii characters except ascii +1. Converts identifiers to snake_case, small caps. Removes all ASCII characters except ASCII alphanumerics and underscores. -1. Adds `_` if name starts with number. -1. Multiples of `_` are converted into single `_`. +1. Adds `_` if the name starts with a number. +1. Multiples of `_` are converted into a single `_`. 1. Nesting is expressed as double `_` in names. -1. It shorts the identifier if it exceed the length at the destination. +1. It shortens the identifier if it exceeds the length at the destination. -> šŸ’” Standard behavior of `dlt` is to **use the same naming convention for all destinations** so -> users see always the same tables and columns in their databases. +> šŸ’” The standard behavior of `dlt` is to **use the same naming convention for all destinations** so +> users always see the same tables and columns in their databases. -> šŸ’” If you provide any schema elements that contain identifiers via decorators or arguments (i.e. -> `table_name` or `columns`) all the names used will be converted via the naming convention when -> adding to the schema. For example if you execute `dlt.run(... table_name="CamelCase")` the data +> šŸ’” If you provide any schema elements that contain identifiers via decorators or arguments (i.e., +> `table_name` or `columns`), all the names used will be converted via the naming convention when +> adding to the schema. For example, if you execute `dlt.run(... table_name="CamelCase")`, the data > will be loaded into `camel_case`. -> šŸ’” Use simple, short small caps identifiers for everything! +> šŸ’” Use simple, short, small caps identifiers for everything! -To retain the original naming convention (like keeping `"createdAt"` as it is instead of converting it to `"created_at"`), you can use the direct naming convention, in "config.toml" as follows: +To retain the original naming convention (like keeping `"createdAt"` as it is instead of converting it to `"created_at"`), you can use the direct naming convention in "config.toml" as follows: ```toml [schema] naming="direct" @@ -74,82 +74,70 @@ naming="direct" Opting for `"direct"` naming bypasses most name normalization processes. This means any unusual characters present will be carried over unchanged to database tables and columns. Please be aware of this behavior to avoid potential issues. ::: -The naming convention is configurable and users can easily create their own -conventions that i.e. pass all the identifiers unchanged if the destination accepts that (i.e. +The naming convention is configurable, and users can easily create their own +conventions that, i.e., pass all the identifiers unchanged if the destination accepts that (i.e., DuckDB). ## Data normalizer -Data normalizer changes the structure of the input data, so it can be loaded into destination. The -standard `dlt` normalizer creates a relational structure from Python dictionaries and lists. -Elements of that structure: table and column definitions, are added to the schema. +The data normalizer changes the structure of the input data so it can be loaded into the destination. The standard `dlt` normalizer creates a relational structure from Python dictionaries and lists. Elements of that structure, such as table and column definitions, are added to the schema. -The data normalizer is configurable and users can plug their own normalizers i.e. to handle the -nested table linking differently or generate parquet-like data structs instead of nested -tables. +The data normalizer is configurable, and users can plug in their own normalizers, for example, to handle nested table linking differently or to generate parquet-like data structures instead of nested tables. ## Tables and columns -The key components of a schema are tables and columns. You can find a dictionary of tables in -`tables` key or via `tables` property of Schema object. +The key components of a schema are tables and columns. You can find a dictionary of tables in the `tables` key or via the `tables` property of the Schema object. A table schema has the following properties: 1. `name` and `description`. -1. `columns` with dictionary of table schemas. -1. `write_disposition` hint telling `dlt` how new data coming to the table is loaded. -1. `schema_contract` - describes a [contract on the table](schema-contracts.md) -1. `parent` a part of the nested reference, defined on a nested table and points to the parent table. +2. `columns` with a dictionary of table schemas. +3. `write_disposition` hint telling `dlt` how new data coming to the table is loaded. +4. `schema_contract` - describes a [contract on the table](schema-contracts.md). +5. `parent` a part of the nested reference, defined on a nested table and points to the parent table. -Table schema is extended by data normalizer. Standard data normalizer adds propagated columns to it. +The table schema is extended by the data normalizer. The standard data normalizer adds propagated columns to it. -A column schema contains following properties: +A column schema contains the following properties: 1. `name` and `description` of a column in a table. Data type information: 1. `data_type` with a column data type. -1. `precision` a precision for **text**, **timestamp**, **time**, **bigint**, **binary**, and **decimal** types -1. `scale` a scale for **decimal** type -1. `timezone` a flag indicating TZ aware or NTZ **timestamp** and **time**. Default value is **true** -1. `nullable` tells if column is nullable or not. -1. `is_variant` telling that column was generated as variant of another column. - -A column schema contains following basic hints: - -1. `primary_key` marks a column as a part of primary key. -1. `unique` tells that column is unique. on some destination that generates unique index. -1. `merge_key` marks a column as a part of merge key used by - [incremental load](./incremental-loading.md#merge-incremental_loading). - -Hints below are used to create [nested references](#root-and-nested-tables-nested-references) -1. `row_key` a special form of primary key created by `dlt` to uniquely identify rows of data -1. `parent_key` a special form of foreign key used by nested tables to refer to parent tables -1. `root_key` marks a column as a part of root key which is a type of foreign key always referring to the - root table. -1. `_dlt_list_idx` index on a nested list from which nested table is created. +2. `precision` a precision for **text**, **timestamp**, **time**, **bigint**, **binary**, and **decimal** types. +3. `scale` a scale for the **decimal** type. +4. `timezone` a flag indicating TZ aware or NTZ **timestamp** and **time**. The default value is **true**. +5. `nullable` tells if the column is nullable or not. +6. `is_variant` indicating that the column was generated as a variant of another column. + +A column schema contains the following basic hints: + +1. `primary_key` marks a column as part of the primary key. +2. `unique` indicates that the column is unique. On some destinations, this generates a unique index. +3. `merge_key` marks a column as part of the merge key used by [incremental load](./incremental-loading.md#merge-incremental_loading). + +Hints below are used to create [nested references](#root-and-nested-tables-nested-references): +1. `row_key` a special form of primary key created by `dlt` to uniquely identify rows of data. +2. `parent_key` a special form of foreign key used by nested tables to refer to parent tables. +3. `root_key` marks a column as part of the root key, which is a type of foreign key always referring to the root table. +4. `_dlt_list_idx` index on a nested list from which the nested table is created. `dlt` lets you define additional performance hints: -1. `partition` marks column to be used to partition data. -1. `cluster` marks column to be part to be used to cluster data -1. `sort` marks column as sortable/having order. on some destinations that non-unique generates - index. +1. `partition` marks a column to be used to partition data. +2. `cluster` marks a column to be used to cluster data. +3. `sort` marks a column as sortable/having order. On some destinations, this non-unique generates an index. :::note -Each destination can interpret the hints in its own way. For example `cluster` hint is used by -Redshift to define table distribution and by BigQuery to specify cluster column. DuckDB and -Postgres ignore it when creating tables. +Each destination can interpret the hints in its own way. For example, the `cluster` hint is used by Redshift to define table distribution and by BigQuery to specify a cluster column. DuckDB and Postgres ignore it when creating tables. ::: ### Variant columns -Variant columns are generated by a normalizer when it encounters data item with type that cannot be -coerced in existing column. Please see our [`coerce_row`](https://github.com/dlt-hub/dlt/blob/7d9baf1b8fdf2813bcf7f1afe5bb3558993305ca/dlt/common/schema/schema.py#L205) if you are interested to see how internally it works. +Variant columns are generated by a normalizer when it encounters a data item with a type that cannot be coerced into an existing column. Please see our [`coerce_row`](https://github.com/dlt-hub/dlt/blob/7d9baf1b8fdf2813bcf7f1afe5bb3558993305ca/dlt/common/schema/schema.py#L205) if you are interested in seeing how it works internally. -Let's consider our [getting started](../intro) example with slightly different approach, -where `id` is an integer type at the beginning +Let's consider our [getting started](../intro) example with a slightly different approach, where `id` is an integer type at the beginning: ```py data = [ @@ -157,14 +145,14 @@ data = [ ] ``` -once pipeline runs we will have the following schema: +Once the pipeline runs, we will have the following schema: | name | data_type | nullable | | ------------- | ------------- | -------- | | id | bigint | true | | human_name | text | true | -Now imagine the data has changed and `id` field also contains strings +Now imagine the data has changed and the `id` field also contains strings: ```py data = [ @@ -173,8 +161,7 @@ data = [ ] ``` -So after you run the pipeline `dlt` will automatically infer type changes and will add a new field in the schema `id__v_text` -to reflect that new data type for `id` so for any type which is not compatible with integer it will create a new field. +So after you run the pipeline, `dlt` will automatically infer type changes and will add a new field in the schema `id__v_text` to reflect that new data type for `id`. For any type that is not compatible with integer, it will create a new field. | name | data_type | nullable | | ------------- | ------------- | -------- | @@ -182,10 +169,9 @@ to reflect that new data type for `id` so for any type which is not compatible w | human_name | text | true | | id__v_text | text | true | -On the other hand if `id` field was already a string then introducing new data with `id` containing other types -will not change schema because they can be coerced to string. +On the other hand, if the `id` field was already a string, then introducing new data with `id` containing other types will not change the schema because they can be coerced to string. -Now go ahead and try to add a new record where `id` is float number, you should see a new field `id__v_double` in the schema. +Now go ahead and try to add a new record where `id` is a float number; you should see a new field `id__v_double` in the schema. ### Data types @@ -203,63 +189,57 @@ Now go ahead and try to add a new record where `id` is float number, you should | decimal | `Decimal('4.56')` | Supports precision and scale | | wei | `2**56` | | -`wei` is a datatype tries to best represent native Ethereum 256bit integers and fixed point -decimals. It works correctly on Postgres and BigQuery. All the other destinations have insufficient -precision. +`wei` is a datatype that tries to best represent native Ethereum 256-bit integers and fixed-point decimals. It works correctly on Postgres and BigQuery. All other destinations have insufficient precision. -`json` data type tells `dlt` to load that element as JSON or string and do not attempt to flatten -or create a nested table out of it. Note that structured types like arrays or maps are not supported by `dlt` at this point. +`json` data type tells `dlt` to load that element as JSON or string and not attempt to flatten or create a nested table out of it. Note that structured types like arrays or maps are not supported by `dlt` at this point. -`time` data type is saved in destination without timezone info, if timezone is included it is stripped. E.g. `'14:01:02+02:00` -> `'14:01:02'`. +`time` data type is saved in the destination without timezone info; if timezone is included, it is stripped. E.g., `'14:01:02+02:00` -> `'14:01:02'`. :::tip -The precision and scale are interpreted by particular destination and are validated when a column is created. Destinations that -do not support precision for a given data type will ignore it. +The precision and scale are interpreted by the particular destination and are validated when a column is created. Destinations that do not support precision for a given data type will ignore it. -The precision for **timestamp** is useful when creating **parquet** files. Use 3 - for milliseconds, 6 for microseconds, 9 for nanoseconds +The precision for **timestamp** is useful when creating **parquet** files. Use 3 for milliseconds, 6 for microseconds, and 9 for nanoseconds. -The precision for **bigint** is mapped to available integer types ie. TINYINT, INT, BIGINT. The default is 64 bits (8 bytes) precision (BIGINT) +The precision for **bigint** is mapped to available integer types, i.e., TINYINT, INT, BIGINT. The default is 64 bits (8 bytes) precision (BIGINT). ::: ## Table references -`dlt` tables to refer to other tables. It supports two types of such references. -1. **nested reference** created automatically when nested data (ie. `json` document containing nested list) is converted into relational form. Those -references use specialized column and table hints and are used ie. when [merging data](incremental-loading.md). -2. **table references** are optional, user-defined annotations that are not verified and enforced but may be used by downstream tools ie. -to generate automatic tests or models for the loaded data. +`dlt` tables refer to other tables. It supports two types of such references: +1. **Nested reference** created automatically when nested data (i.e., a `json` document containing a nested list) is converted into relational form. These references use specialized column and table hints and are used, for example, when [merging data](incremental-loading.md). +2. **Table references** are optional, user-defined annotations that are not verified and enforced but may be used by downstream tools, for example, to generate automatic tests or models for the loaded data. ### Nested references: root and nested tables -When `dlt` normalizes nested data into relational schema it will automatically create [**root** and **nested** tables](destination-tables.md) and link them using **nested references**. +When `dlt` normalizes nested data into a relational schema, it automatically creates [**root** and **nested** tables](destination-tables.md) and links them using **nested references**. -1. All tables get a column with `row_key` hint (named `_dlt_id` by default) to uniquely identify each row of data. -2. Nested tables get `parent` table hint with a name of the parent table. Root table does not have `parent` hint defined. -3. Nested tables get a column with `parent_key` hint (named `_dlt_parent_id` by default) that refers to `row_key` of the `parent` table. +1. All tables receive a column with the `row_key` hint (named `_dlt_id` by default) to uniquely identify each row of data. +2. Nested tables receive a `parent` table hint with the name of the parent table. The root table does not have a `parent` hint defined. +3. Nested tables receive a column with the `parent_key` hint (named `_dlt_parent_id` by default) that refers to the `row_key` of the `parent` table. -`parent` + `row_key` + `parent_key` form a **nested reference**: from nested table to `parent` table and are extensively used when loading data. Both `replace` and `merge` write dispositions +`parent` + `row_key` + `parent_key` form a **nested reference**: from the nested table to the `parent` table and are extensively used when loading data. Both `replace` and `merge` write dispositions. `row_key` is created as follows: -1. Random string on **root** tables, except for [`upsert`](incremental-loading.md#upsert-strategy) and -[`scd2`](incremental-loading.md#scd2-strategy) merge strategies, where it is a deterministic hash of `primary_key` (or whole row, so called `content_hash`, if PK is not defined). -2. A deterministic hash of `parent_key`, `parent` table name and position in the list (`_dlt_list_idx`) +1. A random string on **root** tables, except for [`upsert`](incremental-loading.md#upsert-strategy) and +[`scd2`](incremental-loading.md#scd2-strategy) merge strategies, where it is a deterministic hash of `primary_key` (or the whole row, so-called `content_hash`, if PK is not defined). +2. A deterministic hash of `parent_key`, `parent` table name, and position in the list (`_dlt_list_idx`) for **nested** tables. -You are able to bring your own `row_key` by adding `_dlt_id` column/field to your data (both root and nested). All data types with equal operator are supported. +You can bring your own `row_key` by adding the `_dlt_id` column/field to your data (both root and nested). All data types with an equal operator are supported. -`merge` write disposition requires additional nested reference that goes from **nested** to **root** table, skipping all parent tables in between. This reference is created by [adding a column with hint](incremental-loading.md#forcing-root-key-propagation) `root_key` (named `_dlt_root_id` by default) to nested tables. +`merge` write disposition requires an additional nested reference that goes from **nested** to **root** table, skipping all parent tables in between. This reference is created by [adding a column with a hint](incremental-loading.md#forcing-root-key-propagation) `root_key` (named `_dlt_root_id` by default) to nested tables. ### Table references You can annotate tables with table references. This feature is coming soon. ## Schema settings -The `settings` section of schema file lets you define various global rules that impact how tables -and columns are inferred from data. For example you can assign **primary_key** hint to all columns with name `id` or force **timestamp** data type on all columns containing `timestamp` with an use of regex pattern. +The `settings` section of the schema file lets you define various global rules that impact how tables +and columns are inferred from data. For example, you can assign a **primary_key** hint to all columns named `id` or force a **timestamp** data type on all columns containing `timestamp` with the use of a regex pattern. ### Data type autodetectors You can define a set of functions that will be used to infer the data type of the column from a value. The functions are run from top to bottom on the lists. Look in `detections.py` to see what is -available. **iso_timestamp** detector that looks for ISO 8601 strings and converts them to **timestamp** +available. The **iso_timestamp** detector that looks for ISO 8601 strings and converts them to **timestamp** is enabled by default. ```yaml @@ -273,24 +253,24 @@ settings: - wei_to_double ``` -Alternatively you can add and remove detections from code: +Alternatively, you can add and remove detections from the code: ```py source = data_source() # remove iso time detector source.schema.remove_type_detection("iso_timestamp") - # convert UNIX timestamp (float, withing a year from NOW) into timestamp + # convert UNIX timestamp (float, within a year from NOW) into timestamp source.schema.add_type_detection("timestamp") ``` -Above we modify a schema that comes with a source to detect UNIX timestamps with **timestamp** detector. +Above, we modify a schema that comes with a source to detect UNIX timestamps with the **timestamp** detector. ### Column hint rules -You can define a global rules that will apply hints of a newly inferred columns. Those rules apply -to normalized column names. You can use column names directly or with regular expressions. `dlt` is matching -the column names **after they got normalized with naming convention**. +You can define global rules that will apply hints to newly inferred columns. These rules apply +to normalized column names. You can use column names directly or with regular expressions. `dlt` matches +the column names **after they have been normalized with naming convention**. -By default, schema adopts hints rules from json(relational) normalizer to support correct hinting -of columns added by normalizer: +By default, the schema adopts hint rules from the json(relational) normalizer to support correct hinting +of columns added by the normalizer: ```yaml settings: @@ -310,13 +290,13 @@ settings: root_key: - _dlt_root_id ``` -Above we require exact column name match for a hint to apply. You can also use regular expression (which we call `SimpleRegex`) as follows: +Above, we require an exact column name match for a hint to apply. You can also use a regular expression (which we call `SimpleRegex`) as follows: ```yaml settings: partition: - re:_timestamp$ ``` -Above we add `partition` hint to all columns ending with `_timestamp`. You can do same thing in the code +Above, we add a `partition` hint to all columns ending with `_timestamp`. You can do the same thing in the code: ```py source = data_source() # this will update existing hints with the hints passed @@ -326,9 +306,9 @@ Above we add `partition` hint to all columns ending with `_timestamp`. You can d ### Preferred data types You can define rules that will set the data type for newly created columns. Put the rules under -`preferred_types` key of `settings`. On the left side there's a rule on a column name, on the right +the `preferred_types` key of `settings`. On the left side, there's a rule on a column name; on the right side is the data type. You can use column names directly or with regular expressions. -`dlt` is matching the column names **after they got normalized with naming convention**. +`dlt` matches the column names **after they have been normalized with naming convention**. Example: @@ -341,8 +321,8 @@ settings: updated_at: timestamp ``` -Above we prefer `timestamp` data type for all columns containing **timestamp** substring and define a few exact matches ie. **created_at**. -Here's same thing in code +Above, we prefer the `timestamp` data type for all columns containing the **timestamp** substring and define a few exact matches, i.e., **created_at**. +Here's the same thing in code: ```py source = data_source() source.schema.update_preferred_types( @@ -390,7 +370,7 @@ load_info = pipeline.run(source_data) This example iterates through MongoDB collections, applying the **json** [data type](schema#data-types) to a specified column, and then processes the data with `pipeline.run`. ## View and print the schema -To view and print the default schema in a clear YAML format use the command: +To view and print the default schema in a clear YAML format, use the command: ```py pipeline.default_schema.to_pretty_yaml() @@ -419,16 +399,16 @@ schema files in your pipeline. ## Attaching schemas to sources -We recommend to not create schemas explicitly. Instead, user should provide a few global schema -settings and then let the table and column schemas to be generated from the resource hints and the +We recommend not creating schemas explicitly. Instead, users should provide a few global schema +settings and then let the table and column schemas be generated from the resource hints and the data itself. The `dlt.source` decorator accepts a schema instance that you can create yourself and modify in -whatever way you wish. The decorator also support a few typical use cases: +whatever way you wish. The decorator also supports a few typical use cases: ### Schema created implicitly by decorator -If no schema instance is passed, the decorator creates a schema with the name set to source name and +If no schema instance is passed, the decorator creates a schema with the name set to the source name and all the settings to default. ### Automatically load schema file stored with source python module @@ -437,16 +417,16 @@ If no schema instance is passed, and a file with a name `{source name}_schema.ym same folder as the module with the decorated function, it will be automatically loaded and used as the schema. -This should make easier to bundle a fully specified (or pre-configured) schema with a source. +This should make it easier to bundle a fully specified (or pre-configured) schema with a source. ### Schema is modified in the source function body -What if you can configure your schema or add some tables only inside your schema function, when i.e. -you have the source credentials and user settings available? You could for example add detailed -schemas of all the database tables when someone requests a table data to be loaded. This information -is available only at the moment source function is called. +What if you can configure your schema or add some tables only inside your schema function, when, for example, +you have the source credentials and user settings available? You could, for example, add detailed +schemas of all the database tables when someone requests table data to be loaded. This information +is available only at the moment the source function is called. -Similarly to the `source_state()` and `resource_state()` , source and resource function has current +Similarly to the `source_state()` and `resource_state()`, the source and resource function has the current schema available via `dlt.current.source_schema()`. Example: @@ -458,8 +438,9 @@ def textual(nesting_level: int): schema = dlt.current.source_schema() # remove date detector schema.remove_type_detection("iso_timestamp") - # convert UNIX timestamp (float, withing a year from NOW) into timestamp + # convert UNIX timestamp (float, within a year from NOW) into timestamp schema.add_type_detection("timestamp") return dlt.resource([]) ``` + diff --git a/docs/website/docs/general-usage/source.md b/docs/website/docs/general-usage/source.md index e94cc2bd30..240d6933b6 100644 --- a/docs/website/docs/general-usage/source.md +++ b/docs/website/docs/general-usage/source.md @@ -6,58 +6,59 @@ keywords: [source, api, dlt.source] # Source -A [source](glossary.md#source) is a logical grouping of resources i.e. endpoints of a +A [source](glossary.md#source) is a logical grouping of resources, i.e., endpoints of a single API. The most common approach is to define it in a separate Python module. - A source is a function decorated with `@dlt.source` that returns one or more resources. -- A source can optionally define a [schema](schema.md) with tables, columns, performance hints and +- A source can optionally define a [schema](schema.md) with tables, columns, performance hints, and more. - The source Python module typically contains optional customizations and data transformations. -- The source Python module typically contains the authentication and pagination code for particular +- The source Python module typically contains the authentication and pagination code for a particular API. ## Declare sources -You declare source by decorating an (optionally async) function that return or yields one or more resource with `dlt.source`. Our -[Create a pipeline](../walkthroughs/create-a-pipeline.md) how to guide teaches you how to do that. +You declare a source by decorating an (optionally async) function that returns or yields one or more resources with `@dlt.source`. Our +[Create a pipeline](../walkthroughs/create-a-pipeline.md) how-to guide teaches you how to do that. ### Create resources dynamically -You can create resources by using `dlt.resource` as a function. In an example below we reuse a +You can create resources by using `dlt.resource` as a function. In the example below, we reuse a single generator function to create a list of resources for several Hubspot endpoints. ```py @dlt.source def hubspot(api_key=dlt.secrets.value): - endpoints = ["companies", "deals", "product"] + endpoints = ["companies", "deals", "products"] def get_resource(endpoint): yield requests.get(url + "/" + endpoint).json() for endpoint in endpoints: - # calling get_resource creates generator, + # calling get_resource creates a generator, # the actual code of the function will be executed in pipeline.run yield dlt.resource(get_resource(endpoint), name=endpoint) ``` ### Attach and configure schemas -You can [create, attach and configure schema](schema.md#attaching-schemas-to-sources) that will be +You can [create, attach, and configure schemas](schema.md#attaching-schemas-to-sources) that will be used when loading the source. -### Avoid long lasting operations in source function -Do not extract data in source function. Leave that task to your resources if possible. Source function is executed immediately when called (contrary to resources which delay execution - like Python generators). There are several benefits (error handling, execution metrics, parallelization) you get when you extract data in `pipeline.run` or `pipeline.extract`. +### Avoid long-lasting operations in source function -If this is impractical (for example you want to reflect a database to create resources for tables) make sure you do not call source function too often. [See this note if you plan to deploy on Airflow](../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file) +Do not extract data in the source function. Leave that task to your resources if possible. The source function is executed immediately when called (contrary to resources which delay execution - like Python generators). There are several benefits (error handling, execution metrics, parallelization) you get when you extract data in `pipeline.run` or `pipeline.extract`. + +If this is impractical (for example, you want to reflect a database to create resources for tables), make sure you do not call the source function too often. [See this note if you plan to deploy on Airflow](../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file) ## Customize sources ### Access and select resources to load -You can access resources present in a source and select which of them you want to load. In case of -`hubspot` resource above we could select and load "companies", "deals" and "products" resources: +You can access resources present in a source and select which of them you want to load. In the case of +the `hubspot` resource above, we could select and load the "companies", "deals", and "products" resources: ```py from hubspot import hubspot @@ -67,7 +68,7 @@ source = hubspot() print(source.resources.keys()) # print names of all resources # print resources that are selected to load print(source.resources.selected.keys()) -# load only "companies" and "deals" using "with_resources" convenience method +# load only "companies" and "deals" using the "with_resources" convenience method pipeline.run(source.with_resources("companies", "deals")) ``` @@ -75,7 +76,7 @@ Resources can be individually accessed and selected: ```py # resources are accessible as attributes of a source -for c in source.companies: # enumerate all data in companies resource +for c in source.companies: # enumerate all data in the companies resource print(c) # check if deals are selected to load @@ -84,9 +85,9 @@ print(source.deals.selected) source.deals.selected = False ``` -### Filter, transform and pivot data +### Filter, transform, and pivot data -You can modify and filter data in resources, for example if we want to keep only deals after certain +You can modify and filter data in resources, for example, if we want to keep only deals after a certain date: ```py @@ -97,11 +98,7 @@ Find more on transforms [here](resource.md#filter-transform-and-pivot-data). ### Load data partially -You can limit the number of items produced by each resource by calling a `add_limit` method on a -source. This is useful for testing, debugging and generating sample datasets for experimentation. -You can easily get your test dataset in a few minutes, when otherwise you'd need to wait hours for -the full loading to complete. Below we limit the `pipedrive` source to just get **10 pages** of data -from each endpoint. Mind that the transformers will be evaluated fully: +You can limit the number of items produced by each resource by calling the `add_limit` method on a source. This is useful for testing, debugging, and generating sample datasets for experimentation. You can easily get your test dataset in a few minutes, when otherwise you would need to wait hours for the full loading to complete. Below, we limit the `pipedrive` source to just get **10 pages** of data from each endpoint. Mind that the transformers will be evaluated fully: ```py from pipedrive import pipedrive_source @@ -119,11 +116,7 @@ Find more on sampling data [here](resource.md#sample-from-large-data). ### Add more resources to existing source -You can add a custom resource to source after it was created. Imagine that you want to score all the -deals with a keras model that will tell you if the deal is a fraud or not. In order to do that you -declare a new -[transformer that takes the data from](resource.md#feeding-data-from-one-resource-into-another) `deals` -resource and add it to the source. +You can add a custom resource to a source after it was created. Imagine that you want to score all the deals with a keras model that will tell you if the deal is a fraud or not. In order to do that, you declare a new [transformer that takes the data from](resource.md#feeding-data-from-one-resource-into-another) `deals` resource and add it to the source. ```py import dlt @@ -143,7 +136,7 @@ source.resources.add(source.deals | deal_scores) # load the data: you'll see the new table `deal_scores` in your destination! pipeline.run(source) ``` -You can also set the resources in the source as follows +You can also set the resources in the source as follows: ```py source.deal_scores = source.deals | deal_scores ``` @@ -152,13 +145,12 @@ or source.resources["deal_scores"] = source.deals | deal_scores ``` :::note -When adding resource to the source, `dlt` clones the resource so your existing instance is not affected. +When adding a resource to the source, `dlt` clones the resource so your existing instance is not affected. ::: ### Reduce the nesting level of generated tables -You can limit how deep `dlt` goes when generating nested tables and flattening dicts into columns. By default, the library will descend -and generate nested tables for all nested lists and columns form dicts, without limit. +You can limit how deep `dlt` goes when generating nested tables and flattening dicts into columns. By default, the library will descend and generate nested tables for all nested lists and columns from dicts, without limit. ```py @dlt.source(max_table_nesting=1) @@ -166,13 +158,10 @@ def mongo_db(): ... ``` -In the example above, we want only 1 level of nested tables to be generated (so there are no nested -tables of a nested table). Typical settings: +In the example above, we want only 1 level of nested tables to be generated (so there are no nested tables of a nested table). Typical settings: -- `max_table_nesting=0` will not generate nested tables and will not flatten dicts into columns at all. All nested data will be - represented as JSON. -- `max_table_nesting=1` will generate nested tables of root tables and nothing more. All nested - data in nested tables will be represented as JSON. +- `max_table_nesting=0` will not generate nested tables and will not flatten dicts into columns at all. All nested data will be represented as JSON. +- `max_table_nesting=1` will generate nested tables of root tables and nothing more. All nested data in nested tables will be represented as JSON. You can achieve the same effect after the source instance is created: @@ -183,17 +172,12 @@ source = mongo_db() source.max_table_nesting = 0 ``` -Several data sources are prone to contain semi-structured documents with very deep nesting i.e. -MongoDB databases. Our practical experience is that setting the `max_nesting_level` to 2 or 3 -produces the clearest and human-readable schemas. +Several data sources are prone to contain semi-structured documents with very deep nesting, e.g., MongoDB databases. Our practical experience is that setting the `max_nesting_level` to 2 or 3 produces the clearest and human-readable schemas. :::tip -The `max_table_nesting` parameter at the source level doesn't automatically apply to individual -resources when accessed directly (e.g., using `source.resources["resource_1"])`. To make sure it -works, either use `source.with_resources("resource_1")` or set the parameter directly on the resource. +The `max_table_nesting` parameter at the source level doesn't automatically apply to individual resources when accessed directly (e.g., using `source.resources["resource_1"]`). To make sure it works, either use `source.with_resources("resource_1")` or set the parameter directly on the resource. ::: - You can directly configure the `max_table_nesting` parameter on the resource level as: ```py @@ -209,28 +193,28 @@ my_source.my_resource.max_table_nesting = 0 ### Modify schema -The schema is available via `schema` property of the source. -[You can manipulate this schema i.e. add tables, change column definitions etc. before the data is loaded.](schema.md#schema-is-modified-in-the-source-function-body) +The schema is available via the `schema` property of the source. +[You can manipulate this schema, i.e., add tables, change column definitions, etc., before the data is loaded.](schema.md#schema-is-modified-in-the-source-function-body) -Source provides two other convenience properties: +The source provides two other convenience properties: -1. `max_table_nesting` to set the maximum nesting level for nested tables and flattened columns -1. `root_key` to propagate the `_dlt_id` of from a root table to all nested tables. +1. `max_table_nesting` to set the maximum nesting level for nested tables and flattened columns. +1. `root_key` to propagate the `_dlt_id` from a root table to all nested tables. ## Load sources -You can pass individual sources or list of sources to the `dlt.pipeline` object. By default, all the -sources will be loaded to a single dataset. +You can pass individual sources or a list of sources to the `dlt.pipeline` object. By default, all the +sources will be loaded into a single dataset. You are also free to decompose a single source into several ones. For example, you may want to break -down a 50 table copy job into an airflow dag with high parallelism to load the data faster. To do -so, you could get the list of resources as: +down a 50-table copy job into an Airflow DAG with high parallelism to load the data faster. To do +so, you could get the list of resources as follows: ```py # get a list of resources' names resource_list = sql_source().resources.keys() -#now we are able to make a pipeline for each resource +# now we are able to make a pipeline for each resource for res in resource_list: pipeline.run(sql_source().with_resources(res)) ``` @@ -249,3 +233,4 @@ With selected resources: ```py p.run(tables.with_resources("users"), write_disposition="replace") ``` + diff --git a/docs/website/docs/general-usage/state.md b/docs/website/docs/general-usage/state.md index b34d37c8b1..4895fa4c33 100644 --- a/docs/website/docs/general-usage/state.md +++ b/docs/website/docs/general-usage/state.md @@ -6,21 +6,21 @@ keywords: [state, metadata, dlt.current.resource_state, dlt.current.source_state # State -The pipeline state is a Python dictionary which lives alongside your data; you can store values in -it and, on next pipeline run, request them back. +The pipeline state is a Python dictionary that lives alongside your data; you can store values in +it and, on the next pipeline run, retrieve them. ## Read and write pipeline state in a resource -You read and write the state in your resources. Below we use the state to create a list of chess -game archives which we then use to +You read and write the state in your resources. Below, we use the state to create a list of chess +game archives, which we then use to [prevent requesting duplicates](incremental-loading.md#advanced-state-usage-storing-a-list-of-processed-entities). ```py @dlt.resource(write_disposition="append") def players_games(chess_url, player, start_month=None, end_month=None): - # create or request a list of archives from resource scoped state + # create or request a list of archives from resource-scoped state checked_archives = dlt.current.resource_state().setdefault("archives", []) - # get list of archives for a particular player + # get a list of archives for a particular player archives = player_archives(chess_url, player) for url in archives: if url in checked_archives: @@ -35,55 +35,55 @@ def players_games(chess_url, player, start_month=None, end_month=None): yield r.json().get("games", []) ``` -Above, we request the resource-scoped state. The `checked_archives` list stored under `archives` +Above, we request the resource-scoped state. The `checked_archives` list stored under the `archives` dictionary key is private and visible only to the `players_games` resource. -The pipeline state is stored locally in -[pipeline working directory](pipeline.md#pipeline-working-directory) and as a consequence - it -cannot be shared with pipelines with different names. You must also make sure that data written into -the state is JSON Serializable. Except standard Python types, `dlt` handles `DateTime`, `Decimal`, -`bytes` and `UUID`. +The pipeline state is stored locally in the +[pipeline working directory](pipeline.md#pipeline-working-directory) and, as a consequence, it +cannot be shared with pipelines with different names. You must also ensure that data written into +the state is JSON serializable. Except for standard Python types, `dlt` handles `DateTime`, `Decimal`, +`bytes`, and `UUID`. ## Share state across resources and read state in a source -You can also access the source-scoped state with `dlt.current.source_state()` which can be shared +You can also access the source-scoped state with `dlt.current.source_state()`, which can be shared across resources of a particular source and is also available read-only in the source-decorated -functions. The most common use case for the source-scoped state is to store mapping of custom fields +functions. The most common use case for the source-scoped state is to store a mapping of custom fields to their displayable names. You can take a look at our [pipedrive source](https://github.com/dlt-hub/verified-sources/blob/master/sources/pipedrive/__init__.py#L118) for an example of state passed across resources. :::tip -[decompose your source](../reference/performance.md#source-decomposition-for-serial-and-parallel-resource-execution) -in order to, for example run it on Airflow in parallel. If you cannot avoid that, designate one of -the resources as state writer and all the other as state readers. This is exactly what `pipedrive` -pipeline does. With such structure you will still be able to run some of your resources in +[Decompose your source](../reference/performance.md#source-decomposition-for-serial-and-parallel-resource-execution) +to, for example, run it on Airflow in parallel. If you cannot avoid that, designate one of +the resources as the state writer and all others as state readers. This is exactly what the `pipedrive` +pipeline does. With such a structure, you will still be able to run some of your resources in parallel. ::: :::caution -The `dlt.state()` is a deprecated alias to `dlt.current.source_state()` and will be soon +The `dlt.state()` is a deprecated alias to `dlt.current.source_state()` and will soon be removed. ::: ## Syncing state with destination -What if you run your pipeline on, for example, Airflow where every task gets a clean filesystem and -[pipeline working directory](pipeline.md#pipeline-working-directory) is always deleted? `dlt` loads -your state into the destination together with all other data and when faced with a clean start, it -will try to restore state from the destination. +What if you run your pipeline on, for example, Airflow, where every task gets a clean filesystem and +the [pipeline working directory](pipeline.md#pipeline-working-directory) is always deleted? `dlt` loads +your state into the destination along with all other data, and when faced with a clean start, it +will try to restore the state from the destination. -The remote state is identified by pipeline name, the destination location (as given by the -credentials) and destination dataset. To re-use the same state, use the same pipeline name and +The remote state is identified by the pipeline name, the destination location (as given by the +credentials), and the destination dataset. To reuse the same state, use the same pipeline name and destination. The state is stored in the `_dlt_pipeline_state` table at the destination and contains information -about the pipeline, pipeline run (that the state belongs to) and state blob. +about the pipeline, the pipeline run (to which the state belongs), and the state blob. -`dlt` has `dlt pipeline sync` command where you can +`dlt` has a `dlt pipeline sync` command where you can [request the state back from that table](../reference/command-line-interface.md#sync-pipeline-with-the-destination). > šŸ’” If you can keep the pipeline working directory across the runs, you can disable the state sync -> by setting `restore_from_destination=false` i.e. in your `config.toml`. +> by setting `restore_from_destination=false` in your `config.toml`. ## When to use pipeline state @@ -94,77 +94,76 @@ about the pipeline, pipeline run (that the state belongs to) and state blob. if the list is not much bigger than 100k elements. - [Store large dictionaries of last values](incremental-loading.md#advanced-state-usage-tracking-the-last-value-for-all-search-terms-in-twitter-api) if you are not able to implement it with the standard incremental construct. -- Store the custom fields dictionaries, dynamic configurations and other source-scoped state. +- Store custom fields dictionaries, dynamic configurations, and other source-scoped state. ## Do not use pipeline state if it can grow to millions of records -Do not use dlt state when it may grow to millions of elements. Do you plan to store modification -timestamps of all of your millions of user records? This is probably a bad idea! In that case you +Do not use `dlt` state when it may grow to millions of elements. Do you plan to store modification +timestamps of all your millions of user records? This is probably a bad idea! In that case, you could: -- Store the state in dynamo-db, redis etc. taking into the account that if the extract stage fails - you'll end with invalid state. +- Store the state in DynamoDB, Redis, etc., taking into account that if the extract stage fails, + you'll end up with an invalid state. - Use your loaded data as the state. `dlt` exposes the current pipeline via `dlt.current.pipeline()` from which you can obtain [sqlclient](../dlt-ecosystem/transformations/sql.md) - and load the data of interest. In that case try at least to process your user records in batches. + and load the data of interest. In that case, try at least to process your user records in batches. ### Access data in the destination instead of pipeline state -In the example below, we load recent comments made by given `user_id`. We access `user_comments` table to select -maximum comment id for a given user. + +In the example below, we load recent comments made by a given `user_id`. We access the `user_comments` table to select the maximum comment id for a given user. ```py import dlt @dlt.resource(name="user_comments") def comments(user_id: str): current_pipeline = dlt.current.pipeline() - # find last comment id for given user_id by looking in destination + # find the last comment id for the given user_id by looking in the destination max_id: int = 0 - # on first pipeline run, user_comments table does not yet exist so do not check at all - # alternatively catch DatabaseUndefinedRelation which is raised when unknown table is selected + # on the first pipeline run, the user_comments table does not yet exist so do not check at all + # alternatively, catch DatabaseUndefinedRelation which is raised when an unknown table is selected if not current_pipeline.first_run: with current_pipeline.sql_client() as client: - # we may get last user comment or None which we replace with 0 + # we may get the last user comment or None which we replace with 0 max_id = ( client.execute_sql( "SELECT MAX(_id) FROM user_comments WHERE user_id=?", user_id )[0][0] or 0 ) - # use max_id to filter our results (we simulate API query) + # use max_id to filter our results (we simulate an API query) yield from [ {"_id": i, "value": letter, "user_id": user_id} for i, letter in zip([1, 2, 3], ["A", "B", "C"]) if i > max_id ] ``` -When pipeline is first run, the destination dataset and `user_comments` table do not yet exist. We skip the destination -query by using `first_run` property of the pipeline. We also handle a situation where there are no comments for a user_id -by replacing None with 0 as `max_id`. +When the pipeline is first run, the destination dataset and `user_comments` table do not yet exist. We skip the destination query by using the `first_run` property of the pipeline. We also handle a situation where there are no comments for a user_id by replacing None with 0 as `max_id`. ## Inspect the pipeline state -You can inspect pipeline state with +You can inspect the pipeline state with the [`dlt pipeline` command](../reference/command-line-interface.md#dlt-pipeline): ```sh dlt pipeline -v chess_pipeline info ``` -will display source and resource state slots for all known sources. +This will display the source and resource state slots for all known sources. ## Reset the pipeline state: full or partial **To fully reset the state:** - Drop the destination dataset to fully reset the pipeline. -- [Set the `dev_mode` flag when creating pipeline](pipeline.md#do-experiments-with-dev-mode). +- [Set the `dev_mode` flag when creating the pipeline](pipeline.md#do-experiments-with-dev-mode). - Use the `dlt pipeline drop --drop-all` command to - [drop state and tables for a given schema name](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). + [drop the state and tables for a given schema name](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). **To partially reset the state:** - Use the `dlt pipeline drop ` command to - [drop state and tables for a given resource](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). + [drop the state and tables for a given resource](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). - Use the `dlt pipeline drop --state-paths` command to - [reset the state at given path without touching the tables and data](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). + [reset the state at a given path without touching the tables and data](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). + diff --git a/docs/website/docs/intro.md b/docs/website/docs/intro.md index 6660696cfb..650c47920b 100644 --- a/docs/website/docs/intro.md +++ b/docs/website/docs/intro.md @@ -18,7 +18,7 @@ dlt is designed to be easy to use, flexible, and scalable: - dlt infers [schemas](./general-usage/schema) and [data types](./general-usage/schema/#data-types), [normalizes the data](./general-usage/schema/#data-normalizer), and handles nested data structures. - dlt supports a variety of [popular destinations](./dlt-ecosystem/destinations/) and has an interface to add [custom destinations](./dlt-ecosystem/destinations/destination) to create reverse ETL pipelines. -- dlt can be deployed anywhere Python runs, be it on [Airflow](./walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [serverless functions](./walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-functions) or any other cloud deployment of your choice. +- dlt can be deployed anywhere Python runs, be it on [Airflow](./walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [serverless functions](./walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-functions), or any other cloud deployment of your choice. - dlt automates pipeline maintenance with [schema evolution](./general-usage/schema-evolution) and [schema and data contracts](./general-usage/schema-contracts). To get started with dlt, install the library using pip: @@ -43,7 +43,7 @@ We recommend using a clean virtual environment for your experiments! Read the [d ]}> -Use dlt's [REST API source](./tutorial/rest-api) to extract data from any REST API. Define API endpoints youā€™d like to fetch data from, pagination method and authentication and dlt will handle the rest: +Use dlt's [REST API source](./tutorial/rest-api) to extract data from any REST API. Define the API endpoints youā€™d like to fetch data from, the pagination method, and authentication, and dlt will handle the rest: ```py import dlt @@ -76,7 +76,7 @@ Follow the [REST API source tutorial](./tutorial/rest-api) to learn more about t -Use the [SQL source](./tutorial/sql-database) to extract data from the database like PostgreSQL, MySQL, SQLite, Oracle and more. +Use the [SQL source](./tutorial/sql-database) to extract data from databases like PostgreSQL, MySQL, SQLite, Oracle, and more. ```py from dlt.sources.sql_database import sql_database @@ -99,7 +99,7 @@ Follow the [SQL source tutorial](./tutorial/sql-database) to learn more about th -[Filesystem](./tutorial/filesystem) source extracts data from AWS S3, Google Cloud Storage, Google Drive, Azure, or a local file system. +The [Filesystem](./tutorial/filesystem) source extracts data from AWS S3, Google Cloud Storage, Google Drive, Azure, or a local file system. ```py from dlt.sources.filesystem import filesystem @@ -155,4 +155,5 @@ If you'd like to try out dlt without installing it on your machine, check out th 1. Give the library a ā­ and check out the code on [GitHub](https://github.com/dlt-hub/dlt). 1. Ask questions and share how you use the library on [Slack](https://dlthub.com/community). -1. Report problems and make feature requests [here](https://github.com/dlt-hub/dlt/issues/new/choose). \ No newline at end of file +1. Report problems and make feature requests [here](https://github.com/dlt-hub/dlt/issues/new/choose). +