From 496eb3541f5018e7363b42cc9dcf6b54c2a674af Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Wed, 13 Sep 2023 10:37:52 +0300 Subject: [PATCH 01/17] Rework 'Understanding the tables' --- .../understanding-the-tables.md | 142 ++++++++++++++++-- 1 file changed, 126 insertions(+), 16 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md index e14ef554f5..42ea7ea36b 100644 --- a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md +++ b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md @@ -6,31 +6,132 @@ keywords: [understanding tables, loaded data, data structure] # Understanding the tables -## Show tables and data in the destination +In [Exploring the data](./exploring-the-data.md) you have seen the data that has been loaded into the +database. Let's take a closer look at the tables that have been created. +We start with a simple dlt pipeline: + +```py +import dlt + +data = [ + {'id': 1, 'name': 'Alice'}, + {'id': 2, 'name': 'Bob'} +] + +pipeline = dlt.pipeline( + pipeline_name='quick_start', + destination='duckdb', + dataset_name='mydata' +) +load_info = pipeline.run(data, table_name="users") ``` -dlt pipeline show + +:::note + +Here we are using the `duckdb` destination, which is an in-memory database. Other database [destinations](../destinations) +will behave similarly and have similar concepts. + +::: + +## Schema + +When you run the pipeline, dlt creates a schema in the destination database. The schema is a +collection of tables that represent the data you loaded. The schema name is the same as the +`dataset_name` you provided in the pipeline definition. In the example above, we explicitly set the +`dataset_name` to `mydata`, if you don't set it, it will be set to the pipeline name with a suffix `_dataset`. + +## Tables + +Each [resource](../../general-usage/resource.md) in your pipeline definition will be represented by a table in +the destination. In the example above, we have one resource, `users`, so we will have one table, `users`, +in the destination. Here also, we explicitly set the `table_name` to `users`, if you don't set it, it will be +set to the resource name. + +For example, we can rewrite the pipeline above as: + +```py +@dlt.resource +def users(): + yield [ + {'id': 1, 'name': 'Balice'}, + {'id': 2, 'name': 'Bob'} + ] + +pipeline = dlt.pipeline( + pipeline_name='quick_start', + destination='duckdb', + dataset_name='mydata' +) +load_info = pipeline.run(users) ``` -[This command](../../reference/command-line-interface.md#show-tables-and-data-in-the-destination) -generates and launches a simple Streamlit app that you can use to inspect the schemas -and data in the destination as well as your pipeline state and loading status / stats. It should be -executed from the same folder where you ran the pipeline script to access destination credentials. -It requires `streamlit` and `pandas` to be installed. +The result will be the same, but the table is implicitly named `users` based on the resource name. -## Table and column names +::: note -We [normalize table and column names,](../../general-usage/schema.md#naming-convention) so they fit -what the destination database allows. We convert all the names in your source data into -`snake_case`, alphanumeric identifiers. Please note that in many cases the names you had in your -input document will be (slightly) different from identifiers you see in the database. +Special tables are created to track the pipeline state. These tables are prefixed with `_dlt_` +and are not shown in the `show` command of the `dlt pipeline` CLI. However, you can see them when +connecting to the database directly. + +::: ## Child and parent tables -When creating a schema during normalization, `dlt` recursively unpacks this nested structure into -relational tables, creating and linking children and parent tables. +Now let's look at a more complex example: + +```py +import dlt + +data = [ + { + 'id': 1, + 'name': 'Alice', + 'pets': [ + {'id': 1, 'name': 'Fluffy', 'type': 'cat'}, + {'id': 2, 'name': 'Spot', 'type': 'dog'} + ] + }, + { + 'id': 2, + 'name': 'Bob', + 'pets': [ + {'id': 3, 'name': 'Fido', 'type': 'dog'} + ] + } +] + +pipeline = dlt.pipeline( + pipeline_name='quick_start', + destination='duckdb', + dataset_name='mydata' +) +load_info = pipeline.run(data, table_name="users") +``` + +Running this pipeline will create two tables in the destination, `users` and `users__pets`. The +`users` table will contain the top level data, and the `users__pets` table will contain the child +data. Here is what the tables may look like: + +**users** + +| id | name | _dlt_id | _dlt_load_id | +| --- | --- | --- | --- | +| 1 | Alice | wX3f5vn801W16A | 1234562350.98417 | +| 2 | Bob | rX8ybgTeEmAmmA | 1234562350.98417 | + +**users__pets** + +| id | name | type | _dlt_id | _dlt_parent_id | _dlt_list_idx | +| --- | --- | --- | --- | --- | --- | +| 1 | Fluffy | cat | w1n0PEDzuP3grw | wX3f5vn801W16A | 0 | +| 2 | Spot | dog | 9uxh36VU9lqKpw | wX3f5vn801W16A | 1 | +| 3 | Fido | dog | pe3FVtCWz8VuNA | rX8ybgTeEmAmmA | 0 | -This is how table linking works: +When creating a database schema, dlt recursively unpacks nested structures into relational tables, +creating and linking children and parent tables. + +This is how it works: 1. Each row in all (top level and child) data tables created by `dlt` contains UNIQUE column named `_dlt_id`. @@ -41,10 +142,19 @@ This is how table linking works: 1. For tables that are loaded with the `merge` write disposition, we add a ROOT KEY column `_dlt_root_id`, which links child table to a row in top level table. -> 💡 Note: If you define your own primary key in a child table, it will be used to link to parent table + +:::note + +If you define your own primary key in a child table, it will be used to link to parent table and the `_dlt_parent_id` and `_dlt_list_idx` will not be added. `_dlt_id` is always added even in case the primary key or other unique columns are defined. +::: + +## Naming convention: tables and columns + +During a pipeline run, dlt [normalizes both table and column names](../../general-usage/schema.md#naming-convention) to ensure compatibility with the destination database's accepted format. All names from your source data will be transformed into snake_case and will only include alphanumeric characters. Please be aware that the names in the destination database may differ somewhat from those in your original input. + ## Load IDs Each pipeline run creates one or more load packages, which can be identified by their `load_id`. A load From 322a6ee9c4b24471103417bd529a2d3de48f48c8 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Wed, 13 Sep 2023 10:47:53 +0300 Subject: [PATCH 02/17] Fix typo --- .../dlt-ecosystem/visualizations/understanding-the-tables.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md index 42ea7ea36b..ea85fe0e66 100644 --- a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md +++ b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md @@ -68,7 +68,7 @@ load_info = pipeline.run(users) The result will be the same, but the table is implicitly named `users` based on the resource name. -::: note +:::note Special tables are created to track the pipeline state. These tables are prefixed with `_dlt_` and are not shown in the `show` command of the `dlt pipeline` CLI. However, you can see them when From 7f1c44a3ade041bf4aafde484090bf570c6ca427 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Wed, 13 Sep 2023 10:56:39 +0300 Subject: [PATCH 03/17] Relink destinations section --- .../dlt-ecosystem/visualizations/understanding-the-tables.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md index ea85fe0e66..5bbd721162 100644 --- a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md +++ b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md @@ -29,7 +29,7 @@ load_info = pipeline.run(data, table_name="users") :::note -Here we are using the `duckdb` destination, which is an in-memory database. Other database [destinations](../destinations) +Here we are using the [DuckDb destination](../destinations/duckdb.md), which is an in-memory database. Other database destinations will behave similarly and have similar concepts. ::: From 02f927453f7ae2bcf37131c4f52ac4b956b06a5b Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Wed, 13 Sep 2023 21:51:14 +0300 Subject: [PATCH 04/17] Add staging and versioned datasets --- .../understanding-the-tables.md | 150 ++++++++++++++++-- 1 file changed, 135 insertions(+), 15 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md index 5bbd721162..b56688a755 100644 --- a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md +++ b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md @@ -37,16 +37,21 @@ will behave similarly and have similar concepts. ## Schema When you run the pipeline, dlt creates a schema in the destination database. The schema is a -collection of tables that represent the data you loaded. The schema name is the same as the +collection of tables that represent the data you loaded into the database. The schema name is the same as the `dataset_name` you provided in the pipeline definition. In the example above, we explicitly set the `dataset_name` to `mydata`, if you don't set it, it will be set to the pipeline name with a suffix `_dataset`. +Be aware that the schema referred to in this section is distinct from the [dlt Schema](../../general-usage/schema.md). +The database schema pertains to the structure and organization of data within the database, including table +definitions and relationships. On the other hand, the "dlt Schema" specifically refers to the format +and structure of normalized data within the dlt pipeline. + ## Tables Each [resource](../../general-usage/resource.md) in your pipeline definition will be represented by a table in -the destination. In the example above, we have one resource, `users`, so we will have one table, `users`, -in the destination. Here also, we explicitly set the `table_name` to `users`, if you don't set it, it will be -set to the resource name. +the destination. In the example above, we have one resource, `users`, so we will have one table, `mydata.users`, +in the destination. `mydata` is the schema name, and `users` is the table name. Here also, we explicitly set +the `table_name` to `users`, if you don't set it, it will be set to the resource name. For example, we can rewrite the pipeline above as: @@ -113,14 +118,14 @@ Running this pipeline will create two tables in the destination, `users` and `us `users` table will contain the top level data, and the `users__pets` table will contain the child data. Here is what the tables may look like: -**users** +**mydata.users** | id | name | _dlt_id | _dlt_load_id | | --- | --- | --- | --- | | 1 | Alice | wX3f5vn801W16A | 1234562350.98417 | | 2 | Bob | rX8ybgTeEmAmmA | 1234562350.98417 | -**users__pets** +**mydata.users__pets** | id | name | type | _dlt_id | _dlt_parent_id | _dlt_list_idx | | --- | --- | --- | --- | --- | --- | @@ -155,23 +160,55 @@ case the primary key or other unique columns are defined. During a pipeline run, dlt [normalizes both table and column names](../../general-usage/schema.md#naming-convention) to ensure compatibility with the destination database's accepted format. All names from your source data will be transformed into snake_case and will only include alphanumeric characters. Please be aware that the names in the destination database may differ somewhat from those in your original input. -## Load IDs +## Load Packages and Load IDs + +Each execution of the pipeline generates one or more load packages. A load package typically contains data retrieved from +all the [resources](../../general-usage/glossary.md#resource) of a particular [source](../../general-usage/glossary.md#source). +These packages are uniquely identified by a `load_id`. The `load_id` of a particular package is added to the top data tables +(referenced as `_dlt_load_id` column in the example above) and to the special `_dlt_loads` table with a status 0 +(when the load process is fully completed). + +To illustrate this, let's load more data into the same destination: + +```py +data = [ + { + 'id': 3, + 'name': 'Charlie', + 'pets': [] + }, +] +``` + +The rest of the pipeline definition remains the same. Running this pipeline will create a new load +package with a new `load_id` and add the data to the existing tables. The `users` table will now +look like this: + +**mydata.users** +| id | name | _dlt_id | _dlt_load_id | +| --- | --- | --- | --- | +| 1 | Alice | wX3f5vn801W16A | 1234562350.98417 | +| 2 | Bob | rX8ybgTeEmAmmA | 1234562350.98417 | +| 3 | Charlie | h8lehZEvT3fASQ | **1234563456.12345** | + +The `_dlt_loads` table will look like this: -Each pipeline run creates one or more load packages, which can be identified by their `load_id`. A load -package typically contains data from all [resources](../../general-usage/glossary.md#resource) of a -particular [source](../../general-usage/glossary.md#source). The `load_id` of a particular package -is added to the top data tables (`_dlt_load_id` column) and to the `_dlt_loads` table with a status 0 (when the load process -is fully completed). +**mydata._dlt_loads** + +| load_id | schema_name | status | inserted_at | schema_version_hash | +| --- | --- | --- | --- | --- | +| 1234562350.98417 | quick_start | 0 | 2023-09-12 16:45:51.17865+00 | aOEbMXCa6yHWbBM56qhLlx209rHoe35X1ZbnQekd/58= | +| **1234563456.12345** | quick_start | 0 | 2023-09-12 16:46:03.10662+00 | aOEbMXCa6yHWbBM56qhLlx209rHoe35X1ZbnQekd/58= | The `_dlt_loads` table tracks complete loads and allows chaining transformations on top of them. Many destinations do not support distributed and long-running transactions (e.g. Amazon Redshift). -In that case, the user may see the partially loaded data. It is possible to filter such data out—any -row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to delete and identify +In that case, the user may see the partially loaded data. It is possible to filter such data out: any +row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to identify and delete data for packages that never got completed. For each load, you can test and [alert](../../running-in-production/alerting.md) on anomalies (e.g. no data, too much loaded to a table). There are also some useful load stats in the `Load info` tab -of the [Streamlit app](understanding-the-tables.md#show-tables-and-data-in-the-destination) +of the [Streamlit app](exploring-the-data.md#exploring-the-data) mentioned above. You can add [transformations](../transformations) and chain them together @@ -192,3 +229,86 @@ You can [save](../../running-in-production/running.md#inspect-and-save-the-load- complete lineage info for a particular `load_id` including a list of loaded files, error messages (if any), elapsed times, schema changes. This can be helpful, for example, when troubleshooting problems. + +## Staging dataset + +So far we've been using the `append` write disposition in our example pipeline. This means that +each time we run the pipeline, the data is appended to the existing tables. When you use [the +merge write disposition](../../general-usage/incremental-loading.md), `dlt` creates a staging database schema for +staging data. This schema is named `_staging` and contains the same tables as the +destination schema. When you run the pipeline, the data from the staging tables is loaded into the +destination tables in a single atomic transaction. + +Let's illustrate this with an example. We change our pipeline to use the `merge` write disposition: + +```py +import dlt + +@dlt.resource(primary_key="id", write_disposition="merge") +def users(): + yield [ + {'id': 1, 'name': 'Alice 2'}, + {'id': 2, 'name': 'Bob 2'} + ] + +pipeline = dlt.pipeline( + pipeline_name='quick_start', + destination='duckdb', + dataset_name='mydata' +) + +load_info = pipeline.run(users) +``` + +Running this pipeline will create a schema in the destination database with the name `mydata_staging`. +If you inspect the tables in this schema, you will find `mydata_staging.users` table identical to the +`mydata.users` table in the previous example. + +Here is what the tables may look like after running the pipeline: + +**mydata_staging.users** +| id | name | _dlt_id | _dlt_load_id | +| --- | --- | --- | --- | +| 1 | Alice 2 | wX3f5vn801W16A | 2345672350.98417 | +| 2 | Bob 2 | rX8ybgTeEmAmmA | 2345672350.98417 | + +**mydata.users** +| id | name | _dlt_id | _dlt_load_id | +| --- | --- | --- | --- | +| 1 | Alice 2 | wX3f5vn801W16A | 2345672350.98417 | +| 2 | Bob 2 | rX8ybgTeEmAmmA | 2345672350.98417 | +| 3 | Charlie | h8lehZEvT3fASQ | 1234563456.12345 | + +Notice that the `mydata.users` table now contains the data from both the previous pipeline run and +the current one. + +## Versioned datasets + +When you use the `full_refresh` option, `dlt` creates a versioned dataset. This means that each +time you run the pipeline, the data is loaded into a new dataset (a new database schema). +The dataset name is the same as the `dataset_name` you provided in the pipeline definition with a +datetime-based suffix. + +We modify our pipeline to use the `full_refresh` option to see how this works: + +```py +import dlt + +data = [ + {'id': 1, 'name': 'Alice'}, + {'id': 2, 'name': 'Bob'} +] + +pipeline = dlt.pipeline( + pipeline_name='quick_start', + destination='duckdb', + dataset_name='mydata', + full_refresh=True # <-- add this line +) +load_info = pipeline.run(data, table_name="users") +``` + +Every time you run this pipeline, a new schema will be created in the destination database with a +datetime-based suffix. The data will be loaded into tables in this schema. +For example, the first time you run the pipeline, the schema will be named +`mydata_20230912064403`, the second time it will be named `mydata_20230912064407`, and so on. From 5ef616bd99a8c2a270d923bb3869103db7fa6c6b Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Wed, 13 Sep 2023 21:55:11 +0300 Subject: [PATCH 05/17] Fix wording --- .../dlt-ecosystem/visualizations/understanding-the-tables.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md index b56688a755..c310af4a54 100644 --- a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md +++ b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md @@ -284,8 +284,8 @@ the current one. ## Versioned datasets -When you use the `full_refresh` option, `dlt` creates a versioned dataset. This means that each -time you run the pipeline, the data is loaded into a new dataset (a new database schema). +When you set the `full_refresh` argument to `True` in `dlt.pipeline` call, dlt creates a versioned dataset. +This means that each time you run the pipeline, the data is loaded into a new dataset (a new database schema). The dataset name is the same as the `dataset_name` you provided in the pipeline definition with a datetime-based suffix. From 18cf6a346e898e3f7481e50c1476011a3bb54148 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Wed, 13 Sep 2023 22:10:12 +0300 Subject: [PATCH 06/17] Rephrase repetitive sentences --- .../visualizations/understanding-the-tables.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md index c310af4a54..28ec429272 100644 --- a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md +++ b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md @@ -39,7 +39,7 @@ will behave similarly and have similar concepts. When you run the pipeline, dlt creates a schema in the destination database. The schema is a collection of tables that represent the data you loaded into the database. The schema name is the same as the `dataset_name` you provided in the pipeline definition. In the example above, we explicitly set the -`dataset_name` to `mydata`, if you don't set it, it will be set to the pipeline name with a suffix `_dataset`. +`dataset_name` to `mydata`. If you don't set it, it will be set to the pipeline name with a suffix `_dataset`. Be aware that the schema referred to in this section is distinct from the [dlt Schema](../../general-usage/schema.md). The database schema pertains to the structure and organization of data within the database, including table @@ -50,8 +50,8 @@ and structure of normalized data within the dlt pipeline. Each [resource](../../general-usage/resource.md) in your pipeline definition will be represented by a table in the destination. In the example above, we have one resource, `users`, so we will have one table, `mydata.users`, -in the destination. `mydata` is the schema name, and `users` is the table name. Here also, we explicitly set -the `table_name` to `users`, if you don't set it, it will be set to the resource name. +in the destination. Where `mydata` is the schema name, and `users` is the table name. Here also, we explicitly set +the `table_name` to `users`. When `table_name` is not set, the table name will be set to the resource name. For example, we can rewrite the pipeline above as: From 9c01d71f15a4b2552be92708d249d9d34fd16a1e Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Wed, 13 Sep 2023 22:16:24 +0300 Subject: [PATCH 07/17] Fix broken tables --- .../visualizations/understanding-the-tables.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md index 28ec429272..36621597de 100644 --- a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md +++ b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md @@ -185,6 +185,7 @@ package with a new `load_id` and add the data to the existing tables. The `users look like this: **mydata.users** + | id | name | _dlt_id | _dlt_load_id | | --- | --- | --- | --- | | 1 | Alice | wX3f5vn801W16A | 1234562350.98417 | @@ -197,8 +198,8 @@ The `_dlt_loads` table will look like this: | load_id | schema_name | status | inserted_at | schema_version_hash | | --- | --- | --- | --- | --- | -| 1234562350.98417 | quick_start | 0 | 2023-09-12 16:45:51.17865+00 | aOEbMXCa6yHWbBM56qhLlx209rHoe35X1ZbnQekd/58= | -| **1234563456.12345** | quick_start | 0 | 2023-09-12 16:46:03.10662+00 | aOEbMXCa6yHWbBM56qhLlx209rHoe35X1ZbnQekd/58= | +| 1234562350.98417 | quick_start | 0 | 2023-09-12 16:45:51.17865+00 | aOEb...Qekd/58= | +| **1234563456.12345** | quick_start | 0 | 2023-09-12 16:46:03.10662+00 | aOEb...Qekd/58= | The `_dlt_loads` table tracks complete loads and allows chaining transformations on top of them. Many destinations do not support distributed and long-running transactions (e.g. Amazon Redshift). @@ -267,12 +268,14 @@ If you inspect the tables in this schema, you will find `mydata_staging.users` t Here is what the tables may look like after running the pipeline: **mydata_staging.users** + | id | name | _dlt_id | _dlt_load_id | | --- | --- | --- | --- | | 1 | Alice 2 | wX3f5vn801W16A | 2345672350.98417 | | 2 | Bob 2 | rX8ybgTeEmAmmA | 2345672350.98417 | **mydata.users** + | id | name | _dlt_id | _dlt_load_id | | --- | --- | --- | --- | | 1 | Alice 2 | wX3f5vn801W16A | 2345672350.98417 | From e327a312aea348d3f79526dbdb8d48d5fb683d8d Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Fri, 15 Sep 2023 20:44:02 +0300 Subject: [PATCH 08/17] Rename the schema heading; fix typo --- .../dlt-ecosystem/visualizations/understanding-the-tables.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md index 36621597de..afd7615b6f 100644 --- a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md +++ b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md @@ -34,7 +34,7 @@ will behave similarly and have similar concepts. ::: -## Schema +## Database schema When you run the pipeline, dlt creates a schema in the destination database. The schema is a collection of tables that represent the data you loaded into the database. The schema name is the same as the @@ -59,7 +59,7 @@ For example, we can rewrite the pipeline above as: @dlt.resource def users(): yield [ - {'id': 1, 'name': 'Balice'}, + {'id': 1, 'name': 'Alice'}, {'id': 2, 'name': 'Bob'} ] From ad2df37d5859b302b0fbde77d825f0e70e2b0df5 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Fri, 15 Sep 2023 20:47:37 +0300 Subject: [PATCH 09/17] Rename the page --- .../visualizations/understanding-the-tables.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md index afd7615b6f..5083d7dc98 100644 --- a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md +++ b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md @@ -1,10 +1,10 @@ --- -title: Understanding the tables -description: Understanding the tables that have been loaded -keywords: [understanding tables, loaded data, data structure] +title: Destination tables +description: Understanding the tables created in the destination database +keywords: [destination tables, loaded data, data structure, schema, table, child table, load package, load id, lineage, staging dataset, versioned dataset] --- -# Understanding the tables +# Destination tables In [Exploring the data](./exploring-the-data.md) you have seen the data that has been loaded into the database. Let's take a closer look at the tables that have been created. From 115fdbea71ab1123fedf20ec45ea14f4439fd39e Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Fri, 15 Sep 2023 21:04:09 +0300 Subject: [PATCH 10/17] Rename the page and update the links --- docs/website/docs/build-a-pipeline-tutorial.md | 2 +- .../docs/dlt-ecosystem/destinations/filesystem.md | 2 +- .../destination-tables.md} | 0 docs/website/docs/general-usage/full-loading.md | 10 +++++----- docs/website/docs/getting-started.mdx | 2 +- docs/website/docs/user-guides/data-beginner.md | 2 +- docs/website/docs/user-guides/data-scientist.md | 4 ++-- docs/website/docs/user-guides/engineering-manager.md | 4 ++-- docs/website/sidebars.js | 2 +- 9 files changed, 14 insertions(+), 14 deletions(-) rename docs/website/docs/{dlt-ecosystem/visualizations/understanding-the-tables.md => general-usage/destination-tables.md} (100%) diff --git a/docs/website/docs/build-a-pipeline-tutorial.md b/docs/website/docs/build-a-pipeline-tutorial.md index 14c3a78411..2462a9a32a 100644 --- a/docs/website/docs/build-a-pipeline-tutorial.md +++ b/docs/website/docs/build-a-pipeline-tutorial.md @@ -391,7 +391,7 @@ utilization, schema enforcement and curation, and schema change alerts. which consist of a timestamp and pipeline name. Load IDs enable incremental transformations and data vaulting by tracking data loads and facilitating data lineage and traceability. -Read more about [lineage.](dlt-ecosystem/visualizations/understanding-the-tables.md#load-ids) +Read more about [lineage](general-usage/destination-tables.md#data-lineage). ### Schema Enforcement and Curation diff --git a/docs/website/docs/dlt-ecosystem/destinations/filesystem.md b/docs/website/docs/dlt-ecosystem/destinations/filesystem.md index 8db9a35514..32bf561a82 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/filesystem.md +++ b/docs/website/docs/dlt-ecosystem/destinations/filesystem.md @@ -155,7 +155,7 @@ All the files are stored in a single folder with the name of the dataset that yo The name of each file contains essential metadata on the content: - **schema_name** and **table_name** identify the [schema](../../general-usage/schema.md) and table that define the file structure (column names, data types etc.) -- **load_id** is the [id of the load package](https://dlthub.com/docs/dlt-ecosystem/visualizations/understanding-the-tables#load-ids) form which the file comes from. +- **load_id** is the [id of the load package](../../general-usage/destination-tables.md#load-packages-and-load-ids) form which the file comes from. - **file_id** is there are many files with data for a single table, they are copied with different file id. - **ext** a format of the file ie. `jsonl` or `parquet` diff --git a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md b/docs/website/docs/general-usage/destination-tables.md similarity index 100% rename from docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md rename to docs/website/docs/general-usage/destination-tables.md diff --git a/docs/website/docs/general-usage/full-loading.md b/docs/website/docs/general-usage/full-loading.md index f6f359914d..92fdf064fd 100644 --- a/docs/website/docs/general-usage/full-loading.md +++ b/docs/website/docs/general-usage/full-loading.md @@ -40,15 +40,15 @@ replace_strategy = "staging-optimized" ### The `truncate-and-insert` strategy The `truncate-and-insert` replace strategy is the default and the fastest of all three strategies. If you load data with this setting, then the -destination tables will be truncated at the beginning of the load and the new data will be inserted consecutively but not within the same transaction. +destination tables will be truncated at the beginning of the load and the new data will be inserted consecutively but not within the same transaction. The downside of this strategy is, that your tables will have no data for a while until the load is completed. You -may end up with new data in some tables and no data in other tables if the load fails during the run. Such incomplete load may be however detected by checking if the -[_dlt_loads table contains load id](../dlt-ecosystem/visualizations/understanding-the-tables.md#load-ids) from _dlt_load_id of the replaced tables. If you prefer to have no data downtime, please use one of the other strategies. +may end up with new data in some tables and no data in other tables if the load fails during the run. Such incomplete load may be however detected by checking if the +[_dlt_loads table contains load id](destination-tables.md#load-packages-and-load-ids) from _dlt_load_id of the replaced tables. If you prefer to have no data downtime, please use one of the other strategies. ### The `insert-from-staging` strategy -The `insert-from-staging` is the slowest of all three strategies. It will load all new data into staging tables away from your final destination tables and will then truncate and insert the new data in one transaction. -It also maintains a consistent state between child and parent tables at all times. Use this strategy if you have the requirement for consistent destination datasets with zero downtime and the `optimized` strategy does not work for you. +The `insert-from-staging` is the slowest of all three strategies. It will load all new data into staging tables away from your final destination tables and will then truncate and insert the new data in one transaction. +It also maintains a consistent state between child and parent tables at all times. Use this strategy if you have the requirement for consistent destination datasets with zero downtime and the `optimized` strategy does not work for you. This strategy behaves the same way across all destinations. ### The `staging-optimized` strategy diff --git a/docs/website/docs/getting-started.mdx b/docs/website/docs/getting-started.mdx index 5afa0fb0da..cdbeac8eab 100644 --- a/docs/website/docs/getting-started.mdx +++ b/docs/website/docs/getting-started.mdx @@ -109,7 +109,7 @@ Learn more: - [The full list of available destinations.](dlt-ecosystem/destinations/) - [Exploring the data](dlt-ecosystem/visualizations/exploring-the-data). - What happens after loading? - [Understanding the tables](dlt-ecosystem/visualizations/understanding-the-tables). + [Destination tables](general-usage/destination-tables). ## Load your data diff --git a/docs/website/docs/user-guides/data-beginner.md b/docs/website/docs/user-guides/data-beginner.md index 69b8b8bdee..e6dd8b8d22 100644 --- a/docs/website/docs/user-guides/data-beginner.md +++ b/docs/website/docs/user-guides/data-beginner.md @@ -116,7 +116,7 @@ Good docs pages to check out: - [Create a pipeline.](../walkthroughs/create-a-pipeline) - [Run a pipeline.](../walkthroughs/run-a-pipeline) - [Deploy a pipeline with GitHub Actions.](../walkthroughs/deploy-a-pipeline/deploy-with-github-actions) -- [Understand the loaded data.](../dlt-ecosystem/visualizations/understanding-the-tables) +- [Understand the loaded data.](../general-usage/destination-tables.md) - [Explore the loaded data in Streamlit.](../dlt-ecosystem/visualizations/exploring-the-data.md) - [Transform the data with SQL or python.](../dlt-ecosystem/transformations) - [Contribute a pipeline.](https://github.com/dlt-hub/verified-sources/blob/master/CONTRIBUTING.md) diff --git a/docs/website/docs/user-guides/data-scientist.md b/docs/website/docs/user-guides/data-scientist.md index c0bcf289be..b8415937e4 100644 --- a/docs/website/docs/user-guides/data-scientist.md +++ b/docs/website/docs/user-guides/data-scientist.md @@ -53,7 +53,7 @@ with the production environment, leading to smoother integration and deployment ### `dlt` is optimized for local use on laptops - It offers a seamless - [integration with Streamlit](../dlt-ecosystem/visualizations/understanding-the-tables#show-tables-and-data-in-the-destination). + [integration with Streamlit](../dlt-ecosystem/visualizations/exploring-the-data.md). This integration enables a smooth and interactive data analysis experience, where Data Scientists can leverage the power of `dlt` alongside Streamlit's intuitive interface and visualization capabilities. @@ -107,7 +107,7 @@ analysis process. Besides, having a schema imposed on the data acts as a technical description of the data, accelerating the discovery process. -See [Understanding the tables](../dlt-ecosystem/visualizations/understanding-the-tables), +See [Destination tables](../general-usage/destination-tables.md) and [Exploring the data](../dlt-ecosystem/visualizations/exploring-the-data) in our documentation. ## Use case #3: Data Preprocessing and Transformation diff --git a/docs/website/docs/user-guides/engineering-manager.md b/docs/website/docs/user-guides/engineering-manager.md index cdd8ad4172..70e23eb2c1 100644 --- a/docs/website/docs/user-guides/engineering-manager.md +++ b/docs/website/docs/user-guides/engineering-manager.md @@ -102,7 +102,7 @@ open source communities can. involved in curation. This makes both the engineer and the others happy. - Better governance with end to end pipelining via dbt: [run dbt packages on the fly](../dlt-ecosystem/transformations/dbt.md), - [lineage out of the box](../dlt-ecosystem/visualizations/understanding-the-tables). + [lineage out of the box](../general-usage/destination-tables.md#data-lineage). - Zero learning curve: Declarative loading, simple functional programming. By using `dlt`'s declarative, standard approach to loading data, there is no complicated code to maintain, and the analysts can thus maintain the code. @@ -144,7 +144,7 @@ The implications: - Rapid Data Exploration and Prototyping: By running in Colab with DuckDB, you can explore semi-structured data much faster by structuring it with `dlt` and analysing it in SQL. [Schema inference](../general-usage/schema#data-normalizer), - [exploring the loaded data](../dlt-ecosystem/visualizations/understanding-the-tables#show-tables-and-data-in-the-destination). + [exploring the loaded data](../dlt-ecosystem/visualizations/exploring-the-data.md). - No vendor limits: `dlt` is forever free, with no vendor strings. We do not create value by creating a pain for you and solving it. We create value by supporting you beyond. - `dlt` removes complexity: You can use `dlt` in your existing stack, no overheads, no race conditions, diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index 9d17d6646b..8865feb2f7 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -140,7 +140,6 @@ const sidebars = { label: 'Visualizations', items: [ 'dlt-ecosystem/visualizations/exploring-the-data', - 'dlt-ecosystem/visualizations/understanding-the-tables' ] }, ], @@ -215,6 +214,7 @@ const sidebars = { 'general-usage/state', 'general-usage/incremental-loading', 'general-usage/full-loading', + 'general-usage/destination-tables', 'general-usage/credentials', 'general-usage/schema', 'general-usage/configuration', From d2132f135f732eff6d31097c54a65272419d7cd3 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Fri, 15 Sep 2023 21:04:27 +0300 Subject: [PATCH 11/17] Remove an extra comma --- docs/website/sidebars.js | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index 8865feb2f7..568c6a0e0c 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -132,7 +132,6 @@ const sidebars = { 'dlt-ecosystem/transformations/dbt', 'dlt-ecosystem/transformations/sql', 'dlt-ecosystem/transformations/pandas', - , ] }, { From b7cfa43871c9e095bfb886490ec4a5e653347d12 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Fri, 15 Sep 2023 21:27:46 +0300 Subject: [PATCH 12/17] Fix the relative link --- docs/website/docs/general-usage/destination-tables.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/general-usage/destination-tables.md b/docs/website/docs/general-usage/destination-tables.md index 5083d7dc98..0107930208 100644 --- a/docs/website/docs/general-usage/destination-tables.md +++ b/docs/website/docs/general-usage/destination-tables.md @@ -6,7 +6,7 @@ keywords: [destination tables, loaded data, data structure, schema, table, child # Destination tables -In [Exploring the data](./exploring-the-data.md) you have seen the data that has been loaded into the +In [Exploring the data](../dlt-ecosystem/visualizations/exploring-the-data.md) you have seen the data that has been loaded into the database. Let's take a closer look at the tables that have been created. We start with a simple dlt pipeline: From 1dec20f3d606eff06abd8eee02e36fb9417fcdf3 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Fri, 15 Sep 2023 21:32:48 +0300 Subject: [PATCH 13/17] Fix the rest relative links --- .../docs/general-usage/destination-tables.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/website/docs/general-usage/destination-tables.md b/docs/website/docs/general-usage/destination-tables.md index 0107930208..b655998507 100644 --- a/docs/website/docs/general-usage/destination-tables.md +++ b/docs/website/docs/general-usage/destination-tables.md @@ -29,7 +29,7 @@ load_info = pipeline.run(data, table_name="users") :::note -Here we are using the [DuckDb destination](../destinations/duckdb.md), which is an in-memory database. Other database destinations +Here we are using the [DuckDb destination](../dlt-ecosystem/destinations/duckdb.md), which is an in-memory database. Other database destinations will behave similarly and have similar concepts. ::: @@ -41,14 +41,14 @@ collection of tables that represent the data you loaded into the database. The s `dataset_name` you provided in the pipeline definition. In the example above, we explicitly set the `dataset_name` to `mydata`. If you don't set it, it will be set to the pipeline name with a suffix `_dataset`. -Be aware that the schema referred to in this section is distinct from the [dlt Schema](../../general-usage/schema.md). +Be aware that the schema referred to in this section is distinct from the [dlt Schema](schema.md). The database schema pertains to the structure and organization of data within the database, including table definitions and relationships. On the other hand, the "dlt Schema" specifically refers to the format and structure of normalized data within the dlt pipeline. ## Tables -Each [resource](../../general-usage/resource.md) in your pipeline definition will be represented by a table in +Each [resource](resource.md) in your pipeline definition will be represented by a table in the destination. In the example above, we have one resource, `users`, so we will have one table, `mydata.users`, in the destination. Where `mydata` is the schema name, and `users` is the table name. Here also, we explicitly set the `table_name` to `users`. When `table_name` is not set, the table name will be set to the resource name. @@ -158,12 +158,12 @@ case the primary key or other unique columns are defined. ## Naming convention: tables and columns -During a pipeline run, dlt [normalizes both table and column names](../../general-usage/schema.md#naming-convention) to ensure compatibility with the destination database's accepted format. All names from your source data will be transformed into snake_case and will only include alphanumeric characters. Please be aware that the names in the destination database may differ somewhat from those in your original input. +During a pipeline run, dlt [normalizes both table and column names](schema.md#naming-convention) to ensure compatibility with the destination database's accepted format. All names from your source data will be transformed into snake_case and will only include alphanumeric characters. Please be aware that the names in the destination database may differ somewhat from those in your original input. ## Load Packages and Load IDs Each execution of the pipeline generates one or more load packages. A load package typically contains data retrieved from -all the [resources](../../general-usage/glossary.md#resource) of a particular [source](../../general-usage/glossary.md#source). +all the [resources](glossary.md#resource) of a particular [source](glossary.md#source). These packages are uniquely identified by a `load_id`. The `load_id` of a particular package is added to the top data tables (referenced as `_dlt_load_id` column in the example above) and to the special `_dlt_loads` table with a status 0 (when the load process is fully completed). @@ -207,12 +207,12 @@ In that case, the user may see the partially loaded data. It is possible to filt row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to identify and delete data for packages that never got completed. -For each load, you can test and [alert](../../running-in-production/alerting.md) on anomalies (e.g. +For each load, you can test and [alert](../running-in-production/alerting.md) on anomalies (e.g. no data, too much loaded to a table). There are also some useful load stats in the `Load info` tab -of the [Streamlit app](exploring-the-data.md#exploring-the-data) +of the [Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md#exploring-the-data) mentioned above. -You can add [transformations](../transformations) and chain them together +You can add [transformations](../dlt-ecosystem/transformations/) and chain them together using the `status` column. You start the transformation for all the data with a particular `load_id` with a status of 0 and then update it to 1. The next transformation starts with the status of 1 and is then updated to 2. This can be repeated for every additional transformation. @@ -226,7 +226,7 @@ same process across multiple systems, which adds data lineage requirements. Usin and `load_id` provided out of the box by `dlt`, you are able to identify the source and time of data. -You can [save](../../running-in-production/running.md#inspect-and-save-the-load-info-and-trace) +You can [save](../running-in-production/running.md#inspect-and-save-the-load-info-and-trace) complete lineage info for a particular `load_id` including a list of loaded files, error messages (if any), elapsed times, schema changes. This can be helpful, for example, when troubleshooting problems. From 1fa0a70e3c5737e7f722fa5478ab52ca3ae0aa66 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Fri, 15 Sep 2023 21:43:08 +0300 Subject: [PATCH 14/17] Fix another relative link --- docs/website/docs/general-usage/destination-tables.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/general-usage/destination-tables.md b/docs/website/docs/general-usage/destination-tables.md index b655998507..f3efbaebcb 100644 --- a/docs/website/docs/general-usage/destination-tables.md +++ b/docs/website/docs/general-usage/destination-tables.md @@ -235,7 +235,7 @@ problems. So far we've been using the `append` write disposition in our example pipeline. This means that each time we run the pipeline, the data is appended to the existing tables. When you use [the -merge write disposition](../../general-usage/incremental-loading.md), `dlt` creates a staging database schema for +merge write disposition](incremental-loading.md), dlt creates a staging database schema for staging data. This schema is named `_staging` and contains the same tables as the destination schema. When you run the pipeline, the data from the staging tables is loaded into the destination tables in a single atomic transaction. From 196aee445d8669805a7427c0b4e8806e85e4ca46 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Fri, 15 Sep 2023 22:20:56 +0300 Subject: [PATCH 15/17] Put destination tables after pipeline --- docs/website/sidebars.js | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index 568c6a0e0c..f7d29993dc 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -210,10 +210,10 @@ const sidebars = { 'general-usage/resource', 'general-usage/source', 'general-usage/pipeline', + 'general-usage/destination-tables', 'general-usage/state', 'general-usage/incremental-loading', 'general-usage/full-loading', - 'general-usage/destination-tables', 'general-usage/credentials', 'general-usage/schema', 'general-usage/configuration', From a21abec2845a0ff26f2a145b21d684c48505baf2 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Fri, 15 Sep 2023 22:36:50 +0300 Subject: [PATCH 16/17] Update page content so it fits the new page name --- .../docs/general-usage/destination-tables.md | 21 +++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/docs/website/docs/general-usage/destination-tables.md b/docs/website/docs/general-usage/destination-tables.md index f3efbaebcb..613bee1d4b 100644 --- a/docs/website/docs/general-usage/destination-tables.md +++ b/docs/website/docs/general-usage/destination-tables.md @@ -6,8 +6,9 @@ keywords: [destination tables, loaded data, data structure, schema, table, child # Destination tables -In [Exploring the data](../dlt-ecosystem/visualizations/exploring-the-data.md) you have seen the data that has been loaded into the -database. Let's take a closer look at the tables that have been created. +When you run a [pipeline](pipeline.md), dlt creates tables in the destination database and loads the data +from your [source](source.md) into these tables. In this section, we will take a closer look at what +destination tables look like and how they are organized. We start with a simple dlt pipeline: @@ -27,6 +28,8 @@ pipeline = dlt.pipeline( load_info = pipeline.run(data, table_name="users") ``` +Running this pipeline will create a database schema in the destination database (DuckDB) along with a table named `users`. + :::note Here we are using the [DuckDb destination](../dlt-ecosystem/destinations/duckdb.md), which is an in-memory database. Other database destinations @@ -34,12 +37,18 @@ will behave similarly and have similar concepts. ::: +:::tip + +You can use the [Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md#exploring-the-data) to explore the data in your destination database. + +::: + ## Database schema -When you run the pipeline, dlt creates a schema in the destination database. The schema is a -collection of tables that represent the data you loaded into the database. The schema name is the same as the -`dataset_name` you provided in the pipeline definition. In the example above, we explicitly set the -`dataset_name` to `mydata`. If you don't set it, it will be set to the pipeline name with a suffix `_dataset`. +The database schema is a collection of tables that represent the data you loaded into the database. +The schema name is the same as the `dataset_name` you provided in the pipeline definition. +In the example above, we explicitly set the `dataset_name` to `mydata`. If you don't set it, +it will be set to the pipeline name with a suffix `_dataset`. Be aware that the schema referred to in this section is distinct from the [dlt Schema](schema.md). The database schema pertains to the structure and organization of data within the database, including table From 5e156eddb532133d4f35f12c5f5f9e0f948726b6 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Fri, 15 Sep 2023 22:55:26 +0300 Subject: [PATCH 17/17] Rearrange the sections --- docs/website/docs/general-usage/destination-tables.md | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/docs/website/docs/general-usage/destination-tables.md b/docs/website/docs/general-usage/destination-tables.md index 613bee1d4b..8f95639f87 100644 --- a/docs/website/docs/general-usage/destination-tables.md +++ b/docs/website/docs/general-usage/destination-tables.md @@ -28,8 +28,6 @@ pipeline = dlt.pipeline( load_info = pipeline.run(data, table_name="users") ``` -Running this pipeline will create a database schema in the destination database (DuckDB) along with a table named `users`. - :::note Here we are using the [DuckDb destination](../dlt-ecosystem/destinations/duckdb.md), which is an in-memory database. Other database destinations @@ -37,11 +35,7 @@ will behave similarly and have similar concepts. ::: -:::tip - -You can use the [Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md#exploring-the-data) to explore the data in your destination database. - -::: +Running this pipeline will create a database schema in the destination database (DuckDB) along with a table named `users`. Quick tip: you can use the `show` command of the `dlt pipeline` CLI [to see the tables](../dlt-ecosystem/visualizations/exploring-the-data.md#exploring-the-data) in the destination database. ## Database schema