Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework 'Understanding the tables' #629

Merged
merged 17 commits into from
Sep 17, 2023
Original file line number Diff line number Diff line change
Expand Up @@ -6,31 +6,137 @@ keywords: [understanding tables, loaded data, data structure]

# Understanding the tables

## Show tables and data in the destination
In [Exploring the data](./exploring-the-data.md) you have seen the data that has been loaded into the
database. Let's take a closer look at the tables that have been created.

We start with a simple dlt pipeline:

```py
import dlt

data = [
{'id': 1, 'name': 'Alice'},
{'id': 2, 'name': 'Bob'}
]

pipeline = dlt.pipeline(
pipeline_name='quick_start',
destination='duckdb',
dataset_name='mydata'
)
load_info = pipeline.run(data, table_name="users")
```
dlt pipeline <pipeline name> show

:::note

Here we are using the [DuckDb destination](../destinations/duckdb.md), which is an in-memory database. Other database destinations
will behave similarly and have similar concepts.

:::

## Schema
burnash marked this conversation as resolved.
Show resolved Hide resolved

When you run the pipeline, dlt creates a schema in the destination database. The schema is a
collection of tables that represent the data you loaded into the database. The schema name is the same as the
`dataset_name` you provided in the pipeline definition. In the example above, we explicitly set the
`dataset_name` to `mydata`. If you don't set it, it will be set to the pipeline name with a suffix `_dataset`.

Be aware that the schema referred to in this section is distinct from the [dlt Schema](../../general-usage/schema.md).
The database schema pertains to the structure and organization of data within the database, including table
definitions and relationships. On the other hand, the "dlt Schema" specifically refers to the format
and structure of normalized data within the dlt pipeline.

## Tables

Each [resource](../../general-usage/resource.md) in your pipeline definition will be represented by a table in
the destination. In the example above, we have one resource, `users`, so we will have one table, `mydata.users`,
in the destination. Where `mydata` is the schema name, and `users` is the table name. Here also, we explicitly set
the `table_name` to `users`. When `table_name` is not set, the table name will be set to the resource name.

For example, we can rewrite the pipeline above as:

```py
@dlt.resource
def users():
yield [
{'id': 1, 'name': 'Balice'},
burnash marked this conversation as resolved.
Show resolved Hide resolved
{'id': 2, 'name': 'Bob'}
]

pipeline = dlt.pipeline(
pipeline_name='quick_start',
destination='duckdb',
dataset_name='mydata'
)
load_info = pipeline.run(users)
```

[This command](../../reference/command-line-interface.md#show-tables-and-data-in-the-destination)
generates and launches a simple Streamlit app that you can use to inspect the schemas
and data in the destination as well as your pipeline state and loading status / stats. It should be
executed from the same folder where you ran the pipeline script to access destination credentials.
It requires `streamlit` and `pandas` to be installed.
The result will be the same, but the table is implicitly named `users` based on the resource name.

## Table and column names
:::note

We [normalize table and column names,](../../general-usage/schema.md#naming-convention) so they fit
what the destination database allows. We convert all the names in your source data into
`snake_case`, alphanumeric identifiers. Please note that in many cases the names you had in your
input document will be (slightly) different from identifiers you see in the database.
Special tables are created to track the pipeline state. These tables are prefixed with `_dlt_`
and are not shown in the `show` command of the `dlt pipeline` CLI. However, you can see them when
connecting to the database directly.

:::

## Child and parent tables

When creating a schema during normalization, `dlt` recursively unpacks this nested structure into
relational tables, creating and linking children and parent tables.
Now let's look at a more complex example:

```py
import dlt

This is how table linking works:
data = [
{
'id': 1,
'name': 'Alice',
'pets': [
{'id': 1, 'name': 'Fluffy', 'type': 'cat'},
{'id': 2, 'name': 'Spot', 'type': 'dog'}
]
},
{
'id': 2,
'name': 'Bob',
'pets': [
{'id': 3, 'name': 'Fido', 'type': 'dog'}
]
}
]

pipeline = dlt.pipeline(
pipeline_name='quick_start',
destination='duckdb',
dataset_name='mydata'
)
load_info = pipeline.run(data, table_name="users")
```

Running this pipeline will create two tables in the destination, `users` and `users__pets`. The
`users` table will contain the top level data, and the `users__pets` table will contain the child
data. Here is what the tables may look like:

**mydata.users**

| id | name | _dlt_id | _dlt_load_id |
| --- | --- | --- | --- |
| 1 | Alice | wX3f5vn801W16A | 1234562350.98417 |
| 2 | Bob | rX8ybgTeEmAmmA | 1234562350.98417 |

**mydata.users__pets**

| id | name | type | _dlt_id | _dlt_parent_id | _dlt_list_idx |
| --- | --- | --- | --- | --- | --- |
| 1 | Fluffy | cat | w1n0PEDzuP3grw | wX3f5vn801W16A | 0 |
| 2 | Spot | dog | 9uxh36VU9lqKpw | wX3f5vn801W16A | 1 |
| 3 | Fido | dog | pe3FVtCWz8VuNA | rX8ybgTeEmAmmA | 0 |

When creating a database schema, dlt recursively unpacks nested structures into relational tables,
creating and linking children and parent tables.

This is how it works:

1. Each row in all (top level and child) data tables created by `dlt` contains UNIQUE column named
`_dlt_id`.
Expand All @@ -41,27 +147,69 @@ This is how table linking works:
1. For tables that are loaded with the `merge` write disposition, we add a ROOT KEY column
`_dlt_root_id`, which links child table to a row in top level table.

> 💡 Note: If you define your own primary key in a child table, it will be used to link to parent table

:::note

If you define your own primary key in a child table, it will be used to link to parent table
and the `_dlt_parent_id` and `_dlt_list_idx` will not be added. `_dlt_id` is always added even in
case the primary key or other unique columns are defined.

## Load IDs
:::

## Naming convention: tables and columns

During a pipeline run, dlt [normalizes both table and column names](../../general-usage/schema.md#naming-convention) to ensure compatibility with the destination database's accepted format. All names from your source data will be transformed into snake_case and will only include alphanumeric characters. Please be aware that the names in the destination database may differ somewhat from those in your original input.

## Load Packages and Load IDs

Each execution of the pipeline generates one or more load packages. A load package typically contains data retrieved from
all the [resources](../../general-usage/glossary.md#resource) of a particular [source](../../general-usage/glossary.md#source).
These packages are uniquely identified by a `load_id`. The `load_id` of a particular package is added to the top data tables
(referenced as `_dlt_load_id` column in the example above) and to the special `_dlt_loads` table with a status 0
(when the load process is fully completed).

Each pipeline run creates one or more load packages, which can be identified by their `load_id`. A load
package typically contains data from all [resources](../../general-usage/glossary.md#resource) of a
particular [source](../../general-usage/glossary.md#source). The `load_id` of a particular package
is added to the top data tables (`_dlt_load_id` column) and to the `_dlt_loads` table with a status 0 (when the load process
is fully completed).
To illustrate this, let's load more data into the same destination:

```py
data = [
{
'id': 3,
'name': 'Charlie',
'pets': []
},
]
```

The rest of the pipeline definition remains the same. Running this pipeline will create a new load
package with a new `load_id` and add the data to the existing tables. The `users` table will now
look like this:

**mydata.users**

| id | name | _dlt_id | _dlt_load_id |
| --- | --- | --- | --- |
| 1 | Alice | wX3f5vn801W16A | 1234562350.98417 |
| 2 | Bob | rX8ybgTeEmAmmA | 1234562350.98417 |
| 3 | Charlie | h8lehZEvT3fASQ | **1234563456.12345** |

The `_dlt_loads` table will look like this:

**mydata._dlt_loads**

| load_id | schema_name | status | inserted_at | schema_version_hash |
| --- | --- | --- | --- | --- |
| 1234562350.98417 | quick_start | 0 | 2023-09-12 16:45:51.17865+00 | aOEb...Qekd/58= |
| **1234563456.12345** | quick_start | 0 | 2023-09-12 16:46:03.10662+00 | aOEb...Qekd/58= |

The `_dlt_loads` table tracks complete loads and allows chaining transformations on top of them.
Many destinations do not support distributed and long-running transactions (e.g. Amazon Redshift).
In that case, the user may see the partially loaded data. It is possible to filter such data outany
row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to delete and identify
In that case, the user may see the partially loaded data. It is possible to filter such data out: any
row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to identify
and delete data for packages that never got completed.

For each load, you can test and [alert](../../running-in-production/alerting.md) on anomalies (e.g.
no data, too much loaded to a table). There are also some useful load stats in the `Load info` tab
of the [Streamlit app](understanding-the-tables.md#show-tables-and-data-in-the-destination)
of the [Streamlit app](exploring-the-data.md#exploring-the-data)
mentioned above.

You can add [transformations](../transformations) and chain them together
Expand All @@ -82,3 +230,88 @@ You can [save](../../running-in-production/running.md#inspect-and-save-the-load-
complete lineage info for a particular `load_id` including a list of loaded files, error messages
(if any), elapsed times, schema changes. This can be helpful, for example, when troubleshooting
problems.

## Staging dataset

So far we've been using the `append` write disposition in our example pipeline. This means that
each time we run the pipeline, the data is appended to the existing tables. When you use [the
merge write disposition](../../general-usage/incremental-loading.md), `dlt` creates a staging database schema for
staging data. This schema is named `<dataset_name>_staging` and contains the same tables as the
destination schema. When you run the pipeline, the data from the staging tables is loaded into the
destination tables in a single atomic transaction.

Let's illustrate this with an example. We change our pipeline to use the `merge` write disposition:

```py
import dlt

@dlt.resource(primary_key="id", write_disposition="merge")
def users():
yield [
{'id': 1, 'name': 'Alice 2'},
{'id': 2, 'name': 'Bob 2'}
]

pipeline = dlt.pipeline(
pipeline_name='quick_start',
destination='duckdb',
dataset_name='mydata'
)

load_info = pipeline.run(users)
```

Running this pipeline will create a schema in the destination database with the name `mydata_staging`.
If you inspect the tables in this schema, you will find `mydata_staging.users` table identical to the
`mydata.users` table in the previous example.

Here is what the tables may look like after running the pipeline:

**mydata_staging.users**

| id | name | _dlt_id | _dlt_load_id |
| --- | --- | --- | --- |
| 1 | Alice 2 | wX3f5vn801W16A | 2345672350.98417 |
| 2 | Bob 2 | rX8ybgTeEmAmmA | 2345672350.98417 |

**mydata.users**

| id | name | _dlt_id | _dlt_load_id |
| --- | --- | --- | --- |
| 1 | Alice 2 | wX3f5vn801W16A | 2345672350.98417 |
| 2 | Bob 2 | rX8ybgTeEmAmmA | 2345672350.98417 |
| 3 | Charlie | h8lehZEvT3fASQ | 1234563456.12345 |

Notice that the `mydata.users` table now contains the data from both the previous pipeline run and
the current one.

## Versioned datasets
rudolfix marked this conversation as resolved.
Show resolved Hide resolved

When you set the `full_refresh` argument to `True` in `dlt.pipeline` call, dlt creates a versioned dataset.
This means that each time you run the pipeline, the data is loaded into a new dataset (a new database schema).
The dataset name is the same as the `dataset_name` you provided in the pipeline definition with a
datetime-based suffix.

We modify our pipeline to use the `full_refresh` option to see how this works:

```py
import dlt

data = [
{'id': 1, 'name': 'Alice'},
{'id': 2, 'name': 'Bob'}
]

pipeline = dlt.pipeline(
pipeline_name='quick_start',
destination='duckdb',
dataset_name='mydata',
full_refresh=True # <-- add this line
)
load_info = pipeline.run(data, table_name="users")
```

Every time you run this pipeline, a new schema will be created in the destination database with a
datetime-based suffix. The data will be loaded into tables in this schema.
For example, the first time you run the pipeline, the schema will be named
`mydata_20230912064403`, the second time it will be named `mydata_20230912064407`, and so on.