Skip to content

Commit

Permalink
create data contracts page
Browse files Browse the repository at this point in the history
  • Loading branch information
sh-rp committed Sep 26, 2023
1 parent 9ba2496 commit 038d03a
Show file tree
Hide file tree
Showing 3 changed files with 78 additions and 1 deletion.
77 changes: 77 additions & 0 deletions docs/website/docs/general-usage/data-contracts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
title: Data Contracts
description: Data contracts and controlling schema evolution
keywords: [data contracts, schema, dlt schema, pydantic]
---

## Data contracts and controlling schema evolution

`dlt` will evolve the schema of the destination to accomodate the structure and data types of the extracted data. There are several settings
that you can use to control this automatic schema evolution, from the default settings where all changes to the schema are accepted to
a frozen schema that does not change at all.

Consider this example:

```py
@dlt.resource(schema_contract_settings={"table": "evolve", "columns": "freeze"})
def items():
...
```

This resource will allow new subtables to be created, but will throw an exception if data is extracted for an existing table which
contains a new column.

The `schema_contract_settings` exists on the `source` decorator as a directive for all resources of that source and on the
`resource` decorator as a directive for the individual resource. Additionally it exists on the `pipeline.run()` method, which will override all existing settings.
The `schema_contract_settings` is a dictionary with keys that control the following:

* `table` creating of new tables and subtables
* `columns` creating of new columns on an existing table
* `data_type` creating of new variant columns, which happens if a different datatype is discovered in the extracted data than exists in the schema

Each property can be set to one of three values:
* `freeze`: This will raise an exception if data is encountered that does not fit the existing schema, so no data will be loaded to the destination
* `discard_row`: This will discard any extracted row if it does not adhere to the existing schema, and this row will not be loaded to the destination. All other rows will be.
* `discard_value`: This will discard data in an extracted row that does not adhere to the existing schema and the row will be loaded without this data.

### Code Examples

The below code will silently ignore new subtables, allow new columns to be added to existing tables and raise an error if a variant of a column is discovered.

```py
@dlt.resource(schema_contract_settings={"table": "discard_row", "columns": "evolve", "data_type": "freeze"})
def items():
...
```

The below Code will raise on any encountered schema change. Note: You can always set a string which will be interpreted as though all keys are set to these values.

```py
pipeline.run(my_source(), schema_contract_settings="freeze")
```

The below code defines some settings on the source which can be overwritten on the resource which in turn can be overwritten by the global override on the `run` method.
Here for all resources variant columns are frozen and raise an error if encountered, on `items` new columns are allowed but `other_items` inherits the `freeze` setting from
the source, thus new columns are frozen there. New tables are allowed.

```py
@dlt.resource(schema_contract_settings={"columns": "evolve"})
def items():
...

@dlt.resource()
def other_items():
...

@dlt.source(schema_contract_settings={"columns": "freeze", "data_type": "freeze"}):
def source():
return [items(), other_items()]


# this will use the settings defined by the decorators
pipeline.run(source())

# this will freeze the whole schema, regardless of the decorator settings
pipeline.run(source(), schema_contract_settings="freeze")

```
1 change: 0 additions & 1 deletion docs/website/docs/reference/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,6 @@ def read_table(limit):
now = pendulum.now().isoformat()
yield [{"row": _id, "description": "this is row with id {_id}", "timestamp": now} for _id in item_slice]


# this prevents process pool to run the initialization code again
if __name__ == "__main__" or "PYTEST_CURRENT_TEST" in os.environ:
pipeline = dlt.pipeline("parallel_load", destination="duckdb", full_refresh=True)
Expand Down
1 change: 1 addition & 0 deletions docs/website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,7 @@ const sidebars = {
'general-usage/full-loading',
'general-usage/credentials',
'general-usage/schema',
'general-usage/data-contracts',
'general-usage/configuration',
'general-usage/glossary',
{
Expand Down

0 comments on commit 038d03a

Please sign in to comment.