Skip to content

Commit

Permalink
Docs/fix walkthroughs (#456)
Browse files Browse the repository at this point in the history
  • Loading branch information
AstrakhantsevaAA authored Jun 28, 2023
1 parent 96633e2 commit 7e902d4
Show file tree
Hide file tree
Showing 9 changed files with 493 additions and 281 deletions.
99 changes: 65 additions & 34 deletions docs/website/docs/walkthroughs/add-a-verified-source.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,106 +6,137 @@ keywords: [how to, add a verified source]

# Add a verified source

Follow the steps below to create a [pipeline](../general-usage/glossary.md#pipeline) from a [verified source](../general-usage/glossary.md#verified-source) contributed by `dlt` users.
Follow the steps below to create a [pipeline](../general-usage/glossary.md#pipeline) from a
[verified source](../general-usage/glossary.md#verified-source) contributed by `dlt` users.

Please make sure you have [installed `dlt`](../reference/installation.mdx) before following the steps below.
Please make sure you have [installed `dlt`](../reference/installation.mdx) before following the
steps below.

## 1. Initialize project

Create a new empty directory for your `dlt` project by running
Create a new empty directory for your `dlt` project by running:

```shell
mkdir various_pipelines
cd various_pipelines
```

List available verified sources to see their names and descriptions
```
List available verified sources to see their names and descriptions:

```bash
dlt init --list-verified-sources
```

Now pick one of the source names, for example `pipedrive` and a destination ie. `bigquery`.
```
Now pick one of the source names, for example `pipedrive` and a destination i.e. `bigquery`:

```bash
dlt init pipedrive bigquery
```

The command will create your pipeline project by copying over the `pipedrive` folder and creating a `.dlt` folder:
The command will create your pipeline project by copying over the `pipedrive` folder and creating a
`.dlt` folder:

```
├── .dlt
│ ├── config.toml
│ └── secrets.toml
├── pipedrive
│ └── helpers
│ └── __init__.py
│ └── settings.py
│ └── typing.py
├── .gitignore
├── pipedrive_pipeline.py
└── requirements.txt
```

After running the command, read the command output for the instructions on how to install the dependencies.
After running the command, read the command output for the instructions on how to install the
dependencies:

```
Verified source pipedrive was added to your project!
* See the usage examples and code snippets to copy from pipedrive_pipeline.py
* Add credentials for bigquery and other secrets in .dlt/secrets.toml
* requirements.txt was created. Install it with:
pip3 install -r requirements.txt
* Add the required dependencies to pyproject.toml:
dlt[bigquery]>=0.3.1
If the dlt dependency is already added, make sure you install the extra for bigquery to it
If you are using poetry you may issue the following command:
poetry add dlt -E bigquery
* Read https://dlthub.com/docs/walkthroughs/create-a-pipeline for more information
```
So make sure you install the requirements with `pip3 install -r requirements.txt`.
When deploying to an online orchestrator, you can install the requirements to it from requirements.txt in the ways supported by the orchestrator.

Finally, run the pipeline, fill the secrets.toml with your credentials or place your credentials in the supported locations.
So make sure you install the requirements with `pip3 install -r requirements.txt`. When deploying to
an online orchestrator, you can install the requirements to it from requirements.txt in the ways
supported by the orchestrator.

Finally, run the pipeline, fill the secrets.toml with your credentials or place your credentials in
the supported locations.

## 2. Adding credentials

For adding them locally or on your orchestrator, please see the following guide [credentials](../general-usage/credentials.md).
For adding them locally or on your orchestrator, please see the following guide
[credentials](../general-usage/credentials.md).

## 3. Customize or write a pipeline script

Once you initialized the pipeline, you will have a sample file `pipedrive_pipeline.py`.
Once you initialized the pipeline, you will have a sample file `pipedrive_pipeline.py`.

This is the developer's suggested way to use the pipeline, so you can use it as a starting point - in our case, we can choose to run a method that loads all data, or we can choose which endpoints should load.
This is the developer's suggested way to use the pipeline, so you can use it as a starting point -
in our case, we can choose to run a method that loads all data, or we can choose which endpoints
should load.

You can also use this file as a suggestion and write your own instead.

## 4. Hack a verified source

You can modify an existing verified source in place.
* If that modification is generally useful for anyone using this source, consider contributing it back via a PR. This way, we can ensure it is tested and maintained.
* If that modification is not a generally shared case, then you are responsible for maintaining it. We suggest making any of your own customisations modular is possible, so you can keep pulling the updated source from the community repo in the event of source maintenance.

- If that modification is generally useful for anyone using this source, consider contributing it
back via a PR. This way, we can ensure it is tested and maintained.
- If that modification is not a generally shared case, then you are responsible for maintaining it.
We suggest making any of your own customisations modular is possible, so you can keep pulling the
updated source from the community repo in the event of source maintenance.

## 5. Add more sources to your project
```

```bash
dlt init chess duckdb
```
To add another verified source, just run the dlt init command at the same location as the first pipeline
- the shared files will be updated (secrets, config)
- a new folder will be created for the new source
- do not forget to install the requirements for the second source!

To add another verified source, just run the `dlt init` command at the same location as the first
pipeline:

- The shared files will be updated (secrets, config).
- A new folder will be created for the new source.
- Do not forget to install the requirements for the second source!

## 6. Update the verified source with the newest version
To update the verified source you have to the newest online version just do the same init command in the parent folder.
```

To update the verified source you have to the newest online version just do the same init command in
the parent folder:

```bash
dlt init pipedrive bigquery
```

## 7. Advanced: Using dlt init with branches, local folders or git repos.
To find out more info about this command, use --help.
## 7. Advanced: Using dlt init with branches, local folders or git repos

```
To find out more info about this command, use --help:

```bash
dlt init --help
```


To deploy from a branch of the `verified-sources` repo, you can use the following:

```
dlt init source destination --branch branchname
```bash
dlt init source destination --branch <branch_name>
```

To deploy from another repo, you could fork the verified-sources repo and then provide the new repo url as below, replacing `dlt-hub` with your fork name
To deploy from another repo, you could fork the verified-sources repo and then provide the new repo
url as below, replacing `dlt-hub` with your fork name:

```
```bash
dlt init pipedrive bigquery --location "https://github.com/dlt-hub/verified-sources"
```
156 changes: 93 additions & 63 deletions docs/website/docs/walkthroughs/adjust-a-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,37 @@ keywords: [how to, adjust a schema]

# Adjust a schema

When you [create](create-a-pipeline.md) and then [run](run-a-pipeline.md) a pipeline, you may want to manually inspect and change the [schema](../general-usage/schema.md) that `dlt` generated for you. Here's how you do it.
When you [create](create-a-pipeline.md) and then [run](run-a-pipeline.md) a pipeline, you may want
to manually inspect and change the [schema](../general-usage/schema.md) that `dlt` generated for
you. Here's how you do it.

## 1. Export your schemas on each run

Set up an export folder by providing the `export_schema_path` argument to `dlt.pipeline` to save the schema. Set up an import folder from which `dlt` will read your modifications by providing `import_schema_path` argument.
Set up an export folder by providing the `export_schema_path` argument to `dlt.pipeline` to save the
schema. Set up an import folder from which `dlt` will read your modifications by providing
`import_schema_path` argument.

Following our example in [run a pipeline](run-a-pipeline.md):

```python
dlt.pipeline(import_schema_path="schemas/import", export_schema_path="schemas/export", pipeline_name="chess_pipeline", destination='duckdb', dataset_name="games_data")
dlt.pipeline(
import_schema_path="schemas/import",
export_schema_path="schemas/export",
pipeline_name="chess_pipeline",
destination='duckdb',
dataset_name="games_data"
)
```

will create following folder structure in project root folder
Following folder structure in project root folder will be created:

```
schemas
|---import/
|---export/
```

Instead of modifying the code, you can put those settings in `config.toml`
Instead of modifying the code, you can put those settings in `config.toml`:

```toml
export_schema_path="schemas/export"
Expand All @@ -35,100 +45,120 @@ import_schema_path="schemas/import"

## 2. Run the pipeline to see the schemas

To see the schemas, you must run your pipeline again. The `schemas` and `import`/`export` directories will be created. In each directory, you'll see a `yaml` file with a file `chess.schema.toml`.

Look at the export schema (in the export folder): this is the schema that got inferred from the data and was used to load it into the destination (i.e `duckdb`).
To see the schemas, you must run your pipeline again. The `schemas` and `import`/`export`
directories will be created. In each directory, you'll see a `yaml` file with a file
`chess.schema.toml`.

Look at the export schema (in the export folder): this is the schema that got inferred from the data
and was used to load it into the destination (i.e `duckdb`).

## 3. Make changes in import schema

Now look at the import schema (in the import folder): it contains only the tables, columns, and hints that were explicitly declared in the `chess` source. You'll use this schema to make modifications, typically by pasting relevant snippets from your export schema and modifying them. You should keep the import schema as simple as possible and let `dlt` do the rest.
Now look at the import schema (in the import folder): it contains only the tables, columns, and
hints that were explicitly declared in the `chess` source. You'll use this schema to make
modifications, typically by pasting relevant snippets from your export schema and modifying them.
You should keep the import schema as simple as possible and let `dlt` do the rest.

> 💡 How importing a schema works:
> 1. When a new pipeline is created and the source function is extracted for the first time, a new schema is added to pipeline. This schema is created out of global hints and resource hints present in the source extractor function.
> 2. Every such new schema will be saved to the `import` folder (if it does not exist there already) and used as the initial version for all future pipeline runs.
> 3. Once a schema is present in `import` folder, **it is writable by the user only**.
> 4. Any change to the schemas in that folder are detected and propagated to the pipeline automatically on the next run. It means that after an user update, the schema in `import` folder reverts all the automatic updates from the data.
In next steps we'll experiment a lot, you will wan to **set `full_refresh=True` in the `dlt.pipeline` until we are done experimenting**
>
> 1. When a new pipeline is created and the source function is extracted for the first time, a new
> schema is added to pipeline. This schema is created out of global hints and resource hints
> present in the source extractor function.
> 1. Every such new schema will be saved to the `import` folder (if it does not exist there already)
> and used as the initial version for all future pipeline runs.
> 1. Once a schema is present in `import` folder, **it is writable by the user only**.
> 1. Any change to the schemas in that folder are detected and propagated to the pipeline
> automatically on the next run. It means that after an user update, the schema in `import`
> folder reverts all the automatic updates from the data.
In next steps we'll experiment a lot, you will be warned to set `full_refresh=True` in the
`dlt.pipeline` until we are done experimenting.

### Change the data type

In export schema we see that `end_time` column in `players_games` has a `text` data type while we know that there is a timestamp. Let's change it and see if it works.
In export schema we see that `end_time` column in `players_games` has a `text` data type while we
know that there is a timestamp. Let's change it and see if it works.

Copy the column:

```yaml
end_time:
nullable: true
data_type: text
end_time:
nullable: true
data_type: text
```
from export to import schema and change the data type to get
from export to import schema and change the data type to get:
```yaml
players_games:
columns:
end_time:
nullable: true
data_type: timestamp
players_games:
columns:
end_time:
nullable: true
data_type: timestamp
```
Run the pipeline script again and make sure that the change is visible in export schema. Then, launch the Streamlit app to see the changed data.
Run the pipeline script again and make sure that the change is visible in export schema. Then,
[launch the Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md) to see the changed data.
### Load data as json instead of generating child table or columns from flattened dicts
In the export schema, you can see that white and black players properties got flattened into
In the export schema, you can see that white and black players properties got flattened into:
```yaml
white__rating:
nullable: true
data_type: bigint
white__result:
nullable: true
data_type: text
white__aid:
nullable: true
data_type: text
white__rating:
nullable: true
data_type: bigint
white__result:
nullable: true
data_type: text
white__aid:
nullable: true
data_type: text
```
For some reason you'd rather deal with a single JSON (or struct) column. Just declare the `white` column as `complex`, which will instruct `dlt` not to flatten it (or not convert into child table in case of a list). Do the same with `black` column.
For some reason you'd rather deal with a single JSON (or struct) column. Just declare the `white`
column as `complex`, which will instruct `dlt` not to flatten it (or not convert into child table in
case of a list). Do the same with `black` column:

```yaml
players_games:
columns:
end_time:
nullable: true
data_type: timestamp
white:
nullable: false
data_type: complex
black:
nullable: false
data_type: complex
players_games:
columns:
end_time:
nullable: true
data_type: timestamp
white:
nullable: false
data_type: complex
black:
nullable: false
data_type: complex
```

Run the pipeline script again and now you can query `black` and `white` columns with JSON expressions.
Run the pipeline script again, and now you can query `black` and `white` columns with JSON
expressions.

### Add performance hints

Let's say you are done with local experimentation and want to load your data to `BigQuery` instead of `duckdb`. You'd like to partition your data to save on query costs. The `end_time` column we just fixed looks like a good candidate.
Let's say you are done with local experimentation and want to load your data to `BigQuery` instead
of `duckdb`. You'd like to partition your data to save on query costs. The `end_time` column we just
fixed looks like a good candidate.

```yaml
players_games:
columns:
end_time:
nullable: false
data_type: timestamp
partition: true
white:
nullable: false
data_type: complex
black:
nullable: false
data_type: complex
players_games:
columns:
end_time:
nullable: false
data_type: timestamp
partition: true
white:
nullable: false
data_type: complex
black:
nullable: false
data_type: complex
```

## 4. Keep your import schema

Just add and push the import folder to git. It will be used automatically when cloned. Alternatively [bundle such schema with your source](../general-usage/schema.md#attaching-schemas-to-sources)
Just add and push the import folder to git. It will be used automatically when cloned. Alternatively
[bundle such schema with your source](../general-usage/schema.md#attaching-schemas-to-sources).
Loading

0 comments on commit 7e902d4

Please sign in to comment.