Docs/fix walkthroughs (#456)

dlt-hub · Jun 28, 2023 · 7e902d4 · 7e902d4
1 parent 96633e2
commit 7e902d4
Show file tree

Hide file tree

Showing 9 changed files with 493 additions and 281 deletions.
diff --git a/docs/website/docs/walkthroughs/add-a-verified-source.md b/docs/website/docs/walkthroughs/add-a-verified-source.md
@@ -6,106 +6,137 @@ keywords: [how to, add a verified source]
 
 # Add a verified source
 
-Follow the steps below to create a [pipeline](../general-usage/glossary.md#pipeline) from a [verified source](../general-usage/glossary.md#verified-source) contributed by `dlt` users.
+Follow the steps below to create a [pipeline](../general-usage/glossary.md#pipeline) from a
+[verified source](../general-usage/glossary.md#verified-source) contributed by `dlt` users.
 
-Please make sure you have [installed `dlt`](../reference/installation.mdx) before following the steps below.
+Please make sure you have [installed `dlt`](../reference/installation.mdx) before following the
+steps below.
 
 ## 1. Initialize project
 
-Create a new empty directory for your `dlt` project by running
+Create a new empty directory for your `dlt` project by running:
+
 ```shell
 mkdir various_pipelines
 cd various_pipelines
 ```
 
-List available verified sources to see their names and descriptions
-```
+List available verified sources to see their names and descriptions:
+
+```bash
 dlt init --list-verified-sources
 ```
 
-Now pick one of the source names, for example `pipedrive` and a destination ie. `bigquery`.
-```
+Now pick one of the source names, for example `pipedrive` and a destination i.e. `bigquery`:
+
+```bash
 dlt init pipedrive bigquery
 ```
 
-The command will create your pipeline project by copying over the `pipedrive` folder and creating a `.dlt` folder:
+The command will create your pipeline project by copying over the `pipedrive` folder and creating a
+`.dlt` folder:
 
 ```
 ├── .dlt
 │   ├── config.toml
 │   └── secrets.toml
 ├── pipedrive
+│   └── helpers
 │   └── __init__.py
+│   └── settings.py
+│   └── typing.py
 ├── .gitignore
 ├── pipedrive_pipeline.py
 └── requirements.txt
 ```
 
-After running the command, read the command output for the instructions on how to install the dependencies.
+After running the command, read the command output for the instructions on how to install the
+dependencies:
 
 ```
 Verified source pipedrive was added to your project!
 * See the usage examples and code snippets to copy from pipedrive_pipeline.py
 * Add credentials for bigquery and other secrets in .dlt/secrets.toml
-* requirements.txt was created. Install it with:
-pip3 install -r requirements.txt
+* Add the required dependencies to pyproject.toml:
+  dlt[bigquery]>=0.3.1
+  If the dlt dependency is already added, make sure you install the extra for bigquery to it
+  If you are using poetry you may issue the following command:
+  poetry add dlt -E bigquery
+
+* Read https://dlthub.com/docs/walkthroughs/create-a-pipeline for more information
 ```
-So make sure you install the requirements with `pip3 install -r requirements.txt`.
-When deploying to an online orchestrator, you can install the requirements to it from requirements.txt in the ways supported by the orchestrator.
 
-Finally, run the pipeline, fill the secrets.toml with your credentials or place your credentials in the supported locations.
+So make sure you install the requirements with `pip3 install -r requirements.txt`. When deploying to
+an online orchestrator, you can install the requirements to it from requirements.txt in the ways
+supported by the orchestrator.
+
+Finally, run the pipeline, fill the secrets.toml with your credentials or place your credentials in
+the supported locations.
 
 ## 2. Adding credentials
 
-For adding them locally or on your orchestrator, please see the following guide [credentials](../general-usage/credentials.md).
+For adding them locally or on your orchestrator, please see the following guide
+[credentials](../general-usage/credentials.md).
 
 ## 3. Customize or write a pipeline script
 
-Once you initialized the pipeline, you will have a sample file  `pipedrive_pipeline.py`.
+Once you initialized the pipeline, you will have a sample file `pipedrive_pipeline.py`.
 
-This is the developer's suggested way to use the pipeline, so you can use it as a starting point - in our case, we can choose to run a method that loads all data, or we can choose which endpoints should load.
+This is the developer's suggested way to use the pipeline, so you can use it as a starting point -
+in our case, we can choose to run a method that loads all data, or we can choose which endpoints
+should load.
 
 You can also use this file as a suggestion and write your own instead.
 
 ## 4. Hack a verified source
 
 You can modify an existing verified source in place.
-* If that modification is generally useful for anyone using this source, consider contributing it back via a PR. This way, we can ensure it is tested and maintained.
-* If that modification is not a generally shared case, then you are responsible for maintaining it. We suggest making any of your own customisations modular is possible, so you can keep pulling the updated source from the community repo in the event of source maintenance.
 
+- If that modification is generally useful for anyone using this source, consider contributing it
+  back via a PR. This way, we can ensure it is tested and maintained.
+- If that modification is not a generally shared case, then you are responsible for maintaining it.
+  We suggest making any of your own customisations modular is possible, so you can keep pulling the
+  updated source from the community repo in the event of source maintenance.
 
 ## 5. Add more sources to your project
-```
+
+```bash
 dlt init chess duckdb
 ```
-To add another verified source, just run the dlt init command at the same location as the first pipeline
-- the shared files will be updated (secrets, config)
-- a new folder will be created for the new source
-- do not forget to install the requirements for the second source!
 
+To add another verified source, just run the `dlt init` command at the same location as the first
+pipeline:
+
+- The shared files will be updated (secrets, config).
+- A new folder will be created for the new source.
+- Do not forget to install the requirements for the second source!
 
 ## 6. Update the verified source with the newest version
-To update the verified source you have to the newest online version just do the same init command in the parent folder.
-```
+
+To update the verified source you have to the newest online version just do the same init command in
+the parent folder:
+
+```bash
 dlt init pipedrive bigquery
 ```
 
-## 7. Advanced: Using dlt init with branches, local folders or git repos.
-To find out more info about this command, use --help.
+## 7. Advanced: Using dlt init with branches, local folders or git repos
 
-```
+To find out more info about this command, use --help:
+
+```bash
 dlt init --help
 ```
 
-
 To deploy from a branch of the `verified-sources` repo, you can use the following:
 
-```
-dlt init source destination --branch branchname
+```bash
+dlt init source destination --branch <branch_name>
 ```
 
-To deploy from another repo, you could fork the verified-sources repo and then provide the new repo url as below, replacing `dlt-hub` with your fork name
+To deploy from another repo, you could fork the verified-sources repo and then provide the new repo
+url as below, replacing `dlt-hub` with your fork name:
 
-```
+```bash
 dlt init pipedrive bigquery --location "https://github.com/dlt-hub/verified-sources"
 ```
diff --git a/docs/website/docs/walkthroughs/adjust-a-schema.md b/docs/website/docs/walkthroughs/adjust-a-schema.md
@@ -6,27 +6,37 @@ keywords: [how to, adjust a schema]
 
 # Adjust a schema
 
-When you [create](create-a-pipeline.md) and then [run](run-a-pipeline.md) a pipeline, you may want to manually inspect and change the [schema](../general-usage/schema.md) that `dlt` generated for you. Here's how you do it.
+When you [create](create-a-pipeline.md) and then [run](run-a-pipeline.md) a pipeline, you may want
+to manually inspect and change the [schema](../general-usage/schema.md) that `dlt` generated for
+you. Here's how you do it.
 
 ## 1. Export your schemas on each run
 
-Set up an export folder by providing the `export_schema_path` argument to `dlt.pipeline` to save the schema. Set up an import folder from which `dlt` will read your modifications by providing `import_schema_path` argument.
+Set up an export folder by providing the `export_schema_path` argument to `dlt.pipeline` to save the
+schema. Set up an import folder from which `dlt` will read your modifications by providing
+`import_schema_path` argument.
 
 Following our example in [run a pipeline](run-a-pipeline.md):
 
 ```python
-dlt.pipeline(import_schema_path="schemas/import", export_schema_path="schemas/export", pipeline_name="chess_pipeline", destination='duckdb', dataset_name="games_data")
+dlt.pipeline(
+    import_schema_path="schemas/import",
+    export_schema_path="schemas/export",
+    pipeline_name="chess_pipeline",
+    destination='duckdb',
+    dataset_name="games_data"
+)
 ```
 
-will create following folder structure in project root folder
+Following folder structure in project root folder will be created:
 
 ```
 schemas
     |---import/
     |---export/
 ```
 
-Instead of modifying the code, you can put those settings in `config.toml`
+Instead of modifying the code, you can put those settings in `config.toml`:
 
 ```toml
 export_schema_path="schemas/export"
@@ -35,100 +45,120 @@ import_schema_path="schemas/import"
 
 ## 2. Run the pipeline to see the schemas
 
-To see the schemas, you must run your pipeline again. The `schemas` and `import`/`export` directories will be created. In each directory, you'll see a `yaml` file with a file `chess.schema.toml`.
-
-Look at the export schema (in the export folder): this is the schema that got inferred from the data and was used to load it into the destination (i.e `duckdb`).
+To see the schemas, you must run your pipeline again. The `schemas` and `import`/`export`
+directories will be created. In each directory, you'll see a `yaml` file with a file
+`chess.schema.toml`.
 
+Look at the export schema (in the export folder): this is the schema that got inferred from the data
+and was used to load it into the destination (i.e `duckdb`).
 
 ## 3. Make changes in import schema
 
-Now look at the import schema (in the import folder): it contains only the tables, columns, and hints that were explicitly declared in the `chess` source. You'll use this schema to make modifications, typically by pasting relevant snippets from your export schema and modifying them. You should keep the import schema as simple as possible and let `dlt` do the rest.
+Now look at the import schema (in the import folder): it contains only the tables, columns, and
+hints that were explicitly declared in the `chess` source. You'll use this schema to make
+modifications, typically by pasting relevant snippets from your export schema and modifying them.
+You should keep the import schema as simple as possible and let `dlt` do the rest.
 
 > 💡 How importing a schema works:
-> 1. When a new pipeline is created and the source function is extracted for the first time, a new schema is added to pipeline. This schema is created out of global hints and resource hints present in the source extractor function.
-> 2. Every such new schema will be saved to the `import` folder (if it does not exist there already) and used as the initial version for all future pipeline runs.
-> 3. Once a schema is present in `import` folder, **it is writable by the user only**.
-> 4. Any change to the schemas in that folder are detected and propagated to the pipeline automatically on the next run. It means that after an user update, the schema in `import` folder reverts all the automatic updates from the data.
-
-In next steps we'll experiment a lot, you will wan to **set `full_refresh=True` in the `dlt.pipeline` until we are done experimenting**
+>
+> 1. When a new pipeline is created and the source function is extracted for the first time, a new
+>    schema is added to pipeline. This schema is created out of global hints and resource hints
+>    present in the source extractor function.
+> 1. Every such new schema will be saved to the `import` folder (if it does not exist there already)
+>    and used as the initial version for all future pipeline runs.
+> 1. Once a schema is present in `import` folder, **it is writable by the user only**.
+> 1. Any change to the schemas in that folder are detected and propagated to the pipeline
+>    automatically on the next run. It means that after an user update, the schema in `import`
+>    folder reverts all the automatic updates from the data.
+
+In next steps we'll experiment a lot, you will be warned to set `full_refresh=True` in the
+`dlt.pipeline` until we are done experimenting.
 
 ### Change the data type
 
-In export schema we see that `end_time` column in `players_games` has a `text` data type while we know that there is a timestamp. Let's change it and see if it works.
+In export schema we see that `end_time` column in `players_games` has a `text` data type while we
+know that there is a timestamp. Let's change it and see if it works.
 
 Copy the column:
 
 ```yaml
-      end_time:
-        nullable: true
-        data_type: text
+end_time:
+  nullable: true
+  data_type: text
 ```
 
-from export to import schema and change the data type to get
+from export to import schema and change the data type to get:
 
 ```yaml
-  players_games:
-    columns:
-      end_time:
-        nullable: true
-        data_type: timestamp
+players_games:
+  columns:
+    end_time:
+      nullable: true
+      data_type: timestamp
 ```
 
-Run the pipeline script again and make sure that the change is visible in export schema. Then, launch the Streamlit app to see the changed data.
+Run the pipeline script again and make sure that the change is visible in export schema. Then,
+[launch the Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md) to see the changed data.
 
 ### Load data as json instead of generating child table or columns from flattened dicts
 
-In the export schema, you can see that white and black players properties got flattened into
+In the export schema, you can see that white and black players properties got flattened into:
 
 ```yaml
-      white__rating:
-        nullable: true
-        data_type: bigint
-      white__result:
-        nullable: true
-        data_type: text
-      white__aid:
-        nullable: true
-        data_type: text
+white__rating:
+  nullable: true
+  data_type: bigint
+white__result:
+  nullable: true
+  data_type: text
+white__aid:
+  nullable: true
+  data_type: text
 ```
 
-For some reason you'd rather deal with a single JSON (or struct) column. Just declare the `white` column as `complex`, which will instruct `dlt` not to flatten it (or not convert into child table in case of a list). Do the same with `black` column.
+For some reason you'd rather deal with a single JSON (or struct) column. Just declare the `white`
+column as `complex`, which will instruct `dlt` not to flatten it (or not convert into child table in
+case of a list). Do the same with `black` column:
 
 ```yaml
-  players_games:
-    columns:
-      end_time:
-        nullable: true
-        data_type: timestamp
-      white:
-        nullable: false
-        data_type: complex
-      black:
-        nullable: false
-        data_type: complex
+players_games:
+  columns:
+    end_time:
+      nullable: true
+      data_type: timestamp
+    white:
+      nullable: false
+      data_type: complex
+    black:
+      nullable: false
+      data_type: complex
 ```
 
-Run the pipeline script again and now you can query `black` and `white` columns with JSON expressions.
+Run the pipeline script again, and now you can query `black` and `white` columns with JSON
+expressions.
 
 ### Add performance hints
 
-Let's say you are done with local experimentation and want to load your data to `BigQuery` instead of `duckdb`. You'd like to partition your data to save on query costs. The `end_time` column we just fixed looks like a good candidate.
+Let's say you are done with local experimentation and want to load your data to `BigQuery` instead
+of `duckdb`. You'd like to partition your data to save on query costs. The `end_time` column we just
+fixed looks like a good candidate.
 
 ```yaml
-  players_games:
-    columns:
-      end_time:
-        nullable: false
-        data_type: timestamp
-        partition: true
-      white:
-        nullable: false
-        data_type: complex
-      black:
-        nullable: false
-        data_type: complex
+players_games:
+  columns:
+    end_time:
+      nullable: false
+      data_type: timestamp
+      partition: true
+    white:
+      nullable: false
+      data_type: complex
+    black:
+      nullable: false
+      data_type: complex
 ```
 
 ## 4. Keep your import schema
 
-Just add and push the import folder to git. It will be used automatically when cloned. Alternatively [bundle such schema with your source](../general-usage/schema.md#attaching-schemas-to-sources)
+Just add and push the import folder to git. It will be used automatically when cloned. Alternatively
+[bundle such schema with your source](../general-usage/schema.md#attaching-schemas-to-sources).