From af215b0431c3b4b6214be485e5b88a22fb0f5a0c Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 12:54:17 +0200 Subject: [PATCH 01/25] Replace weather api example with github in create a pipeline tutorial --- .../docs/walkthroughs/create-a-pipeline.md | 97 +++++++++++-------- 1 file changed, 59 insertions(+), 38 deletions(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 1d5974efbe..fe8cdb4cdd 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -6,26 +6,37 @@ keywords: [how to, create a pipeline] # Create a pipeline -Follow the steps below to create a [pipeline](../general-usage/glossary.md#pipeline) from the -WeatherAPI.com API to DuckDB from scratch. The same steps can be repeated for any source and +Follow the steps below to create a [pipeline](../general-usage/glossary.md#pipeline) using the +our rest API client to DuckDB from scratch. The same steps can be repeated for any source and destination of your choice—use `dlt init ` and then build the pipeline for that API instead. Please make sure you have [installed `dlt`](../reference/installation.md) before following the steps below. + +## Task +Let's suppose you have a github project and would like to download all issues to analyze them in +your local machine, thus you need to write some code which does the following things: + +1. Authenticates requests, +2. Fetches and paginates over the issues, +3. Saves the data somewhere. + +With this in mind let's continue. + ## 1. Initialize project Create a new empty directory for your `dlt` project by running: ```sh -mkdir weatherapi_duckdb && cd weatherapi_duckdb +mkdir githubapi_duckdb && cd githubapi_duckdb ``` Start a `dlt` project with a pipeline template that loads data to DuckDB by running: ```sh -dlt init weatherapi duckdb +dlt init githubapi duckdb ``` Install the dependencies necessary for DuckDB: @@ -34,98 +45,107 @@ Install the dependencies necessary for DuckDB: pip install -r requirements.txt ``` -## 2. Add WeatherAPI.com API credentials - -You will need to [sign up for the WeatherAPI.com API](https://www.weatherapi.com/signup.aspx). +## 2. Obtain and Add API credentials from GitHub -Once you do this, you should see your `API Key` at the top of your -[user page](https://www.weatherapi.com/my/). +You will need to [sign in](https://github.com/login) to your github account and create your access token via [Personal access tokens page](https://github.com/settings/tokens). -Copy the value of the API key into `.dlt/secrets.toml`: +Copy your new access token over to `.dlt/secrets.toml`: ```toml [sources] api_secret_key = '' ``` -The **secret name** corresponds to the **argument name** in the source function. Below `api_secret_key` [will get its value](../general-usage/credentials/configuration.md#general-usage-and-an-example) from `secrets.toml` when `weatherapi_source()` is called. + +The **secret name** corresponds to the **argument name** in the source function. +Below `api_secret_key` [will get its value](../general-usage/credentials/configuration.md#general-usage-and-an-example) from `secrets.toml` when `githubapi_source()` is called. + ```py @dlt.source -def weatherapi_source(api_secret_key=dlt.secrets.value): - ... +def githubapi_source(api_secret_key=dlt.secrets.value): + return repo_issues_resource(api_secret_key=api_secret_key) ``` -Run the `weatherapi.py` pipeline script to test that authentication headers look fine: +Run the `githubapi.py` pipeline script to test that authentication headers look fine: ```sh -python3 weatherapi.py +python3 githubapi.py ``` Your API key should be printed out to stdout along with some test data. -## 3. Request data from the WeatherAPI.com API +## 3. Request project issues from then GitHub API -Replace the definition of the `weatherapi_resource` function definition in the `weatherapi.py` -pipeline script with a call to the WeatherAPI.com API: +Replace the definition of the `githubapi_resource` function definition in the `githubapi.py` +pipeline script with a call to the GitHub API: + +>[!NOTE] +> We will use dlt as an example project https://github.com/dlt-hub/dlt, feel free to replace it with your own repository. ```py +from dlt.sources.helpers.rest_client import paginate +from dlt.sources.helpers.rest_client.auth import BearerTokenAuth +from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator + @dlt.resource(write_disposition="append") -def weatherapi_resource(api_secret_key=dlt.secrets.value): - url = "https://api.weatherapi.com/v1/current.json" - params = { - "q": "NYC", - "key": api_secret_key - } - response = requests.get(url, params=params) - response.raise_for_status() - yield response.json() +def repo_issues_resource(api_secret_key=dlt.secrets.value): + url = "https://api.github.com/repos/dlt-hub/dlt/issues" + + for page in paginate( + url, + auth=BearerTokenAuth(api_secret_key), + paginator=HeaderLinkPaginator(), + params={"state": "open"} + ): + print(page) + yield page ``` -Run the `weatherapi.py` pipeline script to test that the API call works: +Run the `githubapi.py` pipeline script to test that the API call works: ```sh -python3 weatherapi.py +python3 githubapi.py ``` This should print out the weather in New York City right now. ## 4. Load the data -Remove the `exit()` call from the `main` function in `weatherapi.py`, so that running the -`python3 weatherapi.py` command will now also run the pipeline: +Remove the `exit()` call from the `main` function in `githubapi.py`, so that running the +`python3 githubapi.py` command will now also run the pipeline: ```py if __name__=='__main__': # configure the pipeline with your destination details pipeline = dlt.pipeline( - pipeline_name='weatherapi', + pipeline_name='githubapi_issues', destination='duckdb', - dataset_name='weatherapi_data' + dataset_name='githubapi_issues_data' ) # print credentials by running the resource - data = list(weatherapi_resource()) + data = list(repo_issues_resource()) # print the data yielded from resource print(data) # run the pipeline with your parameters - load_info = pipeline.run(weatherapi_source()) + load_info = pipeline.run(githubapi_source()) # pretty print the information on data that was loaded print(load_info) ``` -Run the `weatherapi.py` pipeline script to load data into DuckDB: +Run the `githubapi.py` pipeline script to load data into DuckDB: ```sh -python3 weatherapi.py +python3 githubapi.py ``` Then this command to see that the data loaded: ```sh -dlt pipeline weatherapi show +dlt pipeline githubapi show ``` This will open a Streamlit app that gives you an overview of the data loaded. @@ -134,6 +154,7 @@ This will open a Streamlit app that gives you an overview of the data loaded. Now that you have a working pipeline, you have options for what to learn next: +- Learn more about our [rest client](https://dlthub.com/devel/general-usage/http/rest-client). - [Deploy this pipeline with GitHub Actions](deploy-a-pipeline/deploy-with-github-actions), so that the data is automatically loaded on a schedule. - Transform the [loaded data](../dlt-ecosystem/transformations) with dbt or in From f89bff473aba9f427cf4b27cd02e3fc2cb1ad840 Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 13:04:20 +0200 Subject: [PATCH 02/25] Adjust pipeline name and dataset name --- docs/website/docs/walkthroughs/create-a-pipeline.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index fe8cdb4cdd..9d965c377d 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -115,12 +115,11 @@ Remove the `exit()` call from the `main` function in `githubapi.py`, so that run ```py if __name__=='__main__': - # configure the pipeline with your destination details pipeline = dlt.pipeline( - pipeline_name='githubapi_issues', + pipeline_name='githubapi_repo_issues', destination='duckdb', - dataset_name='githubapi_issues_data' + dataset_name='repo_issues_data' ) # print credentials by running the resource From cc3a531202b1758a200a862ccbdf2ae2991bca4a Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 13:12:20 +0200 Subject: [PATCH 03/25] Improve text --- .../docs/walkthroughs/create-a-pipeline.md | 44 ++++++++++++------- 1 file changed, 27 insertions(+), 17 deletions(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 9d965c377d..b6102b44a1 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -1,29 +1,28 @@ --- title: Create a pipeline description: How to create a pipeline -keywords: [how to, create a pipeline] +keywords: [how to, create a pipeline, rest client] --- # Create a pipeline -Follow the steps below to create a [pipeline](../general-usage/glossary.md#pipeline) using the -our rest API client to DuckDB from scratch. The same steps can be repeated for any source and -destination of your choice—use `dlt init ` and then build the pipeline for -that API instead. +This guide walks you through creating a pipeline that utilizes our REST API client to connect to DuckDB. +Although this example uses DuckDB, you can adapt the steps to any source and destination by +using the command `dlt init ` and tweaking the pipeline accordingly. Please make sure you have [installed `dlt`](../reference/installation.md) before following the steps below. +## Task Overview -## Task -Let's suppose you have a github project and would like to download all issues to analyze them in -your local machine, thus you need to write some code which does the following things: +Imagine you want to analyze issues from a GitHub project locally. +To achieve this, you need to write code that accomplishes the following: -1. Authenticates requests, -2. Fetches and paginates over the issues, -3. Saves the data somewhere. +1. Build correct requests. +1. Authenticates your requests. +2. Fetches and handles paginated issue data. +3. Stores the data for analysis. -With this in mind let's continue. ## 1. Initialize project @@ -56,6 +55,8 @@ Copy your new access token over to `.dlt/secrets.toml`: api_secret_key = '' ``` +This token will be used by `githubapi_source()` to authenticate requests. + The **secret name** corresponds to the **argument name** in the source function. Below `api_secret_key` [will get its value](../general-usage/credentials/configuration.md#general-usage-and-an-example) from `secrets.toml` when `githubapi_source()` is called. @@ -75,12 +76,12 @@ Your API key should be printed out to stdout along with some test data. ## 3. Request project issues from then GitHub API -Replace the definition of the `githubapi_resource` function definition in the `githubapi.py` -pipeline script with a call to the GitHub API: >[!NOTE] > We will use dlt as an example project https://github.com/dlt-hub/dlt, feel free to replace it with your own repository. +Modify `repo_issues_resource` in `githubapi.py` to request issues data from your GitHub project's API: + ```py from dlt.sources.helpers.rest_client import paginate from dlt.sources.helpers.rest_client.auth import BearerTokenAuth @@ -106,7 +107,16 @@ Run the `githubapi.py` pipeline script to test that the API call works: python3 githubapi.py ``` -This should print out the weather in New York City right now. +This should print out json data containig the issues in the GitHub project. + +Then, confirm the data is loaded + +>[!NOTE] +> Make sure you have `streamlit` installed `pip install streamlit` + +```sh +dlt pipeline githubapi show +``` ## 4. Load the data @@ -135,7 +145,7 @@ if __name__=='__main__': print(load_info) ``` -Run the `githubapi.py` pipeline script to load data into DuckDB: +Load your GitHub issues into DuckDB: ```sh python3 githubapi.py @@ -151,7 +161,7 @@ This will open a Streamlit app that gives you an overview of the data loaded. ## 5. Next steps -Now that you have a working pipeline, you have options for what to learn next: +With a functioning pipeline, consider exploring: - Learn more about our [rest client](https://dlthub.com/devel/general-usage/http/rest-client). - [Deploy this pipeline with GitHub Actions](deploy-a-pipeline/deploy-with-github-actions), so that From 18ceb8f1af338e360d7045f3301218e5a0a6447c Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 13:20:21 +0200 Subject: [PATCH 04/25] Align with default template --- docs/website/docs/walkthroughs/create-a-pipeline.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index b6102b44a1..3b474182b6 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -63,7 +63,7 @@ Below `api_secret_key` [will get its value](../general-usage/credentials/configu ```py @dlt.source def githubapi_source(api_secret_key=dlt.secrets.value): - return repo_issues_resource(api_secret_key=api_secret_key) + return githubapi_my_repo_issues(api_secret_key=api_secret_key) ``` Run the `githubapi.py` pipeline script to test that authentication headers look fine: @@ -80,7 +80,7 @@ Your API key should be printed out to stdout along with some test data. >[!NOTE] > We will use dlt as an example project https://github.com/dlt-hub/dlt, feel free to replace it with your own repository. -Modify `repo_issues_resource` in `githubapi.py` to request issues data from your GitHub project's API: +Modify `githubapi_my_repo_issues` in `githubapi.py` to request issues data from your GitHub project's API: ```py from dlt.sources.helpers.rest_client import paginate @@ -88,7 +88,7 @@ from dlt.sources.helpers.rest_client.auth import BearerTokenAuth from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator @dlt.resource(write_disposition="append") -def repo_issues_resource(api_secret_key=dlt.secrets.value): +def githubapi_my_repo_issues(api_secret_key=dlt.secrets.value): url = "https://api.github.com/repos/dlt-hub/dlt/issues" for page in paginate( @@ -133,7 +133,7 @@ if __name__=='__main__': ) # print credentials by running the resource - data = list(repo_issues_resource()) + data = list(githubapi_my_repo_issues()) # print the data yielded from resource print(data) From 219d03b5702e9c1b79d950129e4f88ae0f61f0d6 Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 14:42:26 +0200 Subject: [PATCH 05/25] Update numbered list --- docs/website/docs/walkthroughs/create-a-pipeline.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 3b474182b6..343e67c1f2 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -19,9 +19,9 @@ Imagine you want to analyze issues from a GitHub project locally. To achieve this, you need to write code that accomplishes the following: 1. Build correct requests. -1. Authenticates your requests. -2. Fetches and handles paginated issue data. -3. Stores the data for analysis. +2. Authenticates your requests. +3. Fetches and handles paginated issue data. +4. Stores the data for analysis. ## 1. Initialize project From 5727f6d2003c21d4b6410a46bf30b201fc014ffd Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 14:42:46 +0200 Subject: [PATCH 06/25] Update section title --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 343e67c1f2..df4e1bb265 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -13,7 +13,7 @@ using the command `dlt init ` and tweaking the pipeline ac Please make sure you have [installed `dlt`](../reference/installation.md) before following the steps below. -## Task Overview +## Task overview Imagine you want to analyze issues from a GitHub project locally. To achieve this, you need to write code that accomplishes the following: From 5b99689ec76a444bdfabe90138884b6eacf0e1e7 Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 14:43:28 +0200 Subject: [PATCH 07/25] Reword bullet point --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index df4e1bb265..861b185b39 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -18,7 +18,7 @@ steps below. Imagine you want to analyze issues from a GitHub project locally. To achieve this, you need to write code that accomplishes the following: -1. Build correct requests. +1. Constructs a correct request. 2. Authenticates your requests. 3. Fetches and handles paginated issue data. 4. Stores the data for analysis. From 5704cb38b06a8ac42109bd016f3b39cef3d2fa18 Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 14:46:17 +0200 Subject: [PATCH 08/25] Add benefits --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 861b185b39..9ddf629930 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -23,6 +23,8 @@ To achieve this, you need to write code that accomplishes the following: 3. Fetches and handles paginated issue data. 4. Stores the data for analysis. +This sounds complicated and it is indeed complicated, but we offer you our [rest client](https://dlthub.com/devel/general-usage/http/rest-client) which let's you put more focus on your data. + ## 1. Initialize project From 635a1acb8cfff345ddfd41c9f4fd43dc07d48432 Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 14:47:13 +0200 Subject: [PATCH 09/25] Adjust section title --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 9ddf629930..4642bdbe4d 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -46,7 +46,7 @@ Install the dependencies necessary for DuckDB: pip install -r requirements.txt ``` -## 2. Obtain and Add API credentials from GitHub +## 2. Obtain and add API credentials from GitHub You will need to [sign in](https://github.com/login) to your github account and create your access token via [Personal access tokens page](https://github.com/settings/tokens). From f6dae7b4c5e3e72517c4e56c335117f884ebf47c Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 15:46:01 +0200 Subject: [PATCH 10/25] Adjust resource name to match the one from default template --- docs/website/docs/walkthroughs/create-a-pipeline.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 4642bdbe4d..68a7341523 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -65,7 +65,7 @@ Below `api_secret_key` [will get its value](../general-usage/credentials/configu ```py @dlt.source def githubapi_source(api_secret_key=dlt.secrets.value): - return githubapi_my_repo_issues(api_secret_key=api_secret_key) + return githubapi_resource(api_secret_key=api_secret_key) ``` Run the `githubapi.py` pipeline script to test that authentication headers look fine: @@ -82,7 +82,7 @@ Your API key should be printed out to stdout along with some test data. >[!NOTE] > We will use dlt as an example project https://github.com/dlt-hub/dlt, feel free to replace it with your own repository. -Modify `githubapi_my_repo_issues` in `githubapi.py` to request issues data from your GitHub project's API: +Modify `githubapi_resource` in `githubapi.py` to request issues data from your GitHub project's API: ```py from dlt.sources.helpers.rest_client import paginate @@ -90,7 +90,7 @@ from dlt.sources.helpers.rest_client.auth import BearerTokenAuth from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator @dlt.resource(write_disposition="append") -def githubapi_my_repo_issues(api_secret_key=dlt.secrets.value): +def githubapi_resource(api_secret_key=dlt.secrets.value): url = "https://api.github.com/repos/dlt-hub/dlt/issues" for page in paginate( @@ -135,7 +135,7 @@ if __name__=='__main__': ) # print credentials by running the resource - data = list(githubapi_my_repo_issues()) + data = list(githubapi_resource()) # print the data yielded from resource print(data) From 7bce0f687d6baf0ae5b8e06826e440c2d12a9c00 Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 15:50:41 +0200 Subject: [PATCH 11/25] Re-arrange instructions --- .../docs/walkthroughs/create-a-pipeline.md | 33 +++++++------------ 1 file changed, 11 insertions(+), 22 deletions(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 68a7341523..feb5ad3cde 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -103,35 +103,18 @@ def githubapi_resource(api_secret_key=dlt.secrets.value): yield page ``` -Run the `githubapi.py` pipeline script to test that the API call works: - -```sh -python3 githubapi.py -``` - -This should print out json data containig the issues in the GitHub project. - -Then, confirm the data is loaded - ->[!NOTE] -> Make sure you have `streamlit` installed `pip install streamlit` - -```sh -dlt pipeline githubapi show -``` - ## 4. Load the data -Remove the `exit()` call from the `main` function in `githubapi.py`, so that running the +Uncomment the commented out code in `main` function in `githubapi.py`, so that running the `python3 githubapi.py` command will now also run the pipeline: ```py if __name__=='__main__': # configure the pipeline with your destination details pipeline = dlt.pipeline( - pipeline_name='githubapi_repo_issues', + pipeline_name='githubapi', destination='duckdb', - dataset_name='repo_issues_data' + dataset_name='githubapi_data' ) # print credentials by running the resource @@ -147,13 +130,19 @@ if __name__=='__main__': print(load_info) ``` -Load your GitHub issues into DuckDB: + +Run the `githubapi.py` pipeline script to test that the API call works: ```sh python3 githubapi.py ``` -Then this command to see that the data loaded: +This should print out json data containig the issues in the GitHub project. + +Then, confirm the data is loaded + +>[!NOTE] +> Make sure you have `streamlit` installed `pip install streamlit` ```sh dlt pipeline githubapi show From 11321175afb30a8fca9fefc7f3b7002858a257a6 Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 15:52:35 +0200 Subject: [PATCH 12/25] Update links --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index feb5ad3cde..18c5f9c20a 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -154,7 +154,7 @@ This will open a Streamlit app that gives you an overview of the data loaded. With a functioning pipeline, consider exploring: -- Learn more about our [rest client](https://dlthub.com/devel/general-usage/http/rest-client). +- Our [rest client](https://dlthub.com/devel/general-usage/http/rest-client). - [Deploy this pipeline with GitHub Actions](deploy-a-pipeline/deploy-with-github-actions), so that the data is automatically loaded on a schedule. - Transform the [loaded data](../dlt-ecosystem/transformations) with dbt or in From 6265674a38d60eea40eeeef976f1e286b20898d9 Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Mon, 13 May 2024 15:53:43 +0200 Subject: [PATCH 13/25] Add type hints --- docs/website/docs/walkthroughs/create-a-pipeline.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 18c5f9c20a..eeb4e465e7 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -64,7 +64,7 @@ Below `api_secret_key` [will get its value](../general-usage/credentials/configu ```py @dlt.source -def githubapi_source(api_secret_key=dlt.secrets.value): +def githubapi_source(api_secret_key: str = dlt.secrets.value): return githubapi_resource(api_secret_key=api_secret_key) ``` @@ -90,7 +90,7 @@ from dlt.sources.helpers.rest_client.auth import BearerTokenAuth from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator @dlt.resource(write_disposition="append") -def githubapi_resource(api_secret_key=dlt.secrets.value): +def githubapi_resource(api_secret_key: str = dlt.secrets.value): url = "https://api.github.com/repos/dlt-hub/dlt/issues" for page in paginate( From 7b3f3dab2419388d2fb6a8936da9d3357b387e01 Mon Sep 17 00:00:00 2001 From: AstrakhantsevaAA Date: Tue, 14 May 2024 18:25:45 +0200 Subject: [PATCH 14/25] small changes --- .../docs/walkthroughs/create-a-pipeline.md | 83 ++++++++++--------- 1 file changed, 45 insertions(+), 38 deletions(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index eeb4e465e7..1acfc1fb68 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -6,11 +6,12 @@ keywords: [how to, create a pipeline, rest client] # Create a pipeline -This guide walks you through creating a pipeline that utilizes our REST API client to connect to DuckDB. +This guide walks you through creating a pipeline that utilizes our [REST API Client](../general-usage/http/rest-client) +to connect to [DuckDB](../dlt-ecosystem/destinations/duckdb). Although this example uses DuckDB, you can adapt the steps to any source and destination by -using the command `dlt init ` and tweaking the pipeline accordingly. +using the [command](../reference/command-line-interface#dlt-init) `dlt init ` and tweaking the pipeline accordingly. -Please make sure you have [installed `dlt`](../reference/installation.md) before following the +Please make sure you have [installed `dlt`](../reference/installation) before following the steps below. ## Task overview @@ -23,7 +24,9 @@ To achieve this, you need to write code that accomplishes the following: 3. Fetches and handles paginated issue data. 4. Stores the data for analysis. -This sounds complicated and it is indeed complicated, but we offer you our [rest client](https://dlthub.com/devel/general-usage/http/rest-client) which let's you put more focus on your data. +This sounds complicated, and it is indeed complicated, +but we offer you our [REST API Client,](../general-usage/http/rest-client) +which lets you put more focus on your data. ## 1. Initialize project @@ -31,13 +34,13 @@ This sounds complicated and it is indeed complicated, but we offer you our [rest Create a new empty directory for your `dlt` project by running: ```sh -mkdir githubapi_duckdb && cd githubapi_duckdb +mkdir github_api_duckdb && cd github_api_duckdb ``` Start a `dlt` project with a pipeline template that loads data to DuckDB by running: ```sh -dlt init githubapi duckdb +dlt init github_api duckdb ``` Install the dependencies necessary for DuckDB: @@ -48,7 +51,7 @@ pip install -r requirements.txt ## 2. Obtain and add API credentials from GitHub -You will need to [sign in](https://github.com/login) to your github account and create your access token via [Personal access tokens page](https://github.com/settings/tokens). +You will need to [sign in](https://github.com/login) to your GitHub account and create your access token via [Personal access tokens page](https://github.com/settings/tokens). Copy your new access token over to `.dlt/secrets.toml`: @@ -57,21 +60,22 @@ Copy your new access token over to `.dlt/secrets.toml`: api_secret_key = '' ``` -This token will be used by `githubapi_source()` to authenticate requests. +This token will be used by `github_api_source()` to authenticate requests. The **secret name** corresponds to the **argument name** in the source function. -Below `api_secret_key` [will get its value](../general-usage/credentials/configuration.md#general-usage-and-an-example) from `secrets.toml` when `githubapi_source()` is called. +Below `api_secret_key` [will get its value](../general-usage/credentials/configuration#allow-dlt-to-pass-the-config-and-secrets-automatically) +from `secrets.toml` when `github_api_source()` is called. ```py @dlt.source -def githubapi_source(api_secret_key: str = dlt.secrets.value): - return githubapi_resource(api_secret_key=api_secret_key) +def github_api_source(api_secret_key: str = dlt.secrets.value): + return github_api_resource(api_secret_key=api_secret_key) ``` -Run the `githubapi.py` pipeline script to test that authentication headers look fine: +Run the `github_api.py` pipeline script to test that authentication headers look fine: ```sh -python3 githubapi.py +python github_api.py ``` Your API key should be printed out to stdout along with some test data. @@ -79,18 +83,19 @@ Your API key should be printed out to stdout along with some test data. ## 3. Request project issues from then GitHub API ->[!NOTE] -> We will use dlt as an example project https://github.com/dlt-hub/dlt, feel free to replace it with your own repository. +:::tip +We will use `dlt` as an example project https://github.com/dlt-hub/dlt, feel free to replace it with your own repository. +::: -Modify `githubapi_resource` in `githubapi.py` to request issues data from your GitHub project's API: +Modify `github_api_resource` in `github_api.py` to request issues data from your GitHub project's API: ```py from dlt.sources.helpers.rest_client import paginate from dlt.sources.helpers.rest_client.auth import BearerTokenAuth from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator -@dlt.resource(write_disposition="append") -def githubapi_resource(api_secret_key: str = dlt.secrets.value): +@dlt.resource(write_disposition="replace") +def github_api_resource(api_secret_key: str = dlt.secrets.value): url = "https://api.github.com/repos/dlt-hub/dlt/issues" for page in paginate( @@ -99,53 +104,55 @@ def githubapi_resource(api_secret_key: str = dlt.secrets.value): paginator=HeaderLinkPaginator(), params={"state": "open"} ): - print(page) yield page ``` ## 4. Load the data -Uncomment the commented out code in `main` function in `githubapi.py`, so that running the -`python3 githubapi.py` command will now also run the pipeline: +Uncomment the commented out code in `main` function in `github_api.py`, so that running the +`python github_api.py` command will now also run the pipeline: ```py if __name__=='__main__': # configure the pipeline with your destination details pipeline = dlt.pipeline( - pipeline_name='githubapi', + pipeline_name='github_api_pipeline', destination='duckdb', - dataset_name='githubapi_data' + dataset_name='github_api_data' ) # print credentials by running the resource - data = list(githubapi_resource()) + data = list(github_api_resource()) # print the data yielded from resource print(data) # run the pipeline with your parameters - load_info = pipeline.run(githubapi_source()) + load_info = pipeline.run(github_api_source()) # pretty print the information on data that was loaded print(load_info) ``` -Run the `githubapi.py` pipeline script to test that the API call works: +Run the `github_api.py` pipeline script to test that the API call works: ```sh -python3 githubapi.py +python github_api.py ``` This should print out json data containig the issues in the GitHub project. -Then, confirm the data is loaded +Then, confirm the data is loaded with printing `load_info` object. ->[!NOTE] -> Make sure you have `streamlit` installed `pip install streamlit` +Let's explore the loaded data with the [command](../reference/command-line-interface#show-tables-and-data-in-the-destination) `dlt pipeline show`. + +:::warning +Make sure you have `streamlit` installed `pip install streamlit` +::: ```sh -dlt pipeline githubapi show +dlt pipeline github_api_pipeline show ``` This will open a Streamlit app that gives you an overview of the data loaded. @@ -154,15 +161,15 @@ This will open a Streamlit app that gives you an overview of the data loaded. With a functioning pipeline, consider exploring: -- Our [rest client](https://dlthub.com/devel/general-usage/http/rest-client). +- Our [REST Client](../general-usage/http/rest-client). - [Deploy this pipeline with GitHub Actions](deploy-a-pipeline/deploy-with-github-actions), so that the data is automatically loaded on a schedule. - Transform the [loaded data](../dlt-ecosystem/transformations) with dbt or in Pandas DataFrames. -- Learn how to [run](../running-in-production/running.md), - [monitor](../running-in-production/monitoring.md), and - [alert](../running-in-production/alerting.md) when you put your pipeline in production. +- Learn how to [run](../running-in-production/running), + [monitor](../running-in-production/monitoring), and + [alert](../running-in-production/alerting) when you put your pipeline in production. - Try loading data to a different destination like - [Google BigQuery](../dlt-ecosystem/destinations/bigquery.md), - [Amazon Redshift](../dlt-ecosystem/destinations/redshift.md), or - [Postgres](../dlt-ecosystem/destinations/postgres.md). + [Google BigQuery](../dlt-ecosystem/destinations/bigquery), + [Amazon Redshift](../dlt-ecosystem/destinations/redshift), or + [Postgres](../dlt-ecosystem/destinations/postgres). From a5b4ac8955545bd2cc651d4e80f11e5ee2a962de Mon Sep 17 00:00:00 2001 From: AstrakhantsevaAA Date: Tue, 14 May 2024 18:36:53 +0200 Subject: [PATCH 15/25] change warning to info --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 1acfc1fb68..bec21d7341 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -147,7 +147,7 @@ Then, confirm the data is loaded with printing `load_info` object. Let's explore the loaded data with the [command](../reference/command-line-interface#show-tables-and-data-in-the-destination) `dlt pipeline show`. -:::warning +:::info Make sure you have `streamlit` installed `pip install streamlit` ::: From 4296181a3b0784e0fa53fe6d703c79bf425c880b Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Wed, 15 May 2024 16:00:34 +0200 Subject: [PATCH 16/25] Using a simpler alternative --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index bec21d7341..91e296ce20 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -6,7 +6,7 @@ keywords: [how to, create a pipeline, rest client] # Create a pipeline -This guide walks you through creating a pipeline that utilizes our [REST API Client](../general-usage/http/rest-client) +This guide walks you through creating a pipeline that uses our [REST API Client](../general-usage/http/rest-client) to connect to [DuckDB](../dlt-ecosystem/destinations/duckdb). Although this example uses DuckDB, you can adapt the steps to any source and destination by using the [command](../reference/command-line-interface#dlt-init) `dlt init ` and tweaking the pipeline accordingly. From 64970def61a80055af1f034971461e24dd83e506 Mon Sep 17 00:00:00 2001 From: Sultan Iman <354868+sultaniman@users.noreply.github.com> Date: Wed, 15 May 2024 16:46:58 +0200 Subject: [PATCH 17/25] Update docs/website/docs/walkthroughs/create-a-pipeline.md Co-authored-by: Anton Burnashev --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 91e296ce20..e7611ce58d 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -143,7 +143,7 @@ python github_api.py This should print out json data containig the issues in the GitHub project. -Then, confirm the data is loaded with printing `load_info` object. +It also prints `load_info` object. Let's explore the loaded data with the [command](../reference/command-line-interface#show-tables-and-data-in-the-destination) `dlt pipeline show`. From 0b07e0498f694624a825ada3505a95ecba624676 Mon Sep 17 00:00:00 2001 From: Sultan Iman <354868+sultaniman@users.noreply.github.com> Date: Wed, 15 May 2024 16:47:05 +0200 Subject: [PATCH 18/25] Update docs/website/docs/walkthroughs/create-a-pipeline.md Co-authored-by: Anton Burnashev --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index e7611ce58d..b82c7adcba 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -141,7 +141,7 @@ Run the `github_api.py` pipeline script to test that the API call works: python github_api.py ``` -This should print out json data containig the issues in the GitHub project. +This should print out JSON data containing the issues in the GitHub project. It also prints `load_info` object. From c0d2ecc4783dd6cf8374c206f12ad39d4d616eee Mon Sep 17 00:00:00 2001 From: Sultan Iman <354868+sultaniman@users.noreply.github.com> Date: Wed, 15 May 2024 16:47:27 +0200 Subject: [PATCH 19/25] Update docs/website/docs/walkthroughs/create-a-pipeline.md Co-authored-by: Anton Burnashev --- docs/website/docs/walkthroughs/create-a-pipeline.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index b82c7adcba..2f5809ed2f 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -24,9 +24,7 @@ To achieve this, you need to write code that accomplishes the following: 3. Fetches and handles paginated issue data. 4. Stores the data for analysis. -This sounds complicated, and it is indeed complicated, -but we offer you our [REST API Client,](../general-usage/http/rest-client) -which lets you put more focus on your data. +This may sound complicated, but dlt provides a [REST API Client](../general-usage/http/rest-client) that allows you to focus more on your data rather than on managing API interactions. ## 1. Initialize project From 4795d55a37ea4dba62032f46bbe6f6c2d71b12e0 Mon Sep 17 00:00:00 2001 From: Sultan Iman <354868+sultaniman@users.noreply.github.com> Date: Wed, 15 May 2024 16:47:37 +0200 Subject: [PATCH 20/25] Update docs/website/docs/walkthroughs/create-a-pipeline.md Co-authored-by: Anton Burnashev --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 2f5809ed2f..e8577b54ce 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -82,7 +82,7 @@ Your API key should be printed out to stdout along with some test data. :::tip -We will use `dlt` as an example project https://github.com/dlt-hub/dlt, feel free to replace it with your own repository. +We will use `dlt` repository as an example GitHub project https://github.com/dlt-hub/dlt, feel free to replace it with your own repository. ::: Modify `github_api_resource` in `github_api.py` to request issues data from your GitHub project's API: From cb43a107a34b07a7116ff8073f8ee7a257ee9599 Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Wed, 15 May 2024 16:48:48 +0200 Subject: [PATCH 21/25] Add links to sources and destinations --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index e8577b54ce..bcd6df0a5c 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -8,7 +8,7 @@ keywords: [how to, create a pipeline, rest client] This guide walks you through creating a pipeline that uses our [REST API Client](../general-usage/http/rest-client) to connect to [DuckDB](../dlt-ecosystem/destinations/duckdb). -Although this example uses DuckDB, you can adapt the steps to any source and destination by +Although this example uses DuckDB, you can adapt the steps to any [source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/) and [destination]( https://dlthub.com/docs/dlt-ecosystem/destinations/) by using the [command](../reference/command-line-interface#dlt-init) `dlt init ` and tweaking the pipeline accordingly. Please make sure you have [installed `dlt`](../reference/installation) before following the From aabcd11c9aa8d11deccfa6c5aa85ff400a5e1915 Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Wed, 15 May 2024 16:49:46 +0200 Subject: [PATCH 22/25] Remove leading space --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index bcd6df0a5c..54ff834b5b 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -8,7 +8,7 @@ keywords: [how to, create a pipeline, rest client] This guide walks you through creating a pipeline that uses our [REST API Client](../general-usage/http/rest-client) to connect to [DuckDB](../dlt-ecosystem/destinations/duckdb). -Although this example uses DuckDB, you can adapt the steps to any [source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/) and [destination]( https://dlthub.com/docs/dlt-ecosystem/destinations/) by +Although this example uses DuckDB, you can adapt the steps to any [source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/) and [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/) by using the [command](../reference/command-line-interface#dlt-init) `dlt init ` and tweaking the pipeline accordingly. Please make sure you have [installed `dlt`](../reference/installation) before following the From d5e5b5b1ce427efcf30d318e3b9ab138c6c355c3 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Thu, 16 May 2024 12:01:44 +0200 Subject: [PATCH 23/25] Update docs/website/docs/walkthroughs/create-a-pipeline.md --- docs/website/docs/walkthroughs/create-a-pipeline.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 54ff834b5b..e2cd3b88e0 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -8,8 +8,10 @@ keywords: [how to, create a pipeline, rest client] This guide walks you through creating a pipeline that uses our [REST API Client](../general-usage/http/rest-client) to connect to [DuckDB](../dlt-ecosystem/destinations/duckdb). -Although this example uses DuckDB, you can adapt the steps to any [source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/) and [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/) by +:::tip +This example uses DuckDB, but you can adapt the steps to any [source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/) and [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/) by using the [command](../reference/command-line-interface#dlt-init) `dlt init ` and tweaking the pipeline accordingly. +::: Please make sure you have [installed `dlt`](../reference/installation) before following the steps below. From f7ee428023dca916178c93e0d52ce296fc15dfe9 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Thu, 16 May 2024 13:31:24 +0200 Subject: [PATCH 24/25] Update docs/website/docs/walkthroughs/create-a-pipeline.md --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index e2cd3b88e0..31dc994a32 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -9,7 +9,7 @@ keywords: [how to, create a pipeline, rest client] This guide walks you through creating a pipeline that uses our [REST API Client](../general-usage/http/rest-client) to connect to [DuckDB](../dlt-ecosystem/destinations/duckdb). :::tip -This example uses DuckDB, but you can adapt the steps to any [source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/) and [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/) by +We're using DuckDB as a destination here, but you can adapt the steps to any [source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/) and [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/) by using the [command](../reference/command-line-interface#dlt-init) `dlt init ` and tweaking the pipeline accordingly. ::: From 060fcce00eb205f5d264a9b73425c0566af8f9bc Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Thu, 16 May 2024 13:31:57 +0200 Subject: [PATCH 25/25] Update docs/website/docs/walkthroughs/create-a-pipeline.md --- docs/website/docs/walkthroughs/create-a-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/walkthroughs/create-a-pipeline.md b/docs/website/docs/walkthroughs/create-a-pipeline.md index 31dc994a32..bba78dc6cb 100644 --- a/docs/website/docs/walkthroughs/create-a-pipeline.md +++ b/docs/website/docs/walkthroughs/create-a-pipeline.md @@ -22,7 +22,7 @@ Imagine you want to analyze issues from a GitHub project locally. To achieve this, you need to write code that accomplishes the following: 1. Constructs a correct request. -2. Authenticates your requests. +2. Authenticates your request. 3. Fetches and handles paginated issue data. 4. Stores the data for analysis.