Skip to content

Commit

Permalink
Add missing example + fix typos
Browse files Browse the repository at this point in the history
  • Loading branch information
burnash committed Oct 30, 2023
1 parent 7d13b7d commit c48c46b
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 24 deletions.
47 changes: 30 additions & 17 deletions docs/website/docs/tutorial/grouping-resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ keywords: [api, source, decorator, dynamic resource, github, tutorial]
---

This tutorial continues the [previous](load-data-from-an-api) part. We'll use the same GitHub API example to show you how to:
1. Load from other GitHub API endpoints.
1. Load data from other GitHub API endpoints.
1. Group your resources into sources for easier management.
2. Handle secrets and configuration.

Expand Down Expand Up @@ -113,7 +113,7 @@ load_info = pipeline.run(github_source())
print(load_info)
```

### Dynamic resource
### Dynamic resources

You've noticed that there's a lot of code duplication in the `get_issues` and `get_comments` functions. We can reduce that by extracting the common fetching code into a separate function and use it in both resources. Even better, we can use `dlt.resource` as a function and pass it the `fetch_github_data()` generator function directly. Here's refactored code:

Expand Down Expand Up @@ -159,14 +159,14 @@ row_counts = pipeline.last_trace.last_normalize_info

## Handle secrets

For the next step we'd want to get the [number of repository clones](https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones) for our dlt repo from the GitHub API. However, the `traffic/clones` endpoint that returns the data requires authentication.
For the next step we'd want to get the [number of repository clones](https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones) for our dlt repo from the GitHub API. However, the `traffic/clones` endpoint that returns the data requires [authentication](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28).

Let's handle this by changing our `fetch_github_data()` first:

```python
def fetch_github_data(endpoint, params={}, api_token=None):
def fetch_github_data(endpoint, params={}, access_token=None):
"""Fetch data from GitHub API based on endpoint and params."""
headers = {"Authorization": f"Bearer {api_token}"} if api_token else {}
headers = {"Authorization": f"Bearer {access_token}"} if access_token else {}

url = f"{BASE_GITHUB_URL}/{endpoint}"

Expand All @@ -179,15 +179,28 @@ def fetch_github_data(endpoint, params={}, api_token=None):
if "next" not in response.links:
break
url = response.links["next"]["url"]

@dlt.source
def github_source(access_token):
for endpoint in ["issues", "comments", "traffic/clones"]:
params = {"per_page": 100}
yield dlt.resource(
fetch_github_data(endpoint, params, access_token),
name=endpoint,
write_disposition="merge",
primary_key="id",
)

...
```

Here, we added `api_token` parameter and used it to pass the authentication token to the request:
Here, we added `access_token` parameter and now we can used it to pass the access token token to the request:

```python
load_info = pipeline.run(github_source(access_token="ghp_A...M"))
load_info = pipeline.run(github_source(access_token="ghp_XXXXX"))
```

Good. But we'd want to follow the best practices and not hardcode the token in the script. One option is to set the token as an environment variable, load it with `os.getenv()` and pass it around as a parameter. dlt offers a more convenient way to handle secrets and credentials: it lets you inject the arguments using a special `dlt.secrets.value` argument value.
It's a good start. But we'd want to follow the best practices and not hardcode the token in the script. One option is to set the token as an environment variable, load it with `os.getenv()` and pass it around as a parameter. dlt offers a more convenient way to handle secrets and credentials: it lets you inject the arguments using a special `dlt.secrets.value` argument value.

To use it, change the `github_source()` function to:

Expand All @@ -199,12 +212,12 @@ def github_source(
...
```

When you add `dlt.secrets.value` as a default value for an argument, `dlt` will try to load this value from different configuration sources in the following order:
When you add `dlt.secrets.value` as a default value for an argument, `dlt` will try to load and inject this value from a different configuration sources in the following order:

1. Special environment variables.
2. `secrets.toml` file.

The `secret.toml` file is located in the `~/.dlt` folder (for global configuration) or in the project folder (for project-specific configuration).
The `secret.toml` file is located in the `~/.dlt` folder (for global configuration) or in the `.dlt` folder in the project folder (for project-specific configuration).

Let's add the token to the `~/.dlt/secrets.toml` file:

Expand All @@ -222,9 +235,9 @@ from dlt.sources.helpers import requests
BASE_GITHUB_URL = "https://api.github.com/repos/dlt-hub/dlt"


def fetch_github_data(endpoint, params={}, api_token=None):
def fetch_github_data(endpoint, params={}, access_token=None):
"""Fetch data from GitHub API based on endpoint and params."""
headers = {"Authorization": f"Bearer {api_token}"} if api_token else {}
headers = {"Authorization": f"Bearer {access_token}"} if access_token else {}

url = f"{BASE_GITHUB_URL}/{endpoint}"

Expand Down Expand Up @@ -272,9 +285,9 @@ from dlt.sources.helpers import requests
BASE_GITHUB_URL = "https://api.github.com/repos/{repo_name}"


def fetch_github_data(repo_name, endpoint, params={}, api_token=None):
def fetch_github_data(repo_name, endpoint, params={}, access_token=None):
"""Fetch data from GitHub API based on repo_name, endpoint, and params."""
headers = {"Authorization": f"Bearer {api_token}"} if api_token else {}
headers = {"Authorization": f"Bearer {access_token}"} if access_token else {}

url = BASE_GITHUB_URL.format(repo_name=repo_name) + f"/{endpoint}"

Expand Down Expand Up @@ -312,7 +325,7 @@ pipeline = dlt.pipeline(
load_info = pipeline.run(github_source())
```

Next, create a `config.toml` file in the project folder and add the `repo_name` parameter to it:
Next, create a `.dlt/config.toml` file in the project folder and add the `repo_name` parameter to it:

```toml
[github_with_source_secrets]
Expand All @@ -323,7 +336,7 @@ That's it! Now you have a reusable source that can load data from any GitHub rep

## What’s next

Congratulations on completing the tutorial! You've come a long way since the Getting started guide. By now, you've mastered loading data from various GitHub API endpoints, organizing resources into sources, managing secrets securely, and creatig reusable sources. You can use these skills to build your own pipelines and load data from any source.
Congratulations on completing the tutorial! You've come a long way since the [getting started](../getting-started) guide. By now, you've mastered loading data from various GitHub API endpoints, organizing resources into sources, managing secrets securely, and creatig reusable sources. You can use these skills to build your own pipelines and load data from any source.

Interested in learning more? Here are some suggestions:
1. You've been running your pipelines locally. Learn how to [deploy and run them in the cloud](../walkthroughs/deploy-a-pipeline/).
Expand All @@ -335,4 +348,4 @@ Interested in learning more? Here are some suggestions:
- [Run in production: inspecting, tracing, retry policies and cleaning up](../running-in-production/running).
- [Run resources in parallel, optimize buffers and local storage](../reference/performance.md)
3. Check out our [how-to guides](../walkthroughs) to get answers to some common questions.
4. Explore [Examples](../examples) section to see how dlt can be used in real-world scenarios.
4. Explore [Examples](../examples) section to see how dlt can be used in real-world scenarios
19 changes: 13 additions & 6 deletions docs/website/docs/tutorial/load-data-from-an-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Need help with this tutorial? Join our [Slack community](https://join.slack.com/

## Create a pipeline

First, we need to create a pipeline. Pipelines are the main building blocks of `dlt` and are used to load data from sources to destinations. Open your favorite text editor and create a file called `github_issues.py`. Add the following code to it:
First, we need to create a [pipeline](../general-usage/pipeline). Pipelines are the main building blocks of `dlt` and are used to load data from sources to destinations. Open your favorite text editor and create a file called `github_issues.py`. Add the following code to it:

<!--@@@DLT_SNIPPET_START basic_api-->
```py
Expand All @@ -43,6 +43,13 @@ print(load_info)
```
<!--@@@DLT_SNIPPET_END basic_api-->

Here's what the code above does:
1. It makes a request to the GitHub API endpoint and checks if the response is successful.
2. Then it creates a dlt pipeline with the name `github_issues` and specifies that the data should be loaded to the `duckdb` destination and the `github_data` dataset. Nothing gets loaded yet.
3. Finally, it runs the pipeline with the data from the API response (`response.json()`) and specifies that the data should be loaded to the `issues` table. The `run` method returns a `LoadInfo` object that contains information about the loaded data.

## Run the pipeline

Save `github_issues.py` and run the following command:

```bash
Expand All @@ -55,7 +62,7 @@ Once the data has been loaded, you can inspect the created dataset using the Str
dlt pipeline github_issues show
```

### Append or replace your data
## Append or replace your data

Try running the pipeline again with `python github_issues.py`. You will notice that the **issues** table contains two copies of the same data. This happens because the default load mode is `append`. It is very useful, for example, when you have a new folder created daily with `json` file logs, and you want to ingest them.

Expand Down Expand Up @@ -161,9 +168,9 @@ print(load_info)

Let's take a closer look at the code above.

We request issues for dlt-hub/dlt repository ordered by **created_at** field (descending) and yield them page by page in `get_issues` generator function.
We use the `@dlt.resource` decorator to declare the table name into which data will be loaded and specify the `append` write disposition.

We use the `@dlt.resource` decorator to declare table name to which data will be loaded and write disposition, which is `append`.
We request issues for dlt-hub/dlt repository ordered by **created_at** field (descending) and yield them page by page in `get_issues` generator function.

We also use `dlt.sources.incremental` to track `created_at` field present in each issue to filter in the newly created.

Expand Down Expand Up @@ -240,9 +247,9 @@ print(load_info)
```
<!--@@@DLT_SNIPPET_END incremental_merge-->

Above we add `primary_key` hint that tells `dlt` how to identify the issues in the database to find duplicates which content it will merge.
Above we add `primary_key` argument to the `dlt.resource()` that tells `dlt` how to identify the issues in the database to find duplicates which content it will merge.

Note that we now track the `updated_at` field - so we filter in all issues **updated** since the last pipeline run (which also includes those newly created).
Note that we now track the `updated_at` field so we filter in all issues **updated** since the last pipeline run (which also includes those newly created).

Pay attention how we use **since** parameter from [GitHub API](https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues)
and `updated_at.last_value` to tell GitHub to return issues updated only **after** the date we pass. `updated_at.last_value` holds the last `updated_at` value from the previous run.
Expand Down
2 changes: 1 addition & 1 deletion docs/website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ const sidebars = {
'dlt-ecosystem/visualizations/exploring-the-data',
{
type: 'category',
label: 'Transform the data'
label: 'Transform the data',
link: {
type: 'generated-index',
title: 'Transform the data',
Expand Down

0 comments on commit c48c46b

Please sign in to comment.