diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md
index 8500ef6cbb..197b1f9160 100644
--- a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md
+++ b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md
@@ -5,30 +5,44 @@ keywords: [rest api, restful api]
---
import Header from './_source-info-header.md';
-# REST API Generic Source
-
This is a generic dlt source you can use to extract data from any REST API. It uses declarative configuration to define the API endpoints, their relationships, parameters, pagination, and authentication.
-## Setup Guide
+## Setup guide
### Initialize the verified source
-Enter the following command:
+Enter the following command in your terminal:
- ```sh
- dlt init rest_api duckdb
- ```
+```sh
+dlt init rest_api duckdb
+```
-[dlt init](../../reference/command-line-interface) will initialize the pipeline example with REST API as the [source](../../general-usage/source) and [duckdb](../destinations/duckdb.md) as the [destination](../destinations).
+[dlt init](../../reference/command-line-interface) will initialize the pipeline examples for REST API as the [source](../../general-usage/source) and [duckdb](../destinations/duckdb.md) as the [destination](../destinations).
-## Add credentials
+Running `dlt init` creates the following in the current folder:
+- `rest_api_pipeline.py` file with a sample pipelines definition:
+ - GitHub API example
+ - Pokemon API example
+- `.dlt` folder with:
+ - `secrets.toml` file to store your access tokens and other sensitive information
+ - `config.toml` file to store the configuration settings
+- `requirements.txt` file with the required dependencies
+
+Change the REST API source to your needs by modifying the `rest_api_pipeline.py` file. See the detailed [source configuration](#source-configuration) section below.
+
+:::note
+For the rest of the guide, we will use the [GitHub API](https://docs.github.com/en/rest?apiVersion=2022-11-28) and [Pokemon API](https://pokeapi.co/) as example sources.
+:::
+
+### Add credentials
In the `.dlt` folder, you'll find a file called `secrets.toml`, where you can securely store your access tokens and other sensitive information. It's important to handle this file with care and keep it safe.
-The GitHub API requires an access token to be set in the `secrets.toml` file.
-Here is an example of how to set the token in the `secrets.toml` file:
+The GitHub API [requires an access token](https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api?apiVersion=2022-11-28) to access some of its endpoints and to increase the rate limit for the API calls. To get a GitHub token, follow the GitHub documentation on [managing your personal access tokens](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens).
+
+After you get the token, add it to the `secrets.toml` file:
```toml
[sources.rest_api.github]
@@ -55,7 +69,9 @@ github_token = "your_github_token"
dlt pipeline rest_api show
```
-## Source Configuration
+## Source configuration
+
+### Quick example
Let's take a look at the GitHub example in `rest_api_pipeline.py` file:
@@ -123,66 +139,305 @@ def load_github() -> None:
print(load_info)
```
-The declarative configuration is defined in the `github_config` dictionary. It contains the following key components:
+The declarative resource configuration is defined in the `github_config` dictionary. It contains the following key components:
1. `client`: Defines the base URL and authentication method for the API. In this case it uses token-based authentication. The token is stored in the `secrets.toml` file.
-2. `resource_defaults`: Contains default settings for all resources.
+2. `resource_defaults`: Contains default settings for all resources. In this example, we define that all resources:
+ - Have `id` as the [primary key](../../general-usage/resource#define-schema)
+ - Use the `merge` [write disposition](../../general-usage/incremental-loading#choosing-a-write-disposition) to merge the data with the existing data in the destination.
+ - Send a `per_page` query parameter with each request to 100 to get more results per page.
+
+3. `resources`: A list of resources to be loaded. In this example, we have two resources: `issues` and `issue_comments`, which correspond to the GitHub API endpoints for [repository issues](https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues) and [issue comments](https://docs.github.com/en/rest/issues/comments?apiVersion=2022-11-28#list-issue-comments). Note that we need a in issue number to fetch comments for each issue. This number is taken from the `issues` resource. More on this in the [resource relationships](#define-resource-relationships) section.
+
+Let's break down the configuration in more detail.
+
+### Configuration structure
+
+:::tip
+Import the `RESTAPIConfig` type from the `rest_api` module to have convenient hints in your editor/IDE:
+
+```python
+from rest_api import RESTAPIConfig
+```
+:::
+
+
+The configuration object passed to the REST API Generic Source has three main elements:
+
+```py
+config: RESTAPIConfig = {
+ "client": {
+ ...
+ },
+ "resource_defaults": {
+ ...
+ },
+ "resources": [
+ ...
+ ],
+}
+```
+
+#### `client`
+
+`client` contains the configuration to connect to the API's endpoints. It includes the following fields:
+
+- `base_url` (str): The base URL of the API. This string is prepended to all endpoint paths. For example, if the base URL is `https://api.example.com/v1/`, and the endpoint path is `users`, the full URL will be `https://api.example.com/v1/users`.
+- `headers` (dict, optional): Additional headers to be sent with each request.
+- `auth` (optional): Authentication configuration. It can be a simple token, a `AuthConfigBase` object, or a more complex authentication method.
+- `paginator` (optional): Configuration for the default pagination to be used for resources that support pagination. See the [pagination](#pagination) section for more details.
+
+#### `resource_defaults` (optional)
+
+`resource_defaults` contains the default values to configure the dlt resources. This configuration is applied to all resources unless overridden by the resource-specific configuration.
+
+For example, you can set the primary key, write disposition, and other default settings here:
+
+```py
+config = {
+ "client": {
+ ...
+ },
+ "resource_defaults": {
+ "primary_key": "id",
+ "write_disposition": "merge",
+ "endpoint": {
+ "params": {
+ "per_page": 100,
+ },
+ },
+ },
+ "resources": [
+ "resource1",
+ "resource2": {
+ "name": "resource2_name",
+ "write_disposition": "append",
+ "endpoint": {
+ "params": {
+ "param1": "value1",
+ },
+ },
+ },
+ ],
+}
+```
+
+Above, all resources will have `primary_key` set to `id`, `resource1` will have `write_disposition` set to `merge`, and `resource2` will override the default `write_disposition` with `append`.
+Both `resource1` and `resource2` will have the `per_page` parameter set to 100.
+
+#### `resources`
+
+This is a list of resource configurations that define the API endpoints to be loaded. Each resource configuration can be:
+- a dictionary with the [resource configuration](#resource-configuration).
+- a string. In this case, the string is used as the both as the endpoint path and the resource name, and the resource configuration is taken from the `resource_defaults` configuration if it exists.
+
+### Resource configuration
-3. `resources`: A list of resources to be loaded. In this example, we have two resources: `issues` and `issue_comments`. Which correspond to the GitHub API endpoints for issues and issue comments.
+A resource configuration has the following fields:
-Each resource has a name and an endpoint configuration. The endpoint configuration includes:
+- `endpoint`: The endpoint configuration for the resource. It can be a string or a dict representing the endpoint settings. See the [endpoint configuration](#endpoint-configuration) section for more details.
+- `write_disposition`: The write disposition for the resource.
+- `primary_key`: The primary key for the resource.
+- `include_from_parent`: A list of fields from the parent resource to be included in the resource output.
+- `selected`: A flag to indicate if the resource is selected for loading. This could be useful when you want to load data only from child resources and not from the parent resource.
+
+### Endpoint configuration
+
+The endpoint configuration defines how to query the API endpoint. Quick example:
+
+```py
+{
+ "path": "issues",
+ "method": "GET",
+ "params": {
+ "sort": "updated",
+ "direction": "desc",
+ "state": "open",
+ "since": {
+ "type": "incremental",
+ "cursor_path": "updated_at",
+ "initial_value": "2024-01-25T11:21:28Z",
+ },
+ },
+ "data_selector": "results",
+}
+```
+
+The fields in the endpoint configuration are:
- `path`: The path to the API endpoint.
- `method`: The HTTP method to be used. Default is `GET`.
- `params`: Query parameters to be sent with each request. For example, `sort` to order the results.
- `json`: The JSON payload to be sent with the request (for POST and PUT requests).
-- `paginator`: Configuration for paginating the results.
-- `data_selector`: A JSON path to select the data from the response.
+- `paginator`: Pagination configuration for the endpoint. See the [pagination](#pagination) section for more details.
+- `data_selector`: A JSONPath to select the data from the response. See the [data selection](#data-selection) section for more details.
- `response_actions`: A list of actions that define how to process the response data.
- `incremental`: Configuration for incremental loading.
-When you pass this configuration to the `rest_api_source` function, it creates a dlt source object that can be used with the pipeline.
+### Pagination
-`rest_api_source` function takes the following arguments:
+The REST API source will try to automatically handle pagination for you. This works by detecting the pagination details from the first API response.
-- `config`: The REST API configuration dictionary.
-- `name`: An optional name for the source.
-- `section`: An optional section name in the configuration file.
-- `max_table_nesting`: Sets the maximum depth of nested table above which the remaining nodes are loaded as structs or JSON.
-- `root_key` (bool): Enables merging on all resources by propagating root foreign key to child tables. This option is most useful if you plan to change write disposition of a resource to disable/enable merge. Defaults to False.
-- `schema_contract`: Schema contract settings that will be applied to this resource.
-- `spec`: A specification of configuration and secret values required by the source.
+In some special cases, you may need to specify the pagination configuration explicitly.
-## Define Resource Relationships
+These are the available paginator types:
-When you have a resource that depends on another resource, you can define the relationship using the resolve field type.
+| Paginator type | String Alias | Description |
+| -------------- | ------------ | ----------- |
+| JSONResponsePaginator | `json_links` | The links to the next page are in the body (JSON) of the response. |
+| HeaderLinkPaginator | `header_links` | The links to the next page are in the response headers. |
+| OffsetPaginator | `offset` | The pagination is based on an offset parameter. With total items count either in the response body or explicitly provided. |
+| PageNumberPaginator | `page_number` | The pagination is based on a page number parameter. With total pages count either in the response body or explicitly provided. |
+| JSONCursorPaginator | `json_cursor` | The pagination is based on a cursor parameter. The value of the cursor is in the response body (JSON). |
+| SinglePagePaginator | `single_page` | The response will be interpreted as a single-page response, ignoring possible pagination metadata. |
-In the GitHub example, the `issue_comments` resource depends on the `issues` resource. The `issue_number` parameter in the `issue_comments` endpoint configuration is resolved from the `number` field of the `issues` resource.
+To specify the pagination configuration, you can use the `paginator` field in the endpoint configuration:
```python
{
- "name": "issue_comments",
- "endpoint": {
- "path": "issues/{issue_number}/comments",
- "params": {
- "issue_number": {
- "type": "resolve",
- "resource": "issues",
- "field": "number",
- }
+ "path": "issues",
+ "paginator": {
+ "type": "json_links",
+ "next_url_path": "paging.next",
+ },
+}
+```
+### Data selection
+
+The `data_selector` field in the endpoint configuration allows you to specify a JSONPath to select the data from the response. By default, the source will try to detect locations of the data automatically.
+
+Use this field when you need to specify the location of the data in the response explicitly.
+
+For example, if the API response looks like this:
+
+```json
+{
+ "posts": [
+ {"id": 1, "title": "Post 1"},
+ {"id": 2, "title": "Post 2"},
+ {"id": 3, "title": "Post 3"}
+ ]
+}
+```
+
+You can use the following endpoint configuration:
+
+```python
+{
+ "path": "posts",
+ "data_selector": "posts",
+}
+```
+
+For a nested structure like this:
+
+```json
+{
+ "results": {
+ "posts": [
+ {"id": 1, "title": "Post 1"},
+ {"id": 2, "title": "Post 2"},
+ {"id": 3, "title": "Post 3"}
+ ]
+ }
+}
+```
+
+You can use the following endpoint configuration:
+
+```python
+{
+ "path": "posts",
+ "data_selector": "results.posts",
+}
+```
+
+Read more about [JSONPath syntax](https://github.com/h2non/jsonpath-ng?tab=readme-ov-file#jsonpath-syntax) to learn how to write selectors.
+
+
+### Authentication
+
+Many APIs require authentication to access their endpoints. The REST API source supports various authentication methods, such as token-based, query parameters, basic auth, etc.
+
+#### Quick example
+
+One of the most common method is token-based authentication. To authenticate with a token, you can use the `token` field in the `auth` configuration:
+
+```python
+{
+ "client": {
+ ...
+ "auth": {
+ "token": dltd.secrets["your_api_token"],
},
+ ...
},
-},
+}
+```
+
+:::warning
+Make sure to store your access tokens and other sensitive information in the `secrets.toml` file and never commit it to the version control system.
+:::
+
+Available authentication methods:
+
+| Authentication type | Description |
+| ------------------- | ----------- |
+| BearTokenAuth | Bearer token authentication. |
+| HTTPBasicAuth | Basic HTTP authentication. |
+| APIKeyAuth | API key authentication with key defined in the query parameters or in the headers. |
+
+### Define resource relationships
+
+When you have a resource that depends on another resource, you can define the relationship using the resolve field type.
+
+In the GitHub example, the `issue_comments` resource depends on the `issues` resource. The `issue_number` parameter in the `issue_comments` endpoint configuration is resolved from the `number` field of the `issues` resource:
+
+```py
+{
+ "resources": [
+ {
+ "name": "issues",
+ "endpoint": {
+ "path": "issues",
+ ...
+ },
+ },
+ {
+ "name": "issue_comments",
+ "endpoint": {
+ "path": "issues/{issue_number}/comments",
+ "params": {
+ "issue_number": {
+ "type": "resolve",
+ "resource": "issues",
+ "field": "number",
+ }
+ },
+ },
+ },
+ ],
+}
```
This configuration tells the source to get issue numbers from the `issues` resource and use them to fetch comments for each issue.
-## Incremental Loading
+The syntax for the `resolve` field in parameter configuration is:
+
+```py
+"": {
+ "type": "resolve",
+ "resource": "",
+ "field": "",
+}
+```
+
+## Incremental loading
To set up incremental loading for a resource, you can use two options:
-1. Defining a special parameter in the `params` section of the endpoint configuration:
+1. Defining a special parameter in the `params` section of the [endpoint configuration](#endpoint-configuration):
```python
"": {
@@ -204,7 +459,7 @@ To set up incremental loading for a resource, you can use two options:
This configuration tells the source to create an incremental object that will keep track of the `updated_at` field in the response and use it as a value for the `since` parameter in subsequent requests.
-2. Specifying the `incremental` field in the endpoint configuration:
+2. Specifying the `incremental` field in the [endpoint configuration](#endpoint-configuration):
```python
"incremental": {
@@ -218,3 +473,14 @@ To set up incremental loading for a resource, you can use two options:
This configuration is more flexible and allows you to specify the start and end conditions for the incremental loading.
+## `rest_api_source()` function
+
+`rest_api_source` function takes the following arguments:
+
+- `config`: The REST API configuration dictionary.
+- `name`: An optional name for the source.
+- `section`: An optional section name in the configuration file.
+- `max_table_nesting`: Sets the maximum depth of nested table above which the remaining nodes are loaded as structs or JSON.
+- `root_key` (bool): Enables merging on all resources by propagating root foreign key to child tables. This option is most useful if you plan to change write disposition of a resource to disable/enable merge. Defaults to False.
+- `schema_contract`: Schema contract settings that will be applied to this resource.
+- `spec`: A specification of configuration and secret values required by the source.