Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added info about how to reorder the columns to adjust a schema #1364

Merged
merged 6 commits into from
May 23, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 47 additions & 36 deletions docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,13 +174,13 @@ The configuration object passed to the REST API Generic Source has three main el
```py
config: RESTAPIConfig = {
"client": {
...
# ...
},
"resource_defaults": {
...
# ...
},
"resources": [
...
# ...
],
}
```
Expand All @@ -203,7 +203,9 @@ For example, you can set the primary key, write disposition, and other default s
```py
config = {
"client": {
...
"api_key": "your_api_key_here",
"base_url": "https://api.example.com",
# Add other client configurations here
},
"resource_defaults": {
"primary_key": "id",
Expand All @@ -216,7 +218,7 @@ config = {
},
"resources": [
"resource1",
"resource2": {
{
burnash marked this conversation as resolved.
Show resolved Hide resolved
"name": "resource2_name",
"write_disposition": "append",
"endpoint": {
Expand Down Expand Up @@ -309,7 +311,7 @@ To specify the pagination configuration, use the `paginator` field in the [clien

```py
{
...
# ...
"paginator": {
"type": "json_links",
"next_url_path": "paging.next",
Expand All @@ -321,7 +323,7 @@ Or using the paginator instance:

```py
{
...
# ...
"paginator": JSONResponsePaginator(
next_url_path="paging.next"
),
Expand Down Expand Up @@ -394,11 +396,11 @@ One of the most common method is token-based authentication. To authenticate wit
```py
{
"client": {
...
# ...
"auth": {
"token": dlt.secrets["your_api_token"],
},
...
# ...
},
}
```
Expand All @@ -424,7 +426,7 @@ To specify the authentication configuration, use the `auth` field in the [client
"type": "bearer",
"token": dlt.secrets["your_api_token"],
},
...
# ...
},
}
```
Expand All @@ -438,7 +440,7 @@ config = {
"client": {
"auth": BearTokenAuth(dlt.secrets["your_api_token"]),
},
...
# ...
}
```

Expand All @@ -455,7 +457,7 @@ In the GitHub example, the `issue_comments` resource depends on the `issues` res
"name": "issues",
"endpoint": {
"path": "issues",
...
# ...
},
},
{
Expand Down Expand Up @@ -495,13 +497,15 @@ The `issue_comments` resource will make requests to the following endpoints:
The syntax for the `resolve` field in parameter configuration is:

```py
"<parameter_name>": {
"type": "resolve",
"resource": "<parent_resource_name>",
"field": "<parent_resource_field_name>",
}
({
"{parameter_name}" :
{
"type": "resolve",
"resource": "{parent_resource_name}",
"field": "{parent_resource_field_name}",
}
})
```

Under the hood, dlt handles this by using a [transformer resource](../../general-usage/resource.md#process-resources-with-dlttransformer).

#### Include fields from the parent resource
Expand All @@ -512,7 +516,7 @@ You can include data from the parent resource in the child resource by using the
{
"name": "issue_comments",
"endpoint": {
...
# ...
},
"include_from_parent": ["id", "title", "created_at"],
}
Expand All @@ -530,35 +534,42 @@ When the API endpoint supports incremental loading, you can configure the source
1. Defining a special parameter in the `params` section of the [endpoint configuration](#endpoint-configuration):

```py
"<parameter_name>": {
"type": "incremental",
"cursor_path": "<path_to_cursor_field>",
"initial_value": "<initial_value>",
},

({
"<parameter_name>": {
"type": "incremental",
"cursor_path": "<path_to_cursor_field>",
"initial_value": "<initial_value>",
}
})
```

For example, in the `issues` resource configuration in the GitHub example, we have:

```py
"since": {
"type": "incremental",
"cursor_path": "updated_at",
"initial_value": "2024-01-25T11:21:28Z",
},
({
"since": {
"type": "incremental",
"cursor_path": "updated_at",
"initial_value": "2024-01-25T11:21:28Z",
}
})
```

This configuration tells the source to create an incremental object that will keep track of the `updated_at` field in the response and use it as a value for the `since` parameter in subsequent requests.

2. Specifying the `incremental` field in the [endpoint configuration](#endpoint-configuration):

```py
"incremental": {
"start_param": "<parameter_name>",
"end_param": "<parameter_name>",
"cursor_path": "<path_to_cursor_field>",
"initial_value": "<initial_value>",
"end_value": "<end_value>",
},
({
"incremental": {
"start_param": "<parameter_name>",
"end_param": "<parameter_name>",
"cursor_path": "<path_to_cursor_field>",
"initial_value": "<initial_value>",
"end_value": "<end_value>",
}
})
```

This configuration is more flexible and allows you to specify the start and end conditions for the incremental loading.
Expand Down
26 changes: 26 additions & 0 deletions docs/website/docs/walkthroughs/adjust-a-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,32 @@ Do not rename the tables or columns in the yaml file. `dlt` infers those from th
You can [adjust the schema](../general-usage/resource.md#adjust-schema) in Python before resource is loaded.
:::

### Reorder columns
To reorder the columns in your dataset, follow these steps:

1. Initial Run: Execute the pipeline to obtain the import and export schemas.
1. Modify Export Schema: Adjust the column order as desired in the export schema.
1. Sync Import Schema: Ensure that these changes are mirrored in the import schema to maintain consistency.
1. Delete Dataset: Remove the existing dataset to prepare for the reload.
1. Reload Data: Reload the data. The dataset should now reflect the new column order as specified in the import yaml.
burnash marked this conversation as resolved.
Show resolved Hide resolved

These steps ensure that the column order in your dataset matches your specifications.

**Another approach** to reorder columns is to use the `add_map` function. For instance, to rearrange ‘column1’, ‘column2’, and ‘column3’, you can proceed as follows:

```py
# Define the data source and reorder columns using add_map
data_source = resource().add_map(lambda row: {
'column3': row['column3'],
'column1': row['column1'],
'column2': row['column2']
})

# Run the pipeline
load_info = pipeline.run(data_source)
```

In this example, the `add_map` function reorders columns by defining a new mapping. The lambda function specifies the desired order by rearranging the key-value pairs. When the pipeline runs, the data will load with the columns in the new order.

### Load data as json instead of generating child table or columns from flattened dicts

Expand Down
Loading