Skip to content

Commit

Permalink
refactored config providers
Browse files Browse the repository at this point in the history
  • Loading branch information
AstrakhantsevaAA committed Oct 20, 2023
1 parent a7001c2 commit 713f55b
Showing 1 changed file with 79 additions and 32 deletions.
111 changes: 79 additions & 32 deletions docs/website/docs/general-usage/configuration/config_providers.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,108 @@
---
title: Secrets and Config Providers
description: Secrets and Config Providers
keywords: [credentials, secrets.toml, environment variables]
title: Config Providers
description: Configuration dlt Providers
keywords: [credentials, secrets.toml, secrets, config, configuration, environment
variables, provider]
---

## Providers
If function signature has arguments that may be injected, `dlt` looks for the argument values in providers. **The argument name is a key in the lookup**. In case of `google_sheets()` it will look for: `tab_names`, `credentials` and `strings_only`.
# Config Providers
## The provider hierarchy

Each provider has its own key naming convention and dlt is able to translate between them.
If function signature has arguments that may be injected, `dlt` looks for the argument values in
providers. **The argument name is a key in the lookup**.

Providers form a hierarchy. At the top are environment variables, then `secrets.toml` and `config.toml` files. Providers like google, aws, azure vaults can be inserted after the environment provider.
Example:

```python
import dlt


@dlt.source
def google_sheets(
spreadsheet_id,
tab_names=dlt.config.value,
credentials=dlt.secrets.value,
only_strings=False
):
sheets = build('sheets', 'v4', credentials=Services.from_json(credentials))
tabs = []
for tab_name in tab_names:
data = sheets.get(spreadsheet_id, tab_name).execute().values()
tabs.append(dlt.resource(data, name=tab_name))
return tabs
```

For example if `spreadsheet_id` is in environment, dlt does not look into other providers.
In case of `google_sheets()` it will look
for: `tab_names`, `credentials` and `only_strings`.

The values passed in the code explitly are the **highest** in provider hierarchy.
The default values of the arguments have the **lowest** priority in the provider hierarchy.
Each provider has its own key naming convention, and dlt is able to translate between them.

> **Summary of the hierarchy**
> explicit args > env variables > ...vaults, airflow etc > secrets.toml > config.toml > default arg values
Providers form a hierarchy. At the top are environment variables, then `secrets.toml` and
`config.toml` files. Providers like Google/AWS/Azure Vaults can be inserted after the environment
provider.

Secrets are handled only by the providers supporting them. Some of the providers support only secrets (to reduce the number of requests done by `dlt` when searching sections)
1. `secrets.toml` and environment may hold both config and secret values
2. `config.toml` may hold only config values, no secrets
3. various vaults providers hold only secrets, `dlt` skips them when looking for values that are not secrets.
For example, if `spreadsheet_id` is in environment, dlt does not look into other providers.

⛔ Context aware providers will activate in right environments ie. on Airflow or AWS/GCP VMachines
The values passed in the code **explicitly** are the **highest** in provider hierarchy. The **default values**
of the arguments have the **lowest** priority in the provider hierarchy.

### Provider key formats. toml vs. environment variable
> **Summary of the hierarchy:**
>
> explicit args > env variables > ...vaults, airflow etc. > secrets.toml > config.toml > default arg values
Providers may use diffent formats for the keys. `dlt` will translate the standard format where sections and key names are separated by "." into the provider specific formats.
Secrets are handled only by the providers supporting them. Some providers support only
secrets (to reduce the number of requests done by `dlt` when searching sections).

1. for `toml` names are case sensitive and sections are separated with "."
2. for environment variables all names are capitalized and sections are separated with double underscore "__"
1. `secrets.toml` and environment may hold both config and secret values.
1. `config.toml` may hold only config values, no secrets.
1. Various vaults providers hold only secrets, `dlt` skips them when looking for values that are not
secrets.

Example:
When `dlt` evaluates the request `dlt.secrets["my_section.gcp_credentials"]` it must find the `private_key` for google credentials. It will look
1. first in env variable `MY_SECTION__GCP_CREDENTIALS__PRIVATE_KEY` and if not found
2. in `secrets.toml` with key `my_section.gcp_credentials.private_key`
⛔ Context-aware providers will activate in right environments i.e. on Airflow or AWS/GCP VMachines.

## Provider key formats

### `toml` vs. Environment Variables

Providers may use different formats for the keys. `dlt` will translate the standard format where
sections and key names are separated by "." into the provider-specific formats.

1. For `toml`, names are case-sensitive and sections are separated with ".".
1. For Environment Variables, all names are capitalized and sections are separated with double
underscore "\_\_".

Example: When `dlt` evaluates the request `dlt.secrets["my_section.gcp_credentials"]` it must find
the `private_key` for Google credentials. It will look

1. first in env variable `MY_SECTION__GCP_CREDENTIALS__PRIVATE_KEY` and if not found,
1. in `secrets.toml` with key `my_section.gcp_credentials.private_key`.

### Environment provider
Looks for the values in the environment variables

Looks for the values in the environment variables.

### Toml provider
Tomls provider uses two `toml` files: `secrets.toml` to store secrets and `config.toml` to store configuration values. The default `.gitignore` file prevents secrets from being added to source control and pushed. The `config.toml` may be freely added.

**Toml provider always loads those files from `.dlt` folder** which is looked **relative to the current working directory**. Example:
if your working dir is `my_dlt_project` and you have:
Tomls provider uses two `toml` files: `secrets.toml` to store secrets and `config.toml` to store
configuration values. The default `.gitignore` file prevents secrets from being added to source
control and pushed. The `config.toml` may be freely added.

> **Toml provider always loads those files from `.dlt` folder** which is looked **relative to the
> current Working Directory**.
Example: If your working directory is `my_dlt_project` and your project has the following structure:

```
my_dlt_project:
|
pipelines/
|---- .dlt/secrets.toml
|---- google_sheets.py
```
in it and you run `python pipelines/google_sheets.py` then `dlt` will look for `secrets.toml` in `my_dlt_project/.dlt/secrets.toml` and ignore the existing `my_dlt_project/pipelines/.dlt/secrets.toml`

if you change your working dir to `pipelines` and run `python google_sheets.py` it will look for `my_dlt_project/pipelines/.dlt/secrets.toml` a (probably) expected.
and you run `python pipelines/google_sheets.py` then `dlt` will look for `secrets.toml` in
`my_dlt_project/.dlt/secrets.toml` and ignore the existing
`my_dlt_project/pipelines/.dlt/secrets.toml`.

*that was common problem on our workshop - but believe me all other layouts are even worse I've tried*
If you change your working directory to `pipelines` and run `python google_sheets.py` it will look for
`my_dlt_project/pipelines/.dlt/secrets.toml` as (probably) expected.

0 comments on commit 713f55b

Please sign in to comment.