diff --git a/docs/technical/secrets_and_config.md b/docs/technical/secrets_and_config.md index 82a0e66e12..423767293d 100644 --- a/docs/technical/secrets_and_config.md +++ b/docs/technical/secrets_and_config.md @@ -77,14 +77,14 @@ You should type your function signatures! The effort is very low and it gives `d ```python @dlt.source -def google_sheets(spreadsheet_id: str, tab_names: List[str] = dlt.config.value, credentials: GcpClientCredentialsWithDefault = dlt.secrets.value, only_strings: bool = False): +def google_sheets(spreadsheet_id: str, tab_names: List[str] = dlt.config.value, credentials: GcpServiceAccountCredentials = dlt.secrets.value, only_strings: bool = False): ... ``` Now: 1. you are sure that you get a list of strings as `tab_names` 2. you will get actual google credentials (see `CredentialsConfiguration` later) and your users can pass them in many different forms. -In case of `GcpClientCredentialsWithDefault` +In case of `GcpServiceAccountCredentials` * you may just pass the `service_json` as string or dictionary (in code and via config providers) * you may pass a connection string (used in sql alchemy) (in code and via config providers) * or default credentials will be used @@ -331,7 +331,7 @@ It tells you exactly which paths `dlt` looked at, via which config providers and ## Working with credentials (and other complex configuration values) -`GcpClientCredentialsWithDefault` is an example of a **spec**: a Python `dataclass` that describes the configuration fields, their types and default values. It also allows to parse various native representations of the configuration. Credentials marked with `WithDefaults` mixin are also to instantiate itself from the machine/user default environment ie. googles `default()` or AWS `.aws/credentials`. +`GcpServiceAccountCredentials` is an example of a **spec**: a Python `dataclass` that describes the configuration fields, their types and default values. It also allows to parse various native representations of the configuration. Credentials marked with `WithDefaults` mixin are also to instantiate itself from the machine/user default environment ie. googles `default()` or AWS `.aws/credentials`. As an example, let's use `ConnectionStringCredentials` which represents a database connection string. @@ -421,7 +421,7 @@ In fact for each decorated function a spec is synthesized. In case of `google_sh @configspec class GoogleSheetsConfiguration: tab_names: List[str] = None # manadatory - credentials: GcpClientCredentialsWithDefault = None # mandatory secret + credentials: GcpServiceAccountCredentials = None # mandatory secret only_strings: Optional[bool] = False ``` diff --git a/docs/website/docs/dlt-ecosystem/destinations/duckdb.md b/docs/website/docs/dlt-ecosystem/destinations/duckdb.md index c5e9dd1f14..40bfa8f5ef 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/duckdb.md +++ b/docs/website/docs/dlt-ecosystem/destinations/duckdb.md @@ -84,7 +84,7 @@ p = dlt.pipeline(pipeline_name='chess', destination='duckdb', dataset_name='ches This destination accepts database connection strings in format used by [duckdb-engine](https://github.com/Mause/duckdb_engine#configuration). -You can configure a DuckDB destination with [secret / config values](../../general-usage/credentials.md) (e.g. using a `secrets.toml` file) +You can configure a DuckDB destination with [secret / config values](../../general-usage/credentials) (e.g. using a `secrets.toml` file) ```toml destination.duckdb.credentials=duckdb:///_storage/test_quack.duckdb ``` diff --git a/docs/website/docs/general-usage/configuration.md b/docs/website/docs/general-usage/configuration.md deleted file mode 100644 index d72c7976f2..0000000000 --- a/docs/website/docs/general-usage/configuration.md +++ /dev/null @@ -1,4 +0,0 @@ -# Configuration - -This page is a work in progress. If you have a question about configuration, please send us an email -at community@dlthub.com. We'd be happy to help you! diff --git a/docs/website/docs/general-usage/credentials/config_providers.md b/docs/website/docs/general-usage/credentials/config_providers.md new file mode 100644 index 0000000000..b3da2979a9 --- /dev/null +++ b/docs/website/docs/general-usage/credentials/config_providers.md @@ -0,0 +1,146 @@ +--- +title: Configuration Providers +description: Configuration dlt Providers +keywords: [credentials, secrets.toml, secrets, config, configuration, environment + variables, provider] +--- + +# Configuration Providers + + +Configuration Providers in the context of the `dlt` library +refer to different sources from which configuration values +and secrets can be retrieved for a data pipeline. +These providers form a hierarchy, with each having its own +priority in determining the values for function arguments. + +## The provider hierarchy + +If function signature has arguments that may be injected, `dlt` looks for the argument values in +providers. + +### Providers + +1. **Environment Variables**: At the top of the hierarchy are environment variables. + If a value for a specific argument is found in an environment variable, + dlt will use it and will not proceed to search in lower-priority providers. + +2. **Vaults (Airflow/Google/AWS/Azure)**: These are specialized providers that come + after environment variables. They can provide configuration values and secrets. + However, they typically focus on handling sensitive information. + +3. **`secrets.toml` and `config.toml` Files**: These files are used for storing both + configuration values and secrets. `secrets.toml` is dedicated to sensitive information, + while `config.toml` contains non-sensitive configuration data. + +4. **Default Argument Values**: These are the values specified in the function's signature. + They have the lowest priority in the provider hierarchy. + +### Example + +```python +@dlt.source +def google_sheets( + spreadsheet_id=dlt.config.value, + tab_names=dlt.config.value, + credentials=dlt.secrets.value, + only_strings=False +): + sheets = build('sheets', 'v4', credentials=Services.from_json(credentials)) + tabs = [] + for tab_name in tab_names: + data = sheets.get(spreadsheet_id, tab_name).execute().values() + tabs.append(dlt.resource(data, name=tab_name)) + return tabs +``` + +In case of `google_sheets()` it will look +for: `spreadsheet_id`, `tab_names` and `credentials`. + +Each provider has its own key naming convention, and dlt is able to translate between them. + +**The argument name is a key in the lookup**. + +At the top of the hierarchy are Environment Variables, then `secrets.toml` and +`config.toml` files. Providers like Airflow/Google/AWS/Azure Vaults will be inserted **after** the Environment +provider but **before** TOML providers. + +For example, if `spreadsheet_id` is found in environment variable `SPREADSHEET_ID`, `dlt` will not look in TOML files +and below. + +The values passed in the code **explicitly** are the **highest** in provider hierarchy. The **default values** +of the arguments have the **lowest** priority in the provider hierarchy. + +:::info +Explicit Args **>** ENV Variables **>** Vaults: Airflow etc. **>** `secrets.toml` **>** `config.toml` **>** Default Arg Values +::: + +Secrets are handled only by the providers supporting them. Some providers support only +secrets (to reduce the number of requests done by `dlt` when searching sections). + +1. `secrets.toml` and environment may hold both config and secret values. +1. `config.toml` may hold only config values, no secrets. +1. Various vaults providers hold only secrets, `dlt` skips them when looking for values that are not + secrets. + +:::info +Context-aware providers will activate in the right environments i.e. on Airflow or AWS/GCP VMachines. +::: + +## Provider key formats + +### TOML vs. Environment Variables + +Providers may use different formats for the keys. `dlt` will translate the standard format where +sections and key names are separated by "." into the provider-specific formats. + +1. For TOML, names are case-sensitive and sections are separated with ".". +1. For Environment Variables, all names are capitalized and sections are separated with double + underscore "__". + +Example: When `dlt` evaluates the request `dlt.secrets["my_section.gcp_credentials"]` it must find +the `private_key` for Google credentials. It will look + +1. first in env variable `MY_SECTION__GCP_CREDENTIALS__PRIVATE_KEY` and if not found, +1. in `secrets.toml` with key `my_section.gcp_credentials.private_key`. + +### Environment provider + +Looks for the values in the environment variables. + +### TOML provider + +The TOML provider in dlt utilizes two TOML files: + +- `secrets.toml `- This file is intended for storing sensitive information, often referred to as "secrets". +- `config.toml `- This file is used for storing configuration values. + +By default, the `.gitignore` file in the project prevents `secrets.toml` from being added to +version control and pushed. However, `config.toml` can be freely added to version control. + +:::info +**TOML provider always loads those files from `.dlt` folder** which is looked **relative to the +current Working Directory**. +::: + +Example: If your working directory is `my_dlt_project` and your project has the following structure: + +``` +my_dlt_project: + | + pipelines/ + |---- .dlt/secrets.toml + |---- google_sheets.py +``` + +and you run `python pipelines/google_sheets.py` then `dlt` will look for `secrets.toml` in +`my_dlt_project/.dlt/secrets.toml` and ignore the existing +`my_dlt_project/pipelines/.dlt/secrets.toml`. + +If you change your working directory to `pipelines` and run `python google_sheets.py` it will look for +`my_dlt_project/pipelines/.dlt/secrets.toml` as (probably) expected. + +:::caution +It's worth mentioning that the TOML provider also has the capability to read files from `~/.dlt/` +(located in the user's home directory) in addition to the local project-specific `.dlt` folder. +::: \ No newline at end of file diff --git a/docs/website/docs/general-usage/credentials/config_specs.md b/docs/website/docs/general-usage/credentials/config_specs.md new file mode 100644 index 0000000000..328d18d2a0 --- /dev/null +++ b/docs/website/docs/general-usage/credentials/config_specs.md @@ -0,0 +1,468 @@ +--- +title: Configuration Specs +description: Overview configuration specs and how to create custom specs +keywords: [credentials, secrets.toml, secrets, config, configuration, environment + variables, specs] +--- + +# Configuration Specs + +Configuration Specs in `dlt` are Python dataclasses that define how complex configuration values, +particularly credentials, should be handled. +They specify the types, defaults, and parsing methods for these values. + +## Working with credentials (and other complex configuration values) + +For example, a spec like `GcpServiceAccountCredentials` manages Google Cloud Platform +service account credentials, while `ConnectionStringCredentials` handles database connection strings. + +### Example + +As an example, let's use `ConnectionStringCredentials` which represents a database connection +string. + +```python +from dlt.sources.credentials import ConnectionStringCredentials + +@dlt.source +def query(sql: str, dsn: ConnectionStringCredentials = dlt.secrets.value): + ... +``` + +The source above executes the `sql` against database defined in `dsn`. `ConnectionStringCredentials` +makes sure you get the correct values with correct types and understands the relevant native form of +the credentials. + +Below are examples of how you can set credentials in `secrets.toml` and `config.toml` files. + +Example 1. Use the **dictionary** form. + +```toml +[dsn] +database="dlt_data" +password="loader" +username="loader" +host="localhost" +``` + +Example 2. Use the **native** form. + +```toml +dsn="postgres://loader:loader@localhost:5432/dlt_data" +``` + +Example 3. Use the **mixed** form: the password is missing in explicit dsn and will be taken from the +`secrets.toml`. + +```toml +dsn.password="loader" +``` + +You can explicitly provide credentials in various forms: + +```python +query("SELECT * FROM customers", "postgres://loader@localhost:5432/dlt_data") +# or +query("SELECT * FROM customers", {"database": "dlt_data", "username": "loader"...}) +``` + +## Built in credentials + +We have some ready-made credentials you can reuse: + +```python +from dlt.sources.credentials import ConnectionStringCredentials +from dlt.sources.credentials import OAuth2Credentials +from dlt.sources.credentials import GcpServiceAccountCredentials, GcpOAuthCredentials +from dlt.sources.credentials import AwsCredentials +from dlt.sources.credentials import AzureCredentials +``` + +### ConnectionStringCredentials + +The `ConnectionStringCredentials` class handles connection string +credentials for SQL database connections. +It includes attributes for the driver name, database name, username, password, host, port, +and additional query parameters. +This class provides methods for parsing and generating connection strings. + +#### Usage +```python +credentials = ConnectionStringCredentials() + +# Set the necessary attributes +credentials.drivername = "postgresql" +credentials.database = "my_database" +credentials.username = "my_user" +credentials.password = "my_password" +credentials.host = "localhost" +credentials.port = 5432 + +# Convert credentials to connection string +connection_string = credentials.to_native_representation() + +# Parse a connection string and update credentials +native_value = "postgresql://my_user:my_password@localhost:5432/my_database" +credentials.parse_native_representation(native_value) + +# Get a URL representation of the connection +url_representation = credentials.to_url() +``` +Above, you can find an example of how to use this spec with sources and TOML files. + +### OAuth2Credentials + +The `OAuth2Credentials` class handles OAuth 2.0 credentials, including client ID, +client secret, refresh token, and access token. +It also allows for the addition of scopes and provides methods for client authentication. + +Usage: +```python +credentials = OAuth2Credentials( + client_id="CLIENT_ID", + client_secret="CLIENT_SECRET", + refresh_token="REFRESH_TOKEN", + scopes=["scope1", "scope2"] +) + +# Authorize the client +credentials.auth() + +# Add additional scopes +credentials.add_scopes(["scope3", "scope4"]) +``` + +`OAuth2Credentials` is a base class to implement actual OAuth, for example, +it is a base class for [GcpOAuthCredentials](#gcpoauthcredentials). + +### GCP Credentials + +- [GcpServiceAccountCredentials](#gcpserviceaccountcredentials). +- [GcpOAuthCredentials](#gcpoauthcredentials). + +[Google Analytics verified source](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_analytics/__init__.py): +the example how to use GCP Credentials. + +#### GcpServiceAccountCredentials + +The `GcpServiceAccountCredentials` class manages GCP Service Account credentials. +This class provides methods to retrieve native credentials for Google clients. + +##### Usage + +- You may just pass the `service.json` as string or dictionary (in code and via config providers). +- Or default credentials will be used. + +```python +credentials = GcpServiceAccountCredentials() +# Parse a native value (ServiceAccountCredentials) +# Accepts a native value, which can be either an instance of ServiceAccountCredentials +# or a serialized services.json. +# Parses the native value and updates the credentials. +native_value = {"private_key": ".."} # or "path/to/services.json" +credentials.parse_native_representation(native_value) +``` +or more preferred use: +```python +import dlt +from dlt.sources.credentials import GcpServiceAccountCredentials + +@dlt.source +def google_analytics( + property_id: str = dlt.config.value, + credentials: GcpServiceAccountCredentials = dlt.secrets.value, +): + # Retrieve native credentials for Google clients + # For example, build the service object for Google Analytics PI. + client = BetaAnalyticsDataClient(credentials=credentials.to_native_credentials()) + + # Get a string representation of the credentials + # Returns a string representation of the credentials in the format client_email@project_id. + credentials_str = str(credentials) + ... +``` +while `secrets.toml` looks as following: +```toml +[sources.google_analytics.credentials] +client_id = "client_id" # please set me up! +client_secret = "client_secret" # please set me up! +refresh_token = "refresh_token" # please set me up! +project_id = "project_id" # please set me up! +``` +and `config.toml`: +```toml +[sources.google_analytics] +property_id = "213025502" +``` + +#### GcpOAuthCredentials + +The `GcpOAuthCredentials` class is responsible for handling OAuth2 credentials for +desktop applications in Google Cloud Platform (GCP). +It can parse native values either as `GoogleOAuth2Credentials` or as +serialized OAuth client secrets JSON. +This class provides methods for authentication and obtaining access tokens. + +##### Usage +```python +oauth_credentials = GcpOAuthCredentials() + +# Accepts a native value, which can be either an instance of GoogleOAuth2Credentials +# or serialized OAuth client secrets JSON. +# Parses the native value and updates the credentials. +native_value_oauth = {"client_secret": ...} +oauth_credentials.parse_native_representation(native_value_oauth) +``` +or more preferred use: +```python +import dlt +from dlt.sources.credentials import GcpOAuthCredentials + +@dlt.source +def google_analytics( + property_id: str = dlt.config.value, + credentials: GcpOAuthCredentials = dlt.secrets.value, +): + # Authenticate and get access token + credentials.auth(scopes=["scope1", "scope2"]) + + # Retrieve native credentials for Google clients + # For example, build the service object for Google Analytics PI. + client = BetaAnalyticsDataClient(credentials=credentials.to_native_credentials()) + + # Get a string representation of the credentials + # Returns a string representation of the credentials in the format client_id@project_id. + credentials_str = str(credentials) + ... +``` +while `secrets.toml` looks as following: +```toml +[sources.google_analytics.credentials] +client_id = "client_id" # please set me up! +client_secret = "client_secret" # please set me up! +refresh_token = "refresh_token" # please set me up! +project_id = "project_id" # please set me up! +``` +and `config.toml`: +```toml +[sources.google_analytics] +property_id = "213025502" +``` + +In order for `auth()` method to succeed: + +- You must provide valid `client_id` and `client_secret`, + `refresh_token` and `project_id` in order to get a current + **access token** and authenticate with OAuth. + Mind that the `refresh_token` must contain all the scopes that you require for your access. +- If `refresh_token` is not provided, and you run the pipeline from a console or a notebook, + `dlt` will use InstalledAppFlow to run the desktop authentication flow. + +[Google Analytics example](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_analytics/setup_script_gcp_oauth.py): how you can get the refresh token using `dlt.secrets.value`. + +#### Defaults + +If configuration values are missing, `dlt` will use the default Google credentials (from `default()`) if available. +Read more about [Google defaults.](https://googleapis.dev/python/google-auth/latest/user-guide.html#application-default-credentials) + +- `dlt` will try to fetch the `project_id` from default credentials. + If the project id is missing, it will look for `project_id` in the secrets. + So it is normal practice to pass partial credentials (just `project_id`) and take the rest from defaults. + +### AwsCredentials + +The `AwsCredentials` class is responsible for handling AWS credentials, +including access keys, session tokens, profile names, region names, and endpoint URLs. +It inherits the ability to manage default credentials and extends it with methods +for handling partial credentials and converting credentials to a botocore session. + +#### Usage +```python +credentials = AwsCredentials() +# Set the necessary attributes +credentials.aws_access_key_id = "ACCESS_KEY_ID" +credentials.aws_secret_access_key = "SECRET_ACCESS_KEY" +credentials.region_name = "us-east-1" +``` +or +```python +# Imports an external boto3 session and sets the credentials properties accordingly. +import botocore.session + +credentials = AwsCredentials() +session = botocore.session.get_session() +credentials.parse_native_representation(session) +print(credentials.aws_access_key_id) +``` +or more preferred use: +```python +@dlt.source +def aws_readers( + bucket_url: str = dlt.config.value, + credentials: AwsCredentials = dlt.secrets.value, +): + ... + # Convert credentials to s3fs format + s3fs_credentials = credentials.to_s3fs_credentials() + print(s3fs_credentials["key"]) + + # Get AWS credentials from botocore session + aws_credentials = credentials.to_native_credentials() + print(aws_credentials.access_key) + ... +``` +while `secrets.toml` looks as following: +```toml +[sources.aws_readers.credentials] +aws_access_key_id = "key_id" +aws_secret_access_key = "access_key" +region_name = "region" +``` +and `config.toml`: +```toml +[sources.aws_readers] +bucket_url = "bucket_url" +``` + +#### Defaults + +If configuration is not provided, `dlt` uses the default AWS credentials (from `.aws/credentials`) as present on the machine: +- It works by creating an instance of botocore Session. +- If `profile_name` is specified, the credentials for that profile are used. + If not - the default profile is used. + +### AzureCredentials + +The `AzureCredentials` class is responsible for handling Azure Blob Storage credentials, +including account name, account key, Shared Access Signature (SAS) token, and SAS token permissions. +It inherits the ability to manage default credentials and extends it with methods for +handling partial credentials and converting credentials to a format suitable +for interacting with Azure Blob Storage using the adlfs library. + +#### Usage +```python +credentials = AzureCredentials() +# Set the necessary attributes +credentials.azure_storage_account_name = "ACCOUNT_NAME" +credentials.azure_storage_account_key = "ACCOUNT_KEY" +``` +or more preferred use: +```python +@dlt.source +def azure_readers( + bucket_url: str = dlt.config.value, + credentials: AzureCredentials = dlt.secrets.value, +): + ... + # Generate a SAS token + credentials.create_sas_token() + print(credentials.azure_storage_sas_token) + + # Convert credentials to adlfs format + adlfs_credentials = credentials.to_adlfs_credentials() + print(adlfs_credentials["account_name"]) + + # to_native_credentials() is not yet implemented + ... +``` +while `secrets.toml` looks as following: +```toml +[sources.azure_readers.credentials] +azure_storage_account_name = "account_name" +azure_storage_account_key = "account_key" +``` +and `config.toml`: +```toml +[sources.azure_readers] +bucket_url = "bucket_url" +``` +#### Defaults + +If configuration is not provided, `dlt` uses the default credentials using `DefaultAzureCredential`. + +## Working with alternatives of credentials (Union types) + +If your source/resource allows for many authentication methods, you can support those seamlessly for +your user. The user just passes the right credentials and `dlt` will inject the right type into your +decorated function. + +Example: + +```python +@dlt.source +def zen_source(credentials: Union[ZenApiKeyCredentials, ZenEmailCredentials, str] = dlt.secrets.value, some_option: bool = False): + # depending on what the user provides in config, ZenApiKeyCredentials or ZenEmailCredentials will be injected in `credentials` argument + # both classes implement `auth` so you can always call it + credentials.auth() + return dlt.resource([credentials], name="credentials") + +# pass native value +os.environ["CREDENTIALS"] = "email:mx:pwd" +assert list(zen_source())[0].email == "mx" + +# pass explicit native value +assert list(zen_source("secret:🔑:secret"))[0].api_secret == "secret" + +# pass explicit dict +assert list(zen_source(credentials={"email": "emx", "password": "pass"}))[0].email == "emx" + +``` + +> This applies not only to credentials but to all specs (see next chapter). + +Read the [whole test](https://github.com/dlt-hub/dlt/blob/devel/tests/common/configuration/test_spec_union.py), it shows how to create unions +of credentials that derive from the common class, so you can handle it seamlessly in your code. + +## Writing custom specs + +**specs** let you take full control over the function arguments: + +- Which values should be injected, the types, default values. +- You can specify optional and final fields. +- Form hierarchical configurations (specs in specs). +- Provide own handlers for `on_partial` (called before failing on missing config key) or `on_resolved`. +- Provide own native value parsers. +- Provide own default credentials logic. +- Adds all Python dataclass goodies to it. +- Adds all Python `dict` goodies to it (`specs` instances can be created from dicts and serialized + from dicts). + +This is used a lot in the `dlt` core and may become useful for complicated sources. + +In fact, for each decorated function a spec is synthesized. In case of `google_sheets` following +class is created: + +```python +from dlt.sources.config import configspec, with_config + +@configspec +class GoogleSheetsConfiguration(BaseConfiguration): + tab_names: List[str] = None # manadatory + credentials: GcpServiceAccountCredentials = None # mandatory secret + only_strings: Optional[bool] = False +``` + +### All specs derive from [BaseConfiguration](https://github.com/dlt-hub/dlt/blob/devel/dlt/common/configuration/specs/base_configuration.py#L170) +This class serves as a foundation for creating configuration objects with specific characteristics: + +- It provides methods to parse and represent the configuration + in native form (`parse_native_representation` and `to_native_representation`). + +- It defines methods for accessing and manipulating configuration fields. + +- It implements a dictionary-compatible interface on top of the dataclass. +This allows instances of this class to be treated like dictionaries. + +- It defines helper functions for checking if a certain attribute is present, +if a field is valid, and for calling methods in the method resolution order (MRO). + +More information about this class can be found in the class docstrings. + +### All credentials derive from [CredentialsConfiguration](https://github.com/dlt-hub/dlt/blob/devel/dlt/common/configuration/specs/base_configuration.py#L307) + +This class is a subclass of `BaseConfiguration` +and is meant to serve as a base class for handling various types of credentials. +It defines methods for initializing credentials, converting them to native representations, +and generating string representations while ensuring sensitive information is appropriately handled. + +More information about this class can be found in the class docstrings. \ No newline at end of file diff --git a/docs/website/docs/general-usage/credentials/configuration.md b/docs/website/docs/general-usage/credentials/configuration.md new file mode 100644 index 0000000000..a92fb6fd0c --- /dev/null +++ b/docs/website/docs/general-usage/credentials/configuration.md @@ -0,0 +1,462 @@ +--- +title: Secrets and Configs +description: Overview secrets and configs +keywords: [credentials, secrets.toml, secrets, config, configuration, environment + variables] +--- + +# Secrets and Configs + +Secrets and configs are two types of sensitive and non-sensitive information used in a data pipeline: + +1. **Configs**: + - Configs refer to non-sensitive configuration data. These are settings, parameters, or options that define the behavior of a data pipeline. + - They can include things like file paths, database connection strings, API endpoints, or any other settings that affect the pipeline's behavior. +2. **Secrets**: + - Secrets are sensitive information that should be kept confidential, such as passwords, API keys, private keys, and other confidential data. + - It's crucial to never hard-code secrets directly into the code, as it can pose a security risk. Instead, they should be stored securely and accessed via a secure mechanism. + + +**Design Principles**: + +1. Adding configuration and secrets to [sources](../source) and [resources](../resource) should be no-effort. +2. You can reconfigure the pipeline for production after it is deployed. Deployed and local code should + be identical. +3. You can always pass configuration values explicitly and override any default behavior (i.e. naming of the configuration keys). + +We invite you to learn how `dlt` helps you adhere to +these principles and easily operate secrets and +configurations using `dlt.secrets.value` and `dlt.config.value` instances. + +## General Usage and an Example + +In the example below, the `google_sheets` source function is used to read selected tabs from Google Sheets. +It takes several arguments, including `spreadsheet_id`, `tab_names`, and `credentials`. + +```python +@dlt.source +def google_sheets( + spreadsheet_id=dlt.config.value, + tab_names=dlt.config.value, + credentials=dlt.secrets.value, + only_strings=False +): + sheets = build('sheets', 'v4', credentials=Services.from_json(credentials)) + tabs = [] + for tab_name in tab_names: + data = sheets.get(spreadsheet_id, tab_name).execute().values() + tabs.append(dlt.resource(data, name=tab_name)) + return tabs +``` + +- `spreadsheet_id`: The unique identifier of the Google Sheets document. +- `tab_names`: A list of tab names to read from the spreadsheet. +- `credentials`: Google Sheets credentials as a dictionary ({"private_key": ...}). +- `only_strings`: Flag to specify if only string data should be retrieved. + +`spreadsheet_id` and `tab_names` are configuration values that can be provided directly +when calling the function. `credentials` is a sensitive piece of information. + +`dlt.secrets.value` and `dlt.config.value` are instances of classes that provide +dictionary-like access to configuration values and secrets, respectively. +These objects allow for convenient retrieval and modification of configuration +values and secrets used by the application. + +Below, we will demonstrate the correct and wrong approaches to providing these values. + +### Wrong approach +The wrong approach includes providing secret values directly in the code, +which is not recommended for security reasons. + +```python +# WRONG!: +# provide all values directly - wrong but possible. +# secret values should never be present in the code! +data_source = google_sheets( + "23029402349032049", + ["tab1", "tab2"], + credentials={"private_key": ""} +) +``` +:::caution +Be careful not to put your credentials directly in the code. +::: + +### Correct approach + +The correct approach involves providing config values directly and secrets via +automatic [injection mechanism](#injection-mechanism) +or pass everything via configuration. + +1. Option A + ```python + # `only_strings` will get the default value False + data_source = google_sheets("23029402349032049", ["tab1", "tab2"]) + ``` + `credentials` value will be injected by the `@source` decorator (e.g. from `secrets.toml`). + + `spreadsheet_id` and `tab_names` take values from the provided arguments. + +2. Option B + ```python + # `only_strings` will get the default value False + data_source = google_sheets() + ``` + `credentials` value will be injected by the `@source` decorator (e.g. from `secrets.toml`). + + `spreadsheet_id` and `tab_names` will be also injected by the `@source` decorator (e.g. from `config.toml`). + +We use `dlt.secrets.value` and `dlt.config.value` to set secrets and configurations via: +- [TOML files](config_providers#toml-provider) (`secrets.toml` & `config.toml`): + ```toml + [sources.google_sheets.credentials] + client_email = + private_key = + project_id = + ``` + Read more about [TOML layouts](#secret-and-config-values-layout-and-name-lookup). +- [Environment Variables](config_providers#environment-provider): + ```python + SOURCES__GOOGLE_SHEETS__CREDENTIALS__CLIENT_EMAIL + SOURCES__GOOGLE_SHEETS__CREDENTIALS__PRIVATE_KEY + SOURCES__GOOGLE_SHEETS__CREDENTIALS__PROJECT_ID + ``` + +:::caution +**[TOML provider](config_providers#toml-provider) always loads `secrets.toml` and `config.toml` files from `.dlt` folder** which is looked relative to the +**current [Working Directory](https://en.wikipedia.org/wiki/Working_directory)**. TOML provider also has the capability to read files from `~/.dlt/` +(located in the user's [Home Directory](https://en.wikipedia.org/wiki/Home_directory)). +::: + + +### Add typing to your sources and resources + +We highly recommend adding types to your function signatures. +The effort is very low, and it gives `dlt` much more +information on what source/resource expects. + +Doing so provides several benefits: + +1. You'll never receive invalid data types in your code. +1. We can generate nice sample config and secret files for your source. +1. You can request dictionaries or special values (i.e. connection strings, service json) to be + passed. +1. You can specify a set of possible types via `Union` i.e. OAuth or API Key authorization. + +```python +@dlt.source +def google_sheets( + spreadsheet_id: str = dlt.config.value, + tab_names: List[str] = dlt.config.value, + credentials: GcpServiceAccountCredentials = dlt.secrets.value, + only_strings: bool = False +): + ... +``` + +Now: + +1. You are sure that you get a list of strings as `tab_names`. +1. You will get actual Google credentials (see [GCP Credential Configuration](config_specs#gcp-credentials)), and your users can + pass them in many different forms. + +In case of `GcpServiceAccountCredentials`: + +- You may just pass the `service.json` as string or dictionary (in code and via config providers). +- You may pass a connection string (used in SQL Alchemy) (in code and via config providers). +- Or default credentials will be used. + +### Pass config values and credentials explicitly +We suggest a [default layout](#default-layout-and-default-key-lookup-during-injection) of secret and config values, but you can fully ignore it and use your own: + +```python +# use `dlt.secrets` and `dlt.config` to explicitly take +# those values from providers from the explicit keys +data_source = google_sheets( + dlt.config["sheet_id"], + dlt.config["my_section.tabs"], + dlt.secrets["my_section.gcp_credentials"] +) + +data_source.run(destination="bigquery") +``` +`dlt.config` and `dlt.secrets` behave like dictionaries from which you can request a value with any key name. `dlt` will look in all [config providers](#injection-mechanism) - TOML files, env variables etc. just like it does with the standard key name layout. You can also use `dlt.config.get()` / `dlt.secrets.get()` to +request value cast to a desired type. For example: +```python +credentials = dlt.secrets.get("my_section.gcp_credentials", GcpServiceAccountCredentials) +``` +Creates `GcpServiceAccountCredentials` instance out of values (typically a dictionary) under **my_section.gcp_credentials** key. + +See [example](https://github.com/dlt-hub/dlt/blob/devel/docs/examples/archive/credentials/explicit.py). + +### Pass credentials as code + +You can see that the `google_sheets` source expects a `gs_credentials`. So you could pass it as below. + +```python +from airflow.hooks.base_hook import BaseHook + +# get it from airflow connections or other credential store +credentials = BaseHook.get_connection('gcp_credentials').extra +data_source = google_sheets(credentials=credentials) +``` +:::caution +Be careful not to put your credentials directly in code - use your own credential vault instead. +::: + +### Pass explicit destination credentials +You can pass destination credentials and ignore the default lookup: +```python +pipeline = dlt.pipeline(destination="postgres", credentials=dlt.secrets["postgres_dsn"]) +``` + +## Injection mechanism + +Config and secret values are injected to the function arguments if the function is decorated with +`@dlt.source` or `@dlt.resource` (also `@with_config` which you can apply to any function - used +heavily in the dlt core). + +The signature of the function `google_sheets` is **explicitly accepting all the necessary configuration and secrets in its arguments**. +During runtime, `dlt` tries to supply (`inject`) the required values via various config providers. + +The injection rules are: + +1. If you call the decorated function, the arguments that are passed explicitly are **never injected**, + this makes the injection mechanism optional. + +1. Required arguments (i.e. `spreadsheet_id` - without default values) are not injected and must be present. + +1. Arguments with default values are injected if present in config providers, otherwise default is used. + +1. Arguments with the special default value `dlt.secrets.value` and `dlt.config.value` **must be injected** + (or explicitly passed). If they are not found by the config providers, the code raises + exception. The code in the functions always receives those arguments. + +Additionally `dlt.secrets.value` tells `dlt` that supplied value is a secret, and it will be injected +only from secure config providers. + +## Secret and config values layout and name lookup + +`dlt` uses a layout of hierarchical sections to organize the config and secret values. This makes +configurations and secrets easy to manage, and disambiguate values with the same keys by placing +them in the different sections. + +:::note +If you know how TOML files are organized -> this is the same concept! +::: + +A lot of config values are dictionaries themselves (i.e. most of the credentials) and you want the +values corresponding to one component to be close together. + +You can have a separate credentials for your destinations and each of the sources your pipeline uses, +if you have many pipelines in a single project, you can group them in separate sections. + +Here is the simplest default layout for our `google_sheets` example. + +### OPTION A (default layout) + +**secrets.toml** + +```toml +[credentials] +client_email = +private_key = +project_id = +``` + +**config.toml** + +```toml +tab_names=["tab1", "tab2"] +``` + +As you can see the details of GCP credentials are placed under `credentials` which is argument name +to source function. + +### OPTION B (explicit layout) + +Here user has full control over the layout. + +**secrets.toml** + +```toml +[my_section] + + [my_section.gcp_credentials] + client_email = + private_key = +``` + +**config.toml** + +```toml +[my_section] +tabs=["tab1", "tab2"] + + [my_section.gcp_credentials] + # I prefer to keep my project id in config file and private key in secrets + project_id = +``` + +### Default layout and default key lookup during injection + +`dlt` arranges the sections into **default layout** that is expected by injection mechanism. This layout +makes it easy to configure simple cases but also provides a room for more explicit sections and +complex cases i.e. having several sources with different credentials or even hosting several pipelines +in the same project sharing the same config and credentials. + +``` +pipeline_name + | + |-sources + |- + |- + |- {all source and resource options and secrets} + |- + |- {all source and resource options and secrets} + |- + |... + + |-extract + |- extract options for resources ie. parallelism settings, maybe retries + |-destination + |- + |- {destination options} + |-credentials + |-{credentials options} + |-schema + |- + |-schema settings: not implemented but I'll let people set nesting level, name convention, normalizer etc. here + |-load + |-normalize +``` + +Lookup rules: + +**Rule 1:** All the sections above are optional. You are free to arrange your credentials and config +without any additional sections. + +**Rule 2:** The lookup starts with the most specific possible path, and if value is not found there, +it removes the right-most section and tries again. + +Example: In case of option A we have just one set of credentials. +But what if `bigquery` credentials are different from `google sheets`? Then we need to +allow some sections to separate them. + +```toml +# google sheet credentials +[credentials] +client_email = +private_key = +project_id = + +# bigquery credentials +[destination.credentials] +client_email = +private_key = +project_id = +``` + +Now when `dlt` looks for destination credentials, it will encounter the `destination` section and +stop there. When looking for `sources` credentials it will get directly into `credentials` key +(corresponding to function argument). + +> We could also rename the argument in the source function! But then we are **forcing** the user to +> have two copies of credentials. + +Example: let's be even more explicit and use a full section path possible. + +```toml +# google sheet credentials +[sources.google_sheets.credentials] +client_email = +private_key = +project_id = + +# bigquery credentials +[destination.bigquery.credentials] +client_email = +private_key = +project_id = +``` + +Where we add destination and source name to be very explicit. + +**Rule 3:** You can use your pipeline name to have separate configurations for each pipeline in your +project. + +Pipeline created/obtained with `dlt.pipeline()` creates a global and optional namespace with the +value of `pipeline_name`. All config values will be looked with pipeline name first and then again +without it. + +Example: the pipeline is named `ML_sheets`. + +```toml +[ML_sheets.credentials] +client_email = +private_key = +project_id = +``` + +or maximum path: + +```toml +[ML_sheets.sources.google_sheets.credentials] +client_email = +private_key = +project_id = +``` + +### The `sources` section + +Config and secrets for decorated sources and resources are kept in +`sources..` section. **All sections are optional during lookup**. For example, +if source module is named `pipedrive` and the function decorated with `@dlt.source` is +`deals(api_key: str=...)` then `dlt` will look for API key in: + +1. `sources.pipedrive.deals.api_key` +1. `sources.pipedrive.api_key` +1. `sources.api_key` +1. `api_key` + +Step 2 in a search path allows all the sources/resources in a module to share the same set of +credentials. + +Also look at the [following test](https://github.com/dlt-hub/dlt/blob/devel/tests/extract/test_decorators.py#L303) `test_source_sections`. + +## Understanding the exceptions + +Now we can finally understand the `ConfigFieldMissingException`. + +Let's run `chess.py` example without providing the password: + +``` +$ CREDENTIALS="postgres://loader@localhost:5432/dlt_data" python chess.py +... +dlt.common.configuration.exceptions.ConfigFieldMissingException: Following fields are missing: ['password'] in configuration with spec PostgresCredentials + for field "password" config providers and keys were tried in following order: + In Environment Variables key CHESS_GAMES__DESTINATION__POSTGRES__CREDENTIALS__PASSWORD was not found. + In Environment Variables key CHESS_GAMES__DESTINATION__CREDENTIALS__PASSWORD was not found. + In Environment Variables key CHESS_GAMES__CREDENTIALS__PASSWORD was not found. + In secrets.toml key chess_games.destination.postgres.credentials.password was not found. + In secrets.toml key chess_games.destination.credentials.password was not found. + In secrets.toml key chess_games.credentials.password was not found. + In Environment Variables key DESTINATION__POSTGRES__CREDENTIALS__PASSWORD was not found. + In Environment Variables key DESTINATION__CREDENTIALS__PASSWORD was not found. + In Environment Variables key CREDENTIALS__PASSWORD was not found. + In secrets.toml key destination.postgres.credentials.password was not found. + In secrets.toml key destination.credentials.password was not found. + In secrets.toml key credentials.password was not found. +Please refer to https://dlthub.com/docs/general-usage/credentials for more information +``` + +It tells you exactly which paths `dlt` looked at, via which config providers and in which order. + +In the example above: + +1. First it looked in a big section `chess_games` which is name of the pipeline. +1. In each case it starts with full paths and goes to minimum path `credentials.password`. +1. First it looks into `environ` then in `secrets.toml`. It displays the exact keys tried. +1. Note that `config.toml` was skipped! It may not contain any secrets. + +Read more about [Provider Hierarchy](./config_providers). \ No newline at end of file diff --git a/docs/website/docs/general-usage/glossary.md b/docs/website/docs/general-usage/glossary.md index 38bf4ee01b..fd88bc1e5f 100644 --- a/docs/website/docs/general-usage/glossary.md +++ b/docs/website/docs/general-usage/glossary.md @@ -6,7 +6,7 @@ keywords: [glossary, resource, source, pipeline] # Glossary -## [Source](source.md) +## [Source](source) Location that holds data with certain structure. Organized into one or more resources. @@ -17,7 +17,7 @@ Location that holds data with certain structure. Organized into one or more reso Within this documentation, **source** refers also to the software component (i.e. Python function) that **extracts** data from the source location using one or more resource components. -## [Resource](resource.md) +## [Resource](resource) A logical grouping of data within a data source, typically holding data of similar structure and origin. @@ -33,12 +33,12 @@ that **extracts** the data from source location. The data store where data from the source is loaded (e.g. Google BigQuery). -## [Pipeline](pipeline.md) +## [Pipeline](pipeline) Moves the data from the source to the destination, according to instructions provided in the schema (i.e. extracting, normalizing, and loading the data). -## [Verified Source](../walkthroughs/add-a-verified-source.md) +## [Verified Source](../walkthroughs/add-a-verified-source) A Python module distributed with `dlt init` that allows creating pipelines that extract data from a particular **Source**. Such module is intended to be published in order for others to use it to @@ -47,17 +47,17 @@ build pipelines. A source must be published to become "verified": which means that it has tests, test data, demonstration scripts, documentation and the dataset produces was reviewed by a data engineer. -## [Schema](schema.md) +## [Schema](schema) Describes the structure of normalized data (e.g. unpacked tables, column types, etc.) and provides instructions on how the data should be processed and loaded (i.e. it tells `dlt` about the content of the data and how to load it into the destination). -## [Config](configuration.md) +## [Config](credentials/configuration) A set of values that are passed to the pipeline at run time (e.g. to change its behavior locally vs. in production). -## [Credentials](credentials.md) +## [Credentials](credentials/config_specs) A subset of configuration whose elements are kept secret and never shared in plain text. diff --git a/docs/website/docs/general-usage/resource.md b/docs/website/docs/general-usage/resource.md index 77df24d592..e203b3d93a 100644 --- a/docs/website/docs/general-usage/resource.md +++ b/docs/website/docs/general-usage/resource.md @@ -70,7 +70,7 @@ accepts following arguments: > hint value. This let's you create table and column schemas depending on the data. See example in > next section. -> 💡 You can mark some resource arguments as configuration and [credentials](credentials.md) +> 💡 You can mark some resource arguments as [configuration and credentials](credentials) > values so `dlt` can pass them automatically to your functions. ### Define a schema with Pydantic @@ -174,7 +174,7 @@ for row in generate_rows(20): print(row) ``` -You can mark some resource arguments as configuration and [credentials](credentials.md) values +You can mark some resource arguments as [configuration and credentials](credentials) values so `dlt` can pass them automatically to your functions. ### Process resources with `dlt.transformer` diff --git a/docs/website/docs/general-usage/schema.md b/docs/website/docs/general-usage/schema.md index ee73aea54e..492e1a9117 100644 --- a/docs/website/docs/general-usage/schema.md +++ b/docs/website/docs/general-usage/schema.md @@ -64,7 +64,7 @@ The default naming convention: > 💡 Use simple, short small caps identifiers for everything! -The naming convention is [configurable](configuration.md) and users can easily create their own +The naming convention is configurable and users can easily create their own conventions that i.e. pass all the identifiers unchanged if the destination accepts that (i.e. DuckDB). diff --git a/docs/website/docs/running-in-production/running.md b/docs/website/docs/running-in-production/running.md index 96f9f7e071..4d8cc581d7 100644 --- a/docs/website/docs/running-in-production/running.md +++ b/docs/website/docs/running-in-production/running.md @@ -111,7 +111,7 @@ load.delete_completed_jobs=true ## Using slack to send messages `dlt` provides basic support for sending slack messages. You can configure Slack incoming hook via -[secrets.toml or environment variables](../general-usage/credentials.md). Please note that **Slack +[secrets.toml or environment variables](../general-usage/credentials/config_providers). Please note that **Slack incoming hook is considered a secret and will be immediately blocked when pushed to github repository**. In `secrets.toml`: diff --git a/docs/website/docs/walkthroughs/add-a-verified-source.md b/docs/website/docs/walkthroughs/add-a-verified-source.md index ed3701d8b5..bd7bd9894e 100644 --- a/docs/website/docs/walkthroughs/add-a-verified-source.md +++ b/docs/website/docs/walkthroughs/add-a-verified-source.md @@ -76,7 +76,7 @@ the supported locations. ## 2. Adding credentials For adding them locally or on your orchestrator, please see the following guide -[credentials](../general-usage/credentials.md). +[credentials](add_credentials). ## 3. Customize or write a pipeline script diff --git a/docs/website/docs/general-usage/credentials.md b/docs/website/docs/walkthroughs/add_credentials.md similarity index 61% rename from docs/website/docs/general-usage/credentials.md rename to docs/website/docs/walkthroughs/add_credentials.md index d0627ca527..748b8c6d8a 100644 --- a/docs/website/docs/general-usage/credentials.md +++ b/docs/website/docs/walkthroughs/add_credentials.md @@ -1,10 +1,10 @@ --- -title: Credentials +title: Add credentials description: How to use dlt credentials keywords: [credentials, secrets.toml, environment variables] --- -# Credentials +# How to add credentials ## Adding credentials locally @@ -34,48 +34,15 @@ For Verified Source credentials, read the [Setup Guides](../dlt-ecosystem/verifi Once you have credentials for the source and destination, add them to the file above and save them. +Read more about [credential configuration.](../general-usage/credentials) + ## Adding credentials to your deployment To add credentials to your deployment, - either use one of the `dlt deploy` commands; -- or follow the below instructions to pass credentials via code or environment. - -### Passing credentials as code - -A usual dlt pipeline passes a dlt source to a dlt pipeline as below. It is here that we could pass -credentials to the `pipedrive_source()`: - -```python -from pipedrive import pipedrive_source - -pipeline = dlt.pipeline( - pipeline_name='pipedrive', - destination='bigquery', - dataset_name='pipedrive_data' -) -load_info = pipeline.run(pipedrive_source()) -print(load_info) -``` - -When a source is defined, you define how credentials are passed. So it is here that you could look -to understand how to pass custom credentials. - -Example: - -```python -@dlt.source(name='pipedrive') -def pipedrive_source(pipedrive_api_key: str = dlt.secrets.value) -> Sequence[DltResource]: - #code goes here -``` - -You can see that the pipedrive source expects a `pipedrive_api_key`. So you could pass it as below. - -```python -api_key = BaseHook.get_connection('pipedrive_api_key').extra # get it from airflow or other credential store -load_info = pipeline.run(pipedrive_source(pipedrive_api_key=api_key)) -``` -> ❗ Note: be careful not to put your credentials directly in code - use your own credential vault instead. +- or follow the instructions to [pass credentials via code](../general-usage/credentials/configuration#pass-credentials-as-code) +or [environment](../general-usage/credentials/config_providers#environment-provider). ### Reading credentials from environment variables @@ -96,9 +63,9 @@ client_email = "client_email" # please set me up! If dlt tries to read this from environment variables, it will use a different naming convention. -For environment variables all names are capitalized and sections are separated with double underscore "\_\_". +For environment variables all names are capitalized and sections are separated with a double underscore "__". -For example for the above secrets, we would need to put into environment: +For example, for the secrets mentioned above, we would need to set them in the environment: ```shell SOURCES__PIPEDRIVE__PIPEDRIVE_API_KEY diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index a82b7acee6..106162c50e 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -104,9 +104,21 @@ const sidebars = { 'general-usage/state', 'general-usage/incremental-loading', 'general-usage/full-loading', - 'general-usage/credentials', 'general-usage/schema', - 'general-usage/configuration', + { + type: 'category', + label: 'Configuration', + link: { + type: 'generated-index', + title: 'Configuration', + slug: 'general-usage/credentials', + }, + items: [ + 'general-usage/credentials/configuration', + 'general-usage/credentials/config_providers', + 'general-usage/credentials/config_specs', + ] + }, 'reference/performance', { type: 'category', @@ -139,6 +151,7 @@ const sidebars = { items: [ 'walkthroughs/create-a-pipeline', 'walkthroughs/add-a-verified-source', + 'walkthroughs/add_credentials', 'walkthroughs/run-a-pipeline', 'walkthroughs/adjust-a-schema', 'walkthroughs/share-a-dataset',