From bcafd62107f3474924064398a431c34d6e57584b Mon Sep 17 00:00:00 2001 From: Steinthor Palsson Date: Wed, 18 Sep 2024 18:33:37 -0400 Subject: [PATCH] Sqlalchemy destination docs --- .../dlt-ecosystem/destinations/sqlalchemy.md | 158 ++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 docs/website/docs/dlt-ecosystem/destinations/sqlalchemy.md diff --git a/docs/website/docs/dlt-ecosystem/destinations/sqlalchemy.md b/docs/website/docs/dlt-ecosystem/destinations/sqlalchemy.md new file mode 100644 index 0000000000..46de7cb96c --- /dev/null +++ b/docs/website/docs/dlt-ecosystem/destinations/sqlalchemy.md @@ -0,0 +1,158 @@ +--- +title: SQL databases (powered by SQLAlchemy) +description: SQLAlchemy destination +keywords: [sql, sqlalchemy, database, destination] +--- + +# SQLAlchemy destination + +The SQLAlchemy destination allows you to use any database which has an [SQLAlchemy dialect](https://docs.sqlalchemy.org/en/20/dialects/) implemented as a destination. + +Currently mysql and SQLite are considered to have full support and are tested as part of the `dlt` CI suite. Other dialects are not tested but should generally work. + +## Install dlt with SQLAlchemy + +Install dlt with the `sqlalchemy` extra dependency: + +```sh +pip install "dlt[sqlalchemy]" +``` + +Note that database drivers are not included and need to be installed separately for +the database you plan on using. For example for MySQL: + +```sh +pip install mysqlclient +``` + +Refer to the [SQLAlchemy documentation on dialects](https://docs.sqlalchemy.org/en/20/dialects/) for info about client libraries required for supported databases. + +### Create a pipeline + +**1. Initialize a project with a pipeline that loads to MS SQL by running:** +```sh +dlt init chess sqlalchemy +``` + +**2. Install the necessary dependencies for SQLAlchemy by running:** +```sh +pip install -r requirements.txt +``` +or run: +```sh +pip install "dlt[sqlalchemy]" +``` + +**3. Install your database client library.** + +E.g. for MySQL: +```sh +pip install mysqlclient +``` + +**4. Enter your credentials into `.dlt/secrets.toml`.** + +For example, replace with your database connection info: +```toml +[destination.sqlalchemy.credentials] +database = "dlt_data" +username = "loader" +password = "" +host = "localhost" +port = 3306 +driver_name = "mysql" +``` + +Alternatively a valid SQLAlchemy database URL can be used, either in `secrets.toml` or as an environment variable. +E.g. + +```toml +[destination.sqlalchemy] +credentials = "mysql://loader:@localhost:3306/dlt_data" +``` + +or + +```sh +export DESTINATION__SQLALCHEMY__CREDENTIALS="mysql://loader:@localhost:3306/dlt_data" +``` + +An SQLAlchemy `Engine` can also be passed directly by creating an instance of the destination: + +```py +import sqlalchemy as sa +import dlt + +engine = sa.create_engine('sqlite:///chess_data.db') + +pipeline = dlt.pipeline( + pipeline_name='chess', + destination=dlt.destinations.sqlalchemy(engine), + dataset_name='main' +) +``` + +## Notes on SQLite + +### Dataset files +When using an SQLite database file each dataset is stored in a separate file since SQLite does not support multiple schemas in a single database file. +Under the hood this uses [`ATTACH DATABASE`](https://www.sqlite.org/lang_attach.html). + +The file is stored in the same directory as the main database file (provided by your database URL) + +E.g. if your SQLite URL is `sqlite:////home/me/data/chess_data.db` and you `dataset_name` is `games`, the data +is stored in `/home/me/data/chess_data__games.db` + +**Note**: if dataset name is `main` no additional file is created as this is the default SQLite database. + +### In-memory databases +In-memory databases require a persistent connection as the database is destroyed when the connection is closed. +Normally connections are opened and closed for each load job and in other stages during the pipeline run. +To make sure the database persists throughout the pipeline run you need to pass in an SQLAlchemy `Engine` object instead of credentials. +This engine is not disposed of automatically by `dlt`. Example: + +```python +import dlt +import sqlalchemy as sa + +# Create the sqlite engine +engine = sa.create_engine('sqlite:///:memory:') + +# Configure the destination instance and create pipeline +pipeline = dlt.pipeline('my_pipeline', destination=dlt.destinations.sqlalchemy(engine), dataset_name='main') + +# Run the pipeline with some data +pipeline.run([1,2,3], table_name='my_table') + +# engine is still open and you can query the database +with engine.connect() as conn: + result = conn.execute(sa.text('SELECT * FROM my_table')) + print(result.fetchall()) +``` + +## Write dispositions + +The following write dispositions are supported: + +- `append` +- `replace` with `truncate-and-insert` and `insert-from-staging` replace strategies. `staging-optimized` falls back to `insert-from-staging`. + +`merge` disposition is not supported and falls back to `append`. + +## Data loading +Data is loaded in a dialect agnostic with an `insert` statement generated with SQLAlchemy's core API. +Rows are inserted in batches as long as the underlying database driver supports it. By default the batch size 10,000 rows. + +## Syncing of `dlt` state +This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). + +### Data types +All `dlt` data types are supported, but how they are stored in the database depends on the SQLAlchemy dialect. +For example SQLite does not have a `DATETIME`/`TIMESTAMP` type so `timestamp` columns are stored as `TEXT` in ISO 8601 format. + +## Supported file formats +* [typed-jsonl](../file-formats/jsonl.md) is used by default. JSON encoded data with typing information included. +* [parquet](../file-formats/parquet.md) is supported + +## Supported column hints +* `unique` hints are translated to `UNIQUE` constraints via SQLAlchemy (granted the database supports it)