From 65c9cec0926c0c9c4186bafa5247217979e72920 Mon Sep 17 00:00:00 2001 From: Dave Date: Tue, 23 Apr 2024 18:21:28 +0200 Subject: [PATCH] update docs a bit --- .../dlt-ecosystem/destinations/clickhouse.md | 23 ++++++++++++++----- 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/destinations/clickhouse.md b/docs/website/docs/dlt-ecosystem/destinations/clickhouse.md index 811166d400..57f23db83b 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/clickhouse.md +++ b/docs/website/docs/dlt-ecosystem/destinations/clickhouse.md @@ -14,11 +14,6 @@ keywords: [ clickhouse, destination, data warehouse ] pip install dlt[clickhouse] ``` -## Dev Todos for docs -* Clickhouse uses string for time -* bytes are converted to base64 strings when using jsonl and regular strings when using parquet -* JSON / complex fields are experimental currently, they are not supported when loading from parquet and nested structures will be changed when loading from jsonl - ## Setup Guide ### 1. Initialize the dlt project @@ -93,11 +88,27 @@ Data is loaded into ClickHouse using the most efficient method depending on the - For files in remote storage like S3, Google Cloud Storage, or Azure Blob Storage, ClickHouse table functions like `s3`, `gcs` and `azureBlobStorage` are used to read the files and insert the data into tables. +## Datasets + +`Clickhouse` does not support multiple datasets in one database, dlt relies on datasets to exist for multiple reasons. +To make `clickhouse` work with `dlt`, tables generated by `dlt` in your `clickhouse` database will have their name prefixed with the dataset name separated by +the configurable `dataset_table_separator`. Additionally a special sentinel table that does not contain any data will also be created, so dlt knows which virtual datasets already exist in a clickhouse +destination. + ## Supported file formats - [jsonl](../file-formats/jsonl.md) is the preferred format for both direct loading and staging. - [parquet](../file-formats/parquet.md) is supported for both direct loading and staging. +The `clickhouse` destination has a few specific deviations from the default sql destinations: + +1. `Clickhouse` has an experimental `object` datatype, but we have found it to be a bit unpredictable, so the dlt clickhouse destination will load the complex dataype to a `text` column. If you need +this feature, please get in touch in our slack community and we will consider adding it. +2. `Clickhouse` does not support the `time` datatype. Time will be loaded to a `text` column. +3. `Clickhouse` does not support the `binary` datatype. Binary will be loaded to a `text` column. When loading from `jsonl`, this will be a base64 string, when loading from parquet this will be +the `binary` object converted to `text`. +4. `Clickhouse` accepts adding columns to a populated table that are not null. + ## Supported column hints ClickHouse supports the following [column hints](https://dlthub.com/docs/general-usage/schema#tables-and-columns): @@ -149,7 +160,7 @@ pipeline = dlt.pipeline( ### dbt support -Integration with [dbt](../transformations/dbt/dbt.md) is supported. +Integration with [dbt](../transformations/dbt/dbt.md) is generally supported via dbt-clickhouse, but not tested by us at this time. ### Syncing of `dlt` state