-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Clickhouse Destination #1055
Comments
@Pipboyguy what you need is a db-api client: https://clickhouse-driver.readthedocs.io/en/latest/dbapi.html the base class (sql_job_client and sql_client) will handle 90% of things for you, type mapping is just (mostly) writing a dict. my biggest question is how people load data to clickhouse so they can use this on production. this will affect the scope the most:
we should answer the above. for all those options we have scaffolding code that should be quick to adapt. we can also greatly help with writting tests once you have prototype that loads data with a few test pipelines |
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
@rudolfix ClickHouse's primary method for bulk data ingestion is through the use of table functions, for example the s3 table function. There doesn't seem to be any concept of staging tables like in Snowflake, just direct inserts from source in a streaming fashion:
It is possible to save credentials and associated metadata from a source in a table engines, but this is simply a convenience utility. ClickHouse has a collection of table functions that build tables. The main ingestion pattern in production, according the tutorials, example datasets and docs, is to insert the response from a table function into a table using the command INSERT INTO table_name select from table_function(...); Asynchronous, direct inserts are supported as well, but again not the main method. This considered, I've opted to adapt the snowflake implementation (without staging).
I'll also reuse as much of SF's tests considering its similarity. |
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Hey - Clickhouse user here. Thanks a lot for providing this destination❤️. To give a short background, we use Clickhouse partly as a traditional data warehouse for BI, partly for analyzing medium-large scale time series data sets (~100s GB each) and partly (experimentally) as a feature store for ML features. Features that are important to us (sorted by importance):
My base assumption from looking at the destination implementations is that we will fairly easily be able to modify at least some of these features, but they are probably also important to the broader community. |
Hi @arniwesth Thank you for your input! You're right that
I'll see if I can add |
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
* Add clickhouse driver dependency #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Preliminary wireframe #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Update preliminary Clickhouse configurations #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Format #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Finalize wireframing #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Wireframe ClickhouseSqlClient #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Refactor Clickhouse SqlClient wireframing and update capabilities #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Update error messages and transaction capability in Clickhouse Signed-off-by: Marcel Coetzee <[email protected]> * Update Clickhouse configuration and factory Signed-off-by: Marcel Coetzee <[email protected]> * Update Signed-off-by: Marcel Coetzee <[email protected]> * Update identifier escaping logic #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Refactor ClickhouseSqlClient for better error handling #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Refine clickhouse destination and basic adapter #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Finish ClickhouseClient Signed-off-by: Marcel Coetzee <[email protected]> * Add escape_clickhouse_literal function for Clickhouse Signed-off-by: Marcel Coetzee <[email protected]> * Add "insert_values" to supported loader file formats Signed-off-by: Marcel Coetzee <[email protected]> * Add `wei` to "from_db_type" Signed-off-by: Marcel Coetzee <[email protected]> * Improve Clickhouse loader code and update comments #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Preliminary CH tests and utility module Signed-off-by: Marcel Coetzee <[email protected]> * Refactor Clickhouse utilities and update tests Signed-off-by: Marcel Coetzee <[email protected]> * Refactor URL conversion and staging for Clickhouse Signed-off-by: Marcel Coetzee <[email protected]> * Tests don't pass, but they are there #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Fix poetry collision #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Swap engine/primary key clause ordering #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Pass basic tests #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Add Clickhouse tests and improve its configuration handling Signed-off-by: Marcel Coetzee <[email protected]> * Format and Lint Signed-off-by: Marcel Coetzee <[email protected]> * Improve DRYness #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Remove old comment #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Update pyproject.toml #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Fix secure connection settings #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Minor config parsing amendments #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Fix ssl connection (correct port) #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Reuse Athena destination pyformat converter #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Filesystem Auth issues #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Fix incorrect arguments in render_object_storage_table_function #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Pass all providers append tests #1055 * Add merge test #1055 * Resolved driver parameter substitution issues #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Merge Disposition #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Fall back to append disposition for merge #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Clickhouse CI #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Update active destinations to ClickHouse Signed-off-by: Marcel Coetzee <[email protected]> * Expand Clickhouse dependencies in pyproject.toml Signed-off-by: Marcel Coetzee <[email protected]> * Update lock file #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Revert back to merge implementation #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Add default sql test #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Support jsonlines Signed-off-by: Marcel Coetzee <[email protected]> * Revert non-applicable changes #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Fix 'from_db_type' #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Remove unused tests #1055 Signed-off-by: Marcel Coetzee <[email protected]> * No staging test case #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Minor changes Signed-off-by: Marcel Coetzee <[email protected]> * Refactor Clickhouse loader Signed-off-by: Marcel Coetzee <[email protected]> * WIP Signed-off-by: Marcel Coetzee <[email protected]> * Remove from standard sql tests Signed-off-by: Marcel Coetzee <[email protected]> * Remove unnecessary compression #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Dataset prefix and dataset-table seperator #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Remove DATASET_PREFIX from sql_client.py Signed-off-by: Marcel Coetzee <[email protected]> * Add clickhouse connect as local fallback #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Set settings on local #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Update lock Signed-off-by: Marcel Coetzee <[email protected]> * Revert some files back to devel Signed-off-by: Marcel Coetzee <[email protected]> * Remove redundant merge logic #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Spelling fix Signed-off-by: Marcel Coetzee <[email protected]> * Don't synthesise CH credentials __init__. #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Optimize import and type hinting in Clickhouse factory #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Revert back to temp table Signed-off-by: Marcel Coetzee <[email protected]> * Refactor Clickhouse to ClickHouse for consistency Signed-off-by: Marcel Coetzee <[email protected]> * Support compression codec for azure and local #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Merge Signed-off-by: Marcel Coetzee <[email protected]> * Fix table-name separator config resolution #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Format Signed-off-by: Marcel Coetzee <[email protected]> * Set compression parameter for local #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Set compression method to 'auto' for s3 table function #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Typing Signed-off-by: Marcel Coetzee <[email protected]> * Typing Signed-off-by: Marcel Coetzee <[email protected]> * Initial draft doc Signed-off-by: Marcel Coetzee <[email protected]> * auto compression for parquet, detects compression of local files * fixes has_dataset, recognizes more exceptions * fixes some tests * aligns clickhouse config with dataclasses * Remove empty dataset default #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Update clickhouse configuration and docs sidebar Signed-off-by: Marcel Coetzee <[email protected]> * Clickhouse docs #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Don't use Jinja #1055 Signed-off-by: Marcel Coetzee <[email protected]> * udpate clickhouse workflow file * add missing secrets to clickhouse workflow * Add test for clickhouse config settings #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Set experimental session in DSN #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Update data mapping and test for ClickHouse Signed-off-by: Marcel Coetzee <[email protected]> * Revert previous Signed-off-by: Marcel Coetzee <[email protected]> * Fix table aliasing issue Signed-off-by: Marcel Coetzee <[email protected]> * remove additional clickhouse destinations from test setup * fix lockfile * fix merging fix timestamp allow datetime to be parse as time * slightly clean up clickhouse load job * fix merge job a bit more * Refactor key table clause generation in ClickHouse Signed-off-by: Marcel Coetzee <[email protected]> * fixes a bunch of tests * simplify clickhouse load job a bit * add sentinel table for dataset existence check * post merge lockfile update * add support for scd2 * add correct high_ts for clickhouse * remove corelated query from scd2 implementation * fix merge sql for clickhouse * fix merge tests * some further fixes * fix athena tests * disable dbt for now * smaller changes * add clickhouse adapter tests, update small part of docs and correct imports * update scd2 sql based on jorrits suggestions * change merge change test to make it pass * use text for json in clickhouse * remove some unrelated unneeded stuff * update docs a bit * fix json to string tests and implementation * move gcp access credentials into proper config * Add GCS Clickhouse staging docs #1055 Signed-off-by: Marcel Coetzee <[email protected]> * Add `http_port` to docs #1055 Signed-off-by: Marcel Coetzee <[email protected]> * fix import after merge * small changes to the docs * tolerate rounding errors when loading from jsonl * post devel merge fix * remove unneeded stuff from scd2 merge --------- Signed-off-by: Marcel Coetzee <[email protected]> Co-authored-by: Marcin Rudolf <[email protected]> Co-authored-by: Dave <[email protected]>
Objective
Implement Clickhouse as a destination in the dlt library.
Features:
Default Clickhouse configuration via dlt resources adapter:
clickhouse_connect.driver
)Schema mapping and data type compatibility:
clickhouse_connect.cc_sqlalchemy
) could be used to define these mappings. The docs mention that it is limited though.Efficient data loading:
Support for Clickhouse-specific features:
SQL client integration:
Development Tasks:
test/load/pipelines
.Requested Community Feedback
We would appreciate any feedback or suggestions from the community regarding:
Your input will help us prioritize the development efforts and ensure that the Clickhouse destination meets the needs of the dlt community. Thank you for your feedback.
The text was updated successfully, but these errors were encountered: