Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Clickhouse Destination #1055

Closed
4 tasks
Pipboyguy opened this issue Mar 5, 2024 · 4 comments · Fixed by #1097
Closed
4 tasks

Implement Clickhouse Destination #1055

Pipboyguy opened this issue Mar 5, 2024 · 4 comments · Fixed by #1097
Assignees
Labels
destination Issue related to new destinations

Comments

@Pipboyguy
Copy link
Collaborator

Pipboyguy commented Mar 5, 2024

Objective

Implement Clickhouse as a destination in the dlt library.

Features:

  1. Default Clickhouse configuration via dlt resources adapter:

    • Provide a way to configure the Clickhouse connection details using dlt's configuration system using the clickhouse-connect driver. (clickhouse_connect.driver)
    • Allow users to specify additional connection-specific settings if needed.
  2. Schema mapping and data type compatibility:

    • Map dlt's schema to the appropriate Clickhouse data types.
    • The sqlalchemy dialect contained in (clickhouse_connect.cc_sqlalchemy) could be used to define these mappings. The docs mention that it is limited though.
    • Handle necessary data type conversions to ensure compatibility between dlt's data and Clickhouse's supported types.
    • Handle Clickhouse's definition of primary keys.
  3. Efficient data loading:

    • Utilize Clickhouse's bulk loading capabilities to optimize data ingestion performance.
    • Explore options for parallel data loading to leverage Clickhouse's distributed architecture.
  4. Support for Clickhouse-specific features:

    • Provide support for Clickhouse-specific features that can enhance data loading and querying, such as:
      • Partitioning and sharding strategies
      • Materialized views
      • Aggregation and compression settings
  5. SQL client integration:

    • Ensure integration with dlt's SQL client to enable users to query and interact with the loaded data in Clickhouse.

Development Tasks:

  • Define the interface and configuration options for the Clickhouse destination.
  • Implement the core functionality for schema mapping, data loading, and destination-specific features.
  • Write unit and integration tests to ensure the correctness and reliability of the Clickhouse destination in test/load/pipelines.
  • Update documentation and provide examples showcasing the usage of the Clickhouse destination in dlt pipelines.

Requested Community Feedback

We would appreciate any feedback or suggestions from the community regarding:

  • Clickhouse features or optimizations that would be valuable to support in the dlt destination.
  • Use cases or scenarios where a Clickhouse destination would be beneficial.
  • Any potential challenges or considerations to keep in mind when implementing the Clickhouse destination.

Your input will help us prioritize the development efforts and ensure that the Clickhouse destination meets the needs of the dlt community. Thank you for your feedback.

@Pipboyguy Pipboyguy added the destination Issue related to new destinations label Mar 5, 2024
@Pipboyguy Pipboyguy self-assigned this Mar 5, 2024
@Pipboyguy Pipboyguy moved this from Todo to In Progress in dlt core library Mar 5, 2024
@rudolfix
Copy link
Collaborator

rudolfix commented Mar 6, 2024

@Pipboyguy
please take a look at the update guide on new destinations. I bet you'll be able to reuse most of the code of other similar warehouses ie. databricks or snowflake. https://dlthub.com/docs/walkthroughs/create-new-destination

what you need is a db-api client: https://clickhouse-driver.readthedocs.io/en/latest/dbapi.html

the base class (sql_job_client and sql_client) will handle 90% of things for you, type mapping is just (mostly) writing a dict.

my biggest question is how people load data to clickhouse so they can use this on production. this will affect the scope the most:

  • do people use staging and load files with COPY jobs
  • maybe clickhouse cloud has internal staging?
  • maybe people do INSERT statements?

we should answer the above. for all those options we have scaffolding code that should be quick to adapt. we can also greatly help with writting tests once you have prototype that loads data with a few test pipelines

Pipboyguy added a commit that referenced this issue Mar 9, 2024
Pipboyguy added a commit that referenced this issue Mar 11, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Mar 12, 2024
Pipboyguy added a commit that referenced this issue Mar 12, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Mar 12, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Mar 12, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Mar 12, 2024
Pipboyguy added a commit that referenced this issue Mar 14, 2024
Pipboyguy added a commit that referenced this issue Mar 15, 2024
@Pipboyguy Pipboyguy linked a pull request Mar 16, 2024 that will close this issue
4 tasks
@Pipboyguy Pipboyguy mentioned this issue Mar 16, 2024
4 tasks
Pipboyguy added a commit that referenced this issue Mar 16, 2024
@Pipboyguy
Copy link
Collaborator Author

Pipboyguy commented Mar 17, 2024

@rudolfix ClickHouse's primary method for bulk data ingestion is through the use of table functions, for example the s3 table function. There doesn't seem to be any concept of staging tables like in Snowflake, just direct inserts from source in a streaming fashion:

When using the s3 table function with INSERT INTO...SELECT, data is read and inserted in a streaming fashion. Only a few blocks of data reside in memory while the blocks are continuously read from S3 and pushed into the destination table.

It is possible to save credentials and associated metadata from a source in a table engines, but this is simply a convenience utility.

ClickHouse has a collection of table functions that build tables. The main ingestion pattern in production, according the tutorials, example datasets and docs, is to insert the response from a table function into a table using the command

INSERT INTO table_name select from table_function(...);

Asynchronous, direct inserts are supported as well, but again not the main method.

This considered, I've opted to adapt the snowflake implementation (without staging).

class ClickhouseLoadJob(LoadJob, FollowupJob):

I'll also reuse as much of SF's tests considering its similarity.

Pipboyguy added a commit that referenced this issue Mar 17, 2024
Pipboyguy added a commit that referenced this issue Mar 18, 2024
Pipboyguy added a commit that referenced this issue Mar 19, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Mar 19, 2024
Pipboyguy added a commit that referenced this issue Mar 19, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
@arniwesth
Copy link

Hey - Clickhouse user here. Thanks a lot for providing this destination❤️.

To give a short background, we use Clickhouse partly as a traditional data warehouse for BI, partly for analyzing medium-large scale time series data sets (~100s GB each) and partly (experimentally) as a feature store for ML features.

Features that are important to us (sorted by importance):

  • Being able to specify Settings: This is wherestorage_policy is defined and we use both local disks and S3 for storage depending on the type of data
  • Being able to specify compression codecs: Careful selection of compression codecs can result in +20x compression depending on the nature of the data. We are seeing huge compression ratios when moving data from Postgres to Clickhouse
  • Being able to specify ORDER BY: This has implications for joins of large tables and query performance

My base assumption from looking at the destination implementations is that we will fairly easily be able to modify at least some of these features, but they are probably also important to the broader community.

@Pipboyguy
Copy link
Collaborator Author

Pipboyguy commented Mar 22, 2024

Hi @arniwesth

Thank you for your input!

You're right that Settings and comporession codec is easy change.

Order by (sorting key) is a little trickier to wrap, because the primary key must be a prefix of the sorting key if both are specified. The sorting key is implictly set in any case to be equal to primary key.

I'll see if I can add Settings and comporession codec to the destination, but can't promise sorting key (for now). Please see #1097 for further details.

Pipboyguy added a commit that referenced this issue Mar 23, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Mar 23, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Mar 23, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 2, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 2, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 3, 2024
Pipboyguy added a commit that referenced this issue Apr 3, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 3, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 4, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 4, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 4, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 5, 2024
Pipboyguy added a commit that referenced this issue Apr 5, 2024
Pipboyguy added a commit that referenced this issue Apr 5, 2024
Pipboyguy added a commit that referenced this issue Apr 5, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 6, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 6, 2024
Pipboyguy added a commit that referenced this issue Apr 6, 2024
Pipboyguy added a commit that referenced this issue Apr 8, 2024
Pipboyguy added a commit that referenced this issue Apr 9, 2024
Pipboyguy added a commit that referenced this issue Apr 9, 2024
Pipboyguy added a commit that referenced this issue Apr 9, 2024
Pipboyguy added a commit that referenced this issue Apr 11, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 13, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 15, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Apr 16, 2024
Pipboyguy added a commit that referenced this issue Apr 16, 2024
Pipboyguy added a commit that referenced this issue Apr 24, 2024
Pipboyguy added a commit that referenced this issue Apr 24, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
sh-rp added a commit that referenced this issue Apr 26, 2024
* Add clickhouse driver dependency #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Preliminary wireframe #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Update preliminary Clickhouse configurations #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Format #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Finalize wireframing #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Wireframe ClickhouseSqlClient #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Refactor Clickhouse SqlClient wireframing and update capabilities #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Update error messages and transaction capability in Clickhouse

Signed-off-by: Marcel Coetzee <[email protected]>

* Update Clickhouse configuration and factory

Signed-off-by: Marcel Coetzee <[email protected]>

* Update

Signed-off-by: Marcel Coetzee <[email protected]>

* Update identifier escaping logic #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Refactor ClickhouseSqlClient for better error handling #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Refine clickhouse destination and basic adapter #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Finish ClickhouseClient

Signed-off-by: Marcel Coetzee <[email protected]>

* Add escape_clickhouse_literal function for Clickhouse

Signed-off-by: Marcel Coetzee <[email protected]>

* Add "insert_values" to supported loader file formats

Signed-off-by: Marcel Coetzee <[email protected]>

* Add `wei` to "from_db_type"

Signed-off-by: Marcel Coetzee <[email protected]>

* Improve Clickhouse loader code and update comments #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Preliminary CH tests and utility module

Signed-off-by: Marcel Coetzee <[email protected]>

* Refactor Clickhouse utilities and update tests

Signed-off-by: Marcel Coetzee <[email protected]>

* Refactor URL conversion and staging for Clickhouse

Signed-off-by: Marcel Coetzee <[email protected]>

* Tests don't pass, but they are there #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Fix poetry collision #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Swap engine/primary key clause ordering #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Pass basic tests #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Add Clickhouse tests and improve its configuration handling

Signed-off-by: Marcel Coetzee <[email protected]>

* Format and Lint

Signed-off-by: Marcel Coetzee <[email protected]>

* Improve DRYness #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Remove old comment #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Update pyproject.toml #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Fix secure connection settings #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Minor config parsing amendments #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Fix ssl connection (correct port) #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Reuse Athena destination pyformat converter #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Filesystem Auth issues #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Fix incorrect arguments in render_object_storage_table_function #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Pass all providers append tests #1055

* Add merge test #1055

* Resolved driver parameter substitution issues #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Merge Disposition #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Fall back to append disposition for merge #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Clickhouse CI #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Update active destinations to ClickHouse

Signed-off-by: Marcel Coetzee <[email protected]>

* Expand Clickhouse dependencies in pyproject.toml

Signed-off-by: Marcel Coetzee <[email protected]>

* Update lock file #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Revert back to merge implementation #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Add default sql test #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Support jsonlines

Signed-off-by: Marcel Coetzee <[email protected]>

* Revert non-applicable changes #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Fix 'from_db_type' #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Remove unused tests #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* No staging test case #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Minor changes

Signed-off-by: Marcel Coetzee <[email protected]>

* Refactor Clickhouse loader

Signed-off-by: Marcel Coetzee <[email protected]>

* WIP

Signed-off-by: Marcel Coetzee <[email protected]>

* Remove from standard sql tests

Signed-off-by: Marcel Coetzee <[email protected]>

* Remove unnecessary compression #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Dataset prefix and dataset-table seperator #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Remove DATASET_PREFIX from sql_client.py

Signed-off-by: Marcel Coetzee <[email protected]>

* Add clickhouse connect as local fallback #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Set settings on local #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Update lock

Signed-off-by: Marcel Coetzee <[email protected]>

* Revert some files back to devel

Signed-off-by: Marcel Coetzee <[email protected]>

* Remove redundant merge logic #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Spelling fix

Signed-off-by: Marcel Coetzee <[email protected]>

* Don't synthesise CH credentials __init__. #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Optimize import and type hinting in Clickhouse factory #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Revert back to temp table

Signed-off-by: Marcel Coetzee <[email protected]>

* Refactor Clickhouse to ClickHouse for consistency

Signed-off-by: Marcel Coetzee <[email protected]>

* Support compression codec for azure and local #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Merge

Signed-off-by: Marcel Coetzee <[email protected]>

* Fix table-name separator config resolution #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Format

Signed-off-by: Marcel Coetzee <[email protected]>

* Set compression parameter for local #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Set compression method to 'auto' for s3 table function #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Typing

Signed-off-by: Marcel Coetzee <[email protected]>

* Typing

Signed-off-by: Marcel Coetzee <[email protected]>

* Initial draft doc

Signed-off-by: Marcel Coetzee <[email protected]>

* auto compression for parquet, detects compression of local files

* fixes has_dataset, recognizes more exceptions

* fixes some tests

* aligns clickhouse config with dataclasses

* Remove empty dataset default #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Update clickhouse configuration and docs sidebar

Signed-off-by: Marcel Coetzee <[email protected]>

* Clickhouse docs #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Don't use Jinja #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* udpate clickhouse workflow file

* add missing secrets to clickhouse workflow

* Add test for clickhouse config settings #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Set experimental session in DSN #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Update data mapping and test for ClickHouse

Signed-off-by: Marcel Coetzee <[email protected]>

* Revert previous

Signed-off-by: Marcel Coetzee <[email protected]>

* Fix table aliasing issue

Signed-off-by: Marcel Coetzee <[email protected]>

* remove additional clickhouse destinations from test setup

* fix lockfile

* fix merging
fix timestamp
allow datetime to be parse as time

* slightly clean up clickhouse load job

* fix merge job a bit more

* Refactor key table clause generation in ClickHouse

Signed-off-by: Marcel Coetzee <[email protected]>

* fixes a bunch of tests

* simplify clickhouse load job a bit

* add sentinel table for dataset existence check

* post merge lockfile update

* add support for scd2

* add correct high_ts for clickhouse

* remove corelated query from scd2 implementation

* fix merge sql for clickhouse

* fix merge tests

* some further fixes

* fix athena tests

* disable dbt for now

* smaller changes

* add clickhouse adapter tests, update small part of docs and correct imports

* update scd2 sql based on jorrits suggestions

* change merge change test to make it pass

* use text for json in clickhouse

* remove some unrelated unneeded stuff

* update docs a bit

* fix json to string tests and implementation

* move gcp access credentials into proper config

* Add GCS Clickhouse staging docs #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* Add `http_port` to docs #1055

Signed-off-by: Marcel Coetzee <[email protected]>

* fix import after merge

* small changes to the docs

* tolerate rounding errors when loading from jsonl

* post devel merge fix

* remove unneeded stuff from scd2 merge

---------

Signed-off-by: Marcel Coetzee <[email protected]>
Co-authored-by: Marcin Rudolf <[email protected]>
Co-authored-by: Dave <[email protected]>
@github-project-automation github-project-automation bot moved this from In Progress to Done in dlt core library Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
destination Issue related to new destinations
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants