Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sink): Support Deletes for ClickHouse ReplacingMergeTree #17283

Merged
merged 4 commits into from
Jun 20, 2024

Conversation

rickysaltzer
Copy link
Contributor

@rickysaltzer rickysaltzer commented Jun 17, 2024

  • ClickHouse now supports [1] a is_deleted column (optional) when creating a ReplacingMergeTree.

  • Adds new clickhouse.delete.column config option, which when set will enable upsert to a ReplacingMergeTree

  • When the value is deleted the value of this column is set to 1 instead of the default 0.

[1] https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree#is_deleted

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

This patch adds the ability for propagating DELETE operations to a ClickHouse table backed by a ReplacingMergeTree. When using the type='upsert' you will need to specify a clickhouse.delete.column corresponding to the respective column in ClickHouse. If you are using the is_deleted feature then you will need to use a UInt8 type for ClickHouse.

When a DELETE occurs, the value of this column will be flipped to a value of 1 which will allow ClickHouse to ignore this row (and all previous versions) automatically when performing a FROM <table> FINAL query.

Example

1.) Create ClickHouse Table

The following table is a simple K/V store. We use the ver column to denote the latest version, and the del column to indicate whether or not the key was marked for deletion.

CREATE TABLE kv_store
(
    `k` String,
    `v` UInt32,
    `ver` DateTime64,
    `del` UInt8
)
ENGINE = ReplacingMergeTree(ver, del)
ORDER BY k

2.) Create the RisingWave table

The table below will act as our source table for testing.

CREATE TABLE kv_store (
    k VARCHAR,
    v INTEGER,
    ver Timestamptz
);

3.) Create the Sink Connection

The following sink will write all data from the RisingWave kv_store into our ReplacingMergeTree while using the del column to indicate when a key should marked deleted.

CREATE SINK kv_store_sink AS 
SELECT 
    k,
    v,
    ver
FROM kv_store
WITH (
    connector = 'clickhouse',
    type = 'upsert',
    clickhouse.url = 'http://localhost:8123',
    clickhouse.user = 'default',
    clickhouse.password = '',
    clickhouse.database = 'default',
    clickhouse.table='kv_store',
    clickhouse.delete.column='del',
    primary_key='k'
);

4.) Insert / Delete Data

dev=> INSERT INTO kv_store VALUES ('foo', 1, NOW());
INSERT 0 1
dev=> DELETE FROM kv_store WHERE k = 'foo';
DELETE 1

5.) Verify Delete Propagates to ClickHouse

SELECT *
FROM kv_store

   ┌─k───┬─v─┬─────────────────────ver─┬─del─┐
1. │ foo │ 1 │ 2024-06-17 15:13:30.712 │   1 │ <-- automatically set by new patch
   └─────┴───┴─────────────────────────┴─────┘
   ┌─k───┬─v─┬─────────────────────ver─┬─del─┐
2. │ foo │ 1 │ 2024-06-17 15:13:30.712 │   0 │
   └─────┴───┴─────────────────────────┴─────┘

2 rows in set. Elapsed: 0.002 sec.

6.) Validate Scan With FINAL Respects Delete

SELECT *
FROM kv_store
FINAL

Ok.

0 rows in set. Elapsed: 0.002 sec.

- ClickHouse now supports [1] a `is_deleted` column (optional) when creating
  a `ReplacingMergeTree`.

- Adds new `clickhouse.delete.column` config option, which when set will
  enable `upsert` to a `ReplacingMergeTree`

- When the value is deleted the value of this column is set to `1`
  instead of the default `0`.
@rickysaltzer rickysaltzer changed the title feat(sink) Support Deletes for ClickHouse ReplacingMergeTree feat(sink): Support Deletes for ClickHouse ReplacingMergeTree Jun 17, 2024
@rickysaltzer
Copy link
Contributor Author

I thought about trying to use the approach that the Collarpsing merge trees and infer the delete column from the query DDL itself, but there were some drawbacks

  • Felt really hacky with the DDL, error prone
  • Disallows users to use a delete column without actually specifying it on the table (if they want to preserve visibility)

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Generally LGTM.

One minor comment on the RW sink example in the Release Note: I think clickhouse.delete.column option is missing in the example. Would you mind updating that?

@xxhZs PTAL.

@rickysaltzer
Copy link
Contributor Author

Thanks for the contribution! Generally LGTM.

One minor comment on the RW sink example in the Release Note: I think clickhouse.delete.column option is missing in the example. Would you mind updating that?

@xxhZs PTAL.

Good catch, updated!

@hzxa21
Copy link
Collaborator

hzxa21 commented Jun 18, 2024

I thought about trying to use the approach that the Collarpsing merge trees and infer the delete column from the query DDL itself, but there were some drawbacks

  • Felt really hacky with the DDL, error prone

Thanks for bringing this up. This is hacky indeed. I think we should deprecate hacking around DDL and add an option for the sign column like clickhouse.delete.column for collapsing merge engine.

Btw, FYI we are developing a feature call stream changelog (#17132) to convert a retractable stream into an append-only changelog stream with the additional op column, which is a generic way to solve similar issues. With that, we can use the following syntax to support fill-in the delete column for CK ReplacingMergeTree:

# kv_store_changelog's schema is (k, v, ver, op)
# op = 1: INSERT
# op = 2: DELETE
# op = 3: UPDATE_INSERT
# op = 4: UPDATE_DELETE

CREATE SINK kv_store_sink AS 
WITH kv_store_changelog as CHANGELOG FROM kv_store
SELECT 
    k,
    v,
    ver,
   (CASE WHEN op == 2 or op == 4 THEN 1 ELSE 0) as del
FROM kv_store_changelog
WITH (
    connector = 'clickhouse',
    type = 'append-only', -- kv_store_changelog is append only
    clickhouse.url = 'http://localhost:8123',
    clickhouse.user = 'default',
    clickhouse.password = '',
    clickhouse.database = 'default',
    clickhouse.table='kv_store',
    primary_key='k'
);

I think we can still merge this PR given that the change is small and it is easier to use from user's perspective.

Copy link
Contributor

@xxhZs xxhZs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, Thank you for your contribution,

src/connector/src/sink/clickhouse.rs Outdated Show resolved Hide resolved
src/connector/src/sink/clickhouse.rs Show resolved Hide resolved
@rickysaltzer
Copy link
Contributor Author

I thought about trying to use the approach that the Collarpsing merge trees and infer the delete column from the query DDL itself, but there were some drawbacks

  • Felt really hacky with the DDL, error prone

Thanks for bringing this up. This is hacky indeed. I think we should deprecate hacking around DDL and add an option for the sign column like clickhouse.delete.column for collapsing merge engine.

Btw, FYI we are developing a feature call stream changelog (#17132) to convert a retractable stream into an append-only changelog stream with the additional op column, which is a generic way to solve similar issues. With that, we can use the following syntax to support fill-in the delete column for CK ReplacingMergeTree:

# kv_store_changelog's schema is (k, v, ver, op)
# op = 1: INSERT
# op = 2: DELETE
# op = 3: UPDATE_INSERT
# op = 4: UPDATE_DELETE

CREATE SINK kv_store_sink AS 
WITH kv_store_changelog as CHANGELOG FROM kv_store
SELECT 
    k,
    v,
    ver,
   (CASE WHEN op == 2 or op == 4 THEN 1 ELSE 0) as del
FROM kv_store_changelog
WITH (
    connector = 'clickhouse',
    type = 'append-only', -- kv_store_changelog is append only
    clickhouse.url = 'http://localhost:8123',
    clickhouse.user = 'default',
    clickhouse.password = '',
    clickhouse.database = 'default',
    clickhouse.table='kv_store',
    primary_key='k'
);

I think we can still merge this PR given that the change is small and it is easier to use from user's perspective.

This will be great!

- Both the `get_sign_name().is_some()` and
  `.get_delete_column.is_some()` where incurring unnecessary memory
  overhead.

- These methods were being called on a per-row basis from the
  `StreamChunk` so the overhead was potentially large.

- Calling the `.is_collapsing_engine()` and
  `.is_delete_replacing_engine()` instead.
@rickysaltzer rickysaltzer requested a review from xxhZs June 18, 2024 23:40
Copy link
Contributor

@xxhZs xxhZs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, If you have no other changes, we will merge it.

@rickysaltzer
Copy link
Contributor Author

:shipit:

@xxhZs xxhZs added this pull request to the merge queue Jun 20, 2024
Merged via the queue into risingwavelabs:main with commit c710309 Jun 20, 2024
30 of 31 checks passed
@xxhZs xxhZs added the user-facing-changes Contains changes that are visible to users label Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/run-e2e-clickhouse-sink-tests user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants