Introduce `hard_delete` and `dedup_sort` columns hint for `merge` #960

jorritsandbrink · 2024-02-12T01:36:11Z

Description

~~This PR introduces a new write disposition replicate that can be used to propagate captured change data (INSERTs, UPDATEs, and DELETEs) from the source to the target destination.~~

~~requires a primary_key hint—raises SchemaException if not provided~~
~~requires a cdc_config hint—raises SchemaException if not provided~~
- ~~cdc_config holds information on how the change data is organized, such as which column holds the operation type and which values in that column corresponds with which DML operation~~
~~extended SqlMergeJob to implement replicate—SqlMergeJob now handles both merge and replicate write dispositions~~
mechanism: first load change data to staging table, then propagate changes in final table using simple "delete-and-insert" logic (similar to how merge works, but here we have to filter out delete records before we insert from the staging table)
- ~~all primary key values present in the staging table (corresponding with insert, update, and delete operations) are deleted from the final table~~
- ~~records in the staging table corresponding with insert and update operations are inserted in the final table~~
~~also works with child tables~~

The above no longer applies—see #960 (comment)

Potentially useful functionality not yet implemented:

Partial updates. We now expect all columns of an updated record to be available (also the ones that didn't change). Most source systems (Postgres, MySQL, SQL Server, Delta tables...) seem to provide this info so perhaps this isn't so important.
Conflict detection and resolution. The current logic implicitly implements a source wins resolution strategy, but it would be nice to make this configurable. I.e. to add support for target wins or the option to raise an error in case of a conflict. This page provides a useful overview.
Soft deletes. Add a configuration option that lets users choose between hard deletes (current behavior) and soft deletes (extra column is added to final table that marks a record as deleted or not).

Related Issues

Closes Core extensions to support database replication #947

Additional Context

netlify · 2024-02-12T01:36:15Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`9921b89`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65d934d16cec2a000811de41

jorritsandbrink · 2024-02-12T02:19:11Z

@rudolfix can you have a look at this and see if it aligns with your ideas?

rudolfix

we are going into a good direction. my biggest issue is with the way SqlMergeJob is hacked. IMO we can radically simplify this if:

we really want to modify existing SqlMergeJob
we assume that updates are not partial but full

in that case the only thing we need is to define a hard delete column hint and give it a special treatment.

the distinction of insert/update is superfluous if updates are full. you delete records anyway so everything is "i" or "d"

and frankly this is what I'd do. we may even drop a separate write disposition and start interpreting hard delete flag in "merge".

partial deletes:
if we have partial deletes we'll need new write disposition and a completely separate merge job which will be based on sql MERGE statement.
if we do not need this now - let's do it later.

rudolfix · 2024-02-12T20:04:22Z

dlt/common/schema/utils.py

@@ -317,6 +318,19 @@ def validate_stored_schema(stored_schema: TStoredSchema) -> None:
            if parent_table_name not in stored_schema["tables"]:
                raise ParentTableNotFoundException(table_name, parent_table_name)

+        # check for "replicate" tables that miss a primary key or "cdc_config"


this makes sense but we should move it to

end of normalize stage OR

beginning of load stage
at this moment schema may be still partial. not all columns may be present (100% after extract stage)

also we should check merge disposition.

also take a look at _verify_schema in JobClientBase looks like our place :)

rudolfix · 2024-02-12T20:09:19Z

dlt/common/schema/typing.py

@@ -166,6 +198,7 @@ class TTableSchema(TypedDict, total=False):
    columns: TTableSchemaColumns
    resource: Optional[str]
    table_format: Optional[TTableFormat]
+    cdc_config: Optional[TCdcConfig]


this is one way to go. but IMO a better way would be to define a column level hint.
cdc_op which could be integer or single char (u/d/i)

do we really need a sequence? if so we could reuse sort or add a new hint ie. cdc_seq. There are helper methods to find column(s) with hints

it looks simpler to me.

rudolfix · 2024-02-12T20:14:58Z

dlt/destinations/sql_jobs.py

+
+        insert_condition = "1 = 1"
+        write_disposition = root_table["write_disposition"]
+        if write_disposition == "replicate":


IMO SqlMergeJob should not be aware of write disposition... Is this possible to create a base class and two subclasses for merge and replicate.

this is very hacky

rudolfix · 2024-02-12T20:25:41Z

Partial updates. We now expect all columns of an updated record to be available (also the ones that didn't change). Most source systems (Postgres, MySQL, SQL Server, Delta tables...) seem to provide this info so perhaps this isn't so important.

see my comment above. we skip it for now. it is way more work.

Conflict detection and resolution. The current logic implicitly implements a source wins resolution strategy, but it would be nice to make this configurable. I.e. to add support for target wins or the option to raise an error in case of a conflict. This page provides a useful overview.

OK so here we'd need a "i" and "u" distinction. IMO part of advanced replication above

Soft deletes. Add a configuration option that lets users choose between hard deletes (current behavior) and soft deletes (extra column is added to final table that marks a record as deleted or not).

yeah! look at this maybe: #923 and #828 definitely next thing we do

jorritsandbrink · 2024-02-14T02:06:49Z

@rudolfix I undid a lot of the changes based on your feedback.

What's new:

a hard_delete column hint that gets interpreted in merge
interpretation of the sort column hint in merge so changes can be processed in order (e.g. if a record gets inserted then deleted in the same load, it should not be inserted in the destination—if it gets deleted then inserted in the same load, it should be inserted)
append and replace ignore the hard_delete hint
merge now has three "steps" if a hard_delete and/or sort column hint is provided: 1) delete from destination dataset 2) delete from staging dataset 3) copy staging to destination

Is deleting from the staging dataset a good approach or is it better to add filters in the copy-staging-to-destination step?
If it's a good approach, can the existing primary key deduplication also be done in that way rather than using temp tables?

rudolfix

on top of the review:
you must test a data where there are child tables. looks at other merge tests. we need a test with one and two nesting levels (can be same dataset)

rudolfix · 2024-02-14T10:59:42Z

dlt/destinations/sql_jobs.py

@@ -333,6 +337,44 @@ def gen_merge_sql(
                )
            )

+        # remove "non-latest" records from staging table (deduplicate) if a sort column is provided
+        if len(primary_keys) > 0:
+            if has_column_with_prop(root_table, "sort"):


as you point out this does deduplication on top of the dedup done when generating temp tables (or inserting at the end when there are no child tables). my take: use sorted column in those clauses below if sorted column present. otherwise ORDER BY (SELECT NULL)

rudolfix · 2024-02-14T11:04:01Z

dlt/destinations/sql_jobs.py

+                """)
+
+        # remove deleted records from staging tables if a hard_delete column is provided
+        if has_column_with_prop(root_table, "hard_delete"):


I think (I hope) there's a simpler way to handle hard_deletes. The code below does not need any modifications. It will delete all rows from destination dataset (using primary and merge keys) that are present in the staging dataset. it does not matter if hard delete flag is set or not. we must delete those rows anyway.

we only must change how we insert, from here:

# insert from staging to dataset, truncate staging table for table in table_chain:

the only think you need to do is to filter out rows that have deleted flag set so this is another clause in where

overall it should be way less code, we do not interfere with any edge cases by deleting and deduplicating the staging dataset + it looks like less row reads

I have changed the approach to extending the where clause in the insert stage, rather than deleting from the staging dataset. It didn't turn out to be less code but it makes more sense nonetheless.

rudolfix · 2024-02-14T11:14:46Z

dlt/destinations/sql_jobs.py

+            # first delete from root staging table
+            sql.append(f"""
+                DELETE FROM {staging_root_table_name}
+                WHERE {hard_delete_column} IS NOT DISTINCT FROM {escape_literal(True)};


ok so you assume that hard deleted column is boolean. probably makes the most sense. but then you must check the type somewhere. my take:
delete if value is IS NOT NULL OR (only in case of boolean) when true as above. maybe someone wants to have deleted flag as timestamp?

Good point, I implemented your suggestion.

rudolfix · 2024-02-14T11:16:46Z

dlt/destinations/job_client_impl.py

+    ) -> Sequence[TColumnSchema]:
+        updates = super()._create_table_update(table_name, storage_columns)
+        table = self.schema.get_table(table_name)
+        if has_column_with_prop(table, "hard_delete"):


this is really good point. but my take would be to have identical schemas in staging and destination datasets. also: what about append and replace? this data won't be dropped from parquet/json files so just dropping from schema wont't help.

I'd say let's remove it. also all the code in merge job that skips deleted columns

Removed. Staging and destination datasets now how identical schemas.

rudolfix · 2024-02-14T11:18:18Z

dlt/load/load.py

-            if not table_jobs and top_merged_table["write_disposition"] != "replace":
+            # if there are no jobs for the table, skip it, unless child tables need to be replaced
+            needs_replacement = False
+            if top_merged_table["write_disposition"] == "replace" or (


why it is changed?

This is needed to propagate deletes to child tables. If we provide only a primary key and the hard_delete column for a nested table, such as happens on lines 584 and 599 of test_merge_disposition.py, the child tables wouldn't get included in the table chain, and those deletes would only be executed on the root table.

I still do not get it. We have jobs for this table because in both those lines we declare some data. The exception for replace is only for the case that there is no data at all. does not happen here. IMO you should try to remove it and find the problem elsewhere or ping me on slack to discuss it

jorritsandbrink · 2024-02-15T23:59:47Z

@rudolfix I addressed your feedback and the PR is ready for another review!

See my replies on your comments for details regarding the changes.

On top of those replies:

I introduced the dedup_sort column hint to not conflate different usages under the sort hint
I added a test with two layers of nesting as you asked for
I will extend the user docs to describe the new column hints after all code changes have been approved

Edit: I see the new tests failing on some of the destinations, probably because IS DISTINCT FROM isn't generally supported SQL. I Tested only mssql, duckdb, and postgres locally. I'll look into and add a new commit to support all destinations. Fixed.

rudolfix

OK! we are almost there. our sql jobs is almost unreadable now (if it ever was)

we are missing a test where you have a merge key on non-unique column and you have a hard delete flag (you should be able to delete whole day of data with just one flag)

also question: does hard deleted flag make sense if there's no primary key? if the answer is no and primary key is required we can simplify code even more

also maybe a test case when we have a dedup sort and two rows one with deleted flag and one without (could run on duckdb only to make it faster)

rudolfix · 2024-02-19T18:03:38Z

dlt/destinations/sql_jobs.py

@@ -253,28 +302,34 @@ def gen_merge_sql(
        sql: List[str] = []
        root_table = table_chain[0]

+        escape_id = sql_client.capabilities.escape_identifier
+        escape_lit = sql_client.capabilities.escape_literal
+        if escape_id is None:


is this really possible? how this code could work before?

sql_client.capabilities.escape_literal is None for snowflake. sql_client.capabilities.escape_identifier always has a value (at least with the current set of destinations), but I included the if escape_id is None: for consistency.

right, it is not defined on snowflake because we never process literals. we need sqlglot ,to generate those statements. but it does not have all dialects and does not support DDL very well (maybe that changed)

rudolfix · 2024-02-19T18:09:49Z

dlt/destinations/sql_jobs.py

+        if sort_column is None:
+            order_by = "(SELECT NULL)"
+        else:
+            order_by = f"{sort_column} DESC"


is DESC what users expect? what is more typical?

Higher values typically indicate more recent (think timestamps, or the LSN in a Postgres WAL). So if we sort in descending order, we get the most recent value, which makes sense for most typical use cases.

I could also change the dedup_sort column hint from boolean to string and have it accept "asc" or "desc" values to make it configurable for the user. What do you think?

good idea! (if not too much work)

rudolfix · 2024-02-19T18:27:30Z

dlt/destinations/sql_jobs.py

-                insert_sql += ";"
-            sql.append(insert_sql)
-            # -- DELETE FROM {staging_table_name} WHERE 1=1;
+            insert_cond = copy(not_deleted_cond) if hard_delete_col is not None else "1 = 1"


you do not need to copy. strings are immutable. you won't change not_deleted_cond

rudolfix · 2024-02-19T19:06:12Z

dlt/destinations/sql_jobs.py

+
+        insert_temp_table_name: str = None
+        if len(table_chain) > 1:
+            if len(primary_keys) > 0 or (len(primary_keys) == 0 and hard_delete_col is not None):


if len(primary_keys) > 0 or hard_delete_col is not None: should be sufficient.

rudolfix · 2024-02-19T19:11:59Z

dlt/destinations/sql_jobs.py

+        if condition is None:
+            condition = "1 = 1"
+        col_str = ", ".join(columns)
+        inner_col_str = copy(col_str)


do not need to copy! strings are immutable

rudolfix · 2024-02-19T19:19:21Z

dlt/destinations/sql_jobs.py

+        insert_temp_table_name: str = None
+        if len(table_chain) > 1:
+            if len(primary_keys) > 0 or (len(primary_keys) == 0 and hard_delete_col is not None):
+                condition_colummns = [hard_delete_col] if not_deleted_cond is not None else None


condition_colummns typo

rudolfix · 2024-02-19T19:32:29Z

dlt/load/load.py

-            if not table_jobs and top_merged_table["write_disposition"] != "replace":
+            # if there are no jobs for the table, skip it, unless child tables need to be replaced
+            needs_replacement = False
+            if top_merged_table["write_disposition"] == "replace" or (


I still do not get it. We have jobs for this table because in both those lines we declare some data. The exception for replace is only for the case that there is no data at all. does not happen here. IMO you should try to remove it and find the problem elsewhere or ping me on slack to discuss it

dlt/destinations/sql_jobs.py

jorritsandbrink · 2024-02-20T14:01:53Z

@rudolfix I addressed all your points. I added the test cases you mentioned. Only remaining point is the one about child table skipping we are discussing on Slack.

also question: does hard deleted flag make sense if there's no primary key? if the answer is no and primary key is required we can simplify code even more

I think it makes as much sense as doing inserts/updates without a primary key. I would say deleting multiple records sharing the same key is a valid use case we should support.

…seen data

rudolfix

looks like asc/desc on update is left.

I've found and fixed a lot of heresy in the code. you can see my two commits. there were edge cases that tables were not created even if they should or vice versa.

now normalizer is marking tables which seen data so loader knows more which tables to create.

rudolfix · 2024-02-21T19:49:56Z

dlt/destinations/sql_jobs.py

+        if sort_column is None:
+            order_by = "(SELECT NULL)"
+        else:
+            order_by = f"{sort_column} DESC"


good idea! (if not too much work)

rudolfix · 2024-02-21T19:54:05Z

dlt/destinations/sql_jobs.py

@@ -253,28 +302,34 @@ def gen_merge_sql(
        sql: List[str] = []
        root_table = table_chain[0]

+        escape_id = sql_client.capabilities.escape_identifier
+        escape_lit = sql_client.capabilities.escape_literal
+        if escape_id is None:


right, it is not defined on snowflake because we never process literals. we need sqlglot ,to generate those statements. but it does not have all dialects and does not support DDL very well (maybe that changed)

rudolfix · 2024-02-21T20:36:36Z

dlt/load/load.py

-                    continue
-                result.add(table["name"])
-        return result
+                with self.get_destination_client(schema) as job_client:


we can't go to the destination to check if table exists., this costs a lot. we avoid any unnecessary database reflection at all cost.

we should return tables without this check. if we do it right, only tables that were created will be returned here.

rudolfix · 2024-02-21T20:37:19Z

dlt/load/load.py

            ):
                with job_client.with_staging_dataset():
                    self._init_dataset_and_update_schema(
                        job_client,
                        expected_update,
-                        staging_tables | {schema.version_table_name},
+                        order_deduped(staging_tables + [schema.version_table_name]),


dlt tables are never included in staging tables. no need to dedup

…lication

… github.com:dlt-hub/dlt into 947-core-extensions-to-support-database-replication

…ttps://github.com/dlt-hub/dlt into 947-core-extensions-to-support-database-replication

… github.com:dlt-hub/dlt into 947-core-extensions-to-support-database-replication

rudolfix

bugs fixed & some tests added. LGTM!

Jorrit Sandbrink added 3 commits February 12, 2024 02:32

black formatting

82c3634

remove unused exception

97c5512

add initial support for replicate write disposition

400d84b

jorritsandbrink linked an issue Feb 12, 2024 that may be closed by this pull request

Core extensions to support database replication #947

Closed

jorritsandbrink requested a review from rudolfix February 12, 2024 02:19

rudolfix requested changes Feb 12, 2024

View reviewed changes

Jorrit Sandbrink added 3 commits February 14, 2024 02:39

add hard_delete hint and sorted deduplication for merge

24f362e

undo config change

f3a4878

undo unintentional changes

deb816f

jorritsandbrink requested a review from rudolfix February 14, 2024 02:07

jorritsandbrink changed the title ~~Introduce replicate write disposition~~ Introduce hard_delete column hint and sorted deduplication in merge Feb 14, 2024

rudolfix requested changes Feb 14, 2024

View reviewed changes

Jorrit Sandbrink added 2 commits February 16, 2024 00:15

refactor hard_delete handling and introduce dedup_sort hint

4a38d56

update docstring

0d1c977

jorritsandbrink changed the title ~~Introduce hard_delete column hint and sorted deduplication in merge~~ Introduce hard_delete and dedup_sort columns hint for merge Feb 15, 2024

jorritsandbrink requested a review from rudolfix February 16, 2024 00:04

jorritsandbrink mentioned this pull request Feb 16, 2024

Postgres database replication #933

Closed

Jorrit Sandbrink added 4 commits February 16, 2024 15:29

replace dialect-specific SQL

474d8bc

add parentheses to ensure proper clause evaluation order

568ef26

add escape defaults and temp tables for non-primary key case

81ea426

exclude destinations that don't support merge from test

a04a238

rudolfix requested changes Feb 19, 2024

View reviewed changes

Jorrit Sandbrink added 3 commits February 20, 2024 10:19

correct typo

8ac0f9c

extend docstring

ec115e9

remove redundant copies for immutable strings

a1afeb8

Jorrit Sandbrink added 2 commits February 20, 2024 12:43

simplify boolean logic

f07205d

add more test cases for hard_delete and dedup_sort hints

a64580d

Jorrit Sandbrink and others added 3 commits February 21, 2024 02:36

refactor table chain resolution

3308549

marks tables that seen data in normalizer, skips empty jobs if never …

189c2fb

…seen data

ignores tables that didn't seen data when loading, tests edge cases

a649b0e

rudolfix requested changes Feb 22, 2024

View reviewed changes

rudolfix and others added 12 commits February 22, 2024 01:49

Merge branch 'devel' into 947-core-extensions-to-support-database-rep…

9778f0e

…lication

add sort order configuration option

4b3c59b

bumps schema engine to v9, adds migrations

c984c4e

filters tables without data properly in load

935748a

converts seen-data to boolean, fixes tests

d125556

Merge branch '947-core-extensions-to-support-database-replication' of…

ecaf6ef

… github.com:dlt-hub/dlt into 947-core-extensions-to-support-database-replication

disables filesystem tests config due to merge present

af0b344

add docs for hard_delete and dedup_sort column hints

262018b

Merge branch '947-core-extensions-to-support-database-replication' of h…

0814bb0

…ttps://github.com/dlt-hub/dlt into 947-core-extensions-to-support-database-replication

fixes extending table chains in load

44a9ff2

Merge branch '947-core-extensions-to-support-database-replication' of…

9384148

… github.com:dlt-hub/dlt into 947-core-extensions-to-support-database-replication

refactors load and adds unit tests with dummy

9921b89

rudolfix approved these changes Feb 24, 2024

View reviewed changes

rudolfix merged commit 88f2722 into devel Feb 24, 2024
56 of 65 checks passed

rudolfix deleted the 947-core-extensions-to-support-database-replication branch February 24, 2024 09:08

Introduce hard_delete and dedup_sort columns hint for merge #960

Introduce hard_delete and dedup_sort columns hint for merge #960

Conversation

jorritsandbrink commented Feb 12, 2024 • edited Loading

Description

Related Issues

Additional Context

netlify bot commented Feb 12, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

jorritsandbrink commented Feb 12, 2024

rudolfix left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix commented Feb 12, 2024

jorritsandbrink commented Feb 14, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorritsandbrink commented Feb 15, 2024 • edited Loading

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorritsandbrink commented Feb 20, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Introduce `hard_delete` and `dedup_sort` columns hint for `merge` #960

Introduce `hard_delete` and `dedup_sort` columns hint for `merge` #960

jorritsandbrink commented Feb 12, 2024 •

edited

Loading

netlify bot commented Feb 12, 2024 •

edited

Loading

rudolfix left a comment •

edited

Loading

jorritsandbrink commented Feb 15, 2024 •

edited

Loading