Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce hard_delete and dedup_sort columns hint for merge #960

Merged
merged 32 commits into from
Feb 24, 2024

Conversation

jorritsandbrink
Copy link
Collaborator

@jorritsandbrink jorritsandbrink commented Feb 12, 2024

Description

This PR introduces a new write disposition replicate that can be used to propagate captured change data (INSERTs, UPDATEs, and DELETEs) from the source to the target destination.

  • requires a primary_key hint—raises SchemaException if not provided
  • requires a cdc_config hint—raises SchemaException if not provided
    • cdc_config holds information on how the change data is organized, such as which column holds the operation type and which values in that column corresponds with which DML operation
  • extended SqlMergeJob to implement replicateSqlMergeJob now handles both merge and replicate write dispositions
  • mechanism: first load change data to staging table, then propagate changes in final table using simple "delete-and-insert" logic (similar to how merge works, but here we have to filter out delete records before we insert from the staging table)
    • all primary key values present in the staging table (corresponding with insert, update, and delete operations) are deleted from the final table
    • records in the staging table corresponding with insert and update operations are inserted in the final table
  • also works with child tables

The above no longer applies—see #960 (comment)

Potentially useful functionality not yet implemented:

  • Partial updates. We now expect all columns of an updated record to be available (also the ones that didn't change). Most source systems (Postgres, MySQL, SQL Server, Delta tables...) seem to provide this info so perhaps this isn't so important.
  • Conflict detection and resolution. The current logic implicitly implements a source wins resolution strategy, but it would be nice to make this configurable. I.e. to add support for target wins or the option to raise an error in case of a conflict. This page provides a useful overview.
  • Soft deletes. Add a configuration option that lets users choose between hard deletes (current behavior) and soft deletes (extra column is added to final table that marks a record as deleted or not).

Related Issues

Additional Context

@jorritsandbrink jorritsandbrink linked an issue Feb 12, 2024 that may be closed by this pull request
Copy link

netlify bot commented Feb 12, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 9921b89
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/65d934d16cec2a000811de41

@jorritsandbrink
Copy link
Collaborator Author

@rudolfix can you have a look at this and see if it aligns with your ideas?

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are going into a good direction. my biggest issue is with the way SqlMergeJob is hacked. IMO we can radically simplify this if:

  • we really want to modify existing SqlMergeJob
  • we assume that updates are not partial but full

in that case the only thing we need is to define a hard delete column hint and give it a special treatment.

the distinction of insert/update is superfluous if updates are full. you delete records anyway so everything is "i" or "d"

and frankly this is what I'd do. we may even drop a separate write disposition and start interpreting hard delete flag in "merge".

partial deletes:
if we have partial deletes we'll need new write disposition and a completely separate merge job which will be based on sql MERGE statement.
if we do not need this now - let's do it later.

@@ -317,6 +318,19 @@ def validate_stored_schema(stored_schema: TStoredSchema) -> None:
if parent_table_name not in stored_schema["tables"]:
raise ParentTableNotFoundException(table_name, parent_table_name)

# check for "replicate" tables that miss a primary key or "cdc_config"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense but we should move it to

  1. end of normalize stage OR
  2. beginning of load stage
    at this moment schema may be still partial. not all columns may be present (100% after extract stage)

also we should check merge disposition.

also take a look at _verify_schema in JobClientBase looks like our place :)

@@ -166,6 +198,7 @@ class TTableSchema(TypedDict, total=False):
columns: TTableSchemaColumns
resource: Optional[str]
table_format: Optional[TTableFormat]
cdc_config: Optional[TCdcConfig]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one way to go. but IMO a better way would be to define a column level hint.
cdc_op which could be integer or single char (u/d/i)

do we really need a sequence? if so we could reuse sort or add a new hint ie. cdc_seq. There are helper methods to find column(s) with hints

it looks simpler to me.


insert_condition = "1 = 1"
write_disposition = root_table["write_disposition"]
if write_disposition == "replicate":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO SqlMergeJob should not be aware of write disposition... Is this possible to create a base class and two subclasses for merge and replicate.

this is very hacky

@rudolfix
Copy link
Collaborator

Partial updates. We now expect all columns of an updated record to be available (also the ones that didn't change). Most source systems (Postgres, MySQL, SQL Server, Delta tables...) seem to provide this info so perhaps this isn't so important.

see my comment above. we skip it for now. it is way more work.

Conflict detection and resolution. The current logic implicitly implements a source wins resolution strategy, but it would be nice to make this configurable. I.e. to add support for target wins or the option to raise an error in case of a conflict. This page provides a useful overview.

OK so here we'd need a "i" and "u" distinction. IMO part of advanced replication above

Soft deletes. Add a configuration option that lets users choose between hard deletes (current behavior) and soft deletes (extra column is added to final table that marks a record as deleted or not).

yeah! look at this maybe: #923 and #828 definitely next thing we do

@jorritsandbrink
Copy link
Collaborator Author

@rudolfix I undid a lot of the changes based on your feedback.

What's new:

  • a hard_delete column hint that gets interpreted in merge
  • interpretation of the sort column hint in merge so changes can be processed in order (e.g. if a record gets inserted then deleted in the same load, it should not be inserted in the destination—if it gets deleted then inserted in the same load, it should be inserted)
  • append and replace ignore the hard_delete hint
  • merge now has three "steps" if a hard_delete and/or sort column hint is provided: 1) delete from destination dataset 2) delete from staging dataset 3) copy staging to destination
  1. Is deleting from the staging dataset a good approach or is it better to add filters in the copy-staging-to-destination step?
  2. If it's a good approach, can the existing primary key deduplication also be done in that way rather than using temp tables?

@jorritsandbrink jorritsandbrink changed the title Introduce replicate write disposition Introduce hard_delete column hint and sorted deduplication in merge Feb 14, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on top of the review:
you must test a data where there are child tables. looks at other merge tests. we need a test with one and two nesting levels (can be same dataset)

@@ -333,6 +337,44 @@ def gen_merge_sql(
)
)

# remove "non-latest" records from staging table (deduplicate) if a sort column is provided
if len(primary_keys) > 0:
if has_column_with_prop(root_table, "sort"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you point out this does deduplication on top of the dedup done when generating temp tables (or inserting at the end when there are no child tables). my take: use sorted column in those clauses below if sorted column present. otherwise ORDER BY (SELECT NULL)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

""")

# remove deleted records from staging tables if a hard_delete column is provided
if has_column_with_prop(root_table, "hard_delete"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (I hope) there's a simpler way to handle hard_deletes. The code below does not need any modifications. It will delete all rows from destination dataset (using primary and merge keys) that are present in the staging dataset. it does not matter if hard delete flag is set or not. we must delete those rows anyway.

we only must change how we insert, from here:

# insert from staging to dataset, truncate staging table
        for table in table_chain:

the only think you need to do is to filter out rows that have deleted flag set so this is another clause in where

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall it should be way less code, we do not interfere with any edge cases by deleting and deduplicating the staging dataset + it looks like less row reads

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the approach to extending the where clause in the insert stage, rather than deleting from the staging dataset. It didn't turn out to be less code but it makes more sense nonetheless.

# first delete from root staging table
sql.append(f"""
DELETE FROM {staging_root_table_name}
WHERE {hard_delete_column} IS NOT DISTINCT FROM {escape_literal(True)};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so you assume that hard deleted column is boolean. probably makes the most sense. but then you must check the type somewhere. my take:
delete if value is IS NOT NULL OR (only in case of boolean) when true as above. maybe someone wants to have deleted flag as timestamp?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I implemented your suggestion.

) -> Sequence[TColumnSchema]:
updates = super()._create_table_update(table_name, storage_columns)
table = self.schema.get_table(table_name)
if has_column_with_prop(table, "hard_delete"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really good point. but my take would be to have identical schemas in staging and destination datasets. also: what about append and replace? this data won't be dropped from parquet/json files so just dropping from schema wont't help.

I'd say let's remove it. also all the code in merge job that skips deleted columns

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Staging and destination datasets now how identical schemas.

dlt/load/load.py Outdated
if not table_jobs and top_merged_table["write_disposition"] != "replace":
# if there are no jobs for the table, skip it, unless child tables need to be replaced
needs_replacement = False
if top_merged_table["write_disposition"] == "replace" or (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it is changed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to propagate deletes to child tables. If we provide only a primary key and the hard_delete column for a nested table, such as happens on lines 584 and 599 of test_merge_disposition.py, the child tables wouldn't get included in the table chain, and those deletes would only be executed on the root table.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not get it. We have jobs for this table because in both those lines we declare some data. The exception for replace is only for the case that there is no data at all. does not happen here. IMO you should try to remove it and find the problem elsewhere or ping me on slack to discuss it

@jorritsandbrink jorritsandbrink changed the title Introduce hard_delete column hint and sorted deduplication in merge Introduce hard_delete and dedup_sort columns hint for merge Feb 15, 2024
@jorritsandbrink
Copy link
Collaborator Author

jorritsandbrink commented Feb 15, 2024

@rudolfix I addressed your feedback and the PR is ready for another review!

See my replies on your comments for details regarding the changes.

On top of those replies:

  • I introduced the dedup_sort column hint to not conflate different usages under the sort hint
  • I added a test with two layers of nesting as you asked for
  • I will extend the user docs to describe the new column hints after all code changes have been approved

Edit: I see the new tests failing on some of the destinations, probably because IS DISTINCT FROM isn't generally supported SQL. I Tested only mssql, duckdb, and postgres locally. I'll look into and add a new commit to support all destinations. Fixed.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! we are almost there. our sql jobs is almost unreadable now (if it ever was)

we are missing a test where you have a merge key on non-unique column and you have a hard delete flag (you should be able to delete whole day of data with just one flag)

also question: does hard deleted flag make sense if there's no primary key? if the answer is no and primary key is required we can simplify code even more

also maybe a test case when we have a dedup sort and two rows one with deleted flag and one without (could run on duckdb only to make it faster)

@@ -253,28 +302,34 @@ def gen_merge_sql(
sql: List[str] = []
root_table = table_chain[0]

escape_id = sql_client.capabilities.escape_identifier
escape_lit = sql_client.capabilities.escape_literal
if escape_id is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really possible? how this code could work before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sql_client.capabilities.escape_literal is None for snowflake. sql_client.capabilities.escape_identifier always has a value (at least with the current set of destinations), but I included the if escape_id is None: for consistency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it is not defined on snowflake because we never process literals. we need sqlglot ,to generate those statements. but it does not have all dialects and does not support DDL very well (maybe that changed)

if sort_column is None:
order_by = "(SELECT NULL)"
else:
order_by = f"{sort_column} DESC"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is DESC what users expect? what is more typical?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Higher values typically indicate more recent (think timestamps, or the LSN in a Postgres WAL). So if we sort in descending order, we get the most recent value, which makes sense for most typical use cases.

I could also change the dedup_sort column hint from boolean to string and have it accept "asc" or "desc" values to make it configurable for the user. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! (if not too much work)

insert_sql += ";"
sql.append(insert_sql)
# -- DELETE FROM {staging_table_name} WHERE 1=1;
insert_cond = copy(not_deleted_cond) if hard_delete_col is not None else "1 = 1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do not need to copy. strings are immutable. you won't change not_deleted_cond


insert_temp_table_name: str = None
if len(table_chain) > 1:
if len(primary_keys) > 0 or (len(primary_keys) == 0 and hard_delete_col is not None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if len(primary_keys) > 0 or hard_delete_col is not None: should be sufficient.

if condition is None:
condition = "1 = 1"
col_str = ", ".join(columns)
inner_col_str = copy(col_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not need to copy! strings are immutable

insert_temp_table_name: str = None
if len(table_chain) > 1:
if len(primary_keys) > 0 or (len(primary_keys) == 0 and hard_delete_col is not None):
condition_colummns = [hard_delete_col] if not_deleted_cond is not None else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condition_colummns typo

dlt/load/load.py Outdated
if not table_jobs and top_merged_table["write_disposition"] != "replace":
# if there are no jobs for the table, skip it, unless child tables need to be replaced
needs_replacement = False
if top_merged_table["write_disposition"] == "replace" or (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not get it. We have jobs for this table because in both those lines we declare some data. The exception for replace is only for the case that there is no data at all. does not happen here. IMO you should try to remove it and find the problem elsewhere or ping me on slack to discuss it

dlt/destinations/sql_jobs.py Show resolved Hide resolved
@jorritsandbrink
Copy link
Collaborator Author

@rudolfix I addressed all your points. I added the test cases you mentioned. Only remaining point is the one about child table skipping we are discussing on Slack.

also question: does hard deleted flag make sense if there's no primary key? if the answer is no and primary key is required we can simplify code even more

I think it makes as much sense as doing inserts/updates without a primary key. I would say deleting multiple records sharing the same key is a valid use case we should support.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like asc/desc on update is left.

I've found and fixed a lot of heresy in the code. you can see my two commits. there were edge cases that tables were not created even if they should or vice versa.

now normalizer is marking tables which seen data so loader knows more which tables to create.

if sort_column is None:
order_by = "(SELECT NULL)"
else:
order_by = f"{sort_column} DESC"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! (if not too much work)

@@ -253,28 +302,34 @@ def gen_merge_sql(
sql: List[str] = []
root_table = table_chain[0]

escape_id = sql_client.capabilities.escape_identifier
escape_lit = sql_client.capabilities.escape_literal
if escape_id is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it is not defined on snowflake because we never process literals. we need sqlglot ,to generate those statements. but it does not have all dialects and does not support DDL very well (maybe that changed)

dlt/load/load.py Outdated
continue
result.add(table["name"])
return result
with self.get_destination_client(schema) as job_client:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't go to the destination to check if table exists., this costs a lot. we avoid any unnecessary database reflection at all cost.

we should return tables without this check. if we do it right, only tables that were created will be returned here.

dlt/load/load.py Outdated
):
with job_client.with_staging_dataset():
self._init_dataset_and_update_schema(
job_client,
expected_update,
staging_tables | {schema.version_table_name},
order_deduped(staging_tables + [schema.version_table_name]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dlt tables are never included in staging tables. no need to dedup

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bugs fixed & some tests added. LGTM!

@rudolfix rudolfix merged commit 88f2722 into devel Feb 24, 2024
56 of 65 checks passed
@rudolfix rudolfix deleted the 947-core-extensions-to-support-database-replication branch February 24, 2024 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Core extensions to support database replication
2 participants