-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce hard_delete
and dedup_sort
columns hint for merge
#960
Changes from 6 commits
82c3634
97c5512
400d84b
24f362e
f3a4878
deb816f
4a38d56
0d1c977
474d8bc
568ef26
81ea426
a04a238
8ac0f9c
ec115e9
a1afeb8
f07205d
a64580d
3308549
189c2fb
a649b0e
9778f0e
4b3c59b
c984c4e
935748a
d125556
ecaf6ef
af0b344
262018b
0814bb0
44a9ff2
9384148
9921b89
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,7 +4,7 @@ | |
from dlt.common.runtime.logger import pretty_format_exception | ||
|
||
from dlt.common.schema.typing import TTableSchema | ||
from dlt.common.schema.utils import get_columns_names_with_prop | ||
from dlt.common.schema.utils import get_columns_names_with_prop, has_column_with_prop | ||
from dlt.common.storages.load_storage import ParsedLoadJobFileName | ||
from dlt.common.utils import uniq_id | ||
from dlt.destinations.exceptions import MergeDispositionException | ||
|
@@ -147,6 +147,8 @@ def generate_sql( | |
|
||
First we store the root_keys of root table elements to be deleted in the temp table. Then we use the temp table to delete records from root and all child tables in the destination dataset. | ||
At the end we copy the data from the staging dataset into destination dataset. | ||
|
||
If sort and/or hard_delete column hints are provided, records are deleted from the staging dataset before its data is copied to the destination dataset. | ||
""" | ||
return cls.gen_merge_sql(table_chain, sql_client) | ||
|
||
|
@@ -252,6 +254,8 @@ def gen_merge_sql( | |
) -> List[str]: | ||
sql: List[str] = [] | ||
root_table = table_chain[0] | ||
escape_identifier = sql_client.capabilities.escape_identifier | ||
escape_literal = sql_client.capabilities.escape_literal | ||
|
||
# get top level table full identifiers | ||
root_table_name = sql_client.make_qualified_table_name(root_table["name"]) | ||
|
@@ -260,13 +264,13 @@ def gen_merge_sql( | |
# get merge and primary keys from top level | ||
primary_keys = list( | ||
map( | ||
sql_client.capabilities.escape_identifier, | ||
escape_identifier, | ||
get_columns_names_with_prop(root_table, "primary_key"), | ||
) | ||
) | ||
merge_keys = list( | ||
map( | ||
sql_client.capabilities.escape_identifier, | ||
escape_identifier, | ||
get_columns_names_with_prop(root_table, "merge_key"), | ||
) | ||
) | ||
|
@@ -298,7 +302,7 @@ def gen_merge_sql( | |
" it is not possible to link child tables to it.", | ||
) | ||
# get first unique column | ||
unique_column = sql_client.capabilities.escape_identifier(unique_columns[0]) | ||
unique_column = escape_identifier(unique_columns[0]) | ||
# create temp table with unique identifier | ||
create_delete_temp_table_sql, delete_temp_table_name = cls.gen_delete_temp_table_sql( | ||
unique_column, key_table_clauses | ||
|
@@ -319,7 +323,7 @@ def gen_merge_sql( | |
f" {table['name']} so it is not possible to refer to top level table" | ||
f" {root_table['name']} unique column {unique_column}", | ||
) | ||
root_key_column = sql_client.capabilities.escape_identifier(root_key_columns[0]) | ||
root_key_column = escape_identifier(root_key_columns[0]) | ||
sql.append( | ||
cls.gen_delete_from_sql( | ||
table_name, root_key_column, delete_temp_table_name, unique_column | ||
|
@@ -333,6 +337,44 @@ def gen_merge_sql( | |
) | ||
) | ||
|
||
# remove "non-latest" records from staging table (deduplicate) if a sort column is provided | ||
if len(primary_keys) > 0: | ||
if has_column_with_prop(root_table, "sort"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as you point out this does deduplication on top of the dedup done when generating temp tables (or inserting at the end when there are no child tables). my take: use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
sort_column = escape_identifier(get_columns_names_with_prop(root_table, "sort")[0]) | ||
sql.append(f""" | ||
DELETE FROM {staging_root_table_name} | ||
WHERE {sort_column} IN ( | ||
SELECT {sort_column} FROM ( | ||
SELECT {sort_column}, ROW_NUMBER() OVER (partition BY {", ".join(primary_keys)} ORDER BY {sort_column} DESC) AS _rn | ||
FROM {staging_root_table_name} | ||
) AS a | ||
WHERE a._rn > 1 | ||
); | ||
""") | ||
|
||
# remove deleted records from staging tables if a hard_delete column is provided | ||
if has_column_with_prop(root_table, "hard_delete"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think (I hope) there's a simpler way to handle we only must change how we insert, from here:
the only think you need to do is to filter out rows that have deleted flag set so this is another clause in where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. overall it should be way less code, we do not interfere with any edge cases by deleting and deduplicating the staging dataset + it looks like less row reads There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have changed the approach to extending the where clause in the insert stage, rather than deleting from the staging dataset. It didn't turn out to be less code but it makes more sense nonetheless. |
||
hard_delete_column = escape_identifier( | ||
get_columns_names_with_prop(root_table, "hard_delete")[0] | ||
) | ||
# first delete from root staging table | ||
sql.append(f""" | ||
DELETE FROM {staging_root_table_name} | ||
WHERE {hard_delete_column} IS NOT DISTINCT FROM {escape_literal(True)}; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok so you assume that hard deleted column is boolean. probably makes the most sense. but then you must check the type somewhere. my take: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, I implemented your suggestion. |
||
""") | ||
# then delete from child staging tables | ||
for table in table_chain[1:]: | ||
with sql_client.with_staging_dataset(staging=True): | ||
staging_table_name = sql_client.make_qualified_table_name(table["name"]) | ||
sql.append(f""" | ||
DELETE FROM {staging_table_name} | ||
WHERE NOT EXISTS ( | ||
SELECT 1 FROM {staging_root_table_name} AS p | ||
WHERE {staging_table_name}.{root_key_column} = p.{unique_column} | ||
); | ||
""") | ||
|
||
if len(table_chain) > 1: | ||
# create temp table used to deduplicate, only when we have primary keys | ||
if primary_keys: | ||
( | ||
|
@@ -343,15 +385,19 @@ def gen_merge_sql( | |
) | ||
sql.extend(create_insert_temp_table_sql) | ||
|
||
# insert from staging to dataset, truncate staging table | ||
# insert from staging to dataset | ||
for table in table_chain: | ||
table_name = sql_client.make_qualified_table_name(table["name"]) | ||
with sql_client.with_staging_dataset(staging=True): | ||
staging_table_name = sql_client.make_qualified_table_name(table["name"]) | ||
columns = ", ".join( | ||
map( | ||
sql_client.capabilities.escape_identifier, | ||
get_columns_names_with_prop(table, "name"), | ||
escape_identifier, | ||
[ | ||
c | ||
for c in get_columns_names_with_prop(table, "name") | ||
if c not in get_columns_names_with_prop(table, "hard_delete") | ||
], | ||
) | ||
) | ||
insert_sql = ( | ||
|
@@ -374,6 +420,5 @@ def gen_merge_sql( | |
if insert_sql.strip()[-1] != ";": | ||
insert_sql += ";" | ||
sql.append(insert_sql) | ||
# -- DELETE FROM {staging_table_name} WHERE 1=1; | ||
|
||
return sql |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ | |
) | ||
from dlt.common.schema import Schema, TSchemaTables | ||
from dlt.common.schema.typing import TTableSchema, TWriteDisposition | ||
from dlt.common.schema.utils import has_column_with_prop | ||
from dlt.common.storages import LoadStorage | ||
from dlt.common.destination.reference import ( | ||
DestinationClientDwhConfiguration, | ||
|
@@ -246,8 +247,14 @@ def get_completed_table_chain( | |
for job in table_jobs | ||
): | ||
return None | ||
# if there are no jobs for the table, skip it, unless the write disposition is replace, as we need to create and clear the child tables | ||
if not table_jobs and top_merged_table["write_disposition"] != "replace": | ||
# if there are no jobs for the table, skip it, unless child tables need to be replaced | ||
needs_replacement = False | ||
if top_merged_table["write_disposition"] == "replace" or ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why it is changed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is needed to propagate deletes to child tables. If we provide only a primary key and the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still do not get it. We have jobs for this table because in both those lines we declare some data. The exception for replace is only for the case that there is no data at all. does not happen here. IMO you should try to remove it and find the problem elsewhere or ping me on slack to discuss it |
||
top_merged_table["write_disposition"] == "merge" | ||
and has_column_with_prop(top_merged_table, "hard_delete") | ||
): | ||
needs_replacement = True | ||
if not table_jobs and not needs_replacement: | ||
continue | ||
table_chain.append(table) | ||
# there must be at least table | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really good point. but my take would be to have identical schemas in staging and destination datasets. also: what about append and replace? this data won't be dropped from parquet/json files so just dropping from schema wont't help.
I'd say let's remove it. also all the code in merge job that skips deleted columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Staging and destination datasets now how identical schemas.