SCD2 Experiment #923

sh-rp · 2024-01-31T14:02:17Z

Description

Proof of concept for scd2 implementation. This is an implementation of scd2 as described here: https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row with a valid_from and valid_until column. In this prototype we treat scd2 as a merge_strategy rather than it's own write_disposition, given the similarities between scd2 and merge this might make sense, we could also introduce a new write disposition.

Loading with this strategy always assumes a full dataset sync from the source, this way we detect missing columns and mark them as not valid anymore.

Child tables get handled the same way as their parents. All tables are retained but gain validity columns. This part still needs to be hammered out.

Open questions and ideas

Where is the right place to extend the schema if additional columns are needed as in this case?
Is it safe to convert the _dlt_id into a hash of the row in this scenario? Since we merge on this key in the merge job, there will never be two of the same _dlt_ids.
How do we integrate soft delete columns into this?
We could also have a scd2 mode that does not need a full sync but will not mark tables as deleted if there is no new value as a compromise.

netlify · 2024-01-31T14:02:22Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`c2968c8`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65ba5b6ca40ed60008b5cae8

sh-rp · 2024-01-31T14:06:30Z

dlt/destinations/sql_jobs.py

+
+    @classmethod
+    @classmethod
+    def gen_scd2_sql(


this will probably work for most sql destinations, I am not quite sure about the code that copies column values into the staging dataset, but we can see in the tests.

sh-rp · 2024-01-31T14:07:21Z

dlt/destinations/job_client_impl.py

@@ -224,7 +224,14 @@ def _create_append_followup_jobs(self, table_chain: Sequence[TTableSchema]) -> L
        return []

    def _create_merge_followup_jobs(self, table_chain: Sequence[TTableSchema]) -> List[NewLoadJob]:
-        return [SqlMergeJob.from_table_chain(table_chain, self.sql_client)]
+        now = pendulum.now()


I would say we need a cutoff date that is the same for the whole load. Probably we should take the timestamp when the loadpackage was created or something like this, i am just not sure if this is saved somewhere atm.

sh-rp · 2024-01-31T14:09:09Z

dlt/common/schema/schema.py

@@ -240,6 +240,15 @@ def coerce_row(
                        updated_table_partial["columns"] = {}
                    updated_table_partial["columns"][new_col_name] = new_col_def

+        # insert columns defs for scd2 (TODO: where to do this properly, maybe in a step after the normalization?)


where is the right place to update the schema based on destination settings? maybe we should do this at the beginning of the load step? I am not sure.

sh-rp · 2024-01-31T14:10:07Z

dlt/common/normalizers/json/relational.py

@@ -148,6 +149,14 @@ def _extend_row(extend: DictStrAny, row: TDataItemRow) -> None:
    def _add_row_id(
        self, table: str, row: TDataItemRow, parent_row_id: str, pos: int, _r_lvl: int
    ) -> str:
+        # sometimes row id needs to be hash for now hardcode here


for scd2 it makes the most sense to have a row_hash in the row_id..

sh-rp added 2 commits January 30, 2024 21:02

scd2 init

32687c3

finalize first scd2 experiment

8dca5b2

sh-rp commented Jan 31, 2024

View reviewed changes

some cleanup and proper asserts for first test

c2968c8

sh-rp linked an issue Jan 31, 2024 that may be closed by this pull request

SCD2 write disposition #828

Closed

This was referenced Feb 12, 2024

Introduce hard_delete and dedup_sort columns hint for merge #960

Merged

Postgres database replication #933

Closed

sh-rp self-assigned this Mar 18, 2024

jorritsandbrink mentioned this pull request Mar 24, 2024

SCD2 write disposition #828

Closed

rudolfix closed this Apr 2, 2024

sh-rp deleted the d#/scd2_experiment branch April 17, 2024 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCD2 Experiment #923

SCD2 Experiment #923

sh-rp commented Jan 31, 2024 •

edited

Loading

netlify bot commented Jan 31, 2024 •

edited

Loading

sh-rp Jan 31, 2024

sh-rp Jan 31, 2024

sh-rp Jan 31, 2024

sh-rp Jan 31, 2024

SCD2 Experiment #923

SCD2 Experiment #923

Conversation

sh-rp commented Jan 31, 2024 • edited Loading

Description

Open questions and ideas

netlify bot commented Jan 31, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

sh-rp Jan 31, 2024

Choose a reason for hiding this comment

sh-rp Jan 31, 2024

Choose a reason for hiding this comment

sh-rp Jan 31, 2024

Choose a reason for hiding this comment

sh-rp Jan 31, 2024

Choose a reason for hiding this comment

sh-rp commented Jan 31, 2024 •

edited

Loading

netlify bot commented Jan 31, 2024 •

edited

Loading