SCD2 write disposition #828

adrianbr · 2023-12-15T13:07:56Z

Feature description

Just as we have the "merge" write disposition, we should have a SCD2 write disposition

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

Multiple users have requested slowly changing dimension support

Proposed solution

We should implement SCD2 as it's the most common, versatile and sufficient (users can add their own flags or modify it to other types via a view)
https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row

Implementation notes / ideas (by @sh-rp)

Compute hash per row and base updates on the hash
have configurable names for validity columns (from-until)
Figure out how subtables behave, also one hash per row?
Respect soft-delete columns (maybe as a second step?)

Related issues

No response

euanjohnston-dev · 2023-12-15T17:29:36Z

So DBT handles this with snapshots (implements type-2 Slowly Changing over mutable source tables).
DBT runner seems to currently accept seed and models but not sources.

Can currently navigate by running an incremental model in dbt as a workaround but its messy:

e.g. (bq code)

{{
    config({
        "materialized": "incremental",
        "unique_key": "__etl_row_unique_id",
        "on_schema_change":"append_new_columns"
    })
}}


WITH get_base_incremental as (

-- dynamically establish base columns from input.

SELECT
  t.*,
  (
  select string_agg(trim(val, '"'), '-' order by offset)
  from unnest(regexp_extract_all(to_json_string(t), r'"[^"]+":("[^"]+"|null)')) val with offset
) concatenated_columns
FROM 
(

-- exclude fields not necessary to establish increments.

SELECT * Except(LIST_COLUMNS_TO_DROP)
FROM SOURCE_TABLE
WHERE id is not null
) as t
)

-- id creation including __etl_row_unique_id required for increments.
-- new columns always appended to end of table despite explicit ordering below.

SELECT * Except(concatenated_columns),
    '{{ var("job_id") }}' AS __etl_job_id,
    concatenated_columns AS __etl_row_unique_id,
    CURRENT_DATETIME() AS __etl_last_seen_at_utc
FROM get_base_incremental

rudolfix · 2023-12-17T14:09:26Z

@euanjohnston-dev what do you mean by

DBT runner seems to currently accept seed and models but not sources.
?

you can define sources in your dbt package, we just run it with dbt command... probably I'm missing something here and possibly we can fix your issue quickly

rudolfix · 2023-12-17T14:17:49Z

@adrianbr per wiki article we are supporting the scd2 version flavor with our load_id and append mode. what is not clear is how to identify when the record changed. is it some kind of content hash? or we just rely on incremental loads and always add a new version.
handling of deleted records is not clear to me as well. should we detect some special field in the data? or we assume that with each load all records are always present?

btw. scd4 looks way easier to implement (current state + history table). it could be just a flag in the loader module

euanjohnston-dev · 2023-12-17T14:52:45Z

@rudolfix apologies I meant to write “snapshot” instead of “source”. It’s DBT’s way of handling SCD1 -> SCD2.

sh-rp · 2024-01-25T11:06:57Z

Community Discussion here: https://dlthub-community.slack.com/archives/C05CJEDHG2F/p1706020707063669

jorritsandbrink · 2024-03-24T12:19:44Z

As discussed, I'll pick up this issue from here.

Most of what Dave has already suggested and implemented in #923 makes sense to me.

I think these are the main cases seen in the wild:

Source types:

full extraction ➜ entire dimension table is extracted from source each pipeline run
- can always capture deleted records by identifying "missing records"
incremental extraction (CDC) ➜ only changes (INSERTs, UPDATEs, DELETEs) are extracted from source
- can only capture deleted records if DELETEs are included in the extracted changes

Reference date types (values to use in validity columns):

process time ➜ always possible
- pros: easy to implement
- cons: validity columns are only indirectly related to actual record validity in the source
source data time ➜ only possible if source data has applicable date/timestamp column
- pros: validity columns are arguably more meaningful from a "business perspective"
- cons: requires user config (i.e. which column to use), more code complexity

Proposed scope:

To keep the change small, I propose to only support source type (1) and reference date type (1) at first. We can extend later as per community interest.

This would be supported:

configurable names for validity columns
soft deletes are naturally supported by SCD2 by retiring the latest record and not inserting a new one

Implementation details:

use merge write disposition with new merge strategy scd2
use timestamp for validity columns (to support sub-daily change capture)
high/max timestamp in "valid to" to indicate current record
_dlt_id as row hash to identify record changes, also in child tables (this is the surrogate key needed for SCD2)
no validity columns in child tables to limit data redundancy (I think this opposes Dave's idea)

It seems this is mostly what's already in #923, so I'd use that as example and things should be fairly straightforward.

@sh-rp @rudolfix Let me know what you think.

jorritsandbrink · 2024-03-26T12:19:26Z

Some notes about parent/child tables:

child tables are insert-only
parent tables can also be updated, but the only possible update is a change in the "valid to" column
new records will always be inserted in the child table, even if the update on the source record did not affect the complex column
- simple to operate
- complex columns are treated similar to simple columns (which are also copied if they are not changed)
- any alternative approach that doesn't insert new rows but instead reuses the existing ones would have to maintain links to multiple parent id's, probably ending up with a complex column to store those, which is exactly what we're trying to prevent with child tables
the row hash in the parent table should also include the complex column, even though that column is not present (it became a child table)
- if the only changed field in a source record is the complex column, this would lead to duplicate records in the parent table if the row hash doesn't include the complex column
- with the complex column included in the hash, we get near-identical records in the parent table, with the only difference being the hash value

github-project-automation bot added this to dlt core library Dec 15, 2023

github-project-automation bot moved this to Todo in dlt core library Dec 15, 2023

sh-rp moved this from Todo to Planned in dlt core library Jan 18, 2024

sh-rp assigned rudolfix Jan 18, 2024

rudolfix assigned sh-rp and unassigned rudolfix Jan 28, 2024

sh-rp moved this from Planned to In Progress in dlt core library Jan 30, 2024

sh-rp linked a pull request Jan 31, 2024 that will close this issue

SCD2 Experiment #923

Closed

rudolfix moved this from In Progress to Planned in dlt core library Feb 6, 2024

rudolfix mentioned this issue Feb 12, 2024

Introduce hard_delete and dedup_sort columns hint for merge #960

Merged

VioletM added the community This issue came from slack community workspace label Feb 27, 2024

jorritsandbrink self-assigned this Mar 24, 2024

rudolfix moved this from Planned to In Progress in dlt core library Mar 25, 2024

jorritsandbrink mentioned this issue Apr 1, 2024

SCD2 support #1168

Merged

jorritsandbrink linked a pull request Apr 1, 2024 that will close this issue

SCD2 support #1168

Merged

rudolfix closed this as completed in #1168 Apr 14, 2024

github-project-automation bot moved this from In Progress to Done in dlt core library Apr 14, 2024

jorritsandbrink mentioned this issue Apr 20, 2024

scd2 merge strategy extensions #1252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCD2 write disposition #828

SCD2 write disposition #828

adrianbr commented Dec 15, 2023 •

edited by sh-rp

Loading

euanjohnston-dev commented Dec 15, 2023 •

edited

Loading

rudolfix commented Dec 17, 2023

rudolfix commented Dec 17, 2023

euanjohnston-dev commented Dec 17, 2023

sh-rp commented Jan 25, 2024

jorritsandbrink commented Mar 24, 2024

jorritsandbrink commented Mar 26, 2024

SCD2 write disposition #828

SCD2 write disposition #828

Comments

adrianbr commented Dec 15, 2023 • edited by sh-rp Loading

Feature description

Are you a dlt user?

Use case

Proposed solution

Implementation notes / ideas (by @sh-rp)

Related issues

euanjohnston-dev commented Dec 15, 2023 • edited Loading

rudolfix commented Dec 17, 2023

rudolfix commented Dec 17, 2023

euanjohnston-dev commented Dec 17, 2023

sh-rp commented Jan 25, 2024

jorritsandbrink commented Mar 24, 2024

jorritsandbrink commented Mar 26, 2024

adrianbr commented Dec 15, 2023 •

edited by sh-rp

Loading

euanjohnston-dev commented Dec 15, 2023 •

edited

Loading