Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCD2 write disposition #828

Closed
adrianbr opened this issue Dec 15, 2023 · 7 comments · Fixed by #1168
Closed

SCD2 write disposition #828

adrianbr opened this issue Dec 15, 2023 · 7 comments · Fixed by #1168
Assignees
Labels
community This issue came from slack community workspace

Comments

@adrianbr
Copy link
Contributor

adrianbr commented Dec 15, 2023

Feature description

Just as we have the "merge" write disposition, we should have a SCD2 write disposition

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

Multiple users have requested slowly changing dimension support

Proposed solution

We should implement SCD2 as it's the most common, versatile and sufficient (users can add their own flags or modify it to other types via a view)
https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row

Implementation notes / ideas (by @sh-rp)

  • Compute hash per row and base updates on the hash
  • have configurable names for validity columns (from-until)
  • Figure out how subtables behave, also one hash per row?
  • Respect soft-delete columns (maybe as a second step?)

Related issues

No response

@euanjohnston-dev
Copy link

euanjohnston-dev commented Dec 15, 2023

  • So DBT handles this with snapshots (implements type-2 Slowly Changing over mutable source tables).
  • DBT runner seems to currently accept seed and models but not sources.

Can currently navigate by running an incremental model in dbt as a workaround but its messy:

e.g. (bq code)

{{
    config({
        "materialized": "incremental",
        "unique_key": "__etl_row_unique_id",
        "on_schema_change":"append_new_columns"
    })
}}


WITH get_base_incremental as (

-- dynamically establish base columns from input.

SELECT
  t.*,
  (
  select string_agg(trim(val, '"'), '-' order by offset)
  from unnest(regexp_extract_all(to_json_string(t), r'"[^"]+":("[^"]+"|null)')) val with offset
) concatenated_columns
FROM 
(

-- exclude fields not necessary to establish increments.

SELECT * Except(LIST_COLUMNS_TO_DROP)
FROM SOURCE_TABLE
WHERE id is not null
) as t
)

-- id creation including __etl_row_unique_id required for increments.
-- new columns always appended to end of table despite explicit ordering below.

SELECT * Except(concatenated_columns),
    '{{ var("job_id") }}' AS __etl_job_id,
    concatenated_columns AS __etl_row_unique_id,
    CURRENT_DATETIME() AS __etl_last_seen_at_utc
FROM get_base_incremental

@rudolfix
Copy link
Collaborator

@euanjohnston-dev what do you mean by

DBT runner seems to currently accept seed and models but not sources.
?

you can define sources in your dbt package, we just run it with dbt command... probably I'm missing something here and possibly we can fix your issue quickly

@rudolfix
Copy link
Collaborator

@adrianbr per wiki article we are supporting the scd2 version flavor with our load_id and append mode. what is not clear is how to identify when the record changed. is it some kind of content hash? or we just rely on incremental loads and always add a new version.
handling of deleted records is not clear to me as well. should we detect some special field in the data? or we assume that with each load all records are always present?

btw. scd4 looks way easier to implement (current state + history table). it could be just a flag in the loader module

@euanjohnston-dev
Copy link

@rudolfix apologies I meant to write “snapshot” instead of “source”. It’s DBT’s way of handling SCD1 -> SCD2.

@sh-rp sh-rp moved this from Todo to Planned in dlt core library Jan 18, 2024
@sh-rp
Copy link
Collaborator

sh-rp commented Jan 25, 2024

@rudolfix rudolfix assigned sh-rp and unassigned rudolfix Jan 28, 2024
@sh-rp sh-rp moved this from Planned to In Progress in dlt core library Jan 30, 2024
@sh-rp sh-rp linked a pull request Jan 31, 2024 that will close this issue
@rudolfix rudolfix moved this from In Progress to Planned in dlt core library Feb 6, 2024
@VioletM VioletM added the community This issue came from slack community workspace label Feb 27, 2024
@jorritsandbrink
Copy link
Collaborator

As discussed, I'll pick up this issue from here.

Most of what Dave has already suggested and implemented in #923 makes sense to me.

I think these are the main cases seen in the wild:

Source types:

  1. full extraction ➜ entire dimension table is extracted from source each pipeline run
    • can always capture deleted records by identifying "missing records"
  2. incremental extraction (CDC) ➜ only changes (INSERTs, UPDATEs, DELETEs) are extracted from source
    • can only capture deleted records if DELETEs are included in the extracted changes

Reference date types (values to use in validity columns):

  1. process time ➜ always possible
    • pros: easy to implement
    • cons: validity columns are only indirectly related to actual record validity in the source
  2. source data time ➜ only possible if source data has applicable date/timestamp column
    • pros: validity columns are arguably more meaningful from a "business perspective"
    • cons: requires user config (i.e. which column to use), more code complexity

Proposed scope:

To keep the change small, I propose to only support source type (1) and reference date type (1) at first. We can extend later as per community interest.

This would be supported:

  • configurable names for validity columns
  • soft deletes are naturally supported by SCD2 by retiring the latest record and not inserting a new one

Implementation details:

  • use merge write disposition with new merge strategy scd2
  • use timestamp for validity columns (to support sub-daily change capture)
  • high/max timestamp in "valid to" to indicate current record
  • _dlt_id as row hash to identify record changes, also in child tables (this is the surrogate key needed for SCD2)
  • no validity columns in child tables to limit data redundancy (I think this opposes Dave's idea)

It seems this is mostly what's already in #923, so I'd use that as example and things should be fairly straightforward.

@sh-rp @rudolfix Let me know what you think.

@jorritsandbrink jorritsandbrink self-assigned this Mar 24, 2024
@rudolfix rudolfix moved this from Planned to In Progress in dlt core library Mar 25, 2024
@jorritsandbrink
Copy link
Collaborator

Some notes about parent/child tables:

  • child tables are insert-only
  • parent tables can also be updated, but the only possible update is a change in the "valid to" column
  • new records will always be inserted in the child table, even if the update on the source record did not affect the complex column
    • simple to operate
    • complex columns are treated similar to simple columns (which are also copied if they are not changed)
    • any alternative approach that doesn't insert new rows but instead reuses the existing ones would have to maintain links to multiple parent id's, probably ending up with a complex column to store those, which is exactly what we're trying to prevent with child tables
  • the row hash in the parent table should also include the complex column, even though that column is not present (it became a child table)
    • if the only changed field in a source record is the complex column, this would lead to duplicate records in the parent table if the row hash doesn't include the complex column
    • with the complex column included in the hash, we get near-identical records in the parent table, with the only difference being the hash value

@jorritsandbrink jorritsandbrink linked a pull request Apr 1, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from In Progress to Done in dlt core library Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This issue came from slack community workspace
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants