SCD2 support #1168

jorritsandbrink · 2024-04-01T22:13:49Z

Description

This PR adds basic support for SCD2. Scope is defined here.

Treats scd2 as a special case of the merge write disposition.
Support configurable names for the validity columns ("valid from" / "valid to").
Validity column names default to _dlt_valid_from and _dlt_valid_to if the user does not specify names.
Uses timestamps (not dates) to support sub-daily changes.
Uses high timestamp (9999-12-31T00:00:00+00:00) in "valid to" column to indicate active records.
Includes validity columns only in the root table. Validity dates for records in child tables can be linked through the root id.
Does not use "standard" configuration because it doesn't support resource-scoped configuration (correct me if I'm wrong). We want resource-scoped configuration because a user might want to load one table with the scd2 strategy and another one with "regular" merge behavior, or use different validity column names for different tables.
~~Introduces merge_config argument on the resource decorator~~ Extends write_disposition argument such that it also accepts configuration dictionaries to let the user specify the merge strategy and validity column names..
~~Adds merge_config key to TResourceHints, but not to TTableSchema.~~
Uses x-hints to store user configuration in the schema:
- x-merge-strategy at the table level
- x-valid-from and x-valid-to at the column level

Related Issues

Closes SCD2 write disposition #828

netlify · 2024-04-01T22:14:06Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`6944828`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/661b6cce698afb000817e765

sultaniman · 2024-04-02T07:44:16Z

tests/load/pipeline/test_scd2.py

+
+
+ACTIVE_TS = datetime.fromisoformat(HIGH_TS.isoformat()).replace(tzinfo=None)
+h = DataItemNormalizer.get_row_hash


maybe alias this to get_row_hash?

Changed in 7726d98

sultaniman · 2024-04-02T07:46:18Z

dlt/common/schema/typing.py

@@ -155,6 +155,14 @@ class NormalizerInfo(TypedDict, total=True):
    new_table: bool


+class TMergeConfig(TypedDict, total=False):
+    strategy: Optional[TLoaderMergeStrategy]
+    validity_column_names: Optional[List[str]]


Should it allow duplicates column names?

It shouldn't, didn't think of that. Added validity column name checking in 30bb2e0. An exception is raised if a configured validity column name appears in the data.

dlt/destinations/sql_jobs.py

sultaniman · 2024-04-02T07:58:05Z

tests/load/pipeline/test_scd2.py

+)
+
+
+ACTIVE_TS = datetime.fromisoformat(HIGH_TS.isoformat()).replace(tzinfo=None)


Can't we just use HIGHT_TS.replace(tzinfo=None)?

HIGH_TS is a pendulum DateTime because the datetime module is a banned import. However, when fetching data from the destination tables to assert its content, timestamps are returned as datetime objects. Hence the conversion from DateTime to datetime.

dlt/destinations/sql_jobs.py

sh-rp · 2024-04-03T12:08:03Z

tests/load/pipeline/test_scd2.py

+    ids=lambda x: x.name,
+)
+@pytest.mark.parametrize("simple", [True, False])
+def test_child_table(destination_config: DestinationTestConfiguration, simple: bool) -> None:


i think we should test the child table functionality on all destinations, so either enable all dests here or add a child table to the basic test. the reason is, that we should run all sql code that we have against all destinations, just to make sure it works everywhere.

Done in 11748a6

sh-rp

Very nice work, thanks! Only a few small changes and we still need the docs. Also I want to discuss two points with @rudolfix before we can merge.

sh-rp · 2024-04-03T12:24:17Z

@rudolfix two questions on this PR:

Do you agree with the interface? We now have additional fields on the resource to configure this as opposed to a settings on the destination how we do it with the replace strategy. I think this makes sense, but also looks a bit weird provided the replace strategy is set in a different way.
Right now the "cutoff date" is the created date of the load package. I am thinking we might want to make this overridable with a setting for certain cases, but maybe it is good like this now and we can add this feature when people request it.

jorritsandbrink · 2024-04-04T07:27:07Z

@rudolfix two questions on this PR:

Do you agree with the interface? We now have additional fields on the resource to configure this as opposed to a settings on the destination how we do it with the replace strategy. I think this makes sense, but also looks a bit weird provided the replace strategy is set in a different way.

Right now the "cutoff date" is the created date of the load package. I am thinking we might want to make this overridable with a setting for certain cases, but maybe it is good like this now and we can add this feature when people request it.

Note that this is why I chose a different interface than the one used to configure replace strategy:

rudolfix

this is really good! @jorritsandbrink congrats on plugging the work into our normalizer and hints perfectly.

I think we can do the user interface better by extending write_disposition instead of adding a new argument. @sh-rp exactly what we did for data contracts with shorthand and full notation - see my review

also: I trust you that SQL transformation is good and properly tested :)

dlt/destinations/sql_jobs.py

rudolfix · 2024-04-04T10:38:08Z

dlt/common/schema/typing.py

@@ -65,6 +64,7 @@
 ]
 """Known hints of a column used to declare hint regexes."""
 TWriteDisposition = Literal["skip", "append", "replace", "merge"]
+TLoaderMergeStrategy = Literal["scd2"]


we should have a default strategy which is "delete-insert". and we'll add one more merge to support MERGE sql statements #1129

Added in 396ec59

rudolfix · 2024-04-04T10:45:13Z

dlt/extract/decorators.py

@@ -297,6 +297,7 @@ def resource(
    columns: TTableHintTemplate[TAnySchemaColumns] = None,
    primary_key: TTableHintTemplate[TColumnNames] = None,
    merge_key: TTableHintTemplate[TColumnNames] = None,
+    merge_config: TMergeConfig = None,


@jorritsandbrink @sh-rp maybe instead of doing this, we could extend write_disposition to
``
write_disposition: TTableHintTemplate[Union[TWriteDisposition, TMergeConfig, TReplaceConfig]] = None

exactly what we do with `schema_contract` below? then we can support short hand strings and full definitions. and still use the same parameter in that case I'd rename TMergeConfig to TMergeDispositionConfig

and we also handle the replace strategies in there? (not in the pr but later)? i think it's a good idea

Extended write_disposition in 8f4d4ce and c8d84d8

jorritsandbrink · 2024-04-07T23:13:18Z

@rudolfix I adjusted the user interface according to your suggestion and introduced a default delete-insert merge strategy. WDYT?

I'll add docs after we've settled on the user interface.

rudolfix

this is good! I have two optimizations that IMO are OK to add.
thx for adding apply_hints tests!

rudolfix · 2024-04-08T10:39:30Z

dlt/common/normalizers/json/relational.py

@@ -296,10 +319,18 @@ def normalize_data_item(
        row = cast(TDataItemRowRoot, item)
        # identify load id if loaded data must be processed after loading incrementally
        row["_dlt_load_id"] = load_id
+        # determine if row hash should be used as dlt id


could you precompute a list of all "scd2" tables in _reset method? this part of schema remains constant during normalization.
and this method is called for each normalized row. so it makes sense to optimize it

Solved with caching as you suggested on Slack: 6b24378

rudolfix · 2024-04-08T10:46:40Z

dlt/extract/hints.py

+    def _create_table_schema(resource_hints: TResourceHints, resource_name: str) -> TTableSchema:
+        """Creates table schema from resource hints and resource name."""
+
+        dict_ = cast(Dict[str, Any], deepcopy(resource_hints))


why do we need to deepcopy hints? it was not copied in merge_keys. are you getting errors in tests?
it was already done in compute_table():

# resolve a copy of a held template table_template = self._clone_hints(table_template)

note that deepcopy will also clone pydantic models and other things that were used as original column definition. and that may be quite costly

deepcopy was indeed obsolete—removed it in 1f399bc

master merge for 0.4.8

rudolfix

LGTM!

oh dremio does not like ZULU timestamps in sql merge job... do you generate them for SCD2? I think Z timestamps are not really part of ISO

we have a dremio container in tests if you want to fix it

…scd2 tests

…ate to be passed on, uses load_id as created_at if possible

…arried on

… execution as long as packages are processed in order

…on and updates schema in the normalizer

Jorrit Sandbrink added 3 commits March 31, 2024 23:30

format examples

720b115

add core functionality for scd2 merge strategy

115b4c9

make scd2 validity column names configurable

37befbc

jorritsandbrink self-assigned this Apr 1, 2024

jorritsandbrink linked an issue Apr 1, 2024 that may be closed by this pull request

SCD2 write disposition #828

Closed

jorritsandbrink requested a review from sh-rp April 1, 2024 22:15

sultaniman reviewed Apr 2, 2024

View reviewed changes

Jorrit Sandbrink added 2 commits April 2, 2024 23:33

make alias descriptive

7726d98

add validity column name conflict checking

30bb2e0

sh-rp reviewed Apr 3, 2024

View reviewed changes

dlt/destinations/sql_jobs.py Outdated Show resolved Hide resolved

sh-rp reviewed Apr 3, 2024

View reviewed changes

sh-rp requested changes Apr 3, 2024

View reviewed changes

rudolfix requested changes Apr 4, 2024

View reviewed changes

Jorrit Sandbrink added 5 commits April 7, 2024 10:52

Merge branch 'devel' of https://github.com/dlt-hub/dlt into 828-scd2

765d652

extend write disposition with dictionary configuration option

8f4d4ce

add default delete-insert merge strategy

396ec59

update write_disposition type hints

c8d84d8

extend tested destinations

11748a6

jorritsandbrink requested a review from rudolfix April 7, 2024 23:13

rudolfix requested changes Apr 8, 2024

View reviewed changes

adrianbr and others added 6 commits April 9, 2024 14:02

2nd time setup (#1202)

e9c8f61

remove obsolete deepcopy

1f399bc

Merge branch 'devel' of https://github.com/dlt-hub/dlt into 828-scd2

c8f4173

Merge pull request #1200 from dlt-hub/devel

c99d612

master merge for 0.4.8

add scd2 docs

0fa603b

add write_disposition existence condition

c110aae

Jorrit Sandbrink added 3 commits April 11, 2024 13:03

default to default merge strategy

4236f20

replace hardcoded column name with variable to fix test

0e7f8c0

fix doc snippets

93e7f45

rudolfix previously approved these changes Apr 11, 2024

View reviewed changes

rudolfix added 9 commits April 13, 2024 14:20

compares records without order and with caps timestamps precision in …

36da1f2

…scd2 tests

defines create load id, stores package state typed, allows package st…

0d6919a

…ate to be passed on, uses load_id as created_at if possible

creates new package to normalize from extracted package so state is c…

2195ebf

…arried on

bans direct pendulum import

52f0d7b

uses timestamps with properly reduced precision in scd2

caa9ae7

selects newest state by load_id, not created_at. this will not affect…

64baf2d

… execution as long as packages are processed in order

adds formating datetime literal to escape

8cb24af

renames x-row-hash to x-row-version

6039f1c

corrects json and pendulum imports

dee8e08

rudolfix dismissed their stale review via dee8e08 April 13, 2024 12:31

rudolfix added 14 commits April 13, 2024 22:47

uses unique column in scd2 sql generation

e63ffe1

renames arrow items literal

12bdf2b

adds limitations to docs

c1614b9

passes only complete columns to arrow normalize

a3f47fc

renames mode to disposition

e1c53b8

Merge branch 'master' into 828-scd2

c815792

Merge branch 'devel' into 828-scd2

900cf06

saves parquet with timestamp precision corresponding to the destinati…

6ed8000

…on and updates schema in the normalizer

adds transform that computes hashes of tables

4fde3cc

tests arrow/pandas + scd2

7796b00

allows scd2 columns to be added to arrow items

feca9cd

various renames

e5c78fd

uses generic caps when writing parquet if no destination context

d462a5b

disables coercing timestamps in parquet arrow writer

6944828

rudolfix merged commit 05aa413 into devel Apr 14, 2024
45 of 46 checks passed

rudolfix deleted the 828-scd2 branch April 14, 2024 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCD2 support #1168

SCD2 support #1168

jorritsandbrink commented Apr 1, 2024 •

edited

Loading

netlify bot commented Apr 1, 2024 •

edited

Loading

sultaniman Apr 2, 2024

jorritsandbrink Apr 2, 2024

sultaniman Apr 2, 2024

jorritsandbrink Apr 2, 2024

sultaniman Apr 2, 2024

jorritsandbrink Apr 2, 2024

sh-rp Apr 3, 2024

jorritsandbrink Apr 7, 2024

sh-rp left a comment

sh-rp commented Apr 3, 2024

jorritsandbrink commented Apr 4, 2024

rudolfix left a comment

rudolfix Apr 4, 2024

jorritsandbrink Apr 7, 2024

rudolfix Apr 4, 2024

sh-rp Apr 4, 2024

jorritsandbrink Apr 7, 2024

jorritsandbrink commented Apr 7, 2024

rudolfix left a comment

rudolfix Apr 8, 2024

jorritsandbrink Apr 10, 2024

rudolfix Apr 8, 2024

jorritsandbrink Apr 10, 2024

rudolfix left a comment



		ACTIVE_TS = datetime.fromisoformat(HIGH_TS.isoformat()).replace(tzinfo=None)
		h = DataItemNormalizer.get_row_hash

SCD2 support #1168

SCD2 support #1168

Conversation

jorritsandbrink commented Apr 1, 2024 • edited Loading

Description

Related Issues

netlify bot commented Apr 1, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp left a comment

Choose a reason for hiding this comment

sh-rp commented Apr 3, 2024

jorritsandbrink commented Apr 4, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorritsandbrink commented Apr 7, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

jorritsandbrink commented Apr 1, 2024 •

edited

Loading

netlify bot commented Apr 1, 2024 •

edited

Loading