Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCD2 support #1168

Merged
merged 45 commits into from
Apr 14, 2024
Merged

SCD2 support #1168

merged 45 commits into from
Apr 14, 2024

Conversation

jorritsandbrink
Copy link
Collaborator

@jorritsandbrink jorritsandbrink commented Apr 1, 2024

Description

This PR adds basic support for SCD2. Scope is defined here.

  • Treats scd2 as a special case of the merge write disposition.
  • Support configurable names for the validity columns ("valid from" / "valid to").
  • Validity column names default to _dlt_valid_from and _dlt_valid_to if the user does not specify names.
  • Uses timestamps (not dates) to support sub-daily changes.
  • Uses high timestamp (9999-12-31T00:00:00+00:00) in "valid to" column to indicate active records.
  • Includes validity columns only in the root table. Validity dates for records in child tables can be linked through the root id.
  • Does not use "standard" configuration because it doesn't support resource-scoped configuration (correct me if I'm wrong). We want resource-scoped configuration because a user might want to load one table with the scd2 strategy and another one with "regular" merge behavior, or use different validity column names for different tables.
  • Introduces merge_config argument on the resource decorator Extends write_disposition argument such that it also accepts configuration dictionaries to let the user specify the merge strategy and validity column names..
  • Adds merge_config key to TResourceHints, but not to TTableSchema.
  • Uses x-hints to store user configuration in the schema:
    • x-merge-strategy at the table level
    • x-valid-from and x-valid-to at the column level

Related Issues

@jorritsandbrink jorritsandbrink self-assigned this Apr 1, 2024
@jorritsandbrink jorritsandbrink linked an issue Apr 1, 2024 that may be closed by this pull request
Copy link

netlify bot commented Apr 1, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 6944828
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/661b6cce698afb000817e765

@jorritsandbrink jorritsandbrink requested a review from sh-rp April 1, 2024 22:15


ACTIVE_TS = datetime.fromisoformat(HIGH_TS.isoformat()).replace(tzinfo=None)
h = DataItemNormalizer.get_row_hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe alias this to get_row_hash?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 7726d98

@@ -155,6 +155,14 @@ class NormalizerInfo(TypedDict, total=True):
new_table: bool


class TMergeConfig(TypedDict, total=False):
strategy: Optional[TLoaderMergeStrategy]
validity_column_names: Optional[List[str]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it allow duplicates column names?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't, didn't think of that. Added validity column name checking in 30bb2e0. An exception is raised if a configured validity column name appears in the data.

dlt/destinations/sql_jobs.py Show resolved Hide resolved
)


ACTIVE_TS = datetime.fromisoformat(HIGH_TS.isoformat()).replace(tzinfo=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use HIGHT_TS.replace(tzinfo=None)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIGH_TS is a pendulum DateTime because the datetime module is a banned import. However, when fetching data from the destination tables to assert its content, timestamps are returned as datetime objects. Hence the conversion from DateTime to datetime.

dlt/destinations/sql_jobs.py Show resolved Hide resolved
dlt/destinations/sql_jobs.py Outdated Show resolved Hide resolved
ids=lambda x: x.name,
)
@pytest.mark.parametrize("simple", [True, False])
def test_child_table(destination_config: DestinationTestConfiguration, simple: bool) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should test the child table functionality on all destinations, so either enable all dests here or add a child table to the basic test. the reason is, that we should run all sql code that we have against all destinations, just to make sure it works everywhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 11748a6

Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work, thanks! Only a few small changes and we still need the docs. Also I want to discuss two points with @rudolfix before we can merge.

@sh-rp
Copy link
Collaborator

sh-rp commented Apr 3, 2024

@rudolfix two questions on this PR:

  • Do you agree with the interface? We now have additional fields on the resource to configure this as opposed to a settings on the destination how we do it with the replace strategy. I think this makes sense, but also looks a bit weird provided the replace strategy is set in a different way.
  • Right now the "cutoff date" is the created date of the load package. I am thinking we might want to make this overridable with a setting for certain cases, but maybe it is good like this now and we can add this feature when people request it.

@jorritsandbrink
Copy link
Collaborator Author

@rudolfix two questions on this PR:

  • Do you agree with the interface? We now have additional fields on the resource to configure this as opposed to a settings on the destination how we do it with the replace strategy. I think this makes sense, but also looks a bit weird provided the replace strategy is set in a different way.
  • Right now the "cutoff date" is the created date of the load package. I am thinking we might want to make this overridable with a setting for certain cases, but maybe it is good like this now and we can add this feature when people request it.

Note that this is why I chose a different interface than the one used to configure replace strategy:
image

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really good! @jorritsandbrink congrats on plugging the work into our normalizer and hints perfectly.

I think we can do the user interface better by extending write_disposition instead of adding a new argument. @sh-rp exactly what we did for data contracts with shorthand and full notation - see my review

also: I trust you that SQL transformation is good and properly tested :)

dlt/destinations/sql_jobs.py Show resolved Hide resolved
dlt/destinations/sql_jobs.py Outdated Show resolved Hide resolved
@@ -65,6 +64,7 @@
]
"""Known hints of a column used to declare hint regexes."""
TWriteDisposition = Literal["skip", "append", "replace", "merge"]
TLoaderMergeStrategy = Literal["scd2"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have a default strategy which is "delete-insert". and we'll add one more merge to support MERGE sql statements #1129

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 396ec59

@@ -297,6 +297,7 @@ def resource(
columns: TTableHintTemplate[TAnySchemaColumns] = None,
primary_key: TTableHintTemplate[TColumnNames] = None,
merge_key: TTableHintTemplate[TColumnNames] = None,
merge_config: TMergeConfig = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorritsandbrink @sh-rp maybe instead of doing this, we could extend write_disposition to
``
write_disposition: TTableHintTemplate[Union[TWriteDisposition, TMergeConfig, TReplaceConfig]] = None

exactly what we do with `schema_contract` below? then we can support short hand strings and full definitions. and still use the same parameter
in that case I'd rename TMergeConfig to TMergeDispositionConfig

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we also handle the replace strategies in there? (not in the pr but later)? i think it's a good idea

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended write_disposition in 8f4d4ce and c8d84d8

@jorritsandbrink
Copy link
Collaborator Author

@rudolfix I adjusted the user interface according to your suggestion and introduced a default delete-insert merge strategy. WDYT?

I'll add docs after we've settled on the user interface.

@jorritsandbrink jorritsandbrink requested a review from rudolfix April 7, 2024 23:13
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good! I have two optimizations that IMO are OK to add.
thx for adding apply_hints tests!

@@ -296,10 +319,18 @@ def normalize_data_item(
row = cast(TDataItemRowRoot, item)
# identify load id if loaded data must be processed after loading incrementally
row["_dlt_load_id"] = load_id
# determine if row hash should be used as dlt id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you precompute a list of all "scd2" tables in _reset method? this part of schema remains constant during normalization.
and this method is called for each normalized row. so it makes sense to optimize it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved with caching as you suggested on Slack: 6b24378

def _create_table_schema(resource_hints: TResourceHints, resource_name: str) -> TTableSchema:
"""Creates table schema from resource hints and resource name."""

dict_ = cast(Dict[str, Any], deepcopy(resource_hints))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to deepcopy hints? it was not copied in merge_keys. are you getting errors in tests?
it was already done in compute_table():

# resolve a copy of a held template
        table_template = self._clone_hints(table_template)

note that deepcopy will also clone pydantic models and other things that were used as original column definition. and that may be quite costly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deepcopy was indeed obsolete—removed it in 1f399bc

rudolfix
rudolfix previously approved these changes Apr 11, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

oh dremio does not like ZULU timestamps in sql merge job... do you generate them for SCD2? I think Z timestamps are not really part of ISO

we have a dremio container in tests if you want to fix it

@rudolfix rudolfix merged commit 05aa413 into devel Apr 14, 2024
45 of 46 checks passed
@rudolfix rudolfix deleted the 828-scd2 branch April 14, 2024 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

SCD2 write disposition
5 participants