Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCD2 support #1168

Merged
merged 45 commits into from
Apr 14, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
720b115
format examples
Mar 31, 2024
115b4c9
add core functionality for scd2 merge strategy
Mar 31, 2024
37befbc
make scd2 validity column names configurable
Apr 1, 2024
7726d98
make alias descriptive
Apr 2, 2024
30bb2e0
add validity column name conflict checking
Apr 2, 2024
765d652
Merge branch 'devel' of https://github.com/dlt-hub/dlt into 828-scd2
Apr 7, 2024
8f4d4ce
extend write disposition with dictionary configuration option
Apr 7, 2024
396ec59
add default delete-insert merge strategy
Apr 7, 2024
c8d84d8
update write_disposition type hints
Apr 7, 2024
11748a6
extend tested destinations
Apr 7, 2024
e9c8f61
2nd time setup (#1202)
adrianbr Apr 9, 2024
1f399bc
remove obsolete deepcopy
Apr 9, 2024
c8f4173
Merge branch 'devel' of https://github.com/dlt-hub/dlt into 828-scd2
Apr 9, 2024
c99d612
Merge pull request #1200 from dlt-hub/devel
rudolfix Apr 9, 2024
0fa603b
add scd2 docs
Apr 9, 2024
c110aae
add write_disposition existence condition
Apr 9, 2024
55df900
add nullability hints to validity columns
Apr 9, 2024
6b24378
cache functions to limit schema lookups
Apr 10, 2024
4124d61
add row_hash_column_name config option
Apr 10, 2024
4236f20
default to default merge strategy
Apr 11, 2024
0e7f8c0
replace hardcoded column name with variable to fix test
Apr 11, 2024
93e7f45
fix doc snippets
Apr 11, 2024
36da1f2
compares records without order and with caps timestamps precision in …
rudolfix Apr 13, 2024
0d6919a
defines create load id, stores package state typed, allows package st…
rudolfix Apr 13, 2024
2195ebf
creates new package to normalize from extracted package so state is c…
rudolfix Apr 13, 2024
52f0d7b
bans direct pendulum import
rudolfix Apr 13, 2024
caa9ae7
uses timestamps with properly reduced precision in scd2
rudolfix Apr 13, 2024
64baf2d
selects newest state by load_id, not created_at. this will not affect…
rudolfix Apr 13, 2024
8cb24af
adds formating datetime literal to escape
rudolfix Apr 13, 2024
6039f1c
renames x-row-hash to x-row-version
rudolfix Apr 13, 2024
dee8e08
corrects json and pendulum imports
rudolfix Apr 13, 2024
e63ffe1
uses unique column in scd2 sql generation
rudolfix Apr 13, 2024
12bdf2b
renames arrow items literal
rudolfix Apr 13, 2024
c1614b9
adds limitations to docs
rudolfix Apr 13, 2024
a3f47fc
passes only complete columns to arrow normalize
rudolfix Apr 13, 2024
e1c53b8
renames mode to disposition
rudolfix Apr 13, 2024
c815792
Merge branch 'master' into 828-scd2
rudolfix Apr 13, 2024
900cf06
Merge branch 'devel' into 828-scd2
rudolfix Apr 13, 2024
6ed8000
saves parquet with timestamp precision corresponding to the destinati…
rudolfix Apr 13, 2024
4fde3cc
adds transform that computes hashes of tables
rudolfix Apr 13, 2024
7796b00
tests arrow/pandas + scd2
rudolfix Apr 13, 2024
feca9cd
allows scd2 columns to be added to arrow items
rudolfix Apr 13, 2024
e5c78fd
various renames
rudolfix Apr 13, 2024
d462a5b
uses generic caps when writing parquet if no destination context
rudolfix Apr 14, 2024
6944828
disables coercing timestamps in parquet arrow writer
rudolfix Apr 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
renames x-row-hash to x-row-version
  • Loading branch information
rudolfix committed Apr 13, 2024
commit 6039f1cea136a46051431238b47f19154e11e0de
4 changes: 2 additions & 2 deletions dlt/common/normalizers/json/relational.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from functools import lru_cache
from typing import Dict, List, Mapping, Optional, Sequence, Tuple, cast, TypedDict, Any
from dlt.common import json
from dlt.common.json import json
from dlt.common.normalizers.exceptions import InvalidJsonNormalizer
from dlt.common.normalizers.typing import TJSONNormalizer
from dlt.common.normalizers.utils import generate_dlt_id, DLT_ID_LENGTH_BYTES
Expand Down Expand Up @@ -382,7 +382,7 @@ def _get_validity_column_names(schema: Schema, table_name: str) -> List[Optional
@staticmethod
@lru_cache(maxsize=None)
def _dlt_id_is_row_hash(schema: Schema, table_name: str) -> bool:
return schema.get_table(table_name)["columns"].get("_dlt_id", dict()).get("x-row-hash", False) # type: ignore[return-value]
return schema.get_table(table_name)["columns"].get("_dlt_id", dict()).get("x-row-version", False) # type: ignore[return-value]

@staticmethod
def _validate_validity_column_names(
Expand Down
4 changes: 2 additions & 2 deletions dlt/extract/hints.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from copy import copy, deepcopy
from typing import TypedDict, cast, Any, Optional, Dict

from dlt.common import logger
from dlt.common.schema.typing import (
TColumnNames,
TColumnProp,
Expand All @@ -14,7 +15,6 @@
TTableFormat,
TSchemaContract,
)
from dlt.common import logger
from dlt.common.schema.utils import (
DEFAULT_WRITE_DISPOSITION,
DEFAULT_MERGE_STRATEGY,
Expand Down Expand Up @@ -452,7 +452,7 @@ def _merge_merge_disposition_dict(dict_: Dict[str, Any]) -> None:
dict_["columns"][hash_] = {
"name": hash_,
"nullable": False,
"x-row-hash": True,
"x-row-version": True,
}

@staticmethod
Expand Down