-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
source and schema changes #769
Conversation
sh-rp
commented
Nov 16, 2023
•
edited
Loading
edited
- Keep ancestry information in schema as the 10 previous hashes
- Remove name from DLTSource initializer and properties.
✅ Deploy Preview for dlt-hub-docs canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably had some scope misunderstanding!
if you look at here #758
we should drop the name in DltSource
not in decorator. the point of the ticket is to use schema name as the source name. keeping the name in decorator which was enforcing the schema name made perfect sense... probably my fault with describing the scope?
you could also link the relevant issues to PR - the template has a section for that
thanks!
decorator should stay intact, you also do not need to change verified sources.
2437983
to
33a64fb
Compare
Ah alright, I was going to ask anyway because it did not seem useful to remove the decorator arg. I have updated the PR and it should be ready for review again @rudolfix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx for the tests! I have a few questions and I'd like to rename ancestors to something else... then we merge this because it LGTM
dlt/common/schema/utils.py
Outdated
@@ -240,12 +246,18 @@ def compile_simple_regexes(r: Iterable[TSimpleRegex]) -> REPattern: | |||
|
|||
|
|||
def validate_stored_schema(stored_schema: TStoredSchema) -> None: | |||
# exclude validation of keys added later | |||
ignored_keys = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is required? first we migrate the dictionary and then we validate the schema. so the engine should be always 7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in test_upgrade_engine_v1_schema many different schema versions are validated. we could alternatively change that test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really I was sure we migrate and then validate. We should validate only the version after migration. So you can change the test.
but you can keep this filter_required
providing that you add tests for it
dlt/common/schema/schema.py
Outdated
@@ -37,6 +37,7 @@ class Schema: | |||
_dlt_tables_prefix: str | |||
_stored_version: int # version at load/creation time | |||
_stored_version_hash: str # version hash at load/creation time | |||
_stored_ancestors: Optional[List[str]] # list of ancestor hashes of the schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not ancestor. this is previous version hash. maybe let's name it like that. we do not have any schema derivation scheme like we have in pydantic models (ie. base schema etc.) this is just a version (probably revision would be better - but it is too late to change it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_previous_version_hashes or _previous_hashes would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
dlt/common/schema/utils.py
Outdated
# unshift previous hash to ancestors and limit array to 10 entries | ||
if previous_hash not in stored_schema["ancestors"]: | ||
stored_schema["ancestors"].insert(0, previous_hash) | ||
stored_schema["ancestors"] = stored_schema["ancestors"][:10] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could have it in the contants so it is easy to find and change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean?
dlt/common/validation.py
Outdated
@@ -9,7 +9,7 @@ | |||
TCustomValidator = Callable[[str, str, Any, Any], bool] | |||
|
|||
|
|||
def validate_dict(spec: Type[_TypedDict], doc: StrAny, path: str, filter_f: TFilterFunc = None, validator_f: TCustomValidator = None) -> None: | |||
def validate_dict(spec: Type[_TypedDict], doc: StrAny, path: str, filter_f: TFilterFunc = None, validator_f: TCustomValidator = None, filter_required: TFilterFunc = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filter_required - what it does? you added new argument without docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename to ignored_props_f
. And maybe it is time to consider Pydantic instead? honestly I wrote this code because pydantic handling of TypedDicts was weak (forgot the detaiuls already).
on the other hand I do not want Pydantic in core deps
dlt/extract/decorators.py
Outdated
# if generator, consume it immediately | ||
if inspect.isgenerator(rv): | ||
rv = list(rv) | ||
|
||
# convert to source | ||
s = _impl_cls.from_data(name, source_section, schema.clone(update_normalizers=True), rv) | ||
s = _impl_cls.from_data(source_section, schema.clone(update_normalizers=True), rv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
dlt/extract/source.py
Outdated
@@ -642,22 +642,17 @@ class DltSource(Iterable[TDataItem]): | |||
* You can use a `run` method to load the data with a default instance of dlt pipeline. | |||
* You can get source read only state for the currently active Pipeline instance | |||
""" | |||
def __init__(self, name: str, section: str, schema: Schema, resources: Sequence[DltResource] = None) -> None: | |||
self.name = name | |||
def __init__(self, section: str, schema: Schema, resources: Sequence[DltResource] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we swap schema with section? it feels more natural - schema is the most important arg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlt/extract/source.py
Outdated
if resources: | ||
self.resources.add(*resources) | ||
|
||
@classmethod | ||
def from_data(cls, name: str, section: str, schema: Schema, data: Any) -> Self: | ||
def from_data(cls, section: str, schema: Schema, data: Any) -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# Conflicts: # dlt/common/schema/schema.py # dlt/common/schema/utils.py # dlt/pipeline/pipeline.py # tests/common/cases/schemas/eth/ethereum_schema_v7.yml # tests/common/schema/test_schema.py # tests/common/schema/test_versioning.py # tests/common/utils.py # tests/extract/test_incremental.py
fb9d731
to
0799197
Compare
dlt/common/validation.py
Outdated
@@ -9,7 +9,7 @@ | |||
TCustomValidator = Callable[[str, str, Any, Any], bool] | |||
|
|||
|
|||
def validate_dict(spec: Type[_TypedDict], doc: StrAny, path: str, filter_f: TFilterFunc = None, validator_f: TCustomValidator = None) -> None: | |||
def validate_dict(spec: Type[_TypedDict], doc: StrAny, path: str, filter_f: TFilterFunc = None, validator_f: TCustomValidator = None, filter_required: TFilterFunc = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename to ignored_props_f
. And maybe it is time to consider Pydantic instead? honestly I wrote this code because pydantic handling of TypedDicts was weak (forgot the detaiuls already).
on the other hand I do not want Pydantic in core deps
dlt/common/schema/utils.py
Outdated
@@ -240,12 +246,18 @@ def compile_simple_regexes(r: Iterable[TSimpleRegex]) -> REPattern: | |||
|
|||
|
|||
def validate_stored_schema(stored_schema: TStoredSchema) -> None: | |||
# exclude validation of keys added later | |||
ignored_keys = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really I was sure we migrate and then validate. We should validate only the version after migration. So you can change the test.
but you can keep this filter_required
providing that you add tests for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally LGTM! thanks!