Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source and schema changes #769

Merged
merged 8 commits into from
Nov 21, 2023
Merged

source and schema changes #769

merged 8 commits into from
Nov 21, 2023

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Nov 16, 2023

  • Keep ancestry information in schema as the 10 previous hashes
  • Remove name from DLTSource initializer and properties.

Copy link

netlify bot commented Nov 16, 2023

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit cb251e9
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/655cbdd62f7bcb0008e466d5

@sh-rp sh-rp changed the title D#/source and schema changes /source and schema changes Nov 18, 2023
@sh-rp sh-rp changed the title /source and schema changes source and schema changes Nov 18, 2023
@sh-rp sh-rp marked this pull request as ready for review November 18, 2023 08:50
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably had some scope misunderstanding!
if you look at here #758
we should drop the name in DltSource not in decorator. the point of the ticket is to use schema name as the source name. keeping the name in decorator which was enforcing the schema name made perfect sense... probably my fault with describing the scope?

you could also link the relevant issues to PR - the template has a section for that

thanks!

decorator should stay intact, you also do not need to change verified sources.

@sh-rp sh-rp force-pushed the d#/source_and_schema_changes branch from 2437983 to 33a64fb Compare November 19, 2023 17:48
@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 19, 2023

Ah alright, I was going to ask anyway because it did not seem useful to remove the decorator arg. I have updated the PR and it should be ready for review again @rudolfix

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the tests! I have a few questions and I'd like to rename ancestors to something else... then we merge this because it LGTM

@@ -240,12 +246,18 @@ def compile_simple_regexes(r: Iterable[TSimpleRegex]) -> REPattern:


def validate_stored_schema(stored_schema: TStoredSchema) -> None:
# exclude validation of keys added later
ignored_keys = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is required? first we migrate the dictionary and then we validate the schema. so the engine should be always 7

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in test_upgrade_engine_v1_schema many different schema versions are validated. we could alternatively change that test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really I was sure we migrate and then validate. We should validate only the version after migration. So you can change the test.

but you can keep this filter_required providing that you add tests for it

@@ -37,6 +37,7 @@ class Schema:
_dlt_tables_prefix: str
_stored_version: int # version at load/creation time
_stored_version_hash: str # version hash at load/creation time
_stored_ancestors: Optional[List[str]] # list of ancestor hashes of the schema
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not ancestor. this is previous version hash. maybe let's name it like that. we do not have any schema derivation scheme like we have in pydantic models (ie. base schema etc.) this is just a version (probably revision would be better - but it is too late to change it)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_previous_version_hashes or _previous_hashes would be better

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

# unshift previous hash to ancestors and limit array to 10 entries
if previous_hash not in stored_schema["ancestors"]:
stored_schema["ancestors"].insert(0, previous_hash)
stored_schema["ancestors"] = stored_schema["ancestors"][:10]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could have it in the contants so it is easy to find and change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean?

@@ -9,7 +9,7 @@
TCustomValidator = Callable[[str, str, Any, Any], bool]


def validate_dict(spec: Type[_TypedDict], doc: StrAny, path: str, filter_f: TFilterFunc = None, validator_f: TCustomValidator = None) -> None:
def validate_dict(spec: Type[_TypedDict], doc: StrAny, path: str, filter_f: TFilterFunc = None, validator_f: TCustomValidator = None, filter_required: TFilterFunc = None) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter_required - what it does? you added new argument without docstrings

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to ignored_props_f. And maybe it is time to consider Pydantic instead? honestly I wrote this code because pydantic handling of TypedDicts was weak (forgot the detaiuls already).

on the other hand I do not want Pydantic in core deps

# if generator, consume it immediately
if inspect.isgenerator(rv):
rv = list(rv)

# convert to source
s = _impl_cls.from_data(name, source_section, schema.clone(update_normalizers=True), rv)
s = _impl_cls.from_data(source_section, schema.clone(update_normalizers=True), rv)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -642,22 +642,17 @@ class DltSource(Iterable[TDataItem]):
* You can use a `run` method to load the data with a default instance of dlt pipeline.
* You can get source read only state for the currently active Pipeline instance
"""
def __init__(self, name: str, section: str, schema: Schema, resources: Sequence[DltResource] = None) -> None:
self.name = name
def __init__(self, section: str, schema: Schema, resources: Sequence[DltResource] = None) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we swap schema with section? it feels more natural - schema is the most important arg

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if resources:
self.resources.add(*resources)

@classmethod
def from_data(cls, name: str, section: str, schema: Schema, data: Any) -> Self:
def from_data(cls, section: str, schema: Schema, data: Any) -> Self:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# Conflicts:
#	dlt/common/schema/schema.py
#	dlt/common/schema/utils.py
#	dlt/pipeline/pipeline.py
#	tests/common/cases/schemas/eth/ethereum_schema_v7.yml
#	tests/common/schema/test_schema.py
#	tests/common/schema/test_versioning.py
#	tests/common/utils.py
#	tests/extract/test_incremental.py
@sh-rp sh-rp force-pushed the d#/source_and_schema_changes branch from fb9d731 to 0799197 Compare November 21, 2023 11:26
@@ -9,7 +9,7 @@
TCustomValidator = Callable[[str, str, Any, Any], bool]


def validate_dict(spec: Type[_TypedDict], doc: StrAny, path: str, filter_f: TFilterFunc = None, validator_f: TCustomValidator = None) -> None:
def validate_dict(spec: Type[_TypedDict], doc: StrAny, path: str, filter_f: TFilterFunc = None, validator_f: TCustomValidator = None, filter_required: TFilterFunc = None) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to ignored_props_f. And maybe it is time to consider Pydantic instead? honestly I wrote this code because pydantic handling of TypedDicts was weak (forgot the detaiuls already).

on the other hand I do not want Pydantic in core deps

@@ -240,12 +246,18 @@ def compile_simple_regexes(r: Iterable[TSimpleRegex]) -> REPattern:


def validate_stored_schema(stored_schema: TStoredSchema) -> None:
# exclude validation of keys added later
ignored_keys = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really I was sure we migrate and then validate. We should validate only the version after migration. So you can change the test.

but you can keep this filter_required providing that you add tests for it

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally LGTM! thanks!

@rudolfix rudolfix merged commit 28dbba6 into devel Nov 21, 2023
41 of 44 checks passed
@rudolfix rudolfix deleted the d#/source_and_schema_changes branch November 21, 2023 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants