fix import schema yaml #1013

sh-rp · 2024-02-27T15:53:30Z

Description

Schemas references by the import_schema_path are completely ignore by the current dlt version which is a bug. This PR makes sure these schemas get loaded if no other schema was discovered locally or in the destination.

netlify · 2024-02-27T15:53:46Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`095916f`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65e77e7a295722000963bcc7
😎 Deploy Preview	https://deploy-preview-1013--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

sh-rp · 2024-02-27T16:44:28Z

@rudolfix i think this got lost during the contracts merge.

sultaniman · 2024-02-27T16:51:23Z

dlt/common/storages/schema_storage.py

@@ -143,6 +143,13 @@ def _maybe_import_schema(self, name: str, storage_schema: DictStrAny = None) ->
        assert rv_schema is not None
        return rv_schema

+    def maybe_load_import_schema(self, name: str) -> Schema:


type hint then will be Optional

rudolfix · 2024-02-27T19:40:00Z

@sh-rp I'm almost 100% I've seen this code? is this restoring the old code somehow? this part is tricky but was working correctly before. we need to review this thing together

sh-rp · 2024-02-28T15:27:39Z

@rudolfix the import schema behavior broke with your changes in Rfix/load package extract (#790). I am looking to find the exact line or part of code that changed it.

sh-rp · 2024-02-28T15:45:32Z

@rudolfix Ok, what I think is going on is this:

In the code before the above pr, you are running the load_schema function of the SchemaStorage in def spool_files (via one other function call) of the normalizer, which will use the import schema if no other schema is found.
In the code after the above pr, you are running the self.normalize_storage.extracted_packages.load_schema(load_id) function to aquire the schema in the normalization step, but this will load the schema generated during the extraction which does not respect the import schema.

So the import schema got lost during the transition to the load packages. The suggestion that I made in my pr is, that the import schema gets respected during extraction of a source, but only if there is no other schema present yet. I think this makes the most sense and should have consistent behavior

sh-rp · 2024-02-28T16:40:20Z

@rudolfix one additional comment, now the import schema only get used if there is no other schema present. I can change it to always be used which somehow makes sense to me, we should be getting an error when trying to merge it in and has a conflict with an existing schema, so that should be safe.

…ed schema when no exception. uses load_schema to import schema on extract

rudolfix

@sh-rp import schema is implemented on the level of storage. there's a built in mechanism that

when loading schema from storage, checks if import schema exists
loads import schema if it were modified (remembers original imported schema hash)
let the schema in storage evolve otherwise

I think we lost the import schema long time ago. You are right that normalize lost the ability to import schemas, but it looks to me extract lost that really long time ago.

I restored that (hopefully) by using load_schema, I also do that in a way that changes introduced during extract stay in memory until extact ends and they are comitted in the decorator

tests are amazing. there's not much to add. what we must test above all:

only changes introduced in extract (so coming from source and resource hints) are present in import schema. normalize changes are not present there!
second run of the pipeline with other resource hints will not overwrite import schema
but pipeline schema will evolve freely
until someone changes the import schema then all gets reset

sultaniman · 2024-03-04T09:43:43Z

dlt/pipeline/pipeline.py

+            # drop all changes in live schemas
+            for name in list(self._schema_storage.live_schemas.keys()):
+                try:
+                    schema = self._schema_storage.load_schema(name)


I am wondering if we can load schemas in parallel if there are more than N schemas?

@sultaniman what is the question exactly?

I mean the cases when there will be dozens or >100 of schemas when calling self._schema_storage.live_schemas.keys() and was wondering if we should do it in parallel.

…eption

sh-rp · 2024-03-04T15:10:22Z

dlt/extract/extract.py

        elif pipeline.default_schema_name:
-            schema_ = pipeline.schemas[pipeline.default_schema_name].clone()
+            schema_ = Schema(pipeline.default_schema_name)


let's see if this passes

rudolfix

there's issue with detecting changes in committed schemas. but somehow tests are passing?

and we still clone the pipeline schema, but comment days otherwise

rudolfix · 2024-03-05T13:23:26Z

dlt/common/storages/live_schema_storage.py

+            raise SchemaNotFoundError(name, f"live-schema://{name}")
+        try:
+            stored_schema_json = self._load_schema_json(name)
+            return live_schema.stored_version_hash == cast(


we must use version_hash here.
stored_version_hash - is a hash of last commit to storage
version_hash - incorporates all the changes
I wonder how the tests are passing. we must somehow bump schema version (which should only happen when saving)

rudolfix · 2024-03-05T13:24:26Z

dlt/extract/extract.py

@@ -75,6 +75,7 @@ def choose_schema() -> Schema:
        """Except of explicitly passed schema, use a clone that will get discarded if extraction fails"""
        if schema:
            schema_ = schema
+        # if we have a default schema name, use it but do not collect any info from the existing schema


I think is is a wrong comment? we still clone pipeline schema

ah yes, i had to revert this but forgot to remove this comment

sultaniman · 2024-03-05T16:01:16Z

dlt/common/storages/live_schema_storage.py

+            raise SchemaNotFoundError(name, f"live-schema://{name}")
+        try:
+            stored_schema_json = self._load_schema_json(name)
+            return live_schema.version_hash == cast(str, stored_schema_json.get("version_hash"))


is using cast relevant here, I mean we are not using it for any type hints and detections?

sh-rp linked an issue Feb 27, 2024 that may be closed by this pull request

Fix import schemas not working #1004

Closed

sh-rp force-pushed the d#/fix_import_schemas branch from e79744f to f75b072 Compare February 27, 2024 16:43

sh-rp requested a review from rudolfix February 27, 2024 16:44

sh-rp self-assigned this Feb 27, 2024

sultaniman reviewed Feb 27, 2024

View reviewed changes

sh-rp marked this pull request as ready for review February 27, 2024 16:52

sh-rp added 5 commits February 28, 2024 16:50

make sure import schema is respected if non other is present

ef9f433

add some previously missing tests

ca93b78

fix tests

6a79f66

fix bug in loading import schema

9779455

switch to dummy destination for tests

b30d048

sh-rp force-pushed the d#/fix_import_schemas branch from 85b7c42 to b30d048 Compare February 28, 2024 15:50

rudolfix added 3 commits March 3, 2024 22:06

runs import tests on ci

0b5345d

load schema for live schemas will not committ live schema

9b82452

committs and rollbacks live schemas in with_schema_sync, saves import…

9d1847b

…ed schema when no exception. uses load_schema to import schema on extract

rudolfix requested changes Mar 3, 2024

View reviewed changes

moves import schema tests

c3c3e66

rudolfix force-pushed the d#/fix_import_schemas branch from ddcd49e to c3c3e66 Compare March 3, 2024 21:24

rudolfix added bug Something isn't working community This issue came from slack community workspace labels Mar 4, 2024

sultaniman reviewed Mar 4, 2024

View reviewed changes

sh-rp and others added 3 commits March 4, 2024 14:36

fix bug in stored schema comparison

6cfcdad

adds comment on restoring schema names and default schema name on exc…

af6b22e

…eption

adds more tests and fixes one bug

5bd5b21

align resource and pure data source schema with regular schema

6c27211

sh-rp commented Mar 4, 2024

View reviewed changes

sh-rp added 3 commits March 5, 2024 08:22

Merge branch 'devel' into d#/fix_import_schemas

ba449a4

fix linting and revert changes in basic resource schema resolution

7eb4b27

fix one test

09475dc

sh-rp requested a review from rudolfix March 5, 2024 11:01

rudolfix requested changes Mar 5, 2024

View reviewed changes

fix 2 review comments

22485f6

sultaniman reviewed Mar 5, 2024

View reviewed changes

sh-rp requested a review from rudolfix March 5, 2024 17:34

Merge branch 'devel' into d#/fix_import_schemas

095916f

rudolfix merged commit adb6aa4 into devel Mar 5, 2024
39 of 41 checks passed

rudolfix deleted the d#/fix_import_schemas branch March 5, 2024 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix import schema yaml #1013

fix import schema yaml #1013

sh-rp commented Feb 27, 2024

netlify bot commented Feb 27, 2024 •

edited

Loading

sh-rp commented Feb 27, 2024

sultaniman Feb 27, 2024

rudolfix commented Feb 27, 2024

sh-rp commented Feb 28, 2024

sh-rp commented Feb 28, 2024 •

edited

Loading

sh-rp commented Feb 28, 2024

rudolfix left a comment

sultaniman Mar 4, 2024

sh-rp Mar 4, 2024

sultaniman Mar 5, 2024

sh-rp Mar 4, 2024

rudolfix left a comment

rudolfix Mar 5, 2024

rudolfix Mar 5, 2024

sh-rp Mar 5, 2024

sultaniman Mar 5, 2024

fix import schema yaml #1013

fix import schema yaml #1013

Conversation

sh-rp commented Feb 27, 2024

Description

netlify bot commented Feb 27, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

sh-rp commented Feb 27, 2024

Choose a reason for hiding this comment

rudolfix commented Feb 27, 2024

sh-rp commented Feb 28, 2024

sh-rp commented Feb 28, 2024 • edited Loading

sh-rp commented Feb 28, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Feb 27, 2024 •

edited

Loading

sh-rp commented Feb 28, 2024 •

edited

Loading