filesystem state sync #1184

sh-rp · 2024-04-03T16:44:34Z

Description

Implements support for synching state and schema with a filesystem destination. Also saves the info to the loads table the way it is saved in other destinations. The saving of the state is done a bit different, mostly to have the same fileformat always and making things easier. I'm not sure if exracting the state at such a late time creates any problems?

Notes:

This approach does not respect a custom layout for the dlt tables for now and always uses jsonl (i think this is fine). Maybe we could provide a configurable path to the dlt internal tables that the user can change, and by default it is _dlt/ (in which all the dlt table folders then lie)
I touch an "init" file in each dlt table folder to differentiate between "table does not exist" and "table exists but no state found"
Should we gzip the schema and state files?
Don't upload files if they exist already (if the hash is the same, the contents should be the same)?

ToDo:

Write a couple of filesystem specific state tests, may not be necessary, but I had one instance where the schema was created too often, so maybe a simple test for this would be good.

netlify · 2024-04-03T16:44:49Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`abfc170`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/661fec5fd232f0000805326b
😎 Deploy Preview	https://deploy-preview-1184--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

dlt/destinations/impl/filesystem/filesystem.py

sh-rp · 2024-04-03T16:50:19Z

This is what you end up with when running the example pipeline currently included. You can always find the current state and current schema with a direct key as well as all the previous versions via hash. I think this is the easiest and fastest way of doing this.

rudolfix

also make sure that it works when state sync is disabled. otherwise it is really good!

rudolfix · 2024-04-14T11:31:08Z

dlt/destinations/impl/filesystem/filesystem.py

-                schema_name=self.schema.name, table_name=table_name
-            )
+            # dlt tables do not respect layout (for now)
+            if table_name in self.schema.dlt_table_names():


I think it is totally fine and should stay like that (we need to document this)

dlt/destinations/impl/filesystem/filesystem.py

rudolfix · 2024-04-14T11:50:05Z

dlt/destinations/impl/filesystem/filesystem.py

+    #
+
+    def _write_to_json_file(self, filepath: str, data: DictStrAny) -> None:
+        dirname = os.path.dirname(filepath)


don't use os.path, use posixpath. here paths are normalized from fsspec.

dlt/destinations/impl/filesystem/filesystem.py

rudolfix · 2024-04-14T11:59:23Z

dlt/destinations/impl/filesystem/filesystem.py

+
+    def get_stored_state(self, pipeline_name: str) -> Optional[StateInfo]:
+        # raise if dir not initialized
+        filepath = self._get_state_file_name(pipeline_name, "current")


you need to reproduce the WHERE clause of other destinations.

must start with pipeline_name

find one with highest load_id and return

This is what is does, no? I mean it starts with the pipeline name, so you can look up the state with the pipeline, and instead of looking for the highest load_id (which we shouldn't do anyway, since we should not rely on load ids being timestamps) it has this current marker. I have a screenshot above of the file layout this pr produces. This way we can avoid iterating through a list of files to find the newest one which will eventually slow down against destinations with many loads, at least that would be my expectation.

changed this now.

rudolfix · 2024-04-14T12:00:50Z

dlt/destinations/impl/filesystem/filesystem.py

+        safe_hash = "".join(
+            [c for c in version_hash if re.match(r"\w", c)]
+        )  # remove all special chars from hash
+        return f"{self.dataset_path}/{self.schema.version_table_name}/{self.schema.name}__{safe_hash}.jsonl"


IMO hash is enough. also it would be good to have load_id

I need the name in the filepath so I can find the right schema when looking for the newest version of a schema, so I will keep it.

rudolfix · 2024-04-14T12:01:14Z

dlt/destinations/impl/filesystem/filesystem.py

+
+    def get_stored_schema(self) -> Optional[StorageSchemaInfo]:
+        """Retrieves newest schema from destination storage"""
+        return self.get_stored_schema_by_hash("current")


same thing like in state: find the oldest load id

done (assuming you mean the newest load id :) )

dlt/destinations/impl/filesystem/filesystem.py

rudolfix · 2024-04-14T12:02:29Z

fs_testing_pipe.py

@@ -0,0 +1,20 @@
+import dlt


keep those in some ignored folder ;)

yeah, i have that also, but since i am working on two different machines i need to do this sometimes ;)

sh-rp · 2024-04-15T08:39:32Z

This is what you end up with when running the example pipeline currently included. You can always find the current state and current schema with a direct key as well as all the previous versions via hash. I think this is the easiest and fastest way of doing this.

this image is not quite up to date, there will now also be these init files in each folder.

sh-rp · 2024-04-15T13:49:30Z

dlt/destinations/impl/filesystem/filesystem.py

+                self.fs_client.touch(posixpath.join(directory, INIT_FILE_NAME))
+
+        # write schema to destination
+        self.store_current_schema(load_id or "1")


When "sync_destination" is called, we are not inside the context of a load. I am not quite sure how to handle this case. I first just did not store the schema, but there is a test that verifies that there is a schema in the destination after "sync_destination" is called on a pipeline with nothing in the versions folder. Either we change the the tests or think of some default value. I am not sure..

We can also find the newest load_id for this schema present and increase it by one, but that also does not feel right..

or we use a current timestamp, then it should be in line with the other destinations.

i have changed it to be like this though. For lineage purposes it would be interesting to also have the load_id (if available) in the file/table, but for now it is inline with the other destinations

…tination from there

dlt/destinations/impl/filesystem/filesystem.py

sh-rp · 2024-04-15T16:31:48Z

There is a lot of state and schema testing going on in test_job_client, I will think about if we can adapt these tests to also run for the filesystem destinations.

change filesystem to fallback, to old state loading when used as staging destination

dlt/destinations/impl/filesystem/filesystem.py

also fixes table count loading to work for all bucket destinations

sh-rp · 2024-04-16T13:27:20Z

@rudolfix I have added some more tests for the statesync. Imho the best approach would be to standardize the state loading and retrieving further and run all standard sql tests for the filesystem also, right now many standard sql tests rely on executing sql, so this would be a bigger refactor including probably an extension of the "WithStateSync" interface. Additionally the drop command will currently not work on the filesystem, since that also relies on the sql client, but I don't think it needs to if the destinations would have an improved interface that could be implemented by both the sql and filesystem destinations. Maybe you could have a look if the coverage seems enough at this point and we tackle the harmonization of (possibly all?) destinations when we also add the ability to route statesync to a separate place.

dlt/destinations/impl/filesystem/filesystem.py

sultaniman · 2024-04-16T14:00:13Z

dlt/destinations/impl/filesystem/filesystem.py

+        self.fs_client.write_text(filepath, json.dumps(data), "utf-8")
+
+    def _to_path_safe_string(self, s: str) -> str:
+        return "".join([c for c in s if re.match(r"\w", c)]) if s else None


should we instead use c in strings.ascii_letters?

In [2]: import string In [3]: string.ascii_letters Out[3]: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

changed it to a hex string now

sultaniman · 2024-04-16T14:03:49Z

dlt/destinations/impl/filesystem/filesystem.py

+    def _list_dlt_dir(self, dirname: str) -> Iterator[Tuple[str, List[str]]]:
+        if not self.fs_client.exists(posixpath.join(dirname, INIT_FILE_NAME)):
+            raise DestinationUndefinedEntity({"dir": dirname})
+        for filepath in self.fs_client.listdir(dirname, detail=False):


listir is just an alias to fs_client.ls or is it implementing some additional behavior?

when listing directories always request a refresh!

all_files = self.fs_client.ls(truncate_dir, detail=False, refresh=True)

IMO this is the only command you should use. fsspec is a mess and this one is proven to work. please replace all commands that list

maybe we could have our own fsclient wrapper that only exposes stuff we think is reliable..

+1 on having our own abstraction with things we really need

sultaniman · 2024-04-16T14:06:34Z

dlt/destinations/impl/filesystem/filesystem.py

+            "inserted_at": pendulum.now().isoformat(),
+            "schema_version_hash": self.schema.version_hash,
+        }
+        filepath = f"{self.dataset_path}/{self.schema.loads_table_name}/{self.schema.name}__{load_id}.jsonl"


Should we use path.join instead?

sultaniman · 2024-04-16T14:06:53Z

dlt/destinations/impl/filesystem/filesystem.py

+
+    def _get_state_file_name(self, pipeline_name: str, version_hash: str, load_id: str) -> str:
+        """gets full path for schema file for a given hash"""
+        return f"{self.dataset_path}/{self.schema.state_table_name}/{pipeline_name}__{load_id}__{self._to_path_safe_string(version_hash)}.jsonl"


Should we use path.join here as well?

this is only use to "encode" hash. why not convert hash to hex? still not perfect but less hacky

dlt/destinations/impl/filesystem/filesystem.py

sultaniman · 2024-04-16T14:33:39Z

tests/load/filesystem/test_filesystem_client.py

@@ -155,7 +155,12 @@ def test_replace_write_disposition(layout: str, default_buckets_env: str) -> Non
            for basedir, _dirs, files in client.fs_client.walk(
                client.dataset_path, detail=False, refresh=True
            ):
+                # remove internal paths
+                if "_dlt" in basedir or "init" in basedir:


"init" maybe should be a reference to constant INIT_FILE_NAME

sultaniman · 2024-04-16T14:34:10Z

docs/website/docs/dlt-ecosystem/destinations/filesystem.md

+managed in the regular way by the final destination you have configured.
+
+You will also notice `init` files being present in the root folder and the special `dlt` folders. In the absence of the concepts of schemas and tables
+in blob storages and directories, `dlt` uses these special files to harmonize the behavior of the `filesystem` destination with the other implemented destinations.


is this much gpt 🙂 ?

I'm not sure what you mean.

sultaniman · 2024-04-16T14:34:41Z

tests/load/pipeline/test_filesystem_pipeline.py

@@ -19,6 +19,8 @@
 from tests.common.utils import load_json_case
 from tests.utils import ALL_TEST_DATA_ITEM_FORMATS, TestDataItemFormat, skip_if_not_active
 from dlt.destinations.path_utils import create_path
+from tests.load.pipeline.utils import destinations_configs, DestinationTestConfiguration
+from tests.load.pipeline.utils import load_table_counts


these two can be merged

rudolfix

LGTM! we can remove the mark. if we are running filesystem through standard state tests, it should be good

rudolfix · 2024-04-16T20:50:01Z

dlt/destinations/impl/filesystem/filesystem.py

+
+    def _get_state_file_name(self, pipeline_name: str, version_hash: str, load_id: str) -> str:
+        """gets full path for schema file for a given hash"""
+        return f"{self.dataset_path}/{self.schema.state_table_name}/{pipeline_name}__{load_id}__{self._to_path_safe_string(version_hash)}.jsonl"


this is only use to "encode" hash. why not convert hash to hex? still not perfect but less hacky

dlt/destinations/impl/filesystem/filesystem.py

rudolfix · 2024-04-16T20:55:19Z

dlt/destinations/impl/filesystem/filesystem.py

+    def _list_dlt_dir(self, dirname: str) -> Iterator[Tuple[str, List[str]]]:
+        if not self.fs_client.exists(posixpath.join(dirname, INIT_FILE_NAME)):
+            raise DestinationUndefinedEntity({"dir": dirname})
+        for filepath in self.fs_client.listdir(dirname, detail=False):


when listing directories always request a refresh!

all_files = self.fs_client.ls(truncate_dir, detail=False, refresh=True)

IMO this is the only command you should use. fsspec is a mess and this one is proven to work. please replace all commands that list

rudolfix · 2024-04-16T20:59:59Z

dlt/pipeline/state_sync.py

+
+
+def state_resource(state: TPipelineState) -> DltResource:
+    doc = dlt.mark.with_package_state(state_doc(state), LOAD_PACKAGE_STATE_KEY)


OK! if you want you can commit pipeline state when we commit package here. and you can decide if we remove this mark.... I think package state will be available also in dlt.resource right? so you can always write to it. if so you can remove the mark... sorry for the stupid idea ;>

…ckages

sh-rp · 2024-04-17T10:19:15Z

dlt/pipeline/pipeline.py

-                # commit load packages
-                extract_step.commit_packages()
+                # commit load packages with state
+                extract_step.commit_packages(state)


the state only gets committed to the load package when it has changed now, this is how it is done for other destinations too, but I think we should probably always add the pipeline state to the load package and then check in the destination wether a state entry with this hash exists, right? eventually (I think) all other destinations will also use this mechanism to store the state, this would make sense to me at least.

here we could add info if version was bumped. easy to check the pre and post bump and extract state. but for now it is OK

sh-rp · 2024-04-17T11:41:40Z

Follow up ticket for state synching to separate place: #1233

rudolfix

LGTM!

rudolfix · 2024-04-17T14:48:41Z

dlt/pipeline/pipeline.py

-                # commit load packages
-                extract_step.commit_packages()
+                # commit load packages with state
+                extract_step.commit_packages(state)


here we could add info if version was bumped. easy to check the pre and post bump and extract state. but for now it is OK

# Conflicts: # docs/website/docs/dlt-ecosystem/destinations/filesystem.md

* clean some stuff * first messy version of filesystem state sync * clean up a bit * fix bug in state sync * enable state tests for all bucket providers * do not store state to uninitialized dataset folders * fix linter errors * get current pipeline from pipeline context * fix bug in filesystem table init * update testing pipe * move away from "current" file, rather iterator bucket path contents * store pipeline state in load package state and send to filesystem destination from there * fix tests for changed number of files in filesystem destination * remove dev code * create init file also to mark datasets * fix tests to respect new init file change filesystem to fallback, to old state loading when used as staging destination * update filesystem docs * fix incoming tests of placeholders * small fixes * adds some tests for filesystem state also fixes table count loading to work for all bucket destinations * fix test helper * save schema with timestamp instead of load_id * pr fixes and move pipeline state saving to committing of extracted packages * ensure pipeline state is only saved to load package if it has changed * adds missing state injection into state package * fix athena iceberg locations * fix google drive filesystem with missing argument

sh-rp added 3 commits April 3, 2024 18:45

clean some stuff

0369496

first messy version of filesystem state sync

9a87f0f

clean up a bit

f6d5c9c

sh-rp force-pushed the d#/filesystem_state_sync branch from 8791aaa to f6d5c9c Compare April 3, 2024 16:45

sh-rp commented Apr 3, 2024

View reviewed changes

dlt/destinations/impl/filesystem/filesystem.py Show resolved Hide resolved

sh-rp added 6 commits April 4, 2024 11:01

fix bug in state sync

d58a38b

enable state tests for all bucket providers

2913c33

do not store state to uninitialized dataset folders

e32ad95

fix linter errors

cd21ff6

get current pipeline from pipeline context

6b7c16d

fix bug in filesystem table init

95cc882

rudolfix requested changes Apr 14, 2024

View reviewed changes

sh-rp added 3 commits April 15, 2024 10:50

Merge branch 'devel' into d#/filesystem_state_sync

b5eb47d

update testing pipe

15ac9bf

move away from "current" file, rather iterator bucket path contents

a6ce1b1

sh-rp commented Apr 15, 2024

View reviewed changes

sh-rp added 3 commits April 15, 2024 16:49

store pipeline state in load package state and send to filesystem des…

bce2837

…tination from there

fix tests for changed number of files in filesystem destination

40f1f3e

remove dev code

5e8c233

sh-rp commented Apr 15, 2024

View reviewed changes

dlt/destinations/impl/filesystem/filesystem.py Show resolved Hide resolved

create init file also to mark datasets

e7e0192

sh-rp force-pushed the d#/filesystem_state_sync branch from c6a65f3 to e7e0192 Compare April 15, 2024 16:27

sh-rp added 4 commits April 16, 2024 11:03

fix tests to respect new init file

7cd51b4

change filesystem to fallback, to old state loading when used as staging destination

update filesystem docs

c406600

Merge branch 'devel' into d#/filesystem_state_sync

0c52fcd

fix incoming tests of placeholders

f0635b2

sh-rp commented Apr 16, 2024

View reviewed changes

dlt/destinations/impl/filesystem/filesystem.py Show resolved Hide resolved

sh-rp added 3 commits April 16, 2024 12:11

small fixes

bdaf094

adds some tests for filesystem state

a09f896

also fixes table count loading to work for all bucket destinations

fix test helper

fce47c6

save schema with timestamp instead of load_id

0d5423c

sultaniman reviewed Apr 16, 2024

View reviewed changes

sultaniman previously approved these changes Apr 16, 2024

View reviewed changes

sultaniman reviewed Apr 16, 2024

View reviewed changes

rudolfix requested changes Apr 16, 2024

View reviewed changes

pr fixes and move pipeline state saving to committing of extracted pa…

b2b5913

…ckages

sh-rp dismissed sultaniman’s stale review via b2b5913 April 17, 2024 10:03

ensure pipeline state is only saved to load package if it has changed

cd4dd23

sh-rp force-pushed the d#/filesystem_state_sync branch from 2857a5c to cd4dd23 Compare April 17, 2024 10:17

sh-rp commented Apr 17, 2024

View reviewed changes

sh-rp requested a review from rudolfix April 17, 2024 10:27

sh-rp marked this pull request as ready for review April 17, 2024 10:27

adds missing state injection into state package

c8b3429

fix athena iceberg locations

6522f87

rudolfix previously approved these changes Apr 17, 2024

View reviewed changes

fix google drive filesystem with missing argument

de34a48

sh-rp dismissed rudolfix’s stale review via de34a48 April 17, 2024 15:35

Merge branch 'devel' into d#/filesystem_state_sync

abfc170

# Conflicts: # docs/website/docs/dlt-ecosystem/destinations/filesystem.md

sh-rp merged commit 02daa10 into devel Apr 17, 2024
45 of 48 checks passed

sh-rp deleted the d#/filesystem_state_sync branch April 17, 2024 17:29



		def state_resource(state: TPipelineState) -> DltResource:
		doc = dlt.mark.with_package_state(state_doc(state), LOAD_PACKAGE_STATE_KEY)

filesystem state sync #1184

filesystem state sync #1184

Conversation

sh-rp commented Apr 3, 2024 • edited Loading

Description

Notes:

ToDo:

netlify bot commented Apr 3, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

sh-rp commented Apr 3, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Apr 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Apr 15, 2024

sh-rp commented Apr 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sultaniman Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Apr 17, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Apr 3, 2024 •

edited

Loading

netlify bot commented Apr 3, 2024 •

edited

Loading

sultaniman Apr 16, 2024 •

edited

Loading