Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: filesystem delete old pipeline state files #1838

Merged
merged 16 commits into from
Sep 25, 2024

Conversation

donotpush
Copy link
Collaborator

Description

Keep only the latest 100 pipeline state files

Related Issues

Copy link

netlify bot commented Sep 18, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 06bf3f8
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66f2b5932e7d3c0009285bdc
😎 Deploy Preview https://deploy-preview-1838--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@donotpush donotpush requested a review from sh-rp September 18, 2024 19:30
@donotpush donotpush marked this pull request as ready for review September 19, 2024 00:17
@donotpush donotpush requested a review from rudolfix September 19, 2024 12:12
@@ -520,6 +520,31 @@ def _get_state_file_name(self, pipeline_name: str, version_hash: str, load_id: s
f"{pipeline_name}{FILENAME_SEPARATOR}{load_id}{FILENAME_SEPARATOR}{self._to_path_safe_string(version_hash)}.jsonl",
)

def _cleanup_pipeline_states(self, pipeline_name: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ticket it says that we should make sure to not delete state files attached to failed loads, but we are not saving state on failed loads, so we should be good here.

Copy link
Collaborator Author

@donotpush donotpush Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed. @rudolfix specified:

delete only the state files that corresponds to finished loads (they have corresponding completed entry). 
this is to prevent a rate case when we have 100 unsuccessful partial loads and we delete the last right state

Will partial loads store a state file? if not, we can keep the code as it is.

@donotpush
Copy link
Collaborator Author

@sh-rp I have implemented the requested changes and added additional tests. I have also documented these tests.

The only divergence from your comments concerns the use of None for the integer max_state_files. I encountered some errors with None, so I decided to disable cleanup proces when max_state_files is set to 0 or negative values. This change is properly documented in the codebase.

@donotpush donotpush requested a review from sh-rp September 20, 2024 14:54
@@ -476,7 +476,9 @@ def _to_path_safe_string(self, s: str) -> str:
"""for base64 strings"""
return base64.b64decode(s).hex() if s else None

def _list_dlt_table_files(self, table_name: str) -> Iterator[Tuple[str, List[str]]]:
def _list_dlt_table_files(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have changed the bevahior of this function but not updated the other places where it is used, this will no longer just list the files of the current pipeline by default, so please double check that there are no suprising side effects :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function behaviour hasn't change for the other places because pipeline_name is set to None by default.

And the condition will always be True when pipeline_name is None - returning all the files wihtout filtering. The functions _iter_stored_schema_files and _list_dlt_table_files aren't affected by it.

# Filters only if pipeline_name provided
if pipeline_name is None or fileparts[0] == pipeline_name:
    yield filepath, filepart

Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good, just one concern mentioned above. Please also update the filesystem docs and this is a breaking change as old state files will be deleted automatically after this merge.

@sh-rp sh-rp added the breaking This issue introduces breaking change label Sep 24, 2024
@donotpush donotpush requested a review from sh-rp September 24, 2024 11:58
@sh-rp sh-rp merged commit 3aadc32 into devel Sep 25, 2024
61 checks passed
@sh-rp sh-rp deleted the feat/1657-filesystem-clean-state-files branch September 25, 2024 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This issue introduces breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compact dlt pipeline state table in filesystem destination
2 participants