feat: resume archive extraction by skipping existing files #786

SiQube · 2024-08-24T22:10:58Z

resolves #488 eventually

currently, tox does not work locally for whatever reason.

TODO:

time vs old extraction
coverage

codecov · 2024-08-24T22:13:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (a364c46) to head (3d39eeb).

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #786   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           74        74           
  Lines         3419      3439   +20     
  Branches       613       621    +8     
=========================================
+ Hits          3419      3439   +20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SiQube · 2024-08-25T00:48:40Z

using this script

from pathlib import Path
import pymovements as pm

pm.utils.archives.extract_archive(Path('gazebasevr.zip'))

and predownloaded gazebase-vr -- I executed this command

time python t.py

for this branch:

Extracting gazebasevr.zip to .

real	6m53,025s
user	0m31,852s
sys	0m8,898s

for current main:

Extracting gazebasevr.zip to .

real	7m12,829s
user	0m33,634s
sys	0m9,835s

different workloads might affect the performance but it should be not too much slower. (in the example above the loop variant was even faster (?!?))

dkrako

great, thanks a lot! that was what I had in mind when creating #488.

We should introduce a new argument to let the user decide if continuing is desired or extracting should be done for all members. Something like continue: bool = True would be already sufficient.

src/pymovements/utils/archives.py

dkrako

Great, thanks a lot for your work! There are some small issues left to work out, but we're getting there

dkrako · 2024-12-08T13:10:57Z

src/pymovements/utils/archives.py

@@ -138,6 +138,7 @@ def _extract_tar(
        source_path: Path,
        destination_path: Path,
        compression: str | None,
+        skip: bool = True,


I'd say that skip is a bit confusing as a name. I suggested continue but that's a reserved keyword, so let's call it resume. this way it's already clear what the argument does without reading the docs.

done -- resume

dkrako · 2024-12-08T13:14:50Z

src/pymovements/utils/archives.py

+            if (
+                    os.path.exists(os.path.join(destination_path, member)) and
+                    member[-4:] not in _ARCHIVE_EXTRACTORS and
+                    tarfile.TarInfo(os.path.join(destination_path, member)).size > 0 and


this won't really check for correct size. it's just checking if the archive member's size is greater than zero.

you need to check that the size of the member is the same as the size of the already existing file. Otherwise a partially extracted file (i.e. the existing file is smaller than the archive member) will be skipped.

src/pymovements/utils/archives.py

tests/unit/utils/archives_test.py

for more information, see https://pre-commit.ci

dkrako

Great! A few cosmetic creases to iron out and we're good to go

dkrako · 2025-01-08T11:13:01Z

src/pymovements/utils/archives.py

@@ -149,18 +157,37 @@ def _extract_tar(
        Path to the directory the file will be extracted to.
    compression: str | None
        Compression filename suffix.
+    resume: bool
+        Resume if archive was already previous extracted.
+    verbose: int


as it's basically just a boolean switch I would use bool instead of int. otherwise there's potential confusion about the effect of the verbosity level.

we can still "upgrade" to int in case we have different verbosity levels, as positive int is "backwards compatible" to bool.

dkrako · 2025-01-08T11:14:24Z

src/pymovements/utils/archives.py

-        else:  # pragma: >=3.12 cover
-            archive.extractall(destination_path, filter='tar')
+        for member in tqdm(archive.getmembers()):
+            member_name = member.name


I doubt these two lines increase readability. there are probably just a leftover which should be refactored away

dkrako · 2025-01-08T11:18:08Z

src/pymovements/utils/archives.py

-
-_ARCHIVE_EXTRACTORS: dict[str, Callable[[Path, Path, str | None], None]] = {
+        for member in tqdm(archive.filelist):
+            member_filename = member.filename


same as above

dkrako · 2025-01-08T11:18:20Z

src/pymovements/utils/archives.py

@@ -172,13 +199,32 @@ def _extract_zip(
        Path to the directory the file will be extracted to.
    compression: str | None
        Compression filename suffix.
+    resume: bool
+        Resume if archive was already previous extracted.
+    verbose: int


same as above

dkrako · 2025-01-08T11:20:46Z

src/pymovements/utils/archives.py

+                        print(f'Skipping {member_filename} due to previous extraction')
+                    continue
+                archive.extract(member_filename, destination_path)
+            else:


in contrast to the tar implementation, there is no need for an additional if-else

dkrako · 2025-01-08T11:21:01Z

src/pymovements/utils/archives.py

+                    if verbose:
+                        print(f'Skipping {member_filename} due to previous extraction')
+                    continue
+                archive.extract(member_filename, destination_path)


this line should be outside the if clause

dkrako · 2025-01-08T11:21:37Z

src/pymovements/utils/archives.py

-_ARCHIVE_EXTRACTORS: dict[str, Callable[[Path, Path, str | None], None]] = {
+        for member in tqdm(archive.filelist):
+            member_filename = member.filename
+            member_dest_path = os.path.join(destination_path, member_filename)


same as above

dkrako · 2025-01-08T11:22:36Z

src/pymovements/utils/archives.py

+        for member in tqdm(archive.getmembers()):
+            member_name = member.name
+            member_size = member.size
+            member_dest_path = os.path.join(destination_path, member_name)


while you're at it, you can probably move this line into the following if-clause

dkrako · 2025-01-08T11:29:47Z

src/pymovements/utils/archives.py

@@ -42,6 +44,7 @@ def extract_archive(
        remove_finished: bool = False,
        remove_top_level: bool = True,
        verbose: int = 1,
+        resume: bool = False,


verbose should come after resume

which means this change would be a breaking change as verbose can be passed as a positional argument.

I strongly prefer verbose as the last argument. We should probably add a *, after the recursive parameter and enforce passing remove_finished, remove_top_level, resume and verbose as keyword arguments.

We can either introduce the breaking in this PR or in a separate one. What do you think?

Also we should keep this issue in mind when enhancing signatures in the future. To be future-proof and backwards compatible it is necessary to keep the number of positional arguments to a comfortable minimum.

tests/unit/utils/archives_test.py

dkrako

One additional thing I forgot:

Please include this functionality also in Dataset.extract() and Dataset.download()

dkrako · 2025-01-08T11:54:47Z

src/pymovements/utils/archives.py

@@ -65,6 +68,8 @@ def extract_archive(
        Verbosity levels: (1) Print messages for extracting each dataset resource without printing
        messages for recursive archives. (2) Print additional messages for each recursive archive
        extract. (default: 1)
+    resume: bool
+        Resume previous extraction. (default: True)


Please add a few more info like this:

Resume previous extraction by skipping existing files. Checks for correct size of existing files but not integrity. (default: True)

SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 071e0bc to c65939f Compare August 25, 2024 00:37

SiQube marked this pull request as ready for review August 25, 2024 00:48

SiQube requested review from dkrako and prassepaul as code owners August 25, 2024 00:48

SiQube added bug Something isn't working enhancement New feature or request labels Aug 26, 2024

SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 161b3ec to 265687f Compare September 30, 2024 06:24

dkrako requested changes Oct 25, 2024

View reviewed changes

src/pymovements/utils/archives.py Outdated Show resolved Hide resolved

SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 3bd6e20 to 75b0b34 Compare November 17, 2024 20:08

SiQube force-pushed the prevent-unnecessary-extraction branch from e8eec21 to 4f4f0ff Compare December 5, 2024 22:02

dkrako requested changes Dec 8, 2024

View reviewed changes

dkrako removed the bug Something isn't working label Dec 10, 2024

SiQube added 6 commits January 2, 2025 12:55

serial extract_archive to prevent unnecessary extractions

f957049

add tests

bee6f0b

test for size of the file before skipping

b74619f

add skip argument to extract_archive

617a45e

test for actual size and not only > 0

c06ded8

add verbosity and tests

1477ee3

SiQube force-pushed the prevent-unnecessary-extraction branch from c04d614 to 1477ee3 Compare January 2, 2025 11:55

[pre-commit.ci] auto fixes from pre-commit.com hooks

3d39eeb

for more information, see https://pre-commit.ci

dkrako requested changes Jan 8, 2025

View reviewed changes

dkrako changed the title ~~serial extract_archive to prevent unnecessary extractions~~ feat: resume unfinished extractions by skipping existing files Jan 8, 2025

dkrako requested changes Jan 8, 2025

View reviewed changes

dkrako changed the title ~~feat: resume unfinished extractions by skipping existing files~~ feat: resume archive extraction by skipping existing files Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: resume archive extraction by skipping existing files #786

feat: resume archive extraction by skipping existing files #786

SiQube commented Aug 24, 2024 •

edited

Loading

codecov bot commented Aug 24, 2024 •

edited

Loading

SiQube commented Aug 25, 2024

dkrako left a comment

dkrako left a comment

dkrako Dec 8, 2024

SiQube Jan 2, 2025

dkrako Dec 8, 2024

SiQube Jan 2, 2025

dkrako left a comment

dkrako Jan 8, 2025

dkrako Jan 8, 2025

dkrako Jan 8, 2025

dkrako Jan 8, 2025

dkrako Jan 8, 2025

dkrako Jan 8, 2025

dkrako Jan 8, 2025

dkrako Jan 8, 2025

dkrako Jan 8, 2025

dkrako Jan 8, 2025 •

edited

Loading

dkrako left a comment

dkrako Jan 8, 2025

feat: resume archive extraction by skipping existing files #786

Are you sure you want to change the base?

feat: resume archive extraction by skipping existing files #786

Conversation

SiQube commented Aug 24, 2024 • edited Loading

codecov bot commented Aug 24, 2024 • edited Loading

Codecov Report

SiQube commented Aug 25, 2024

dkrako left a comment

Choose a reason for hiding this comment

dkrako left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkrako left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkrako Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

dkrako left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SiQube commented Aug 24, 2024 •

edited

Loading

codecov bot commented Aug 24, 2024 •

edited

Loading

dkrako Jan 8, 2025 •

edited

Loading