Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: resume archive extraction by skipping existing files #786

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

SiQube
Copy link
Member

@SiQube SiQube commented Aug 24, 2024

resolves #488 eventually

currently, tox does not work locally for whatever reason.

TODO:

  • time vs old extraction
  • coverage

Copy link

codecov bot commented Aug 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (a364c46) to head (3d39eeb).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #786   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           74        74           
  Lines         3419      3439   +20     
  Branches       613       621    +8     
=========================================
+ Hits          3419      3439   +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SiQube SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 071e0bc to c65939f Compare August 25, 2024 00:37
@SiQube
Copy link
Member Author

SiQube commented Aug 25, 2024

using this script

from pathlib import Path
import pymovements as pm

pm.utils.archives.extract_archive(Path('gazebasevr.zip'))

and predownloaded gazebase-vr -- I executed this command

time python t.py

for this branch:

Extracting gazebasevr.zip to .

real	6m53,025s
user	0m31,852s
sys	0m8,898s

for current main:

Extracting gazebasevr.zip to .

real	7m12,829s
user	0m33,634s
sys	0m9,835s

different workloads might affect the performance but it should be not too much slower. (in the example above the loop variant was even faster (?!?))

@SiQube SiQube marked this pull request as ready for review August 25, 2024 00:48
@SiQube SiQube added bug Something isn't working enhancement New feature or request labels Aug 26, 2024
@SiQube SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 161b3ec to 265687f Compare September 30, 2024 06:24
Copy link
Contributor

@dkrako dkrako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thanks a lot! that was what I had in mind when creating #488.

We should introduce a new argument to let the user decide if continuing is desired or extracting should be done for all members. Something like continue: bool = True would be already sufficient.

src/pymovements/utils/archives.py Outdated Show resolved Hide resolved
@SiQube SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 3bd6e20 to 75b0b34 Compare November 17, 2024 20:08
@SiQube SiQube force-pushed the prevent-unnecessary-extraction branch from e8eec21 to 4f4f0ff Compare December 5, 2024 22:02
Copy link
Contributor

@dkrako dkrako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks a lot for your work! There are some small issues left to work out, but we're getting there

@@ -138,6 +138,7 @@ def _extract_tar(
source_path: Path,
destination_path: Path,
compression: str | None,
skip: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that skip is a bit confusing as a name. I suggested continue but that's a reserved keyword, so let's call it resume. this way it's already clear what the argument does without reading the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done -- resume

if (
os.path.exists(os.path.join(destination_path, member)) and
member[-4:] not in _ARCHIVE_EXTRACTORS and
tarfile.TarInfo(os.path.join(destination_path, member)).size > 0 and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't really check for correct size. it's just checking if the archive member's size is greater than zero.

you need to check that the size of the member is the same as the size of the already existing file. Otherwise a partially extracted file (i.e. the existing file is smaller than the archive member) will be skipped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

src/pymovements/utils/archives.py Outdated Show resolved Hide resolved
src/pymovements/utils/archives.py Outdated Show resolved Hide resolved
src/pymovements/utils/archives.py Outdated Show resolved Hide resolved
src/pymovements/utils/archives.py Outdated Show resolved Hide resolved
tests/unit/utils/archives_test.py Show resolved Hide resolved
@dkrako dkrako removed the bug Something isn't working label Dec 10, 2024
@SiQube SiQube force-pushed the prevent-unnecessary-extraction branch from c04d614 to 1477ee3 Compare January 2, 2025 11:55
Copy link
Contributor

@dkrako dkrako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! A few cosmetic creases to iron out and we're good to go

@@ -149,18 +157,37 @@ def _extract_tar(
Path to the directory the file will be extracted to.
compression: str | None
Compression filename suffix.
resume: bool
Resume if archive was already previous extracted.
verbose: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it's basically just a boolean switch I would use bool instead of int. otherwise there's potential confusion about the effect of the verbosity level.

we can still "upgrade" to int in case we have different verbosity levels, as positive int is "backwards compatible" to bool.

else: # pragma: >=3.12 cover
archive.extractall(destination_path, filter='tar')
for member in tqdm(archive.getmembers()):
member_name = member.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt these two lines increase readability. there are probably just a leftover which should be refactored away


_ARCHIVE_EXTRACTORS: dict[str, Callable[[Path, Path, str | None], None]] = {
for member in tqdm(archive.filelist):
member_filename = member.filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -172,13 +199,32 @@ def _extract_zip(
Path to the directory the file will be extracted to.
compression: str | None
Compression filename suffix.
resume: bool
Resume if archive was already previous extracted.
verbose: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

print(f'Skipping {member_filename} due to previous extraction')
continue
archive.extract(member_filename, destination_path)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in contrast to the tar implementation, there is no need for an additional if-else

if verbose:
print(f'Skipping {member_filename} due to previous extraction')
continue
archive.extract(member_filename, destination_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line should be outside the if clause

_ARCHIVE_EXTRACTORS: dict[str, Callable[[Path, Path, str | None], None]] = {
for member in tqdm(archive.filelist):
member_filename = member.filename
member_dest_path = os.path.join(destination_path, member_filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

for member in tqdm(archive.getmembers()):
member_name = member.name
member_size = member.size
member_dest_path = os.path.join(destination_path, member_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while you're at it, you can probably move this line into the following if-clause

@@ -42,6 +44,7 @@ def extract_archive(
remove_finished: bool = False,
remove_top_level: bool = True,
verbose: int = 1,
resume: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verbose should come after resume

Copy link
Contributor

@dkrako dkrako Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which means this change would be a breaking change as verbose can be passed as a positional argument.

I strongly prefer verbose as the last argument. We should probably add a *, after the recursive parameter and enforce passing remove_finished, remove_top_level, resume and verbose as keyword arguments.

We can either introduce the breaking in this PR or in a separate one. What do you think?

Also we should keep this issue in mind when enhancing signatures in the future. To be future-proof and backwards compatible it is necessary to keep the number of positional arguments to a comfortable minimum.

tests/unit/utils/archives_test.py Show resolved Hide resolved
Copy link
Contributor

@dkrako dkrako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional thing I forgot:

Please include this functionality also in Dataset.extract() and Dataset.download()

@dkrako dkrako changed the title serial extract_archive to prevent unnecessary extractions feat: resume unfinished extractions by skipping existing files Jan 8, 2025
@@ -65,6 +68,8 @@ def extract_archive(
Verbosity levels: (1) Print messages for extracting each dataset resource without printing
messages for recursive archives. (2) Print additional messages for each recursive archive
extract. (default: 1)
resume: bool
Resume previous extraction. (default: True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a few more info like this:

Resume previous extraction by skipping existing files. Checks for correct size of existing files but not integrity. (default: True)

@dkrako dkrako changed the title feat: resume unfinished extractions by skipping existing files feat: resume archive extraction by skipping existing files Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't extract dataset archives twice
2 participants