-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: resume archive extraction by skipping existing files #786
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #786 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 74 74
Lines 3419 3439 +20
Branches 613 621 +8
=========================================
+ Hits 3419 3439 +20 ☔ View full report in Codecov by Sentry. |
071e0bc
to
c65939f
Compare
using this script from pathlib import Path
import pymovements as pm
pm.utils.archives.extract_archive(Path('gazebasevr.zip')) and predownloaded gazebase-vr -- I executed this command time python t.py for this branch: Extracting gazebasevr.zip to .
real 6m53,025s
user 0m31,852s
sys 0m8,898s for current main: Extracting gazebasevr.zip to .
real 7m12,829s
user 0m33,634s
sys 0m9,835s different workloads might affect the performance but it should be not too much slower. (in the example above the loop variant was even faster (?!?)) |
161b3ec
to
265687f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, thanks a lot! that was what I had in mind when creating #488.
We should introduce a new argument to let the user decide if continuing is desired or extracting should be done for all members. Something like continue: bool = True
would be already sufficient.
3bd6e20
to
75b0b34
Compare
e8eec21
to
4f4f0ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks a lot for your work! There are some small issues left to work out, but we're getting there
src/pymovements/utils/archives.py
Outdated
@@ -138,6 +138,7 @@ def _extract_tar( | |||
source_path: Path, | |||
destination_path: Path, | |||
compression: str | None, | |||
skip: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that skip
is a bit confusing as a name. I suggested continue
but that's a reserved keyword, so let's call it resume
. this way it's already clear what the argument does without reading the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done -- resume
src/pymovements/utils/archives.py
Outdated
if ( | ||
os.path.exists(os.path.join(destination_path, member)) and | ||
member[-4:] not in _ARCHIVE_EXTRACTORS and | ||
tarfile.TarInfo(os.path.join(destination_path, member)).size > 0 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't really check for correct size. it's just checking if the archive member's size is greater than zero.
you need to check that the size of the member is the same as the size of the already existing file. Otherwise a partially extracted file (i.e. the existing file is smaller than the archive member) will be skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
c04d614
to
1477ee3
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! A few cosmetic creases to iron out and we're good to go
@@ -149,18 +157,37 @@ def _extract_tar( | |||
Path to the directory the file will be extracted to. | |||
compression: str | None | |||
Compression filename suffix. | |||
resume: bool | |||
Resume if archive was already previous extracted. | |||
verbose: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it's basically just a boolean switch I would use bool
instead of int
. otherwise there's potential confusion about the effect of the verbosity level.
we can still "upgrade" to int
in case we have different verbosity levels, as positive int
is "backwards compatible" to bool
.
else: # pragma: >=3.12 cover | ||
archive.extractall(destination_path, filter='tar') | ||
for member in tqdm(archive.getmembers()): | ||
member_name = member.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt these two lines increase readability. there are probably just a leftover which should be refactored away
|
||
_ARCHIVE_EXTRACTORS: dict[str, Callable[[Path, Path, str | None], None]] = { | ||
for member in tqdm(archive.filelist): | ||
member_filename = member.filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
@@ -172,13 +199,32 @@ def _extract_zip( | |||
Path to the directory the file will be extracted to. | |||
compression: str | None | |||
Compression filename suffix. | |||
resume: bool | |||
Resume if archive was already previous extracted. | |||
verbose: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
print(f'Skipping {member_filename} due to previous extraction') | ||
continue | ||
archive.extract(member_filename, destination_path) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in contrast to the tar implementation, there is no need for an additional if-else
if verbose: | ||
print(f'Skipping {member_filename} due to previous extraction') | ||
continue | ||
archive.extract(member_filename, destination_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line should be outside the if clause
_ARCHIVE_EXTRACTORS: dict[str, Callable[[Path, Path, str | None], None]] = { | ||
for member in tqdm(archive.filelist): | ||
member_filename = member.filename | ||
member_dest_path = os.path.join(destination_path, member_filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
for member in tqdm(archive.getmembers()): | ||
member_name = member.name | ||
member_size = member.size | ||
member_dest_path = os.path.join(destination_path, member_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while you're at it, you can probably move this line into the following if-clause
@@ -42,6 +44,7 @@ def extract_archive( | |||
remove_finished: bool = False, | |||
remove_top_level: bool = True, | |||
verbose: int = 1, | |||
resume: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verbose should come after resume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which means this change would be a breaking change as verbose
can be passed as a positional argument.
I strongly prefer verbose
as the last argument. We should probably add a *,
after the recursive parameter and enforce passing remove_finished
, remove_top_level
, resume
and verbose
as keyword arguments.
We can either introduce the breaking in this PR or in a separate one. What do you think?
Also we should keep this issue in mind when enhancing signatures in the future. To be future-proof and backwards compatible it is necessary to keep the number of positional arguments to a comfortable minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional thing I forgot:
Please include this functionality also in Dataset.extract()
and Dataset.download()
@@ -65,6 +68,8 @@ def extract_archive( | |||
Verbosity levels: (1) Print messages for extracting each dataset resource without printing | |||
messages for recursive archives. (2) Print additional messages for each recursive archive | |||
extract. (default: 1) | |||
resume: bool | |||
Resume previous extraction. (default: True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a few more info like this:
Resume previous extraction by skipping existing files. Checks for correct size of existing files but not integrity. (default: True)
resolves #488 eventually
currently, tox does not work locally for whatever reason.
TODO: