Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: improve restart by optionally writing batched zipfiles #102

Merged
merged 48 commits into from
Oct 4, 2024

Conversation

bluegenes
Copy link
Collaborator

@bluegenes bluegenes commented Sep 29, 2024

This PR introduces a new optional param --batch-size, which allows users to build smaller zipfiles with gbsketch or urlsketch. These zipfiles are populated sequentially, with all signatures associated with batch_size accessions (not batch_size signatures). If gbsketch/urlsketch fail, they can read any zipfiles that were finished in order to restart. Zip names will be generated from the --output, so if output is output.zip, batches will be output.1.zip, output.2.zip, etc. I'm not really sure what batch_size to recommend, but I think the overhead is fairly low for creating new small zips -- the main issue will be if users later want to concatenate them into a single zip.

Uses the changes from #101 to enable writing batched zipfiles as a way to improve restart.

  • make batch_size a user modifiable parameter
  • For cases where the total number of signatures is less than the batch_size, we could write the regular *zip file, with no .1, etc.
  • functions to enable reading from existing batched zips to allow restart
    • build filename: paramset Hashmap, use that to filter the template sigs for each filename using filter
  • add tests for batched zipfile writing, recovery from existing batches
  • move zip_writer creation inside writing loop to avoid empty final zip
  • check what happens if we have an unclosed zip (i.e. from unexpected failure)
    • sourmash panics on invalid zips. Here I've caught the panic and ignored it, but it may ultimately be better to handle + return error at the sourmash level (ZipStorage::from_file panics)
    • Note that we will likely have an invalid zip upon any restart from failure, because the zip file would not have properly been closed/finished.

Issue for later:

Fixes:

Base automatically changed from improve-restart-batched-zip to main September 30, 2024 20:19
@bluegenes bluegenes changed the title WIP: batched zip WIP: improve restart by optionally writing batched zipfiles Oct 1, 2024
@bluegenes
Copy link
Collaborator Author

@ctb ready for review

@ccbaumler - better later than never, right? Does this fit your use cases? Do you have a recommendation for the # of files we should recommend in each batch?

@bluegenes bluegenes changed the title WIP: improve restart by optionally writing batched zipfiles MRG: improve restart by optionally writing batched zipfiles Oct 1, 2024
README.md Outdated Show resolved Hide resolved
@@ -844,3 +869,337 @@ pub fn parse_params_str(params_strs: String) -> Result<Vec<Params>, String> {

Ok(unique_params.into_iter().collect())
}

// this should be replaced with branchwater's MultiCollection when it's ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe create an issue in branchwater repo? I didn't really design around re-use 😅 - you could also link to sourmash-bio/sourmash#3321.

so, ok, suggest:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do. Note, it doesn't really need to be replaced with MultiCollection - just, if we eventually get MultiCollection in sourmash core, it'd be good to just have one supported implementation.

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only skimmed, but looks great to me ;)

@bluegenes bluegenes merged commit dd71f14 into main Oct 4, 2024
1 check passed
@bluegenes bluegenes deleted the batched-zip branch October 4, 2024 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants