-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: improve restart by optionally writing batched zipfiles #102
Conversation
@ctb ready for review @ccbaumler - better later than never, right? Does this fit your use cases? Do you have a recommendation for the # of files we should recommend in each batch? |
@@ -844,3 +869,337 @@ pub fn parse_params_str(params_strs: String) -> Result<Vec<Params>, String> { | |||
|
|||
Ok(unique_params.into_iter().collect()) | |||
} | |||
|
|||
// this should be replaced with branchwater's MultiCollection when it's ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe create an issue in branchwater repo? I didn't really design around re-use 😅 - you could also link to sourmash-bio/sourmash#3321.
so, ok, suggest:
- create issue in this repo
- link to sourmash issue consider how to support more flexible
Collection
inRevIndex
for external storage sourmash#3321
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do. Note, it doesn't really need to be replaced with MultiCollection
- just, if we eventually get MultiCollection
in sourmash core, it'd be good to just have one supported implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only skimmed, but looks great to me ;)
This PR introduces a new optional param
--batch-size
, which allows users to build smaller zipfiles withgbsketch
orurlsketch
. These zipfiles are populated sequentially, with all signatures associated withbatch_size
accessions (notbatch_size
signatures). Ifgbsketch
/urlsketch
fail, they can read any zipfiles that were finished in order to restart. Zip names will be generated from the--output
, so if output isoutput.zip
, batches will beoutput.1.zip
,output.2.zip
, etc. I'm not really sure whatbatch_size
to recommend, but I think the overhead is fairly low for creating new small zips -- the main issue will be if users later want to concatenate them into a single zip.Uses the changes from #101 to enable writing batched zipfiles as a way to improve restart.
batch_size
, we could write the regular*zip
file, with no.1
, etc.filter
ZipStorage::from_file
panics)Issue for later:
Fixes: