-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: debug multisigfile test #445
base: main
Are you sure you want to change the base?
Conversation
…water into ctb_misc_cleanup
…water into ctb_misc2
…water into ctb_misc2
…water into ctb_misc_cleanup
#434) * preliminary victory * compiles and mostly runs * cleanup, split to new module * cleanup and comment * more cleanup of diff * cargo fmt * fix fmt * restore n_failed * comment failing test * cleanup and de-vec * create module/submodule structure * comment for later * get rid of vec * beg for help * cleanup and doc
…water into ctb_misc2
Dear journal, note also that literally anything other than using a pathlist seems to work fine - i.e. it's the pathlist loading code that's causing the problem. this breaks:
this succeeds:
and... ahh, I see, the idea is that pathlists are not loaded into memory necessarily, unlike the sig.gz file. So the problem is that we are constructing |
Verified this directly in A few thoughts and ideas -
|
ok, this is probably where the problem is: the rust layer takes just the first sketch. impl Storage for FSStorage {
...
fn load_sig(&self, path: &str) -> Result<SigStore> {
let raw = self.load(path)?;
let sig = Signature::from_reader(&mut &raw[..])?
// TODO: select the right sig?
.swap_remove(0);
Ok(sig.into())
} |
ok, and of course the problem here is that The ...dumbest hack we could put in place would be to munge |
ok, so now I’m kind of stuck… The fundamental problem is that the So there’s the idea of allowing it anyway, by munging path to include a record number of some kind (as in comment linked above). Alternatively, could “just” load into memory and keep in But then again, maybe that’s not a problem? With zip files, and standalone manifests pointing at zip files, we fully support random access into, & subsetting of, large collections. So anyone wanting to access large collections of sketches using pathlists that then runs into memory problems … can just use those mechanisms. |
Lots of conversation to relay, much of it from slack -- First, note that loading a multiple-signature JSON file from a pathlist is broken in Now, a few backs and forths to my questions and proposals - 3:55 PM
Excellent points - there's a lot to disentangle here 😭 My proposal above was the easy "load everything in a pathlist into memory" which would break your use case. Ooops.
Here we are facing a few legacy situations. Legacy #1 - the wort signatures are "3-in-one" files. There's exactly one Signature with multiple Sketches underneath - see sourmash-bio/sourmash#616 (comment) and the comment in the sourmash Python code here. This means that Legacy #2 - the wort signatures don't have manifests and don't support direct loading of specific sketches, unlike zip files (because they predate zip files at the Python layer, as well as zip files at the Rust layer). We could significantly speed things up by switching to zip files - in addition to having manifests already there, we wouldn't need to read all three sketches worth of data to get a specific sketch. As noted above, this bug has been present since at least the last big refactoring of collection loading sourmash_plugin_branchwater#197.
See below!
If you run A better approach would be to transfer all of those files over to
Agreed :)
Yes, this is exactly the problem :)
ahh, ok! Since I'm using
Excellent idea and definitely one of the options I was thinking about over the last few days.
It doesn't seem to fit into Looking at luizirber
cool!
Good to know! I will maybe document it in the code a bit. Other thoughts - I think adding a |
One big TODO item: move all of this over to an issue or issues, perhaps one in sourmash and one here in branchwater... |
over in the manysearch benchmarking repo, I tried out switching things to lists of .sig.zip files using the `MultiCollection stuff in #430 and it worked great! see dib-lab/2022-branchwater-benchmarking#11 for numbers and details and revised snakemake workflow. The 10,000 sketches for the 'd' list of wort sigs are here on farm: In that directory there is a Snakefile that does the conversion and builds the mf.csv files. See issue sourmash-bio/sourmash#3349 |
…water into fix_multisig_foo
a clean PR that provides an |
…ture` in a JSON record (#3333) This PR was originally about debugging sourmash-bio/sourmash_plugin_branchwater#445, but that's going to require more work to fix properly. For now, I would like to nominate it for merge because sourmash fails silently in this situation, and that's Bad. In brief, the main thing this PR does is panic with an `unimplemented!` when `FSStorage::load_sig` encounters more than one `Signature` in a JSON record. This PR also adds a bit of documentation to `InnerStorage`, per the bottom of [this comment](sourmash-bio/sourmash_plugin_branchwater#445 (comment)). --- The problem at hand: when loading a `SigStore`/`Signature` from a `Storage`, sourmash only loads the first one and ignores any others. https://github.com/sourmash-bio/sourmash/blob/26b50f3e3566006fd6356a4f8b4d47c5e381aeec/src/core/src/storage/mod.rs#L34-L38 This results from the concept of a `Signature` as containing one or more sketches; the history of this is described [here](#616 (comment)), and it leads to some interesting silliness [in the Python layer](https://github.com/sourmash-bio/sourmash/blob/d63c464e825529fa54bb7e8b81faa53b858b09de/src/sourmash/save_load.py#L297). The contrapositive is that, in Rust, a single `Signature` can include multiple sketches, e.g. with different ksizes. So this works fine for the wort case where we have a single `.sig` file with k=21, k=31, k51. Note that the Python layer (and hence the entire sourmash CLI) fully supports multiple `Signature`s in JSON: this is well tested and well covered behavior. The branchwater plugin runs into it because it is using the Rust layer and the API is not fully fleshed out there. ---
This PR is debugging the mystifying failure of
test_fastgather::test_against_multisigfile
.