Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: debug multisigfile test #445

Open
wants to merge 94 commits into
base: main
Choose a base branch
from
Open

WIP: debug multisigfile test #445

wants to merge 94 commits into from

Conversation

ctb
Copy link
Collaborator

@ctb ctb commented Aug 27, 2024

This PR is debugging the mystifying failure of test_fastgather::test_against_multisigfile.

ctb added 30 commits August 17, 2024 10:14
#434)

* preliminary victory

* compiles and mostly runs

* cleanup, split to new module

* cleanup and comment

* more cleanup of diff

* cargo fmt

* fix fmt

* restore n_failed

* comment failing test

* cleanup and de-vec

* create module/submodule structure

* comment for later

* get rid of vec

* beg for help

* cleanup and doc
@ctb
Copy link
Collaborator Author

ctb commented Sep 17, 2024

Dear journal,

note also that literally anything other than using a pathlist seems to work fine - i.e. it's the pathlist loading code that's causing the problem.

this breaks:

sourmash scripts fastgather src/python/tests/test-data/SRR606249.sig.gz list-combo.txt -o xxx.csv -s 100000

this succeeds:

sourmash scripts fastgather src/python/tests/test-data/SRR606249.sig.gz combined.sig.gz -o xxx.csv -s 100000

and... ahh, I see, the idea is that pathlists are not loaded into memory necessarily, unlike the sig.gz file. So the problem is that we are constructing Records from the ones in the pathlist, and then trying to refer back to the Record to load the signature.

@ctb
Copy link
Collaborator Author

ctb commented Sep 17, 2024

Verified this directly in fastgather.rs. So, the question is, what are the fix(es)?

A few thoughts and ideas -

  • this is probably happening elsewhere, in addition to load_sketches_above_threshold; check other places where we are using sig_from_record.
  • in load_sketches_above_threshold, we probably should create PrefetchResult directly from the loaded signature, so as to avoid future problems with discrepancies between Record and Signature.
  • in addition, load_sketches_above_threshold (and other similar code) should probably directly confirm the right thing was loaded, and maybe panic if it is not?
  • and last but not least, sig_from_record should be doing the Right Thing, but I'm not sure exactly what is going wrong :)

@ctb
Copy link
Collaborator Author

ctb commented Sep 17, 2024

ok, this is probably where the problem is: the rust layer takes just the first sketch.

https://github.com/sourmash-bio/sourmash/blob/26b50f3e3566006fd6356a4f8b4d47c5e381aeec/src/core/src/storage/mod.rs#L34-L38

impl Storage for FSStorage {
    ...

    fn load_sig(&self, path: &str) -> Result<SigStore> {
        let raw = self.load(path)?;
        let sig = Signature::from_reader(&mut &raw[..])?
            // TODO: select the right sig? 
            .swap_remove(0);

        Ok(sig.into())
    }

@ctb
Copy link
Collaborator Author

ctb commented Sep 17, 2024

ok, and of course the problem here is that load_sig returns precisely one sig, and we don't have information necessary to return the right one.

The ...dumbest hack we could put in place would be to munge path to include some kind of internal numbering, e.g. /path/to/combined.sig.gz:0 or 0:/path/to/combined.sig.gz, in internal_location.

@ctb
Copy link
Collaborator Author

ctb commented Sep 17, 2024

ok, so now I’m kind of stuck…

The fundamental problem is that the Storage trait doesn’t (seem to) really provide an interface where a single path or location can store multiple sketches, and the right (single) sketch can be retrieved from Storage specifically. There are actually some good reasons for this - you generally do not want to load JSON sketches multiple times, picking out a different record each time. But, y’know, nor should we be doing what we’re currently doing with just picking the first signature in the file :).

So there’s the idea of allowing it anyway, by munging path to include a record number of some kind (as in comment linked above). Alternatively, could “just” load into memory and keep in MemStorage, which is how it works when we load a single .sig.gz file over in the branchwater plugin; problem is if you have a pathlist with tons of .sig.gz files in it, you’ll load them all at once into memory (under this new mechanism).

But then again, maybe that’s not a problem? With zip files, and standalone manifests pointing at zip files, we fully support random access into, & subsetting of, large collections. So anyone wanting to access large collections of sketches using pathlists that then runs into memory problems … can just use those mechanisms.

@ctb ctb changed the title WIP: debugging multisigfile test WIP: debug multisigfile test Sep 24, 2024
@ctb
Copy link
Collaborator Author

ctb commented Sep 25, 2024

Lots of conversation to relay, much of it from slack --

First, note that loading a multiple-signature JSON file from a pathlist is broken in main and doesn't work in the most recent release, ; see test code and comment.

Now, a few backs and forths to my questions and proposals -

3:55 PM
bluegenes

🐾 First, it sounds like behavior would only change for commands/situations where we load everything into memory in branchwater, right? Not something like fastmultigather, where we load a single query per thread at a time? If yes, then I think my use case will be fine and I agree re: just allowing loading into MemStorage and pointing folks to manifests if they run into memory issues. If no (as in, if this would happen when loading the collection), then I think it would at least temporarily disable our ability to run with wort metagenomes.

Excellent points - there's a lot to disentangle here 😭

My proposal above was the easy "load everything in a pathlist into memory" which would break your use case. Ooops.

Some logistics thoughts for standalone manifests for very large collections of large sketches:

  • I believe sig manifest tries to load all sketches at once --> lots of memory, but sig collect instead iterates through, right? But metagenomes are large and sig collect is quite time consuming for them. For 10 wort metagenomes, it took 7min and 5GB of memory. For 100 wort metagenomes, 1 hour, 34GB.

Here we are facing a few legacy situations.

Legacy #1 - the wort signatures are "3-in-one" files. There's exactly one Signature with multiple Sketches underneath - see sourmash-bio/sourmash#616 (comment) and the comment in the sourmash Python code here. This means that load_sig is working fine for this exact situation but is broken for everything else, including the output of sig cat ... -o combined.sig.gz. So, I think ... we need to figure this out :).

Legacy #2 - the wort signatures don't have manifests and don't support direct loading of specific sketches, unlike zip files (because they predate zip files at the Python layer, as well as zip files at the Rust layer). We could significantly speed things up by switching to zip files - in addition to having manifests already there, we wouldn't need to read all three sketches worth of data to get a specific sketch.

As noted above, this bug has been present since at least the last big refactoring of collection loading sourmash_plugin_branchwater#197.

  • There are currently 3,376,595 wort-sra sigs in wort on farm. A full manifest doesn't seem worth attempting with sig collect. Note mostly to self: this would be straightforward to port to rust for improved manifest generation from pathlists.

See below!

  • I was able to multithread fastmultigather using a query pathlist of 18k wort metagenomes (the ones with matches to my query) in ~2 days against a small database. So while loading from a pathlist took quite a while, it was still essential here, and faster/lower memory than sig collect. (edited)

If you run sig collect on the stored .sig files from branchwater, it will need to read all three sketches in order to generate a manifest.

A better approach would be to transfer all of those files over to .sig.zip files; are we only searching them via the branchwater plugin, or somewhere else? That is, do we need to worry about properly handling zip files somewhere other than the branchwater plugin?

🐾 To be clear - I think loading from pathlist directly is far inferior and the MemStorage idea is a fine hackaround, I just think that we need to improve manifest generation from pathlists in order to support the wort-sra use case!

Agreed :)

🐾 hmm thinking about it a bit more, this might yield a circular issue. To build a manifest from a pathlist, we would like to load sigs using the MultiCollection loading framework, in case there are different file types. However, if this loads into MemStorage, we will not be able to.

Yes, this is exactly the problem :)

luizirber
couple of hot takes, in lieu of deeper reading on the code:

  • if load_sig is not doing what you need, you can call load directly on the path and select the right sig out of it. the Storage doesn't know anything about what it is storing, and load_sig was written with the assumption that 1 path = 1 sig.

ahh, ok!

Since I'm using Collection::load_from_record as an intermediary, though, I would need to add something to the Collection struct to use load instead of load_sig.

  • we can also modify load_sig to take a Selection too (maybe optional?), and let the selection get the right sig

Excellent idea and definitely one of the options I was thinking about over the last few days.

  • isn't the pathlist naturally a FSStorage?

It doesn't seem to fit into MultiCollection that way. I'll dig and try to explain why (and/or just fix my code when I discover I'm wrong!)

Looking at Collection::from_paths, I might be able to use that here.

luizirber

cool!

oh, comment about InnerStorage: that's kind of a catch-all type that allows using any Storage in parallel contexts
hence the weird type definition:
pub struct InnerStorage(Arc<RwLock<dyn Storage + Send + Sync + 'static>>);
Arc allows ref counting to share it between threads
RwLock makes sure there is only one writer possible (and a lot of readers)
dyn Storage so we can init with anything that implements the Storage trait
Send + Sync + 'static is kind of a cheat to avoid lifetimes issues, I particularly would like to get rid of that 'static if possible...

Good to know! I will maybe document it in the code a bit.

Other thoughts -

I think adding a panic! in load_sig when there is more than one Signature might be a good idea...

@ctb
Copy link
Collaborator Author

ctb commented Sep 25, 2024

One big TODO item: move all of this over to an issue or issues, perhaps one in sourmash and one here in branchwater...

@ctb
Copy link
Collaborator Author

ctb commented Oct 13, 2024

over in the manysearch benchmarking repo, I tried out switching things to lists of .sig.zip files using the `MultiCollection stuff in #430 and it worked great! see dib-lab/2022-branchwater-benchmarking#11 for numbers and details and revised snakemake workflow.

The 10,000 sketches for the 'd' list of wort sigs are here on farm:/home/ctbrown/scratch/2022-branchwater-benchmarking/wort-list-d.d

In that directory there is a Snakefile that does the conversion and builds the mf.csv files. See issue sourmash-bio/sourmash#3349

@ctb
Copy link
Collaborator Author

ctb commented Oct 27, 2024

a clean PR that provides an unimplemented! panic + associated test in sourmash is here: sourmash-bio/sourmash#3333

ctb added a commit to sourmash-bio/sourmash that referenced this pull request Nov 4, 2024
…ture` in a JSON record (#3333)

This PR was originally about debugging
sourmash-bio/sourmash_plugin_branchwater#445,
but that's going to require more work to fix properly. For now, I would
like to nominate it for merge because sourmash fails silently in this
situation, and that's Bad.

In brief, the main thing this PR does is panic with an `unimplemented!`
when `FSStorage::load_sig` encounters more than one `Signature` in a
JSON record.

This PR also adds a bit of documentation to `InnerStorage`, per the
bottom of [this
comment](sourmash-bio/sourmash_plugin_branchwater#445 (comment)).

---

The problem at hand: when loading a `SigStore`/`Signature` from a
`Storage`, sourmash only loads the first one and ignores any others.


https://github.com/sourmash-bio/sourmash/blob/26b50f3e3566006fd6356a4f8b4d47c5e381aeec/src/core/src/storage/mod.rs#L34-L38

This results from the concept of a `Signature` as containing one or more
sketches; the history of this is described
[here](#616 (comment)),
and it leads to some interesting silliness [in the Python
layer](https://github.com/sourmash-bio/sourmash/blob/d63c464e825529fa54bb7e8b81faa53b858b09de/src/sourmash/save_load.py#L297).

The contrapositive is that, in Rust, a single `Signature` can include
multiple sketches, e.g. with different ksizes. So this works fine for
the wort case where we have a single `.sig` file with k=21, k=31, k51.

Note that the Python layer (and hence the entire sourmash CLI) fully
supports multiple `Signature`s in JSON: this is well tested and well
covered behavior. The branchwater plugin runs into it because it is
using the Rust layer and the API is not fully fleshed out there.

---
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant