Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: use core Collection and Select for sig loading #197

Merged
merged 58 commits into from
Feb 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
f1edc2c
init changes
bluegenes Sep 27, 2023
cb702b3
compiling code using newer mastiff branch
bluegenes Sep 27, 2023
db318ba
use selection
bluegenes Oct 2, 2023
d4393b1
Merge branch 'main' into upd-smash-core
bluegenes Nov 5, 2023
07c8362
rustfmt
bluegenes Nov 5, 2023
6975a72
Merge branch 'main' into upd-smash-core
bluegenes Nov 17, 2023
ff48469
update deps
luizirber Nov 22, 2023
ea88f20
Merge branch 'main' into upd-smash-core
bluegenes Nov 22, 2023
ddf6c8c
update to sourmash 0.12.0
luizirber Dec 1, 2023
b48ac88
fix index
luizirber Dec 1, 2023
f1145a1
rm reporting line checks not in smash core idx
bluegenes Dec 2, 2023
a6785dc
use selection instead of template
bluegenes Dec 7, 2023
5dd52f2
Merge branch 'main' into upd-smash-core
bluegenes Jan 23, 2024
21f20ca
rustfmt
bluegenes Jan 23, 2024
45b598f
fix query file no exist errs
bluegenes Jan 23, 2024
34ed1bb
update mastiff_manygather
bluegenes Jan 23, 2024
243d106
rustfmt
bluegenes Jan 23, 2024
cee5597
Merge branch 'main' into upd-smash-core
bluegenes Jan 23, 2024
87219aa
add cargo lock
bluegenes Jan 23, 2024
2fcf684
switch to commit in latest br
bluegenes Jan 24, 2024
86d6c16
cleanup unused imports and code
bluegenes Jan 24, 2024
13940cd
init use collection for query loading
bluegenes Jan 25, 2024
cd8be99
...collection loading in progress
bluegenes Jan 26, 2024
32fc2d5
fix fastgather
bluegenes Jan 26, 2024
39fb7dc
re-enable more permissive pathlist loading
bluegenes Jan 26, 2024
15f7dba
clean up ms
bluegenes Jan 26, 2024
4e3b7ee
harmonize errors
bluegenes Jan 26, 2024
14ee1bd
harmonize error text and output filenames
bluegenes Jan 27, 2024
b5a175e
Merge branch 'main' into upd-smash-core
bluegenes Jan 30, 2024
363b90d
re-allow load from sig; upd manysearch
bluegenes Jan 30, 2024
0ea39b5
fix all except moltype selection
bluegenes Jan 31, 2024
912f717
update fastgather and multisearch
bluegenes Jan 31, 2024
c518e83
Merge branch 'upd-smash-core' into use-core-more-broadly
bluegenes Jan 31, 2024
f5216f8
update pairwise
bluegenes Jan 31, 2024
893e0a7
clean up a little
bluegenes Jan 31, 2024
dbdff4a
clean up; unify sketch loading for pairwise/multisearch
bluegenes Feb 1, 2024
ab339ba
...cleaner
bluegenes Feb 2, 2024
f769aee
unify more code
bluegenes Feb 2, 2024
8d7781c
rm unused save_paths option
bluegenes Feb 2, 2024
b6ebc7a
use updated mh loading
bluegenes Feb 2, 2024
a463ac8
standardize indexed writing using local struct for now
bluegenes Feb 2, 2024
c7b865b
clean up sketch loading and file opening/writing
bluegenes Feb 2, 2024
14af130
apply clippy suggestions
bluegenes Feb 2, 2024
13c96d1
add back SmallSignature and use
bluegenes Feb 2, 2024
2453c9b
rename fn back to load_sketches
bluegenes Feb 2, 2024
0086b3a
use serde serialize for writing instead of custom traits
bluegenes Feb 2, 2024
f6989fd
upd to 0.12.1 sourmash core; clean up index tests
bluegenes Feb 10, 2024
5119961
Merge branch 'main' into use-core-more-broadly
bluegenes Feb 10, 2024
43b1b5d
Merge branch 'main' into use-core-more-broadly
bluegenes Feb 10, 2024
970c434
fix issues from merge conflicts
bluegenes Feb 10, 2024
4169e50
clean up tests
bluegenes Feb 10, 2024
994f0ed
rustfmt
bluegenes Feb 10, 2024
b7807b5
Merge branch 'main' into use-core-more-broadly
ctb Feb 12, 2024
862fb65
make fmg output filenaming robust to spaces in signame
bluegenes Feb 13, 2024
c9db4e0
Merge branch 'use-core-more-broadly' of github.com:sourmash-bio/sourm…
bluegenes Feb 13, 2024
3c97baa
minor doc updates
bluegenes Feb 13, 2024
7ce7ff9
disable rocksdb for fastgather
bluegenes Feb 13, 2024
c9cda17
instead, disable rocksdb reading within load_collection
bluegenes Feb 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 8 additions & 18 deletions doc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,17 @@

This repository implements five sourmash plugins, `manysketch`, `fastgather`, `fastmultigather`, `multisearch`, and `manysearch`. These plugins make use of multithreading in Rust to provide very fast implementations of `sketch`, `search`, and `gather`. With large databases, these commands can be hundreds to thousands of times faster, and 10-50x lower memory, than sourmash.

The main *drawback* to these plugin commands is that their inputs and outputs are not as rich as the native sourmash commands. In particular, this means that input databases need to be prepared differently. Moreover, the output may be most useful as a prefilter in conjunction with regular sourmash commands - see the instructions below for using `fastgather` to create picklists for sourmash.
The main *drawback* to these plugin commands is that their inputs and outputs are not as rich as the native sourmash commands. This may mean that your input files need to be prepared differently. The output may currently be most useful as a prefilter in conjunction with regular sourmash commands - see the instructions below for using `fastgather` to create picklists for sourmash.

## Input file formats

All four search/gather commands use either zip files or _text files containing lists of signature files_ ("fromfiles") for the search database. `multisearch`, `manysearch` and `fastmultigather` also use either zips or "fromfiles" for queries, too.
All four search/gather commands accept zip files or _text files containing lists of signature files_ ("fromfiles") for the search database. `multisearch`, `manysearch` and `fastmultigather` also use either zips or "fromfiles" for queries, too. All commands now accept single signature files as well, though this is only useful for single-query input.

`manysketch` takes as input a CSV file with columns `name,genome_filename,protein_filename`. If you don't have `protein_filename` entries, be sure to include the trailing comma so the CSV reader can process the file correctly.

### Using zip files

Zip files are used in two ways, depending on how the command works.

If the command loads a collection of sketches into memory at the start, then the sketches from the zip file are simply loaded into memory! So,
* `multisearch` loads both query and database into memory;
* `manysearch` loads the queries into memory;
* `fastmultigather` loads the search database into memory;

If the command loads a collection of sketches throughout execution, then the zip file is _unpacked_ to a temporary directory and the sketches are loaded from there. (This can consume a lot of extra disk space!) So,
* `manysearch` loads the sketches being searched this way;
* `fastgather` loads the database sketches this way;
* `fastmultigather` loads the query sketches this way;

Note that the temp directory is created under the path specified in the `TMPDIR` environment variable if it is set, otherwise it returns `/tmp`.
Signature zip files are the most efficient file to load, as they contain 'manifest' files with parameter information for each included sketch. When loading the zipfile, we can select relevant signatures without loading the sketches themselves into memory. We then only load the actual sketches (and optionally, downsample to a lower scaled value) when we're ready to use them.

### Using "fromfiles"

Expand All @@ -41,6 +29,8 @@ and then build a "fromfile":
find gtdb-reps-rs214-k21/ -name "*.sig.gz" -type f > list.gtdb-reps-rs214-k21.txt
```

When using these files for search, we have no a priori information about the parameters used for each sketch, so we load all signatures into memory at the start.

## Running the commands

### Running `manysketch`
Expand Down Expand Up @@ -101,7 +91,7 @@ The results file here, `query.x.gtdb-reps.csv`, will have 8 columns: `query` and

The `fastgather` command is a much faster version of `sourmash gather`.

`fastgather` takes a query metagenome and an input collection (zip or "fromfile") as database, and outputs a CSV:
`fastgather` takes a single query metagenome (in any file format) and an input collection (zip or "fromfile") as database, and outputs a CSV:
```
sourmash scripts fastgather query.sig.gz podar-ref-list.txt -o results.csv --cores 4
```
Expand Down Expand Up @@ -144,9 +134,9 @@ The main advantage that `fastmultigather` has over running `fastgather` on multi

`fastmultigather` will output two CSV files for each query, a `prefetch` file containing all overlapping matches between that query and the database, and a `gather` file containing the minimum metagenome cover for that query in the database.

The prefetch CSV will be named `{basename}.prefetch.csv`, and the gather CSV will be named `{basename}.gather.csv`. Here, `{basename}` is the filename, stripped of its path. If zipfiles are used, `{basename}` will be the md5sum.
The prefetch CSV will be named `{signame}.prefetch.csv`, and the gather CSV will be named `{signame}.gather.csv`. Here, `{signame}` is the name of your sourmash signature.

**Warning:** At the moment, if two different queries have the same `{basename}`, the CSVs for one of the queries will be overwritten by the other query. The behavior here is undefined in practice, because of multithreading: we don't know what queries will be executed when or files will be written first.
**Warning:** At the moment, if two different queries have the same `{signame}`, the CSVs for one of the queries will be overwritten by the other query. The behavior here is undefined in practice, because of multithreading: we don't know what queries will be executed when or files will be written first.

### Running `manysearch`

Expand Down
13 changes: 4 additions & 9 deletions src/check.rs
Original file line number Diff line number Diff line change
@@ -1,19 +1,14 @@
use std::path::Path;

use crate::utils::is_revindex_database;

use sourmash::index::revindex::{RevIndex, RevIndexOps};

pub fn check<P: AsRef<Path>>(index: P, quick: bool) -> Result<(), Box<dyn std::error::Error>> {
if !is_revindex_database(index.as_ref()) {
bail!(
"'{}' is not a valid RevIndex database",
index.as_ref().display()
);
pub fn check(index: camino::Utf8PathBuf, quick: bool) -> Result<(), Box<dyn std::error::Error>> {
if !is_revindex_database(&index) {
bail!("'{}' is not a valid RevIndex database", index);
}

println!("Opening DB");
let db = RevIndex::open(index.as_ref(), true)?;
let db = RevIndex::open(index, true)?;

println!("Starting check");
db.check(quick);
Expand Down
85 changes: 41 additions & 44 deletions src/fastgather.rs
Original file line number Diff line number Diff line change
@@ -1,49 +1,52 @@
/// fastgather: Run gather with a query against a list of files.
use anyhow::Result;

use sourmash::signature::Signature;
use sourmash::sketch::Sketch;
use std::path::Path;
use sourmash::prelude::Select;
use sourmash::selection::Selection;

use crate::utils::{
consume_query_by_gather, load_sigpaths_from_zip_or_pathlist, load_sketches_above_threshold,
prepare_query, write_prefetch, ReportType,
consume_query_by_gather, load_collection, load_sketches_above_threshold, write_prefetch,
ReportType,
};

pub fn fastgather<P: AsRef<Path> + std::fmt::Debug + std::fmt::Display + Clone>(
query_filename: P,
matchlist_filename: P,
pub fn fastgather(
query_filepath: String,
against_filepath: String,
threshold_bp: usize,
ksize: u8,
scaled: usize,
template: Sketch,
gather_output: Option<P>,
prefetch_output: Option<P>,
selection: &Selection,
gather_output: Option<String>,
prefetch_output: Option<String>,
allow_failed_sigpaths: bool,
) -> Result<()> {
let location = query_filename.to_string();
eprintln!("Loading query from '{}'", location);
let query = {
let sigs = Signature::from_path(query_filename)?;
let query_collection = load_collection(
&query_filepath,
selection,
ReportType::Query,
allow_failed_sigpaths,
)?;

prepare_query(&sigs, &template, &location)
};
// did we find anything matching the desired template?
let query = match query {
Some(query) => query,
None => bail!("No sketch found with scaled={}, k={}", scaled, ksize),
if query_collection.len() != 1 {
bail!(
"Fastgather requires a single query sketch. Check input: '{:?}'",
&query_filepath
)
}
// get single query sig and minhash
let query_sig = query_collection.sig_for_dataset(0)?; // need this for original md5sum
let query_sig_ds = query_sig.clone().select(selection)?; // downsample
let query_mh = match query_sig_ds.minhash() {
Some(query_mh) => query_mh,
None => {
bail!("No query sketch matching selection parameters.");
}
};

// build the list of paths to match against.
eprintln!(
"Loading matchlist from '{}'",
matchlist_filename.as_ref().display()
);

let matchlist_filename = matchlist_filename.as_ref().to_string_lossy().to_string();
let (matchlist_paths, _temp_dir) =
load_sigpaths_from_zip_or_pathlist(matchlist_filename, &template, ReportType::Against)?;

eprintln!("Loaded {} sig paths in matchlist", matchlist_paths.len());
// load collection to match against.
let against_collection = load_collection(
&against_filepath,
selection,
ReportType::Against,
allow_failed_sigpaths,
)?;

// calculate the minimum number of hashes based on desired threshold
let threshold_hashes: u64 = {
Expand All @@ -62,16 +65,10 @@ pub fn fastgather<P: AsRef<Path> + std::fmt::Debug + std::fmt::Display + Clone>(
);

// load a set of sketches, filtering for those with overlaps > threshold
let result = load_sketches_above_threshold(
matchlist_paths,
&template,
&query.minhash,
threshold_hashes,
)?;
let result = load_sketches_above_threshold(against_collection, query_mh, threshold_hashes)?;
let matchlist = result.0;
let skipped_paths = result.1;
let failed_paths = result.2;

if skipped_paths > 0 {
eprintln!(
"WARNING: skipped {} search paths - no compatible signatures.",
Expand All @@ -91,10 +88,10 @@ pub fn fastgather<P: AsRef<Path> + std::fmt::Debug + std::fmt::Display + Clone>(
}

if prefetch_output.is_some() {
write_prefetch(&query, prefetch_output, &matchlist).ok();
write_prefetch(&query_sig, prefetch_output, &matchlist).ok();
}

// run the gather!
consume_query_by_gather(query, matchlist, threshold_hashes, gather_output).ok();
consume_query_by_gather(query_sig, matchlist, threshold_hashes, gather_output).ok();
Ok(())
}
Loading
Loading