Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: add Manifest::intersect_manifest to Rust core #3305

Merged
merged 40 commits into from
Sep 21, 2024
Merged

MRG: add Manifest::intersect_manifest to Rust core #3305

merged 40 commits into from
Sep 21, 2024

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Aug 21, 2024

This PR implements Manifest::intersect_manifest and Collection::intersect_manifest for the Rust layer, which is needed to support standalone manifests over in sourmash-bio/sourmash_plugin_branchwater#430.

As part of this, the PR implements Eq and Hash traits for Record so that HashSet can be used for efficient intersections.

Related PRs:

Copy link

codecov bot commented Aug 21, 2024

Codecov Report

Attention: Patch coverage is 70.83333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 86.49%. Comparing base (26b50f3) to head (aa109f8).
Report is 1 commits behind head on latest.

Files with missing lines Patch % Lines
src/core/src/manifest.rs 72.72% 6 Missing ⚠️
src/core/src/collection.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #3305      +/-   ##
==========================================
- Coverage   86.53%   86.49%   -0.04%     
==========================================
  Files         137      137              
  Lines       16023    16047      +24     
  Branches     2728     2728              
==========================================
+ Hits        13865    13880      +15     
- Misses       1849     1858       +9     
  Partials      309      309              
Flag Coverage Δ
hypothesis-py 25.40% <ø> (ø)
python 92.37% <ø> (ø)
rust 62.30% <70.83%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ctb ctb changed the title EXP: propagate error from opening invalid RocksDB database WIP: RevIndex creation and access minor code updates Aug 21, 2024
@ctb ctb changed the base branch from latest to update_revindex August 24, 2024 14:10
@ctb ctb changed the title WIP: RevIndex creation and access minor code updates EXP: trial code updates for branchwater Aug 24, 2024
ctb added a commit that referenced this pull request Aug 26, 2024
Extracted from #3305

This PR adjusts `RevIndex::open` to propagate a `RocksDBError` when
opening a non-RocksDB directory.

It also modifies the creation printout in `RevIndex::create` to
distinguish index completion from interim output, for slightly easier
debugging.
Base automatically changed from update_revindex to latest August 26, 2024 18:46
@ctb ctb changed the title EXP: trial code updates for branchwater MRG: add Manifest::intersect_manifest to Rust core Sep 16, 2024
@@ -215,6 +215,14 @@ impl Collection {
assert_eq!(sig.signatures.len(), 1);
Ok(sig)
}

pub fn intersect_manifest(&self, mf: &Manifest) -> Self {
Copy link
Contributor

@bluegenes bluegenes Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be simpler (or complementary) to enable intersect of the collections directly i.e. pass a ref to the Collection (other) and then use self.manifest.intersect_manifest(other.manifest)?

I guess it probably doesn't matter as long as we're passing in a reference to either the collection or manifest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comes directly from the MultiCollection code use case, where we read a manifest in directly; would be frustrating to have to create a Collection first.

we might be able to coerce a Collection into a Manifest, though, which would make it work more smoothly when you do have a Collection... I'll look!

let manifest = self.manifest.intersect_manifest(mf);
Self {
manifest,
storage: self.storage.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does just cloning the storage mean that we're carrying around all sigs anyway, but we just don't have access to them via the manifest?

This is probably best since we avoid having to find and copy sigs individually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is really a question for @luizirber at the moment. The alternative would involve references and lifetime annotations, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does just cloning the storage mean that we're carrying around all sigs anyway, but we just don't have access to them via the manifest?

This is probably best since we avoid having to find and copy sigs individually?

To be clear: I think the answer is yes, esp if it's e.g. a MemStorage.

I am worried about duplicating memory unnecessarily. I don't really know how cloning a storage works in practice. I don't like the idea of cloning (which presumably would double the MemStorage).

Maybe a better approach here would be to have the result of Collection::intersect_manifest take ownership of storage without a clone. I'd need to change the signature of intersect_manifest here, will think on't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 9b9e17e and f4e89f4!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemStorage is the odd one, because it does allocate a lot of new memory, but Storage in general doesn't store much (or any) data, mostly metadata to access the actual data.

@ctb
Copy link
Contributor Author

ctb commented Sep 18, 2024

@luizirber any strong objections to the code here?

@ctb
Copy link
Contributor Author

ctb commented Sep 20, 2024

@luizirber @bluegenes can we merge this? :)

@ctb
Copy link
Contributor Author

ctb commented Sep 20, 2024

(not sure why codecov is acting up... tests do cover that code?!)

@bluegenes
Copy link
Contributor

I'm ok with merging 🤷🏼‍♀️

@luizirber
Copy link
Member

(not sure why codecov is acting up... tests do cover that code?!)

Sometimes cargo-tarpaulin doesn't get everything, I've tried #2993 some time ago, maybe time to revisit

@ctb ctb merged commit ada039a into latest Sep 21, 2024
42 of 46 checks passed
@ctb ctb deleted the remove_unwrap branch September 21, 2024 16:29
@ctb
Copy link
Contributor Author

ctb commented Sep 21, 2024

🎉

ctb added a commit to sourmash-bio/sourmash_plugin_branchwater that referenced this pull request Sep 21, 2024
@ctb ctb mentioned this pull request Sep 25, 2024
4 tasks
ctb added a commit that referenced this pull request Sep 25, 2024
from
#2987 (comment):

- [x] verify minimum supported rust version (MSRV) for writing release
notes (see #2988 for an
example); MSRV is checked by CI in `.github/workflows/rust.yml`,
`minimum_rust_version`
- [x] write release notes using `git log --oneline r0.12.0..latest
src/core | cut -d\ -f2- > /tmp/out.txt`
- [x] verify that the ChangeLog is up to date:
https://github.com/sourmash-bio/sourmash/blob/latest/src/core/CHANGELOG.md
- [x] bump version in `src/core/Cargo.toml`

# Release notes:

## [0.15.2] - 2024-09-25

MSRV: 1.65

Changes/additions:
* add `Manifest::intersect_manifest` to Rust core (#3305)
* propagate error from `RocksDB::open` on bad directory (#3306, #3307)

Updates:

* Bump getset from 0.1.2 to 0.1.3 (#3328)
* Bump memmap2 from 0.9.4 to 0.9.5 (#3326)
* Bump codspeed-criterion-compat from 2.6.0 to 2.7.2 (#3324)
* Bump serde_json from 1.0.127 to 1.0.128 (#3316)
* Bump serde from 1.0.209 to 1.0.210 (#3318)
* Bump serde from 1.0.208 to 1.0.209 (#3310)
* Bump serde_json from 1.0.125 to 1.0.127 (#3309)
ctb added a commit to sourmash-bio/sourmash_plugin_branchwater that referenced this pull request Oct 15, 2024
…r database (#430)

* refactor & rename & consolidate

* remove 'lower'

* add cargo doc output for private fn

* add a few comments/docs

* switch to dev version of sourmash

* tracking

* cleaner

* cleanup

* load rocksdb natively

* foo

* cargo fmt

* upd

* upd

* fix fmt

* MRG: create `MultiCollection` for collections that span multiple files (#434)

* preliminary victory

* compiles and mostly runs

* cleanup, split to new module

* cleanup and comment

* more cleanup of diff

* cargo fmt

* fix fmt

* restore n_failed

* comment failing test

* cleanup and de-vec

* create module/submodule structure

* comment for later

* get rid of vec

* beg for help

* cleanup and doc

* clippy fixes

* compiling again

* cleanup

* bump sourmash to v0.15.1

* check if is rocksdb

* weird error

* use remove_unwrap branch of sourmash

* get index to work with MultiCollection

* old bug now fixed

* clippy, format, and fix

* make names clearer

* ditch MultiCollection for index, at least for now

* testy testy

* getting closer

* update sourmash

* mark failing tests

* upd

* cargo fmt

* MRG: test exit from `pairwise` and `multisearch` if no loaded sketches (#437)

* upd

* check for appropriate multisearch error exits

* add more tests for pairwise, too

* cargo fmt

* MRG: switch to more efficient use of `Collection` by removing cloning (#438)

* remove unnecessary clones by switch to references in SmallSignature

* switch away from references for collections => avoid clones

* remove MultiCollection::iter

* MRG: add tests for RocksDB/RevIndex, standalone manifests, and flexible pathlists (#436)

* test using rocksdb as source of sketches

* test file lists of zips

* cargo fmt

* hackity hack hack a picklist

* ok that makes more sense

* it works

* comments around future par_iter

* support loading from a .sig.gz for index

* test pairwise loading from rocksdb

* add test for queries from Rocksdb

* decide not to implement lists of manifests :)

* reenable and fix test_fastgather.py::test_indexed_against

* impl Deref for MultiCollection

* clippy

* switch to using load_sketches method

* deref doesn't actually make sense for MultiCollection

* update to latest sourmash code

* update to latest sourmash code

* simplify

* update to latest sourmash code

* remove unnecessary flag

* MRG: support & test loading of standalone manifests within pathlists (#450)

* use recursion to load paths into a MultiCollection => mf support

* MRG: clean up index to use `MultiCollection` (#451)

* try making index work with standard code

* kinda working

* fmt

* refactor

* clear up the tests

* refactor/clean up

* cargo fmt

* add tests for index warning & error

* comment

* MRG: documentation updates based on new collection loading (#444)

* update docs for #430

* upd documentation

* upd

* Update src/lib.rs

Co-authored-by: Tessa Pierce Ward <[email protected]>

* switch unwrap to expect

* move unwrap to expect

* minor cleanup

* cargo fmt

* provide legacy method to avoid xfail on index loading

* switch to using reference

* update docs to reflect pathlist behavior

* test recursive nature of MultiCollection

* re-enable test that is now passing

* update to latest sourmash

* upd sourmash

* update sourmash

* mut MultiCollection

* cleanup

* update after merge of sourmash-bio/sourmash#3305

* fix contains_revindex

* add trace commands for tracing loading

* use released version of sourmash

* add support for ignoring abundance

* cargo fmt

* avoid downsampling until we know there is overlap

* change downsample to true; add panic assertion

* move downsampling side guard

* eliminate redundant overlap check

* move calc_abund_stats

* extract abundance code into own function; avoid downsampling if poss

* cleanup

* fmt

* update to next sourmash release

* cargo fmt

* upd sourmash

* correct numbers

* upd sourmash

* upd sourmash

* upd sourmash

* upd sourmash

* use new try_into() and eliminate several clone()s

* refactor a bit more

* deallocate collection?

* upd sourmash

* cargo fmt

* fix merge foo

---------

Co-authored-by: Tessa Pierce Ward <[email protected]>
ctb added a commit to sourmash-bio/sourmash_plugin_branchwater that referenced this pull request Oct 15, 2024
…#471)

* refactor & rename & consolidate

* remove 'lower'

* add cargo doc output for private fn

* add a few comments/docs

* switch to dev version of sourmash

* tracking

* cleaner

* cleanup

* load rocksdb natively

* foo

* cargo fmt

* upd

* upd

* fix fmt

* MRG: create `MultiCollection` for collections that span multiple files (#434)

* preliminary victory

* compiles and mostly runs

* cleanup, split to new module

* cleanup and comment

* more cleanup of diff

* cargo fmt

* fix fmt

* restore n_failed

* comment failing test

* cleanup and de-vec

* create module/submodule structure

* comment for later

* get rid of vec

* beg for help

* cleanup and doc

* clippy fixes

* compiling again

* cleanup

* bump sourmash to v0.15.1

* check if is rocksdb

* weird error

* use remove_unwrap branch of sourmash

* get index to work with MultiCollection

* old bug now fixed

* clippy, format, and fix

* make names clearer

* ditch MultiCollection for index, at least for now

* testy testy

* getting closer

* update sourmash

* mark failing tests

* upd

* cargo fmt

* MRG: test exit from `pairwise` and `multisearch` if no loaded sketches (#437)

* upd

* check for appropriate multisearch error exits

* add more tests for pairwise, too

* cargo fmt

* MRG: switch to more efficient use of `Collection` by removing cloning (#438)

* remove unnecessary clones by switch to references in SmallSignature

* switch away from references for collections => avoid clones

* remove MultiCollection::iter

* MRG: add tests for RocksDB/RevIndex, standalone manifests, and flexible pathlists (#436)

* test using rocksdb as source of sketches

* test file lists of zips

* cargo fmt

* hackity hack hack a picklist

* ok that makes more sense

* it works

* comments around future par_iter

* support loading from a .sig.gz for index

* test pairwise loading from rocksdb

* add test for queries from Rocksdb

* decide not to implement lists of manifests :)

* reenable and fix test_fastgather.py::test_indexed_against

* impl Deref for MultiCollection

* clippy

* switch to using load_sketches method

* deref doesn't actually make sense for MultiCollection

* update to latest sourmash code

* update to latest sourmash code

* simplify

* update to latest sourmash code

* remove unnecessary flag

* MRG: support & test loading of standalone manifests within pathlists (#450)

* use recursion to load paths into a MultiCollection => mf support

* MRG: clean up index to use `MultiCollection` (#451)

* try making index work with standard code

* kinda working

* fmt

* refactor

* clear up the tests

* refactor/clean up

* cargo fmt

* add tests for index warning & error

* comment

* MRG: documentation updates based on new collection loading (#444)

* update docs for #430

* upd documentation

* upd

* Update src/lib.rs

Co-authored-by: Tessa Pierce Ward <[email protected]>

* switch unwrap to expect

* move unwrap to expect

* minor cleanup

* cargo fmt

* provide legacy method to avoid xfail on index loading

* switch to using reference

* update docs to reflect pathlist behavior

* test recursive nature of MultiCollection

* re-enable test that is now passing

* update to latest sourmash

* upd sourmash

* update sourmash

* mut MultiCollection

* cleanup

* update after merge of sourmash-bio/sourmash#3305

* fix contains_revindex

* add trace commands for tracing loading

* use released version of sourmash

* add support for ignoring abundance

* cargo fmt

* avoid downsampling until we know there is overlap

* change downsample to true; add panic assertion

* move downsampling side guard

* eliminate redundant overlap check

* move calc_abund_stats

* extract abundance code into own function; avoid downsampling if poss

* cleanup

* fmt

* update to next sourmash release

* cargo fmt

* upd sourmash

* correct numbers

* upd sourmash

* upd sourmash

* upd sourmash

* upd sourmash

* use new try_into() and eliminate several clone()s

* refactor a bit more

* use new try_into() in manysearch; flag clones

* avoid a few more clones

* eliminate more clone

* fix mismatched clauses

* note minhash

* fix mastiff_manygather

* avoid more clone

* resolve comments

* microchange

* microchange 2

* eliminate more clone: fastgather

* avoid more clone: fastmultigather

* refactor to avoid more clones

* rm one more clone

* cleanup

* cargo fmt

* cargo fmt

* deallocate collection?

* deallocate collection?

* upd sourmash

* cargo fmt

* fix merge foo

* try out new sourmash PR

* upd latest sourmash branch

* upd sourmash

---------

Co-authored-by: Tessa Pierce Ward <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants