Skip to content

Commit

Permalink
Feat: on-disk RevIndex based on RocksDB (#2230)
Browse files Browse the repository at this point in the history
On-disk RevIndex based on RocksDB, initially implemented in
https://github.com/luizirber/2022-06-26-rocksdb-eval
This is the index/core data structure backing
https://mastiff.sourmash.bio

There are many changes in the Rust code, so bumping the version to
`0.12.0`.

This is mostly not exposed thru the FFI yet. Tests from the from the
in-memory `RevIndex` (greyhound) from #1238 were kept working, but it is
not well supported (doesn't allow saving/loading from disk, for
example), and should be wholly replaced by
`sourmash::index::revindex::disk_revindex` (the on-disk RevIndex) in the
future.
It is confusing to have these different RevIndex impls in Rust, and I
started converging them, but the work is not completely done yet.

#2727 is a better starting point for how `Index` abc/trait should work
acrosss Python/Rust, and I started moving the Rust indices to start from
a `LinearIndex` and later specialize into a `RevIndex`, which will make
easier to translate the work from #2727 for future indices across FFI.

A couple of new concepts introduced in this PR:

- a `Collection` is a `Manifest` + `Storage`. So a zip file like the
ones for GTDB databases fit this easily (storage = `ZipStorage`,
manifest is read from the zipfile), but a file paths list does too
(manifest built from the file paths, storage = `FSStorage`). This goes
in a bit of different direction than #1901, which was extending
`Storage` to support more functionality. I think `Storage` should stay
pretty bare and mostly deal with loading/saving data, but not much
knowledge of **what** data is there (this is covered with `Manifest`).

- a `CollectionSet` is a consistent collection of signatures. Consistent
here means: same k-size, downsample-compatible for scaled, same moltype.
You can create a `CollectionSet` by running `.select()` on a
`Collection`. `CollectionSet` is required for building indices (because
we shouldn't be building indices mixing k-size/moltype), and greatly
simplifies the logic in many places because it is not necessary to check
for compatibility.

- `LinearIndex` was rewritten based on `Collection` (and also the
`greyhound`/`branchwater` parallelism), and this supports the "parallel
search without an index" use case. There is no index construction per se
here, pretty much just a thin layer on top of `Collection` implementing
functionality expected from indices.

- `Manifest`, `Selection`, and `Picklist` are still incomplete, but the
relevant function definitions are in place, need to barrage it with
tests (and potentially exposing to Python and reusing the ones there in
#2726)

## Feature

- Initial implementation for `Manifest`, `Selection`, and `Picklist`
following
  the Python API.
- `Collection` is a new abstraction for working with a set of
signatures. A
collection needs a `Storage` for holding the signatures (on-disk,
in-memory,
or remotely), and a `Manifest` to describe the metadata for each
signature.
- Expose CSV parsing and RocksDB errors.
- New module `sourmash::index::revindex::disk_revindex` with the on-disk
  RevIndex implementation based on RocksDB.
- Add `iter` and `iter_mut` methods for `Signature`.
- Add `load_sig` and `save_sig` methods to `Storage` trait for
higher-level data
  manipulation and caching.
- Add `spec` method to `Storage` to allow constructing a concrete
`Storage` from
  a string description.
- Add `InnerStorage` for synchronizing parallel access to `Storage`
  implementations.
- Add `MemStorage` for keeping signatures in-memory (mostly for
debugging and
  testing).

## Refactor

- Rename `HashFunctions` variants to follow camel-case, so
`Murmur64Protein` instead of `murmur64_protein`
- `LinearIndex` is now implemented as a thin layer on top of
`Collection`.
- Move `GatherResult` to `sourmash::index` module.
- Move `sourmash::index::revindex` to `sourmash::index::mem_revindex`
(this is
the Greyhound version of revindex, in-memory only). It was also
refactored
internally to build a version of a `LinearIndex` that will be merged in
the
  future with `sourmash::index::LinearIndex`
- Move `select` method from `Index` trait into a separate `Select`
trait,
  and implement it for `Signature` based on the new `Selection` API.
- Move `SigStore` into `sourmash::storage` module, and remove the
generic. Now
  it always stores `Signature`. Also implement `Select` for it.

## Build

- Add new `branchwater` feature (enabled by default), which can be
disabled by downstream projects to limit bringing heavy dependencies
like rocksdb
- Add new `rkyv` feature (disabled by default), making `MinHash`
serializable
  with the `rkyv` crate.
- Add semver checks for CI (so we bump versions accordingly, or avoid
breaking changes)
- Reduce features combinations on Rust checks (takes much less time to
run)
- Disable `musllinux` wheels (need to figure out how to build rocksdb
for it)

---------

Co-authored-by: Tessa Pierce Ward <[email protected]>
Co-authored-by: C. Titus Brown <[email protected]>
  • Loading branch information
3 people authored Dec 1, 2023
1 parent d6994a1 commit 22ec122
Show file tree
Hide file tree
Showing 45 changed files with 3,967 additions and 1,419 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/dev_envs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ jobs:

- name: install dependencies
shell: bash -l {0}
run: mamba install 'tox>=3.27,<4' tox-conda rust git compilers pandoc
run: mamba install 'tox>=3.27,<4' tox-conda rust git compilers pandoc libstdcxx-ng

- name: run tests for 3.10
shell: bash -l {0}
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,12 @@ jobs:
toolchain: stable
override: true

- name: Check semver
uses: obi1kenobi/cargo-semver-checks-action@v2
with:
crate-name: sourmash
version-tag-prefix: r

- name: Make sure we can publish the sourmash crate
uses: actions-rs/cargo@v1
with:
Expand Down
4 changes: 4 additions & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ build:
tools:
python: "3.10"
rust: "1.64"
apt_packages:
- llvm-dev
- libclang-dev
- clang

# Build documentation in the docs/ directory with Sphinx
sphinx:
Expand Down
Loading

0 comments on commit 22ec122

Please sign in to comment.