From 22ec12205af61dfed77ab807bb19d6bcd325f354 Mon Sep 17 00:00:00 2001 From: Luiz Irber Date: Thu, 30 Nov 2023 16:56:18 -0800 Subject: [PATCH] Feat: on-disk RevIndex based on RocksDB (#2230) On-disk RevIndex based on RocksDB, initially implemented in https://github.com/luizirber/2022-06-26-rocksdb-eval This is the index/core data structure backing https://mastiff.sourmash.bio There are many changes in the Rust code, so bumping the version to `0.12.0`. This is mostly not exposed thru the FFI yet. Tests from the from the in-memory `RevIndex` (greyhound) from #1238 were kept working, but it is not well supported (doesn't allow saving/loading from disk, for example), and should be wholly replaced by `sourmash::index::revindex::disk_revindex` (the on-disk RevIndex) in the future. It is confusing to have these different RevIndex impls in Rust, and I started converging them, but the work is not completely done yet. #2727 is a better starting point for how `Index` abc/trait should work acrosss Python/Rust, and I started moving the Rust indices to start from a `LinearIndex` and later specialize into a `RevIndex`, which will make easier to translate the work from #2727 for future indices across FFI. A couple of new concepts introduced in this PR: - a `Collection` is a `Manifest` + `Storage`. So a zip file like the ones for GTDB databases fit this easily (storage = `ZipStorage`, manifest is read from the zipfile), but a file paths list does too (manifest built from the file paths, storage = `FSStorage`). This goes in a bit of different direction than #1901, which was extending `Storage` to support more functionality. I think `Storage` should stay pretty bare and mostly deal with loading/saving data, but not much knowledge of **what** data is there (this is covered with `Manifest`). - a `CollectionSet` is a consistent collection of signatures. Consistent here means: same k-size, downsample-compatible for scaled, same moltype. You can create a `CollectionSet` by running `.select()` on a `Collection`. `CollectionSet` is required for building indices (because we shouldn't be building indices mixing k-size/moltype), and greatly simplifies the logic in many places because it is not necessary to check for compatibility. - `LinearIndex` was rewritten based on `Collection` (and also the `greyhound`/`branchwater` parallelism), and this supports the "parallel search without an index" use case. There is no index construction per se here, pretty much just a thin layer on top of `Collection` implementing functionality expected from indices. - `Manifest`, `Selection`, and `Picklist` are still incomplete, but the relevant function definitions are in place, need to barrage it with tests (and potentially exposing to Python and reusing the ones there in #2726) ## Feature - Initial implementation for `Manifest`, `Selection`, and `Picklist` following the Python API. - `Collection` is a new abstraction for working with a set of signatures. A collection needs a `Storage` for holding the signatures (on-disk, in-memory, or remotely), and a `Manifest` to describe the metadata for each signature. - Expose CSV parsing and RocksDB errors. - New module `sourmash::index::revindex::disk_revindex` with the on-disk RevIndex implementation based on RocksDB. - Add `iter` and `iter_mut` methods for `Signature`. - Add `load_sig` and `save_sig` methods to `Storage` trait for higher-level data manipulation and caching. - Add `spec` method to `Storage` to allow constructing a concrete `Storage` from a string description. - Add `InnerStorage` for synchronizing parallel access to `Storage` implementations. - Add `MemStorage` for keeping signatures in-memory (mostly for debugging and testing). ## Refactor - Rename `HashFunctions` variants to follow camel-case, so `Murmur64Protein` instead of `murmur64_protein` - `LinearIndex` is now implemented as a thin layer on top of `Collection`. - Move `GatherResult` to `sourmash::index` module. - Move `sourmash::index::revindex` to `sourmash::index::mem_revindex` (this is the Greyhound version of revindex, in-memory only). It was also refactored internally to build a version of a `LinearIndex` that will be merged in the future with `sourmash::index::LinearIndex` - Move `select` method from `Index` trait into a separate `Select` trait, and implement it for `Signature` based on the new `Selection` API. - Move `SigStore` into `sourmash::storage` module, and remove the generic. Now it always stores `Signature`. Also implement `Select` for it. ## Build - Add new `branchwater` feature (enabled by default), which can be disabled by downstream projects to limit bringing heavy dependencies like rocksdb - Add new `rkyv` feature (disabled by default), making `MinHash` serializable with the `rkyv` crate. - Add semver checks for CI (so we bump versions accordingly, or avoid breaking changes) - Reduce features combinations on Rust checks (takes much less time to run) - Disable `musllinux` wheels (need to figure out how to build rocksdb for it) --------- Co-authored-by: Tessa Pierce Ward Co-authored-by: C. Titus Brown --- .github/workflows/dev_envs.yml | 2 +- .github/workflows/rust.yml | 6 + .readthedocs.yml | 4 + Cargo.lock | 426 +++++++++-- Makefile | 4 +- deny.toml | 1 + doc/developer.md | 2 +- flake.nix | 4 + include/sourmash.h | 2 + pyproject.toml | 16 +- src/core/CHANGELOG.md | 74 +- src/core/Cargo.toml | 38 +- src/core/build.rs | 6 +- src/core/cbindgen.toml | 2 +- src/core/src/cmd.rs | 8 +- src/core/src/collection.rs | 190 +++++ src/core/src/encodings.rs | 66 +- src/core/src/errors.rs | 15 + src/core/src/ffi/index/mod.rs | 2 + src/core/src/ffi/index/revindex.rs | 39 +- src/core/src/ffi/minhash.rs | 9 +- src/core/src/ffi/mod.rs | 39 +- src/core/src/ffi/storage.rs | 9 +- src/core/src/from.rs | 19 +- src/core/src/index/linear.rs | 458 +++++++----- src/core/src/index/mod.rs | 305 ++------ src/core/src/index/revindex.rs | 699 ------------------- src/core/src/index/revindex/disk_revindex.rs | 513 ++++++++++++++ src/core/src/index/revindex/mem_revindex.rs | 461 ++++++++++++ src/core/src/index/revindex/mod.rs | 590 ++++++++++++++++ src/core/src/lib.rs | 6 +- src/core/src/manifest.rs | 279 ++++++++ src/core/src/prelude.rs | 11 +- src/core/src/selection.rs | 133 ++++ src/core/src/signature.rs | 139 +++- src/core/src/sketch/hyperloglog/mod.rs | 18 +- src/core/src/sketch/minhash.rs | 72 +- src/core/src/sketch/mod.rs | 4 + src/core/src/sketch/nodegraph.rs | 11 +- src/core/src/storage.rs | 448 ++++++++++-- src/core/src/wasm.rs | 8 +- src/core/tests/minhash.rs | 110 +-- src/core/tests/storage.rs | 127 +++- src/sourmash/sbt_storage.py | 4 +- tox.ini | 7 +- 45 files changed, 3967 insertions(+), 1419 deletions(-) create mode 100644 src/core/src/collection.rs delete mode 100644 src/core/src/index/revindex.rs create mode 100644 src/core/src/index/revindex/disk_revindex.rs create mode 100644 src/core/src/index/revindex/mem_revindex.rs create mode 100644 src/core/src/index/revindex/mod.rs create mode 100644 src/core/src/manifest.rs create mode 100644 src/core/src/selection.rs diff --git a/.github/workflows/dev_envs.yml b/.github/workflows/dev_envs.yml index a2eab66ba8..8b5a1503d6 100644 --- a/.github/workflows/dev_envs.yml +++ b/.github/workflows/dev_envs.yml @@ -57,7 +57,7 @@ jobs: - name: install dependencies shell: bash -l {0} - run: mamba install 'tox>=3.27,<4' tox-conda rust git compilers pandoc + run: mamba install 'tox>=3.27,<4' tox-conda rust git compilers pandoc libstdcxx-ng - name: run tests for 3.10 shell: bash -l {0} diff --git a/.github/workflows/rust.yml b/.github/workflows/rust.yml index c417f81793..998210e454 100644 --- a/.github/workflows/rust.yml +++ b/.github/workflows/rust.yml @@ -229,6 +229,12 @@ jobs: toolchain: stable override: true + - name: Check semver + uses: obi1kenobi/cargo-semver-checks-action@v2 + with: + crate-name: sourmash + version-tag-prefix: r + - name: Make sure we can publish the sourmash crate uses: actions-rs/cargo@v1 with: diff --git a/.readthedocs.yml b/.readthedocs.yml index 5b33921869..5479606af7 100644 --- a/.readthedocs.yml +++ b/.readthedocs.yml @@ -9,6 +9,10 @@ build: tools: python: "3.10" rust: "1.64" + apt_packages: + - llvm-dev + - libclang-dev + - clang # Build documentation in the docs/ directory with Sphinx sphinx: diff --git a/Cargo.lock b/Cargo.lock index d476ee825e..dc7b40a938 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -8,6 +8,17 @@ version = "1.0.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f26201604c87b1e01bd3d98f8d5d9a8fcbb815e8cedb41ffccbeb4bf593a35fe" +[[package]] +name = "ahash" +version = "0.7.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fcb51a0695d8f838b1ee009b3fbf66bda078cd64590202a864a8f3e8c4315c47" +dependencies = [ + "getrandom", + "once_cell", + "version_check", +] + [[package]] name = "aliasable" version = "0.1.3" @@ -41,12 +52,6 @@ version = "1.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "41ed9a86bf92ae6580e0a31281f65a1b1d867c0cc68d5346e2ae128dddfa6a7d" -[[package]] -name = "assert_matches" -version = "1.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9b34d609dfbaf33d6889b2b7106d3ca345eacad44200913df5ba02bfd31d2ba9" - [[package]] name = "autocfg" version = "1.1.0" @@ -59,6 +64,12 @@ version = "1.2.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7b7e4c2464d97fe331d41de9d5db0def0a96f4d823b8b32a2efd503578988973" +[[package]] +name = "binary-merge" +version = "0.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "597bb81c80a54b6a4381b23faba8d7774b144c94cbd1d6fe3f1329bd776554ab" + [[package]] name = "bincode" version = "1.3.3" @@ -68,6 +79,27 @@ dependencies = [ "serde", ] +[[package]] +name = "bindgen" +version = "0.65.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cfdf7b466f9a4903edc73f95d6d2bcd5baf8ae620638762244d3f60143643cc5" +dependencies = [ + "bitflags 1.3.2", + "cexpr", + "clang-sys", + "lazy_static", + "lazycell", + "peeking_take_while", + "prettyplease", + "proc-macro2", + "quote", + "regex", + "rustc-hash", + "shlex", + "syn 2.0.23", +] + [[package]] name = "bitflags" version = "1.3.2" @@ -80,18 +112,6 @@ version = "2.4.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "327762f6e5a765692301e5bb513e0d9fef63be86bbc14528052b1cd3e6f03e07" -[[package]] -name = "bstr" -version = "0.2.17" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ba3569f383e8f1598449f1a423e72e99569137b47740b1da11ef19af3d5c3223" -dependencies = [ - "lazy_static", - "memchr", - "regex-automata", - "serde", -] - [[package]] name = "buffer-redux" version = "1.0.0" @@ -108,17 +128,44 @@ version = "3.12.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0d261e256854913907f67ed06efbc3338dfe6179796deefc1ff763fc1aee5535" +[[package]] +name = "bytecheck" +version = "0.6.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3a31f923c2db9513e4298b72df143e6e655a759b3d6a0966df18f81223fff54f" +dependencies = [ + "bytecheck_derive", + "ptr_meta", +] + +[[package]] +name = "bytecheck_derive" +version = "0.6.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "edb17c862a905d912174daa27ae002326fff56dc8b8ada50a0a5f0976cb174f0" +dependencies = [ + "proc-macro2", + "quote", + "syn 1.0.104", +] + [[package]] name = "bytecount" version = "0.6.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e1e5f035d16fc623ae5f74981db80a439803888314e3a555fd6f04acd51a3205" +[[package]] +name = "bytemuck" +version = "1.12.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2f5715e491b5a1598fc2bef5a606847b5dc1d48ea625bd3c02c00de8285591da" + [[package]] name = "byteorder" -version = "1.5.0" +version = "1.4.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b" +checksum = "14c189c53d098945499cdfa7ecc63567cf3886b3332b312a5b4585d8d3a6a610" [[package]] name = "bzip2" @@ -146,6 +193,9 @@ name = "camino" version = "1.1.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c59e92b5a388f549b863a7bea62612c09f24c8393560709a54558a9abdfb3b9c" +dependencies = [ + "serde", +] [[package]] name = "capnp" @@ -164,6 +214,18 @@ name = "cc" version = "1.0.73" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2fff2a6927b3bb87f9595d67196a70493f627687a71d87a0d692242c33f58c11" +dependencies = [ + "jobserver", +] + +[[package]] +name = "cexpr" +version = "0.6.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6fac387a98bb7c37292057cffc56d62ecb629900026402633ae9160df93a8766" +dependencies = [ + "nom", +] [[package]] name = "cfg-if" @@ -212,6 +274,17 @@ dependencies = [ "half", ] +[[package]] +name = "clang-sys" +version = "1.3.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5a050e2153c5be08febd6734e29298e844fdb0fa21aeddd63b4eb7baa106c69b" +dependencies = [ + "glob", + "libc", + "libloading", +] + [[package]] name = "clap" version = "4.3.0" @@ -363,13 +436,12 @@ dependencies = [ [[package]] name = "csv" -version = "1.1.6" +version = "1.2.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "22813a6dc45b335f9bade10bf7271dc477e81113e89eb251a0bc2a8a81c536e1" +checksum = "af91f40b7355f82b0a891f50e70399475945bb0b0da4f1700ce60761c9d3e359" dependencies = [ - "bstr", "csv-core", - "itoa 0.4.8", + "itoa", "ryu", "serde", ] @@ -385,9 +457,9 @@ dependencies = [ [[package]] name = "cxx" -version = "1.0.85" +version = "1.0.91" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5add3fc1717409d029b20c5b6903fc0c0b02fa6741d820054f4a2efa5e5816fd" +checksum = "86d3488e7665a7a483b57e25bdd90d0aeb2bc7608c8d0346acf2ad3f1caf1d62" dependencies = [ "cc", "cxxbridge-flags", @@ -397,9 +469,9 @@ dependencies = [ [[package]] name = "cxx-build" -version = "1.0.85" +version = "1.0.91" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b4c87959ba14bc6fbc61df77c3fcfe180fc32b93538c4f1031dd802ccb5f2ff0" +checksum = "48fcaf066a053a41a81dfb14d57d99738b767febb8b735c3016e469fac5da690" dependencies = [ "cc", "codespan-reporting", @@ -412,15 +484,15 @@ dependencies = [ [[package]] name = "cxxbridge-flags" -version = "1.0.85" +version = "1.0.91" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "69a3e162fde4e594ed2b07d0f83c6c67b745e7f28ce58c6df5e6b6bef99dfb59" +checksum = "a2ef98b8b717a829ca5603af80e1f9e2e48013ab227b68ef37872ef84ee479bf" [[package]] name = "cxxbridge-macro" -version = "1.0.85" +version = "1.0.91" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3e7e2adeb6a0d4a282e581096b06e1791532b7d576dcde5ccd9382acf55db8e6" +checksum = "086c685979a698443656e5cf7856c95c642295a38599f12fb1ff76fb28d19892" dependencies = [ "proc-macro2", "quote", @@ -433,6 +505,18 @@ version = "1.6.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e78d4f1cc4ae33bbfc157ed5d5a5ef3bc29227303d595861deb238fcec4e9457" +[[package]] +name = "enum_dispatch" +version = "0.3.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8f33313078bb8d4d05a2733a94ac4c2d8a0df9a2b84424ebf4f33bfc224a890e" +dependencies = [ + "once_cell", + "proc-macro2", + "quote", + "syn 2.0.23", +] + [[package]] name = "errno" version = "0.3.1" @@ -520,12 +604,27 @@ dependencies = [ "syn 1.0.104", ] +[[package]] +name = "glob" +version = "0.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9b919933a397b79c37e33b77bb2aa3dc8eb6e165ad809e58ff75bc7db2e34574" + [[package]] name = "half" version = "1.8.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "eabb4a44450da02c90444cf74558da904edde8fb4e9035a9a6a4e15445af0bd7" +[[package]] +name = "hashbrown" +version = "0.12.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "db0d4cf898abf0081f964436dc980e96670a0f36863e4b83aaacdb65c9d7ccc3" +dependencies = [ + "ahash", +] + [[package]] name = "heck" version = "0.4.1" @@ -538,6 +637,12 @@ version = "0.3.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "443144c8cdadd93ebf52ddb4056d257f5b52c04d3c804e657d19eb73fc33668b" +[[package]] +name = "histogram" +version = "0.6.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "12cb882ccb290b8646e554b157ab0b71e64e8d5bef775cd66b6531e52d302669" + [[package]] name = "iana-time-zone" version = "0.1.53" @@ -562,6 +667,15 @@ dependencies = [ "cxx-build", ] +[[package]] +name = "inplace-vec-builder" +version = "0.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cf64c2edc8226891a71f127587a2861b132d2b942310843814d5001d99a1d307" +dependencies = [ + "smallvec", +] + [[package]] name = "io-lifetimes" version = "1.0.11" @@ -605,15 +719,18 @@ dependencies = [ [[package]] name = "itoa" -version = "0.4.8" +version = "1.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b71991ff56294aa922b450139ee08b3bfc70982c6b2c7562771375cf73542dd4" +checksum = "1aab8fc367588b89dcee83ab0fd66b72b50b72fa1904d7095045ace2b0c81c35" [[package]] -name = "itoa" -version = "1.0.1" +name = "jobserver" +version = "0.1.24" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1aab8fc367588b89dcee83ab0fd66b72b50b72fa1904d7095045ace2b0c81c35" +checksum = "af25a77299a7f711a01975c35a6a424eb6862092cc2d6c72c4ed6cbc56dfc1fa" +dependencies = [ + "libc", +] [[package]] name = "js-sys" @@ -630,18 +747,61 @@ version = "1.4.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e2abad23fbc42b3700f2f279844dc832adb2b2eb069b2df918f455c4e18cc646" +[[package]] +name = "lazycell" +version = "1.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "830d08ce1d1d941e6b30645f1a0eb5643013d835ce3779a5fc208261dbe10f55" + [[package]] name = "libc" version = "0.2.149" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a08173bc88b7955d1b3145aa561539096c421ac8debde8cbc3612ec635fee29b" +[[package]] +name = "libloading" +version = "0.7.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b67380fd3b2fbe7527a606e18729d21c6f3951633d0500574c4dc22d2d638b9f" +dependencies = [ + "cfg-if", + "winapi", +] + [[package]] name = "libm" version = "0.2.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "348108ab3fba42ec82ff6e9564fc4ca0247bdccdc68dd8af9764bbc79c3c8ffb" +[[package]] +name = "librocksdb-sys" +version = "0.11.0+8.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d3386f101bcb4bd252d8e9d2fb41ec3b0862a15a62b478c355b2982efa469e3e" +dependencies = [ + "bindgen", + "bzip2-sys", + "cc", + "glob", + "libc", + "libz-sys", + "lz4-sys", + "zstd-sys", +] + +[[package]] +name = "libz-sys" +version = "1.1.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9702761c3935f8cc2f101793272e202c72b99da8f4224a19ddcf1279a6450bbf" +dependencies = [ + "cc", + "pkg-config", + "vcpkg", +] + [[package]] name = "link-cplusplus" version = "1.0.8" @@ -669,6 +829,16 @@ version = "0.4.20" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b5e6163cb8c49088c2c36f57875e58ccd8c87c7427f7fbd50ea6710b2f3f2e8f" +[[package]] +name = "lz4-sys" +version = "1.9.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "57d27b317e207b10f69f5e75494119e391a96f48861ae870d1da6edac98ca900" +dependencies = [ + "cc", + "libc", +] + [[package]] name = "lzma-sys" version = "0.1.17" @@ -720,6 +890,12 @@ dependencies = [ "autocfg", ] +[[package]] +name = "minimal-lexical" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "68354c5c6bd36d73ff3feceb05efa59b6acb7626617f4962be322a825e61f79a" + [[package]] name = "miniz_oxide" version = "0.4.4" @@ -752,9 +928,9 @@ dependencies = [ [[package]] name = "niffler" -version = "2.4.0" +version = "2.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "68c7ffd42bdba05fc9fbfda31283d44c5c8a88fed1a191f68795dba23cc8204b" +checksum = "470dd05a938a5ad42c2cb80ceea4255e275990ee530b86ca164e6d8a19fa407f" dependencies = [ "cfg-if", "flate2", @@ -767,6 +943,16 @@ version = "0.2.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2bf50223579dc7cdcfb3bfcacf7069ff68243f8c363f62ffa99cf000a6b9c451" +[[package]] +name = "nom" +version = "7.1.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d273983c5a657a70a3e8f2a01329822f3b8c8172b73826411a55751e404a0a4a" +dependencies = [ + "memchr", + "minimal-lexical", +] + [[package]] name = "num-integer" version = "0.1.44" @@ -835,6 +1021,12 @@ dependencies = [ "syn 2.0.23", ] +[[package]] +name = "peeking_take_while" +version = "0.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "19b17cddbe7ec3f8bc800887bab5e717348c95ea2ca0b1bf0837fb964dc67099" + [[package]] name = "piz" version = "0.5.1" @@ -891,6 +1083,16 @@ version = "0.2.16" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "eb9f9e6e233e5c4a35559a617bf40a4ec447db2e84c20b55a6f83167b7e57872" +[[package]] +name = "prettyplease" +version = "0.2.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6c64d9ba0963cdcea2e1b2230fbae2bab30eb25a174be395c41e764bfb65dd62" +dependencies = [ + "proc-macro2", + "syn 2.0.23", +] + [[package]] name = "primal-check" version = "0.3.3" @@ -949,6 +1151,26 @@ dependencies = [ "unarray", ] +[[package]] +name = "ptr_meta" +version = "0.1.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0738ccf7ea06b608c10564b31debd4f5bc5e197fc8bfe088f68ae5ce81e7a4f1" +dependencies = [ + "ptr_meta_derive", +] + +[[package]] +name = "ptr_meta_derive" +version = "0.1.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "16b845dbfca988fa33db069c0e230574d15a3088f147a87b64c7589eb662c9ac" +dependencies = [ + "proc-macro2", + "quote", + "syn 1.0.104", +] + [[package]] name = "quote" version = "1.0.29" @@ -1035,12 +1257,6 @@ dependencies = [ "regex-syntax 0.6.26", ] -[[package]] -name = "regex-automata" -version = "0.1.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6c230d73fb8d8c1b9c0b3135c5142a8acee3a0558fb8db5cf1cb65f8d7862132" - [[package]] name = "regex-syntax" version = "0.6.26" @@ -1053,6 +1269,73 @@ version = "0.7.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "dbb5fb1acd8a1a18b3dd5be62d25485eb770e05afb408a9627d14d451bae12da" +[[package]] +name = "rend" +version = "0.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "581008d2099240d37fb08d77ad713bcaec2c4d89d50b5b21a8bb1996bbab68ab" +dependencies = [ + "bytecheck", +] + +[[package]] +name = "retain_mut" +version = "0.1.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8c31b5c4033f8fdde8700e4657be2c497e7288f01515be52168c631e2e4d4086" + +[[package]] +name = "rkyv" +version = "0.7.40" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c30f1d45d9aa61cbc8cd1eb87705470892289bb2d01943e7803b873a57404dc3" +dependencies = [ + "bytecheck", + "hashbrown", + "ptr_meta", + "rend", + "rkyv_derive", + "seahash", +] + +[[package]] +name = "rkyv_derive" +version = "0.7.40" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ff26ed6c7c4dfc2aa9480b86a60e3c7233543a270a680e10758a507c5a4ce476" +dependencies = [ + "proc-macro2", + "quote", + "syn 1.0.104", +] + +[[package]] +name = "roaring" +version = "0.10.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ef0fb5e826a8bde011ecae6a8539dd333884335c57ff0f003fbe27c25bbe8f71" +dependencies = [ + "bytemuck", + "byteorder", + "retain_mut", +] + +[[package]] +name = "rocksdb" +version = "0.21.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bb6f170a4041d50a0ce04b0d2e14916d6ca863ea2e422689a5b694395d299ffe" +dependencies = [ + "libc", + "librocksdb-sys", +] + +[[package]] +name = "rustc-hash" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "08d43f7aa6b08d49f382cde6a7982047c3426db949b1424bc4b7ec9ae12c6ce2" + [[package]] name = "rustix" version = "0.37.25" @@ -1119,6 +1402,12 @@ version = "1.0.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ddccb15bcce173023b3fedd9436f882a0739b8dfb45e4f6b6002bee5929f61b2" +[[package]] +name = "seahash" +version = "4.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1c107b6f4780854c8b126e228ea8869f4d7b71260f962fefb57b996b8959ba6b" + [[package]] name = "serde" version = "1.0.168" @@ -1145,11 +1434,17 @@ version = "1.0.108" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3d1c7e3eac408d115102c4c24ad393e0821bb3a5df4d506a80f85f7a742a526b" dependencies = [ - "itoa 1.0.1", + "itoa", "ryu", "serde", ] +[[package]] +name = "shlex" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "43b2853a4d09f215c24cc5489c992ce46052d359b5109343cbafbf26bc62f8a3" + [[package]] name = "smallvec" version = "1.8.0" @@ -1164,20 +1459,22 @@ checksum = "9f1341053f34bb13b5e9590afb7d94b48b48d4b87467ec28e3c238693bb553de" [[package]] name = "sourmash" -version = "0.11.0" +version = "0.12.0" dependencies = [ - "assert_matches", "az", - "bytecount", "byteorder", + "camino", "cfg-if", "chrono", "counter", "criterion", + "csv", + "enum_dispatch", "finch", "fixedbitset", "getrandom", "getset", + "histogram", "log", "md5", "memmap2", @@ -1193,6 +1490,9 @@ dependencies = [ "proptest", "rand", "rayon", + "rkyv", + "roaring", + "rocksdb", "serde", "serde_json", "tempfile", @@ -1321,16 +1621,25 @@ checksum = "6ceab39d59e4c9499d4e5a8ee0e2735b891bb7308ac83dfb4e80cad195c9f6f3" [[package]] name = "unicode-width" -version = "0.1.9" +version = "0.1.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c0edd1e5b14653f783770bce4a4dabb4a5108a5370a5f5d8cfe8710c361f6c8b" + +[[package]] +name = "vcpkg" +version = "0.2.15" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3ed742d4ea2bd1176e236172c8429aaf54486e7ac098db29ffe6529e0ce50973" +checksum = "accd4ea62f7bb7a82fe23066fb0957d48ef677f6eeb8215f372f52e48bb32426" [[package]] name = "vec-collections" -version = "0.3.6" +version = "0.4.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4f2390c4dc8ae8640c57d067b1a3d40bc05c124cc6bc7394d761b53435d41b76" +checksum = "3c9965c8f2ffed1dbcd16cafe18a009642f540fa22661c6cfd6309ddb02e4982" dependencies = [ + "binary-merge", + "inplace-vec-builder", + "lazy_static", "num-traits", "serde", "smallvec", @@ -1568,3 +1877,14 @@ checksum = "c179869f34fc7c01830d3ce7ea2086bc3a07e0d35289b667d0a8bf910258926c" dependencies = [ "lzma-sys", ] + +[[package]] +name = "zstd-sys" +version = "2.0.7+zstd.1.5.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "94509c3ba2fe55294d752b79842c530ccfab760192521df74a081a78d2b3c7f5" +dependencies = [ + "cc", + "libc", + "pkg-config", +] diff --git a/Makefile b/Makefile index 4c2ef69abb..9b26d91331 100644 --- a/Makefile +++ b/Makefile @@ -25,6 +25,7 @@ doc: .PHONY tox -e docs include/sourmash.h: src/core/src/lib.rs \ + src/core/src/ffi/mod.rs \ src/core/src/ffi/hyperloglog.rs \ src/core/src/ffi/minhash.rs \ src/core/src/ffi/signature.rs \ @@ -32,7 +33,8 @@ include/sourmash.h: src/core/src/lib.rs \ src/core/src/ffi/index/mod.rs \ src/core/src/ffi/index/revindex.rs \ src/core/src/ffi/storage.rs \ - src/core/src/errors.rs + src/core/src/errors.rs \ + src/core/cbindgen.toml cd src/core && \ RUSTC_BOOTSTRAP=1 cbindgen -c cbindgen.toml . -o ../../$@ diff --git a/deny.toml b/deny.toml index 29d148d50b..99f3b442c7 100644 --- a/deny.toml +++ b/deny.toml @@ -29,6 +29,7 @@ default = "deny" confidence-threshold = 0.8 exceptions = [ { allow = ["Zlib"], name = "piz", version = "*" }, + { allow = ["ISC"], name = "libloading", version = "*" }, ] [bans] diff --git a/doc/developer.md b/doc/developer.md index eb8466a5c9..218f24cab7 100644 --- a/doc/developer.md +++ b/doc/developer.md @@ -25,7 +25,7 @@ and the [`conda-forge`](https://conda-forge.org/) channel by default). Once `mamba` is installed, run ``` -mamba create -n sourmash_dev 'tox>=3.27,<4' tox-conda rust git compilers pandoc +mamba create -n sourmash_dev 'tox>=3.27,<4' tox-conda rust git compilers pandoc libstdcxx-ng ``` to create an environment called `sourmash_dev` containing the programs needed for development. diff --git a/flake.nix b/flake.nix index 9a4390fd70..ea8839b4a7 100644 --- a/flake.nix +++ b/flake.nix @@ -103,17 +103,21 @@ wasmtime wasm-pack nodejs_20 + #emscripten #py-spy #heaptrack + cargo-all-features cargo-watch cargo-limit cargo-outdated cargo-udeps cargo-deny + cargo-semver-checks nixpkgs-fmt ]; + # Needed for matplotlib LD_LIBRARY_PATH = lib.makeLibraryPath [ pkgs.stdenv.cc.cc.lib ]; # workaround for https://github.com/NixOS/nixpkgs/blob/48dfc9fa97d762bce28cc8372a2dd3805d14c633/doc/languages-frameworks/python.section.md#python-setuppy-bdist_wheel-cannot-create-whl diff --git a/include/sourmash.h b/include/sourmash.h index 6fa7854880..d647378da7 100644 --- a/include/sourmash.h +++ b/include/sourmash.h @@ -42,6 +42,8 @@ enum SourmashErrorCode { SOURMASH_ERROR_CODE_PARSE_INT = 100003, SOURMASH_ERROR_CODE_SERDE_ERROR = 100004, SOURMASH_ERROR_CODE_NIFFLER_ERROR = 100005, + SOURMASH_ERROR_CODE_CSV_ERROR = 100006, + SOURMASH_ERROR_CODE_ROCKS_DB_ERROR = 100007, }; typedef uint32_t SourmashErrorCode; diff --git a/pyproject.toml b/pyproject.toml index 59b4c26ce1..a08d2ae03e 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -144,7 +144,7 @@ include = [ exclude = [ { path = "**/__pycache__/*", format = ["sdist", "wheel"] }, ] -features = ["maturin"] +features = ["maturin", "branchwater"] locked = true module-name = "sourmash._lowlevel" @@ -164,7 +164,7 @@ known_first_party = ["sourmash"] [tool.cibuildwheel] build = "cp310-*" -skip = "*-win32 *-manylinux_i686 *-musllinux_ppc64le *-musllinux_s390x" +skip = "*-win32 *-manylinux_i686 *-musllinux_*" before-all = [ "curl https://sh.rustup.rs -sSf | sh -s -- -y --default-toolchain=stable", "cargo update --dry-run", @@ -178,6 +178,18 @@ build-verbosity = 3 CARGO_REGISTRIES_CRATES_IO_PROTOCOL="sparse" PATH="$HOME/.cargo/bin:$PATH" +[tool.cibuildwheel.linux] +before-all = [ + "curl https://sh.rustup.rs -sSf | sh -s -- -y --default-toolchain=stable", + "cargo update --dry-run", + "if [ -f /etc/system-release ]; then yum -y install centos-release-scl; fi", + "if [ -f /etc/system-release ]; then yum -y install llvm-toolset-7.0; fi", +] +before-build = [ + "if [ -f /etc/system-release ]; then source scl_source enable llvm-toolset-7.0; fi", + "if [ -f /etc/system-release ]; then source scl_source enable devtoolset-10; fi", +] + [tool.cibuildwheel.linux.environment] CARGO_REGISTRIES_CRATES_IO_PROTOCOL="sparse" PATH="$HOME/.cargo/bin:$PATH" diff --git a/src/core/CHANGELOG.md b/src/core/CHANGELOG.md index 3915a62086..d9807e8ebf 100644 --- a/src/core/CHANGELOG.md +++ b/src/core/CHANGELOG.md @@ -5,11 +5,81 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## Unreleased +## [0.12.0] - 2023-11-26 + +MSRV: 1.64 Added: -- An inverted index, codename Greyhound (#1238) +- Initial implementation for `Manifest`, `Selection`, and `Picklist` following + the Python API. (#2230) +- `Collection` is a new abstraction for working with a set of signatures. A + collection needs a `Storage` for holding the signatures (on-disk, in-memory, + or remotely), and a `Manifest` to describe the metadata for each signature. (#2230) +- Expose CSV parsing and RocksDB errors. (#2230) +- New module `sourmash::index::revindex::disk_revindex` with the on-disk + RevIndex implementation based on RocksDB. (#2230) +- Add `iter` and `iter_mut` methods for `Signature`. (#2230) +- Add `load_sig` and `save_sig` methods to `Storage` trait for higher-level data + manipulation and caching. (#2230) +- Add `spec` method to `Storage` to allow constructing a concrete `Storage` from + a string description. (#2230) +- Add `InnerStorage` for synchronizing parallel access to `Storage` + implementations. (#2230) +- Add `MemStorage` for keeping signatures in-memory (mostly for debugging and + testing). (#2230) +- Add new `branchwater` feature (enabled by default), which can be disabled by + downstream projects to limit bringing heavy dependencies like rocksdb. (#2230) +- Add new `rkyv` feature (disabled by default), making `MinHash` serializable + with the `rkyv` crate. (#2230) +- Add semver checks for CI (so we bump versions accordingly, or avoid breaking + changes). (#2230) +- Add cargo deny config. (#2724) +- Benchmarks for seq_to_hashes in protein mode. (#1944) +- Oxidize ZipStorage. (#1909) +- Move greyhound-core into sourmash. (#1238) +- add `MinHash.kmers_and_hashes(...)` and `sourmash sig kmers`. (#1695) +- Produce list of hashes from a sequence. (#1653) + +Changed: + +- Rename `HashFunctions` variants to follow camel-case, so `Murmur64Protein` + instead of `murmur64_protein`. (#2230) +- `LinearIndex` is now implemented as a thin layer on top of `Collection`. (#2230) +- Move `GatherResult` to `sourmash::index` module. (#2230) +- Move `sourmash::index::revindex` to `sourmash::index::mem_revindex` (this is + the Greyhound version of revindex, in-memory only). It was also refactored + internally to build a version of a `LinearIndex` that will be merged in the + future with `sourmash::index::LinearIndex`. (#2230) +- Move `select` method from `Index` trait into a separate `Select` trait, + and implement it for `Signature` based on the new `Selection` API. (#2230) +- Move `SigStore` into `sourmash::storage` module, and remove the generic. Now + it always stores `Signature`. Also implement `Select` for it. (#2230) +- Disable `musllinux` wheels (need to figure out how to build rocksdb for it). (#2230) +- Reorganize traits for easier wasm and native compilation. (#1836) +- Adjust dayhoff and hp encodings to tolerate stop codons in the protein sequence. (#1673) + +Fixed: + +- Reduce features combinations on Rust checks (takes much less time to run). (#2230) +- Build: MSRV check for 1.64. (#2680) +- maturin: move deprecated definition from Cargo.toml to pyproject.toml. (#2597) +- Fix broken crates.io badge. (#2556) +- Fix unnecessary typecasts in Rust. (#2366) +- Fix `Signature.minhash` API during `sourmash sketch`. (#2329) +- Return Err for angular_similarity when abundance tracking is off. (#2327) +- Update various descriptions to talk about k-mers, not just DNA. (#2137) +- Fix downsample_scaled in `core`. (#2108) +- speed up `SeqToHashes` `translate`. (#1946) +- Speed-up `SeqToHashes()`. (#1938) +- Fix containment calculation for nodegraphs. (#1862) +- Fix panic bug in `sourmash sketch` dna with bad input and `--check-sequence`. (#1702) +- Fix Rust panic in `MinHash.seq_to_hashes`. (#1701) +- Beta lints. (#2841 #2630 #2596 #2298 #1791 #1786 #1760) + +Removed: + +- Remove BIGSI and SBT code. (#2732) ## [0.11.0] - 2021-07-07 diff --git a/src/core/Cargo.toml b/src/core/Cargo.toml index a8a2ddca95..d1f0706a8a 100644 --- a/src/core/Cargo.toml +++ b/src/core/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "sourmash" -version = "0.11.0" +version = "0.12.0" authors = ["Luiz Irber "] description = "MinHash sketches for genomic data" repository = "https://github.com/sourmash-bio/sourmash" @@ -22,38 +22,44 @@ bench = false from-finch = ["finch"] parallel = ["rayon"] maturin = [] +branchwater = ["rocksdb", "rkyv", "parallel"] +default = [] [dependencies] az = "1.0.0" -bytecount = "0.6.7" -byteorder = "1.5.0" +byteorder = "1.4.3" +camino = { version = "1.1.6", features = ["serde1"] } cfg-if = "1.0" counter = "0.5.7" +csv = "1.1.6" +enum_dispatch = "0.3.12" finch = { version = "0.6.0", optional = true } fixedbitset = "0.4.0" getrandom = { version = "0.2", features = ["js"] } getset = "0.1.1" +histogram = "0.6.9" log = "0.4.20" md5 = "0.7.0" +memmap2 = "0.9.0" murmurhash3 = "0.0.5" niffler = { version = "2.3.1", default-features = false, features = [ "gz" ] } nohash-hasher = "0.2.0" num-iter = "0.1.43" once_cell = "1.18.0" +ouroboros = "0.18.0" +piz = "0.5.0" +primal-check = "0.3.1" +rkyv = { version = "0.7.39", optional = true } +roaring = "0.10.0" rayon = { version = "1.8.0", optional = true } serde = { version = "1.0.168", features = ["derive"] } serde_json = "1.0.108" -primal-check = "0.3.1" thiserror = "1.0" -typed-builder = "0.14.0" twox-hash = "1.6.0" -vec-collections = "0.3.4" -piz = "0.5.0" -memmap2 = "0.9.0" -ouroboros = "0.18.0" +typed-builder = "0.14.0" +vec-collections = "0.4.3" [dev-dependencies] -assert_matches = "1.3.0" criterion = "0.5.1" needletail = { version = "0.5.1", default-features = false } proptest = { version = "1.3.1", default-features = false, features = ["std"]} @@ -72,6 +78,13 @@ harness = false name = "minhash" harness = false +[package.metadata.cargo-all-features] +skip_optional_dependencies = true +denylist = ["maturin"] +skip_feature_sets = [ + ["branchwater", "parallel"], # branchwater implies parallel +] + ## Wasm section. Crates only used for WASM, as well as specific configurations [target.'cfg(all(target_arch = "wasm32", target_os="unknown"))'.dependencies.wasm-bindgen] @@ -83,11 +96,12 @@ version = "0.3.65" features = ["console", "File"] [target.'cfg(all(target_arch = "wasm32"))'.dependencies.chrono] -version = "0.4.31" +version = "0.4.28" features = ["wasmbind"] [target.'cfg(all(target_arch = "wasm32", target_os="unknown"))'.dev-dependencies] wasm-bindgen-test = "0.3.39" ### These crates don't compile on wasm -[target.'cfg(not(all(target_arch = "wasm32", target_os="unknown")))'.dependencies] +[target.'cfg(not(target_arch = "wasm32"))'.dependencies] +rocksdb = { version = "0.21.0", optional = true } diff --git a/src/core/build.rs b/src/core/build.rs index a22396c25a..f067828d50 100644 --- a/src/core/build.rs +++ b/src/core/build.rs @@ -55,12 +55,12 @@ fn copy_c_bindings(crate_dir: &str) { let new_header: String = header .lines() .filter_map(|s| { - if s.starts_with("#") { + if s.starts_with('#') { None } else { Some({ let mut s = s.to_owned(); - s.push_str("\n"); + s.push('\n'); s }) } @@ -71,5 +71,5 @@ fn copy_c_bindings(crate_dir: &str) { let target_dir = find_target_dir(&out_dir); std::fs::create_dir_all(&target_dir).expect("error creating target dir"); let out_path = target_dir.join("header.h"); - std::fs::write(out_path, &new_header).expect("error writing header"); + std::fs::write(out_path, new_header).expect("error writing header"); } diff --git a/src/core/cbindgen.toml b/src/core/cbindgen.toml index cd6cd781c2..1a0a81af47 100644 --- a/src/core/cbindgen.toml +++ b/src/core/cbindgen.toml @@ -8,7 +8,7 @@ clean = true [parse.expand] crates = ["sourmash"] -features = [] +features = ["branchwater"] [enum] rename_variants = "QualifiedScreamingSnakeCase" diff --git a/src/core/src/cmd.rs b/src/core/src/cmd.rs index 436c2ca7df..a760e0f79d 100644 --- a/src/core/src/cmd.rs +++ b/src/core/src/cmd.rs @@ -119,7 +119,7 @@ pub fn build_template(params: &ComputeParameters) -> Vec { KmerMinHashBTree::builder() .num(params.num_hashes) .ksize(*k) - .hash_function(HashFunctions::murmur64_protein) + .hash_function(HashFunctions::Murmur64Protein) .max_hash(max_hash) .seed(params.seed) .abunds(if params.track_abundance { @@ -136,7 +136,7 @@ pub fn build_template(params: &ComputeParameters) -> Vec { KmerMinHashBTree::builder() .num(params.num_hashes) .ksize(*k) - .hash_function(HashFunctions::murmur64_dayhoff) + .hash_function(HashFunctions::Murmur64Dayhoff) .max_hash(max_hash) .seed(params.seed) .abunds(if params.track_abundance { @@ -153,7 +153,7 @@ pub fn build_template(params: &ComputeParameters) -> Vec { KmerMinHashBTree::builder() .num(params.num_hashes) .ksize(*k) - .hash_function(HashFunctions::murmur64_hp) + .hash_function(HashFunctions::Murmur64Hp) .max_hash(max_hash) .seed(params.seed) .abunds(if params.track_abundance { @@ -170,7 +170,7 @@ pub fn build_template(params: &ComputeParameters) -> Vec { KmerMinHashBTree::builder() .num(params.num_hashes) .ksize(*k) - .hash_function(HashFunctions::murmur64_DNA) + .hash_function(HashFunctions::Murmur64Dna) .max_hash(max_hash) .seed(params.seed) .abunds(if params.track_abundance { diff --git a/src/core/src/collection.rs b/src/core/src/collection.rs new file mode 100644 index 0000000000..c00b2fd288 --- /dev/null +++ b/src/core/src/collection.rs @@ -0,0 +1,190 @@ +use std::ops::{Deref, DerefMut}; + +use camino::Utf8Path as Path; +use camino::Utf8PathBuf as PathBuf; + +use crate::encodings::Idx; +use crate::manifest::{Manifest, Record}; +use crate::prelude::*; +use crate::signature::Signature; +use crate::storage::{FSStorage, InnerStorage, MemStorage, SigStore, Storage, ZipStorage}; +use crate::{Error, Result}; + +#[cfg(feature = "parallel")] +use rayon::prelude::*; + +pub struct Collection { + manifest: Manifest, + storage: InnerStorage, +} + +pub struct CollectionSet { + collection: Collection, +} + +impl Deref for CollectionSet { + type Target = Collection; + + fn deref(&self) -> &Self::Target { + &self.collection + } +} + +impl DerefMut for CollectionSet { + fn deref_mut(&mut self) -> &mut Self::Target { + &mut self.collection + } +} + +impl TryFrom for CollectionSet { + type Error = crate::Error; + + fn try_from(collection: Collection) -> Result { + let first = if let Some(first) = collection.manifest.first() { + first + } else { + // empty collection is consistent ¯\_(ツ)_/¯ + return Ok(Self { collection }); + }; + + collection + .manifest + .iter() + .skip(1) + .try_for_each(|c| first.check_compatible(c))?; + + Ok(Self { collection }) + } +} + +impl CollectionSet { + pub fn into_inner(self) -> Collection { + self.collection + } + + pub fn selection(&self) -> Selection { + todo!("Extract selection from first sig") + } +} + +impl Collection { + pub fn new(manifest: Manifest, storage: InnerStorage) -> Self { + Self { manifest, storage } + } + + pub fn iter(&self) -> impl Iterator { + self.manifest.iter().enumerate().map(|(i, r)| (i as Idx, r)) + } + + #[cfg(feature = "parallel")] + pub fn par_iter(&self) -> impl IndexedParallelIterator { + self.manifest + .par_iter() + .enumerate() + .map(|(i, r)| (i as Idx, r)) + } + + pub fn len(&self) -> usize { + self.manifest.len() + } + + pub fn is_empty(&self) -> bool { + self.manifest.len() == 0 + } + + pub fn manifest(&self) -> &Manifest { + &self.manifest + } + + pub fn storage(&self) -> &InnerStorage { + &self.storage + } + + pub fn check_superset(&self, other: &Collection) -> Result { + self.iter() + .zip(other.iter()) + .all(|((id1, rec1), (id2, rec2))| id1 == id2 && rec1 == rec2) + .then(|| self.len()) + // TODO: right error here + .ok_or(Error::MismatchKSizes) + } + + pub fn from_zipfile>(zipfile: P) -> Result { + let storage = ZipStorage::from_file(zipfile)?; + // Load manifest from standard location in zipstorage + let manifest = Manifest::from_reader(storage.load("SOURMASH-MANIFEST.csv")?.as_slice())?; + Ok(Self { + manifest, + storage: InnerStorage::new(storage), + }) + } + + pub fn from_sigs(sigs: Vec) -> Result { + let storage = MemStorage::new(); + + #[cfg(feature = "parallel")] + let iter = sigs.into_par_iter(); + + #[cfg(not(feature = "parallel"))] + let iter = sigs.into_iter(); + + let records: Vec<_> = iter + .enumerate() + .flat_map(|(i, sig)| { + let path = format!("{}", i); + let mut record = Record::from_sig(&sig, &path); + let path = storage.save_sig(&path, sig).expect("Error saving sig"); + record.iter_mut().for_each(|rec| { + rec.set_internal_location(path.clone().into()); + }); + record + }) + .collect(); + + Ok(Self { + manifest: records.into(), + storage: InnerStorage::new(storage), + }) + } + + pub fn from_paths(paths: &[PathBuf]) -> Result { + // TODO: + // - figure out if there is a common path between sigs for FSStorage? + + Ok(Self { + manifest: paths.into(), + storage: InnerStorage::new( + FSStorage::builder() + .fullpath("".into()) + .subdir("".into()) + .build(), + ), + }) + } + + pub fn record_for_dataset(&self, dataset_id: Idx) -> Result<&Record> { + Ok(&self.manifest[dataset_id as usize]) + } + + pub fn sig_for_dataset(&self, dataset_id: Idx) -> Result { + let match_path = if self.manifest.is_empty() { + "" + } else { + self.manifest[dataset_id as usize] + .internal_location() + .as_str() + }; + + let selection = Selection::from_record(&self.manifest[dataset_id as usize])?; + let sig = self.storage.load_sig(match_path)?.select(&selection)?; + assert_eq!(sig.signatures.len(), 1); + Ok(sig) + } +} + +impl Select for Collection { + fn select(mut self, selection: &Selection) -> Result { + self.manifest = self.manifest.select(selection)?; + Ok(self) + } +} diff --git a/src/core/src/encodings.rs b/src/core/src/encodings.rs index 6010cf2f6d..ac69cd58eb 100644 --- a/src/core/src/encodings.rs +++ b/src/core/src/encodings.rs @@ -7,6 +7,7 @@ use std::str; use nohash_hasher::BuildNoHashHasher; use once_cell::sync::Lazy; +use vec_collections::AbstractVecSet; use crate::Error; @@ -17,35 +18,39 @@ use crate::Error; // and a `Slab`. This might be very useful if K is something // heavy such as a `String`. pub type Color = u64; -pub type Idx = u64; -type IdxTracker = (vec_collections::VecSet<[Idx; 4]>, u64); +pub type Idx = u32; +type IdxTracker = (vec_collections::VecSet<[Idx; 8]>, u64); type ColorToIdx = HashMap>; -#[allow(non_camel_case_types)] -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -#[repr(u32)] +#[derive(Debug, Clone, PartialEq, Eq)] +#[cfg_attr( + feature = "rkyv", + derive(rkyv::Serialize, rkyv::Deserialize, rkyv::Archive) +)] +#[non_exhaustive] pub enum HashFunctions { - murmur64_DNA = 1, - murmur64_protein = 2, - murmur64_dayhoff = 3, - murmur64_hp = 4, + Murmur64Dna, + Murmur64Protein, + Murmur64Dayhoff, + Murmur64Hp, + Custom(String), } impl HashFunctions { pub fn dna(&self) -> bool { - *self == HashFunctions::murmur64_DNA + *self == HashFunctions::Murmur64Dna } pub fn protein(&self) -> bool { - *self == HashFunctions::murmur64_protein + *self == HashFunctions::Murmur64Protein } pub fn dayhoff(&self) -> bool { - *self == HashFunctions::murmur64_dayhoff + *self == HashFunctions::Murmur64Dayhoff } pub fn hp(&self) -> bool { - *self == HashFunctions::murmur64_hp + *self == HashFunctions::Murmur64Hp } } @@ -55,10 +60,11 @@ impl std::fmt::Display for HashFunctions { f, "{}", match self { - HashFunctions::murmur64_DNA => "dna", - HashFunctions::murmur64_protein => "protein", - HashFunctions::murmur64_dayhoff => "dayhoff", - HashFunctions::murmur64_hp => "hp", + HashFunctions::Murmur64Dna => "dna", + HashFunctions::Murmur64Protein => "protein", + HashFunctions::Murmur64Dayhoff => "dayhoff", + HashFunctions::Murmur64Hp => "hp", + HashFunctions::Custom(v) => v, } ) } @@ -69,11 +75,11 @@ impl TryFrom<&str> for HashFunctions { fn try_from(moltype: &str) -> Result { match moltype.to_lowercase().as_ref() { - "dna" => Ok(HashFunctions::murmur64_DNA), - "dayhoff" => Ok(HashFunctions::murmur64_dayhoff), - "hp" => Ok(HashFunctions::murmur64_hp), - "protein" => Ok(HashFunctions::murmur64_protein), - _ => unimplemented!(), + "dna" => Ok(HashFunctions::Murmur64Dna), + "dayhoff" => Ok(HashFunctions::Murmur64Dayhoff), + "hp" => Ok(HashFunctions::Murmur64Hp), + "protein" => Ok(HashFunctions::Murmur64Protein), + v => unimplemented!("{v}"), } } } @@ -507,16 +513,16 @@ mod test { fn colors_update() { let mut colors = Colors::new(); - let color = colors.update(None, &[1_u64]).unwrap(); + let color = colors.update(None, &[1_u32]).unwrap(); assert_eq!(colors.len(), 1); dbg!("update"); - let new_color = colors.update(Some(color), &[1_u64]).unwrap(); + let new_color = colors.update(Some(color), &[1_u32]).unwrap(); assert_eq!(colors.len(), 1); assert_eq!(color, new_color); dbg!("upgrade"); - let new_color = colors.update(Some(color), &[2_u64]).unwrap(); + let new_color = colors.update(Some(color), &[2_u32]).unwrap(); assert_eq!(colors.len(), 2); assert_ne!(color, new_color); } @@ -525,20 +531,20 @@ mod test { fn colors_retain() { let mut colors = Colors::new(); - let color1 = colors.update(None, &[1_u64]).unwrap(); + let color1 = colors.update(None, &[1_u32]).unwrap(); assert_eq!(colors.len(), 1); // used_colors: // color1: 1 dbg!("update"); - let same_color = colors.update(Some(color1), &[1_u64]).unwrap(); + let same_color = colors.update(Some(color1), &[1_u32]).unwrap(); assert_eq!(colors.len(), 1); assert_eq!(color1, same_color); // used_colors: // color1: 2 dbg!("upgrade"); - let color2 = colors.update(Some(color1), &[2_u64]).unwrap(); + let color2 = colors.update(Some(color1), &[2_u32]).unwrap(); assert_eq!(colors.len(), 2); assert_ne!(color1, color2); // used_colors: @@ -546,7 +552,7 @@ mod test { // color2: 1 dbg!("update"); - let same_color = colors.update(Some(color2), &[2_u64]).unwrap(); + let same_color = colors.update(Some(color2), &[2_u32]).unwrap(); assert_eq!(colors.len(), 2); assert_eq!(color2, same_color); // used_colors: @@ -554,7 +560,7 @@ mod test { // color1: 2 dbg!("upgrade"); - let color3 = colors.update(Some(color1), &[3_u64]).unwrap(); + let color3 = colors.update(Some(color1), &[3_u32]).unwrap(); assert_ne!(color1, color3); assert_ne!(color2, color3); // used_colors: diff --git a/src/core/src/errors.rs b/src/core/src/errors.rs index cd4ddcfaf1..c43b104bee 100644 --- a/src/core/src/errors.rs +++ b/src/core/src/errors.rs @@ -63,9 +63,17 @@ pub enum SourmashError { #[error(transparent)] IOError(#[from] std::io::Error), + #[error(transparent)] + CsvError(#[from] csv::Error), + #[cfg(not(all(target_arch = "wasm32", target_os = "unknown")))] #[error(transparent)] Panic(#[from] crate::ffi::utils::Panic), + + #[cfg(not(target_arch = "wasm32"))] + #[cfg(feature = "branchwater")] + #[error(transparent)] + RocksDBError(#[from] rocksdb::Error), } #[derive(Debug, Error)] @@ -108,6 +116,8 @@ pub enum SourmashErrorCode { ParseInt = 100_003, SerdeError = 100_004, NifflerError = 100_005, + CsvError = 100_006, + RocksDBError = 100_007, } #[cfg(not(all(target_arch = "wasm32", target_os = "unknown")))] @@ -137,6 +147,11 @@ impl SourmashErrorCode { SourmashError::IOError { .. } => SourmashErrorCode::Io, SourmashError::NifflerError { .. } => SourmashErrorCode::NifflerError, SourmashError::Utf8Error { .. } => SourmashErrorCode::Utf8Error, + SourmashError::CsvError { .. } => SourmashErrorCode::CsvError, + + #[cfg(not(target_arch = "wasm32"))] + #[cfg(feature = "branchwater")] + SourmashError::RocksDBError { .. } => SourmashErrorCode::RocksDBError, } } } diff --git a/src/core/src/ffi/index/mod.rs b/src/core/src/ffi/index/mod.rs index 932a97b222..a2f03f222f 100644 --- a/src/core/src/ffi/index/mod.rs +++ b/src/core/src/ffi/index/mod.rs @@ -1,3 +1,5 @@ +#[cfg(not(target_arch = "wasm32"))] +#[cfg(feature = "branchwater")] pub mod revindex; use crate::signature::Signature; diff --git a/src/core/src/ffi/index/revindex.rs b/src/core/src/ffi/index/revindex.rs index 3597121bce..e38bdef7fb 100644 --- a/src/core/src/ffi/index/revindex.rs +++ b/src/core/src/ffi/index/revindex.rs @@ -1,16 +1,17 @@ -use std::path::PathBuf; use std::slice; -use crate::index::revindex::RevIndex; -use crate::index::Index; -use crate::signature::{Signature, SigsTrait}; -use crate::sketch::minhash::KmerMinHash; -use crate::sketch::Sketch; +use camino::Utf8PathBuf as PathBuf; use crate::ffi::index::SourmashSearchResult; use crate::ffi::minhash::SourmashKmerMinHash; use crate::ffi::signature::SourmashSignature; use crate::ffi::utils::{ForeignObject, SourmashStr}; +use crate::index::revindex::mem_revindex::RevIndex; +use crate::index::Index; +use crate::prelude::*; +use crate::signature::{Signature, SigsTrait}; +use crate::sketch::minhash::KmerMinHash; +use crate::sketch::Sketch; pub struct SourmashRevIndex; @@ -18,6 +19,21 @@ impl ForeignObject for SourmashRevIndex { type RustObject = RevIndex; } +// TODO: remove this when it is possible to pass Selection thru the FFI +fn from_template(template: &Sketch) -> Selection { + let (num, scaled) = match template { + Sketch::MinHash(mh) => (mh.num(), mh.scaled() as u32), + Sketch::LargeMinHash(mh) => (mh.num(), mh.scaled() as u32), + _ => unimplemented!(), + }; + + Selection::builder() + .ksize(template.ksize() as u32) + .num(num) + .scaled(scaled) + .build() +} + ffi_fn! { unsafe fn revindex_new_with_paths( search_sigs_ptr: *const *const SourmashStr, @@ -58,13 +74,16 @@ unsafe fn revindex_new_with_paths( .collect(); Some(queries_vec.as_ref()) }; + + let selection = from_template(&template); + let revindex = RevIndex::new( search_sigs.as_ref(), - &template, + &selection, threshold, queries, keep_sigs, - ); + )?; Ok(SourmashRevIndex::from_rust(revindex)) } } @@ -105,7 +124,9 @@ unsafe fn revindex_new_with_sigs( .collect(); Some(queries_vec.as_ref()) }; - let revindex = RevIndex::new_with_sigs(search_sigs, &template, threshold, queries); + + let selection = from_template(&template); + let revindex = RevIndex::new_with_sigs(search_sigs, &selection, threshold, queries)?; Ok(SourmashRevIndex::from_rust(revindex)) } } diff --git a/src/core/src/ffi/minhash.rs b/src/core/src/ffi/minhash.rs index 45890b81d9..11863ba265 100644 --- a/src/core/src/ffi/minhash.rs +++ b/src/core/src/ffi/minhash.rs @@ -2,8 +2,9 @@ use std::ffi::CStr; use std::os::raw::c_char; use std::slice; -use crate::encodings::{aa_to_dayhoff, aa_to_hp, translate_codon, HashFunctions}; +use crate::encodings::{aa_to_dayhoff, aa_to_hp, translate_codon}; use crate::ffi::utils::{ForeignObject, SourmashStr}; +use crate::ffi::HashFunctions; use crate::signature::SeqToHashes; use crate::signature::SigsTrait; use crate::sketch::minhash::KmerMinHash; @@ -23,7 +24,7 @@ pub unsafe extern "C" fn kmerminhash_new( track_abundance: bool, n: u32, ) -> *mut SourmashKmerMinHash { - let mh = KmerMinHash::new(scaled, k, hash_function, seed, track_abundance, n); + let mh = KmerMinHash::new(scaled, k, hash_function.into(), seed, track_abundance, n); SourmashKmerMinHash::from_rust(mh) } @@ -367,13 +368,13 @@ pub unsafe extern "C" fn kmerminhash_hash_function( ptr: *const SourmashKmerMinHash, ) -> HashFunctions { let mh = SourmashKmerMinHash::as_rust(ptr); - mh.hash_function() + mh.hash_function().into() } ffi_fn! { unsafe fn kmerminhash_hash_function_set(ptr: *mut SourmashKmerMinHash, hash_function: HashFunctions) -> Result<()> { let mh = SourmashKmerMinHash::as_rust_mut(ptr); - mh.set_hash_function(hash_function) + mh.set_hash_function(hash_function.into()) } } diff --git a/src/core/src/ffi/mod.rs b/src/core/src/ffi/mod.rs index a67de37176..6f1dff78e4 100644 --- a/src/core/src/ffi/mod.rs +++ b/src/core/src/ffi/mod.rs @@ -1,6 +1,6 @@ //! # Foreign Function Interface for calling sourmash from a C API //! -//! Primary client for now is the Python version, using CFFI and milksnake. +//! Primary client for now is the Python version, using CFFI and maturin. #![allow(clippy::missing_safety_doc)] #[macro_use] @@ -29,3 +29,40 @@ pub unsafe extern "C" fn hash_murmur(kmer: *const c_char, seed: u64) -> u64 { _hash_murmur(c_str.to_bytes(), seed) } + +#[repr(u32)] +pub enum HashFunctions { + Murmur64Dna = 1, + Murmur64Protein = 2, + Murmur64Dayhoff = 3, + Murmur64Hp = 4, +} + +impl From for crate::encodings::HashFunctions { + fn from(v: HashFunctions) -> crate::encodings::HashFunctions { + use crate::encodings::HashFunctions::{ + Murmur64Dayhoff, Murmur64Dna, Murmur64Hp, Murmur64Protein, + }; + match v { + HashFunctions::Murmur64Dna => Murmur64Dna, + HashFunctions::Murmur64Protein => Murmur64Protein, + HashFunctions::Murmur64Dayhoff => Murmur64Dayhoff, + HashFunctions::Murmur64Hp => Murmur64Hp, + } + } +} + +impl From for HashFunctions { + fn from(v: crate::encodings::HashFunctions) -> HashFunctions { + use crate::encodings::HashFunctions::{ + Murmur64Dayhoff, Murmur64Dna, Murmur64Hp, Murmur64Protein, + }; + match v { + Murmur64Dna => HashFunctions::Murmur64Dna, + Murmur64Protein => HashFunctions::Murmur64Protein, + Murmur64Dayhoff => HashFunctions::Murmur64Dayhoff, + Murmur64Hp => HashFunctions::Murmur64Hp, + _ => todo!("Not supported, probably custom"), + } + } +} diff --git a/src/core/src/ffi/storage.rs b/src/core/src/ffi/storage.rs index 86d3834201..7479e983e5 100644 --- a/src/core/src/ffi/storage.rs +++ b/src/core/src/ffi/storage.rs @@ -1,5 +1,6 @@ use std::os::raw::c_char; use std::slice; +use std::sync::Arc; use crate::ffi::utils::{ForeignObject, SourmashStr}; use crate::prelude::*; @@ -8,7 +9,7 @@ use crate::storage::ZipStorage; pub struct SourmashZipStorage; impl ForeignObject for SourmashZipStorage { - type RustObject = ZipStorage; + type RustObject = Arc; } ffi_fn! { @@ -20,7 +21,7 @@ unsafe fn zipstorage_new(ptr: *const c_char, insize: usize) -> Result<*mut Sourm }; let zipstorage = ZipStorage::from_file(path)?; - Ok(SourmashZipStorage::from_rust(zipstorage)) + Ok(SourmashZipStorage::from_rust(Arc::new(zipstorage))) } } @@ -110,7 +111,7 @@ unsafe fn zipstorage_set_subdir( std::str::from_utf8(path)? }; - storage.set_subdir(path.to_string()); + (*Arc::get_mut(storage).unwrap()).set_subdir(path.to_string()); Ok(()) } } @@ -120,7 +121,7 @@ unsafe fn zipstorage_path(ptr: *const SourmashZipStorage) -> Result let storage = SourmashZipStorage::as_rust(ptr); if let Some(ref path) = storage.path() { - Ok(path.clone().into()) + Ok(path.clone().into_string().into()) } else { Ok("".into()) } diff --git a/src/core/src/from.rs b/src/core/src/from.rs index dfc384236e..dbeeb58a2f 100644 --- a/src/core/src/from.rs +++ b/src/core/src/from.rs @@ -17,16 +17,14 @@ impl From for KmerMinHash { let mut new_mh = KmerMinHash::new( 0, values.get(0).unwrap().kmer.len() as u32, - HashFunctions::murmur64_DNA, + HashFunctions::Murmur64Dna, 42, true, values.len() as u32, ); - let hash_with_abunds: Vec<(u64, u64)> = values - .iter() - .map(|x| (x.hash as u64, x.count as u64)) - .collect(); + let hash_with_abunds: Vec<(u64, u64)> = + values.iter().map(|x| (x.hash, x.count as u64)).collect(); new_mh .add_many_with_abund(&hash_with_abunds) @@ -53,7 +51,7 @@ mod test { #[test] fn finch_behavior() { - let mut a = KmerMinHash::new(0, 10, HashFunctions::murmur64_DNA, 42, true, 20); + let mut a = KmerMinHash::new(0, 10, HashFunctions::Murmur64Dna, 42, true, 20); let mut b = MashSketcher::new(20, 10, 42); let seq = b"TGCCGCCCAGCACCGGGTGACTAGGTTGAGCCATGATTAACCTGCAATGA"; @@ -68,7 +66,7 @@ mod test { let b_hashes = b.to_vec(); let s1: HashSet<_> = a.mins().into_iter().collect(); - let s2: HashSet<_> = b_hashes.iter().map(|x| x.hash as u64).collect(); + let s2: HashSet<_> = b_hashes.iter().map(|x| x.hash).collect(); let i1 = &s1 & &s2; assert!(i1.len() == a.size()); @@ -79,10 +77,9 @@ mod test { let smap: HashMap<_, _> = mins.iter().zip(abunds.iter()).collect(); println!("{:?}", smap); for item in b_hashes.iter() { - assert!(smap.contains_key(&(item.hash as u64))); + assert!(smap.contains_key(&{ item.hash })); assert!( - **smap.get(&(item.hash as u64)).unwrap() - == ((item.count + item.extra_count) as u64) + **smap.get(&{ item.hash }).unwrap() == ((item.count + item.extra_count) as u64) ); } } @@ -90,7 +87,7 @@ mod test { #[test] fn from_finch() { - let mut a = KmerMinHash::new(0, 10, HashFunctions::murmur64_DNA, 42, true, 20); + let mut a = KmerMinHash::new(0, 10, HashFunctions::Murmur64Dna, 42, true, 20); let mut b = MashSketcher::new(20, 10, 42); let seq = b"TGCCGCCCAGCACCGGGTGACTAGGTTGAGCCATGATTAACCTGCAATGA"; diff --git a/src/core/src/index/linear.rs b/src/core/src/index/linear.rs index 78b2c6f1f5..ff919b6f57 100644 --- a/src/core/src/index/linear.rs +++ b/src/core/src/index/linear.rs @@ -1,185 +1,325 @@ -use std::fs::File; -use std::io::{BufReader, Read}; -use std::path::Path; -use std::path::PathBuf; - -use serde::{Deserialize, Serialize}; -use typed_builder::TypedBuilder; - -use crate::index::{Comparable, DatasetInfo, Index, SigStore}; -use crate::prelude::*; -use crate::storage::{FSStorage, InnerStorage, Storage, StorageInfo}; -use crate::Error; - -#[derive(TypedBuilder)] -pub struct LinearIndex { - #[builder(default)] - storage: Option, - - #[builder(default)] - datasets: Vec>, -} +use std::collections::HashSet; +use std::sync::atomic::{AtomicUsize, Ordering}; + +use camino::Utf8PathBuf as PathBuf; +use log::info; + +#[cfg(feature = "parallel")] +use rayon::prelude::*; -#[derive(Serialize, Deserialize)] -struct LinearInfo { - version: u32, - storage: StorageInfo, - leaves: Vec, +use crate::collection::CollectionSet; +use crate::encodings::Idx; +use crate::index::{GatherResult, Index, Selection, SigCounter}; +use crate::selection::Select; +use crate::signature::{Signature, SigsTrait}; +use crate::sketch::minhash::KmerMinHash; +use crate::sketch::Sketch; +use crate::storage::SigStore; +use crate::Result; + +pub struct LinearIndex { + collection: CollectionSet, + template: Sketch, } -impl<'a, L> Index<'a> for LinearIndex -where - L: Clone + Comparable + 'a, - SigStore: From, -{ - type Item = L; - //type SignatureIterator = std::slice::Iter<'a, Self::Item>; - - fn insert(&mut self, node: L) -> Result<(), Error> { - self.datasets.push(node.into()); - Ok(()) - } - - fn save>(&self, _path: P) -> Result<(), Error> { - /* - let file = File::create(path)?; - match serde_json::to_writer(file, &self) { - Ok(_) => Ok(()), - Err(_) => Err(SourmashError::SerdeError.into()), +impl LinearIndex { + pub fn from_collection(collection: CollectionSet) -> Self { + let sig = collection.sig_for_dataset(0).unwrap(); + let template = sig.sketches().swap_remove(0); + Self { + collection, + template, } - */ - unimplemented!() } - fn load>(_path: P) -> Result<(), Error> { + pub fn sig_for_dataset(&self, dataset_id: Idx) -> Result { + self.collection.sig_for_dataset(dataset_id) + } + + pub fn collection(&self) -> &CollectionSet { + &self.collection + } + + pub fn template(&self) -> &Sketch { + &self.template + } + + pub fn location(&self) -> Option { unimplemented!() } - fn signatures(&self) -> Vec { - self.datasets - .iter() - .map(|x| x.data.get().unwrap().clone()) - .collect() + pub fn counter_for_query(&self, query: &KmerMinHash) -> SigCounter { + let processed_sigs = AtomicUsize::new(0); + + let template = self.template(); + + #[cfg(feature = "parallel")] + let sig_iter = self.collection.par_iter(); + + #[cfg(not(feature = "parallel"))] + let sig_iter = self.collection.iter(); + + let counters = sig_iter.filter_map(|(dataset_id, record)| { + let filename = record.internal_location(); + + let i = processed_sigs.fetch_add(1, Ordering::SeqCst); + if i % 1000 == 0 { + info!("Processed {} reference sigs", i); + } + + let search_sig = self + .collection + .sig_for_dataset(dataset_id) + .unwrap_or_else(|_| panic!("error loading {:?}", filename)); + + let mut search_mh = None; + if let Some(Sketch::MinHash(mh)) = search_sig.select_sketch(template) { + search_mh = Some(mh); + }; + let search_mh = search_mh.expect("Couldn't find a compatible MinHash"); + + let (large_mh, small_mh) = if query.size() > search_mh.size() { + (query, search_mh) + } else { + (search_mh, query) + }; + + let (size, _) = small_mh + .intersection_size(large_mh) + .unwrap_or_else(|_| panic!("error computing intersection for {:?}", filename)); + + if size == 0 { + None + } else { + let mut counter: SigCounter = Default::default(); + counter[&(dataset_id as Idx)] += size as usize; + Some(counter) + } + }); + + let reduce_counters = |mut a: SigCounter, b: SigCounter| { + a.extend(&b); + a + }; + + #[cfg(feature = "parallel")] + let counter = counters.reduce(SigCounter::new, reduce_counters); + + #[cfg(not(feature = "parallel"))] + let counter = counters.fold(SigCounter::new(), reduce_counters); + + counter } - fn signature_refs(&self) -> Vec<&Self::Item> { - self.datasets - .iter() - .map(|x| x.data.get().unwrap()) - .collect() + pub fn search( + &self, + counter: SigCounter, + similarity: bool, + threshold: usize, + ) -> Result> { + let mut matches = vec![]; + if similarity { + unimplemented!("TODO: threshold correction") + } + + for (dataset_id, size) in counter.most_common() { + if size >= threshold { + matches.push( + self.collection + .record_for_dataset(dataset_id)? + .internal_location() + .to_string(), + ); + } else { + break; + }; + } + Ok(matches) } - /* - fn iter_signatures(&'a self) -> Self::SignatureIterator { - self.datasets.iter() + pub fn gather_round( + &self, + dataset_id: Idx, + match_size: usize, + query: &KmerMinHash, + round: usize, + ) -> Result { + let match_path = self + .collection + .record_for_dataset(dataset_id)? + .internal_location() + .into(); + let match_sig = self.collection.sig_for_dataset(dataset_id)?; + let result = self.stats_for_match(&match_sig, query, match_size, match_path, round)?; + Ok(result) } - */ -} -impl LinearIndex -where - L: ToWriter, - SigStore: ReadData, -{ - pub fn save_file>( - &mut self, - path: P, - storage: Option, - ) -> Result<(), Error> { - let ref_path = path.as_ref(); - let mut basename = ref_path.file_name().unwrap().to_str().unwrap().to_owned(); - if basename.ends_with(".sbt.json") { - basename = basename.replace(".sbt.json", ""); + fn stats_for_match( + &self, + match_sig: &Signature, + query: &KmerMinHash, + match_size: usize, + match_path: PathBuf, + gather_result_rank: usize, + ) -> Result { + let template = self.template(); + + let mut match_mh = None; + if let Some(Sketch::MinHash(mh)) = match_sig.select_sketch(template) { + match_mh = Some(mh); } - let location = ref_path.parent().unwrap(); + let match_mh = match_mh.expect("Couldn't find a compatible MinHash"); + + // Calculate stats + let f_orig_query = match_size as f64 / query.size() as f64; + let f_match = match_size as f64 / match_mh.size() as f64; + let filename = match_path.into_string(); + let name = match_sig.name(); + let unique_intersect_bp = match_mh.scaled() as usize * match_size; + + let (intersect_orig, _) = match_mh.intersection_size(query)?; + let intersect_bp = (match_mh.scaled() * intersect_orig) as usize; - let storage = match storage { - Some(s) => s, - None => { - let subdir = format!(".linear.{}", basename); - InnerStorage::new(FSStorage::new(location.to_str().unwrap(), &subdir)) + let f_unique_to_query = intersect_orig as f64 / query.size() as f64; + let match_ = match_sig.clone(); + + // TODO: all of these + let f_unique_weighted = 0.; + let average_abund = 0; + let median_abund = 0; + let std_abund = 0; + let md5 = "".into(); + let f_match_orig = 0.; + let remaining_bp = 0; + + Ok(GatherResult { + intersect_bp, + f_orig_query, + f_match, + f_unique_to_query, + f_unique_weighted, + average_abund, + median_abund, + std_abund, + filename, + name, + md5, + match_, + f_match_orig, + unique_intersect_bp, + gather_result_rank, + remaining_bp, + }) + } + + pub fn gather( + &self, + mut counter: SigCounter, + threshold: usize, + query: &KmerMinHash, + ) -> std::result::Result, Box> { + let mut match_size = usize::max_value(); + let mut matches = vec![]; + let template = self.template(); + + while match_size > threshold && !counter.is_empty() { + let (dataset_id, size) = counter.most_common()[0]; + if threshold == 0 && size == 0 { + break; } - }; - let args = storage.args(); - let storage_info = StorageInfo { - backend: "FSStorage".into(), - args, - }; + match_size = if size >= threshold { + size + } else { + break; + }; - let info: LinearInfo = LinearInfo { - storage: storage_info, - version: 5, - leaves: self - .datasets - .iter_mut() - .map(|l| { - // Trigger data loading - let _: &L = (*l).data().unwrap(); - - // set storage to new one - l.storage = Some(storage.clone()); - - let filename = (*l).save(&l.filename).unwrap(); - - DatasetInfo { - filename, - name: l.name.clone(), - metadata: l.metadata.clone(), - } - }) - .collect(), - }; + let result = self.gather_round(dataset_id, match_size, query, matches.len())?; + + // Prepare counter for finding the next match by decrementing + // all hashes found in the current match in other datasets + // TODO: maybe par_iter? + let mut to_remove: HashSet = Default::default(); + to_remove.insert(dataset_id); + + for (dataset, value) in counter.iter_mut() { + let dataset_sig = self.collection.sig_for_dataset(*dataset)?; + let mut match_mh = None; + if let Some(Sketch::MinHash(mh)) = dataset_sig.select_sketch(template) { + match_mh = Some(mh); + } + let match_mh = match_mh.expect("Couldn't find a compatible MinHash"); + + let (intersection, _) = query.intersection_size(match_mh)?; + if intersection as usize > *value { + to_remove.insert(*dataset); + } else { + *value -= intersection as usize; + }; + } + to_remove.iter().for_each(|dataset_id| { + counter.remove(dataset_id); + }); + matches.push(result); + } + Ok(matches) + } + + pub fn signatures_iter(&self) -> impl Iterator + '_ { + (0..self.collection.len()).map(move |dataset_id| { + self.collection + .sig_for_dataset(dataset_id as Idx) + .expect("error loading sig") + }) + } +} + +impl Select for LinearIndex { + fn select(self, selection: &Selection) -> Result { + let Self { + collection, + template, + } = self; + let collection = collection.into_inner().select(selection)?.try_into()?; - let file = File::create(path)?; - serde_json::to_writer(file, &info)?; - - Ok(()) - } - - pub fn from_path>(path: P) -> Result, Error> { - let file = File::open(&path)?; - let mut reader = BufReader::new(file); - - // TODO: match with available Storage while we don't - // add a function to build a Storage from a StorageInfo - let mut basepath = PathBuf::new(); - basepath.push(path); - basepath.canonicalize()?; - - let linear = LinearIndex::::from_reader(&mut reader, basepath.parent().unwrap())?; - Ok(linear) - } - - pub fn from_reader(rdr: R, path: P) -> Result, Error> - where - R: Read, - P: AsRef, - { - // TODO: check https://serde.rs/enum-representations.html for a - // solution for loading v4 and v5 - let linear: LinearInfo = serde_json::from_reader(rdr)?; - - // TODO: support other storages - let mut st: FSStorage = (&linear.storage.args).into(); - st.set_base(path.as_ref().to_str().unwrap()); - let storage = InnerStorage::new(st); - - Ok(LinearIndex { - storage: Some(storage.clone()), - datasets: linear - .leaves - .into_iter() - .map(|l| { - let mut v: SigStore = l.into(); - v.storage = Some(storage.clone()); - v - }) - .collect(), + Ok(Self { + collection, + template, }) } +} + +impl<'a> Index<'a> for LinearIndex { + type Item = SigStore; + + fn insert(&mut self, _node: Self::Item) -> Result<()> { + unimplemented!() + } + + fn save>(&self, _path: P) -> Result<()> { + unimplemented!() + } + + fn load>(_path: P) -> Result<()> { + unimplemented!() + } + + fn len(&self) -> usize { + self.collection.len() + } + + fn signatures(&self) -> Vec { + self.collection() + .iter() + .map(|(i, p)| { + self.collection() + .sig_for_dataset(i as Idx) + .unwrap_or_else(|_| panic!("Error processing {}", p.internal_location())) + }) + .collect() + } - pub fn storage(&self) -> Option { - self.storage.clone() + fn signature_refs(&self) -> Vec<&Self::Item> { + unimplemented!() } } diff --git a/src/core/src/index/mod.rs b/src/core/src/index/mod.rs index 832fdf9091..ec55249b04 100644 --- a/src/core/src/index/mod.rs +++ b/src/core/src/index/mod.rs @@ -4,35 +4,73 @@ //! Some indices also support containment searches. pub mod linear; + +#[cfg(not(target_arch = "wasm32"))] +#[cfg(feature = "branchwater")] pub mod revindex; pub mod search; -use std::ops::Deref; use std::path::Path; -use once_cell::sync::OnceCell; +use getset::{CopyGetters, Getters, Setters}; + use serde::{Deserialize, Serialize}; use typed_builder::TypedBuilder; -use crate::errors::ReadDataError; +use crate::encodings::Idx; use crate::index::search::{search_minhashes, search_minhashes_containment}; use crate::prelude::*; -use crate::signature::SigsTrait; -use crate::sketch::Sketch; -use crate::storage::{InnerStorage, Storage}; -use crate::Error; +use crate::Result; + +#[derive(TypedBuilder, CopyGetters, Getters, Setters, Serialize, Deserialize, Debug, PartialEq)] +pub struct GatherResult { + #[getset(get_copy = "pub")] + intersect_bp: usize, + + #[getset(get_copy = "pub")] + f_orig_query: f64, + + #[getset(get_copy = "pub")] + f_match: f64, + + f_unique_to_query: f64, + f_unique_weighted: f64, + average_abund: usize, + median_abund: usize, + std_abund: usize, + + #[getset(get = "pub")] + filename: String, + + #[getset(get = "pub")] + name: String, + + #[getset(get = "pub")] + md5: String, + + #[serde(skip)] + match_: Signature, + + f_match_orig: f64, + unique_intersect_bp: usize, + gather_result_rank: usize, + remaining_bp: usize, +} + +impl GatherResult { + pub fn get_match(&self) -> Signature { + self.match_.clone() + } +} + +type SigCounter = counter::Counter; pub trait Index<'a> { type Item: Comparable; //type SignatureIterator: Iterator; - fn find( - &self, - search_fn: F, - sig: &Self::Item, - threshold: f64, - ) -> Result, Error> + fn find(&self, search_fn: F, sig: &Self::Item, threshold: f64) -> Result> where F: Fn(&dyn Comparable, &Self::Item, f64) -> bool, { @@ -54,7 +92,7 @@ pub trait Index<'a> { sig: &Self::Item, threshold: f64, containment: bool, - ) -> Result, Error> { + ) -> Result> { if containment { self.find(search_minhashes_containment, sig, threshold) } else { @@ -62,11 +100,11 @@ pub trait Index<'a> { } } - //fn gather(&self, sig: &Self::Item, threshold: f64) -> Result, Error>; + //fn gather(&self, sig: &Self::Item, threshold: f64) -> Result>; - fn insert(&mut self, node: Self::Item) -> Result<(), Error>; + fn insert(&mut self, node: Self::Item) -> Result<()>; - fn batch_insert(&mut self, nodes: Vec) -> Result<(), Error> { + fn batch_insert(&mut self, nodes: Vec) -> Result<()> { for node in nodes { self.insert(node)?; } @@ -74,9 +112,9 @@ pub trait Index<'a> { Ok(()) } - fn save>(&self, path: P) -> Result<(), Error>; + fn save>(&self, path: P) -> Result<()>; - fn load>(path: P) -> Result<(), Error>; + fn load>(path: P) -> Result<()>; fn signatures(&self) -> Vec; @@ -107,232 +145,3 @@ where (*self).containment(other) } } - -#[derive(Serialize, Deserialize, Debug)] -pub struct DatasetInfo { - pub filename: String, - pub name: String, - pub metadata: String, -} - -#[derive(TypedBuilder, Default, Clone)] -pub struct SigStore { - #[builder(setter(into))] - filename: String, - - #[builder(setter(into))] - name: String, - - #[builder(setter(into))] - metadata: String, - - storage: Option, - - #[builder(setter(into), default)] - data: OnceCell, -} - -impl SigStore { - pub fn name(&self) -> String { - self.name.clone() - } -} - -impl std::fmt::Debug for SigStore { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - write!( - f, - "SigStore [filename: {}, name: {}, metadata: {}]", - self.filename, self.name, self.metadata - ) - } -} - -impl ReadData for SigStore { - fn data(&self) -> Result<&Signature, Error> { - if let Some(sig) = self.data.get() { - Ok(sig) - } else if let Some(storage) = &self.storage { - let sig = self.data.get_or_init(|| { - let raw = storage.load(&self.filename).unwrap(); - let sigs: Result, _> = serde_json::from_reader(&mut &raw[..]); - if let Ok(sigs) = sigs { - // TODO: select the right sig? - sigs[0].to_owned() - } else { - let sig: Signature = serde_json::from_reader(&mut &raw[..]).unwrap(); - sig - } - }); - - Ok(sig) - } else { - Err(ReadDataError::LoadError.into()) - } - } -} - -impl SigStore -where - T: ToWriter, -{ - pub fn save(&self, path: &str) -> Result { - if let Some(storage) = &self.storage { - if let Some(data) = self.data.get() { - let mut buffer = Vec::new(); - data.to_writer(&mut buffer)?; - - Ok(storage.save(path, &buffer)?) - } else { - unimplemented!() - } - } else { - unimplemented!() - } - } -} - -impl SigStore { - pub fn count_common(&self, other: &SigStore) -> u64 { - let ng: &Signature = self.data().unwrap(); - let ong: &Signature = other.data().unwrap(); - - // TODO: select the right signatures... - // TODO: better matching here, what if it is not a mh? - if let Sketch::MinHash(mh) = &ng.signatures[0] { - if let Sketch::MinHash(omh) = &ong.signatures[0] { - return mh.count_common(omh, false).unwrap(); - } - } - unimplemented!(); - } - - pub fn mins(&self) -> Vec { - let ng: &Signature = self.data().unwrap(); - - // TODO: select the right signatures... - // TODO: better matching here, what if it is not a mh? - if let Sketch::MinHash(mh) = &ng.signatures[0] { - mh.mins() - } else { - unimplemented!() - } - } -} - -impl From> for Signature { - fn from(other: SigStore) -> Signature { - other.data.get().unwrap().to_owned() - } -} - -impl Deref for SigStore { - type Target = Signature; - - fn deref(&self) -> &Signature { - self.data.get().unwrap() - } -} - -impl From for SigStore { - fn from(other: Signature) -> SigStore { - let name = other.name(); - let filename = other.filename(); - - SigStore::builder() - .name(name) - .filename(filename) - .data(other) - .metadata("") - .storage(None) - .build() - } -} - -impl Comparable> for SigStore { - fn similarity(&self, other: &SigStore) -> f64 { - let ng: &Signature = self.data().unwrap(); - let ong: &Signature = other.data().unwrap(); - - // TODO: select the right signatures... - // TODO: better matching here, what if it is not a mh? - if let Sketch::MinHash(mh) = &ng.signatures[0] { - if let Sketch::MinHash(omh) = &ong.signatures[0] { - return mh.similarity(omh, true, false).unwrap(); - } - } - - /* FIXME: bring back after boomphf changes - if let Sketch::UKHS(mh) = &ng.signatures[0] { - if let Sketch::UKHS(omh) = &ong.signatures[0] { - return 1. - mh.distance(&omh); - } - } - */ - - unimplemented!() - } - - fn containment(&self, other: &SigStore) -> f64 { - let ng: &Signature = self.data().unwrap(); - let ong: &Signature = other.data().unwrap(); - - // TODO: select the right signatures... - // TODO: better matching here, what if it is not a mh? - if let Sketch::MinHash(mh) = &ng.signatures[0] { - if let Sketch::MinHash(omh) = &ong.signatures[0] { - let common = mh.count_common(omh, false).unwrap(); - let size = mh.size(); - return common as f64 / size as f64; - } - } - unimplemented!() - } -} - -impl Comparable for Signature { - fn similarity(&self, other: &Signature) -> f64 { - // TODO: select the right signatures... - // TODO: better matching here, what if it is not a mh? - if let Sketch::MinHash(mh) = &self.signatures[0] { - if let Sketch::MinHash(omh) = &other.signatures[0] { - return mh.similarity(omh, true, false).unwrap(); - } - } - - /* FIXME: bring back after boomphf changes - if let Sketch::UKHS(mh) = &self.signatures[0] { - if let Sketch::UKHS(omh) = &other.signatures[0] { - return 1. - mh.distance(&omh); - } - } - */ - - unimplemented!() - } - - fn containment(&self, other: &Signature) -> f64 { - // TODO: select the right signatures... - // TODO: better matching here, what if it is not a mh? - if let Sketch::MinHash(mh) = &self.signatures[0] { - if let Sketch::MinHash(omh) = &other.signatures[0] { - let common = mh.count_common(omh, false).unwrap(); - let size = mh.size(); - return common as f64 / size as f64; - } - } - unimplemented!() - } -} - -impl From for SigStore { - fn from(other: DatasetInfo) -> SigStore { - SigStore { - filename: other.filename, - name: other.name, - metadata: other.metadata, - storage: None, - data: OnceCell::new(), - } - } -} diff --git a/src/core/src/index/revindex.rs b/src/core/src/index/revindex.rs deleted file mode 100644 index 0a1fc25d18..0000000000 --- a/src/core/src/index/revindex.rs +++ /dev/null @@ -1,699 +0,0 @@ -use std::collections::{HashMap, HashSet}; -use std::path::{Path, PathBuf}; -use std::sync::atomic::{AtomicUsize, Ordering}; - -use getset::{CopyGetters, Getters, Setters}; -use log::{debug, info}; -use nohash_hasher::BuildNoHashHasher; -use serde::{Deserialize, Serialize}; - -#[cfg(feature = "parallel")] -use rayon::prelude::*; - -use crate::encodings::{Color, Colors, Idx}; -use crate::index::Index; -use crate::signature::{Signature, SigsTrait}; -use crate::sketch::minhash::KmerMinHash; -use crate::sketch::Sketch; -use crate::Error; -use crate::HashIntoType; - -type SigCounter = counter::Counter; - -#[derive(Serialize, Deserialize)] -struct HashToColor(HashMap>); - -impl HashToColor { - fn new() -> Self { - HashToColor(HashMap::< - HashIntoType, - Color, - BuildNoHashHasher, - >::with_hasher(BuildNoHashHasher::default())) - } - - fn get(&self, hash: &HashIntoType) -> Option<&Color> { - self.0.get(hash) - } - - fn retain(&mut self, hashes: &HashSet) { - self.0.retain(|hash, _| hashes.contains(hash)) - } - - fn len(&self) -> usize { - self.0.len() - } - - fn is_empty(&self) -> bool { - self.0.is_empty() - } - - fn add_to(&mut self, colors: &mut Colors, dataset_id: usize, matched_hashes: Vec) { - let mut color = None; - - matched_hashes.into_iter().for_each(|hash| { - color = Some(colors.update(color, &[dataset_id as Idx]).unwrap()); - self.0.insert(hash, color.unwrap()); - }); - } - - fn reduce_hashes_colors( - a: (HashToColor, Colors), - b: (HashToColor, Colors), - ) -> (HashToColor, Colors) { - let ((small_hashes, small_colors), (mut large_hashes, mut large_colors)) = - if a.0.len() > b.0.len() { - (b, a) - } else { - (a, b) - }; - - small_hashes.0.into_iter().for_each(|(hash, color)| { - large_hashes - .0 - .entry(hash) - .and_modify(|entry| { - // Hash is already present. - // Update the current color by adding the indices from - // small_colors. - let ids = small_colors.indices(&color); - let new_color = large_colors.update(Some(*entry), ids).unwrap(); - *entry = new_color; - }) - .or_insert_with(|| { - // In this case, the hash was not present yet. - // we need to create the same color from small_colors - // into large_colors. - let ids = small_colors.indices(&color); - let new_color = large_colors.update(None, ids).unwrap(); - assert_eq!(new_color, color); - new_color - }); - }); - - (large_hashes, large_colors) - } -} - -// Use rkyv for serialization? -// https://davidkoloski.me/rkyv/ -#[derive(Serialize, Deserialize)] -pub struct RevIndex { - hash_to_color: HashToColor, - - sig_files: Vec, - - #[serde(skip)] - ref_sigs: Option>, - - template: Sketch, - colors: Colors, - //#[serde(skip)] - //storage: Option, -} - -impl RevIndex { - pub fn load>( - index_path: P, - queries: Option<&[KmerMinHash]>, - ) -> Result> { - let (rdr, _) = niffler::from_path(index_path)?; - let revindex = if let Some(qs) = queries { - // TODO: avoid loading full revindex if query != None - /* - struct PartialRevIndex { - hashes_to_keep: Option>, - marker: PhantomData T>, - } - - impl PartialRevIndex { - pub fn new(hashes_to_keep: HashSet) -> Self { - PartialRevIndex { - hashes_to_keep: Some(hashes_to_keep), - marker: PhantomData, - } - } - } - */ - - let mut hashes: HashSet = HashSet::new(); - for q in qs { - hashes.extend(q.iter_mins()); - } - - //let mut revindex: RevIndex = PartialRevIndex::new(hashes).deserialize(&rdr).unwrap(); - - let mut revindex: RevIndex = serde_json::from_reader(rdr)?; - revindex.hash_to_color.retain(&hashes); - revindex - } else { - // Load the full revindex - serde_json::from_reader(rdr)? - }; - - Ok(revindex) - } - - pub fn new( - search_sigs: &[PathBuf], - template: &Sketch, - threshold: usize, - queries: Option<&[KmerMinHash]>, - keep_sigs: bool, - ) -> RevIndex { - // If threshold is zero, let's merge all queries and save time later - let merged_query = queries.and_then(|qs| Self::merge_queries(qs, threshold)); - - let processed_sigs = AtomicUsize::new(0); - - #[cfg(feature = "parallel")] - let sig_iter = search_sigs.par_iter(); - - #[cfg(not(feature = "parallel"))] - let sig_iter = search_sigs.iter(); - - let filtered_sigs = sig_iter.enumerate().filter_map(|(dataset_id, filename)| { - let i = processed_sigs.fetch_add(1, Ordering::SeqCst); - if i % 1000 == 0 { - info!("Processed {} reference sigs", i); - } - - let search_sig = Signature::from_path(filename) - .unwrap_or_else(|_| panic!("Error processing {:?}", filename)) - .swap_remove(0); - - RevIndex::map_hashes_colors( - dataset_id, - &search_sig, - queries, - &merged_query, - threshold, - template, - ) - }); - - #[cfg(feature = "parallel")] - let (hash_to_color, colors) = filtered_sigs.reduce( - || (HashToColor::new(), Colors::default()), - HashToColor::reduce_hashes_colors, - ); - - #[cfg(not(feature = "parallel"))] - let (hash_to_color, colors) = filtered_sigs.fold( - (HashToColor::new(), Colors::default()), - HashToColor::reduce_hashes_colors, - ); - - // TODO: build this together with hash_to_idx? - let ref_sigs = if keep_sigs { - #[cfg(feature = "parallel")] - let sigs_iter = search_sigs.par_iter(); - - #[cfg(not(feature = "parallel"))] - let sigs_iter = search_sigs.iter(); - - Some( - sigs_iter - .map(|ref_path| { - Signature::from_path(ref_path) - .unwrap_or_else(|_| panic!("Error processing {:?}", ref_path)) - .swap_remove(0) - }) - .collect(), - ) - } else { - None - }; - - RevIndex { - hash_to_color, - sig_files: search_sigs.into(), - ref_sigs, - template: template.clone(), - colors, - // storage: Some(InnerStorage::new(MemStorage::default())), - } - } - - fn merge_queries(qs: &[KmerMinHash], threshold: usize) -> Option { - if threshold == 0 { - let mut merged = qs[0].clone(); - for query in &qs[1..] { - merged.merge(query).unwrap(); - } - Some(merged) - } else { - None - } - } - - pub fn new_with_sigs( - search_sigs: Vec, - template: &Sketch, - threshold: usize, - queries: Option<&[KmerMinHash]>, - ) -> RevIndex { - // If threshold is zero, let's merge all queries and save time later - let merged_query = queries.and_then(|qs| Self::merge_queries(qs, threshold)); - - let processed_sigs = AtomicUsize::new(0); - - #[cfg(feature = "parallel")] - let sigs_iter = search_sigs.par_iter(); - #[cfg(not(feature = "parallel"))] - let sigs_iter = search_sigs.iter(); - - let filtered_sigs = sigs_iter.enumerate().filter_map(|(dataset_id, sig)| { - let i = processed_sigs.fetch_add(1, Ordering::SeqCst); - if i % 1000 == 0 { - info!("Processed {} reference sigs", i); - } - - RevIndex::map_hashes_colors( - dataset_id, - sig, - queries, - &merged_query, - threshold, - template, - ) - }); - - #[cfg(feature = "parallel")] - let (hash_to_color, colors) = filtered_sigs.reduce( - || (HashToColor::new(), Colors::default()), - HashToColor::reduce_hashes_colors, - ); - - #[cfg(not(feature = "parallel"))] - let (hash_to_color, colors) = filtered_sigs.fold( - (HashToColor::new(), Colors::default()), - HashToColor::reduce_hashes_colors, - ); - - RevIndex { - hash_to_color, - sig_files: vec![], - ref_sigs: search_sigs.into(), - template: template.clone(), - colors, - //storage: None, - } - } - - fn map_hashes_colors( - dataset_id: usize, - search_sig: &Signature, - queries: Option<&[KmerMinHash]>, - merged_query: &Option, - threshold: usize, - template: &Sketch, - ) -> Option<(HashToColor, Colors)> { - let mut search_mh = None; - if let Some(Sketch::MinHash(mh)) = search_sig.select_sketch(template) { - search_mh = Some(mh); - } - - let search_mh = search_mh.expect("Couldn't find a compatible MinHash"); - let mut hash_to_color = HashToColor::new(); - let mut colors = Colors::default(); - - if let Some(qs) = queries { - if let Some(ref merged) = merged_query { - let (matched_hashes, intersection) = merged.intersection(search_mh).unwrap(); - if !matched_hashes.is_empty() || intersection > threshold as u64 { - hash_to_color.add_to(&mut colors, dataset_id, matched_hashes); - } - } else { - for query in qs { - let (matched_hashes, intersection) = query.intersection(search_mh).unwrap(); - if !matched_hashes.is_empty() || intersection > threshold as u64 { - hash_to_color.add_to(&mut colors, dataset_id, matched_hashes); - } - } - } - } else { - let matched = search_mh.mins(); - let size = matched.len() as u64; - if !matched.is_empty() || size > threshold as u64 { - hash_to_color.add_to(&mut colors, dataset_id, matched); - } - }; - - if hash_to_color.is_empty() { - None - } else { - Some((hash_to_color, colors)) - } - } - - pub fn search( - &self, - counter: SigCounter, - similarity: bool, - threshold: usize, - ) -> Result, Box> { - let mut matches = vec![]; - if similarity { - unimplemented!("TODO: threshold correction") - } - - for (dataset_id, size) in counter.most_common() { - if size >= threshold { - matches.push(self.sig_files[dataset_id as usize].to_str().unwrap().into()); - } else { - break; - }; - } - Ok(matches) - } - - pub fn gather( - &self, - mut counter: SigCounter, - threshold: usize, - query: &KmerMinHash, - ) -> Result, Box> { - let mut match_size = usize::max_value(); - let mut matches = vec![]; - - while match_size > threshold && !counter.is_empty() { - let (dataset_id, size) = counter.most_common()[0]; - match_size = if size >= threshold { size } else { break }; - - let p; - let match_path = if self.sig_files.is_empty() { - p = PathBuf::new(); // TODO: Fix somehow? - &p - } else { - &self.sig_files[dataset_id as usize] - }; - - let ref_match; - let match_sig = if let Some(refsigs) = &self.ref_sigs { - &refsigs[dataset_id as usize] - } else { - // TODO: remove swap_remove - ref_match = Signature::from_path(match_path)?.swap_remove(0); - &ref_match - }; - - let mut match_mh = None; - if let Some(Sketch::MinHash(mh)) = match_sig.select_sketch(&self.template) { - match_mh = Some(mh); - } - let match_mh = match_mh.expect("Couldn't find a compatible MinHash"); - - // Calculate stats - let f_orig_query = match_size as f64 / query.size() as f64; - let f_match = match_size as f64 / match_mh.size() as f64; - let filename = match_path.to_str().unwrap().into(); - let name = match_sig.name(); - let unique_intersect_bp = match_mh.scaled() as usize * match_size; - let gather_result_rank = matches.len(); - - let (intersect_orig, _) = match_mh.intersection_size(query)?; - let intersect_bp = (match_mh.scaled() * intersect_orig) as usize; - - let f_unique_to_query = intersect_orig as f64 / query.size() as f64; - let match_ = match_sig.clone(); - - // TODO: all of these - let f_unique_weighted = 0.; - let average_abund = 0; - let median_abund = 0; - let std_abund = 0; - let md5 = "".into(); - let f_match_orig = 0.; - let remaining_bp = 0; - - let result = GatherResult { - intersect_bp, - f_orig_query, - f_match, - f_unique_to_query, - f_unique_weighted, - average_abund, - median_abund, - std_abund, - filename, - name, - md5, - match_, - f_match_orig, - unique_intersect_bp, - gather_result_rank, - remaining_bp, - }; - matches.push(result); - - // Prepare counter for finding the next match by decrementing - // all hashes found in the current match in other datasets - for hash in match_mh.iter_mins() { - if let Some(color) = self.hash_to_color.get(hash) { - for dataset in self.colors.indices(color) { - counter.entry(*dataset).and_modify(|e| { - if *e > 0 { - *e -= 1 - } - }); - } - } - } - counter.remove(&dataset_id); - } - Ok(matches) - } - - pub fn counter_for_query(&self, query: &KmerMinHash) -> SigCounter { - query - .iter_mins() - .filter_map(|hash| self.hash_to_color.get(hash)) - .flat_map(|color| self.colors.indices(color)) - .cloned() - .collect() - } - - pub fn template(&self) -> Sketch { - self.template.clone() - } - - // TODO: mh should be a sketch, or even a sig... - pub(crate) fn find_signatures( - &self, - mh: &KmerMinHash, - threshold: f64, - containment: bool, - _ignore_scaled: bool, - ) -> Result, Error> { - /* - let template_mh = None; - if let Sketch::MinHash(mh) = self.template { - template_mh = Some(mh); - }; - // TODO: throw error - let template_mh = template_mh.unwrap(); - - let tmp_mh; - let mh = if template_mh.scaled() > mh.scaled() { - // TODO: proper error here - tmp_mh = mh.downsample_scaled(self.scaled)?; - &tmp_mh - } else { - mh - }; - - if self.scaled < mh.scaled() && !ignore_scaled { - return Err(LcaDBError::ScaledMismatchError { - db: self.scaled, - query: mh.scaled(), - } - .into()); - } - */ - - // TODO: proper threshold calculation - let threshold: usize = (threshold * (mh.size() as f64)) as _; - - let counter = self.counter_for_query(mh); - - debug!( - "number of matching signatures for hashes: {}", - counter.len() - ); - - let mut results = vec![]; - for (dataset_id, size) in counter.most_common() { - let match_size = if size >= threshold { size } else { break }; - - let p; - let match_path = if self.sig_files.is_empty() { - p = PathBuf::new(); // TODO: Fix somehow? - &p - } else { - &self.sig_files[dataset_id as usize] - }; - - let ref_match; - let match_sig = if let Some(refsigs) = &self.ref_sigs { - &refsigs[dataset_id as usize] - } else { - // TODO: remove swap_remove - ref_match = Signature::from_path(match_path)?.swap_remove(0); - &ref_match - }; - - let mut match_mh = None; - if let Some(Sketch::MinHash(mh)) = match_sig.select_sketch(&self.template) { - match_mh = Some(mh); - } - let match_mh = match_mh.unwrap(); - - if size >= threshold { - let score = if containment { - size as f64 / mh.size() as f64 - } else { - size as f64 / (mh.size() + match_size - size) as f64 - }; - let filename = match_path.to_str().unwrap().into(); - let mut sig = match_sig.clone(); - sig.reset_sketches(); - sig.push(Sketch::MinHash(match_mh.clone())); - results.push((score, sig, filename)); - } else { - break; - }; - } - Ok(results) - } -} - -#[derive(CopyGetters, Getters, Setters, Serialize, Deserialize, Debug)] -pub struct GatherResult { - #[getset(get_copy = "pub")] - intersect_bp: usize, - - #[getset(get_copy = "pub")] - f_orig_query: f64, - - #[getset(get_copy = "pub")] - f_match: f64, - - f_unique_to_query: f64, - f_unique_weighted: f64, - average_abund: usize, - median_abund: usize, - std_abund: usize, - - #[getset(get = "pub")] - filename: String, - - #[getset(get = "pub")] - name: String, - - md5: String, - match_: Signature, - f_match_orig: f64, - unique_intersect_bp: usize, - gather_result_rank: usize, - remaining_bp: usize, -} - -impl GatherResult { - pub fn get_match(&self) -> Signature { - self.match_.clone() - } -} - -impl<'a> Index<'a> for RevIndex { - type Item = Signature; - - fn insert(&mut self, _node: Self::Item) -> Result<(), Error> { - unimplemented!() - } - - fn save>(&self, _path: P) -> Result<(), Error> { - unimplemented!() - } - - fn load>(_path: P) -> Result<(), Error> { - unimplemented!() - } - - fn len(&self) -> usize { - if let Some(refs) = &self.ref_sigs { - refs.len() - } else { - self.sig_files.len() - } - } - - fn signatures(&self) -> Vec { - if let Some(ref sigs) = self.ref_sigs { - sigs.to_vec() - } else { - unimplemented!() - } - } - - fn signature_refs(&self) -> Vec<&Self::Item> { - unimplemented!() - } -} - -#[cfg(test)] -mod test { - use super::*; - - use crate::sketch::minhash::max_hash_for_scaled; - - #[test] - fn revindex_new() { - let max_hash = max_hash_for_scaled(10000); - let template = Sketch::MinHash( - KmerMinHash::builder() - .num(0u32) - .ksize(31) - .max_hash(max_hash) - .build(), - ); - let search_sigs = [ - "../../tests/test-data/gather/GCF_000006945.2_ASM694v2_genomic.fna.gz.sig".into(), - "../../tests/test-data/gather/GCF_000007545.1_ASM754v1_genomic.fna.gz.sig".into(), - ]; - let index = RevIndex::new(&search_sigs, &template, 0, None, false); - assert_eq!(index.colors.len(), 3); - } - - #[test] - fn revindex_many() { - let max_hash = max_hash_for_scaled(10000); - let template = Sketch::MinHash( - KmerMinHash::builder() - .num(0u32) - .ksize(31) - .max_hash(max_hash) - .build(), - ); - let search_sigs = [ - "../../tests/test-data/gather/GCF_000006945.2_ASM694v2_genomic.fna.gz.sig".into(), - "../../tests/test-data/gather/GCF_000007545.1_ASM754v1_genomic.fna.gz.sig".into(), - "../../tests/test-data/gather/GCF_000008105.1_ASM810v1_genomic.fna.gz.sig".into(), - ]; - - let index = RevIndex::new(&search_sigs, &template, 0, None, false); - /* - dbg!(&index.colors.colors); - 0: 86 - 1: 132 - 2: 91 - (0, 1): 53 - (0, 2): 90 - (1, 2): 26 - (0, 1, 2): 261 - union: 739 - */ - //assert_eq!(index.colors.len(), 3); - assert_eq!(index.colors.len(), 7); - } -} diff --git a/src/core/src/index/revindex/disk_revindex.rs b/src/core/src/index/revindex/disk_revindex.rs new file mode 100644 index 0000000000..05efad9ecb --- /dev/null +++ b/src/core/src/index/revindex/disk_revindex.rs @@ -0,0 +1,513 @@ +use std::hash::{BuildHasher, BuildHasherDefault, Hash, Hasher}; +use std::path::Path; +use std::sync::atomic::{AtomicUsize, Ordering}; +use std::sync::Arc; + +use byteorder::{LittleEndian, WriteBytesExt}; +use log::{info, trace}; +use rayon::prelude::*; +use rocksdb::{ColumnFamilyDescriptor, MergeOperands, Options}; + +use crate::collection::{Collection, CollectionSet}; +use crate::encodings::{Color, Idx}; +use crate::index::revindex::{ + self as module, prepare_query, stats_for_cf, Datasets, DbStats, HashToColor, QueryColors, + RevIndexOps, DB, HASHES, MANIFEST, METADATA, STORAGE_SPEC, VERSION, +}; +use crate::index::{GatherResult, SigCounter}; +use crate::manifest::Manifest; +use crate::prelude::*; +use crate::signature::SigsTrait; +use crate::sketch::minhash::KmerMinHash; +use crate::sketch::Sketch; +use crate::storage::{InnerStorage, Storage}; +use crate::Result; + +const DB_VERSION: u8 = 1; + +fn compute_color(idxs: &Datasets) -> Color { + let s = BuildHasherDefault::::default(); + let mut hasher = s.build_hasher(); + idxs.hash(&mut hasher); + hasher.finish() +} + +#[derive(Clone)] +pub struct RevIndex { + db: Arc, + collection: Arc, +} + +fn merge_datasets( + _: &[u8], + existing_val: Option<&[u8]>, + operands: &MergeOperands, +) -> Option> { + let mut datasets = existing_val + .and_then(Datasets::from_slice) + .unwrap_or_default(); + + for op in operands { + let new_vals = Datasets::from_slice(op).unwrap(); + datasets.union(new_vals); + } + // TODO: optimization! if nothing changed, skip as_bytes() + datasets.as_bytes() +} + +/* TODO: need the repair_cf variant, not available in rocksdb-rust yet +pub fn repair(path: &Path) { + let opts = db_options(); + + DB::repair(&opts, path).unwrap() +} +*/ + +impl RevIndex { + pub fn create(path: &Path, collection: CollectionSet) -> Result { + let mut opts = module::RevIndex::db_options(); + opts.create_if_missing(true); + opts.create_missing_column_families(true); + opts.prepare_for_bulk_load(); + + // prepare column family descriptors + let cfs = cf_descriptors(); + + let db = Arc::new(DB::open_cf_descriptors(&opts, path, cfs).unwrap()); + + let processed_sigs = AtomicUsize::new(0); + + let index = Self { + db, + collection: Arc::new(collection), + }; + + index.collection.par_iter().for_each(|(dataset_id, _)| { + let i = processed_sigs.fetch_add(1, Ordering::SeqCst); + if i % 1000 == 0 { + info!("Processed {} reference sigs", i); + } + + index.map_hashes_colors(dataset_id as Idx); + }); + + index.save_collection().expect("Error saving collection"); + + info!("Compact SSTs"); + index.compact(); + info!("Processed {} reference sigs", processed_sigs.into_inner()); + + Ok(module::RevIndex::Plain(index)) + } + + pub fn open>(path: P, read_only: bool) -> Result { + let mut opts = module::RevIndex::db_options(); + if !read_only { + opts.prepare_for_bulk_load(); + } + + // prepare column family descriptors + let cfs = cf_descriptors(); + + let db = if read_only { + Arc::new(DB::open_cf_descriptors_read_only( + &opts, + path.as_ref(), + cfs, + false, + )?) + } else { + Arc::new(DB::open_cf_descriptors(&opts, path.as_ref(), cfs)?) + }; + + let collection = Arc::new(Self::load_collection_from_rocksdb(db.clone())?); + + Ok(module::RevIndex::Plain(Self { db, collection })) + } + + fn load_collection_from_rocksdb(db: Arc) -> Result { + let cf_metadata = db.cf_handle(METADATA).unwrap(); + + let rdr = db.get_cf(&cf_metadata, VERSION)?.unwrap(); + assert_eq!(rdr[0], DB_VERSION); + + let rdr = db.get_cf(&cf_metadata, MANIFEST)?.unwrap(); + let manifest = Manifest::from_reader(&rdr[..])?; + + let spec = String::from_utf8(db.get_cf(&cf_metadata, STORAGE_SPEC)?.unwrap()) + .expect("invalid utf-8"); + + let storage = if spec == "rocksdb://" { + todo!("init storage from db") + } else { + InnerStorage::from_spec(spec)? + }; + + Collection::new(manifest, storage).try_into() + } + + fn save_collection(&self) -> Result<()> { + let cf_metadata = self.db.cf_handle(METADATA).unwrap(); + + // save DB version + // TODO: probably should go together with a more general + // saving procedure used in create/update + self.db.put_cf(&cf_metadata, VERSION, &[DB_VERSION])?; + + // write manifest + let mut wtr = vec![]; + { + self.collection.manifest().to_writer(&mut wtr)?; + } + self.db.put_cf(&cf_metadata, MANIFEST, &wtr[..])?; + + // write storage spec + let spec = self.collection.storage().spec(); + + // TODO: check if spec if memstorage, would probably have to + // save into rocksdb in that case! + + self.db.put_cf(&cf_metadata, STORAGE_SPEC, spec)?; + + Ok(()) + } + + fn map_hashes_colors(&self, dataset_id: Idx) { + let search_sig = self + .collection + .sig_for_dataset(dataset_id) + .expect("Couldn't find a compatible Signature"); + let search_mh = &search_sig.sketches()[0]; + + let colors = Datasets::new(&[dataset_id]).as_bytes().unwrap(); + + let cf_hashes = self.db.cf_handle(HASHES).unwrap(); + + let hashes = match search_mh { + Sketch::MinHash(mh) => mh.mins(), + Sketch::LargeMinHash(mh) => mh.mins(), + _ => unimplemented!(), + }; + + let mut hash_bytes = [0u8; 8]; + for hash in hashes { + (&mut hash_bytes[..]) + .write_u64::(hash) + .expect("error writing bytes"); + self.db + .merge_cf(&cf_hashes, &hash_bytes[..], colors.as_slice()) + .expect("error merging"); + } + } +} + +impl RevIndexOps for RevIndex { + fn counter_for_query(&self, query: &KmerMinHash) -> SigCounter { + info!("Collecting hashes"); + let cf_hashes = self.db.cf_handle(HASHES).unwrap(); + let hashes_iter = query.iter_mins().map(|hash| { + let mut v = vec![0_u8; 8]; + (&mut v[..]) + .write_u64::(*hash) + .expect("error writing bytes"); + (&cf_hashes, v) + }); + + info!("Multi get"); + self.db + .multi_get_cf(hashes_iter) + .into_iter() + .filter_map(|r| r.ok().unwrap_or(None)) + .flat_map(|raw_datasets| { + let new_vals = Datasets::from_slice(&raw_datasets).unwrap(); + new_vals.into_iter() + }) + .collect() + } + + fn prepare_gather_counters( + &self, + query: &KmerMinHash, + ) -> (SigCounter, QueryColors, HashToColor) { + let cf_hashes = self.db.cf_handle(HASHES).unwrap(); + let hashes_iter = query.iter_mins().map(|hash| { + let mut v = vec![0_u8; 8]; + (&mut v[..]) + .write_u64::(*hash) + .expect("error writing bytes"); + (&cf_hashes, v) + }); + + /* + build a HashToColors for query, + and a QueryColors (Color -> Datasets) mapping. + Loading Datasets from rocksdb for every hash takes too long. + */ + let mut query_colors: QueryColors = Default::default(); + let mut counter: SigCounter = Default::default(); + + info!("Building hash_to_colors and query_colors"); + let hash_to_colors = query + .iter_mins() + .zip(self.db.multi_get_cf(hashes_iter)) + .filter_map(|(k, r)| { + let raw = r.ok().unwrap_or(None); + raw.map(|raw| { + let new_vals = Datasets::from_slice(&raw).unwrap(); + let color = compute_color(&new_vals); + query_colors + .entry(color) + .or_insert_with(|| new_vals.clone()); + counter.update(new_vals); + (*k, color) + }) + }) + .collect(); + + (counter, query_colors, hash_to_colors) + } + + fn matches_from_counter(&self, counter: SigCounter, threshold: usize) -> Vec<(String, usize)> { + info!("get matches from counter"); + counter + .most_common() + .into_iter() + .filter_map(|(dataset_id, size)| { + if size >= threshold { + let row = &self + .collection + .record_for_dataset(dataset_id) + .expect("dataset not found"); + Some((row.name().into(), size)) + } else { + None + } + }) + .collect() + } + + fn gather( + &self, + mut counter: SigCounter, + query_colors: QueryColors, + hash_to_color: HashToColor, + threshold: usize, + orig_query: &KmerMinHash, + selection: Option, + ) -> Result> { + let mut match_size = usize::max_value(); + let mut matches = vec![]; + //let mut query: KmerMinHashBTree = orig_query.clone().into(); + let selection = selection.unwrap_or_else(|| self.collection.selection()); + + while match_size > threshold && !counter.is_empty() { + trace!("counter len: {}", counter.len()); + trace!("match size: {}", match_size); + + let (dataset_id, size) = counter.k_most_common_ordered(1)[0]; + match_size = if size >= threshold { size } else { break }; + + let match_sig = self.collection.sig_for_dataset(dataset_id)?; + + // Calculate stats + let f_orig_query = match_size as f64 / orig_query.size() as f64; + let name = match_sig.name(); + let gather_result_rank = matches.len(); + let match_ = match_sig.clone(); + let md5 = match_sig.md5sum(); + + let match_mh = prepare_query(match_sig.into(), &selection) + .expect("Couldn't find a compatible MinHash"); + let f_match = match_size as f64 / match_mh.size() as f64; + let unique_intersect_bp = match_mh.scaled() as usize * match_size; + let (intersect_orig, _) = match_mh.intersection_size(orig_query)?; + let intersect_bp = (match_mh.scaled() * intersect_orig) as usize; + let f_unique_to_query = intersect_orig as f64 / orig_query.size() as f64; + + // TODO: all of these + let filename = "".into(); + let f_unique_weighted = 0.; + let average_abund = 0; + let median_abund = 0; + let std_abund = 0; + let f_match_orig = 0.; + let remaining_bp = 0; + + let result = GatherResult::builder() + .intersect_bp(intersect_bp) + .f_orig_query(f_orig_query) + .f_match(f_match) + .f_unique_to_query(f_unique_to_query) + .f_unique_weighted(f_unique_weighted) + .average_abund(average_abund) + .median_abund(median_abund) + .std_abund(std_abund) + .filename(filename) + .name(name) + .md5(md5) + .match_(match_.into()) + .f_match_orig(f_match_orig) + .unique_intersect_bp(unique_intersect_bp) + .gather_result_rank(gather_result_rank) + .remaining_bp(remaining_bp) + .build(); + matches.push(result); + + trace!("Preparing counter for next round"); + // Prepare counter for finding the next match by decrementing + // all hashes found in the current match in other datasets + // TODO: not used at the moment, so just skip. + //query.remove_many(match_mh.to_vec().as_slice())?; + + // TODO: Use HashesToColors here instead. If not initialized, + // build it. + match_mh + .iter_mins() + .filter_map(|hash| hash_to_color.get(hash)) + .flat_map(|color| { + // TODO: remove this clone + query_colors.get(color).unwrap().clone().into_iter() + }) + .for_each(|dataset| { + // TODO: collect the flat_map into a Counter, and remove more + // than one at a time... + counter.entry(dataset).and_modify(|e| { + if *e > 0 { + *e -= 1 + } + }); + }); + + counter.remove(&dataset_id); + } + Ok(matches) + } + + fn update(mut self, collection: CollectionSet) -> Result { + // TODO: verify new collection manifest is a superset of current one, + // and the initial chunk is the same + let to_skip = self.collection.check_superset(&collection)?; + + // process the remainder + let processed_sigs = AtomicUsize::new(0); + + self.collection = Arc::new(collection); + + self.collection + .par_iter() + .skip(to_skip) + .for_each(|(dataset_id, _)| { + let i = processed_sigs.fetch_add(1, Ordering::SeqCst); + if i % 1000 == 0 { + info!("Processed {} reference sigs", i); + } + + self.map_hashes_colors(dataset_id as Idx); + }); + + self.save_collection().expect("Error saving collection"); + + info!("Compact SSTs"); + self.compact(); + + info!( + "Processed additional {} reference sigs", + processed_sigs.into_inner() + ); + + Ok(module::RevIndex::Plain(self)) + } + + fn check(&self, quick: bool) -> DbStats { + stats_for_cf(self.db.clone(), HASHES, true, quick) + } + + fn compact(&self) { + for cf_name in [HASHES, METADATA] { + let cf = self.db.cf_handle(cf_name).unwrap(); + self.db.compact_range_cf(&cf, None::<&[u8]>, None::<&[u8]>) + } + } + + fn flush(&self) -> Result<()> { + self.db.flush_wal(true)?; + + for cf_name in [HASHES, METADATA] { + let cf = self.db.cf_handle(cf_name).unwrap(); + self.db.flush_cf(&cf)?; + } + + Ok(()) + } + + fn convert(&self, _output_db: module::RevIndex) -> Result<()> { + todo!() + /* + if let RevIndex::Color(db) = output_db { + let other_db = db.db; + + let cf_hashes = self.db.cf_handle(HASHES).unwrap(); + + info!("start converting colors"); + let mut color_bytes = [0u8; 8]; + let iter = self + .db + .iterator_cf(&cf_hashes, rocksdb::IteratorMode::Start); + for (key, value) in iter { + let datasets = Datasets::from_slice(&value).unwrap(); + let new_idx: Vec<_> = datasets.into_iter().collect(); + let new_color = Colors::update(other_db.clone(), None, new_idx.as_slice()).unwrap(); + + (&mut color_bytes[..]) + .write_u64::(new_color) + .expect("error writing bytes"); + other_db + .put_cf(&cf_hashes, &key[..], &color_bytes[..]) + .unwrap(); + } + info!("finished converting colors"); + + info!("copying sigs to output"); + let cf_sigs = self.db.cf_handle(SIGS).unwrap(); + let iter = self.db.iterator_cf(&cf_sigs, rocksdb::IteratorMode::Start); + for (key, value) in iter { + other_db.put_cf(&cf_sigs, &key[..], &value[..]).unwrap(); + } + info!("finished copying sigs to output"); + + Ok(()) + } else { + todo!() + } + */ + } +} + +fn cf_descriptors() -> Vec { + let mut cfopts = Options::default(); + cfopts.set_max_write_buffer_number(16); + cfopts.set_merge_operator_associative("datasets operator", merge_datasets); + cfopts.set_min_write_buffer_number_to_merge(10); + + // Updated default from + // https://github.com/facebook/rocksdb/wiki/Setup-Options-and-Basic-Tuning#other-general-options + cfopts.set_level_compaction_dynamic_level_bytes(true); + + let cf_hashes = ColumnFamilyDescriptor::new(HASHES, cfopts); + + let mut cfopts = Options::default(); + cfopts.set_max_write_buffer_number(16); + // Updated default + cfopts.set_level_compaction_dynamic_level_bytes(true); + //cfopts.set_merge_operator_associative("colors operator", merge_colors); + + let cf_metadata = ColumnFamilyDescriptor::new(METADATA, cfopts); + + let mut cfopts = Options::default(); + cfopts.set_max_write_buffer_number(16); + // Updated default + cfopts.set_level_compaction_dynamic_level_bytes(true); + //cfopts.set_merge_operator_associative("colors operator", merge_colors); + + vec![cf_hashes, cf_metadata] +} diff --git a/src/core/src/index/revindex/mem_revindex.rs b/src/core/src/index/revindex/mem_revindex.rs new file mode 100644 index 0000000000..5264c8550d --- /dev/null +++ b/src/core/src/index/revindex/mem_revindex.rs @@ -0,0 +1,461 @@ +use std::sync::atomic::{AtomicUsize, Ordering}; + +use camino::Utf8Path as Path; +use camino::Utf8PathBuf as PathBuf; +use log::{debug, info}; + +#[cfg(feature = "parallel")] +use rayon::prelude::*; + +use crate::collection::Collection; +use crate::encodings::{Colors, Idx}; +use crate::index::linear::LinearIndex; +use crate::index::revindex::HashToColor; +use crate::index::{GatherResult, Index, SigCounter}; +use crate::prelude::*; +use crate::signature::{Signature, SigsTrait}; +use crate::sketch::minhash::KmerMinHash; +use crate::sketch::Sketch; +use crate::Result; + +pub struct RevIndex { + linear: LinearIndex, + hash_to_color: HashToColor, + colors: Colors, +} + +impl LinearIndex { + fn index( + self, + threshold: usize, + merged_query: Option, + queries: Option<&[KmerMinHash]>, + ) -> RevIndex { + let processed_sigs = AtomicUsize::new(0); + + #[cfg(feature = "parallel")] + let sig_iter = self.collection().par_iter(); + + #[cfg(not(feature = "parallel"))] + let sig_iter = self.collection().iter(); + + let filtered_sigs = sig_iter.enumerate().filter_map(|(dataset_id, _)| { + let i = processed_sigs.fetch_add(1, Ordering::SeqCst); + if i % 1000 == 0 { + info!("Processed {} reference sigs", i); + } + + let search_sig = self + .collection() + .sig_for_dataset(dataset_id as Idx) + .expect("Error loading sig") + .into(); + + RevIndex::map_hashes_colors( + dataset_id as Idx, + &search_sig, + queries, + &merged_query, + threshold, + self.template(), + ) + }); + + #[cfg(feature = "parallel")] + let (hash_to_color, colors) = filtered_sigs.reduce( + || (HashToColor::new(), Colors::default()), + HashToColor::reduce_hashes_colors, + ); + + #[cfg(not(feature = "parallel"))] + let (hash_to_color, colors) = filtered_sigs.fold( + (HashToColor::new(), Colors::default()), + HashToColor::reduce_hashes_colors, + ); + + RevIndex { + hash_to_color, + colors, + linear: self, + } + } +} + +impl RevIndex { + pub fn new( + search_sigs: &[PathBuf], + selection: &Selection, + threshold: usize, + queries: Option<&[KmerMinHash]>, + _keep_sigs: bool, + ) -> Result { + // If threshold is zero, let's merge all queries and save time later + let merged_query = queries.and_then(|qs| Self::merge_queries(qs, threshold)); + + let collection = Collection::from_paths(search_sigs)?.select(&selection)?; + let linear = LinearIndex::from_collection(collection.try_into()?); + + Ok(linear.index(threshold, merged_query, queries)) + } + + pub fn from_zipfile>( + zipfile: P, + selection: &Selection, + threshold: usize, + queries: Option<&[KmerMinHash]>, + _keep_sigs: bool, + ) -> Result { + // If threshold is zero, let's merge all queries and save time later + let merged_query = queries.and_then(|qs| Self::merge_queries(qs, threshold)); + + let collection = Collection::from_zipfile(zipfile)?.select(&selection)?; + let linear = LinearIndex::from_collection(collection.try_into()?); + + Ok(linear.index(threshold, merged_query, queries)) + } + + fn merge_queries(qs: &[KmerMinHash], threshold: usize) -> Option { + if threshold == 0 { + let mut merged = qs[0].clone(); + for query in &qs[1..] { + merged.merge(query).unwrap(); + } + Some(merged) + } else { + None + } + } + + pub fn new_with_sigs( + search_sigs: Vec, + selection: &Selection, + threshold: usize, + queries: Option<&[KmerMinHash]>, + ) -> Result { + // If threshold is zero, let's merge all queries and save time later + let merged_query = queries.and_then(|qs| Self::merge_queries(qs, threshold)); + + let collection = Collection::from_sigs(search_sigs)?.select(selection)?; + let linear = LinearIndex::from_collection(collection.try_into()?); + + let idx = linear.index(threshold, merged_query, queries); + + Ok(idx) + } + + fn map_hashes_colors( + dataset_id: Idx, + search_sig: &Signature, + queries: Option<&[KmerMinHash]>, + merged_query: &Option, + threshold: usize, + template: &Sketch, + ) -> Option<(HashToColor, Colors)> { + let mut search_mh = None; + if let Some(Sketch::MinHash(mh)) = search_sig.select_sketch(template) { + search_mh = Some(mh); + } + + let search_mh = search_mh.expect("Couldn't find a compatible MinHash"); + let mut hash_to_color = HashToColor::new(); + let mut colors = Colors::default(); + + if let Some(qs) = queries { + if let Some(ref merged) = merged_query { + let (matched_hashes, intersection) = merged.intersection(search_mh).unwrap(); + if !matched_hashes.is_empty() || intersection > threshold as u64 { + hash_to_color.add_to(&mut colors, dataset_id, matched_hashes); + } + } else { + for query in qs { + let (matched_hashes, intersection) = query.intersection(search_mh).unwrap(); + if !matched_hashes.is_empty() || intersection > threshold as u64 { + hash_to_color.add_to(&mut colors, dataset_id, matched_hashes); + } + } + } + } else { + let matched = search_mh.mins(); + let size = matched.len() as u64; + if !matched.is_empty() || size > threshold as u64 { + hash_to_color.add_to(&mut colors, dataset_id, matched); + } + }; + + if hash_to_color.is_empty() { + None + } else { + Some((hash_to_color, colors)) + } + } + + pub fn search( + &self, + counter: SigCounter, + similarity: bool, + threshold: usize, + ) -> Result> { + self.linear.search(counter, similarity, threshold) + } + + pub fn gather( + &self, + mut counter: SigCounter, + threshold: usize, + query: &KmerMinHash, + ) -> Result> { + let mut match_size = usize::max_value(); + let mut matches = vec![]; + + while match_size > threshold && !counter.is_empty() { + let (dataset_id, size) = counter.most_common()[0]; + match_size = if size >= threshold { size } else { break }; + let result = self + .linear + .gather_round(dataset_id, match_size, query, matches.len())?; + if let Some(Sketch::MinHash(match_mh)) = + result.match_.select_sketch(self.linear.template()) + { + // Prepare counter for finding the next match by decrementing + // all hashes found in the current match in other datasets + for hash in match_mh.iter_mins() { + if let Some(color) = self.hash_to_color.get(hash) { + counter.subtract(self.colors.indices(color).cloned()); + } + } + counter.remove(&dataset_id); + matches.push(result); + } else { + unimplemented!() + } + } + Ok(matches) + } + + pub fn template(&self) -> Sketch { + self.linear.template().clone() + } + + // TODO: mh should be a sketch, or even a sig... + pub(crate) fn find_signatures( + &self, + mh: &KmerMinHash, + threshold: f64, + containment: bool, + _ignore_scaled: bool, + ) -> Result> { + // TODO: proper threshold calculation + let threshold: usize = (threshold * (mh.size() as f64)) as _; + + let counter = self.counter_for_query(mh); + + debug!( + "number of matching signatures for hashes: {}", + counter.len() + ); + + let mut results = vec![]; + for (dataset_id, size) in counter.most_common() { + let match_size = if size >= threshold { size } else { break }; + + let match_sig = self.linear.sig_for_dataset(dataset_id)?; + let match_path = self + .linear + .collection() + .record_for_dataset(dataset_id)? + .internal_location(); + + let mut match_mh = None; + if let Some(Sketch::MinHash(mh)) = match_sig.select_sketch(self.linear.template()) { + match_mh = Some(mh); + } + let match_mh = match_mh.unwrap(); + + if size >= threshold { + let score = if containment { + size as f64 / mh.size() as f64 + } else { + size as f64 / (mh.size() + match_size - size) as f64 + }; + let filename = match_path.to_string(); + let mut sig: Signature = match_sig.clone().into(); + sig.reset_sketches(); + sig.push(Sketch::MinHash(match_mh.clone())); + results.push((score, sig, filename)); + } else { + break; + }; + } + Ok(results) + } + + pub fn counter_for_query(&self, query: &KmerMinHash) -> SigCounter { + query + .iter_mins() + .filter_map(|hash| self.hash_to_color.get(hash)) + .flat_map(|color| self.colors.indices(color)) + .cloned() + .collect() + } +} + +impl<'a> Index<'a> for RevIndex { + type Item = Signature; + + fn insert(&mut self, _node: Self::Item) -> Result<()> { + unimplemented!() + } + + fn save>(&self, _path: P) -> Result<()> { + unimplemented!() + } + + fn load>(_path: P) -> Result<()> { + unimplemented!() + } + + fn len(&self) -> usize { + self.linear.len() + } + + fn signatures(&self) -> Vec { + self.linear + .signatures() + .into_iter() + .map(|sig| sig.into()) + .collect() + } + + fn signature_refs(&self) -> Vec<&Self::Item> { + unimplemented!() + } +} + +#[cfg(test)] +mod test { + use super::*; + + use crate::index::revindex::prepare_query; + use crate::Result; + + #[test] + fn revindex_new() -> Result<()> { + let selection = Selection::builder().ksize(31).scaled(10000).build(); + let search_sigs = [ + "../../tests/test-data/gather/GCF_000006945.2_ASM694v2_genomic.fna.gz.sig".into(), + "../../tests/test-data/gather/GCF_000007545.1_ASM754v1_genomic.fna.gz.sig".into(), + ]; + let index = RevIndex::new(&search_sigs, &selection, 0, None, false)?; + assert_eq!(index.colors.len(), 3); + + Ok(()) + } + + #[test] + fn revindex_many() -> Result<()> { + let selection = Selection::builder().ksize(31).scaled(10000).build(); + let search_sigs = [ + "../../tests/test-data/gather/GCF_000006945.2_ASM694v2_genomic.fna.gz.sig".into(), + "../../tests/test-data/gather/GCF_000007545.1_ASM754v1_genomic.fna.gz.sig".into(), + "../../tests/test-data/gather/GCF_000008105.1_ASM810v1_genomic.fna.gz.sig".into(), + ]; + + let index = RevIndex::new(&search_sigs, &selection, 0, None, false)?; + //dbg!(&index.linear.collection().manifest); + /* + dbg!(&index.colors.colors); + 0: 86 + 1: 132 + 2: 91 + (0, 1): 53 + (0, 2): 90 + (1, 2): 26 + (0, 1, 2): 261 + union: 739 + + */ + //assert_eq!(index.colors.len(), 3); + assert_eq!(index.colors.len(), 7); + + Ok(()) + } + + #[test] + fn revindex_from_sigs() -> Result<()> { + let selection = Selection::builder().ksize(31).scaled(10000).build(); + let search_sigs: Vec = [ + "../../tests/test-data/gather/GCF_000006945.2_ASM694v2_genomic.fna.gz.sig", + "../../tests/test-data/gather/GCF_000007545.1_ASM754v1_genomic.fna.gz.sig", + "../../tests/test-data/gather/GCF_000008105.1_ASM810v1_genomic.fna.gz.sig", + ] + .into_iter() + .map(|path| Signature::from_path(path).unwrap().swap_remove(0)) + .collect(); + + let index = RevIndex::new_with_sigs(search_sigs, &selection, 0, None)?; + /* + dbg!(&index.colors.colors); + 0: 86 + 1: 132 + 2: 91 + (0, 1): 53 + (0, 2): 90 + (1, 2): 26 + (0, 1, 2): 261 + union: 739 + */ + //assert_eq!(index.colors.len(), 3); + assert_eq!(index.colors.len(), 7); + + Ok(()) + } + + #[test] + fn revindex_from_zipstorage() -> Result<()> { + let selection = Selection::builder() + .ksize(19) + .scaled(100) + .moltype(crate::encodings::HashFunctions::Murmur64Protein) + .build(); + let index = RevIndex::from_zipfile( + "../../tests/test-data/prot/protein.zip", + &selection, + 0, + None, + false, + ) + .expect("error building from ziptorage"); + + assert_eq!(index.colors.len(), 3); + + let query_sig = Signature::from_path( + "../../tests/test-data/prot/protein/GCA_001593925.1_ASM159392v1_protein.faa.gz.sig", + ) + .expect("Error processing query") + .swap_remove(0) + .select(&selection)?; + + let mut query_mh = None; + if let Some(q) = prepare_query(query_sig, &selection) { + query_mh = Some(q); + } + let query_mh = query_mh.expect("Couldn't find a compatible MinHash"); + + let counter_rev = index.counter_for_query(&query_mh); + let counter_lin = index.linear.counter_for_query(&query_mh); + + let results_rev = index.search(counter_rev, false, 0).unwrap(); + let results_linear = index.linear.search(counter_lin, false, 0).unwrap(); + assert_eq!(results_rev, results_linear); + + let counter_rev = index.counter_for_query(&query_mh); + let counter_lin = index.linear.counter_for_query(&query_mh); + + let results_rev = index.gather(counter_rev, 0, &query_mh).unwrap(); + let results_linear = index.linear.gather(counter_lin, 0, &query_mh).unwrap(); + assert_eq!(results_rev.len(), 1); + assert_eq!(results_rev, results_linear); + + Ok(()) + } +} diff --git a/src/core/src/index/revindex/mod.rs b/src/core/src/index/revindex/mod.rs new file mode 100644 index 0000000000..0765ee71d9 --- /dev/null +++ b/src/core/src/index/revindex/mod.rs @@ -0,0 +1,590 @@ +pub mod disk_revindex; +pub mod mem_revindex; + +use std::collections::HashMap; +use std::hash::{Hash, Hasher}; +use std::path::Path; +use std::sync::Arc; + +use byteorder::{LittleEndian, WriteBytesExt}; +use enum_dispatch::enum_dispatch; +use getset::{Getters, Setters}; +use nohash_hasher::BuildNoHashHasher; +use roaring::RoaringBitmap; +use serde::{Deserialize, Serialize}; + +use crate::collection::CollectionSet; +use crate::encodings::{Color, Colors, Idx}; +use crate::index::{GatherResult, SigCounter}; +use crate::prelude::*; +use crate::signature::Signature; +use crate::sketch::minhash::KmerMinHash; +use crate::sketch::Sketch; +use crate::HashIntoType; +use crate::Result; + +type DB = rocksdb::DBWithThreadMode; + +type QueryColors = HashMap; +type HashToColorT = HashMap>; +#[derive(Serialize, Deserialize)] +pub struct HashToColor(HashToColorT); + +// Column families +const HASHES: &str = "hashes"; +const COLORS: &str = "colors"; +const METADATA: &str = "metadata"; + +// DB metadata saved in the METADATA column family +const MANIFEST: &str = "manifest"; +const STORAGE_SPEC: &str = "storage_spec"; +const VERSION: &str = "version"; + +#[enum_dispatch(RevIndexOps)] +pub enum RevIndex { + //Color(color_revindex::ColorRevIndex), + Plain(disk_revindex::RevIndex), + //Mem(mem_revindex::RevIndex), +} + +#[enum_dispatch] +pub trait RevIndexOps { + /* TODO: need the repair_cf variant, not available in rocksdb-rust yet + pub fn repair(index: &Path, colors: bool); + */ + + fn counter_for_query(&self, query: &KmerMinHash) -> SigCounter; + + fn matches_from_counter(&self, counter: SigCounter, threshold: usize) -> Vec<(String, usize)>; + + fn prepare_gather_counters( + &self, + query: &KmerMinHash, + ) -> (SigCounter, QueryColors, HashToColor); + + fn update(self, collection: CollectionSet) -> Result + where + Self: Sized; + + fn compact(&self); + + fn flush(&self) -> Result<()>; + + fn convert(&self, output_db: RevIndex) -> Result<()>; + + fn check(&self, quick: bool) -> DbStats; + + fn gather( + &self, + counter: SigCounter, + query_colors: QueryColors, + hash_to_color: HashToColor, + threshold: usize, + query: &KmerMinHash, + selection: Option, + ) -> Result>; +} + +impl HashToColor { + fn new() -> Self { + HashToColor(HashMap::< + HashIntoType, + Color, + BuildNoHashHasher, + >::with_hasher(BuildNoHashHasher::default())) + } + + fn get(&self, hash: &HashIntoType) -> Option<&Color> { + self.0.get(hash) + } + + fn len(&self) -> usize { + self.0.len() + } + + fn is_empty(&self) -> bool { + self.0.is_empty() + } + + fn add_to(&mut self, colors: &mut Colors, dataset_id: Idx, matched_hashes: Vec) { + let mut color = None; + + matched_hashes.into_iter().for_each(|hash| { + color = Some(colors.update(color, &[dataset_id]).unwrap()); + self.0.insert(hash, color.unwrap()); + }); + } + + fn reduce_hashes_colors( + a: (HashToColor, Colors), + b: (HashToColor, Colors), + ) -> (HashToColor, Colors) { + let ((small_hashes, small_colors), (mut large_hashes, mut large_colors)) = + if a.0.len() > b.0.len() { + (b, a) + } else { + (a, b) + }; + + small_hashes.0.into_iter().for_each(|(hash, color)| { + large_hashes + .0 + .entry(hash) + .and_modify(|entry| { + // Hash is already present. + // Update the current color by adding the indices from + // small_colors. + let ids = small_colors.indices(&color); + let new_color = large_colors.update(Some(*entry), ids).unwrap(); + *entry = new_color; + }) + .or_insert_with(|| { + // In this case, the hash was not present yet. + // we need to create the same color from small_colors + // into large_colors. + let ids = small_colors.indices(&color); + let new_color = large_colors.update(None, ids).unwrap(); + assert_eq!(new_color, color); + new_color + }); + }); + + (large_hashes, large_colors) + } +} + +impl FromIterator<(HashIntoType, Color)> for HashToColor { + fn from_iter(iter: T) -> Self + where + T: IntoIterator, + { + HashToColor(HashToColorT::from_iter(iter)) + } +} + +impl RevIndex { + /* TODO: need the repair_cf variant, not available in rocksdb-rust yet + pub fn repair(index: &Path, colors: bool) { + if colors { + color_revindex::repair(index); + } else { + disk_revindex::repair(index); + } + } + */ + + pub fn create>( + index: P, + collection: CollectionSet, + colors: bool, + ) -> Result { + if colors { + todo!() //color_revindex::ColorRevIndex::create(index) + } else { + disk_revindex::RevIndex::create(index.as_ref(), collection) + } + } + + pub fn open>(index: P, read_only: bool) -> Result { + let opts = Self::db_options(); + let cfs = DB::list_cf(&opts, index.as_ref()).unwrap(); + + if cfs.into_iter().any(|c| c == COLORS) { + // TODO: ColorRevIndex can't be read-only for now, + // due to pending unmerged colors + todo!() //color_revindex::ColorRevIndex::open(index, false) + } else { + disk_revindex::RevIndex::open(index, read_only) + } + } + + fn db_options() -> rocksdb::Options { + let mut opts = rocksdb::Options::default(); + opts.set_max_open_files(500); + + // Updated defaults from + // https://github.com/facebook/rocksdb/wiki/Setup-Options-and-Basic-Tuning#other-general-options + opts.set_bytes_per_sync(1048576); + let mut block_opts = rocksdb::BlockBasedOptions::default(); + block_opts.set_block_size(16 * 1024); + block_opts.set_cache_index_and_filter_blocks(true); + block_opts.set_pin_l0_filter_and_index_blocks_in_cache(true); + block_opts.set_format_version(6); + opts.set_block_based_table_factory(&block_opts); + // End of updated defaults + + opts.increase_parallelism(rayon::current_num_threads() as i32); + //opts.max_background_jobs = 6; + // opts.optimize_level_style_compaction(); + // opts.optimize_universal_style_compaction(); + + opts + } +} + +pub fn prepare_query(search_sig: Signature, selection: &Selection) -> Option { + let sig = search_sig.select(selection).ok(); + + sig.and_then(|sig| { + if let Sketch::MinHash(mh) = sig.sketches().swap_remove(0) { + Some(mh) + } else { + None + } + }) +} + +#[derive(Debug, Default, PartialEq, Clone)] +pub enum Datasets { + #[default] + Empty, + Unique(Idx), + Many(RoaringBitmap), +} + +impl Hash for Datasets { + fn hash(&self, state: &mut H) + where + H: Hasher, + { + match self { + Self::Empty => todo!(), + Self::Unique(v) => v.hash(state), + Self::Many(v) => { + for value in v.iter() { + value.hash(state); + } + } + } + } +} + +impl IntoIterator for Datasets { + type Item = Idx; + type IntoIter = Box>; + + fn into_iter(self) -> Self::IntoIter { + match self { + Self::Empty => Box::new(std::iter::empty()), + Self::Unique(v) => Box::new(std::iter::once(v)), + Self::Many(v) => Box::new(v.into_iter()), + } + } +} + +impl Extend for Datasets { + fn extend(&mut self, iter: T) + where + T: IntoIterator, + { + if let Self::Many(v) = self { + v.extend(iter); + return; + } + + let mut it = iter.into_iter(); + while let Some(value) = it.next() { + match self { + Self::Empty => *self = Datasets::Unique(value), + Self::Unique(v) => { + if *v != value { + *self = Self::Many([*v, value].iter().copied().collect()); + } + } + Self::Many(v) => { + v.extend(it); + return; + } + } + } + } +} + +impl Datasets { + fn new(vals: &[Idx]) -> Self { + if vals.is_empty() { + Self::Empty + } else if vals.len() == 1 { + Self::Unique(vals[0]) + } else { + Self::Many(RoaringBitmap::from_sorted_iter(vals.iter().copied()).unwrap()) + } + } + + fn from_slice(slice: &[u8]) -> Option { + use byteorder::ReadBytesExt; + + if slice.len() == 8 { + // Unique + Some(Self::Unique( + (&slice[..]).read_u32::().unwrap(), + )) + } else if slice.len() == 1 { + // Empty + Some(Self::Empty) + } else { + // Many + Some(Self::Many(RoaringBitmap::deserialize_from(slice).unwrap())) + } + } + + fn as_bytes(&self) -> Option> { + match self { + Self::Empty => Some(vec![42_u8]), + Self::Unique(v) => { + let mut buf = vec![0u8; 8]; + (&mut buf[..]) + .write_u32::(*v) + .expect("error writing bytes"); + Some(buf) + } + Self::Many(v) => { + let mut buf = vec![]; + v.serialize_into(&mut buf).unwrap(); + Some(buf) + } + } + } + + fn union(&mut self, other: Datasets) { + match self { + Datasets::Empty => match other { + Datasets::Empty => (), + Datasets::Unique(_) | Datasets::Many(_) => *self = other, + }, + Datasets::Unique(v) => match other { + Datasets::Empty => (), + Datasets::Unique(o) => { + if *v != o { + *self = Datasets::Many([*v, o].iter().copied().collect()) + } + } + Datasets::Many(mut o) => { + o.extend([*v]); + *self = Datasets::Many(o); + } + }, + Datasets::Many(ref mut v) => v.extend(other), + } + } + + fn len(&self) -> usize { + match self { + Self::Empty => 0, + Self::Unique(_) => 1, + Self::Many(ref v) => v.len() as usize, + } + } + + /* + fn contains(&self, value: &Idx) -> bool { + match self { + Self::Empty => false, + Self::Unique(v) => v == value, + Self::Many(ref v) => v.contains(*value), + } + } + */ +} + +#[derive(Getters, Setters, Debug)] +pub struct DbStats { + #[getset(get = "pub")] + total_datasets: usize, + + #[getset(get = "pub")] + total_keys: usize, + + #[getset(get = "pub")] + kcount: usize, + + #[getset(get = "pub")] + vcount: usize, + + #[getset(get = "pub")] + vcounts: histogram::Histogram, +} + +fn stats_for_cf(db: Arc, cf_name: &str, deep_check: bool, quick: bool) -> DbStats { + use byteorder::ReadBytesExt; + use histogram::Histogram; + + let cf = db.cf_handle(cf_name).unwrap(); + + let iter = db.iterator_cf(&cf, rocksdb::IteratorMode::Start); + let mut kcount = 0; + let mut vcount = 0; + let mut vcounts = Histogram::new(); + let mut datasets: Datasets = Default::default(); + + for result in iter { + let (key, value) = result.unwrap(); + let _k = (&key[..]).read_u64::().unwrap(); + kcount += key.len(); + + //println!("Saw {} {:?}", k, Datasets::from_slice(&value)); + vcount += value.len(); + + if !quick && deep_check { + let v = Datasets::from_slice(&value).expect("Error with value"); + vcounts.increment(v.len() as u64).unwrap(); + datasets.union(v); + } + //println!("Saw {} {:?}", k, value); + } + + DbStats { + total_datasets: datasets.len(), + total_keys: kcount / 8, + kcount, + vcount, + vcounts, + } +} + +#[cfg(test)] +mod test { + + use camino::Utf8PathBuf as PathBuf; + use tempfile::TempDir; + + use crate::collection::Collection; + use crate::prelude::*; + use crate::selection::Selection; + use crate::Result; + + use super::{prepare_query, RevIndex, RevIndexOps}; + + #[test] + fn revindex_index() -> Result<()> { + let mut basedir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + basedir.push("../../tests/test-data/scaled/"); + + let siglist: Vec<_> = (10..=12) + .map(|i| { + let mut filename = basedir.clone(); + filename.push(format!("genome-s{}.fa.gz.sig", i)); + filename + }) + .collect(); + + let selection = Selection::builder().ksize(31).scaled(10000).build(); + let output = TempDir::new()?; + + let mut query = None; + let query_sig = Signature::from_path(&siglist[0])? + .swap_remove(0) + .select(&selection)?; + if let Some(q) = prepare_query(query_sig, &selection) { + query = Some(q); + } + let query = query.unwrap(); + + let collection = Collection::from_paths(&siglist)?.select(&selection)?; + let index = RevIndex::create(output.path(), collection.try_into()?, false)?; + + let counter = index.counter_for_query(&query); + let matches = index.matches_from_counter(counter, 0); + + assert_eq!(matches, [("../genome-s10.fa.gz".into(), 48)]); + + Ok(()) + } + + #[test] + fn revindex_update() -> Result<()> { + let mut basedir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + basedir.push("../../tests/test-data/scaled/"); + + let siglist: Vec<_> = (10..=11) + .map(|i| { + let mut filename = basedir.clone(); + filename.push(format!("genome-s{}.fa.gz.sig", i)); + filename + }) + .collect(); + + let selection = Selection::builder().ksize(31).scaled(10000).build(); + let output = TempDir::new()?; + + let mut new_siglist = siglist.clone(); + { + let collection = Collection::from_paths(&siglist)?.select(&selection)?; + RevIndex::create(output.path(), collection.try_into()?, false)?; + } + + let mut filename = basedir.clone(); + filename.push("genome-s12.fa.gz.sig"); + new_siglist.push(filename); + + let mut query = None; + let query_sig = Signature::from_path(&new_siglist[2])? + .swap_remove(0) + .select(&selection)?; + if let Some(q) = prepare_query(query_sig, &selection) { + query = Some(q); + } + let query = query.unwrap(); + + let new_collection = Collection::from_paths(&new_siglist)?.select(&selection)?; + let index = RevIndex::open(output.path(), false)?.update(new_collection.try_into()?)?; + + let counter = index.counter_for_query(&query); + let matches = index.matches_from_counter(counter, 0); + + assert!(matches[0].0.ends_with("/genome-s12.fa.gz")); + assert_eq!(matches[0].1, 45); + + Ok(()) + } + + #[test] + fn revindex_load_and_gather() -> Result<()> { + let mut basedir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + basedir.push("../../tests/test-data/scaled/"); + + let siglist: Vec<_> = (10..=12) + .map(|i| { + let mut filename = basedir.clone(); + filename.push(format!("genome-s{}.fa.gz.sig", i)); + filename + }) + .collect(); + + let selection = Selection::builder().ksize(31).scaled(10000).build(); + let output = TempDir::new()?; + + let mut query = None; + let query_sig = Signature::from_path(&siglist[0])? + .swap_remove(0) + .select(&selection)?; + if let Some(q) = prepare_query(query_sig, &selection) { + query = Some(q); + } + let query = query.unwrap(); + + { + let collection = Collection::from_paths(&siglist)?.select(&selection)?; + let _index = RevIndex::create(output.path(), collection.try_into()?, false); + } + + let index = RevIndex::open(output.path(), true)?; + + let (counter, query_colors, hash_to_color) = index.prepare_gather_counters(&query); + + let matches = index.gather( + counter, + query_colors, + hash_to_color, + 0, + &query, + Some(selection), + )?; + + assert_eq!(matches.len(), 1); + assert_eq!(matches[0].name(), "../genome-s10.fa.gz"); + assert_eq!(matches[0].f_match(), 1.0); + + Ok(()) + } +} diff --git a/src/core/src/lib.rs b/src/core/src/lib.rs index 66de82e6a0..da383372a0 100644 --- a/src/core/src/lib.rs +++ b/src/core/src/lib.rs @@ -21,11 +21,16 @@ pub mod errors; pub use errors::SourmashError as Error; +pub type Result = std::result::Result; pub mod prelude; pub mod cmd; +pub mod collection; +pub mod index; +pub mod manifest; +pub mod selection; pub mod signature; pub mod sketch; pub mod storage; @@ -44,7 +49,6 @@ cfg_if! { pub mod wasm; } else { pub mod ffi; - pub mod index; } } diff --git a/src/core/src/manifest.rs b/src/core/src/manifest.rs new file mode 100644 index 0000000000..5bad8ec81b --- /dev/null +++ b/src/core/src/manifest.rs @@ -0,0 +1,279 @@ +use std::convert::TryInto; +use std::io::{Read, Write}; +use std::ops::Deref; + +use camino::Utf8PathBuf as PathBuf; +use getset::{CopyGetters, Getters, Setters}; +#[cfg(feature = "parallel")] +use rayon::prelude::*; +use serde::de; +use serde::{Deserialize, Serialize}; + +use crate::encodings::HashFunctions; +use crate::prelude::*; +use crate::signature::{Signature, SigsTrait}; +use crate::sketch::Sketch; +use crate::Result; + +#[derive(Debug, Serialize, Deserialize, Clone, CopyGetters, Getters, Setters, PartialEq)] +pub struct Record { + #[getset(get = "pub", set = "pub")] + internal_location: PathBuf, + + #[getset(get = "pub", set = "pub")] + md5: String, + + md5short: String, + + #[getset(get = "pub", set = "pub")] + ksize: u32, + + moltype: String, + + num: u32, + scaled: u64, + n_hashes: usize, + + #[getset(get = "pub", set = "pub")] + #[serde(deserialize_with = "to_bool")] + with_abundance: bool, + + #[getset(get = "pub", set = "pub")] + name: String, + + filename: String, +} + +fn to_bool<'de, D>(deserializer: D) -> std::result::Result +where + D: de::Deserializer<'de>, +{ + match String::deserialize(deserializer)? + .to_ascii_lowercase() + .as_ref() + { + "0" | "false" => Ok(false), + "1" | "true" => Ok(true), + other => Err(de::Error::invalid_value( + de::Unexpected::Str(other), + &"0/1 or true/false are the only supported values", + )), + } +} + +#[derive(Debug, Default, Serialize, Deserialize, Clone)] +pub struct Manifest { + records: Vec, +} + +impl Record { + pub fn from_sig(sig: &Signature, path: &str) -> Vec { + sig.iter() + .map(|sketch| { + let (ksize, md5, with_abundance, moltype, n_hashes, num, scaled) = match sketch { + Sketch::MinHash(mh) => ( + mh.ksize() as u32, + mh.md5sum(), + mh.track_abundance(), + mh.hash_function(), + mh.size(), + mh.num(), + mh.scaled(), + ), + Sketch::LargeMinHash(mh) => ( + mh.ksize() as u32, + mh.md5sum(), + mh.track_abundance(), + mh.hash_function(), + mh.size(), + mh.num(), + mh.scaled(), + ), + _ => unimplemented!(), + }; + + let md5short = md5[0..8].into(); + + Self { + internal_location: path.into(), + moltype: moltype.to_string(), + name: sig.name(), + ksize, + md5, + md5short, + with_abundance, + filename: sig.filename(), + n_hashes, + num, + scaled, + } + }) + .collect() + } + + pub fn moltype(&self) -> HashFunctions { + self.moltype.as_str().try_into().unwrap() + } + + pub fn check_compatible(&self, other: &Record) -> Result<()> { + /* + if self.num != other.num { + return Err(Error::MismatchNum { + n1: self.num, + n2: other.num, + } + .into()); + } + */ + use crate::Error; + + if self.ksize() != other.ksize() { + return Err(Error::MismatchKSizes); + } + if self.moltype() != other.moltype() { + // TODO: fix this error + return Err(Error::MismatchDNAProt); + } + /* + if self.scaled() < other.scaled() { + return Err(Error::MismatchScaled); + } + if self.seed() != other.seed() { + return Err(Error::MismatchSeed); + } + */ + Ok(()) + } +} + +impl Manifest { + pub fn from_reader(rdr: R) -> Result { + let mut records = vec![]; + + let mut rdr = csv::ReaderBuilder::new() + .comment(Some(b'#')) + .from_reader(rdr); + for result in rdr.deserialize() { + let record: Record = result?; + records.push(record); + } + Ok(Manifest { records }) + } + + pub fn to_writer(&self, mut wtr: W) -> Result<()> { + wtr.write_all(b"# SOURMASH-MANIFEST-VERSION: 1.0\n")?; + + let mut wtr = csv::Writer::from_writer(wtr); + + for record in &self.records { + wtr.serialize(record)?; + } + + Ok(()) + } + + pub fn internal_locations(&self) -> impl Iterator { + self.records.iter().map(|r| r.internal_location.as_str()) + } + + pub fn iter(&self) -> impl Iterator { + self.records.iter() + } +} + +impl Select for Manifest { + fn select(self, selection: &Selection) -> Result { + let rows = self.records.iter().filter(|row| { + let mut valid = true; + valid = if let Some(ksize) = selection.ksize() { + row.ksize == ksize + } else { + valid + }; + valid = if let Some(abund) = selection.abund() { + valid && *row.with_abundance() == abund + } else { + valid + }; + valid = if let Some(moltype) = selection.moltype() { + valid && row.moltype() == moltype + } else { + valid + }; + valid + }); + + Ok(Manifest { + records: rows.cloned().collect(), + }) + + /* + matching_rows = self.rows + if ksize: + matching_rows = ( row for row in matching_rows + if row['ksize'] == ksize ) + if moltype: + matching_rows = ( row for row in matching_rows + if row['moltype'] == moltype ) + if scaled or containment: + if containment and not scaled: + raise ValueError("'containment' requires 'scaled' in Index.select'") + + matching_rows = ( row for row in matching_rows + if row['scaled'] and not row['num'] ) + if num: + matching_rows = ( row for row in matching_rows + if row['num'] and not row['scaled'] ) + + if abund: + # only need to concern ourselves if abundance is _required_ + matching_rows = ( row for row in matching_rows + if row['with_abundance'] ) + + if picklist: + matching_rows = ( row for row in matching_rows + if picklist.matches_manifest_row(row) ) + + # return only the internal filenames! + for row in matching_rows: + yield row + */ + } +} + +impl From> for Manifest { + fn from(records: Vec) -> Self { + Manifest { records } + } +} + +impl From<&[PathBuf]> for Manifest { + fn from(paths: &[PathBuf]) -> Self { + #[cfg(feature = "parallel")] + let iter = paths.par_iter(); + + #[cfg(not(feature = "parallel"))] + let iter = paths.iter(); + + let records: Vec = iter + .flat_map(|p| { + let recs: Vec = Signature::from_path(p) + .unwrap_or_else(|_| panic!("Error processing {:?}", p)) + .into_iter() + .flat_map(|v| Record::from_sig(&v, p.as_str())) + .collect(); + recs + }) + .collect(); + + Manifest { records } + } +} + +impl Deref for Manifest { + type Target = Vec; + + fn deref(&self) -> &Self::Target { + &self.records + } +} diff --git a/src/core/src/prelude.rs b/src/core/src/prelude.rs index ef7d4aa27b..90598186c4 100644 --- a/src/core/src/prelude.rs +++ b/src/core/src/prelude.rs @@ -1,27 +1,28 @@ use std::io::Write; -use crate::Error; +use crate::Result; +pub use crate::selection::{Select, Selection}; pub use crate::signature::Signature; pub use crate::storage::Storage; pub trait ToWriter { - fn to_writer(&self, writer: &mut W) -> Result<(), Error> + fn to_writer(&self, writer: &mut W) -> Result<()> where W: Write; } pub trait Update { - fn update(&self, other: &mut O) -> Result<(), Error>; + fn update(&self, other: &mut O) -> Result<()>; } pub trait FromFactory { - fn factory(&self, name: &str) -> Result; + fn factory(&self, name: &str) -> Result; } /// Implemented by anything that wants to read specific data from a storage. pub trait ReadData { - fn data(&self) -> Result<&D, Error>; + fn data(&self) -> Result<&D>; } // TODO: split into two traits, Similarity and Containment? diff --git a/src/core/src/selection.rs b/src/core/src/selection.rs new file mode 100644 index 0000000000..1add597173 --- /dev/null +++ b/src/core/src/selection.rs @@ -0,0 +1,133 @@ +use getset::{CopyGetters, Getters, Setters}; +use typed_builder::TypedBuilder; + +use crate::encodings::HashFunctions; +use crate::manifest::Record; +use crate::Result; + +#[derive(Default, Debug, TypedBuilder, Clone)] +pub struct Selection { + #[builder(default, setter(strip_option))] + ksize: Option, + + #[builder(default, setter(strip_option))] + abund: Option, + + #[builder(default, setter(strip_option))] + num: Option, + + #[builder(default, setter(strip_option))] + scaled: Option, + + #[builder(default, setter(strip_option))] + containment: Option, + + #[builder(default, setter(strip_option))] + moltype: Option, + + #[builder(default, setter(strip_option))] + picklist: Option, +} + +#[derive(Default, TypedBuilder, CopyGetters, Getters, Setters, Clone, Debug)] +pub struct Picklist { + #[getset(get = "pub", set = "pub")] + #[builder(default = "".into())] + coltype: String, + + #[getset(get = "pub", set = "pub")] + #[builder(default = "".into())] + pickfile: String, + + #[getset(get = "pub", set = "pub")] + #[builder(default = "".into())] + column_name: String, + + #[getset(get = "pub", set = "pub")] + #[builder] + pickstyle: PickStyle, +} + +#[derive(Clone, Default, Debug)] +#[repr(u32)] +pub enum PickStyle { + #[default] + Include = 1, + Exclude = 2, +} + +pub trait Select { + fn select(self, selection: &Selection) -> Result + where + Self: Sized; +} + +impl Selection { + pub fn ksize(&self) -> Option { + self.ksize + } + + pub fn set_ksize(&mut self, ksize: u32) { + self.ksize = Some(ksize); + } + + pub fn abund(&self) -> Option { + self.abund + } + + pub fn set_abund(&mut self, value: bool) { + self.abund = Some(value); + } + + pub fn num(&self) -> Option { + self.num + } + + pub fn set_num(&mut self, num: u32) { + self.num = Some(num); + } + + pub fn scaled(&self) -> Option { + self.scaled + } + + pub fn set_scaled(&mut self, scaled: u32) { + self.scaled = Some(scaled); + } + + pub fn containment(&self) -> Option { + self.containment + } + + pub fn set_containment(&mut self, containment: bool) { + self.containment = Some(containment); + } + + pub fn moltype(&self) -> Option { + self.moltype.clone() + } + + pub fn set_moltype(&mut self, value: HashFunctions) { + self.moltype = Some(value); + } + + pub fn picklist(&self) -> Option { + self.picklist.clone() + } + + pub fn set_picklist(&mut self, value: Picklist) { + self.picklist = Some(value); + } + + pub fn from_record(row: &Record) -> Result { + Ok(Self { + ksize: Some(*row.ksize()), + abund: Some(*row.with_abundance()), + moltype: Some(row.moltype()), + num: None, + scaled: None, + containment: None, + picklist: None, + }) + } +} diff --git a/src/core/src/signature.rs b/src/core/src/signature.rs index db2a85ea05..f5cb9a2b4e 100644 --- a/src/core/src/signature.rs +++ b/src/core/src/signature.rs @@ -2,6 +2,8 @@ //! //! A signature is a collection of sketches for a genomic dataset. +use core::iter::FusedIterator; + use std::fs::File; use std::io; use std::iter::Iterator; @@ -16,10 +18,13 @@ use typed_builder::TypedBuilder; use crate::encodings::{aa_to_dayhoff, aa_to_hp, revcomp, to_aa, HashFunctions, VALID}; use crate::prelude::*; +use crate::selection::{Select, Selection}; use crate::sketch::Sketch; use crate::Error; use crate::HashIntoType; +// TODO: this is the behavior expected from Sketch, but that name is already +// used. Sketchable? pub trait SigsTrait { fn size(&self) -> usize; fn to_vec(&self) -> Vec; @@ -366,11 +371,11 @@ impl Iterator for SeqToHashes { Some(Ok(hash)) } else { if !self.prot_configured { - self.aa_seq = match self.hash_function { - HashFunctions::murmur64_dayhoff => { + self.aa_seq = match &self.hash_function { + HashFunctions::Murmur64Dayhoff => { self.sequence.iter().cloned().map(aa_to_dayhoff).collect() } - HashFunctions::murmur64_hp => { + HashFunctions::Murmur64Hp => { self.sequence.iter().cloned().map(aa_to_hp).collect() } invalid => { @@ -395,6 +400,10 @@ impl Iterator for SeqToHashes { } #[derive(Serialize, Deserialize, Debug, Clone, TypedBuilder)] +#[cfg_attr( + feature = "rkyv", + derive(rkyv::Serialize, rkyv::Deserialize, rkyv::Archive) +)] pub struct Signature { #[serde(default = "default_class")] #[builder(default = default_class())] @@ -575,9 +584,9 @@ impl Signature { } }; - match moltype { + match &moltype { Some(x) => { - if mh.hash_function() == x { + if mh.hash_function() == *x { return true; } } @@ -591,9 +600,9 @@ impl Signature { } }; - match moltype { + match &moltype { Some(x) => { - if mh.hash_function() == x { + if mh.hash_function() == *x { return true; } } @@ -654,6 +663,92 @@ impl Signature { Ok(()) } + + pub fn iter_mut(&mut self) -> IterMut<'_> { + let length = self.signatures.len(); + IterMut { + iter: self.signatures.iter_mut(), + length, + } + } + + pub fn iter(&self) -> Iter<'_> { + let length = self.signatures.len(); + Iter { + iter: self.signatures.iter(), + length, + } + } +} + +pub struct IterMut<'a> { + iter: std::slice::IterMut<'a, Sketch>, + length: usize, +} + +impl<'a> IntoIterator for &'a mut Signature { + type Item = &'a mut Sketch; + type IntoIter = IterMut<'a>; + + fn into_iter(self) -> IterMut<'a> { + self.iter_mut() + } +} + +impl<'a> Iterator for IterMut<'a> { + type Item = &'a mut Sketch; + + fn next(&mut self) -> Option<&'a mut Sketch> { + if self.length == 0 { + None + } else { + self.length -= 1; + self.iter.next() + } + } + + fn size_hint(&self) -> (usize, Option) { + (self.length, Some(self.length)) + } +} + +pub struct Iter<'a> { + iter: std::slice::Iter<'a, Sketch>, + length: usize, +} + +impl<'a> Iterator for Iter<'a> { + type Item = &'a Sketch; + + fn next(&mut self) -> Option<&'a Sketch> { + if self.length == 0 { + None + } else { + self.length -= 1; + self.iter.next() + } + } + + fn size_hint(&self) -> (usize, Option) { + (self.length, Some(self.length)) + } +} + +impl FusedIterator for Iter<'_> {} + +impl ExactSizeIterator for Iter<'_> { + fn len(&self) -> usize { + self.length + } +} + +impl Clone for Iter<'_> { + fn clone(&self) -> Self { + Iter { + iter: self.iter.clone(), + length: self.length, + } + } } impl ToWriter for Signature { @@ -666,6 +761,36 @@ impl ToWriter for Signature { } } +impl Select for Signature { + fn select(mut self, selection: &Selection) -> Result { + self.signatures.retain(|s| { + let mut valid = true; + valid = if let Some(ksize) = selection.ksize() { + let k = s.ksize() as u32; + k == ksize || k == ksize * 3 + } else { + valid + }; + // TODO: execute downsample if needed + + /* + valid = if let Some(abund) = selection.abund() { + valid && *s.with_abundance() == abund + } else { + valid + }; + valid = if let Some(moltype) = selection.moltype() { + valid && s.moltype() == moltype + } else { + valid + }; + */ + valid + }); + Ok(self) + } +} + impl Default for Signature { fn default() -> Signature { Signature { diff --git a/src/core/src/sketch/hyperloglog/mod.rs b/src/core/src/sketch/hyperloglog/mod.rs index 409d2a2c44..ee09caa6e5 100644 --- a/src/core/src/sketch/hyperloglog/mod.rs +++ b/src/core/src/sketch/hyperloglog/mod.rs @@ -26,6 +26,10 @@ pub mod estimators; use estimators::CounterType; #[derive(Debug, Default, Clone, PartialEq, Eq, Serialize, Deserialize)] +#[cfg_attr( + feature = "rkyv", + derive(rkyv::Serialize, rkyv::Deserialize, rkyv::Archive) +)] pub struct HyperLogLog { registers: Vec, p: usize, @@ -180,7 +184,7 @@ impl SigsTrait for HyperLogLog { fn hash_function(&self) -> HashFunctions { //TODO support other hash functions - HashFunctions::murmur64_DNA + HashFunctions::Murmur64Dna } fn add_hash(&mut self, hash: HashIntoType) { @@ -318,15 +322,15 @@ mod test { assert!(abs_error < ERR_RATE, "{}", abs_error); let similarity = hll1.similarity(&hll2); - let abs_error = (1. - (similarity / SIMILARITY as f64)).abs(); + let abs_error = (1. - (similarity / SIMILARITY)).abs(); assert!(abs_error < ERR_RATE, "{} {}", similarity, SIMILARITY); let containment = hll1.containment(&hll2); - let abs_error = (1. - (containment / CONTAINMENT_H1 as f64)).abs(); + let abs_error = (1. - (containment / CONTAINMENT_H1)).abs(); assert!(abs_error < ERR_RATE, "{} {}", containment, CONTAINMENT_H1); let containment = hll2.containment(&hll1); - let abs_error = (1. - (containment / CONTAINMENT_H2 as f64)).abs(); + let abs_error = (1. - (containment / CONTAINMENT_H2)).abs(); assert!(abs_error < ERR_RATE, "{} {}", containment, CONTAINMENT_H2); let intersection = hll1.intersection(&hll2) as f64; @@ -335,13 +339,13 @@ mod test { hll1.merge(&hll2).unwrap(); - let abs_error = (1. - (hllu.similarity(&hll1) as f64 / 1.)).abs(); + let abs_error = (1. - (hllu.similarity(&hll1) / 1.)).abs(); assert!(abs_error < ERR_RATE, "{}", abs_error); - let abs_error = (1. - (hllu.containment(&hll1) as f64 / 1.)).abs(); + let abs_error = (1. - (hllu.containment(&hll1) / 1.)).abs(); assert!(abs_error < ERR_RATE, "{}", abs_error); - let abs_error = (1. - (hll1.containment(&hllu) as f64 / 1.)).abs(); + let abs_error = (1. - (hll1.containment(&hllu) / 1.)).abs(); assert!(abs_error < ERR_RATE, "{}", abs_error); let intersection = hll1.intersection(&hllu) as f64; diff --git a/src/core/src/sketch/minhash.rs b/src/core/src/sketch/minhash.rs index 5c5f1114f8..36f11a589e 100644 --- a/src/core/src/sketch/minhash.rs +++ b/src/core/src/sketch/minhash.rs @@ -33,11 +33,15 @@ pub fn scaled_for_max_hash(max_hash: u64) -> u64 { } #[derive(Debug, TypedBuilder)] +#[cfg_attr( + feature = "rkyv", + derive(rkyv::Serialize, rkyv::Deserialize, rkyv::Archive) +)] pub struct KmerMinHash { num: u32, ksize: u32, - #[builder(setter(into), default = HashFunctions::murmur64_DNA)] + #[builder(setter(into), default = HashFunctions::Murmur64Dna)] hash_function: HashFunctions, #[builder(default = 42u64)] @@ -53,6 +57,8 @@ pub struct KmerMinHash { abunds: Option>, #[builder(default)] + //#[cfg_attr(feature = "rkyv", with(rkyv::with::Lock))] + #[cfg_attr(feature = "rkyv", with(rkyv::with::Skip))] md5sum: Mutex>, } @@ -68,7 +74,7 @@ impl Clone for KmerMinHash { KmerMinHash { num: self.num, ksize: self.ksize, - hash_function: self.hash_function, + hash_function: self.hash_function.clone(), seed: self.seed, max_hash: self.max_hash, mins: self.mins.clone(), @@ -83,7 +89,7 @@ impl Default for KmerMinHash { KmerMinHash { num: 1000, ksize: 21, - hash_function: HashFunctions::murmur64_DNA, + hash_function: HashFunctions::Murmur64Dna, seed: 42, max_hash: 0, mins: Vec::with_capacity(1000), @@ -142,10 +148,10 @@ impl<'de> Deserialize<'de> for KmerMinHash { let num = if tmpsig.max_hash != 0 { 0 } else { tmpsig.num }; let hash_function = match tmpsig.molecule.to_lowercase().as_ref() { - "protein" => HashFunctions::murmur64_protein, - "dayhoff" => HashFunctions::murmur64_dayhoff, - "hp" => HashFunctions::murmur64_hp, - "dna" => HashFunctions::murmur64_DNA, + "protein" => HashFunctions::Murmur64Protein, + "dayhoff" => HashFunctions::Murmur64Dayhoff, + "hp" => HashFunctions::Murmur64Hp, + "dna" => HashFunctions::Murmur64Dna, _ => unimplemented!(), // TODO: throw error here }; @@ -216,7 +222,7 @@ impl KmerMinHash { } pub fn is_protein(&self) -> bool { - self.hash_function == HashFunctions::murmur64_protein + self.hash_function == HashFunctions::Murmur64Protein } pub fn max_hash(&self) -> u64 { @@ -573,7 +579,7 @@ impl KmerMinHash { let mut combined_mh = KmerMinHash::new( self.scaled(), self.ksize, - self.hash_function, + self.hash_function.clone(), self.seed, self.abunds.is_some(), self.num, @@ -606,7 +612,7 @@ impl KmerMinHash { let mut combined_mh = KmerMinHash::new( self.scaled(), self.ksize, - self.hash_function, + self.hash_function.clone(), self.seed, self.abunds.is_some(), self.num, @@ -709,11 +715,11 @@ impl KmerMinHash { } pub fn dayhoff(&self) -> bool { - self.hash_function == HashFunctions::murmur64_dayhoff + self.hash_function == HashFunctions::Murmur64Dayhoff } pub fn hp(&self) -> bool { - self.hash_function == HashFunctions::murmur64_hp + self.hash_function == HashFunctions::Murmur64Hp } pub fn mins(&self) -> Vec { @@ -735,7 +741,7 @@ impl KmerMinHash { let mut new_mh = KmerMinHash::new( scaled, self.ksize, - self.hash_function, + self.hash_function.clone(), self.seed, self.abunds.is_some(), self.num, @@ -799,7 +805,7 @@ impl SigsTrait for KmerMinHash { } fn hash_function(&self) -> HashFunctions { - self.hash_function + self.hash_function.clone() } fn add_hash(&mut self, hash: u64) { @@ -823,6 +829,8 @@ impl SigsTrait for KmerMinHash { // TODO: fix this error return Err(Error::MismatchDNAProt); } + // TODO: if supporting downsampled to be compatible + //if self.max_hash < other.max_hash { if self.max_hash != other.max_hash { return Err(Error::MismatchScaled); } @@ -927,11 +935,15 @@ mod test { // A MinHash implementation for low scaled or large cardinalities #[derive(Debug, TypedBuilder)] +#[cfg_attr( + feature = "rkyv", + derive(rkyv::Serialize, rkyv::Deserialize, rkyv::Archive) +)] pub struct KmerMinHashBTree { num: u32, ksize: u32, - #[builder(setter(into), default = HashFunctions::murmur64_DNA)] + #[builder(setter(into), default = HashFunctions::Murmur64Dna)] hash_function: HashFunctions, #[builder(default = 42u64)] @@ -950,6 +962,8 @@ pub struct KmerMinHashBTree { current_max: u64, #[builder(default)] + //#[cfg_attr(feature = "rkyv", with(rkyv::with::Lock))] + #[cfg_attr(feature = "rkyv", with(rkyv::with::Skip))] md5sum: Mutex>, } @@ -965,7 +979,7 @@ impl Clone for KmerMinHashBTree { KmerMinHashBTree { num: self.num, ksize: self.ksize, - hash_function: self.hash_function, + hash_function: self.hash_function.clone(), seed: self.seed, max_hash: self.max_hash, mins: self.mins.clone(), @@ -981,7 +995,7 @@ impl Default for KmerMinHashBTree { KmerMinHashBTree { num: 1000, ksize: 21, - hash_function: HashFunctions::murmur64_DNA, + hash_function: HashFunctions::Murmur64Dna, seed: 42, max_hash: 0, mins: Default::default(), @@ -1042,10 +1056,10 @@ impl<'de> Deserialize<'de> for KmerMinHashBTree { let num = if tmpsig.max_hash != 0 { 0 } else { tmpsig.num }; let hash_function = match tmpsig.molecule.to_lowercase().as_ref() { - "protein" => HashFunctions::murmur64_protein, - "dayhoff" => HashFunctions::murmur64_dayhoff, - "hp" => HashFunctions::murmur64_hp, - "dna" => HashFunctions::murmur64_DNA, + "protein" => HashFunctions::Murmur64Protein, + "dayhoff" => HashFunctions::Murmur64Dayhoff, + "hp" => HashFunctions::Murmur64Hp, + "dna" => HashFunctions::Murmur64Dna, _ => unimplemented!(), // TODO: throw error here }; @@ -1115,7 +1129,7 @@ impl KmerMinHashBTree { } pub fn is_protein(&self) -> bool { - self.hash_function == HashFunctions::murmur64_protein + self.hash_function == HashFunctions::Murmur64Protein } pub fn max_hash(&self) -> u64 { @@ -1358,7 +1372,7 @@ impl KmerMinHashBTree { let mut combined_mh = KmerMinHashBTree::new( self.scaled(), self.ksize, - self.hash_function, + self.hash_function.clone(), self.seed, self.abunds.is_some(), self.num, @@ -1390,7 +1404,7 @@ impl KmerMinHashBTree { let mut combined_mh = KmerMinHashBTree::new( self.scaled(), self.ksize, - self.hash_function, + self.hash_function.clone(), self.seed, self.abunds.is_some(), self.num, @@ -1478,15 +1492,15 @@ impl KmerMinHashBTree { } pub fn dayhoff(&self) -> bool { - self.hash_function == HashFunctions::murmur64_dayhoff + self.hash_function == HashFunctions::Murmur64Dayhoff } pub fn hp(&self) -> bool { - self.hash_function == HashFunctions::murmur64_hp + self.hash_function == HashFunctions::Murmur64Hp } pub fn hash_function(&self) -> HashFunctions { - self.hash_function + self.hash_function.clone() } pub fn mins(&self) -> Vec { @@ -1510,7 +1524,7 @@ impl KmerMinHashBTree { let mut new_mh = KmerMinHashBTree::new( scaled, self.ksize, - self.hash_function, + self.hash_function.clone(), self.seed, self.abunds.is_some(), self.num, @@ -1560,7 +1574,7 @@ impl SigsTrait for KmerMinHashBTree { } fn hash_function(&self) -> HashFunctions { - self.hash_function + self.hash_function.clone() } fn add_hash(&mut self, hash: u64) { diff --git a/src/core/src/sketch/mod.rs b/src/core/src/sketch/mod.rs index 09bd51085c..3ef04e43df 100644 --- a/src/core/src/sketch/mod.rs +++ b/src/core/src/sketch/mod.rs @@ -10,6 +10,10 @@ use crate::sketch::minhash::{KmerMinHash, KmerMinHashBTree}; #[derive(Debug, Clone, Serialize, Deserialize)] #[serde(untagged)] +#[cfg_attr( + feature = "rkyv", + derive(rkyv::Serialize, rkyv::Deserialize, rkyv::Archive) +)] pub enum Sketch { MinHash(KmerMinHash), LargeMinHash(KmerMinHashBTree), diff --git a/src/core/src/sketch/nodegraph.rs b/src/core/src/sketch/nodegraph.rs index cbca8915ba..bbfef5cd0d 100644 --- a/src/core/src/sketch/nodegraph.rs +++ b/src/core/src/sketch/nodegraph.rs @@ -7,7 +7,7 @@ use byteorder::{BigEndian, ByteOrder, LittleEndian, ReadBytesExt, WriteBytesExt} use fixedbitset::FixedBitSet; use crate::prelude::*; -use crate::sketch::minhash::KmerMinHash; +use crate::sketch::minhash::{KmerMinHash, KmerMinHashBTree}; use crate::Error; use crate::HashIntoType; @@ -58,6 +58,15 @@ impl Update for KmerMinHash { } } +impl Update for KmerMinHashBTree { + fn update(&self, other: &mut Nodegraph) -> Result<(), Error> { + for h in self.mins() { + other.count(h); + } + Ok(()) + } +} + impl Nodegraph { pub fn new(tablesizes: &[usize], ksize: usize) -> Nodegraph { let mut bs = Vec::with_capacity(tablesizes.len()); diff --git a/src/core/src/storage.rs b/src/core/src/storage.rs index f4f942d330..ad017e65a7 100644 --- a/src/core/src/storage.rs +++ b/src/core/src/storage.rs @@ -1,47 +1,47 @@ -use std::collections::BTreeMap; +use std::collections::{BTreeMap, HashMap}; use std::ffi::OsStr; use std::fs::{DirBuilder, File}; use std::io::{BufReader, BufWriter, Read, Write}; -use std::path::{Path, PathBuf}; -use std::rc::Rc; -use std::sync::RwLock; +use std::ops::Deref; +use std::sync::{Arc, RwLock}; +use camino::Utf8Path as Path; +use camino::Utf8PathBuf as PathBuf; +use once_cell::sync::OnceCell; use serde::{Deserialize, Serialize}; use thiserror::Error; use typed_builder::TypedBuilder; -use crate::Error; +use crate::errors::ReadDataError; +use crate::prelude::*; +use crate::signature::SigsTrait; +use crate::sketch::Sketch; +use crate::{Error, Result}; /// An abstraction for any place where we can store data. pub trait Storage { /// Save bytes into path - fn save(&self, path: &str, content: &[u8]) -> Result; + fn save(&self, path: &str, content: &[u8]) -> Result; /// Load bytes from path - fn load(&self, path: &str) -> Result, Error>; + fn load(&self, path: &str) -> Result>; /// Args for initializing a new Storage fn args(&self) -> StorageArgs; -} -#[derive(Clone)] -pub struct InnerStorage(Rc>); + /// Load signature from internal path + fn load_sig(&self, path: &str) -> Result; -impl InnerStorage { - pub fn new(inner: impl Storage + 'static) -> InnerStorage { - InnerStorage(Rc::new(RwLock::new(inner))) - } -} + /// Return a spec for creating/opening a storage + fn spec(&self) -> String; -impl Storage for InnerStorage { - fn save(&self, path: &str, content: &[u8]) -> Result { - self.0.save(path, content) - } - fn load(&self, path: &str) -> Result, Error> { - self.0.load(path) - } - fn args(&self) -> StorageArgs { - self.0.args() + /// Save signature to internal path + fn save_sig(&self, path: &str, sig: Signature) -> Result { + let mut buffer = vec![]; + { + sig.to_writer(&mut buffer).unwrap(); + } + self.save(path, &buffer) } } @@ -57,6 +57,26 @@ pub enum StorageError { DataReadError(String), } +#[derive(Clone)] +pub struct InnerStorage(Arc>); + +#[derive(TypedBuilder, Default, Clone)] +pub struct SigStore { + #[builder(setter(into))] + filename: String, + + #[builder(setter(into))] + name: String, + + #[builder(setter(into))] + metadata: String, + + storage: Option, + + #[builder(setter(into), default)] + data: OnceCell, +} + #[derive(Serialize, Deserialize)] pub(crate) struct StorageInfo { pub backend: String, @@ -69,6 +89,86 @@ pub enum StorageArgs { FSStorage { path: String }, } +/// Store files locally into a directory +#[derive(TypedBuilder, Debug, Clone, Default)] +pub struct FSStorage { + /// absolute path for the directory where data is saved. + fullpath: PathBuf, + subdir: String, +} + +#[ouroboros::self_referencing] +pub struct ZipStorage { + mapping: Option, + + #[borrows(mapping)] + #[covariant] + archive: piz::ZipArchive<'this>, + + subdir: Option, + path: Option, + + #[borrows(archive)] + #[covariant] + metadata: Metadata<'this>, +} + +/// Store data in memory (no permanent storage) +#[derive(TypedBuilder, Debug, Clone, Default)] +pub struct MemStorage { + //store: HashMap>, + sigs: Arc>>, +} + +pub type Metadata<'a> = BTreeMap<&'a OsStr, &'a piz::read::FileMetadata<'a>>; + +// ========================================= + +impl InnerStorage { + pub fn new(inner: impl Storage + Send + Sync + 'static) -> InnerStorage { + InnerStorage(Arc::new(RwLock::new(inner))) + } + + pub fn from_spec(spec: String) -> Result { + Ok(match spec { + x if x.starts_with("fs") => { + let path = x.split("://").last().expect("not a valid path"); + InnerStorage::new(FSStorage::new("", path)) + } + x if x.starts_with("memory") => InnerStorage::new(MemStorage::new()), + x if x.starts_with("zip") => { + let path = x.split("://").last().expect("not a valid path"); + InnerStorage::new(ZipStorage::from_file(path)?) + } + _ => todo!("storage not supported, throw error"), + }) + } +} + +impl Storage for InnerStorage { + fn save(&self, path: &str, content: &[u8]) -> Result { + self.0.save(path, content) + } + + fn load(&self, path: &str) -> Result> { + self.0.load(path) + } + + fn args(&self) -> StorageArgs { + self.0.args() + } + + fn load_sig(&self, path: &str) -> Result { + let mut store = self.0.load_sig(path)?; + store.storage = Some(self.clone()); + Ok(store) + } + + fn spec(&self) -> String { + self.0.spec() + } +} + impl From<&StorageArgs> for FSStorage { fn from(other: &StorageArgs) -> FSStorage { match other { @@ -90,25 +190,25 @@ impl Storage for RwLock where L: ?Sized + Storage, { - fn save(&self, path: &str, content: &[u8]) -> Result { + fn save(&self, path: &str, content: &[u8]) -> Result { self.read().unwrap().save(path, content) } - fn load(&self, path: &str) -> Result, Error> { + fn load(&self, path: &str) -> Result> { self.read().unwrap().load(path) } fn args(&self) -> StorageArgs { self.read().unwrap().args() } -} -/// Store files locally into a directory -#[derive(TypedBuilder, Debug, Clone, Default)] -pub struct FSStorage { - /// absolute path for the directory where data is saved. - fullpath: PathBuf, - subdir: String, + fn load_sig(&self, path: &str) -> Result { + self.read().unwrap().load_sig(path) + } + + fn spec(&self) -> String { + self.read().unwrap().spec() + } } impl FSStorage { @@ -132,7 +232,7 @@ impl FSStorage { } impl Storage for FSStorage { - fn save(&self, path: &str, content: &[u8]) -> Result { + fn save(&self, path: &str, content: &[u8]) -> Result { if path.is_empty() { return Err(StorageError::EmptyPathError.into()); } @@ -148,7 +248,7 @@ impl Storage for FSStorage { Ok(path.into()) } - fn load(&self, path: &str) -> Result, Error> { + fn load(&self, path: &str) -> Result> { let path = self.fullpath.join(path); let file = File::open(path)?; let mut buf_reader = BufReader::new(file); @@ -162,38 +262,33 @@ impl Storage for FSStorage { path: self.subdir.clone(), } } -} -#[ouroboros::self_referencing] -pub struct ZipStorage { - mapping: Option, + fn load_sig(&self, path: &str) -> Result { + let raw = self.load(path)?; + let sig = Signature::from_reader(&mut &raw[..])? + // TODO: select the right sig? + .swap_remove(0); - #[borrows(mapping)] - #[covariant] - archive: piz::ZipArchive<'this>, - - subdir: Option, - path: Option, + Ok(sig.into()) + } - #[borrows(archive)] - #[covariant] - metadata: Metadata<'this>, + fn spec(&self) -> String { + format!("fs://{}", self.subdir) + } } -pub type Metadata<'a> = BTreeMap<&'a OsStr, &'a piz::read::FileMetadata<'a>>; - fn lookup<'a, P: AsRef>( metadata: &'a Metadata, path: P, -) -> Result<&'a piz::read::FileMetadata<'a>, Error> { +) -> Result<&'a piz::read::FileMetadata<'a>> { let path = path.as_ref(); metadata .get(&path.as_os_str()) - .ok_or_else(|| StorageError::PathNotFoundError(path.to_str().unwrap().into()).into()) + .ok_or_else(|| StorageError::PathNotFoundError(path.to_string()).into()) .map(|entry| *entry) } -fn find_subdirs<'a>(archive: &'a piz::ZipArchive<'a>) -> Result, Error> { +fn find_subdirs<'a>(archive: &'a piz::ZipArchive<'a>) -> Result> { let subdirs: Vec<_> = archive .entries() .iter() @@ -207,11 +302,11 @@ fn find_subdirs<'a>(archive: &'a piz::ZipArchive<'a>) -> Result, } impl Storage for ZipStorage { - fn save(&self, _path: &str, _content: &[u8]) -> Result { + fn save(&self, _path: &str, _content: &[u8]) -> Result { unimplemented!(); } - fn load(&self, path: &str) -> Result, Error> { + fn load(&self, path: &str) -> Result> { let metadata = self.borrow_metadata(); let entry = lookup(metadata, path).or_else(|_| { @@ -237,11 +332,24 @@ impl Storage for ZipStorage { fn args(&self) -> StorageArgs { unimplemented!(); } + + fn load_sig(&self, path: &str) -> Result { + let raw = self.load(path)?; + let sig = Signature::from_reader(&mut &raw[..])? + // TODO: select the right sig? + .swap_remove(0); + + Ok(sig.into()) + } + + fn spec(&self) -> String { + format!("zip://{}", self.path().unwrap_or("".into())) + } } impl ZipStorage { - pub fn from_file(location: &str) -> Result { - let zip_file = File::open(location)?; + pub fn from_file>(location: P) -> Result { + let zip_file = File::open(location.as_ref())?; let mapping = unsafe { memmap2::Mmap::map(&zip_file)? }; let mut storage = ZipStorageBuilder { @@ -257,7 +365,7 @@ impl ZipStorage { .collect() }, subdir: None, - path: Some(location.to_owned()), + path: Some(location.as_ref().into()), } .build(); @@ -267,7 +375,7 @@ impl ZipStorage { Ok(storage) } - pub fn path(&self) -> Option { + pub fn path(&self) -> Option { self.borrow_path().clone() } @@ -279,7 +387,7 @@ impl ZipStorage { self.with_mut(|fields| *fields.subdir = Some(path)) } - pub fn list_sbts(&self) -> Result, Error> { + pub fn list_sbts(&self) -> Result> { Ok(self .borrow_archive() .entries() @@ -295,7 +403,7 @@ impl ZipStorage { .collect()) } - pub fn filenames(&self) -> Result, Error> { + pub fn filenames(&self) -> Result> { Ok(self .borrow_archive() .entries() @@ -304,3 +412,221 @@ impl ZipStorage { .collect()) } } + +impl SigStore { + pub fn new_with_storage(sig: Signature, storage: InnerStorage) -> Self { + let name = sig.name(); + let filename = sig.filename(); + + SigStore::builder() + .name(name) + .filename(filename) + .data(sig) + .metadata("") + .storage(Some(storage)) + .build() + } + + pub fn name(&self) -> String { + self.name.clone() + } +} + +impl Select for SigStore { + fn select(mut self, selection: &Selection) -> Result { + // TODO: find better error + let sig = self.data.take().ok_or(Error::MismatchKSizes)?; + self.data = OnceCell::with_value(sig.select(selection)?); + Ok(self) + } +} + +impl std::fmt::Debug for SigStore { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + write!( + f, + "SigStore [filename: {}, name: {}, metadata: {}]", + self.filename, self.name, self.metadata + ) + } +} + +impl ReadData for SigStore { + fn data(&self) -> Result<&Signature> { + if let Some(sig) = self.data.get() { + Ok(sig) + } else if let Some(storage) = &self.storage { + let sig = self.data.get_or_init(|| { + let raw = storage.load(&self.filename).unwrap(); + Signature::from_reader(&mut &raw[..]) + .unwrap() + // TODO: select the right sig? + .swap_remove(0) + }); + + Ok(sig) + } else { + Err(ReadDataError::LoadError.into()) + } + } +} + +impl SigStore { + pub fn save(&self, path: &str) -> Result { + if let Some(storage) = &self.storage { + if let Some(data) = self.data.get() { + let mut buffer = Vec::new(); + data.to_writer(&mut buffer)?; + + Ok(storage.save(path, &buffer)?) + } else { + unimplemented!() + } + } else { + unimplemented!() + } + } +} + +impl From for Signature { + fn from(other: SigStore) -> Signature { + other.data.get().unwrap().to_owned() + } +} + +impl Deref for SigStore { + type Target = Signature; + + fn deref(&self) -> &Signature { + self.data.get().unwrap() + } +} + +impl From for SigStore { + fn from(other: Signature) -> SigStore { + let name = other.name(); + let filename = other.filename(); + + SigStore::builder() + .name(name) + .filename(filename) + .data(other) + .metadata("") + .storage(None) + .build() + } +} + +impl Comparable for SigStore { + fn similarity(&self, other: &SigStore) -> f64 { + let ng: &Signature = self.data().unwrap(); + let ong: &Signature = other.data().unwrap(); + + // TODO: select the right signatures... + // TODO: better matching here, what if it is not a mh? + if let Sketch::MinHash(mh) = &ng.signatures[0] { + if let Sketch::MinHash(omh) = &ong.signatures[0] { + return mh.similarity(omh, true, false).unwrap(); + } + } + + unimplemented!() + } + + fn containment(&self, other: &SigStore) -> f64 { + let ng: &Signature = self.data().unwrap(); + let ong: &Signature = other.data().unwrap(); + + // TODO: select the right signatures... + // TODO: better matching here, what if it is not a mh? + if let Sketch::MinHash(mh) = &ng.signatures[0] { + if let Sketch::MinHash(omh) = &ong.signatures[0] { + let common = mh.count_common(omh, false).unwrap(); + let size = mh.size(); + return common as f64 / size as f64; + } + } + unimplemented!() + } +} + +#[derive(Serialize, Deserialize, Debug)] +pub struct DatasetInfo { + pub filename: String, + pub name: String, + pub metadata: String, +} +impl From for SigStore { + fn from(other: DatasetInfo) -> SigStore { + SigStore { + filename: other.filename, + name: other.name, + metadata: other.metadata, + storage: None, + data: OnceCell::new(), + } + } +} + +impl Comparable for Signature { + fn similarity(&self, other: &Signature) -> f64 { + // TODO: select the right signatures... + // TODO: better matching here, what if it is not a mh? + if let Sketch::MinHash(mh) = &self.signatures[0] { + if let Sketch::MinHash(omh) = &other.signatures[0] { + return mh.similarity(omh, true, false).unwrap(); + } + } + unimplemented!() + } + + fn containment(&self, other: &Signature) -> f64 { + // TODO: select the right signatures... + // TODO: better matching here, what if it is not a mh? + if let Sketch::MinHash(mh) = &self.signatures[0] { + if let Sketch::MinHash(omh) = &other.signatures[0] { + let common = mh.count_common(omh, false).unwrap(); + let size = mh.size(); + return common as f64 / size as f64; + } + } + unimplemented!() + } +} + +impl MemStorage { + pub fn new() -> Self { + Self { + sigs: Arc::new(RwLock::new(HashMap::default())), + } + } +} + +impl Storage for MemStorage { + fn save(&self, _path: &str, _content: &[u8]) -> Result { + unimplemented!() + } + + fn load(&self, _path: &str) -> Result> { + unimplemented!() + } + + fn args(&self) -> StorageArgs { + unimplemented!() + } + + fn load_sig(&self, path: &str) -> Result { + Ok(self.sigs.read().unwrap().get(path).unwrap().clone()) + } + + fn save_sig(&self, path: &str, sig: Signature) -> Result { + // side-step saving to store + let sig_store: SigStore = sig.into(); + self.sigs.write().unwrap().insert(path.into(), sig_store); + Ok(path.into()) + } + + fn spec(&self) -> String { + "memory://".into() + } +} diff --git a/src/core/src/wasm.rs b/src/core/src/wasm.rs index ad656d9955..c2a0eb6c30 100644 --- a/src/core/src/wasm.rs +++ b/src/core/src/wasm.rs @@ -37,13 +37,13 @@ impl KmerMinHash { // TODO: at most one of (prot, dayhoff, hp) should be true let hash_function = if dayhoff { - HashFunctions::murmur64_dayhoff + HashFunctions::Murmur64Dayhoff } else if hp { - HashFunctions::murmur64_hp + HashFunctions::Murmur64Hp } else if is_protein { - HashFunctions::murmur64_protein + HashFunctions::Murmur64Protein } else { - HashFunctions::murmur64_DNA + HashFunctions::Murmur64Dna }; KmerMinHash(_KmerMinHash::new( diff --git a/src/core/tests/minhash.rs b/src/core/tests/minhash.rs index bcb3fdb4fa..12477ed0d2 100644 --- a/src/core/tests/minhash.rs +++ b/src/core/tests/minhash.rs @@ -18,7 +18,7 @@ const EPSILON: f64 = 0.01; #[test] fn throws_error() { - let mut mh = KmerMinHash::new(0, 4, HashFunctions::murmur64_DNA, 42, false, 1); + let mut mh = KmerMinHash::new(0, 4, HashFunctions::Murmur64Dna, 42, false, 1); assert!( mh.add_sequence(b"ATGR", false).is_err(), @@ -28,8 +28,8 @@ fn throws_error() { #[test] fn merge() { - let mut a = KmerMinHash::new(0, 10, HashFunctions::murmur64_DNA, 42, false, 20); - let mut b = KmerMinHash::new(0, 10, HashFunctions::murmur64_DNA, 42, false, 20); + let mut a = KmerMinHash::new(0, 10, HashFunctions::Murmur64Dna, 42, false, 20); + let mut b = KmerMinHash::new(0, 10, HashFunctions::Murmur64Dna, 42, false, 20); a.add_sequence(b"TGCCGCCCAGCA", false).unwrap(); b.add_sequence(b"TGCCGCCCAGCA", false).unwrap(); @@ -55,20 +55,20 @@ fn merge() { #[test] fn invalid_dna() { - let mut a = KmerMinHash::new(0, 3, HashFunctions::murmur64_DNA, 42, false, 20); + let mut a = KmerMinHash::new(0, 3, HashFunctions::Murmur64Dna, 42, false, 20); a.add_sequence(b"AAANNCCCTN", true).unwrap(); assert_eq!(a.mins().len(), 3); - let mut b = KmerMinHash::new(0, 3, HashFunctions::murmur64_DNA, 42, false, 20); + let mut b = KmerMinHash::new(0, 3, HashFunctions::Murmur64Dna, 42, false, 20); b.add_sequence(b"NAAA", true).unwrap(); assert_eq!(b.mins().len(), 1); } #[test] fn similarity() -> Result<(), Box> { - let mut a = KmerMinHash::new(0, 20, HashFunctions::murmur64_hp, 42, true, 5); - let mut b = KmerMinHash::new(0, 20, HashFunctions::murmur64_hp, 42, true, 5); + let mut a = KmerMinHash::new(0, 20, HashFunctions::Murmur64Hp, 42, true, 5); + let mut b = KmerMinHash::new(0, 20, HashFunctions::Murmur64Hp, 42, true, 5); a.add_hash(1); b.add_hash(1); @@ -82,8 +82,8 @@ fn similarity() -> Result<(), Box> { #[test] fn similarity_2() -> Result<(), Box> { - let mut a = KmerMinHash::new(0, 5, HashFunctions::murmur64_DNA, 42, true, 5); - let mut b = KmerMinHash::new(0, 5, HashFunctions::murmur64_DNA, 42, true, 5); + let mut a = KmerMinHash::new(0, 5, HashFunctions::Murmur64Dna, 42, true, 5); + let mut b = KmerMinHash::new(0, 5, HashFunctions::Murmur64Dna, 42, true, 5); a.add_sequence(b"ATGGA", false)?; a.add_sequence(b"GGACA", false)?; @@ -102,8 +102,8 @@ fn similarity_2() -> Result<(), Box> { #[test] fn similarity_3() -> Result<(), Box> { - let mut a = KmerMinHash::new(0, 20, HashFunctions::murmur64_dayhoff, 42, true, 5); - let mut b = KmerMinHash::new(0, 20, HashFunctions::murmur64_dayhoff, 42, true, 5); + let mut a = KmerMinHash::new(0, 20, HashFunctions::Murmur64Dayhoff, 42, true, 5); + let mut b = KmerMinHash::new(0, 20, HashFunctions::Murmur64Dayhoff, 42, true, 5); a.add_hash(1); a.add_hash(1); @@ -126,8 +126,8 @@ fn similarity_3() -> Result<(), Box> { #[test] fn angular_similarity_requires_abundance() -> Result<(), Box> { - let mut a = KmerMinHash::new(0, 20, HashFunctions::murmur64_dayhoff, 42, false, 5); - let mut b = KmerMinHash::new(0, 20, HashFunctions::murmur64_dayhoff, 42, false, 5); + let mut a = KmerMinHash::new(0, 20, HashFunctions::Murmur64Dayhoff, 42, false, 5); + let mut b = KmerMinHash::new(0, 20, HashFunctions::Murmur64Dayhoff, 42, false, 5); a.add_hash(1); b.add_hash(1); @@ -139,8 +139,8 @@ fn angular_similarity_requires_abundance() -> Result<(), Box Result<(), Box> { - let mut a = KmerMinHashBTree::new(0, 20, HashFunctions::murmur64_dayhoff, 42, false, 5); - let mut b = KmerMinHashBTree::new(0, 20, HashFunctions::murmur64_dayhoff, 42, false, 5); + let mut a = KmerMinHashBTree::new(0, 20, HashFunctions::Murmur64Dayhoff, 42, false, 5); + let mut b = KmerMinHashBTree::new(0, 20, HashFunctions::Murmur64Dayhoff, 42, false, 5); a.add_hash(1); b.add_hash(1); @@ -152,8 +152,8 @@ fn angular_similarity_btree_requires_abundance() -> Result<(), Box = Vec::new(); @@ -769,7 +769,7 @@ fn seq_to_hashes(seq in "ACGTGTAGCTAGACACTGACTGACTGAC") { fn seq_to_hashes_2(seq in "QRMTHINK") { let scaled = 1; - let mut mh = KmerMinHash::new(scaled, 3, HashFunctions::murmur64_protein, 42, true, 0); + let mut mh = KmerMinHash::new(scaled, 3, HashFunctions::Murmur64Protein, 42, true, 0); mh.add_protein(seq.as_bytes())?; // .unwrap(); let mut hashes: Vec = Vec::new(); diff --git a/src/core/tests/storage.rs b/src/core/tests/storage.rs index 5a60e02fcc..e0d355d6b0 100644 --- a/src/core/tests/storage.rs +++ b/src/core/tests/storage.rs @@ -1,6 +1,9 @@ use std::path::PathBuf; -use sourmash::storage::{Storage, ZipStorage}; +use tempfile::TempDir; + +use sourmash::signature::Signature; +use sourmash::storage::{FSStorage, InnerStorage, Storage, StorageArgs, ZipStorage}; #[test] fn zipstorage_load_file() -> Result<(), Box> { @@ -42,3 +45,125 @@ fn zipstorage_list_sbts() -> Result<(), Box> { Ok(()) } + +#[cfg(feature = "parallel")] +#[test] +fn zipstorage_parallel_access() -> Result<(), Box> { + use rayon::prelude::*; + use sourmash::signature::SigsTrait; + + let mut filename = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + filename.push("../../tests/test-data/v6.sbt.zip"); + + let zs = ZipStorage::from_file(filename.to_str().unwrap())?; + + let total_hashes: usize = [ + ".sbt.v3/f71e78178af9e45e6f1d87a0c53c465c", + ".sbt.v3/f0c834bc306651d2b9321fb21d3e8d8f", + ".sbt.v3/4e94e60265e04f0763142e20b52c0da1", + ".sbt.v3/6d6e87e1154e95b279e5e7db414bc37b", + ".sbt.v3/0107d767a345eff67ecdaed2ee5cd7ba", + ".sbt.v3/b59473c94ff2889eca5d7165936e64b3", + ".sbt.v3/60f7e23c24a8d94791cc7a8680c493f9", + ] + .par_iter() + .map(|path| { + let data = zs.load(path).unwrap(); + let sigs: Vec = serde_json::from_reader(&data[..]).expect("Loading error"); + sigs.iter() + .map(|v| v.sketches().iter().map(|mh| mh.size()).sum::()) + .sum::() + }) + .sum(); + + assert_eq!(total_hashes, 3500); + + Ok(()) +} + +#[test] +fn innerstorage_save_sig() -> Result<(), Box> { + let output = TempDir::new()?; + + let fst = FSStorage::new("".into(), output.path().as_os_str().to_str().unwrap()); + + let instorage = InnerStorage::new(fst); + + let mut filename = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + filename.push("../../tests/test-data/genome-s10.fa.gz.sig"); + + let sig = Signature::from_path(filename)?.swap_remove(0); + let new_path = instorage.save_sig("test", sig.clone())?; + dbg!(new_path); + + let loaded_sig = instorage.load_sig("test")?; + + assert_eq!(sig.name(), loaded_sig.name()); + assert_eq!(sig.md5sum(), loaded_sig.md5sum()); + + Ok(()) +} + +#[test] +fn innerstorage_load() -> Result<(), Box> { + let output = TempDir::new()?; + + let fst = FSStorage::new("".into(), output.path().as_os_str().to_str().unwrap()); + + let instorage = InnerStorage::new(fst); + + let mut filename = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + filename.push("../../tests/test-data/genome-s10.fa.gz.sig"); + + let sig = Signature::from_path(filename)?.swap_remove(0); + let new_path = instorage.save_sig("test", sig.clone())?; + dbg!(new_path); + + let raw_data = instorage.load("test")?; + let loaded_sig = Signature::from_reader(raw_data.as_slice())?.swap_remove(0); + + assert_eq!(sig.name(), loaded_sig.name()); + assert_eq!(sig.md5sum(), loaded_sig.md5sum()); + + Ok(()) +} + +#[test] +fn innerstorage_args() -> Result<(), Box> { + let output = TempDir::new()?; + let path = output.path().as_os_str().to_str().unwrap(); + + let fst = FSStorage::new("".into(), path); + + let instorage = InnerStorage::new(fst); + + let args = instorage.args(); + + assert!(matches!(args, StorageArgs::FSStorage { .. })); + let StorageArgs::FSStorage { path: p } = args; + assert_eq!(p, path); + + Ok(()) +} + +#[test] +fn innerstorage_from_args() -> Result<(), Box> { + let output = TempDir::new()?; + let path = output.path().as_os_str().to_str().unwrap(); + + let fst = FSStorage::new("".into(), path); + let args = fst.args(); + + let instorage = InnerStorage::new(FSStorage::from(&args)); + let inargs = instorage.args(); + + assert!(matches!(inargs, StorageArgs::FSStorage { .. })); + let StorageArgs::FSStorage { path: p1 } = inargs; + assert_eq!(p1, path); + + assert!(matches!(args, StorageArgs::FSStorage { .. })); + let StorageArgs::FSStorage { path: p2 } = args; + assert_eq!(p2, path); + + Ok(()) +} diff --git a/src/sourmash/sbt_storage.py b/src/sourmash/sbt_storage.py index a22e782d69..42a4fceaa6 100644 --- a/src/sourmash/sbt_storage.py +++ b/src/sourmash/sbt_storage.py @@ -130,7 +130,7 @@ def subdir(self, value): self._methodcall(lib.zipstorage_set_subdir, to_bytes(value), len(value)) def _filenames(self): - if self.__inner: + if not self._objptr: return self.__inner._filenames() size = ffi.new("uintptr_t *") @@ -150,7 +150,7 @@ def save(self, path, content, *, overwrite=False, compress=False): raise NotImplementedError() def load(self, path): - if self.__inner: + if not self._objptr: return self.__inner.load(path) try: diff --git a/tox.ini b/tox.ini index 03dc2e79f2..0e5602628c 100644 --- a/tox.ini +++ b/tox.ini @@ -50,6 +50,11 @@ commands = pytest \ --junitxml {toxworkdir}/junit.{envname}.xml \ {posargs:doc tests} +[testenv:.pkg] +pass_env = + LIBCLANG_PATH + BINDGEN_EXTRA_CLANG_ARGS + [testenv:pypy3] deps = pip >= 19.3.1 @@ -104,7 +109,7 @@ commands = description = invoke sphinx-build to build the HTML docs basepython = python3.10 extras = doc -whitelist_externals = pandoc +allowlist_externals = pandoc pass_env = HOME change_dir = {toxinidir} #commands = sphinx-build -d "{toxworkdir}/docs_doctree" doc "{toxworkdir}/docs_out" --color -W -bhtml {posargs}