Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider how to support more flexible Collection in RevIndex for external storage #3321

Open
ctb opened this issue Sep 10, 2024 · 1 comment
Labels

Comments

@ctb
Copy link
Contributor

ctb commented Sep 10, 2024

Currently, RevIndex only supports a single Collection for use as external storage. This limits it to things like Zip files and .sig.gz files, and maybe manifests and pathlists of .sig.gz files.

In sourmash-bio/sourmash_plugin_branchwater#430, we are adding MultiCollection to the branchwater plugin, so that we can support a variety of nice features, such as standalone manifests and pathlists pointing at zip files. MultiCollection recursively loads itself as needed.

However, MultiCollection can't be used as a Collection for RevIndex. This is unfortunate and leads to some contortions, the most notable of which is that the sourmash scripts index command can only use supported Collection types for external storage.

It would be nice to enable a larger subset of MultiCollection loading functionality for RevIndex.

Note that Storage is a trait so perhaps one of the simplest ways forward is to implement a MultiStorage that supports the needed flexibility, and then instantiate a Collection with that MultiStorage.

@ctb
Copy link
Contributor Author

ctb commented Sep 14, 2024

File types that can be loaded as a single collection, per https://github.com/sourmash-bio/sourmash_plugin_branchwater/blob/fa7ab221baa9a8437353f2e894892fbd7545a479/src/utils.rs#L660 (which, yes, is potentially not inclusive of everything sourmash can actually do, but seems like a good starting point ;)

(checkbox indicates tested for use in external storage in the branchwater plugin)

  • single zip file - test_index::test_index_protein
  • a single JSON file - e.g. test_index::test_index_sig
  • manifest pointing at JSON files - e.g. test_index::test_index_manifest
  • pathlist pointing at JSON files - tested in a lot of places, e.g. test_index::test_index.

in an interesting nod to Luiz's point about Storage being the key, the manifest and pathlist ones work because they can be supported by a single Storage class, FSStorage. The zipfile collection is supported by ZipStorage. The single JSON file is supported by creating a new storage (?), InnerStorage, that presumably copies the sketches - gotta look into that.

I think now I need to understand what InnerStorage does vs Storage... 🤔

I'm also curious: can we use one RocksDB as an external storage for another RocksDB? Then maybe we could efficiently index part of one RocksDB in another RocksDB index...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant