Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make chroma-db less resource intensive #179

Open
aritraghsh09 opened this issue Jan 23, 2025 · 3 comments
Open

Make chroma-db less resource intensive #179

aritraghsh09 opened this issue Jan 23, 2025 · 3 comments
Labels
enhancement New feature or request Nice to have Nice to have

Comments

@aritraghsh09
Copy link
Collaborator

chroma-db as it is currently implemented has the following issues when scaling up to millions of images:-

  • runs end with oom unless the entire chroma-db can fit on RAM. (for reference, a chroma-db with a million vectors is about 100GB; so this quickly gets out of hand as you scale up to 10 Million). As @drewoldag was pointing out; this is probably due to the chroma-db being loaded into memory for the write operation to happen?

  • runs where I set chroma-db to True take at least almost an order of magnitude longer on the scale of millions (for inference with 1 million images on two A-40 GPUs; runtime is 3 hours without chroma-db; and > 24 hours with chroma-db). I am assuming a lot of time is being spent on I/O as data is written to the chroma-db

@aritraghsh09 aritraghsh09 added enhancement New feature or request Nice to have Nice to have labels Jan 23, 2025
@drewoldag
Copy link
Collaborator

In the short term either a sqlite database or a manifest file that maps input_file_name to batched_output_file_name seems like it would be way to go. If the use case is "Aritra knows the filename of the image that he wants the vector for".

However, this clearly doesn't support the desire to do similarity search. As Michael noted, we should look again at Faiss - it seems quiet potent and even provides a nice break down of configuration options for various numbers of records in the index: https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors, https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors

Just to jot them down, some of the initial concerns around using Faiss included:

  • doesn't support pip installation, conda install only (or build from source)
  • no immediately obvious way to store additional data with the vectors (i.e. no "id" column for mapping to an input file)

On that last point, searching again, I was able to find this: https://github.com/facebookresearch/faiss/wiki/Pre--and-post-processing#faiss-id-mapping, which seems to imply that some indexes, natively support an id. That being said, the documentation doesn't make it immediately obvious how you request a vector associated with an id.

@drewoldag
Copy link
Collaborator

@drewoldag
Copy link
Collaborator

Another option might be sharding the chromadb into multiple smaller dbs.

For similarity search, we could multiprocess the request and have query each shard, then bring all the results back for a final filtering on distance to get the requested number of nearest neighbors.

It would likely also require a manifest file of some kind that would map object id to shard if one of the use cases is getting the vector from the db using the object id. But that might be better served via a sqlite db. And/or just getting the vector directly from the output .npy file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Nice to have Nice to have
Projects
None yet
Development

No branches or pull requests

2 participants