Make chroma-db less resource intensive #179

aritraghsh09 · 2025-01-23T01:31:32Z

chroma-db as it is currently implemented has the following issues when scaling up to millions of images:-

runs end with oom unless the entire chroma-db can fit on RAM. (for reference, a chroma-db with a million vectors is about 100GB; so this quickly gets out of hand as you scale up to 10 Million). As @drewoldag was pointing out; this is probably due to the chroma-db being loaded into memory for the write operation to happen?
runs where I set chroma-db to True take at least almost an order of magnitude longer on the scale of millions (for inference with 1 million images on two A-40 GPUs; runtime is 3 hours without chroma-db; and > 24 hours with chroma-db). I am assuming a lot of time is being spent on I/O as data is written to the chroma-db

The text was updated successfully, but these errors were encountered:

drewoldag · 2025-01-23T05:16:17Z

In the short term either a sqlite database or a manifest file that maps input_file_name to batched_output_file_name seems like it would be way to go. If the use case is "Aritra knows the filename of the image that he wants the vector for".

However, this clearly doesn't support the desire to do similarity search. As Michael noted, we should look again at Faiss - it seems quiet potent and even provides a nice break down of configuration options for various numbers of records in the index: https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors, https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors

Just to jot them down, some of the initial concerns around using Faiss included:

doesn't support pip installation, conda install only (or build from source)
no immediately obvious way to store additional data with the vectors (i.e. no "id" column for mapping to an input file)

On that last point, searching again, I was able to find this: https://github.com/facebookresearch/faiss/wiki/Pre--and-post-processing#faiss-id-mapping, which seems to imply that some indexes, natively support an id. That being said, the documentation doesn't make it immediately obvious how you request a vector associated with an id.

drewoldag · 2025-01-23T05:52:30Z

https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#is-memory-a-concern

drewoldag · 2025-01-23T15:33:33Z

Another option might be sharding the chromadb into multiple smaller dbs.

For similarity search, we could multiprocess the request and have query each shard, then bring all the results back for a final filtering on distance to get the requested number of nearest neighbors.

It would likely also require a manifest file of some kind that would map object id to shard if one of the use cases is getting the vector from the db using the object id. But that might be better served via a sqlite db. And/or just getting the vector directly from the output .npy file.

aritraghsh09 added enhancement New feature or request Nice to have Nice to have labels Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make chroma-db less resource intensive #179

Make chroma-db less resource intensive #179

aritraghsh09 commented Jan 23, 2025

drewoldag commented Jan 23, 2025

drewoldag commented Jan 23, 2025

drewoldag commented Jan 23, 2025

Make chroma-db less resource intensive #179

Make chroma-db less resource intensive #179

Comments

aritraghsh09 commented Jan 23, 2025

drewoldag commented Jan 23, 2025

drewoldag commented Jan 23, 2025

drewoldag commented Jan 23, 2025