You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
chroma-db as it is currently implemented has the following issues when scaling up to millions of images:-
runs end with oom unless the entire chroma-db can fit on RAM. (for reference, a chroma-db with a million vectors is about 100GB; so this quickly gets out of hand as you scale up to 10 Million). As @drewoldag was pointing out; this is probably due to the chroma-db being loaded into memory for the write operation to happen?
runs where I set chroma-db to True take at least almost an order of magnitude longer on the scale of millions (for inference with 1 million images on two A-40 GPUs; runtime is 3 hours without chroma-db; and > 24 hours with chroma-db). I am assuming a lot of time is being spent on I/O as data is written to the chroma-db
The text was updated successfully, but these errors were encountered:
In the short term either a sqlite database or a manifest file that maps input_file_name to batched_output_file_name seems like it would be way to go. If the use case is "Aritra knows the filename of the image that he wants the vector for".
Another option might be sharding the chromadb into multiple smaller dbs.
For similarity search, we could multiprocess the request and have query each shard, then bring all the results back for a final filtering on distance to get the requested number of nearest neighbors.
It would likely also require a manifest file of some kind that would map object id to shard if one of the use cases is getting the vector from the db using the object id. But that might be better served via a sqlite db. And/or just getting the vector directly from the output .npy file.
chroma-db as it is currently implemented has the following issues when scaling up to millions of images:-
runs end with
oom
unless the entire chroma-db can fit on RAM. (for reference, a chroma-db with a million vectors is about 100GB; so this quickly gets out of hand as you scale up to 10 Million). As @drewoldag was pointing out; this is probably due to the chroma-db being loaded into memory for the write operation to happen?runs where I set chroma-db to True take at least almost an order of magnitude longer on the scale of millions (for inference with 1 million images on two A-40 GPUs; runtime is 3 hours without chroma-db; and > 24 hours with chroma-db). I am assuming a lot of time is being spent on I/O as data is written to the chroma-db
The text was updated successfully, but these errors were encountered: