embedding query does not return the id which is known to exist in the db #3580

malee1382 · 2025-01-28T10:05:06Z

I am using chromadb 0.4.14 with which I created a large vectordb having the size of around 21M ids. Now I query some sample texts (confidential) which have already been used when the vectordb was created. However, some of them cannot be returned no matter how big the n_results (e.g., 10000) I used.

To double check, I did the following:

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

client_pat = chromadb.PersistentClient(path='may path/chromadb3')
collection_pat = client_pat.get_or_create_collection(name='Titles', 
                                                     metadata={"hnsw:space": "cosine"})
collection_pat.count()

emb1 = collection_pat.get(ids=['585446416'], include=['embeddings']).get('embeddings')[0]

t = """my text"""
emb2 = list(model.encode(t))

At this point, I could already confirm that the embeddings are identical and the id do exists in the vectordb.

And as a result of the following I get the sim of 1.

import numpy as np

def cosine_similarity(list1, list2):
    A = np.array(list1)
    B = np.array(list2)
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

similarity = cosine_similarity(emb1, emb2)
print(f"Cosine Similarity: {similarity}")

Note that I have made the same search for around 3000 thousands other texts and for all of them I could get the correct result. But somehow for this one, it does not return the expected id on the top.

So what should I do? Should I expect this because in the end it is a approximation algorithm even though a very high n_results is used?

Back in the day I had downgraded my chromadb version due to some other bugs. Now I am facing this which I am not sure if it is a bug or not.

Additionally, my vectordb is approximately 83GB in size but when I query, I see that not all 83GB is being loaded to my memory (256GB) which has enough space. Does that have something to do with how hnsw works? e.g., it starts finding its path to reach its final neighbors as a result of which not all the data is needed?

Now I created a smaller vectordb with a size of 1M making sure that the id I had queried is in it. And when I query in this smaller db, I could get the id on top. So it seems, working with smaller DBs instead of one very big one, would decrease the chance of missing due to hnsw algo?

Many thanks in advance!

The text was updated successfully, but these errors were encountered:

bryan-dailabs · 2025-01-29T07:31:25Z

Hi @malee1382 I ran into the same issue. Possibly related to this one: #3113

The suggestion there is to increase the cluster size when creating the database, which is kind of a hassle but a possible solution.

malee1382 · 2025-01-29T08:34:13Z

many thanks @bryan-dailabs ! I will try to recreate my db increasing the cluster size.

bryan-dailabs · 2025-01-29T08:39:08Z

@malee1382 OK good luck. I tried with just cluster size construction_ef, and it didn't help much. You may need to adjust the different hnsw settings here: https://docs.trychroma.com/docs/collections/configure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embedding query does not return the id which is known to exist in the db #3580

embedding query does not return the id which is known to exist in the db #3580

malee1382 commented Jan 28, 2025 •

edited

Loading

bryan-dailabs commented Jan 29, 2025 •

edited

Loading

malee1382 commented Jan 29, 2025 •

edited

Loading

bryan-dailabs commented Jan 29, 2025

embedding query does not return the id which is known to exist in the db #3580

embedding query does not return the id which is known to exist in the db #3580

Comments

malee1382 commented Jan 28, 2025 • edited Loading

bryan-dailabs commented Jan 29, 2025 • edited Loading

malee1382 commented Jan 29, 2025 • edited Loading

bryan-dailabs commented Jan 29, 2025

malee1382 commented Jan 28, 2025 •

edited

Loading

bryan-dailabs commented Jan 29, 2025 •

edited

Loading

malee1382 commented Jan 29, 2025 •

edited

Loading