Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embedding query does not return the id which is known to exist in the db #3580

Open
malee1382 opened this issue Jan 28, 2025 · 3 comments
Open

Comments

@malee1382
Copy link

malee1382 commented Jan 28, 2025

I am using chromadb 0.4.14 with which I created a large vectordb having the size of around 21M ids. Now I query some sample texts (confidential) which have already been used when the vectordb was created. However, some of them cannot be returned no matter how big the n_results (e.g., 10000) I used.

To double check, I did the following:

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

client_pat = chromadb.PersistentClient(path='may path/chromadb3')
collection_pat = client_pat.get_or_create_collection(name='Titles', 
                                                     metadata={"hnsw:space": "cosine"})
collection_pat.count()

emb1 = collection_pat.get(ids=['585446416'], include=['embeddings']).get('embeddings')[0]

t = """my text"""
emb2 = list(model.encode(t))

At this point, I could already confirm that the embeddings are identical and the id do exists in the vectordb.

And as a result of the following I get the sim of 1.

import numpy as np

def cosine_similarity(list1, list2):
    A = np.array(list1)
    B = np.array(list2)
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

similarity = cosine_similarity(emb1, emb2)
print(f"Cosine Similarity: {similarity}")

Note that I have made the same search for around 3000 thousands other texts and for all of them I could get the correct result. But somehow for this one, it does not return the expected id on the top.

So what should I do? Should I expect this because in the end it is a approximation algorithm even though a very high n_results is used?

Back in the day I had downgraded my chromadb version due to some other bugs. Now I am facing this which I am not sure if it is a bug or not.

Additionally, my vectordb is approximately 83GB in size but when I query, I see that not all 83GB is being loaded to my memory (256GB) which has enough space. Does that have something to do with how hnsw works? e.g., it starts finding its path to reach its final neighbors as a result of which not all the data is needed?

Now I created a smaller vectordb with a size of 1M making sure that the id I had queried is in it. And when I query in this smaller db, I could get the id on top. So it seems, working with smaller DBs instead of one very big one, would decrease the chance of missing due to hnsw algo?

Many thanks in advance!

@bryan-dailabs
Copy link

bryan-dailabs commented Jan 29, 2025

Hi @malee1382 I ran into the same issue. Possibly related to this one: #3113

The suggestion there is to increase the cluster size when creating the database, which is kind of a hassle but a possible solution.

@malee1382
Copy link
Author

malee1382 commented Jan 29, 2025

many thanks @bryan-dailabs ! I will try to recreate my db increasing the cluster size.

@bryan-dailabs
Copy link

@malee1382 OK good luck. I tried with just cluster size construction_ef, and it didn't help much. You may need to adjust the different hnsw settings here: https://docs.trychroma.com/docs/collections/configure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants