You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using chromadb 0.4.14 with which I created a large vectordb having the size of around 21M ids. Now I query some sample texts (confidential) which have already been used when the vectordb was created. However, some of them cannot be returned no matter how big the n_results (e.g., 10000) I used.
At this point, I could already confirm that the embeddings are identical and the id do exists in the vectordb.
And as a result of the following I get the sim of 1.
import numpy as np
def cosine_similarity(list1, list2):
A = np.array(list1)
B = np.array(list2)
return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
similarity = cosine_similarity(emb1, emb2)
print(f"Cosine Similarity: {similarity}")
Note that I have made the same search for around 3000 thousands other texts and for all of them I could get the correct result. But somehow for this one, it does not return the expected id on the top.
So what should I do? Should I expect this because in the end it is a approximation algorithm even though a very high n_results is used?
Back in the day I had downgraded my chromadb version due to some other bugs. Now I am facing this which I am not sure if it is a bug or not.
Additionally, my vectordb is approximately 83GB in size but when I query, I see that not all 83GB is being loaded to my memory (256GB) which has enough space. Does that have something to do with how hnsw works? e.g., it starts finding its path to reach its final neighbors as a result of which not all the data is needed?
Now I created a smaller vectordb with a size of 1M making sure that the id I had queried is in it. And when I query in this smaller db, I could get the id on top. So it seems, working with smaller DBs instead of one very big one, would decrease the chance of missing due to hnsw algo?
Many thanks in advance!
The text was updated successfully, but these errors were encountered:
I am using chromadb 0.4.14 with which I created a large vectordb having the size of around 21M ids. Now I query some sample texts (confidential) which have already been used when the vectordb was created. However, some of them cannot be returned no matter how big the n_results (e.g., 10000) I used.
To double check, I did the following:
At this point, I could already confirm that the embeddings are identical and the id do exists in the vectordb.
And as a result of the following I get the sim of 1.
Note that I have made the same search for around 3000 thousands other texts and for all of them I could get the correct result. But somehow for this one, it does not return the expected id on the top.
So what should I do? Should I expect this because in the end it is a approximation algorithm even though a very high n_results is used?
Back in the day I had downgraded my chromadb version due to some other bugs. Now I am facing this which I am not sure if it is a bug or not.
Additionally, my vectordb is approximately 83GB in size but when I query, I see that not all 83GB is being loaded to my memory (256GB) which has enough space. Does that have something to do with how hnsw works? e.g., it starts finding its path to reach its final neighbors as a result of which not all the data is needed?
Now I created a smaller vectordb with a size of 1M making sure that the id I had queried is in it. And when I query in this smaller db, I could get the id on top. So it seems, working with smaller DBs instead of one very big one, would decrease the chance of missing due to hnsw algo?
Many thanks in advance!
The text was updated successfully, but these errors were encountered: