You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked and my calculated embeddings are behaving deterministically. Therefore, when querying the collection with the same embedding I would expect the nearest result to be the same id that was just added (As the distance should be zero or very near zero). I am querying like this:
After running this test operation for each of my embeddings I have found that 56 / apprx 50,000 embeddings results in the incorrect file path being returned.
Further, this doesn't seem to be random, for example, it doesn't matter how many times I run the same query with an embedding that is returning an incorrect result - it still returns the incorrect result each time.
I am now unsure on why this could be happening and whether this is a potential bug.
Here is my full test code:
# Parametersfile_path='ProgramData/5/63.txt'model_name="llama3.2"COLLECTION_NAME="POJ_DATASET_ollama_embedding"client=chromadb.HttpClient(host='localhost', port=8000)
collection=client.get_collection(name=COLLECTION_NAME)
# Read the file contentwithopen(file_path, "r") asfile:
text_content=file.read()
# Generate embedding for file textollama_client=Client(host='http://localhost:11434')
embedding=ollama_client.embed(model=model_name, input=text_content)['embeddings']
# Add embedding to collectiontry:
collection.add(
embeddings=embedding,
ids=[file_path]
)
print("Added to collection")
doc_count=collection.count()
print(f"Total documents in collection after add: {doc_count}")
exceptExceptionase:
print(f"Error adding to collection: {e}")
try:
results=collection.query(
query_embeddings=embedding,
n_results=5,
)
exceptExceptionase:
print(f"Error querying collection: {e}")
result_paths=results["ids"][0]
result_distances=results["distances"][0]
print(f"Query path: {file_path}")
print(f"Result paths: {result_paths}")
print(f"Result distances: {result_distances}")
Total documents in collection after add: 51752
Query path: ProgramData/5/63.txt
Result paths: ['ProgramData/31/2034.txt', 'ProgramData/51/396.txt', 'ProgramData/22/544.txt', 'ProgramData/71/68.txt', 'ProgramData/48/666.txt']
Result distances: [0.3231008052825928, 0.3234265446662903, 0.3266177773475647, 0.3289269804954529, 0.3314428925514221]
The text was updated successfully, but these errors were encountered:
I think I'm having the same issue.
I loaded my vdb with 60000+ docs and their embeddings using a custom embedding function.
When inspecting the DB embedding looks normal and .query return accurate value with correct distance.
After compressing the folder(I'm using persistent client ) and transferring to local all my embeddings are missing.
I have updated my Chroma version to 0.5.18 but this still results in the same output as before:
Total documents in collection after add: 51752
Query path: ProgramData/5/63.txt
Result paths: ['ProgramData/31/2034.txt', 'ProgramData/51/396.txt', 'ProgramData/22/544.txt', 'ProgramData/71/68.txt', 'ProgramData/48/666.txt']
Result distances: [0.3231008052825928, 0.3234265446662903, 0.3266177773475647, 0.3289269804954529, 0.3314428925514221]
As for your other suggestion I already have ef_search set to a value of 100. My collection is setup as:
Hi @HughStanway. Sorry for not responding sooner. Ultimately, to get better recall and accuracy you would have to tweak the various HNSW until they fit your needs. For example, I just reproduced your set up by making a collection with 50,000 embeddings of 2048 dimensions, and was able to get accurate results (querying using an embedding from the collection itself) by setting construction_ef to 1,000. Hope this helps. If trying many different search/construction values does not work for please feel free to submit a new issue and we can coordinate a time for us to take a closer look.
What happened?
I have populated a chroma collection with approximately 50,000 embeddings which are being pre-calculated then added using llama3.2 as such:
I have checked and my calculated embeddings are behaving deterministically. Therefore, when querying the collection with the same embedding I would expect the nearest result to be the same id that was just added (As the distance should be zero or very near zero). I am querying like this:
After running this test operation for each of my embeddings I have found that 56 / apprx 50,000 embeddings results in the incorrect file path being returned.
Further, this doesn't seem to be random, for example, it doesn't matter how many times I run the same query with an embedding that is returning an incorrect result - it still returns the incorrect result each time.
I am now unsure on why this could be happening and whether this is a potential bug.
Here is my full test code:
Versions
Python 3.11.5
chromadb 0.5.11
llama3.2
MacOS Sonoma 14.1.1
Relevant log output
The text was updated successfully, but these errors were encountered: