[Bug]: Collection not returning id when querying using the exact same embedding used to add to collection #3113

HughStanway · 2024-11-11T12:27:55Z

What happened?

I have populated a chroma collection with approximately 50,000 embeddings which are being pre-calculated then added using llama3.2 as such:

embedding = ollama_client.embed(model=model_name, input=text_content)['embeddings']
collection.add(
   embeddings=embedding,
   ids=[file_path]
)

I have checked and my calculated embeddings are behaving deterministically. Therefore, when querying the collection with the same embedding I would expect the nearest result to be the same id that was just added (As the distance should be zero or very near zero). I am querying like this:

results = collection.query(
   query_embeddings=embedding,
   n_results=5,
)

After running this test operation for each of my embeddings I have found that 56 / apprx 50,000 embeddings results in the incorrect file path being returned.

Further, this doesn't seem to be random, for example, it doesn't matter how many times I run the same query with an embedding that is returning an incorrect result - it still returns the incorrect result each time.

I am now unsure on why this could be happening and whether this is a potential bug.

Here is my full test code:

# Parameters
file_path = 'ProgramData/5/63.txt' 
model_name = "llama3.2" 

COLLECTION_NAME = "POJ_DATASET_ollama_embedding"
client = chromadb.HttpClient(host='localhost', port=8000)
collection = client.get_collection(name=COLLECTION_NAME)

# Read the file content
with open(file_path, "r") as file:
    text_content = file.read()

# Generate embedding for file text
ollama_client = Client(host='http://localhost:11434')
embedding = ollama_client.embed(model=model_name, input=text_content)['embeddings']

# Add embedding to collection
try:
    collection.add(
        embeddings=embedding,
        ids=[file_path]
    )
    print("Added to collection")

    doc_count = collection.count()
    print(f"Total documents in collection after add: {doc_count}")

except Exception as e:
    print(f"Error adding to collection: {e}")


try:
    results = collection.query(
        query_embeddings=embedding,
        n_results=5,
    )
except Exception as e:
        print(f"Error querying collection: {e}")

result_paths = results["ids"][0]
result_distances =  results["distances"][0]

print(f"Query path: {file_path}")
print(f"Result paths: {result_paths}")
print(f"Result distances: {result_distances}")

Versions

Python 3.11.5
chromadb 0.5.11
llama3.2
MacOS Sonoma 14.1.1

Relevant log output

Total documents in collection after add: 51752
Query path: ProgramData/5/63.txt
Result paths: ['ProgramData/31/2034.txt', 'ProgramData/51/396.txt', 'ProgramData/22/544.txt', 'ProgramData/71/68.txt', 'ProgramData/48/666.txt']
Result distances: [0.3231008052825928, 0.3234265446662903, 0.3266177773475647, 0.3289269804954529, 0.3314428925514221]

aseem-eduport · 2024-11-13T19:54:47Z

I think I'm having the same issue.
I loaded my vdb with 60000+ docs and their embeddings using a custom embedding function.
When inspecting the DB embedding looks normal and .query return accurate value with correct distance.
After compressing the folder(I'm using persistent client ) and transferring to local all my embeddings are missing.

itaismith · 2024-11-15T18:35:03Z

Hi there, we are taking a look at this. In the meantime I recommend updating your Chroma version to 0.5.18, and also check out #2675.

HughStanway · 2024-11-15T22:33:38Z

Hi, thanks for your response.

I have updated my Chroma version to 0.5.18 but this still results in the same output as before:

Total documents in collection after add: 51752
Query path: ProgramData/5/63.txt
Result paths: ['ProgramData/31/2034.txt', 'ProgramData/51/396.txt', 'ProgramData/22/544.txt', 'ProgramData/71/68.txt', 'ProgramData/48/666.txt']
Result distances: [0.3231008052825928, 0.3234265446662903, 0.3266177773475647, 0.3289269804954529, 0.3314428925514221]

As for your other suggestion I already have ef_search set to a value of 100. My collection is setup as:

collection = client.create_collection(
            name=COLLECTION_NAME,
            metadata={
                "hnsw:search_ef": 100,
                "hnsw:space": "cosine"
           },
)

itaismith · 2025-01-09T00:39:25Z

Hi @HughStanway. Sorry for not responding sooner. Ultimately, to get better recall and accuracy you would have to tweak the various HNSW until they fit your needs. For example, I just reproduced your set up by making a collection with 50,000 embeddings of 2048 dimensions, and was able to get accurate results (querying using an embedding from the collection itself) by setting construction_ef to 1,000. Hope this helps. If trying many different search/construction values does not work for please feel free to submit a new issue and we can coordinate a time for us to take a closer look.

HughStanway added the bug Something isn't working label Nov 11, 2024

itaismith self-assigned this Nov 15, 2024

itaismith added the 2025-review label Jan 3, 2025

itaismith closed this as completed Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Collection not returning id when querying using the exact same embedding used to add to collection #3113

[Bug]: Collection not returning id when querying using the exact same embedding used to add to collection #3113

HughStanway commented Nov 11, 2024 •

edited

Loading

aseem-eduport commented Nov 13, 2024

itaismith commented Nov 15, 2024

HughStanway commented Nov 15, 2024 •

edited

Loading

itaismith commented Jan 9, 2025

[Bug]: Collection not returning id when querying using the exact same embedding used to add to collection #3113

[Bug]: Collection not returning id when querying using the exact same embedding used to add to collection #3113

Comments

HughStanway commented Nov 11, 2024 • edited Loading

What happened?

Versions

Relevant log output

aseem-eduport commented Nov 13, 2024

itaismith commented Nov 15, 2024

HughStanway commented Nov 15, 2024 • edited Loading

itaismith commented Jan 9, 2025

HughStanway commented Nov 11, 2024 •

edited

Loading

HughStanway commented Nov 15, 2024 •

edited

Loading