Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Collection not returning id when querying using the exact same embedding used to add to collection #3113

Closed
HughStanway opened this issue Nov 11, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@HughStanway
Copy link

HughStanway commented Nov 11, 2024

What happened?

I have populated a chroma collection with approximately 50,000 embeddings which are being pre-calculated then added using llama3.2 as such:

embedding = ollama_client.embed(model=model_name, input=text_content)['embeddings']
collection.add(
   embeddings=embedding,
   ids=[file_path]
) 

I have checked and my calculated embeddings are behaving deterministically. Therefore, when querying the collection with the same embedding I would expect the nearest result to be the same id that was just added (As the distance should be zero or very near zero). I am querying like this:

results = collection.query(
   query_embeddings=embedding,
   n_results=5,
)

After running this test operation for each of my embeddings I have found that 56 / apprx 50,000 embeddings results in the incorrect file path being returned.

Further, this doesn't seem to be random, for example, it doesn't matter how many times I run the same query with an embedding that is returning an incorrect result - it still returns the incorrect result each time.

I am now unsure on why this could be happening and whether this is a potential bug.

Here is my full test code:

# Parameters
file_path = 'ProgramData/5/63.txt' 
model_name = "llama3.2" 

COLLECTION_NAME = "POJ_DATASET_ollama_embedding"
client = chromadb.HttpClient(host='localhost', port=8000)
collection = client.get_collection(name=COLLECTION_NAME)

# Read the file content
with open(file_path, "r") as file:
    text_content = file.read()

# Generate embedding for file text
ollama_client = Client(host='http://localhost:11434')
embedding = ollama_client.embed(model=model_name, input=text_content)['embeddings']

# Add embedding to collection
try:
    collection.add(
        embeddings=embedding,
        ids=[file_path]
    )
    print("Added to collection")

    doc_count = collection.count()
    print(f"Total documents in collection after add: {doc_count}")

except Exception as e:
    print(f"Error adding to collection: {e}")


try:
    results = collection.query(
        query_embeddings=embedding,
        n_results=5,
    )
except Exception as e:
        print(f"Error querying collection: {e}")

result_paths = results["ids"][0]
result_distances =  results["distances"][0]

print(f"Query path: {file_path}")
print(f"Result paths: {result_paths}")
print(f"Result distances: {result_distances}")

Versions

Python 3.11.5
chromadb 0.5.11
llama3.2
MacOS Sonoma 14.1.1

Relevant log output

Total documents in collection after add: 51752
Query path: ProgramData/5/63.txt
Result paths: ['ProgramData/31/2034.txt', 'ProgramData/51/396.txt', 'ProgramData/22/544.txt', 'ProgramData/71/68.txt', 'ProgramData/48/666.txt']
Result distances: [0.3231008052825928, 0.3234265446662903, 0.3266177773475647, 0.3289269804954529, 0.3314428925514221]
@HughStanway HughStanway added the bug Something isn't working label Nov 11, 2024
@aseem-eduport
Copy link

I think I'm having the same issue.
I loaded my vdb with 60000+ docs and their embeddings using a custom embedding function.
When inspecting the DB embedding looks normal and .query return accurate value with correct distance.
After compressing the folder(I'm using persistent client ) and transferring to local all my embeddings are missing.

@itaismith
Copy link
Contributor

Hi there, we are taking a look at this. In the meantime I recommend updating your Chroma version to 0.5.18, and also check out #2675.

@itaismith itaismith self-assigned this Nov 15, 2024
@HughStanway
Copy link
Author

HughStanway commented Nov 15, 2024

Hi, thanks for your response.

I have updated my Chroma version to 0.5.18 but this still results in the same output as before:

Total documents in collection after add: 51752
Query path: ProgramData/5/63.txt
Result paths: ['ProgramData/31/2034.txt', 'ProgramData/51/396.txt', 'ProgramData/22/544.txt', 'ProgramData/71/68.txt', 'ProgramData/48/666.txt']
Result distances: [0.3231008052825928, 0.3234265446662903, 0.3266177773475647, 0.3289269804954529, 0.3314428925514221]

As for your other suggestion I already have ef_search set to a value of 100. My collection is setup as:

collection = client.create_collection(
            name=COLLECTION_NAME,
            metadata={
                "hnsw:search_ef": 100,
                "hnsw:space": "cosine"
           },
)

@itaismith
Copy link
Contributor

Hi @HughStanway. Sorry for not responding sooner. Ultimately, to get better recall and accuracy you would have to tweak the various HNSW until they fit your needs. For example, I just reproduced your set up by making a collection with 50,000 embeddings of 2048 dimensions, and was able to get accurate results (querying using an embedding from the collection itself) by setting construction_ef to 1,000. Hope this helps. If trying many different search/construction values does not work for please feel free to submit a new issue and we can coordinate a time for us to take a closer look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants