Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(chromadb): queries without embeddings result in program crash #11328

Merged

Conversation

poxrud
Copy link
Contributor

@poxrud poxrud commented Feb 23, 2024

Description

When using VectorStoreQuery to query chromadb by metadata only, without an embedding, it results in
a program crash with the following error message:

ValueError: You must provide one of query_embeddings, query_texts, query_images, or query_uris.

Why is this a problem?

When using the VectorIndexAutoRetriever, it will sometimes suggest a metadata only search, without an embedding,
and this will cause a program crash with the above error message.

This fix is to use the chromadb collection's get() method instead of query() for cases where a metadata only search is required.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Added a new test test_add_to_chromadb_and_query_by_metafilters_only into test_chromadb.py.
Run it with:

pytest llama-index-integrations/vector_stores/llama-index-vector-stores-chroma/tests/test_chromadb.py
  • Added new unit/integration tests
  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

With chromadb, VectorStoreQuery queries with metadata only (without embeddings)
result in error: "ValueError: You must provide one of query_embeddings, query_texts, query_images, or query_uris."

The fix is to use chromadb get() method in such cases.
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 23, 2024
@logan-markewich
Copy link
Collaborator

I think the idea is to provide an "empty vector" for the auto retriever to query with. But this works too

@@ -291,13 +291,29 @@ def query(self, query: VectorStoreQuery, **kwargs: Any) -> VectorStoreQueryResul
else:
where = kwargs.pop("where", {})

results = self._collection.query(
if not query.query_embedding:
return self._get(limit=query.similarity_top_k, where=where, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there still be a limit if its purely a filter query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, from the perspective of the end-user they don't know whether the query() or get() method gets used, so when they set smililarity_top_k they expect a max of this number of results.

)

print(f"QUERY RES: {results}")
print(f"EMBEDDINGS: {query_embeddings}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove debug prints

@poxrud
Copy link
Contributor Author

poxrud commented Feb 27, 2024

I think the idea is to provide an "empty vector" for the auto retriever to query with. But this works too

As I understand it, the "empty vector" is used when the query string is empty and is more of a default search string in the form of an embedding. It doesn't allow for a metadata only searches.

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Feb 27, 2024
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 28, 2024
@logan-markewich logan-markewich merged commit 381afe6 into run-llama:main Feb 28, 2024
8 checks passed
Dominastorm pushed a commit to uptrain-ai/llama_index that referenced this pull request Feb 28, 2024
anoopshrma pushed a commit to anoopshrma/llama_index that referenced this pull request Mar 2, 2024
Izukimat pushed a commit to Izukimat/llama_index that referenced this pull request Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants