Skip to content

Conversation

@jairad26
Copy link
Contributor

@jairad26 jairad26 commented Jun 20, 2025

Description of changes

This PR adds a new function embed_query on the base embeddingfunction type to allow embedding functions to define a secondary path to embed documents for query path. by default this will invoke the call method.

this also adds a new protocol for SparseEmbeddingFunctions, and 2 new embedding functions: huggingface_sparse_embedding_function and fastembed_sparse_embedding_function

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link
Contributor Author

jairad26 commented Jun 20, 2025

@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@jairad26 jairad26 force-pushed the jai/query-coll-config branch 5 times, most recently from 2d66a17 to 69696fd Compare June 21, 2025 00:37
@jairad26 jairad26 force-pushed the jai/query-coll-config branch 3 times, most recently from a339c38 to 34225f8 Compare June 24, 2025 17:34
@jairad26 jairad26 force-pushed the jai/query-coll-config branch 6 times, most recently from 7eac9c8 to bd49feb Compare June 30, 2025 23:53
@jairad26 jairad26 force-pushed the jai/query-coll-config branch 2 times, most recently from 36a2766 to 0a6cc75 Compare July 8, 2025 18:34
@jairad26 jairad26 force-pushed the jai/query-coll-config branch from 0a6cc75 to a219bb3 Compare July 11, 2025 21:54
@jairad26 jairad26 force-pushed the jai/query-coll-config branch from a219bb3 to e67eb4f Compare July 25, 2025 19:48
@jairad26 jairad26 force-pushed the jai/query-coll-config branch 2 times, most recently from 5f002bb to 5100775 Compare September 9, 2025 19:03
@jairad26 jairad26 marked this pull request as ready for review September 9, 2025 19:04
self,
record_set: BaseRecordSet,
embeddable_fields: Optional[Set[str]] = None,
is_query: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works, but just for discussion an alternative approach is to have separate methods for read and write paths. any idea here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah +1

Comment on lines +571 to +574
if is_query:
return self._embedding_function.embed_query(input=input)
else:
return self._embedding_function(input=input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work for all embedding functions we support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since theres a default that assumes query config doesnt exist, none of the existing efs will break.

Copy link
Contributor

@Sicheng-Pan Sicheng-Pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@jairad26 jairad26 force-pushed the jai/query-coll-config branch from 5100775 to 5edc33f Compare September 15, 2025 17:58
@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Sep 15, 2025

Add Query Config for Embedding Functions, Introduce Sparse Embedding Support

This PR introduces major extensibility to the embedding function API by adding a standardized embed_query method and optional query_config parameter to embedding functions. The change enables different code paths or configurations for query-time versus document-time embedding, improving support for multi-vector models and sparse embedding scenarios. Additionally, the PR implements a new SparseEmbeddingFunction protocol and provides initial integrations with HuggingFace SPLADE and FastEmbed, supporting sparse vector representations. Various core files are updated to propagate and use the new embed_query method and protocol, and extensive new and updated tests validate this extended configuration support.

Key Changes

• Added embed_query method to the EmbeddingFunction and new SparseEmbeddingFunction protocol in chromadb/api/types.py.
• Introduced support and full protocol for declaring sparse embedding functions along with validation utilities.
• Implemented new embedding function classes: HuggingFaceSparseEmbeddingFunction and FastembedSparseEmbeddingFunction, supporting model/parameter selection and query/document disambiguation.
• Expanded and refactored tests in chromadb/test/configurations/test_collection_configuration.py for serialization, deserialization, config propagation, and behavior of embedding functions with/without query_config.
• Updated CollectionCommon.py to call embed_query for query operations (when appropriate), ensuring backward compatibility with legacy embeddings.
• Modified core configuration loading, registration, and serialization paths to handle legacy and modern function signatures, properly propagate query_config, and support new types.
• Extended JinaEmbeddingFunction and related tests to support query_config.
• Added sparse embedding validation utilities and improved embedding function interface robustness.
• Updated initialization and registry in chromadb/utils/embedding_functions/__init__.py and related test coverage for new sparse classes.

Affected Areas

chromadb/api/types.py (embedding interfaces and core validation)
chromadb/api/models/CollectionCommon.py (embedding pathways, handling of query/document logic)
chromadb/utils/embedding_functions/ (embedding function classes, registry, imports)
chromadb/test/configurations/test_collection_configuration.py (tests for embedding function configs and serialization)
chromadb/test/ef/test_ef.py (import-level embedding function coverage)
chromadb/api/collection_configuration.py and related code (config load/store/validation)

This summary was automatically generated by @propel-code-bot

@jairad26 jairad26 force-pushed the jai/query-coll-config branch from 5edc33f to 341a219 Compare September 16, 2025 00:17
@jairad26 jairad26 changed the title [ENH] add query config on collection configuration [ENH] add query config on collection configuration, splade, and bm25 efs Sep 16, 2025
class FastembedSparseEmbeddingFunction(SparseEmbeddingFunction[Documents]):
def __init__(
self,
model_name: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe put a list of models for common use cases somewhere

@jairad26 jairad26 force-pushed the jai/query-coll-config branch from 341a219 to 6c083cf Compare September 16, 2025 07:31
@jairad26 jairad26 force-pushed the jai/query-coll-config branch from 6c083cf to 255e7f2 Compare September 16, 2025 15:26
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Sep 16, 2025
@jairad26 jairad26 merged commit 702afa5 into main Sep 16, 2025
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants