Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Use Milvus 2.5 analyzers for hybrid search #17504

Open
petros94 opened this issue Jan 13, 2025 · 0 comments
Open

[Feature Request]: Use Milvus 2.5 analyzers for hybrid search #17504

petros94 opened this issue Jan 13, 2025 · 0 comments
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized

Comments

@petros94
Copy link
Contributor

Feature Description

Milvus 2.5 introduced full text search allowing to use built-in analyzers for sparse-embedding retrieval, using the BM25 algorithm. Taken from milvus documentation:

Full text search simplifies the process of text-based searching by eliminating the need for manual embedding. This feature operates through the following workflow:
Text input: You insert raw text documents or provide query text without needing to manually embed them.
Text analysis: Milvus uses an analyzer to tokenize the input text into individual, searchable terms. For more information on analyzers, refer to Analyzer Overview.
Function processing: The built-in function receives tokenized terms and converts them into sparse vector representations.
Collection store: Milvus stores these sparse embeddings in a collection for efficient retrieval.
BM25 scoring: During a search, Milvus applies the BM25 algorithm to calculate scores for the stored documents and ranks matched results based on their relevance to the query text.

This means that sparse embedding creating an retrieval can now be handled internally by Milvus, therefore there is no need to manually create the embeddings from llamaindex code (we could still support it as an option):

def get_default_sparse_embedding_function() -> BGEM3SparseEmbeddingFunction:

This change would require some changes when creating the collection's schema to include the text field:

schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)

and the text -> sparse embedding function:

bm25_function = Function(
    name="text_bm25_emb", # Function name
    input_field_names=["text"], # Name of the VARCHAR field containing raw text data
    output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
    function_type=FunctionType.BM25,
)

schema.add_function(bm25_function)

Also since the raw text would be defined in a new static field, some changes would be required to the ingestion pipeline and query engine (TextNode), so that the raw data is not duplicated. Right now it is stored as a JSON (dynamic) field in node_content.

Reason

No response

Value of Feature

It moves the responsibility of creating sparse embeddings to Milvus, reducing code and logic from llamaindex's side. Furthermore, Milvus may be more optimized in such operation.

@petros94 petros94 added enhancement New feature or request triage Issue needs to be triaged/prioritized labels Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

1 participant