[Feature Request]: Use Milvus 2.5 analyzers for hybrid search #17504

petros94 · 2025-01-13T18:34:12Z

Feature Description

Milvus 2.5 introduced full text search allowing to use built-in analyzers for sparse-embedding retrieval, using the BM25 algorithm. Taken from milvus documentation:

Full text search simplifies the process of text-based searching by eliminating the need for manual embedding. This feature operates through the following workflow:
Text input: You insert raw text documents or provide query text without needing to manually embed them.
Text analysis: Milvus uses an analyzer to tokenize the input text into individual, searchable terms. For more information on analyzers, refer to Analyzer Overview.
Function processing: The built-in function receives tokenized terms and converts them into sparse vector representations.
Collection store: Milvus stores these sparse embeddings in a collection for efficient retrieval.
BM25 scoring: During a search, Milvus applies the BM25 algorithm to calculate scores for the stored documents and ranks matched results based on their relevance to the query text.

This means that sparse embedding creating an retrieval can now be handled internally by Milvus, therefore there is no need to manually create the embeddings from llamaindex code (we could still support it as an option):

llama_index/llama-index-integrations/vector_stores/llama-index-vector-stores-milvus/llama_index/vector_stores/milvus/utils.py

Line 176 in 2668cb7

class BGEM3SparseEmbeddingFunction(BaseSparseEmbeddingFunction):

llama_index/llama-index-integrations/vector_stores/llama-index-vector-stores-milvus/llama_index/vector_stores/milvus/utils.py

Line 210 in 2668cb7

def get_default_sparse_embedding_function() -> BGEM3SparseEmbeddingFunction:

This change would require some changes when creating the collection's schema to include the text field:

schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)

and the text -> sparse embedding function:

bm25_function = Function(
    name="text_bm25_emb", # Function name
    input_field_names=["text"], # Name of the VARCHAR field containing raw text data
    output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
    function_type=FunctionType.BM25,
)

schema.add_function(bm25_function)

Also since the raw text would be defined in a new static field, some changes would be required to the ingestion pipeline and query engine (TextNode), so that the raw data is not duplicated. Right now it is stored as a JSON (dynamic) field in node_content.

Reason

No response

Value of Feature

It moves the responsibility of creating sparse embeddings to Milvus, reducing code and logic from llamaindex's side. Furthermore, Milvus may be more optimized in such operation.

petros94 added enhancement New feature or request triage Issue needs to be triaged/prioritized labels Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Use Milvus 2.5 analyzers for hybrid search #17504

[Feature Request]: Use Milvus 2.5 analyzers for hybrid search #17504

petros94 commented Jan 13, 2025

[Feature Request]: Use Milvus 2.5 analyzers for hybrid search #17504

[Feature Request]: Use Milvus 2.5 analyzers for hybrid search #17504

Comments

petros94 commented Jan 13, 2025

Feature Description

Reason

Value of Feature