Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add milvus vector db integration #419

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

AbhishekRP2002
Copy link

@AbhishekRP2002 AbhishekRP2002 commented Dec 19, 2024

  • initial draft PR for basic milvus vector db integration, aiming [Feature Request]: Add support for Milvus Vector Database #415
  • handled minor typos that crossed my eyes in backend/modules/vector_db/mongo.py
  • @chiragjn Can u please help me in understanding the test setup that you guys follow for testing the functionalities / unit testing ?
  • This PR includes a basic implementation of Milvus vector DB integration ( via Milvus Lite predominantly with Quick setup ref: https://milvus.io/docs/create-collection-instantly.md#Quick-Setup )
  • Post successful merge of the basic implementation we can scope out the integration of additional extended configs for Milvus like it's done for Qdrant and provide more flexibility with Customized setup, I am thinking.

Copy link
Contributor

@mnvsk97 mnvsk97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking time to add a new vector db integration and making corrections to the mongo file. Please have a look at the requested changes.


logger.debug(f"[Milvus] Deleted {len(data_point_vectors)} data point vectors")

def get_embedding_dimensions(self, embeddings: Embeddings) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is already present in the base class. Please check if you can use that directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah actually I thought about leveraging the method from the base class but
logger.debug(f"[VectorDB] Embedding a dummy doc to get vector dimensions") made me question it a little.
It's a singular string right ? and [VectorDB] not a variable ?
I was also thinking if there's any specific reason for adding f-string without any {} or expressions within {}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking if there's any specific reason for adding f-string without any {} or expressions within {}

No reason. Looks like a mistake. Feel free to edit this.

logger.debug(f"[VectorDB] Embedding a dummy doc to get vector dimensions") made me question it a little.

You can change this message and make it generic across all vector dbs. LMK if you need any help

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


logger.debug(f"[Milvus] Created new collection {collection_name}")

# TODO: take a look at this again. Have Doubts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what the doubts are? We can resolve them together.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually initially i was picking the ids list from the langchain Document objects and then using them to delete via the delete method defined by milvus client, but noticed while creating collection, the field id is set with auto_id=True which basically assigns random number as id value to the documents, so to handle this i thought of using the metadata taking ref from other vector dbs implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's preferable to delete documents by fields in the metadata.

Best way would be to accept both params:

  1. If id is provided, delete by id
  2. If id is not provided and metadata is not empty, delete documents by any of the attributes in metadata

self.milvus_client.create_collection(
collection_name=collection_name,
dimension=vector_size,
metric_type="COSINE", # https://milvus.io/docs/metric.md#Metric-Types : check for other supported metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cosine the best metric_type?

Also, please make this configurable from the class initialization.

Copy link
Author

@AbhishekRP2002 AbhishekRP2002 Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, by make this configurable do you mean metric_type ?

The definition of best metric actually is subjective, it depends on the data and the use caase actually as per my understanding. if let's say the embedding vectors are sparse and our use case demands a full text exact search then BM25 is more preferrable, if data is not normalized then Inner Product (IP) would make much more sense, but in general for dense vectors cosine similarity captures the orientation of the vectors and hence the semantic meaning and is default choice in lot of vector DBs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, by make this configurable do you mean metric_type ?

Yes, what do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my initial thought was to make it configurable and there are actually multiple paths to do it, since this is a generic / common field across all attributes so it makes more sense to me to add it directly under VectorDBConfig as an attribute but that would demand changes in rest of the vectordb implementations so to avoid that for now, I am thinking of leveraging the config attribute which is meant for additional params.

Will update the code in the next commit and pls lemme know if that implementation looks good to start with.

backend/modules/vector_db/milvus.py Show resolved Hide resolved

if incremental and len(documents) > 0:
# Instead of using document IDs, we'll delete based on metadata matching
for doc in documents:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if batch delete possible here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think extracting the deletion logic into another method makes sense here to keep the method simple and readable.

Copy link
Author

@AbhishekRP2002 AbhishekRP2002 Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's good point, we can make it modular following the separation of concerns principle, will do.

filter=delete_expr,
)

Milvus(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use self.get_vector_store here

@mnvsk97
Copy link
Contributor

mnvsk97 commented Dec 22, 2024

@AbhishekRP2002 Please add an example config in the compose.env and comment so anyone who wants to use Milvus can refer to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants