Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature #20

Open
Nick17t opened this issue Feb 24, 2023 · 7 comments
Labels

Comments

@Nick17t
Copy link
Contributor

Nick17t commented Feb 24, 2023

Project idea 5: Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature

info details
Skills needed ANN, C++, Python, Databases
Project size 350 hours
Difficulty level Hard
Mentors @Felix Wang @Joan Martínez

Project Description

  • Jina is developing a stateful executor feature that enables Deployments with a state to be replicated and scaled. This opens the door to having a Vector Database in our ecosystem effectively and robustly. Iterating on ANNLite to act as the "Lucene" for Jina would be a great opportunity.

Expected outcomes

  • Prove and come up with an Executor in our Hub that uses ANNlite or DocArray with ANNLite as a backend to be the default Vector Databases for all our examples for mid-sized data requirements.

More Info

  • In DocArray (v2) Document Stores (soon to be renamed to "Document Index"), we want to support multiple vector DBs and ANN libraries to give more options to the user
  • You can read more about DocArray here
  • And about Document Stores
  • But note that we are currently working on v2 of DocArray, which will be quite different. You can read more here
  • And for Document Stores in v2, see this PR: feat: hnswlib document index docarray/docarray#1124
@Nick17t Nick17t added the ideas label Feb 24, 2023
@JoanFM
Copy link
Member

JoanFM commented Feb 27, 2023

ANNlite is a Vector search library developed by Jina which is using HNSW as the algorithm to perform search. On top of this it allows to do filtering on Documents.

However, as a simple library it has limited scalability capabilities. However, using Jina and wrapping it with an Executor, one may be able to add a replication and sharding layer easily. The scalability and performance of this solution is to be seen. The aim of this project is to make sure ANNLite can be used with an Executor as such.

Relevant documentation to follow:

@kronsbein
Copy link

@JoanFM @numb3r3 @Nick17t

Hey everyone,
I'm a CS student from Berlin and interested in contributing to this project. I briefly went through the provided references and have a couple of general questions first:

  • I read about HNSW here on arxiv and was wondering if ANNLite's similarity search logic is based on this paper?
  • I try to understand the stateful executor a bit with the provided link to the pr. Is a comprehensive performance analysis of a potential approach also within the scope of this project?
  • Lastly, out of curiosity, are there any plans to add other open source vector similarity searches like Qdrant, Elastic?

Thank you and I'm happy to discuss further!

Best,
Marvin

@JoanFM
Copy link
Member

JoanFM commented Mar 1, 2023

Hello @kronsbein ,

Answers to your quesitons:

  1. Yes, it is based on this paper

  2. This will be a beta feature, and we will potentially analyze the performance, and see the scale at which the solution work

  3. No, it is out of the scope. The point is that with Jina we want to be able to handle Stateful payloads with our Executor abstraction without the need of external services that need other orchestrations. We want to evaluate to which extension and scale this can be achieved with StatefulExecutor + Vector Search lib as ANNLite.

Thanks,

Joan

@Ahmed-Emad10
Copy link

Ahmed-Emad10 commented Mar 8, 2023

@JoanFM @numb3r3
Hello,
I'm Ahmed from Egypt student at Cairo university faculty of engineering computer engineering department. I'm interested in this project and want to participate in. I just wanted to know what to do and if there is anything that I have to learn to join?
Also I don't know a lot about this project so are the links provided above sufficient?
I appreciate your thoughts and time. I hope to hear from you soon.
Best regards

@Hansolo1103
Copy link

@JoanFM @numb3r3

Hello! I am Sohan Mishra , a student at National Institute of Technology(NIT) . I have read the docarray docs and would like to work on this project. I have experience with Python and C++ and have been learning a little bit about ANNLite for the past few days.

From what i understand on reading the first link under "More Info":
The doc discusses using a document store (e.g., SQLite or Redis) as a storage backend for DocumentArrays to provide longer persistence and faster retrieval. The DocumentArrays with a document store look and feel almost the same as a regular in-memory DocumentArray, allowing easy switching between backends. The section explains how to initialize a DocumentArray with an external storage backend and how to create, retrieve, update, and delete Documents. It also introduces the concept of subindices for multimodal or nested data, and it summarizes the key functionalities of document stores, including vector search, vector search + filter, and filter.

@amangupta201
Copy link

With storage='annlite', AnnLiteIndexer indexes Documents into a DocumentArray. Here, the DocumentArray makes effective use of AnnLite to store and search Documents.
The following shows the code snippet for the vector search:

from jina import Flow
from docarray import Document
import numpy as np

f = Flow().add(
uses='jinahub://AnnLiteIndexer',
uses_with={'n_dim': 2},
)

with f:
f.post(
on='/index',
inputs=[
Document(id='a', embedding=np.array([1, 3])),
Document(id='b', embedding=np.array([1, 1])),
],
)

docs = f.post(
    on='/search',
    inputs=[Document(embedding=np.array([1, 1]))],
)

will print "The ID of the best match of [1,1] is: b"

print('The ID of the best match of [1,1] is: ', docs[0].matches[0].id)

@Nick17t
Copy link
Contributor Author

Nick17t commented Mar 22, 2023

Hi @kronsbein @Ahmed-Emad10 @Hansolo1103

I am delighted to hear that you are interested in contributing to the Jina AI community! 🎉

To get started, please take a moment to fill out our survey so that we can learn more about you and your skills.

Also, don't forget to mark your calendars for the GSoC x Jina AI webinar on March 23rd at 2 pm (CET). This is an excellent opportunity to learn more about the projects and ask any questions you have about the requirements and expectations.

Our mentors will provide an in-depth overview of the projects and answer any questions you may have. So please don't hesitate to ask any questions or seek clarification on any aspect of the project.

Is there anything specific you would like to learn from the webinar? Do you have any questions about the Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature project that you would like to see clarified during the Q&A session? Let me know, and I'll be happy to help!

Looking forward to seeing you at the webinar, and thank you for your interest in the Jina AI community! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants