-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Pipeline Collection smart search #43
Comments
Hey @ddematheu , can you elaborate this? I would like to contribute to this. |
Sure. At a high level, we have Pipelines that each have a description associated to them. (https://github.com/NeumTry/NeumAI/blob/main/neumai/neumai/Pipelines/Pipeline.py) The pipeline represents a collection of data as it has a data source as well as a vector DB associated to it. We introduced PipelineCollection (https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neumai_tools/PipelineCollection/PipelineCollection.py) as an easy way to query multiple pipelines at the same time. Ex. I want to query data both from a user record in Postgres as well as general info from files in S3. This is sort of helpful, but the main piece of feedback we have heard is that the preference would be to dynamically make decisions on what data collection to query based on the question. Ex. If I want to know the status of a user then I would query Postgres vs if I want to get the information for a mortgage they are getting I would go to S3 where the mortgage document is stored. With this in mind, I have stubbed out a search_routed method. The method is designed to take a Collection of Pipelines (1+) and using the description field decide which one to use. For the decision, there are two approaches I have in mind:
I was leaning towards #1 to start given that it is more lightweight and will provide faster responses. But #2 might provide better quality. |
@ddematheu Thanks, that made it more clear to me. I can also think of an approach like comparing query with a cluster center for each pipeline/sink(pre-computed), something which is representative of the the data in the pipeline/sink, along with pipeline description. So its same as your point 1 with an addition of these pipeline representatives. I thought of this because when the data in the pipeline changes, the representative would also update and be more relevant. Would that be useful? |
That is actually a great idea. How are you thinking about calculating the
center? Do some vector dbs provide it our of the box? Alternatively it
might be something that we calculate at ingestion time and update over time
as new data is added.
…On Thu, Dec 28, 2023, 11:15 AM Aakash Thatte ***@***.***> wrote:
@ddematheu <https://github.com/ddematheu> Thanks, that made it more clear
to me. I can also think of an approach like comparing query with a cluster
center for each pipeline(pre-computed), something which is representative
of the the data in the pipeline, along with pipeline description. So its
same as your point 1 with an addition of these pipeline representatives. I
thought of this because when the data in the pipeline changes, the
representative would also update and be more relevant.
—
Reply to this email directly, view it on GitHub
<#43 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRYWGDQBD3LJI4SO3PBLHLYLWSLFAVCNFSM6AAAAABBDSXCKGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZRGM2TMMRSGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@ddematheu I will have to see if each DB offers this, will get back on that. Regarding implementation, as an initial idea, I had thought of it like the way you mentioned: Approach
Doubts
Discussion
|
Something that might be tricky is that after calculating / updating the
center it would require that center to be stored somewhere. Maybe it is
just added into the sink with some metadata / ids to get it back.
Doubt #1 that should be transparent to the sink. Data sources are broken
down and translated to individual vectors.
Doubt #2 center would be calculated the dimensionality of a sink (it has a
embed model associated as part of the pipeline). In terms of embedding
descriptions of pipelines we would need to standardize.
Doubt #3 thinking about this.
In general, my vibe is to start with the most straightforward method and
"benchmark" results before overly investing in a much more complicated
approach.
…On Fri, Dec 29, 2023, 2:19 AM Aakash Thatte ***@***.***> wrote:
@ddematheu <https://github.com/ddematheu> I will have to see if each DB
offers this, will get back on that.
Regarding implementation, as an initial idea, I had thought of it like the
way you mentioned:
*Approach*
- Every SinkCOnnector would have to define methods
compute_cluster_center and update_cluster_center
- Whenever new data is added to a sink, it would trigger the
update_cluster_center method
*Doubts*
1. In each sink, a single data unit can have multiple fields, some or
all of them vectorized, so which fields to use for cluster center
calculation?
2. I am not sure, but it might happen that user has vectorized data in
one sink with 512 dimensions and other sink with 1024 dimensions of
embedding, what to do in that case?
3. What method to use for cluster center calculation? Would simple
averaging suffice?
*Discussion*
- I would love to know if there are more approaches to this. Please
share if you come across any, I would also do some research on that.
- We can also explore simpler approaches instead of embedding
similarity, because semantic stuff would throw multiple options at us like
which model to use, what embedding size to consider, what should be the
threshold etc and make config cluttered. We can discuss this in detail
maybe.
—
Reply to this email directly, view it on GitHub
<#43 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRYWGGL4SRM5NDKOA6LBCDYLZ4KHAVCNFSM6AAAAABBDSXCKGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZRHAZTCOJSG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@ddematheu Okay, I would start with Marqo and try to implement a basic working version. Update |
Sounds good. feel free to open PR and I can take a look to provide feedback. |
@ddematheu How will the query be vectorized? In the separate search, each time we are using the respective pipeline'e |
This is where it gets hard with the representative vector as that vector will be determined by the embedding model used within the each pipeline. Comparing those will be hard. Unless, for the comparison we embed the query using the embed_query for each pipeline and compare. We would then just compare the scores. So yeah, I think using the embed_query makes sense. |
Okay then, I would go ahead with the |
@ddematheu I have implemented first version of smart search, it works well, tested it using two data sources and two sinks. Excited for this feature and its further improvements! |
Awesome, taking a look at the PR. |
Currently support unified (re-rank results into single list) and separate (results for each pipeline returned separately) searches for a collection .
Adding smart search which will do a smart routing to identify what collections are worth searching based on the query. Using the description of the pipeline, match to query.
The text was updated successfully, but these errors were encountered: