-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First version of smart search #61
Conversation
pipe_to_similarity = {} | ||
for pipe in self.pipelines: | ||
pipe_representative = pipe.sink.get_representative_vector() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good. The representative vector in the future could be extended to also use the description as the representative vector in cases where there is not possible to compute one of the sink or where you would want to use that instead.
# Currently, we are only selection the only pipeline whose | ||
# representative is most similar to the query. | ||
pipe_to_similarity = dict(sorted(pipe_to_similarity.items(), key=lambda x: x[1], reverse=True)[:1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably make the number of pipelines to use configurable. Assume in most cases you might want to pick the highest. Also might want to consider making it configurable to using any pipelines with a similarity score above X.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it, I think that making it configurable to similarities above a certain threshold seems best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddematheu Yep, sounds good. But deciding that threshold is difficult. I mean, you never know even the most similar pipe can have similarity scores that are low sounding like 0.4 etc. We need some way to decide this threshold as per the computed similarities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree with this.maybe keeping to top to start is fine. My thinking is that we could provide different modalities. One where we pass results for N highest pipelines (default to N=1) and another for threshold based where we pass results for pipelines with similarity is higher than N (default N = 0.5 or whatever)
One additional modality might be to look at distribution. Where N is the max distance or % away from the highest scoring.
Btw this are just thoughts, happy to commit to starting with highest and add other modalities based on feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddematheu These are interesting ideas. Especially the distribution one, sounds similar to(not exactly) p-sampling in language models, maybe we can use that idea, like considering all pipelines whose similarities add up to or more than a value, considering descending order of similarities. But yes, its upto the user feedback that you get.
# Currently, we are only selection the only pipeline whose | ||
# representative is most similar to the query. | ||
pipe_to_similarity = dict(sorted(pipe_to_similarity.items(), key=lambda x: x[1], reverse=True)[:1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it, I think that making it configurable to similarities above a certain threshold seems best.
neumai-tools/neumai_tools/PipelineCollection/PipelineCollection.py
Outdated
Show resolved
Hide resolved
@ddematheu one issue still remains here, when we use the
Open to discussion on this. |
In case where we use the old one, it still needs to be stored somewhere. Initially it might just be locally but thinking whether we could store it back into the vector db (caveat that we would need to be careful in not returning it with just a search also feels a bit hacky but it makes it so that there is no additional storage requirement). If we store back into the sink , we can also check if a sink has already a vector calculated by and not recalculate. Recomputing with just the newly added vectors makes sense. |
@ddematheu Cant we just store it as a property, anyway we have limited number of sinks, so will it become a problem to store the representatives? |
Yeah can be stored as a property for simplicty, but that means that the dev
needs to keep track of that. (they might need to store it in their own
database or something) I think this is fine initially.
…On Fri, Jan 12, 2024 at 10:27 AM Aakash Thatte ***@***.***> wrote:
@ddematheu <https://github.com/ddematheu> Cant we just store it as a
property, anyway we have limited number of sinks, so will it become a
problem to store the representatives?
—
Reply to this email directly, view it on GitHub
<#61 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRYWGF4PJJD52TQICDRUWTYOF6CDAVCNFSM6AAAAABBQRKL6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBZG43DMMBVGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
David de Matheu
Phone: +1(814)321-6323
Email: ***@***.***
|
@sky-2002 let me know if you think its ready to merge. Overall still a couple high level questions, but I think it is at a working point. |
@ddematheu I think we can merge it, just to remind, currently it only supports |
Adds an initial basic version of smart search.
Description - As per discussions in issue #43 , we need a method to route queries to appropriate/relevant collection from the
PipelineCollection
interface. Using collection representatives and their similarity with the query, we decide on which collection/pipe to search in.Approach - For each pipe in the
PipelineCollection
, compute the representative vector of its respective sink. Then use the respective pipe's embed method to embed the query and compute its similarity with each pipe's representative. This gives a similarity score with respect to each pipe. Then search in descending order of similarity to get most relevant results.Note - Currently supported on
Marqo
andLanceDB
sinks. Support for other sinks soon.Demo - I have prepared a notebook which I have tested on my local system. I have created a custom embedding class in it to test quickly. Make sure to spin up your marqo instance before running the notebook.
Looking for review and feedback.