Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support multi-valued keys in all indexes #430

Open
cjnolet opened this issue Oct 28, 2024 · 0 comments
Open

[FEA] Support multi-valued keys in all indexes #430

cjnolet opened this issue Oct 28, 2024 · 0 comments
Labels
feature request New feature or request

Comments

@cjnolet
Copy link
Member

cjnolet commented Oct 28, 2024

This feature is growing in popularity of LLMs where many "term" embeddings may come from the same "document" but we want to make sure the set of resulting "document" indices returned for a neighborhood query are unique for each query vector. At the moment, it seems this is being implemented at a higher layer in many systems as a de-duplication or filtering step. We should try and support such a feature as generally as possible. The ultimate goal being to apply a constraint, potentially in the k-selection step, if it can be done efficiently.

Thinking throught his a bit further, we have had an idea to support returning document ids as the k-closest neighbors, along with the corresponding distances. This method involves using a pre-filtering function and a global atomic, maintaining the following 2 arrays in global memory:

  1. An array of n_index_vectors, which maps each vector (value) to its doc id (key).
  2. An array of size n_documents, which maintains the ongoing sum of distances

The idea here is that we go through potential closest neighbors and each time the pre-filtering predicate function is invoked, we atomically add the distance to the corresponding doc id. This is a fairly naive approach, which leads to increased random writes and atomics, but I think we could find ways to reduce these with priors such as using a filtering threshold and ignoring distances beyond that threshold.

We should continue to find better ways to implement this, but this approach could yield reasonable results initially as something to build on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

1 participant