You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This feature is growing in popularity of LLMs where many "term" embeddings may come from the same "document" but we want to make sure the set of resulting "document" indices returned for a neighborhood query are unique for each query vector. At the moment, it seems this is being implemented at a higher layer in many systems as a de-duplication or filtering step. We should try and support such a feature as generally as possible. The ultimate goal being to apply a constraint, potentially in the k-selection step, if it can be done efficiently.
Thinking throught his a bit further, we have had an idea to support returning document ids as the k-closest neighbors, along with the corresponding distances. This method involves using a pre-filtering function and a global atomic, maintaining the following 2 arrays in global memory:
An array of n_index_vectors, which maps each vector (value) to its doc id (key).
An array of size n_documents, which maintains the ongoing sum of distances
The idea here is that we go through potential closest neighbors and each time the pre-filtering predicate function is invoked, we atomically add the distance to the corresponding doc id. This is a fairly naive approach, which leads to increased random writes and atomics, but I think we could find ways to reduce these with priors such as using a filtering threshold and ignoring distances beyond that threshold.
We should continue to find better ways to implement this, but this approach could yield reasonable results initially as something to build on.
The text was updated successfully, but these errors were encountered:
This feature is growing in popularity of LLMs where many "term" embeddings may come from the same "document" but we want to make sure the set of resulting "document" indices returned for a neighborhood query are unique for each query vector. At the moment, it seems this is being implemented at a higher layer in many systems as a de-duplication or filtering step. We should try and support such a feature as generally as possible. The ultimate goal being to apply a constraint, potentially in the k-selection step, if it can be done efficiently.
Thinking throught his a bit further, we have had an idea to support returning document ids as the k-closest neighbors, along with the corresponding distances. This method involves using a pre-filtering function and a global atomic, maintaining the following 2 arrays in global memory:
The idea here is that we go through potential closest neighbors and each time the pre-filtering predicate function is invoked, we atomically add the distance to the corresponding doc id. This is a fairly naive approach, which leads to increased random writes and atomics, but I think we could find ways to reduce these with priors such as using a filtering threshold and ignoring distances beyond that threshold.
We should continue to find better ways to implement this, but this approach could yield reasonable results initially as something to build on.
The text was updated successfully, but these errors were encountered: