[FEA] Support multi-valued keys in all indexes #430

cjnolet · 2024-10-28T18:54:01Z

This feature is growing in popularity of LLMs where many "term" embeddings may come from the same "document" but we want to make sure the set of resulting "document" indices returned for a neighborhood query are unique for each query vector. At the moment, it seems this is being implemented at a higher layer in many systems as a de-duplication or filtering step. We should try and support such a feature as generally as possible. The ultimate goal being to apply a constraint, potentially in the k-selection step, if it can be done efficiently.

Thinking throught his a bit further, we have had an idea to support returning document ids as the k-closest neighbors, along with the corresponding distances. This method involves using a pre-filtering function and a global atomic, maintaining the following 2 arrays in global memory:

An array of n_index_vectors, which maps each vector (value) to its doc id (key).
An array of size n_documents, which maintains the ongoing sum of distances

The idea here is that we go through potential closest neighbors and each time the pre-filtering predicate function is invoked, we atomically add the distance to the corresponding doc id. This is a fairly naive approach, which leads to increased random writes and atomics, but I think we could find ways to reduce these with priors such as using a filtering threshold and ignoring distances beyond that threshold.

We should continue to find better ways to implement this, but this approach could yield reasonable results initially as something to build on.

cjnolet added the feature request New feature or request label Oct 28, 2024

cjnolet added this to VS/ML/DM Primitives Release Board Oct 28, 2024

cjnolet moved this to In Progress in VS/ML/DM Primitives Release Board Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support multi-valued keys in all indexes #430

[FEA] Support multi-valued keys in all indexes #430

cjnolet commented Oct 28, 2024

[FEA] Support multi-valued keys in all indexes #430

[FEA] Support multi-valued keys in all indexes #430

Comments

cjnolet commented Oct 28, 2024