Skip to content

Latest commit

 

History

History
48 lines (34 loc) · 3.87 KB

Readme.md

File metadata and controls

48 lines (34 loc) · 3.87 KB

Vectors in Search

Dice.com code for implementing the ideas discussed in the following talks:

This extends my earlier work on 'Conceptual Search' which can be found here - https://github.com/DiceTechJobs/ConceptualSearch (including slides and video links). In this talk, I present a number of different approaches for searching vectors at scale using an inverted index. This implements approaches to Approximate k-Nearest Neighbor Search including:

  • LSH (using the Sim Hash)
  • K-Means Tree
  • Vector Thresholding

and describes how these ideas can be implemented and queried efficiently within an inverted index.

UPDATE: After talking with Trey Grainger and Erik Hatcher from LucidWorks, they recommended using term frequency in place of payloads for the solutions where I embed term weights into the index and use a special payload aware similarity function (which would also not be needed). Payloads incur a significant performance penalty. The challenge with this is the negative weights, I assume it is not possible to encode negative term frequencies, but this can be worked around by having different tokens for positive and negative weighted tokens, and making similar adjustments at query time (where negative boosts can be applied in Solr as needed).

Lucene Documentation: Lucene Delimited Term Frequency Filter

There has also been a recent update to Lucene core that is applicable here and is soon to make it's way into Elastic search at time of writing: Block Max WAND. This produces a signifcant speed up for large boolean OR queries where you don't need to know the exact number of results but just care about getting the top-N results as fast as possible. All of the approaches I discuss here generate relatively large OR queries and so this is very relevant. I have also read that the current implementation of minimum-should-match also includes similar optimizations, and so the same sort of performance gain may already be attained using appropriate mm settings, something that I was already experimenting with in my code.

Directory Structure

  • python
    • Code for implementing the k-means tree, LSH sim hash and vector thresholding algorithms, and indexing and searching vectors in solr using these techniques.
  • solr_plugins
    • Java code for implementing the custom similarity classes and payloadEdismax parser described in the talk.
  • solr_configs
    • Xml snippets for importing the solr plugins from the 'solr_vectors_in_search_plugins' java code.

Implementation Details

  • Solr Version - 7.5
  • Python Version - 3.x+ (3.5 used)

Links to Talks

Author

Simon Hughes ( Chief Data Scientist, Dice.com )