[Proposal] Vector Similarity Search Indexing #2287
Replies: 4 comments 2 replies
-
Thank you for providing excellent preliminary research and suggestions, which have given us direction to implement vector search. I also agree that we can first try to implement HNSW (including designing an efficient encoding for reducing HNSW index to rocksdb key-values). And in the later phase, we can introduce other indexes according to the situation. If anyone in the community are interested, welcome to join the discussion! cc @git-hulk @mapleFU @Yangsx-1 |
Beta Was this translation helpful? Give feedback.
-
DiskANN++:使用查询敏感度入口顶点对同构映射图索引进行高效的基于页面的搜索 |
Beta Was this translation helpful? Give feedback.
-
FreshDiskANN:一种用于流式相似性搜索的快速、准确的基于图的 ANN 索引 演示了简单的图更新规则如何导致 HNSW 和 NSG 等流行的基于图的算法在插入和删除流上的索引质量下降。 开发了 FreshVamana,这是第一个支持插入和删除的基于图的索引,并实证了其在长时间更新流中的稳定性。 系统将大部分图形索引存储在 SSD 上,仅将最新更新存储在内存中。为了支持这一点,设计了一种新颖的两遍 StreamingMerge 算法,该算法以一种非常高效的写入方式将内存中索引与 SSD 索引合并。合并过程的时间和空间复杂度与更改集成正比,从而可以使用比从头开始重建大型索引少一个数量级的计算和内存,在 RAM 有限的机器上更新大型十亿点索引。 设计了 FreshDiskANN 系统,其中包含一个覆盖大多数数据点的长期驻留 SSD 的索引,以及一个用于聚合最近更新的短期内存索引。FreshDiskANN 会定期在后台使用 StreamingMerge 算法将短期索引合并到长期索引中,以限制短期索引的内存占用,从而限制整个系统的内存占用。 FreshVamana 因为流行的基于图的算法在构图时采用非常激进的裁边策略来构建高度稀疏的图结构,所以当更新图时,图结构会变得稀疏,降低图的可导航性,导致了图索引质量下降。FreshVamana 采用了 Vamana 中的 RobustPrune 以构建更密集的图,确保了图的持续导航性和在多次修改后保持稳定的召回率的能力。 |
Beta Was this translation helpful? Give feedback.
-
DuckDB 新扩展:向量相似度搜索 |
Beta Was this translation helpful? Give feedback.
-
Kvrocks Vector Similarity Search Indexing Proposal
Background
Redis Vector Search[1] enables real-time indexing, updating, and querying of vectors using two methods: FLAT, which performs brute-force indexing, and HNSW (Hierarchical Navigable Small World) graphs[2].
With the development of the Search module in KVrocks, integrating vector indexing capabilities will empower users to conduct vector similarity searches using KVrocks, supporting real-time processing and efficient large-scale vector data management.
This proposal will explore potential implementations to vector similarity search that prioritize disk access patterns.
Potential Indexing Solutions
HNSW
Algorithm[3][6]
Pros
Cons
Vamana(diskANN) indexing
Algorithm[4][7]
Pros[5]
Cons
Related Work and Impl
IVFFlat
Algorithm[11]
Pros
Cons
In a short, I think HNSW should be implemented first as it’s more compatible with Redis protocol and well-proven in many existing frameworks. We could consider supporting Vamana for future use cases involving static datasets. Additionally, the inclusion of IVFFlat should be evaluated, particularly for scenarios where index size and build time are critical, even though it may require more frequent rebuilding with data updates. To further improve HNSW, we could try HNSW + PQ [9] or SPANN [8], which, in a high level, clusters vectors first and then performs a more fine-grained search within the closest clusters. However, the first milestone is to successfully implement HNSW.
Similar Discussion
apache/lucene#12615
Appendix
ANN benchmarking tool: https://ann-benchmarks.com/glove-100-angular_10_angular.html
References
[1] Redis Vector Database: https://redis.io/docs/latest/develop/get-started/vector-database/
[2] Redis Search Reference: https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/
[3] Write You a Vector Database: https://skyzh.github.io/write-you-a-vector-db/cpp-06-01-nsw.html
[4] Zilliz Engineering Blog**.** DiskANN: A Disk-based ANNS Solution with High Recall and High QPS on Billion-scale Dataset: https://zilliz.com/blog/diskann-a-disk-based-anns-solution-with-high-recall-and-high-qps-on-billion-scale-dataset
[5] Vamana vs. HNSW - Exploring ANN algorithms Part 1: https://weaviate.io/blog/ann-algorithms-vamana-vs-hnsw
[6] Hierarchical Navigable Small Worlds (HNSW):https://www.pinecone.io/learn/series/faiss/hnsw/
[7] DiskANN and the Vamana Algorithm: https://zilliz.com/learn/DiskANN-and-the-Vamana-Algorithm
[8] SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search:https://arxiv.org/pdf/2111.08566.pdf
[9] HNSW+PQ - Exploring ANN algorithms: https://weaviate.io/blog/ann-algorithms-hnsw-pq
[10] Vector Indexes in Postgres using pgvector: IVFFlat vs HNSW: https://tembo.io/blog/vector-indexes-in-pgvector
[11] Everything You Need to Know about Vector Index Basics: https://zilliz.com/learn/vector-index
I'm new to vector database field, any corrections, thoughts and/or insights are welcome!
(p.s. The encoding part is not in the scope of this proposal, but there will be a new post talking about index encoding once the indexing method is determined here. )
Beta Was this translation helpful? Give feedback.
All reactions