[Proposal] Vector Similarity Search Indexing #2287

Beihao-Zhou · 2024-05-02T17:31:52Z

Beihao-Zhou
May 2, 2024
Collaborator

Kvrocks Vector Similarity Search Indexing Proposal

Background

Redis Vector Search[1] enables real-time indexing, updating, and querying of vectors using two methods: FLAT, which performs brute-force indexing, and HNSW (Hierarchical Navigable Small World) graphs[2].

With the development of the Search module in KVrocks, integrating vector indexing capabilities will empower users to conduct vector similarity searches using KVrocks, supporting real-time processing and efficient large-scale vector data management.

This proposal will explore potential implementations to vector similarity search that prioritize disk access patterns.

Potential Indexing Solutions

HNSW

Algorithm[3][6]

Build a hierarchy of layers to speed up the traversal of the nearest neighbor graph.
In this graph, the top layers contain only long-range edges.
The deeper the search traverses through the hierarchy, the shorter the distance between vectors captured in the edges.

Pros

Compatible with original Redis protocol
Real-time insertion
Well-proven performant vector search indexing algorithm in many frameworks
- On-disk HNSW index for Postgres with pg_embedding
- Faiss HNSW impl

Cons

Additional indexing layers or metadata to manage disk-based graph traversal, which results in increased disk RTT and increased metadata.

Vamana(diskANN) indexing

Algorithm[4][7]

Build a random graph.
Optimize the graph, so it only connects vectors close to each other.
Modify the graph by removing some short connections and adding some long-range edges to speed up the traversal of the graph.

Pros[5]

Minimize the footprint of each index and reduces redundancy.
Designed with disk-based systems in mind, reducing the number of disk seeks during queries.

Cons

Static Nature: The initial design and common implementations of diskANN are generally static. This means that once the index is built, it is not designed to dynamically incorporate new data points. Potentially, we could prune data newly inserted data points; however, there is no research or blogs found that actually implemented and benchmarked it.
As Redis explicitly support HNSW, the parameters for Vamana are different from that of HNSW, despite there might be corresponding mapping between parameters of different models.

Related Work and Impl

IVFFlat

Algorithm[11]

IVFFlat divides vectors into multiple lists based on a number of computed centroids, forming clusters around these centroids.
Each list corresponds to a cluster and contains vectors close to that centroid.
During search, instead of comparing to all vectors, the algorithm narrows down to subsets of lists based on the proximity of their centroids to the query vector.

Pros

Limits search to relevant clusters, reducing the number of distance calculations.
Since vectors are grouped by similarity and does not have much space overhead compared to graph index, this can potentially reduce storage needs.

Cons

Changes in Recall upon Updates: Significant impact on recall if vectors are added or modified, as it might require recalculating centroids and redistributing vectors.
Potential Need for Re-indexing: Regular updates or additions may necessitate frequent re-indexing to maintain efficiency and accuracy.

💡 Comparative Analysis with HNSW [10]

Robustness to Updates: HNSW handles updates and modifications with minimal impact on recall.
Index Size: IVFFlat has a smaller storage footprint.
Query Speed: HNSW is substantially faster in terms of queries per second.
Build Time: IVFFlat is significantly faster to build compared to HNSW.

In a short, I think HNSW should be implemented first as it’s more compatible with Redis protocol and well-proven in many existing frameworks. We could consider supporting Vamana for future use cases involving static datasets. Additionally, the inclusion of IVFFlat should be evaluated, particularly for scenarios where index size and build time are critical, even though it may require more frequent rebuilding with data updates. To further improve HNSW, we could try HNSW + PQ [9] or SPANN [8], which, in a high level, clusters vectors first and then performs a more fine-grained search within the closest clusters. However, the first milestone is to successfully implement HNSW.

Similar Discussion

apache/lucene#12615

Appendix

ANN benchmarking tool: https://ann-benchmarks.com/glove-100-angular_10_angular.html

References

[1] Redis Vector Database: https://redis.io/docs/latest/develop/get-started/vector-database/

[2] Redis Search Reference: https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/

[3] Write You a Vector Database: https://skyzh.github.io/write-you-a-vector-db/cpp-06-01-nsw.html

[4] Zilliz Engineering Blog**.** DiskANN: A Disk-based ANNS Solution with High Recall and High QPS on Billion-scale Dataset: https://zilliz.com/blog/diskann-a-disk-based-anns-solution-with-high-recall-and-high-qps-on-billion-scale-dataset

[5] Vamana vs. HNSW - Exploring ANN algorithms Part 1: https://weaviate.io/blog/ann-algorithms-vamana-vs-hnsw

[6] Hierarchical Navigable Small Worlds (HNSW):https://www.pinecone.io/learn/series/faiss/hnsw/

[7] DiskANN and the Vamana Algorithm: https://zilliz.com/learn/DiskANN-and-the-Vamana-Algorithm

[8] SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search:https://arxiv.org/pdf/2111.08566.pdf

[9] HNSW+PQ - Exploring ANN algorithms: https://weaviate.io/blog/ann-algorithms-hnsw-pq

[10] Vector Indexes in Postgres using pgvector: IVFFlat vs HNSW: https://tembo.io/blog/vector-indexes-in-pgvector

[11] Everything You Need to Know about Vector Index Basics: https://zilliz.com/learn/vector-index

I'm new to vector database field, any corrections, thoughts and/or insights are welcome!
(p.s. The encoding part is not in the scope of this proposal, but there will be a new post talking about index encoding once the indexing method is determined here. )

PragmaTwice · 2024-05-04T03:01:29Z

PragmaTwice
May 4, 2024
Collaborator

Thank you for providing excellent preliminary research and suggestions, which have given us direction to implement vector search.

I also agree that we can first try to implement HNSW (including designing an efficient encoding for reducing HNSW index to rocksdb key-values). And in the later phase, we can introduce other indexes according to the situation.

If anyone in the community are interested, welcome to join the discussion! cc @git-hulk @mapleFU @Yangsx-1

2 replies

git-hulk May 4, 2024
Collaborator

Very excited to see this proposal. And I also think it will make many users happy even if only supports HNSW.

Beihao-Zhou May 5, 2024
Collaborator Author

Sounds good!! Then I'll come up with the design for HNSW encoding soon, look forward to your all feedback!

i18nsite · 2024-06-04T03:12:51Z

i18nsite
Jun 4, 2024

DiskANN++：使用查询敏感度入口顶点对同构映射图索引进行高效的基于页面的搜索
https://www.semanticscholar.org/paper/DiskANN%2B%2B%3A-Efficient-Page-based-Search-over-Mapped-Ni-Xu/dadc18320a7dea60ec8fe6dfd3595943c78952e2
给定一个向量数据集 $\mathcal{X}$ 和一个查询向量 $\vec{x}_q$，基于图的近似最近邻搜索 (ANNS) 旨在构建一个图索引 $G$，并通过搜索 $G$ 近似返回与 $\vec{x}_q$ 距离最小的向量。基于图的 ANNS 的主要缺点是图索引太大，无法放入内存，尤其是对于大规模的 $\mathcal{X}$。为了解决这个问题，提出了一种基于乘积量化 (PQ) 的混合方法 DiskANN，将低维 PQ 索引存储在内存中，并将图索引保留在 SSD 中，从而在确保高搜索精度的同时减少内存开销。然而，它存在两个 I/O 问题，会严重影响整体效率：(1) 从入口顶点到查询邻域的长路由路径导致大量 I/O 请求和 (2) 路由过程中的冗余 I/O 请求。我们提出了一个优化的 DiskANN++ 来克服上述问题。具体来说，对于第一个问题，我们提出了一种查询敏感的入口顶点选择策略，用动态确定的接近查询的入口顶点替换 DiskANN 的静态图中心入口顶点。对于第二个 I/O 问题，我们提出了一种基于 DiskANN 图索引的同构映射来优化 SSD 布局，并提出了一种基于优化的 SSD 布局的异步优化页面搜索作为 DiskANN 束搜索的替代方案。对八个真实数据集的全面实验研究表明我们的 DiskANN++ 在效率方面具有优势。在相同的准确度约束下，与 DiskANN 相比，我们的 QPS 显著提高了 1.5 倍到 2.2 倍。

0 replies

i18nsite · 2024-06-04T03:14:28Z

i18nsite
Jun 4, 2024

FreshDiskANN：一种用于流式相似性搜索的快速、准确的基于图的 ANN 索引
https://www.modb.pro/db/1719906010751115264
文章提出了 FreshDiskANN 系统，用于解决欧几里得空间中具有实时新鲜数据点的fresh-ANNS问题，并且所需机器的数量相比其他先进技术少 5 到 10 倍。文章做出了以下几点技术贡献：

演示了简单的图更新规则如何导致 HNSW 和 NSG 等流行的基于图的算法在插入和删除流上的索引质量下降。

开发了 FreshVamana，这是第一个支持插入和删除的基于图的索引，并实证了其在长时间更新流中的稳定性。

系统将大部分图形索引存储在 SSD 上，仅将最新更新存储在内存中。为了支持这一点，设计了一种新颖的两遍 StreamingMerge 算法，该算法以一种非常高效的写入方式将内存中索引与 SSD 索引合并。合并过程的时间和空间复杂度与更改集成正比，从而可以使用比从头开始重建大型索引少一个数量级的计算和内存，在 RAM 有限的机器上更新大型十亿点索引。

设计了 FreshDiskANN 系统，其中包含一个覆盖大多数数据点的长期驻留 SSD 的索引，以及一个用于聚合最近更新的短期内存索引。FreshDiskANN 会定期在后台使用 StreamingMerge 算法将短期索引合并到长期索引中，以限制短期索引的内存占用，从而限制整个系统的内存占用。

FreshVamana

因为流行的基于图的算法在构图时采用非常激进的裁边策略来构建高度稀疏的图结构，所以当更新图时，图结构会变得稀疏，降低图的可导航性，导致了图索引质量下降。FreshVamana 采用了 Vamana 中的 RobustPrune 以构建更密集的图，确保了图的持续导航性和在多次修改后保持稳定的召回率的能力。

0 replies

i18nsite · 2024-06-04T03:15:34Z

i18nsite
Jun 4, 2024

DuckDB 新扩展：向量相似度搜索
https://www.modb.pro/db/1787290240796413952

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Vector Similarity Search Indexing #2287

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Proposal] Vector Similarity Search Indexing #2287

Beihao-Zhou May 2, 2024 Collaborator

Kvrocks Vector Similarity Search Indexing Proposal

Background

Potential Indexing Solutions

HNSW

Algorithm[3][6]

Pros

Cons

Vamana(diskANN) indexing

Algorithm[4][7]

IVFFlat

Algorithm[11]

Pros

Cons

Similar Discussion

Appendix

References

Replies: 4 comments · 2 replies

PragmaTwice May 4, 2024 Collaborator

git-hulk May 4, 2024 Collaborator

Beihao-Zhou May 5, 2024 Collaborator Author

i18nsite Jun 4, 2024

i18nsite Jun 4, 2024

i18nsite Jun 4, 2024

Beihao-Zhou
May 2, 2024
Collaborator

Replies: 4 comments 2 replies

PragmaTwice
May 4, 2024
Collaborator

git-hulk May 4, 2024
Collaborator

Beihao-Zhou May 5, 2024
Collaborator Author

i18nsite
Jun 4, 2024

i18nsite
Jun 4, 2024

i18nsite
Jun 4, 2024