-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Introducing Loading/Writing Layer in Native KNN Engines #2033
Comments
@jmazanec15 |
Interim Progress ReportAfter #2139 PR got merged, there are two more PRs to be followed shortly.
After above two PRs, introducing a loading layer is officially available in OpenSearch. |
PR for introducing a loading layer in NMSLIB |
[NMSLIB] Loading Time ComparisonThe numbers below were measured through
ObservationUnlike FAISS, it took almost 81% more time when loading a system cached file. ExperimentIndex size : 30G 1. Baseline (Using fread)
2. Using Stream (4KB)
|
[NMSLIB] Performance BenchmarkMachine :
|
1. NMSLIB Loading Perf Issue Analysis2. Performance Degradation In FAISSAfter switching from direct file API usage to an abstract IO loading layer, additional overhead was introduced due to JNI calls and buffer copying via In NMSLIB, we expected a similar level of performance regression as seen in FAISS. However, we're observing a 70% increase in load time when loading a 6GB vector index. (baseline=4.144 sec, the modified one=7.503 sec) 3. Why is it more than twice as severe as in FAISS?The key performance difference in index loading between FAISS and NMSLIB stems from their file formats. FAISS stores chunks of the neighbor list in a single location and loads them all at once. See the code below:
In NMSLIB, each neighbor list is stored individually, requiring O(N) reads, where N is the total number of vectors.
4. Solution 1. Patch in NMSLIBWe can patch NMSLIB to avoid making JNI calls for each vector element. The idea is to load data in bulk, then parse the neighbor lists from that buffer, rather than reading bytes individually. This approach would reduce the number of JNI calls to O(Index size / Buffer size). For example, with a 6GB vector index containing 1 million vectors and a 64KB buffer size, the required JNI calls would be reduced to O(6GB / 64KB) = 98,304, which is a significant improvement over 1 million calls, achieving nearly a 90% reduction in operations. Result: Surprisingly, it is 8% faster than the baseline. (Note: I reindexed on a new single node, which is why the loading time differs from the one mentioned earlier in the issue.)
4.1 Pros
4.2 Cons
4.3. Patch in hnsw.cc
5. Solution 2. Disable Streaming When FSDirectorySince we're deprecating NMSLIB in version 3.x, we can disable loading layer in NMSLIB until then.
5.1. Pros :
5.2. Cons :
6. Solution 3. Live with it :)Since we're deprecating NMSLIB in version 3.x, we can tolerate this issue in the short term. 7. Micro Tuning Results
|
We decided to go with the solution 1 for NMSLIB |
As the issue is linked with the PR and PR get closed the issue is also getting closed, which we don't want. Thanks @dblock for removing the untriaged label. |
@0ctopus13prime just checking, what items are left for this feature to complete? |
@navneet1v |
Writing Layer Latency Analysis1. GoalThis analysis report explores the potential impact of replacing the 2. ConclusionAfter multiple rounds of benchmark and impact deep dive analysis, I concluded that the writing layer’s contribution to latency overhead is minimal, account for at most 1% of the shard indexing time. While some sections of the benchmark results suggested potential performance degradation due to the change, after serial benchmarks indicated that those inconsistent numbers are likely noise. Therefore, my conclusion is that hardly expect severe performance degradation coming from writing layer. 3. Writing Layer’s Contribution in Shard Indexing LatencyWhat proportion does the writing layer contribute to the overall shard-level indexing process?
3.1. Test Set-up (For FAISS only)3.1.1. Testing Environment
I created a standalone program to load prepared vector data and feed it to the JNIService, which manages the vector index as a whole. In my standalone testing, I used 1 million vectors, each with 128 dimensions (random-s-128-10m-euclidean.hdf5). In this program, I tested the following three scenarios:
The total indexing time was 5 minutes and 50 seconds, with only 1 second (this is the worst case I can imagine, the actual numbers I got were less than 1 seconds) spent on flushing the vector index (the final step), which accounts for 0.2% of the total time. This indicates that the majority of the time in vector indexing is spent constructing the in-memory vector index, while only 0.2% is allocated to I/O processing. Even though flushing takes more than twice as long after the introduction of the writing layer, its overall impact is marginal, resulting in a total time of 5 minutes and 51 seconds.
The resulting file size is 634 MB. The baseline took approximately 0.45 seconds to flush, while the streaming approach took 0.65 seconds. The hybrid approach was nearly identical to the baseline, taking 0.457 seconds. As observed, there was indeed an increase in flushing time from 0.45 seconds to 0.65 seconds. However, since both vector transfer and constructing the vector index account for 99.8% of the total time, I anticipate that the writing layer will have minimal impact on overall performance. 4. Benchmark Results
I excluded sections from benchmark results according to below conditions.
4.1. Faiss4.1.1. ConclusionFrom the results, I can see that cumulative indexing time and merging time are almost identical between the baseline and writing layer. Only p100 latency is likely impacted from the writing layer. 4.1.2. Benchmark ResultsNote that the values in the writing layer represent the average of five benchmark results.
4.2. NMSLIB4.2.1. ConclusionSurprisingly, the overall performance has improved compared to the baseline. I believe this enhancement is due to the use of 64 KB I/O buffering for writing, as opposed to the 4 KB buffer size used by 4.2.2. Benchmark ResultsI applied the same filtering rule mentioned in Faiss to exclude specific sections.
5. Alternatives - HybridBased on the results, I believe we can safely go with the writing layer. However, if we want to be more conservative, we can adopt the hybrid approach (see Case 3. Use hybrid approach). In this approach, we implement the From the micro-performance testing, we can see that the time spent is nearly identical to the baseline (baseline = 0.45 sec, hybrid = 0.457 sec). One of the greatest advantages of this approach is that it allows us to have a single codebase for I/O operations using the writing layer, while also being simple and easy to implement. Additionally, it guarantees users an identical indexing experience. Overall, I don't see any downsides to this approach. Appendix 1. Stand alone measuring vector indexing.Main
HDF5 Data Dump
Case 1. Use stream
Case 2. Use
|
@navneet1v Please feel free to share your thoughts on it. Thank you! |
@0ctopus13prime thanks for sharing the results. I am aligned with going with the writing layer and shouldn't build a hybrid approach. The minimal degradation we are seeing is just noise which is very prominent with indexing in general. Thanks for sharing the detailed analysis. From my side I would say lets write the code. On a separate PR please include these micro-benchmarks too in the repo so that it can be used later. |
Final Benchmark Environment (Cluster)
|
Faiss benchmark results conclusion (3 shards)From the results, it's expected that total cumulative indexing time will be increased up to 2% (67.67min -> 68.98min), thus it means that the bulk indexing throughput can be decreased down to 1.4% (7861 -> 7744). Faiss benchmark details<style> </style>
|
NMSLIB benchmark conclusion (3 shards)Unlike Faiss, it is expected there will be a slight improvement in indexing related metrics in NMSLIB. Benchmark details<style> </style>
|
It's merged into both 2.x and main branch! |
Introducing Loading Layer in Native KNN Engines
1. Goal
FAISS and Nmslib, two native engines, have been integral to delivering advanced vector search capabilities in OpenSearch. Alongside the official Lucene vector format, these engines have played a significant role in meeting the growing vector search needs of customers, especially in scenarios where Lucene alone might not suffice.
However, the tight coupling in the way vector indexes are loaded during searches has made it challenging for OpenSearch to scale as a vector search solution across various Directory implementations. As of this writing, OpenSearch only supports FSDirectory, limiting its compatibility with other network-based Directory implementations, such as those backed by S3 or NFS.
This document provides an overview of a solution designed to cut this dependency, making OpenSearch compatible with multiple Directory implementations. In the following sections, it will guide the audience through the importance of introducing an abstract loading layer within each native engine. This layer will enable transparent loading of vector indexes, regardless of the specific Directory implementation used.
Related Issues
2. Scope
In this document, we focus exclusively on two types of native engines: FAISS and Nmslib. Lucene vector search is not covered here, as it is already integrated with Directory implementations.
Among the native engines, we will delve deeper into FAISS, while providing only high-level conceptual sketches for Nmslib. The primary reason for this is that, unlike FAISS, Nmslib lacks a loading interface (e.g., FAISS’s IOReader). However, the approach in Nmslib will closely mirror the work in FAISS, where we first introduce a loading interface, then build a mediator that indirectly calls IndexInput to copy bytes upon it.
As we are still in the proposal phase, detailed performance impacts will be addressed in the next phase, after benchmarks have been conducted and real data analyzed. In this initial phase, our focus is solely on creating a scalable interface that allows OpenSearch to integrate with multiple Directory implementations, while keeping the native engines unchanged. We will not be modifying any of the assumptions made by the native engines at this stage. Although further optimizations could be achieved by adjusting these assumptions, that will be the subject of future discussions and is beyond the scope of this document.
For example, it is out of scope for now and we leave it as the next opportunity room for improvement, FAISS loads all data into physical memory before performing a search, a behavior that is also true for Nmslib. Now, imagine a scenario where a user configures an S3-backed directory in OpenSearch. Due to this way FAISS operates, the S3-backed directory would need to download the requested vector index from S3, which could significantly worsen the p99 query time, as KNNWeight lazily loads the index (Code). As a result, query execution would be delayed until the entire vector index has been fully downloaded from S3 before the search can begin.
3. Problem Definitions
3.1. Problem We Are Solving
To enable compatibility with various Directory implementations, we need to decouple from FSDirectory and make it extensible in OpenSearch.
Current implementation in OpenSearch is assuming a vector index exists in normal file system (ex: ext4), passing the absolute path of the vector index to underlying native engines.
For example, in FAISS, it would end up invoking
Index* read_index(const char* fname, int io_flags = 0)
method, and in Nmslib,void LoadIndex(const string& location)
method will be called eventually. In which, both native engines will try to read bytes and load the entire index into physical memory. Although there seems to be better strategies of loading index — lazy loading or mmap loading etc — but as we aligned in 2. Scope, we will not attempt to alter the philosophy of theirs.FSDirectory is the only Directory supported in OpenSearch. Due to this tight coupling, many network-based implementations (such as S3-backed or NFS-based Directories) and potential future Directory implementations cannot be integrated with native vector search engines in OpenSearch.
3.2. [Optional] Lucene Directory Design
Let’s briefly review Lucene’s Directory design before we continue with the discussion. If you're already familiar with its design and functionalities, feel free to skip ahead to the next section — 4. Requirements
This overview is included here rather than in the appendix, as it provides essential background before delving into the proposal.
3.2.1. Directory
Directory represents a logical file system with files organized in a tree structure, allowing users to perform CRUD operations (Create, Read, Update, Delete) through its APIs. The underlying Directory implementation must not only support creation and deletion but also provide an IndexInput stream, enabling the caller to read content from a specific offset.
For now, think of IndexInput as a random access interface, which we will revisit in section 3.2.3. IndexInput.
Renown Directory Implementations
3.2.2. DataInput
DataInput provides sequential read APIs and serves as the base class for IndexInput, which will be covered in the next section. Each DataInput implementation is expected to internally track the last read offset. However, DataInput itself does not offer an API for updating this offset. To modify the offset, users must inherit from DataInput and define their own IndexInput class.
3.2.3. IndexInput
IndexInput inherits from DataInput and includes an API for resetting the offset, in addition to all the features provided by DataInput. The caller can use this API to update the internal offset. Once updated, all read operations in DataInput will start from the new reset offset.
Directory provides an IndexInput as the read stream. While loading a vector index into physical memory typically involves sequential reading and does not necessitate random access, we will use IndexInput for sequential reading, as it is the type returned by the Directory.
3.2.4. DataOutput
An abstract base class for performing write operations on bytes. This serves as the foundation for the IndexOutput class, which will be discussed in detail in the next section. Each DataOutput implementation must internally track the next byte offset for writing.
3.2.5. IndexOutput
IndexOutput inherits from the DataOutput class with extra getter methods that return the internal offset where the next byte will be written. It provides two APIs: 1. A basic getter method. 2. An aligned offset adjustment method.
Note that the aligned offset method appends dummy bytes to ensure the offset is a multiple of the given alignment.
For example, if the current offset is 121 and the required alignment is 8, the method will append 7 dummy bytes to adjust the offset to 128, which is a multiple of 8.
4. Requirements
4.1. Functional Requirements
4.2. Non-functional Requirements.
5. Solution Proposal
5.1. High Level Overview
5.1.1. Loading Vector Index (FAISS only)
[Image: Image.jpg]
5.1.2. Constructing Vector Index (FAISS only)
[Image: Image.jpg]
5.2. Low Level Details - FAISS
5.2.1. [Reading] Vector Index Loading Low Level Details
5.2.1.1. Define C++ Mediator.
The mediator component is responsible for invoking the IndexInput instance to obtain bytes and then copying them into the specified memory location in C++ (e.g., performing a Java-to-C++ byte copy).
5.2.1.2. NativeMemoryEntryContext.IndexEntryContext
IndexEntryContext contains essential information for loading the vector index. The current implementation only includes the logical index path, which it uses to construct a physical absolute path in the file system for access.
The constructor will be updated to include an additional parameter, Directory, which will serve as the source of IndexInput.
5.2.1.3. NativeMemoryLoadStrategy.IndexLoadStrategy
IndexLoadStrategy now falls back to the baseline approach if the given Directory is file-based. Otherwise, it allows the native engine to fetch bytes from IndexInput. This change helps prevent potential performance degradation due to JNI call overheads and redundant byte copying.
5.2.1.4 File Watcher Integration
Whenever NativeMemoryLoadStrategy delegates a task of loading an index to native engines, it attaches a monitor object to remove the corresponding entry from the cached map managed by NativeMemoryCacheManager when a vector file is removed from the Directory. (Code) This behavior will remain unchanged even after extended the current implementation to pass IndexInput to native engines. The cached pair in the map will continue to be properly removed and cleaned up as before.
5.2.1.5. JNIService
JNIService serves as the entry point for interacting with the underlying native engines. Similar to how the current implementation passes the Java string value of the index path, it will now pass the reference to the provided IndexInput.
5.2.1.6. FaissService, Glue Component
The glue component is responsible for creating an adapter for
[IOReader](https://github.com/facebookresearch/faiss/blob/924c24db23b00053fc1c49e67d8787f0a3460ceb/faiss/impl/io.h#L27)
and passing it to the FAISS API.It first creates a
NativeEngineIndexInputMediator
on the local stack, then wraps it with aFaissMediatorWrapper
.5.2.2. [Writing] Constructing Vector Index Low Level Details
By the time a vector index is requested to be written to underlying storage, the vector graph structure should already be properly trained and reside in memory. Below low-level details are intended to abstract away the IO processing logic, making the persistence of data in a file system seamless and transparent.
As a result, the vector transfer processing logic will remain unchanged, even after the proposed introduction of an intermediate layer in native engines.
5.2.2.1. Define IndexOutput wrapper
5.2.2.2. Define C++ mediator
To minimize frequent context switching between C++ and Java, the writer mediator first attempts to retain bytes in its buffer. Once the buffer is full, it copies the bytes (e.g. uint8_t[]) to a Java buffer (e.g. byte[]) and then triggers
IndexOutputWithBuffer
to flush the bytes via the underlying IndexOutput.As a result, it needs double the memory copies compared to the baseline. (Copying bytes in C++ first, and then a second copy to Java byte[]).
But I don’t believe this will significantly impact performance, as glibc’s memcpy typically achieves throughput between 10-30GB/sec. At the worst case, this would likely add only a few seconds to the process of building a vector index.
Most of the performance degradation will be coming from Lucene’s IndexOutput implementation.
For the performance, please refer to 7.1. Analylsis
[Image: Image.jpg]
5.2.2.3. Expanding IndexService
5.3. Miscellaneous
Since we will be fully relying on Lucene’s
Directory
, few DTO classes now need to includeDirectory
instance as a member field.Also, a few classes should be modified to use the passed
Directory
instance process for building and loading vector indices instead of using the castedFSDirectory
.5.4. Pros and Cons
5.4.1. Pros
5.4.2. Cons
5.5. [P1] Nmslib - Introducing Similar Concept of IOReader in FAISS.
We need to implement some changes in Nmslib to make layers available. Since only two index types are being used currently —Hnsw and Hnsw— the definition of Hnsw is the only place requiring modification.
Currently, Hnsw has methods that accept
std::istream
andstd::ostream
for reading and writing bytes. However, these methods are private, which prevents JNI from passing a stream object and subsequently let it utilize Lucene’sIndexInput
andIndexOutput
for IO operations.5.5.1. Required Patches [DONE]
5.5.1.1. Loading With Stream
5.5.1.2. Writing With Stream
5.5.2. Read Stream Buffer
5.5.3. Write Stream Buffer
6. Backward Compatibility And Miscellaneous Performance Issues
It should fall back to the existing implementation when the given Directory is file-based, ensuring no backward compatibility issues.
Apart from the inherent overhead of IndexInput (note that we cannot prevent users from importing inefficient IndexInput implementations in OpenSearch!), what are the costs of the proposed solution?
From a performance perspective, the primary impact will come from context switching between Java and C++ due to JNI calls. However, since each JNI call transition typically takes only nanoseconds, the overall performance degradation is expected to be minimal. Thus, the performance impact from JNI calls is likely to be negligible.
From a memory consumption perspective, additional memory allocation will be limited to a constant factor, with a maximum of approximately 4KB for the copy buffer. Additionally, because GetPrimitiveArrayCritical provides a pointer to the primitive array without performing a data copy in OpenJDK, we don’t need extra memory allocations other than 4K buffer.
6.1. File Watching Mechanism Issue
With this change, we need to modify the file-watching mechanism, as we are no longer relying on OS-level files.
The purpose of the file watcher is to allow us to evict outdated vector indexes as soon as the corresponding file is deleted. This ensures we maintain the necessary memory space efficiently.
I realized that including this section would make the document overly lengthy, so I will move this topic to a separate sub-document where all alternatives will be explored in detail.
7. Performance Benchmark
7.1. Benchmark Environment
7.1.1. Traffic Loader
7.1.2. OpenSearch Engine
vm.max_map_count=262144
7.1. Analylsis
For the details, please refer to Appendix 1. Performance Details.
The best benchmark result for each case was selected after three trials. Overall search performance remained consistent, with the baseline occasionally outperforming the candidate and vice versa. This variation is expected, as the loading layer merely involves reading bytes into memory. Since the query process operates on in-memory data, introducing this layer should not significantly impact search performance.
We can see at the worst -20% performance degradation in bulk ingestion, resulted in 2 seconds added latency. And this is some what expected in As a result, it needs double the memory copies compared to the baseline. (Copying bytes in C++ first, and then…
Milestones
Prepare the next round of design meeting for write part.
POC (Reading part only)
Product-ionize (Reading part only)
Propose an introduction of loading layer in Nmslib.
9. Demo : S3 Vector Index Snapshot
[Image: Image.jpg]
9.1. Demo Steps
For the details commands, please refer to Appendix 2. Demo Scripts.
Appendix 1. Performance Details
Appendix 2. Loading Time Comparison
The numbers below were measured through
time curl -X GET http://localhost:9200/_plugins/_knn/warmup/target_index
.I made two different experiments of loading a FAISS vector index with different buffer sizes.
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
)Conclusion:
Buffer size in
InputIndexWithBuffer
does not impact loading time.Then there's no reason to use more than 4KB buffer. If anything, it cost more space and takes more time between JNI critical section.
When a new index file was just created, and it's not system cached, then there's no trivial different to loading time between baseline and streaming fashion.
But the file is already loaded in system cache, then baseline (e.g. the one using
fread
) is slightly faster than streaming fashion. (3.584 VS 4.664).But considering such case is rare (except for rebooting an engine during rolling restart), and it is expected that most cases it would load a newly created vector index, I think it would not seriously deteriorate performance overall.
Once an index was loaded, then query would be processed against to in-memory data structure, therefore there wasn't search performance between baseline versus streaming version. (Refer to above table for more details).
Experiment
Index size : 6.4G
1. Baseline (Using fread)
2. Using Stream
2.1. 4KB
2.2. 64KB
2.3. 1M
Appendix 3. Demo Scripts
0. Set up AWS credential
./bin/opensearch-keystore create
./bin/opensearch-keystore add s3.client.default.access_key
./bin/opensearch-keystore add s3.client.default.secret_key
1. Make sure S3 bucket is empty.
https://us-east-1.console.aws.amazon.com/s3/buckets/kdooyong-opensearch?region=us-east-1&bucketType=general&tab=objects
2. Create an index.
3. Bulk ingest vector data.
4. Run a query and make sure we are getting a valid result.
5. Create a repository.
6. Look up the repository we created.
7. Take the snapshot
8. Get the snapshot info.
9. Delete the index from OpenSearch. Confirm that we deleted the index in OpenSearch.
10. Confirm we don’t have any indices
11. Restore the searchable index. Confirm that now we have a vector index restored.
12. Run a query against the restored vector index. Make sure we are getting a valid result.
The text was updated successfully, but these errors were encountered: