Introduce a loading layer in NMSLIB. #2185

0ctopus13prime · 2024-10-03T23:40:24Z

Description

This PR is the first commit introducing the loading layer in NMSLIB.
Please refer to this issue for more details. - #2033

FYI : FAISS Loading Layer PR - #2139

Related Issues

Resolves #[Issue number to be closed when this PR is merged]
#2033

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

0ctopus13prime · 2024-10-03T23:42:11Z

jni/src/nmslib_wrapper.cpp

@@ -27,301 +27,354 @@
 #include <string>

 #include "hnswquery.h"
+#include "method/hnsw.h"



Please search 'LoadIndexWithStream' in nmslib_wrapper.cpp and ignore the formatting changes. 🫠

0ctopus13prime · 2024-10-03T23:42:55Z

jni/src/nmslib_wrapper.cpp

+  return (jlong) indexWrapper;
+}
+
+jlong knn_jni::nmslib_wrapper::LoadIndexWithStream(knn_jni::JNIUtilInterface *jniUtil,


This is the place for using an input stream to load index within NMSLIB.

The other loadFunction will be removed in next iterations right?

0ctopus13prime · 2024-10-03T23:43:24Z

jni/src/org_opensearch_knn_jni_NmslibService.cpp

@@ -12,79 +12,102 @@
 #include "org_opensearch_knn_jni_NmslibService.h"



Please ignore the formatting changes and directly go to Java_org_opensearch_knn_jni_NmslibService_loadIndexWithStream.

0ctopus13prime · 2024-10-03T23:43:40Z

jni/src/org_opensearch_knn_jni_NmslibService.cpp

 }

-JNIEXPORT jobjectArray JNICALL Java_org_opensearch_knn_jni_NmslibService_queryIndex(JNIEnv * env, jclass cls,
+JNIEXPORT jlong JNICALL Java_org_opensearch_knn_jni_NmslibService_loadIndexWithStream(JNIEnv *env,


This is the new method added in this PR

src/main/java/org/opensearch/knn/index/memory/NativeMemoryAllocation.java

0ctopus13prime · 2024-10-03T23:45:27Z

src/main/java/org/opensearch/knn/index/memory/NativeMemoryLoadStrategy.java

@@ -182,7 +157,7 @@ class TrainingLoadStrategy
            NativeMemoryLoadStrategy<NativeMemoryAllocation.TrainingDataAllocation, NativeMemoryEntryContext.TrainingDataEntryContext>,
            Closeable {

-        private static TrainingLoadStrategy INSTANCE;
+        private static volatile TrainingLoadStrategy INSTANCE;


We need to make it volatile in Singleton pattern to avoid possible instruction order changes.

lets add this as a java doc.

will do in the next update

0ctopus13prime · 2024-10-03T23:46:14Z

src/test/java/org/opensearch/knn/index/KNNCircuitBreakerIT.java

@@ -48,7 +48,7 @@ private void tripCb() throws Exception {
        createKnnIndex(indexName2, settings, createKnnIndexMapping(FIELD_NAME, 2));

        Float[] vector = { 1.3f, 2.2f };
-        int docsInIndex = 5; // through testing, 7 is minimum number of docs to trip circuit breaker at 1kb
+        int docsInIndex = 7; // through testing, 7 is minimum number of docs to trip circuit breaker at 1kb


It's weird though, setting this to 5, it was slightly less than 1KB making this test fail.

jmazanec15

I think overall it looks good. A couple of comments

jni/include/nmslib_stream_support.h

jmazanec15 · 2024-10-04T21:21:10Z

jni/include/nmslib_stream_support.h

+  }
+
+ protected:
+  std::streamsize xsgetn(std::streambuf::char_type *destination, std::streamsize count) final {


What does this name mean? Also, why is this access protected?

Please refer to the official doc : https://en.cppreference.com/w/cpp/io/basic_streambuf/sgetn
In short, xsgetn is a protect virtual method defined in std::basic_streambuf, which is called whenever sgetn is called for the read delegation to the implementation.

x -> extended
s -> sequence

xs + getn = extended sequence getter function.

Oh this is really cool

Please add this explanation as comments on the function. Hard to remember for future.

jni/src/nmslib_wrapper.cpp

jni/src/org_opensearch_knn_jni_NmslibService.cpp

src/main/java/org/opensearch/knn/index/memory/NativeMemoryAllocation.java

src/test/java/org/opensearch/knn/jni/JNIServiceTests.java

jni/tests/faiss_stream_support_test.cpp

0ctopus13prime · 2024-10-08T21:52:36Z

Memory monitoring results comparison

From memory stand-point, not major changes I could observe from benchmark.

Baseline

Candidate

0ctopus13prime · 2024-10-08T22:24:27Z

Loading Time Comparison

The numbers below were measured through time curl -X GET http://localhost:9200/_plugins/_knn/warmup/target_index.
I made two different experiments of loading a FAISS vector index with different buffer sizes.

After dropped all file cache from memory. (sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches')
With file cache in the memory.

Observation

Unlike FAISS, it took almost 81% more time when loading a system cached file.
Of course, this case will be rare, as it is expected that KNN will load a vector index whenever a new segment file is baked.
And the newly baked segment file likely is not system cached.
Increasing buffer size didn't help. Need to find a better way to transfer data from JNI to Java.

Experiment

Index size : 30G

1. Baseline (Using fread)

After dropped : 50.448 seconds
With cached : 4.144 seconds

2. Using Stream (4KB)

After dropped : 52.779 seconds
With cached : 7.503 seconds
With cached, 64KB : 6.919 seconds
With cached, 1M : 6.99 seconds 🤔

0ctopus13prime · 2024-10-08T22:28:07Z

Performance Benchmark

Machine : -XMS63G -XMX63G
JVM Args : c5ad.12xlarge
Data : random-s-128-10m-euclidean.hdf5

Metric	Task	Baseline-Value	Candidate-Value	Change	Unit
Cumulative indexing time of primary shards		33.7147	34.2115	1.47%	min
Min cumulative indexing time across primary shards		0.000133333	0.00015	12.50%	min
Median cumulative indexing time across primary shards		16.8573	17.1057	1.47%	min
Max cumulative indexing time across primary shards		33.7146	34.2113	1.47%	min
Cumulative indexing throttle time of primary shards		0	0	0.00%	min
Min cumulative indexing throttle time across primary shards		0	0	0.00%	min
Median cumulative indexing throttle time across primary shards		0	0	0.00%	min
Max cumulative indexing throttle time across primary shards		0	0	0.00%	min
Cumulative merge time of primary shards		282.601	282.996	0.14%	min
Cumulative merge count of primary shards		125	122	2.40%
Min cumulative merge time across primary shards		0	0	0.00%	min
Median cumulative merge time across primary shards		141.3	141.498	0.14%	min
Max cumulative merge time across primary shards		282.601	282.996	0.14%	min
Cumulative merge throttle time of primary shards		1.04818	1.61307	53.89%	min
Min cumulative merge throttle time across primary shards		0	0	0.00%	min
Median cumulative merge throttle time across primary shards		0.524092	0.806533	53.89%	min
Max cumulative merge throttle time across primary shards		1.04818	1.61307	53.89%	min
Cumulative refresh time of primary shards		1.1042	1.14667	3.85%	min
Cumulative refresh count of primary shards		88	85	3.41%
Min cumulative refresh time across primary shards		0.000333333	0.000383333	15.00%	min
Median cumulative refresh time across primary shards		0.5521	0.573333	3.85%	min
Max cumulative refresh time across primary shards		1.10387	1.14628	3.84%	min
Cumulative flush time of primary shards		11.5126	10.9446	4.93%	min
Cumulative flush count of primary shards		56	53	5.36%
Min cumulative flush time across primary shards		0	0	0.00%	min
Median cumulative flush time across primary shards		5.75628	5.47229	4.93%	min
Max cumulative flush time across primary shards		11.5126	10.9446	4.93%	min
Total Young Gen GC time		0.338	0.342	1.18%	s
Total Young Gen GC count		19	19	0.00%
Total Old Gen GC time		0	0	0.00%	s
Total Old Gen GC count		0	0	0.00%
Store size		29.8586	29.8584	0.00%	GB
Translog size		5.83E-07	5.83E-07	0.00%	GB
Heap used for segments		0	0	0.00%	MB
Heap used for doc values		0	0	0.00%	MB
Heap used for terms		0	0	0.00%	MB
Heap used for norms		0	0	0.00%	MB
Heap used for points		0	0	0.00%	MB
Heap used for stored fields		0	0	0.00%	MB
Segment count		2	2	0.00%
Min Throughput	custom-vector-bulk	5390.31	5466.23	1.41%	docs/s
Mean Throughput	custom-vector-bulk	11041.1	10866	1.59%	docs/s
Median Throughput	custom-vector-bulk	10377.9	10065.6	3.01%	docs/s
Max Throughput	custom-vector-bulk	20105.1	19337.8	3.82%	docs/s
50th percentile latency	custom-vector-bulk	78.4349	75.8132	3.34%	ms
90th percentile latency	custom-vector-bulk	165.667	158.129	4.55%	ms
99th percentile latency	custom-vector-bulk	331.043	318.269	3.86%	ms
99.9th percentile latency	custom-vector-bulk	1486.47	1487.08	0.04%	ms
99.99th percentile latency	custom-vector-bulk	2300.48	2598.04	12.93%	ms
100th percentile latency	custom-vector-bulk	5049.72	4535.08	10.19%	ms
50th percentile service time	custom-vector-bulk	78.4349	75.8132	3.34%	ms
90th percentile service time	custom-vector-bulk	165.667	158.129	4.55%	ms
99th percentile service time	custom-vector-bulk	331.043	318.269	3.86%	ms
99.9th percentile service time	custom-vector-bulk	1486.47	1487.08	0.04%	ms
99.99th percentile service time	custom-vector-bulk	2300.48	2598.04	12.93%	ms
100th percentile service time	custom-vector-bulk	5049.72	4535.08	10.19%	ms
error rate	custom-vector-bulk	0	0	0.00%	%
Min Throughput	force-merge-segments	0	0	0.00%	ops/s
Mean Throughput	force-merge-segments	0	0	0.00%	ops/s
Median Throughput	force-merge-segments	0	0	0.00%	ops/s
Max Throughput	force-merge-segments	0	0	0.00%	ops/s
100th percentile latency	force-merge-segments	1.16E+07	1.13E+07	2.84%	ms
100th percentile service time	force-merge-segments	1.16E+07	1.13E+07	2.84%	ms
error rate	force-merge-segments	0	0	0.00%	%
Min Throughput	warmup-indices	0.24	0.14	41.67%	ops/s
Mean Throughput	warmup-indices	0.24	0.14	41.67%	ops/s
Median Throughput	warmup-indices	0.24	0.14	41.67%	ops/s
Max Throughput	warmup-indices	0.24	0.14	41.67%	ops/s
100th percentile latency	warmup-indices	4162.87	7127.78	71.22%	ms
100th percentile service time	warmup-indices	4162.87	7127.78	71.22%	ms
error rate	warmup-indices	0	0	0.00%	%
Min Throughput	prod-queries	0.66	0.64	3.03%	ops/s
Mean Throughput	prod-queries	0.66	0.64	3.03%	ops/s
Median Throughput	prod-queries	0.66	0.64	3.03%	ops/s
Max Throughput	prod-queries	0.66	0.64	3.03%	ops/s
50th percentile latency	prod-queries	3.5832	3.83349	6.99%	ms
90th percentile latency	prod-queries	4.75317	4.64172	2.34%	ms
99th percentile latency	prod-queries	22.1628	23.8439	7.59%	ms
100th percentile latency	prod-queries	1508.36	1571.86	4.21%	ms
50th percentile service time	prod-queries	3.5832	3.83349	6.99%	ms
90th percentile service time	prod-queries	4.75317	4.64172	2.34%	ms
99th percentile service time	prod-queries	22.1628	23.8439	7.59%	ms
100th percentile service time	prod-queries	1508.36	1571.86	4.21%	ms
error rate	prod-queries	0	0	0.00%	%
Mean recall@k	prod-queries	0.42	0.43	2.38%
Mean recall@1	prod-queries	0.6	0.63	5.00%

navneet1v · 2024-10-08T22:35:19Z

This PR is the first commit making the loading layer in native engines available.

you might want to update it to say loading layer for nmslib.

0ctopus13prime · 2024-10-09T00:11:41Z

Will holding the merging until root cause the big gap in warmup time.
Compared to FAISS, 84% increase is a bit worriesome.

navneet1v

Code looks good to me. Few minor comments. For me the biggest blocker right now to approve the code is increase in latency for warmups. Once we fix that, I will approve the code.

navneet1v · 2024-10-08T23:07:26Z

src/main/java/org/opensearch/knn/index/memory/NativeMemoryAllocation.java

+                writeLock();
                try {
-                    writeLock();


any reason for moving this out?

If an exception was thrown there, it will called unmatched unlock call in finally block.
I know the current implementation is not throwing any exceptions. Here I was trying to make the code robust no matter how other parts changed.

navneet1v · 2024-10-09T00:13:36Z

jni/include/nmslib_stream_support.h

+  }
+
+ protected:
+  std::streamsize xsgetn(std::streambuf::char_type *destination, std::streamsize count) final {


Please add this explanation as comments on the function. Hard to remember for future.

navneet1v · 2024-10-09T00:16:30Z

jni/src/nmslib_wrapper.cpp

+  return (jlong) indexWrapper;
+}
+
+jlong knn_jni::nmslib_wrapper::LoadIndexWithStream(knn_jni::JNIUtilInterface *jniUtil,


The other loadFunction will be removed in next iterations right?

navneet1v · 2024-10-09T00:17:35Z

src/main/java/org/opensearch/knn/index/memory/NativeMemoryLoadStrategy.java

@@ -182,7 +157,7 @@ class TrainingLoadStrategy
            NativeMemoryLoadStrategy<NativeMemoryAllocation.TrainingDataAllocation, NativeMemoryEntryContext.TrainingDataEntryContext>,
            Closeable {

-        private static TrainingLoadStrategy INSTANCE;
+        private static volatile TrainingLoadStrategy INSTANCE;


lets add this as a java doc.

0ctopus13prime · 2024-10-09T00:44:26Z

Streaming Flamegraph

0ctopus13prime · 2024-10-09T00:54:47Z

Performance tuning plan

Planning to continue below two tuning plans.
Expect to reduce of 23.16% of the latency.
I hardly think we can make other parts (e.g. JNIEnv_::CallIntMethod and IndexInput) parts much faster.
Also it would be worth to try it with bigger buffer size and see how it goes.

__memmove_avx_unaligned_erms :
This indicates that the buffer memory is not properly aligned for internal memcpy.
We can allocate 64 aligned memory buffer and retry again.
Having 64 aligned memory will work for both AVX2 and AVX512.
Remove critical JNI calls (JNIEnv_::CallIntMethod, jni_ReleasePrimitiveArrayCritical) entirely.
We can make Java part to have a native memory via ByteBuffer, then acquire the pointer in JNI for once. GetDirectBufferAddress

TODO : Can we allocate an aligned memory layout?

In Java

ByteBuffer nativeBuffer = ByteBuffer.allocateDirect(size);

In C++

// Get the pointer to the native memory from buffer at the beginning.
void *nativePtr = (*env)->GetDirectBufferAddress(env, buffer);

jmazanec15 · 2024-10-09T02:01:10Z

And the newly baked segment file likely is not system cached.

Wont page cache typically be write through? In which case, if graph is created and written on same node it is searched on, wont it be cached?

0ctopus13prime · 2024-10-09T02:54:55Z

Baseline flamegraph

0ctopus13prime · 2024-10-10T22:59:43Z

1. NMSLIB Loading Perf Issue Analysis

2. Performance Degradation In FAISS

After switching from direct file API usage to an abstract IO loading layer, additional overhead was introduced due to JNI calls and buffer copying via std::memcpy. This change resulted in a 30% increase in loading time compared to the baseline in FAISS. The baseline took 3.584 seconds to load a 6GB vector index, while the modified version increased the load time to 4.664 seconds.

In NMSLIB, we expected a similar level of performance regression as seen in FAISS. However, we're observing a 70% increase in load time when loading a 6GB vector index. (baseline=4.144 sec, the modified one=7.503 sec)
Why is the performance impact in NMSLIB more than twice as severe as in FAISS?

3. Why is it more than twice as severe as in FAISS?

The key performance difference in index loading between FAISS and NMSLIB stems from their file formats.
In NMSLIB, this difference results in JNI calls being made O(N) times, where N is the number of vectors, whereas in FAISS, the number of JNI calls is O(1).

FAISS stores chunks of the neighbor list in a single location and loads them all at once. See the code below:

static void read_HNSW(HNSW* hnsw, IOReader* f) {
    READVECTOR(hnsw->assign_probas);
    READVECTOR(hnsw->cum_nneighbor_per_level);
    READVECTOR(hnsw->levels);
    READVECTOR(hnsw->offsets);
    READVECTOR(hnsw->neighbors);

    READ1(hnsw->entry_point);
    READ1(hnsw->max_level);
    READ1(hnsw->efConstruction);
    READ1(hnsw->efSearch);
    READ1(hnsw->upper_beam);
}

In NMSLIB, each neighbor list is stored individually, requiring O(N) reads, where N is the total number of vectors.
As shown in the code below, we need totalElementsStored_ read operations.
Note that input.read() ultimately calls JNI to delegate Lucene’s IndexInput to read bytes thanks to the introduced loading layer. As a result, the number of input.read() calls directly corresponds to the number of JNI calls.

for (size_t i = 0; i < totalElementsStored_; i++) {
   ...
    } else {
        linkLists_[i] = (char *)malloc(linkListSize);
        CHECK(linkLists_[i]);
        input.read(linkLists_[i], linkListSize); <--------- THIS!
    }
    data_rearranged_[i] = new Object(data_level0_memory_ + (i)*memoryPerObject_ + offsetData_);
}

4. Solution 1. Patch in NMSLIB

We can patch NMSLIB to avoid making JNI calls for each vector element. The idea is to load data in bulk, then parse the neighbor lists from that buffer, rather than reading bytes individually. This approach would reduce the number of JNI calls to O(Index size / Buffer size).

For example, with a 6GB vector index containing 1 million vectors and a 64KB buffer size, the required JNI calls would be reduced to O(6GB / 64KB) = 98,304, which is a significant improvement over 1 million calls, achieving nearly a 90% reduction in operations.

Result: Surprisingly, it is 8% faster than the baseline. (Note: I reindexed on a new single node, which is why the loading time differs from the one mentioned earlier in the issue.)

Baseline : 4.538 sec
Modified version with 64KB buffer : 4.19 sec

4.1 Pros

No performance degradation. If anything, it is even faster than the baseline.
We can maintain unified set of loading APIs for both NMSLIB and FAISS.

4.2 Cons

Medium size of patch is required in NMSLIB. This may increase burdens on code maintenance.

4.3. Patch in hnsw.cc

template <typename dist_t>
void Hnsw<dist_t>::LoadOptimizedIndex(NmslibIOReader& input) {
    ...

    const size_t bufferSize = 64 * 1024;  // 64KB
    std::unique_ptr<char[]> buffer (new char[bufferSize]);
    uint32_t end = 0;
    uint32_t pos = 0;
    const bool isLTE = _isLittleEndian();
    
    for (size_t i = 0, remainingBytes = input.remaining(); i < totalElementsStored_; i++) {
        // Read linkList size integer.
        if ((pos + sizeof(SIZEMASS_TYPE)) >= end) {
            // Underflow, load bytes in bulk.
            const auto firstPartLen = end - pos;
            if (firstPartLen > 0) {
                std::memcpy(buffer.get(), buffer.get() + pos, firstPartLen);
            }
            const auto copyBytes = std::min(remainingBytes, bufferSize - firstPartLen);
            input.read(buffer.get() + firstPartLen, copyBytes);
            remainingBytes -= copyBytes;
            end = copyBytes + firstPartLen;
            pos = 0;
        }
    
        // Read data size. SIZEMASS_TYPE -> uint32_t
        SIZEMASS_TYPE linkListSize = 0;
        if (isLTE) {
            linkListSize = _readIntLittleEndian(buffer[pos], buffer[pos + 1], buffer[pos + 2], buffer[pos + 3]);
        } else {
            linkListSize = _readIntBigEndian(buffer[pos], buffer[pos + 1], buffer[pos + 2], buffer[pos + 3]);
        }
        pos += 4;
    
        if (linkListSize == 0) {
            linkLists_[i] = nullptr;
        } else {
            // Now we load neighbor list.
            linkLists_[i] = (char *) malloc(linkListSize);
            CHECK(linkLists_[i]);
    
            SIZEMASS_TYPE leftLinkListData = linkListSize;
            auto dataPtr = linkLists_[i];
            while (leftLinkListData > 0) {
                if (pos >= end) {
                    // Underflow, load bytes in bulk.
                    const auto copyBytes = std::min(remainingBytes, bufferSize);
                    input.read(buffer.get(), copyBytes);
                    remainingBytes -= copyBytes;
                    end = copyBytes;
                    pos = 0;
                }
        
                const auto copyBytes = std::min(leftLinkListData, end - pos);
                std::memcpy(dataPtr, buffer.get() + pos, copyBytes);
                dataPtr += copyBytes;
                leftLinkListData -= copyBytes;
                pos += copyBytes;
            }  // End while
        }  // End if
    
        data_rearranged_[i] = new Object(data_level0_memory_ + (i)*memoryPerObject_ + offsetData_);
    }  // End for
                
...

5. Solution 2. Disable Streaming When FSDirectory

Since we're deprecating NMSLIB in version 3.x, we can disable loading layer in NMSLIB until then.
Or, we can selectively allow streaming in NMSLIB depending on whether the given Directory is FSDirectory implementation.

if (directory instance of Directory) {
  loadIndexByFilePath(...);
} else {
  loadIndexByStreaming(...);
}

5.1. Pros :

Simple.

5.2. Cons :

Until 3.x, we need to maintain duplicated and similar version of APIs in both Java and JNI.

6. Solution 3. Live with it :)

Since we're deprecating NMSLIB in version 3.x, we can tolerate this issue in the short term.
However, I personally don't favor this approach, as it impacts the p99 latency metrics, which are rare but could still affect overall cluster performance at the worst case.

7. Micro Tuning Results

CallNonvirtualIntMethod → No impacts.
AVX 2 intrinsic copy → No impacts.
Use native ByteBuffer + one additional bytes copy → Made it worse.
Increasing buffer size → Increasing 4KB to 64KB at least reduced the warm-up time by 0.8 seconds in NMSLIB.

0ctopus13prime · 2024-10-10T23:00:22Z

@navneet1v @jmazanec15
Could you share your thoughts on the above analysis?
Thanks

0ctopus13prime · 2024-10-10T23:07:26Z

And the newly baked segment file likely is not system cached.

Wont page cache typically be write through? In which case, if graph is created and written on same node it is searched on, wont it be cached?

Sorry, I just saw it.

Yes it is configured by default in pretty much general file system.
But also the default ratio is 10% of memory which means write-back cache size is bounded by 10% of the physical memory.

The reasons that I assumed that it is going to be 'likely' write-back cache does not longer exist are in two fold:

In normal case, writing and reading happens simultaneously. LRU paging eviction is the one being used in Linux, write-back cache is very likely to be evicted soon.
On top of the point 1, NRT in Lucene periodically exposes a new baked segment. This typically takes few seconds, so I assumed most write-back cached pages are kicked out meanwhile.

Let me share your thoughts on it! You can call me an aggressive dreamer. 😛

navneet1v · 2024-10-10T23:33:49Z

@0ctopus13prime the approach of patching nmslib looks good to me. I think if it is providing a good latency we should do that. Since the pros of having a patch means improvements in load time and also getting away of FSDirectory dependency.

Signed-off-by: Dooyong Kim <[email protected]>

…vector index. Signed-off-by: Dooyong Kim <[email protected]>

Signed-off-by: Dooyong Kim <[email protected]>

…ding index performance. Signed-off-by: Dooyong Kim <[email protected]>

Signed-off-by: Dooyong Kim <[email protected]>

navneet1v

Approving. Please ans some of the clarifying questions.

navneet1v · 2024-10-12T06:25:35Z

jni/patches/nmslib/0003-Added-streaming-apis-for-vector-index-loading-in-Hnsw.patch

+
+    template <typename dist_t>
+    void Hnsw<dist_t>::LoadIndexWithStream(NmslibIOReader& input) {
+        LOG(LIB_INFO) << "Loading index from an input stream(NmslibIOReader).";


[Question]:
is this info log enabled by default? Since I am not to see all the code from nmslib in this PR.

It's disabled by default. (LIB_LOGNONE)

In nmslib_wrapper

void knn_jni::nmslib_wrapper::InitLibrary() { similarity::initLibrary(); }

and in NMSLIB

void initLibrary(int seed = 0, LogChoice choice = LIB_LOGNONE, const char*pLogFile = NULL);

navneet1v · 2024-10-12T06:33:14Z

jni/patches/nmslib/0003-Added-streaming-apis-for-vector-index-loading-in-Hnsw.patch

+    } 
+
+    template <typename dist_t>
+    void Hnsw<dist_t>::LoadIndexWithStream(NmslibIOReader& input) {


[Question]:
in this function what is the extra step we are doing to ensure that we are able to fetch more data from NmslibIOReader? I can understand some part where we might be doing it, but it will be good if you can point exactly what are the lines that improved the loading logic, so that it becomes easy to review that part itself..

I added remainingBytes method to return the number of unread bytes in IndexInputWithBuffer.

private long remainingBytes() { return contentLength - indexInput.getFilePointer(); }

And in NMSLIB, within void Hnsw<dist_t>::LoadOptimizedIndex(NmslibIOReader& input) method, we keep track of safe amount of bytes to read.

for (size_t i = 0, remainingBytes = input.remainingBytes(); i < totalElementsStored_; i++) {

Vikasht34

Hi , I am reviewing this PR , Please wait till EOD to merge this PR.

shatejas

Looks good to me

navneet1v · 2024-10-15T16:40:55Z

Merging the PR, as we have 2 approvals and author is requesting for merge.,

* Introduce a loading layer in NMSLIB. Signed-off-by: Dooyong Kim <[email protected]> * Added NMSLIB istream implementation. Signed-off-by: Dooyong Kim <[email protected]> * Fix integer overflow issue when passing read size for loading NMSLIB vector index. Signed-off-by: Dooyong Kim <[email protected]> * Added unit test for NMSLIB loading layer. Signed-off-by: Dooyong Kim <[email protected]> * Made a patch in NMSLIB to avoid frequently calling JNI for better loading index performance. Signed-off-by: Dooyong Kim <[email protected]> * Compliance constexpr function in C++11 having nullstatement. Signed-off-by: Dooyong Kim <[email protected]> --------- Signed-off-by: Dooyong Kim <[email protected]> Co-authored-by: Dooyong Kim <[email protected]> (cherry picked from commit 7cf45c8)

* Introduce a loading layer in NMSLIB. Signed-off-by: Dooyong Kim <[email protected]> * Added NMSLIB istream implementation. Signed-off-by: Dooyong Kim <[email protected]> * Fix integer overflow issue when passing read size for loading NMSLIB vector index. Signed-off-by: Dooyong Kim <[email protected]> * Added unit test for NMSLIB loading layer. Signed-off-by: Dooyong Kim <[email protected]> * Made a patch in NMSLIB to avoid frequently calling JNI for better loading index performance. Signed-off-by: Dooyong Kim <[email protected]> * Compliance constexpr function in C++11 having nullstatement. Signed-off-by: Dooyong Kim <[email protected]>

* Introduce a loading layer in NMSLIB. Signed-off-by: Dooyong Kim <[email protected]> * Added NMSLIB istream implementation. Signed-off-by: Dooyong Kim <[email protected]> * Fix integer overflow issue when passing read size for loading NMSLIB vector index. Signed-off-by: Dooyong Kim <[email protected]> * Added unit test for NMSLIB loading layer. Signed-off-by: Dooyong Kim <[email protected]> * Made a patch in NMSLIB to avoid frequently calling JNI for better loading index performance. Signed-off-by: Dooyong Kim <[email protected]> * Compliance constexpr function in C++11 having nullstatement. Signed-off-by: Dooyong Kim <[email protected]> --------- Signed-off-by: Dooyong Kim <[email protected]> Co-authored-by: Dooyong Kim <[email protected]>

* Introduce a loading layer in NMSLIB. (#2185) * Introduce a loading layer in NMSLIB. Signed-off-by: Dooyong Kim <[email protected]> * Added NMSLIB istream implementation. Signed-off-by: Dooyong Kim <[email protected]> * Fix integer overflow issue when passing read size for loading NMSLIB vector index. Signed-off-by: Dooyong Kim <[email protected]> * Added unit test for NMSLIB loading layer. Signed-off-by: Dooyong Kim <[email protected]> * Made a patch in NMSLIB to avoid frequently calling JNI for better loading index performance. Signed-off-by: Dooyong Kim <[email protected]> * Compliance constexpr function in C++11 having nullstatement. Signed-off-by: Dooyong Kim <[email protected]> --------- Signed-off-by: Dooyong Kim <[email protected]> Co-authored-by: Dooyong Kim <[email protected]> * Fixed that it's failing to resolve a package in import statement. Signed-off-by: Dooyong Kim <[email protected]> * Move the element in the changelog from 3.x to 2.x. Signed-off-by: Dooyong Kim <[email protected]> --------- Signed-off-by: Dooyong Kim <[email protected]> Co-authored-by: Dooyong Kim <[email protected]>

0ctopus13prime requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan and luyuncheng as code owners October 3, 2024 23:40

0ctopus13prime commented Oct 3, 2024

View reviewed changes

0ctopus13prime mentioned this pull request Oct 3, 2024

[RFC] Introducing Loading/Writing Layer in Native KNN Engines #2033

Closed

jmazanec15 reviewed Oct 4, 2024

View reviewed changes

jmazanec15 reviewed Oct 7, 2024

View reviewed changes

jni/tests/faiss_stream_support_test.cpp Show resolved Hide resolved

jmazanec15 added skip-changelog backport 2.x labels Oct 7, 2024

navneet1v reviewed Oct 9, 2024

View reviewed changes

Dooyong Kim added 5 commits October 11, 2024 14:21

Introduce a loading layer in NMSLIB.

971ddfd

Signed-off-by: Dooyong Kim <[email protected]>

Added NMSLIB istream implementation.

dbbc927

Signed-off-by: Dooyong Kim <[email protected]>

Fix integer overflow issue when passing read size for loading NMSLIB …

a55868a

…vector index. Signed-off-by: Dooyong Kim <[email protected]>

Added unit test for NMSLIB loading layer.

ce98151

Signed-off-by: Dooyong Kim <[email protected]>

Made a patch in NMSLIB to avoid frequently calling JNI for better loa…

2398d49

…ding index performance. Signed-off-by: Dooyong Kim <[email protected]>

0ctopus13prime force-pushed the nmslib-loading-layer branch from a623149 to 2398d49 Compare October 11, 2024 22:55

0ctopus13prime requested a review from shatejas as a code owner October 11, 2024 22:55

Compliance constexpr function in C++11 having nullstatement.

0c1d3fe

Signed-off-by: Dooyong Kim <[email protected]>

navneet1v approved these changes Oct 12, 2024

View reviewed changes

Vikasht34 reviewed Oct 14, 2024

View reviewed changes

kotwanikunal approved these changes Oct 14, 2024

View reviewed changes

shatejas approved these changes Oct 15, 2024

View reviewed changes

navneet1v merged commit 7cf45c8 into opensearch-project:main Oct 15, 2024
30 checks passed

opensearch-trigger-bot bot mentioned this pull request Oct 15, 2024

[Backport 2.x] Introduce a loading layer in NMSLIB. #2209

Closed

0ctopus13prime mentioned this pull request Oct 15, 2024

Backport 2185 to 2.x #2210

Closed

0ctopus13prime deleted the nmslib-loading-layer branch October 15, 2024 19:15

0ctopus13prime mentioned this pull request Oct 15, 2024

[Backport 2.x] Introduce a loading layer in NMSLIB. (#2185) #2211

Closed

0ctopus13prime mentioned this pull request Oct 18, 2024

[Backport 2.x] Introduce a loading layer in NMSLIB. (#2185) #2220

Merged

		@@ -12,79 +12,102 @@
		#include "org_opensearch_knn_jni_NmslibService.h"

Introduce a loading layer in NMSLIB. #2185

Introduce a loading layer in NMSLIB. #2185

Conversation

0ctopus13prime commented Oct 3, 2024 • edited Loading

Description

Related Issues

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmazanec15 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0ctopus13prime commented Oct 8, 2024

Memory monitoring results comparison

Baseline

Candidate

0ctopus13prime commented Oct 8, 2024

Loading Time Comparison

Observation

Experiment

1. Baseline (Using fread)

2. Using Stream (4KB)

0ctopus13prime commented Oct 8, 2024

Performance Benchmark

navneet1v commented Oct 8, 2024

0ctopus13prime commented Oct 9, 2024

navneet1v left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0ctopus13prime commented Oct 9, 2024 • edited Loading

Streaming Flamegraph

0ctopus13prime commented Oct 9, 2024 • edited Loading

Performance tuning plan

jmazanec15 commented Oct 9, 2024

0ctopus13prime commented Oct 9, 2024

Baseline flamegraph

0ctopus13prime commented Oct 10, 2024

1. NMSLIB Loading Perf Issue Analysis

2. Performance Degradation In FAISS

3. Why is it more than twice as severe as in FAISS?

4. Solution 1. Patch in NMSLIB

4.1 Pros

4.2 Cons

4.3. Patch in hnsw.cc

5. Solution 2. Disable Streaming When FSDirectory

5.1. Pros :

5.2. Cons :

6. Solution 3. Live with it :)

7. Micro Tuning Results

0ctopus13prime commented Oct 10, 2024

0ctopus13prime commented Oct 10, 2024 • edited Loading

navneet1v commented Oct 10, 2024

navneet1v left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vikasht34 left a comment

Choose a reason for hiding this comment

shatejas left a comment

Choose a reason for hiding this comment

navneet1v commented Oct 15, 2024 • edited Loading

0ctopus13prime commented Oct 3, 2024 •

edited

Loading

0ctopus13prime commented Oct 9, 2024 •

edited

Loading

0ctopus13prime commented Oct 9, 2024 •

edited

Loading

0ctopus13prime commented Oct 10, 2024 •

edited

Loading

navneet1v commented Oct 15, 2024 •

edited

Loading