Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a loading layer in NMSLIB. #2185

Merged

Conversation

0ctopus13prime
Copy link
Contributor

@0ctopus13prime 0ctopus13prime commented Oct 3, 2024

Description

This PR is the first commit introducing the loading layer in NMSLIB.
Please refer to this issue for more details. - #2033

FYI : FAISS Loading Layer PR - #2139

Related Issues

Resolves #[Issue number to be closed when this PR is merged]
#2033

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@@ -27,301 +27,354 @@
#include <string>

#include "hnswquery.h"
#include "method/hnsw.h"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please search 'LoadIndexWithStream' in nmslib_wrapper.cpp and ignore the formatting changes. 🫠

return (jlong) indexWrapper;
}

jlong knn_jni::nmslib_wrapper::LoadIndexWithStream(knn_jni::JNIUtilInterface *jniUtil,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the place for using an input stream to load index within NMSLIB.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other loadFunction will be removed in next iterations right?

@@ -12,79 +12,102 @@
#include "org_opensearch_knn_jni_NmslibService.h"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore the formatting changes and directly go to Java_org_opensearch_knn_jni_NmslibService_loadIndexWithStream.

}

JNIEXPORT jobjectArray JNICALL Java_org_opensearch_knn_jni_NmslibService_queryIndex(JNIEnv * env, jclass cls,
JNIEXPORT jlong JNICALL Java_org_opensearch_knn_jni_NmslibService_loadIndexWithStream(JNIEnv *env,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new method added in this PR

@@ -182,7 +157,7 @@ class TrainingLoadStrategy
NativeMemoryLoadStrategy<NativeMemoryAllocation.TrainingDataAllocation, NativeMemoryEntryContext.TrainingDataEntryContext>,
Closeable {

private static TrainingLoadStrategy INSTANCE;
private static volatile TrainingLoadStrategy INSTANCE;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make it volatile in Singleton pattern to avoid possible instruction order changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add this as a java doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do in the next update

@@ -48,7 +48,7 @@ private void tripCb() throws Exception {
createKnnIndex(indexName2, settings, createKnnIndexMapping(FIELD_NAME, 2));

Float[] vector = { 1.3f, 2.2f };
int docsInIndex = 5; // through testing, 7 is minimum number of docs to trip circuit breaker at 1kb
int docsInIndex = 7; // through testing, 7 is minimum number of docs to trip circuit breaker at 1kb
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird though, setting this to 5, it was slightly less than 1KB making this test fail.

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall it looks good. A couple of comments

jni/include/nmslib_stream_support.h Outdated Show resolved Hide resolved
}

protected:
std::streamsize xsgetn(std::streambuf::char_type *destination, std::streamsize count) final {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this name mean? Also, why is this access protected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to the official doc : https://en.cppreference.com/w/cpp/io/basic_streambuf/sgetn
In short, xsgetn is a protect virtual method defined in std::basic_streambuf, which is called whenever sgetn is called for the read delegation to the implementation.

x -> extended
s -> sequence

xs + getn = extended sequence getter function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is really cool

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this explanation as comments on the function. Hard to remember for future.

jni/src/nmslib_wrapper.cpp Outdated Show resolved Hide resolved
jni/src/org_opensearch_knn_jni_NmslibService.cpp Outdated Show resolved Hide resolved
@0ctopus13prime
Copy link
Contributor Author

Memory monitoring results comparison

From memory stand-point, not major changes I could observe from benchmark.

Baseline

nmslib_baseline

Candidate

nmslib-candidate-1

@0ctopus13prime
Copy link
Contributor Author

Loading Time Comparison

The numbers below were measured through time curl -X GET http://localhost:9200/_plugins/_knn/warmup/target_index.
I made two different experiments of loading a FAISS vector index with different buffer sizes.

  1. After dropped all file cache from memory. (sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches')
  2. With file cache in the memory.

Observation

Unlike FAISS, it took almost 81% more time when loading a system cached file.
Of course, this case will be rare, as it is expected that KNN will load a vector index whenever a new segment file is baked.
And the newly baked segment file likely is not system cached.
Increasing buffer size didn't help. Need to find a better way to transfer data from JNI to Java.

Experiment

Index size : 30G

1. Baseline (Using fread)

  1. After dropped : 50.448 seconds
  2. With cached : 4.144 seconds

2. Using Stream (4KB)

  1. After dropped : 52.779 seconds
  2. With cached : 7.503 seconds
  3. With cached, 64KB : 6.919 seconds
  4. With cached, 1M : 6.99 seconds 🤔

@0ctopus13prime
Copy link
Contributor Author

Performance Benchmark

Machine : -XMS63G -XMX63G
JVM Args : c5ad.12xlarge
Data : random-s-128-10m-euclidean.hdf5

Metric Task Baseline-Value Candidate-Value Change Unit
Cumulative indexing time of primary shards 33.7147 34.2115 1.47% min
Min cumulative indexing time across primary shards 0.000133333 0.00015 12.50% min
Median cumulative indexing time across primary shards 16.8573 17.1057 1.47% min
Max cumulative indexing time across primary shards 33.7146 34.2113 1.47% min
Cumulative indexing throttle time of primary shards 0 0 0.00% min
Min cumulative indexing throttle time across primary shards 0 0 0.00% min
Median cumulative indexing throttle time across primary shards 0 0 0.00% min
Max cumulative indexing throttle time across primary shards 0 0 0.00% min
Cumulative merge time of primary shards 282.601 282.996 0.14% min
Cumulative merge count of primary shards 125 122 2.40%
Min cumulative merge time across primary shards 0 0 0.00% min
Median cumulative merge time across primary shards 141.3 141.498 0.14% min
Max cumulative merge time across primary shards 282.601 282.996 0.14% min
Cumulative merge throttle time of primary shards 1.04818 1.61307 53.89% min
Min cumulative merge throttle time across primary shards 0 0 0.00% min
Median cumulative merge throttle time across primary shards 0.524092 0.806533 53.89% min
Max cumulative merge throttle time across primary shards 1.04818 1.61307 53.89% min
Cumulative refresh time of primary shards 1.1042 1.14667 3.85% min
Cumulative refresh count of primary shards 88 85 3.41%
Min cumulative refresh time across primary shards 0.000333333 0.000383333 15.00% min
Median cumulative refresh time across primary shards 0.5521 0.573333 3.85% min
Max cumulative refresh time across primary shards 1.10387 1.14628 3.84% min
Cumulative flush time of primary shards 11.5126 10.9446 4.93% min
Cumulative flush count of primary shards 56 53 5.36%
Min cumulative flush time across primary shards 0 0 0.00% min
Median cumulative flush time across primary shards 5.75628 5.47229 4.93% min
Max cumulative flush time across primary shards 11.5126 10.9446 4.93% min
Total Young Gen GC time 0.338 0.342 1.18% s
Total Young Gen GC count 19 19 0.00%
Total Old Gen GC time 0 0 0.00% s
Total Old Gen GC count 0 0 0.00%
Store size 29.8586 29.8584 0.00% GB
Translog size 5.83E-07 5.83E-07 0.00% GB
Heap used for segments 0 0 0.00% MB
Heap used for doc values 0 0 0.00% MB
Heap used for terms 0 0 0.00% MB
Heap used for norms 0 0 0.00% MB
Heap used for points 0 0 0.00% MB
Heap used for stored fields 0 0 0.00% MB
Segment count 2 2 0.00%
Min Throughput custom-vector-bulk 5390.31 5466.23 1.41% docs/s
Mean Throughput custom-vector-bulk 11041.1 10866 1.59% docs/s
Median Throughput custom-vector-bulk 10377.9 10065.6 3.01% docs/s
Max Throughput custom-vector-bulk 20105.1 19337.8 3.82% docs/s
50th percentile latency custom-vector-bulk 78.4349 75.8132 3.34% ms
90th percentile latency custom-vector-bulk 165.667 158.129 4.55% ms
99th percentile latency custom-vector-bulk 331.043 318.269 3.86% ms
99.9th percentile latency custom-vector-bulk 1486.47 1487.08 0.04% ms
99.99th percentile latency custom-vector-bulk 2300.48 2598.04 12.93% ms
100th percentile latency custom-vector-bulk 5049.72 4535.08 10.19% ms
50th percentile service time custom-vector-bulk 78.4349 75.8132 3.34% ms
90th percentile service time custom-vector-bulk 165.667 158.129 4.55% ms
99th percentile service time custom-vector-bulk 331.043 318.269 3.86% ms
99.9th percentile service time custom-vector-bulk 1486.47 1487.08 0.04% ms
99.99th percentile service time custom-vector-bulk 2300.48 2598.04 12.93% ms
100th percentile service time custom-vector-bulk 5049.72 4535.08 10.19% ms
error rate custom-vector-bulk 0 0 0.00% %
Min Throughput force-merge-segments 0 0 0.00% ops/s
Mean Throughput force-merge-segments 0 0 0.00% ops/s
Median Throughput force-merge-segments 0 0 0.00% ops/s
Max Throughput force-merge-segments 0 0 0.00% ops/s
100th percentile latency force-merge-segments 1.16E+07 1.13E+07 2.84% ms
100th percentile service time force-merge-segments 1.16E+07 1.13E+07 2.84% ms
error rate force-merge-segments 0 0 0.00% %
Min Throughput warmup-indices 0.24 0.14 41.67% ops/s
Mean Throughput warmup-indices 0.24 0.14 41.67% ops/s
Median Throughput warmup-indices 0.24 0.14 41.67% ops/s
Max Throughput warmup-indices 0.24 0.14 41.67% ops/s
100th percentile latency warmup-indices 4162.87 7127.78 71.22% ms
100th percentile service time warmup-indices 4162.87 7127.78 71.22% ms
error rate warmup-indices 0 0 0.00% %
Min Throughput prod-queries 0.66 0.64 3.03% ops/s
Mean Throughput prod-queries 0.66 0.64 3.03% ops/s
Median Throughput prod-queries 0.66 0.64 3.03% ops/s
Max Throughput prod-queries 0.66 0.64 3.03% ops/s
50th percentile latency prod-queries 3.5832 3.83349 6.99% ms
90th percentile latency prod-queries 4.75317 4.64172 2.34% ms
99th percentile latency prod-queries 22.1628 23.8439 7.59% ms
100th percentile latency prod-queries 1508.36 1571.86 4.21% ms
50th percentile service time prod-queries 3.5832 3.83349 6.99% ms
90th percentile service time prod-queries 4.75317 4.64172 2.34% ms
99th percentile service time prod-queries 22.1628 23.8439 7.59% ms
100th percentile service time prod-queries 1508.36 1571.86 4.21% ms
error rate prod-queries 0 0 0.00% %
Mean recall@k prod-queries 0.42 0.43 2.38%
Mean recall@1 prod-queries 0.6 0.63 5.00%

@navneet1v
Copy link
Collaborator

This PR is the first commit making the loading layer in native engines available.

you might want to update it to say loading layer for nmslib.

@0ctopus13prime
Copy link
Contributor Author

Will holding the merging until root cause the big gap in warmup time.
Compared to FAISS, 84% increase is a bit worriesome.

Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. Few minor comments. For me the biggest blocker right now to approve the code is increase in latency for warmups. Once we fix that, I will approve the code.

Comment on lines +331 to -332
writeLock();
try {
writeLock();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for moving this out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an exception was thrown there, it will called unmatched unlock call in finally block.
I know the current implementation is not throwing any exceptions. Here I was trying to make the code robust no matter how other parts changed.

}

protected:
std::streamsize xsgetn(std::streambuf::char_type *destination, std::streamsize count) final {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this explanation as comments on the function. Hard to remember for future.

return (jlong) indexWrapper;
}

jlong knn_jni::nmslib_wrapper::LoadIndexWithStream(knn_jni::JNIUtilInterface *jniUtil,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other loadFunction will be removed in next iterations right?

@@ -182,7 +157,7 @@ class TrainingLoadStrategy
NativeMemoryLoadStrategy<NativeMemoryAllocation.TrainingDataAllocation, NativeMemoryEntryContext.TrainingDataEntryContext>,
Closeable {

private static TrainingLoadStrategy INSTANCE;
private static volatile TrainingLoadStrategy INSTANCE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add this as a java doc.

@0ctopus13prime
Copy link
Contributor Author

0ctopus13prime commented Oct 9, 2024

Streaming Flamegraph

Screenshot 2024-10-08 at 5 44 14 PM

@0ctopus13prime
Copy link
Contributor Author

0ctopus13prime commented Oct 9, 2024

Performance tuning plan

Planning to continue below two tuning plans.
Expect to reduce of 23.16% of the latency.
I hardly think we can make other parts (e.g. JNIEnv_::CallIntMethod and IndexInput) parts much faster.
Also it would be worth to try it with bigger buffer size and see how it goes.

  1. __memmove_avx_unaligned_erms :
    This indicates that the buffer memory is not properly aligned for internal memcpy.
    We can allocate 64 aligned memory buffer and retry again.
    Having 64 aligned memory will work for both AVX2 and AVX512.

  2. Remove critical JNI calls (JNIEnv_::CallIntMethod, jni_ReleasePrimitiveArrayCritical) entirely.
    We can make Java part to have a native memory via ByteBuffer, then acquire the pointer in JNI for once. GetDirectBufferAddress

TODO : Can we allocate an aligned memory layout?

In Java

ByteBuffer nativeBuffer = ByteBuffer.allocateDirect(size);
In C++

// Get the pointer to the native memory from buffer at the beginning.
void *nativePtr = (*env)->GetDirectBufferAddress(env, buffer);

@jmazanec15
Copy link
Member

And the newly baked segment file likely is not system cached.

Wont page cache typically be write through? In which case, if graph is created and written on same node it is searched on, wont it be cached?

@0ctopus13prime
Copy link
Contributor Author

Baseline flamegraph

Screenshot 2024-10-08 at 7 54 06 PM

@0ctopus13prime
Copy link
Contributor Author

1. NMSLIB Loading Perf Issue Analysis

2. Performance Degradation In FAISS

After switching from direct file API usage to an abstract IO loading layer, additional overhead was introduced due to JNI calls and buffer copying via std::memcpy. This change resulted in a 30% increase in loading time compared to the baseline in FAISS. The baseline took 3.584 seconds to load a 6GB vector index, while the modified version increased the load time to 4.664 seconds.

In NMSLIB, we expected a similar level of performance regression as seen in FAISS. However, we're observing a 70% increase in load time when loading a 6GB vector index. (baseline=4.144 sec, the modified one=7.503 sec)
Why is the performance impact in NMSLIB more than twice as severe as in FAISS?

3. Why is it more than twice as severe as in FAISS?

The key performance difference in index loading between FAISS and NMSLIB stems from their file formats.
In NMSLIB, this difference results in JNI calls being made O(N) times, where N is the number of vectors, whereas in FAISS, the number of JNI calls is O(1).

FAISS stores chunks of the neighbor list in a single location and loads them all at once. See the code below:

static void read_HNSW(HNSW* hnsw, IOReader* f) {
    READVECTOR(hnsw->assign_probas);
    READVECTOR(hnsw->cum_nneighbor_per_level);
    READVECTOR(hnsw->levels);
    READVECTOR(hnsw->offsets);
    READVECTOR(hnsw->neighbors);

    READ1(hnsw->entry_point);
    READ1(hnsw->max_level);
    READ1(hnsw->efConstruction);
    READ1(hnsw->efSearch);
    READ1(hnsw->upper_beam);
}

In NMSLIB, each neighbor list is stored individually, requiring O(N) reads, where N is the total number of vectors.
As shown in the code below, we need totalElementsStored_ read operations.
Note that input.read() ultimately calls JNI to delegate Lucene’s IndexInput to read bytes thanks to the introduced loading layer. As a result, the number of input.read() calls directly corresponds to the number of JNI calls.

for (size_t i = 0; i < totalElementsStored_; i++) {
   ...
    } else {
        linkLists_[i] = (char *)malloc(linkListSize);
        CHECK(linkLists_[i]);
        input.read(linkLists_[i], linkListSize); <--------- THIS!
    }
    data_rearranged_[i] = new Object(data_level0_memory_ + (i)*memoryPerObject_ + offsetData_);
}

4. Solution 1. Patch in NMSLIB

We can patch NMSLIB to avoid making JNI calls for each vector element. The idea is to load data in bulk, then parse the neighbor lists from that buffer, rather than reading bytes individually. This approach would reduce the number of JNI calls to O(Index size / Buffer size).

For example, with a 6GB vector index containing 1 million vectors and a 64KB buffer size, the required JNI calls would be reduced to O(6GB / 64KB) = 98,304, which is a significant improvement over 1 million calls, achieving nearly a 90% reduction in operations.

Result: Surprisingly, it is 8% faster than the baseline. (Note: I reindexed on a new single node, which is why the loading time differs from the one mentioned earlier in the issue.)

  1. Baseline : 4.538 sec
  2. Modified version with 64KB buffer : 4.19 sec

4.1 Pros

  1. No performance degradation. If anything, it is even faster than the baseline.
  2. We can maintain unified set of loading APIs for both NMSLIB and FAISS.

4.2 Cons

  1. Medium size of patch is required in NMSLIB. This may increase burdens on code maintenance.

4.3. Patch in hnsw.cc

template <typename dist_t>
void Hnsw<dist_t>::LoadOptimizedIndex(NmslibIOReader& input) {
    ...

    const size_t bufferSize = 64 * 1024;  // 64KB
    std::unique_ptr<char[]> buffer (new char[bufferSize]);
    uint32_t end = 0;
    uint32_t pos = 0;
    const bool isLTE = _isLittleEndian();
    
    for (size_t i = 0, remainingBytes = input.remaining(); i < totalElementsStored_; i++) {
        // Read linkList size integer.
        if ((pos + sizeof(SIZEMASS_TYPE)) >= end) {
            // Underflow, load bytes in bulk.
            const auto firstPartLen = end - pos;
            if (firstPartLen > 0) {
                std::memcpy(buffer.get(), buffer.get() + pos, firstPartLen);
            }
            const auto copyBytes = std::min(remainingBytes, bufferSize - firstPartLen);
            input.read(buffer.get() + firstPartLen, copyBytes);
            remainingBytes -= copyBytes;
            end = copyBytes + firstPartLen;
            pos = 0;
        }
    
        // Read data size. SIZEMASS_TYPE -> uint32_t
        SIZEMASS_TYPE linkListSize = 0;
        if (isLTE) {
            linkListSize = _readIntLittleEndian(buffer[pos], buffer[pos + 1], buffer[pos + 2], buffer[pos + 3]);
        } else {
            linkListSize = _readIntBigEndian(buffer[pos], buffer[pos + 1], buffer[pos + 2], buffer[pos + 3]);
        }
        pos += 4;
    
        if (linkListSize == 0) {
            linkLists_[i] = nullptr;
        } else {
            // Now we load neighbor list.
            linkLists_[i] = (char *) malloc(linkListSize);
            CHECK(linkLists_[i]);
    
            SIZEMASS_TYPE leftLinkListData = linkListSize;
            auto dataPtr = linkLists_[i];
            while (leftLinkListData > 0) {
                if (pos >= end) {
                    // Underflow, load bytes in bulk.
                    const auto copyBytes = std::min(remainingBytes, bufferSize);
                    input.read(buffer.get(), copyBytes);
                    remainingBytes -= copyBytes;
                    end = copyBytes;
                    pos = 0;
                }
        
                const auto copyBytes = std::min(leftLinkListData, end - pos);
                std::memcpy(dataPtr, buffer.get() + pos, copyBytes);
                dataPtr += copyBytes;
                leftLinkListData -= copyBytes;
                pos += copyBytes;
            }  // End while
        }  // End if
    
        data_rearranged_[i] = new Object(data_level0_memory_ + (i)*memoryPerObject_ + offsetData_);
    }  // End for
                
...            
              

5. Solution 2. Disable Streaming When FSDirectory

Since we're deprecating NMSLIB in version 3.x, we can disable loading layer in NMSLIB until then.
Or, we can selectively allow streaming in NMSLIB depending on whether the given Directory is FSDirectory implementation.

if (directory instance of Directory) {
  loadIndexByFilePath(...);
} else {
  loadIndexByStreaming(...);
}

5.1. Pros :

  1. Simple.

5.2. Cons :

  1. Until 3.x, we need to maintain duplicated and similar version of APIs in both Java and JNI.

6. Solution 3. Live with it :)

Since we're deprecating NMSLIB in version 3.x, we can tolerate this issue in the short term.
However, I personally don't favor this approach, as it impacts the p99 latency metrics, which are rare but could still affect overall cluster performance at the worst case.

7. Micro Tuning Results

  1. CallNonvirtualIntMethod → No impacts.
  2. AVX 2 intrinsic copy → No impacts.
  3. Use native ByteBuffer + one additional bytes copy → Made it worse.
  4. Increasing buffer size → Increasing 4KB to 64KB at least reduced the warm-up time by 0.8 seconds in NMSLIB.

@0ctopus13prime
Copy link
Contributor Author

@navneet1v @jmazanec15
Could you share your thoughts on the above analysis?
Thanks

@0ctopus13prime
Copy link
Contributor Author

0ctopus13prime commented Oct 10, 2024

And the newly baked segment file likely is not system cached.

Wont page cache typically be write through? In which case, if graph is created and written on same node it is searched on, wont it be cached?

Sorry, I just saw it.

Yes it is configured by default in pretty much general file system.
But also the default ratio is 10% of memory which means write-back cache size is bounded by 10% of the physical memory.

The reasons that I assumed that it is going to be 'likely' write-back cache does not longer exist are in two fold:

  1. In normal case, writing and reading happens simultaneously. LRU paging eviction is the one being used in Linux, write-back cache is very likely to be evicted soon.
  2. On top of the point 1, NRT in Lucene periodically exposes a new baked segment. This typically takes few seconds, so I assumed most write-back cached pages are kicked out meanwhile.

Let me share your thoughts on it! You can call me an aggressive dreamer. 😛

@navneet1v
Copy link
Collaborator

@0ctopus13prime the approach of patching nmslib looks good to me. I think if it is providing a good latency we should do that. Since the pros of having a patch means improvements in load time and also getting away of FSDirectory dependency.

Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Please ans some of the clarifying questions.

+
+ template <typename dist_t>
+ void Hnsw<dist_t>::LoadIndexWithStream(NmslibIOReader& input) {
+ LOG(LIB_INFO) << "Loading index from an input stream(NmslibIOReader).";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question]:
is this info log enabled by default? Since I am not to see all the code from nmslib in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's disabled by default. (LIB_LOGNONE)

In nmslib_wrapper

void knn_jni::nmslib_wrapper::InitLibrary() {
  similarity::initLibrary();
}

and in NMSLIB

void initLibrary(int seed = 0, LogChoice choice = LIB_LOGNONE, const char*pLogFile = NULL);

+ }
+
+ template <typename dist_t>
+ void Hnsw<dist_t>::LoadIndexWithStream(NmslibIOReader& input) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question]:
in this function what is the extra step we are doing to ensure that we are able to fetch more data from NmslibIOReader? I can understand some part where we might be doing it, but it will be good if you can point exactly what are the lines that improved the loading logic, so that it becomes easy to review that part itself..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added remainingBytes method to return the number of unread bytes in IndexInputWithBuffer.

private long remainingBytes() {
    return contentLength - indexInput.getFilePointer();
}

And in NMSLIB, within void Hnsw<dist_t>::LoadOptimizedIndex(NmslibIOReader& input) method, we keep track of safe amount of bytes to read.

        for (size_t i = 0, remainingBytes = input.remainingBytes(); i < totalElementsStored_; i++) {

Copy link
Contributor

@Vikasht34 Vikasht34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi , I am reviewing this PR , Please wait till EOD to merge this PR.

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@navneet1v
Copy link
Collaborator

navneet1v commented Oct 15, 2024

Merging the PR, as we have 2 approvals and author is requesting for merge.,

@navneet1v navneet1v merged commit 7cf45c8 into opensearch-project:main Oct 15, 2024
30 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 15, 2024
* Introduce a loading layer in NMSLIB.

Signed-off-by: Dooyong Kim <[email protected]>

* Added NMSLIB istream implementation.

Signed-off-by: Dooyong Kim <[email protected]>

* Fix integer overflow issue when passing read size for loading NMSLIB vector index.

Signed-off-by: Dooyong Kim <[email protected]>

* Added unit test for NMSLIB loading layer.

Signed-off-by: Dooyong Kim <[email protected]>

* Made a patch in NMSLIB to avoid frequently calling JNI for better loading index performance.

Signed-off-by: Dooyong Kim <[email protected]>

* Compliance constexpr function in C++11 having nullstatement.

Signed-off-by: Dooyong Kim <[email protected]>

---------

Signed-off-by: Dooyong Kim <[email protected]>
Co-authored-by: Dooyong Kim <[email protected]>
(cherry picked from commit 7cf45c8)
@0ctopus13prime 0ctopus13prime deleted the nmslib-loading-layer branch October 15, 2024 19:15
0ctopus13prime pushed a commit to 0ctopus13prime/k-NN that referenced this pull request Oct 15, 2024
* Introduce a loading layer in NMSLIB.

Signed-off-by: Dooyong Kim <[email protected]>

* Added NMSLIB istream implementation.

Signed-off-by: Dooyong Kim <[email protected]>

* Fix integer overflow issue when passing read size for loading NMSLIB vector index.

Signed-off-by: Dooyong Kim <[email protected]>

* Added unit test for NMSLIB loading layer.

Signed-off-by: Dooyong Kim <[email protected]>

* Made a patch in NMSLIB to avoid frequently calling JNI for better loading index performance.

Signed-off-by: Dooyong Kim <[email protected]>

* Compliance constexpr function in C++11 having nullstatement.

Signed-off-by: Dooyong Kim <[email protected]>
0ctopus13prime pushed a commit to 0ctopus13prime/k-NN that referenced this pull request Oct 18, 2024
* Introduce a loading layer in NMSLIB.

Signed-off-by: Dooyong Kim <[email protected]>

* Added NMSLIB istream implementation.

Signed-off-by: Dooyong Kim <[email protected]>

* Fix integer overflow issue when passing read size for loading NMSLIB vector index.

Signed-off-by: Dooyong Kim <[email protected]>

* Added unit test for NMSLIB loading layer.

Signed-off-by: Dooyong Kim <[email protected]>

* Made a patch in NMSLIB to avoid frequently calling JNI for better loading index performance.

Signed-off-by: Dooyong Kim <[email protected]>

* Compliance constexpr function in C++11 having nullstatement.

Signed-off-by: Dooyong Kim <[email protected]>

---------

Signed-off-by: Dooyong Kim <[email protected]>
Co-authored-by: Dooyong Kim <[email protected]>
shatejas pushed a commit that referenced this pull request Oct 21, 2024
* Introduce a loading layer in NMSLIB. (#2185)

* Introduce a loading layer in NMSLIB.

Signed-off-by: Dooyong Kim <[email protected]>

* Added NMSLIB istream implementation.

Signed-off-by: Dooyong Kim <[email protected]>

* Fix integer overflow issue when passing read size for loading NMSLIB vector index.

Signed-off-by: Dooyong Kim <[email protected]>

* Added unit test for NMSLIB loading layer.

Signed-off-by: Dooyong Kim <[email protected]>

* Made a patch in NMSLIB to avoid frequently calling JNI for better loading index performance.

Signed-off-by: Dooyong Kim <[email protected]>

* Compliance constexpr function in C++11 having nullstatement.

Signed-off-by: Dooyong Kim <[email protected]>

---------

Signed-off-by: Dooyong Kim <[email protected]>
Co-authored-by: Dooyong Kim <[email protected]>

* Fixed that it's failing to resolve a package in import statement.

Signed-off-by: Dooyong Kim <[email protected]>

* Move the element in the changelog from 3.x to 2.x.

Signed-off-by: Dooyong Kim <[email protected]>

---------

Signed-off-by: Dooyong Kim <[email protected]>
Co-authored-by: Dooyong Kim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants