[Remote Vector Index Build] Introduce RemoteIndexClient skeleton and Build Request construction #2560

owenhalpert · 2025-02-25T21:13:48Z

Description

First PR for #2518 (Client Skeleton, Build Request)

This PR introduces the initial implementation of the Remote Index client.
In this PR, I have implemented:

Client interface + HTTP implementation skeleton, showing overview of the client creation, code flow, etc
- Updated code flow in RemoteIndexBuildStrategy based on finalized method parameters/return types
Programmatic build request construction
All client-related settings
Extendable RemoteBuildRequest interface and class
Client retry strategy
Unit tests

In the next PR's, I will implement:

Polling mechanism
Metric collection
Complete unit testing

Related Issues

#2391 Meta Issue
#2393 HLD by @jed326
#2518 My LLD

Check List

New functionality includes testing.
~~- [ ] New functionality has been documented.~~
~~- [ ] API changes companion pull request created.~~
Commits are signed per the DCO using --signoff.
~~- [ ] Public documentation issue/PR created.~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

src/main/java/org/opensearch/knn/index/KNNSettings.java

src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClientRetryStrategy.java

src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java

...st/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategyTests.java

src/main/java/org/opensearch/knn/index/KNNSettings.java

jed326

Thanks @owenhalpert , the approach here LGTM, think we're just missing some tests related to submitVectorBuild and the settings.

src/main/java/org/opensearch/knn/index/KNNSettings.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java

src/test/java/org/opensearch/knn/index/remote/RemoteIndexClientTests.java

src/main/java/org/opensearch/knn/index/KNNSettings.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java

src/main/java/org/opensearch/knn/index/KNNSettings.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java

src/main/java/org/opensearch/knn/plugin/KNNPlugin.java

src/main/java/org/opensearch/knn/index/remote/RemoteBuildRequest.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClientRetryStrategy.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java

src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java

src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java

jed326 · 2025-02-28T07:55:29Z

src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java

+    private RemoteIndexClient getRemoteIndexClient() {
+        String endpoint = KNNSettings.state().getSettingValue(KNN_REMOTE_BUILD_SERVICE_ENDPOINT_SETTING.getKey());
+        if (endpoint == null || endpoint.isEmpty()) {
+            throw new IllegalArgumentException("No endpoint set for RemoteIndexClient");
+        }
+        return RemoteIndexHTTPClient.getInstance();


This doesn't look quite right, we're just evaluating the endpoint without using it. Should RemoteIndexHTTPClient actually be Singleton? Or just the underlying apache http client?

I think the idea here was to use the presence of an endpoint to detect that the HTTPClient should be used, and for this first iteration verify the endpoint isn't empty. I don't immediately see how that connects to your second point, but that I don't have too strong of an opinion about. From a configuration standpoint it may make sense to have a Singleton client that has configuration controls over the behavior/usage of an HttpClient which should be consistent across all client(s) in the cluster. I don't see an immediate benefit in creating new RemoteIndexClient's per build that share an underlying HttpClient, unless we run into issues with concurrency.

I think there are a few overlapping ideas here:

I agree that the apache client should be singleton, but I don't see the benefit of having RemoteIndexHTTPClient as singleton as it doesn't have any configurations related to the apache client anyways

If RemoteIndexHTTPClient was not Singleton, then we could instantiate it per flush/merge, and then pass the endpoint to it. This same endpoint would then be "cached" for the operations

We can still use the presence of the endpoint to determine if the httpclient should be used, but we should try to only access the setting once to mitigate the case where it changes during the operation.

I really like the idea of client-per-build for the endpoint logic. I'll make that change. Should I have the build strategy hold the client reference and set it once upon checking the endpoint? this.client = new RemoteIndexHTTPAccessor(endpoint) Or can I just have the factory method return a new HTTPAccessor with no other logic for this first iteration

I guess this means the request object doesn't need to hold the endpoint anymore — it makes sense to have the Accessor keep track of the endpoint and the request is just to be sent to any endpoint.

src/main/java/org/opensearch/knn/index/engine/lucene/Lucene.java

src/main/java/org/opensearch/knn/index/remote/RemoteBuildRequest.java

jed326 · 2025-02-28T08:03:18Z

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java

+/**
+ * Interface which dictates how we interact with a remote index build service.
+ */
+public interface RemoteIndexClient {


I still don't really like calling this interface client. I think something like RemoteBuildProcessor or RemoteBuildServiceAccessor is better. Then we would have a HTTPClientRemoteBuildServiceAccessor implementation, and that implementation could be non-singleton, only the apache client itself would be singleton.

I will keep this open for other comments, but I disagree on the naming side. I don't have a strong opinion on the Singleton Client vs Singleton ApacheClient, but I prefer naming this component a Remote Index Client. The purpose of a client is indeed just to make requests and receive responses, but there are levels of processing/abstraction at every level, even with the underlying Apache client. It may not be perfect with the HTTP implementation because we are working with an underlying protocol that also has a concept of the same name "client", but for any other implementation, this class being the class that submits a build and awaits a response perfectly fits what I would describe as a "client" component. Yes, the HTTP implementation is adding extra methods and logic to make that simple communication work for its chosen protocol, but I would argue that the "Accessor" in this case is the RemoteIndexBuildStrategy that is calling on some component (a client) to submit a build request and wait for its response.

but I would argue that the "Accessor" in this case is the RemoteIndexBuildStrategy that is calling on some component (a client) to submit a build request and wait for its response.

The point is that without an accessor layer, either the RemoteIndexBuildStrategy or the RemoteIndexClient is being overloaded with functionality. The term "client" should refer to specifically to the component that is making requests from/to the remote index build service, and the client functionality should be isolated to such. At a high level:

NativeIndexBuildStrategy (RemoteIndexBuildStrategy) -- This is responsible for coordinating the build / write of the index during flush/merge

VectorRepositoryAccessor -- This is responsible for coordinating interactions with the vector repository

BlobContainer -- This is responsible for reading/writing to the repository

RemoteBuildServiceAccessor -- This is responsible for coordinating interactions with the remote index build service

Apache http client -- This is responsible for making requests to the remote index build service

I also think not calling this layer "client" addresses the concerns about the client interface including constructBuildRequest

In my opinion , It has to be RemoteIndexClient and not the accessor , Ideal Situation is whenever we are buiulding service , that service gives the client that represents that contacts for that service , but here are doing in opposite way. Let's take example for s3Client uses underhood HTTPS and java 11+ HttpClient or Apache HttpClient (depending on SDK version), but naming is still S3Client not an accessor , and generally we create accessorr by wrapping s3Client . So taking same analogy we have remoteIndexBuildService and there has to be someOne Representing that service and that can not be HTTP or APache becuase they are transport menchanism not the representative.

Add RemoteIndexClient initial implementation, its accompanying dependencies, and Build Request, Retry Strategy, and test files Signed-off-by: owenhalpert <[email protected]> # Conflicts: # CHANGELOG.md # src/main/java/org/opensearch/knn/index/KNNSettings.java # src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexBuildStrategyFactory.java # src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java # src/test/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategyTests.java # Conflicts: # CHANGELOG.md # src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java

Extract interface, remove out of scope methods, move build request to client, refactors Signed-off-by: owenhalpert <[email protected]>

Signed-off-by: owenhalpert <[email protected]>

…parameters, secure setting testing Signed-off-by: owenhalpert <[email protected]>

Signed-off-by: owenhalpert <[email protected]>

Vikasht34 · 2025-02-28T18:12:25Z

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java

+     * @return RemoteBuildRequest to use to submit the build
+     * @throws IOException if there is an error constructing the request
+     */
+    RemoteBuildRequest constructBuildRequest(


Why this has to be part of this inteface?

Each Client implementation will have to construct a build request to pass as a parameter to submitVectorBuild, and the index build strategy expects to be able to call client.constructBuildRequest to trigger the client to build its compatible request. Otherwise we'd have to check something like (if HTTPClient return HTTPBuildRequest)

There are two example of how this can be sorted.

Example 1:- Take an example of S3 client, It does not force user to create a full fleshed request , instead it takes parameters and create those request internally. Example ..PutObject(It takes couple of parameters) but S3 intenally create CreatePutRequest.

Example 2:- The others aws Services Clinet , thet give a builder to create the request and then pass the request using client menthods but the builder is not part of Client.

here we can either take 1 or 2 , but giving building object should not be part of Any Client.

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClientRetryStrategy.java

Vikasht34 · 2025-02-28T18:24:40Z

src/main/java/org/opensearch/knn/index/remote/RemoteBuildRequest.java

+     * @param repositoryMetadata Metadata of the repository containing the index
+     * @param blobName File name generated by the Build Strategy with a UUID
+     */
+    public RemoteBuildRequest(


This is too concentrated , It's complex to extend this abstract class to work with multiple type of repository , Constructors responsibility has been increased , It has to figure out many things to assign the variables.

This is ideal abstraction we can think

package org.opensearch.knn.index.remote; import lombok.Getter; import java.io.IOException; import java.util.Map; /** * Abstract class representing a remote index build request. */ @Getter public abstract class RemoteBuildRequest { protected final String repositoryType; protected final String containerName; protected final String vectorPath; protected final String docIdPath; protected final String tenantId; protected final int dimension; protected final int docCount; protected final String dataType; protected final String engine; protected final Map<String, Object> indexParameters; protected RemoteBuildRequest(RemoteBuildRequestBuilder<?> builder) { this.repositoryType = builder.repositoryType; this.containerName = builder.containerName; this.vectorPath = builder.vectorPath; this.docIdPath = builder.docIdPath; this.tenantId = builder.tenantId; this.dimension = builder.dimension; this.docCount = builder.docCount; this.dataType = builder.dataType; this.engine = builder.engine; this.indexParameters = builder.indexParameters; } /** * Convert the request to JSON. */ public abstract String toJson() throws IOException; }

The idea here was that the build service depends on a concrete set of parameters from k-NN. Regardless of the protocol/client implementation, using the BuildIndexParams to fetch the method name, index settings to get the cluster name, these are things that will need to happen for every build request, so I think it's important to have in the super constructor. The repository point is valid, but it might be the only part of the request that is taking too strong of a dependency on one implementation. For this, we could make a helper method that gets the repository container name (bucket for s3 or location for fs, etc) based on the RepositoryMetadata passed in. As of now since we're only supporting s3 I think this is fine to leave as a fast followup/long term improvement, but importantly I think it's not worth removing the rest of the logic which to me is independent of implementation.

We should be abstract class not dependent of any concrete , here we have repositoryMetadata, RemoteIndexBuildStrategy.

I think it's fine to take the parsing logic out of the constructor, if we want to re-use the parsing logic then we can provide a helper / static method to do so

owenhalpert changed the title ~~Introduce RemoteIndexClient~~ [Remote Vector Index Build] Introduce RemoteIndexClient Feb 25, 2025

owenhalpert force-pushed the vector-client-skeleton branch from 3fb30f9 to 2095bba Compare February 25, 2025 21:16

owenhalpert marked this pull request as ready for review February 25, 2025 21:16

owenhalpert requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan, luyuncheng, shatejas, 0ctopus13prime and Vikasht34 as code owners February 25, 2025 21:16

jed326 mentioned this pull request Feb 25, 2025

[Meta] Remote Vector Index Build Component in OpenSearch Vector Engine #2391

Open

18 tasks

jed326 reviewed Feb 25, 2025

View reviewed changes

src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java Outdated Show resolved Hide resolved

...st/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategyTests.java Outdated Show resolved Hide resolved

jed326 added the v3.0.0 label Feb 25, 2025

owenhalpert force-pushed the vector-client-skeleton branch 2 times, most recently from 8d43195 to 1f8b36d Compare February 26, 2025 01:27

navneet1v reviewed Feb 26, 2025

View reviewed changes

src/main/java/org/opensearch/knn/index/KNNSettings.java Show resolved Hide resolved

jed326 reviewed Feb 26, 2025

View reviewed changes

owenhalpert force-pushed the vector-client-skeleton branch 4 times, most recently from 7076818 to a8a4400 Compare February 26, 2025 23:05

jed326 reviewed Feb 26, 2025

View reviewed changes

owenhalpert force-pushed the vector-client-skeleton branch from 2b069b8 to 5176e70 Compare February 27, 2025 05:11

owenhalpert force-pushed the vector-client-skeleton branch 3 times, most recently from 78c56f5 to ab98581 Compare February 27, 2025 16:43

jmazanec15 reviewed Feb 27, 2025

View reviewed changes

owenhalpert force-pushed the vector-client-skeleton branch from 560bdc0 to 2721a15 Compare February 27, 2025 17:07

navneet1v reviewed Feb 27, 2025

View reviewed changes

src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java Show resolved Hide resolved

navneet1v reviewed Feb 27, 2025

View reviewed changes

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java Outdated Show resolved Hide resolved

navneet1v reviewed Feb 27, 2025

View reviewed changes

src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java Outdated Show resolved Hide resolved

jed326 reviewed Feb 28, 2025

View reviewed changes

owenhalpert force-pushed the vector-client-skeleton branch from dbddb29 to ce31623 Compare February 28, 2025 09:32

owenhalpert added 12 commits February 28, 2025 01:38

Initial revisions

173b20b

Extract interface, remove out of scope methods, move build request to client, refactors Signed-off-by: owenhalpert <[email protected]>

Add more build request work

4e48a1a

Signed-off-by: owenhalpert <[email protected]>

Add more build request work, Javadoc

e233c1c

Signed-off-by: owenhalpert <[email protected]>

Add submitVectorBuild unit test

c1b12f9

Signed-off-by: owenhalpert <[email protected]>

Remove authentication until further testing

ab346af

Signed-off-by: owenhalpert <[email protected]>

Add dimension

0123ea2

Signed-off-by: owenhalpert <[email protected]>

Remove auth settings for now

c4a2644

Signed-off-by: owenhalpert <[email protected]>

Add secure settings, access control logic for http request

2677f6f

Signed-off-by: owenhalpert <[email protected]>

Test full build request

2512589

Signed-off-by: owenhalpert <[email protected]>

Remove fp16 vs. fp32 logic, other initial revisions

ffd86c0

Signed-off-by: owenhalpert <[email protected]>

Refactor, add remote build request interface, complete FAISS request …

8b6f12a

…parameters, secure setting testing Signed-off-by: owenhalpert <[email protected]>

owenhalpert force-pushed the vector-client-skeleton branch from ce31623 to 8b6f12a Compare February 28, 2025 09:40

Drop Jackson dependency for OpenSearch XContent

56bb771

Signed-off-by: owenhalpert <[email protected]>

owenhalpert changed the title ~~[Remote Vector Index Build] Introduce RemoteIndexClient~~ [Remote Vector Index Build] Introduce RemoteIndexClient skeleton and Build Request construction Feb 28, 2025

Vikasht34 reviewed Feb 28, 2025

View reviewed changes

src/main/java/org/opensearch/knn/index/remote/RemoteIndexClientRetryStrategy.java Show resolved Hide resolved

Vikasht34 reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Remote Vector Index Build] Introduce RemoteIndexClient skeleton and Build Request construction #2560

[Remote Vector Index Build] Introduce RemoteIndexClient skeleton and Build Request construction #2560

owenhalpert commented Feb 25, 2025 •

edited

Loading

jed326 left a comment

jed326 Feb 28, 2025

owenhalpert Feb 28, 2025

jed326 Feb 28, 2025

owenhalpert Feb 28, 2025 •

edited

Loading

owenhalpert Feb 28, 2025

jed326 Feb 28, 2025

owenhalpert Feb 28, 2025 •

edited

Loading

jed326 Feb 28, 2025

jed326 Feb 28, 2025

Vikasht34 Feb 28, 2025 •

edited

Loading

Vikasht34 Feb 28, 2025

owenhalpert Feb 28, 2025

Vikasht34 Feb 28, 2025

Vikasht34 Feb 28, 2025 •

edited

Loading

owenhalpert Feb 28, 2025

Vikasht34 Feb 28, 2025

jed326 Feb 28, 2025

[Remote Vector Index Build] Introduce RemoteIndexClient skeleton and Build Request construction #2560

Are you sure you want to change the base?

[Remote Vector Index Build] Introduce RemoteIndexClient skeleton and Build Request construction #2560

Conversation

owenhalpert commented Feb 25, 2025 • edited Loading

Description

Related Issues

Check List

jed326 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenhalpert Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenhalpert Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vikasht34 Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vikasht34 Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenhalpert commented Feb 25, 2025 •

edited

Loading

owenhalpert Feb 28, 2025 •

edited

Loading

owenhalpert Feb 28, 2025 •

edited

Loading

Vikasht34 Feb 28, 2025 •

edited

Loading

Vikasht34 Feb 28, 2025 •

edited

Loading