Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Vector Index Build] Introduce RemoteIndexClient skeleton and Build Request construction #2560

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

owenhalpert
Copy link
Contributor

@owenhalpert owenhalpert commented Feb 25, 2025

Description

First PR for #2518 (Client Skeleton, Build Request)

This PR introduces the initial implementation of the Remote Index client.
In this PR, I have implemented:

  • Client interface + HTTP implementation skeleton, showing overview of the client creation, code flow, etc
    • Updated code flow in RemoteIndexBuildStrategy based on finalized method parameters/return types
  • Programmatic build request construction
  • All client-related settings
  • Extendable RemoteBuildRequest interface and class
  • Client retry strategy
  • Unit tests

In the next PR's, I will implement:

  • Polling mechanism
  • Metric collection
  • Complete unit testing

Related Issues

#2391 Meta Issue
#2393 HLD by @jed326
#2518 My LLD

Check List

  • New functionality includes testing.
    - [ ] New functionality has been documented.
    - [ ] API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
    - [ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@owenhalpert owenhalpert changed the title Introduce RemoteIndexClient [Remote Vector Index Build] Introduce RemoteIndexClient Feb 25, 2025
@owenhalpert owenhalpert force-pushed the vector-client-skeleton branch from 3fb30f9 to 2095bba Compare February 25, 2025 21:16
@owenhalpert owenhalpert marked this pull request as ready for review February 25, 2025 21:16
@jed326 jed326 added the v3.0.0 label Feb 25, 2025
@owenhalpert owenhalpert force-pushed the vector-client-skeleton branch 2 times, most recently from 8d43195 to 1f8b36d Compare February 26, 2025 01:27
Copy link
Contributor

@jed326 jed326 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @owenhalpert , the approach here LGTM, think we're just missing some tests related to submitVectorBuild and the settings.

@owenhalpert owenhalpert force-pushed the vector-client-skeleton branch 4 times, most recently from 7076818 to a8a4400 Compare February 26, 2025 23:05
@owenhalpert owenhalpert force-pushed the vector-client-skeleton branch from 2b069b8 to 5176e70 Compare February 27, 2025 05:11
@owenhalpert owenhalpert force-pushed the vector-client-skeleton branch 3 times, most recently from 78c56f5 to ab98581 Compare February 27, 2025 16:43
@owenhalpert owenhalpert force-pushed the vector-client-skeleton branch from 560bdc0 to 2721a15 Compare February 27, 2025 17:07
Comment on lines +182 to +187
private RemoteIndexClient getRemoteIndexClient() {
String endpoint = KNNSettings.state().getSettingValue(KNN_REMOTE_BUILD_SERVICE_ENDPOINT_SETTING.getKey());
if (endpoint == null || endpoint.isEmpty()) {
throw new IllegalArgumentException("No endpoint set for RemoteIndexClient");
}
return RemoteIndexHTTPClient.getInstance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look quite right, we're just evaluating the endpoint without using it. Should RemoteIndexHTTPClient actually be Singleton? Or just the underlying apache http client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea here was to use the presence of an endpoint to detect that the HTTPClient should be used, and for this first iteration verify the endpoint isn't empty. I don't immediately see how that connects to your second point, but that I don't have too strong of an opinion about. From a configuration standpoint it may make sense to have a Singleton client that has configuration controls over the behavior/usage of an HttpClient which should be consistent across all client(s) in the cluster. I don't see an immediate benefit in creating new RemoteIndexClient's per build that share an underlying HttpClient, unless we run into issues with concurrency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a few overlapping ideas here:

  • I agree that the apache client should be singleton, but I don't see the benefit of having RemoteIndexHTTPClient as singleton as it doesn't have any configurations related to the apache client anyways
  • If RemoteIndexHTTPClient was not Singleton, then we could instantiate it per flush/merge, and then pass the endpoint to it. This same endpoint would then be "cached" for the operations
  • We can still use the presence of the endpoint to determine if the httpclient should be used, but we should try to only access the setting once to mitigate the case where it changes during the operation.

Copy link
Contributor Author

@owenhalpert owenhalpert Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of client-per-build for the endpoint logic. I'll make that change. Should I have the build strategy hold the client reference and set it once upon checking the endpoint? this.client = new RemoteIndexHTTPAccessor(endpoint) Or can I just have the factory method return a new HTTPAccessor with no other logic for this first iteration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this means the request object doesn't need to hold the endpoint anymore — it makes sense to have the Accessor keep track of the endpoint and the request is just to be sent to any endpoint.

/**
* Interface which dictates how we interact with a remote index build service.
*/
public interface RemoteIndexClient {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't really like calling this interface client. I think something like RemoteBuildProcessor or RemoteBuildServiceAccessor is better. Then we would have a HTTPClientRemoteBuildServiceAccessor implementation, and that implementation could be non-singleton, only the apache client itself would be singleton.

Copy link
Contributor Author

@owenhalpert owenhalpert Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep this open for other comments, but I disagree on the naming side. I don't have a strong opinion on the Singleton Client vs Singleton ApacheClient, but I prefer naming this component a Remote Index Client. The purpose of a client is indeed just to make requests and receive responses, but there are levels of processing/abstraction at every level, even with the underlying Apache client. It may not be perfect with the HTTP implementation because we are working with an underlying protocol that also has a concept of the same name "client", but for any other implementation, this class being the class that submits a build and awaits a response perfectly fits what I would describe as a "client" component. Yes, the HTTP implementation is adding extra methods and logic to make that simple communication work for its chosen protocol, but I would argue that the "Accessor" in this case is the RemoteIndexBuildStrategy that is calling on some component (a client) to submit a build request and wait for its response.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I would argue that the "Accessor" in this case is the RemoteIndexBuildStrategy that is calling on some component (a client) to submit a build request and wait for its response.

The point is that without an accessor layer, either the RemoteIndexBuildStrategy or the RemoteIndexClient is being overloaded with functionality. The term "client" should refer to specifically to the component that is making requests from/to the remote index build service, and the client functionality should be isolated to such. At a high level:

  • NativeIndexBuildStrategy (RemoteIndexBuildStrategy) -- This is responsible for coordinating the build / write of the index during flush/merge
    • VectorRepositoryAccessor -- This is responsible for coordinating interactions with the vector repository
      • BlobContainer -- This is responsible for reading/writing to the repository
    • RemoteBuildServiceAccessor -- This is responsible for coordinating interactions with the remote index build service
      • Apache http client -- This is responsible for making requests to the remote index build service

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think not calling this layer "client" addresses the concerns about the client interface including constructBuildRequest

Copy link
Collaborator

@Vikasht34 Vikasht34 Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion , It has to be RemoteIndexClient and not the accessor , Ideal Situation is whenever we are buiulding service , that service gives the client that represents that contacts for that service , but here are doing in opposite way. Let's take example for s3Client uses underhood HTTPS and java 11+ HttpClient or Apache HttpClient (depending on SDK version), but naming is still S3Client not an accessor , and generally we create accessorr by wrapping s3Client . So taking same analogy we have remoteIndexBuildService and there has to be someOne Representing that service and that can not be HTTP or APache becuase they are transport menchanism not the representative.

@owenhalpert owenhalpert force-pushed the vector-client-skeleton branch from dbddb29 to ce31623 Compare February 28, 2025 09:32
Add RemoteIndexClient initial implementation, its accompanying dependencies, and Build Request, Retry Strategy, and test files

Signed-off-by: owenhalpert <[email protected]>

# Conflicts:
#	CHANGELOG.md
#	src/main/java/org/opensearch/knn/index/KNNSettings.java
#	src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexBuildStrategyFactory.java
#	src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java
#	src/test/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategyTests.java

# Conflicts:
#	CHANGELOG.md
#	src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java
Extract interface, remove out of scope methods, move build request to client, refactors

Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
…parameters, secure setting testing

Signed-off-by: owenhalpert <[email protected]>
@owenhalpert owenhalpert force-pushed the vector-client-skeleton branch from ce31623 to 8b6f12a Compare February 28, 2025 09:40
@owenhalpert owenhalpert changed the title [Remote Vector Index Build] Introduce RemoteIndexClient [Remote Vector Index Build] Introduce RemoteIndexClient skeleton and Build Request construction Feb 28, 2025
* @return RemoteBuildRequest to use to submit the build
* @throws IOException if there is an error constructing the request
*/
RemoteBuildRequest constructBuildRequest(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this has to be part of this inteface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each Client implementation will have to construct a build request to pass as a parameter to submitVectorBuild, and the index build strategy expects to be able to call client.constructBuildRequest to trigger the client to build its compatible request. Otherwise we'd have to check something like (if HTTPClient return HTTPBuildRequest)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two example of how this can be sorted.

  1. Example 1:- Take an example of S3 client, It does not force user to create a full fleshed request , instead it takes parameters and create those request internally. Example ..PutObject(It takes couple of parameters) but S3 intenally create CreatePutRequest.
  2. Example 2:- The others aws Services Clinet , thet give a builder to create the request and then pass the request using client menthods but the builder is not part of Client.

here we can either take 1 or 2 , but giving building object should not be part of Any Client.

* @param repositoryMetadata Metadata of the repository containing the index
* @param blobName File name generated by the Build Strategy with a UUID
*/
public RemoteBuildRequest(
Copy link
Collaborator

@Vikasht34 Vikasht34 Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too concentrated , It's complex to extend this abstract class to work with multiple type of repository , Constructors responsibility has been increased , It has to figure out many things to assign the variables.

This is ideal abstraction we can think

package org.opensearch.knn.index.remote;

import lombok.Getter;
import java.io.IOException;
import java.util.Map;

/**
 * Abstract class representing a remote index build request.
 */
@Getter
public abstract class RemoteBuildRequest {
    protected final String repositoryType;
    protected final String containerName;
    protected final String vectorPath;
    protected final String docIdPath;
    protected final String tenantId;
    protected final int dimension;
    protected final int docCount;
    protected final String dataType;
    protected final String engine;
    protected final Map<String, Object> indexParameters;

    protected RemoteBuildRequest(RemoteBuildRequestBuilder<?> builder) {
        this.repositoryType = builder.repositoryType;
        this.containerName = builder.containerName;
        this.vectorPath = builder.vectorPath;
        this.docIdPath = builder.docIdPath;
        this.tenantId = builder.tenantId;
        this.dimension = builder.dimension;
        this.docCount = builder.docCount;
        this.dataType = builder.dataType;
        this.engine = builder.engine;
        this.indexParameters = builder.indexParameters;
    }

    /**
     * Convert the request to JSON.
     */
    public abstract String toJson() throws IOException;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here was that the build service depends on a concrete set of parameters from k-NN. Regardless of the protocol/client implementation, using the BuildIndexParams to fetch the method name, index settings to get the cluster name, these are things that will need to happen for every build request, so I think it's important to have in the super constructor. The repository point is valid, but it might be the only part of the request that is taking too strong of a dependency on one implementation. For this, we could make a helper method that gets the repository container name (bucket for s3 or location for fs, etc) based on the RepositoryMetadata passed in. As of now since we're only supporting s3 I think this is fine to leave as a fast followup/long term improvement, but importantly I think it's not worth removing the rest of the logic which to me is independent of implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be abstract class not dependent of any concrete , here we have repositoryMetadata, RemoteIndexBuildStrategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to take the parsing logic out of the constructor, if we want to re-use the parsing logic then we can provide a helper / static method to do so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants