-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Remote Vector Index Build] Introduce RemoteIndexClient skeleton and Build Request construction #2560
base: main
Are you sure you want to change the base?
[Remote Vector Index Build] Introduce RemoteIndexClient skeleton and Build Request construction #2560
Conversation
3fb30f9
to
2095bba
Compare
src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClientRetryStrategy.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClientRetryStrategy.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java
Outdated
Show resolved
Hide resolved
...st/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategyTests.java
Outdated
Show resolved
Hide resolved
8d43195
to
1f8b36d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @owenhalpert , the approach here LGTM, think we're just missing some tests related to submitVectorBuild
and the settings.
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
src/test/java/org/opensearch/knn/index/remote/RemoteIndexClientTests.java
Outdated
Show resolved
Hide resolved
src/test/java/org/opensearch/knn/index/remote/RemoteIndexClientTests.java
Outdated
Show resolved
Hide resolved
7076818
to
a8a4400
Compare
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
2b069b8
to
5176e70
Compare
78c56f5
to
ab98581
Compare
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
560bdc0
to
2721a15
Compare
src/main/java/org/opensearch/knn/index/remote/RemoteBuildRequest.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteBuildRequest.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClientRetryStrategy.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClient.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexHTTPClient.java
Outdated
Show resolved
Hide resolved
private RemoteIndexClient getRemoteIndexClient() { | ||
String endpoint = KNNSettings.state().getSettingValue(KNN_REMOTE_BUILD_SERVICE_ENDPOINT_SETTING.getKey()); | ||
if (endpoint == null || endpoint.isEmpty()) { | ||
throw new IllegalArgumentException("No endpoint set for RemoteIndexClient"); | ||
} | ||
return RemoteIndexHTTPClient.getInstance(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look quite right, we're just evaluating the endpoint without using it. Should RemoteIndexHTTPClient
actually be Singleton? Or just the underlying apache http client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea here was to use the presence of an endpoint to detect that the HTTPClient should be used, and for this first iteration verify the endpoint isn't empty. I don't immediately see how that connects to your second point, but that I don't have too strong of an opinion about. From a configuration standpoint it may make sense to have a Singleton client that has configuration controls over the behavior/usage of an HttpClient which should be consistent across all client(s) in the cluster. I don't see an immediate benefit in creating new RemoteIndexClient's per build that share an underlying HttpClient, unless we run into issues with concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a few overlapping ideas here:
- I agree that the apache client should be singleton, but I don't see the benefit of having
RemoteIndexHTTPClient
as singleton as it doesn't have any configurations related to the apache client anyways - If
RemoteIndexHTTPClient
was not Singleton, then we could instantiate it per flush/merge, and then pass the endpoint to it. This same endpoint would then be "cached" for the operations - We can still use the presence of the endpoint to determine if the httpclient should be used, but we should try to only access the setting once to mitigate the case where it changes during the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea of client-per-build for the endpoint logic. I'll make that change. Should I have the build strategy hold the client reference and set it once upon checking the endpoint? this.client = new RemoteIndexHTTPAccessor(endpoint)
Or can I just have the factory method return a new HTTPAccessor with no other logic for this first iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this means the request object doesn't need to hold the endpoint anymore — it makes sense to have the Accessor keep track of the endpoint and the request is just to be sent to any endpoint.
src/main/java/org/opensearch/knn/index/engine/lucene/Lucene.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteBuildRequest.java
Outdated
Show resolved
Hide resolved
/** | ||
* Interface which dictates how we interact with a remote index build service. | ||
*/ | ||
public interface RemoteIndexClient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't really like calling this interface client. I think something like RemoteBuildProcessor
or RemoteBuildServiceAccessor
is better. Then we would have a HTTPClientRemoteBuildServiceAccessor
implementation, and that implementation could be non-singleton, only the apache client itself would be singleton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will keep this open for other comments, but I disagree on the naming side. I don't have a strong opinion on the Singleton Client vs Singleton ApacheClient, but I prefer naming this component a Remote Index Client. The purpose of a client is indeed just to make requests and receive responses, but there are levels of processing/abstraction at every level, even with the underlying Apache client. It may not be perfect with the HTTP implementation because we are working with an underlying protocol that also has a concept of the same name "client", but for any other implementation, this class being the class that submits a build and awaits a response perfectly fits what I would describe as a "client" component. Yes, the HTTP implementation is adding extra methods and logic to make that simple communication work for its chosen protocol, but I would argue that the "Accessor" in this case is the RemoteIndexBuildStrategy that is calling on some component (a client) to submit a build request and wait for its response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I would argue that the "Accessor" in this case is the RemoteIndexBuildStrategy that is calling on some component (a client) to submit a build request and wait for its response.
The point is that without an accessor layer, either the RemoteIndexBuildStrategy
or the RemoteIndexClient
is being overloaded with functionality. The term "client" should refer to specifically to the component that is making requests from/to the remote index build service, and the client functionality should be isolated to such. At a high level:
NativeIndexBuildStrategy
(RemoteIndexBuildStrategy
) -- This is responsible for coordinating the build / write of the index during flush/mergeVectorRepositoryAccessor
-- This is responsible for coordinating interactions with the vector repositoryBlobContainer
-- This is responsible for reading/writing to the repository
RemoteBuildServiceAccessor
-- This is responsible for coordinating interactions with the remote index build service- Apache http client -- This is responsible for making requests to the remote index build service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think not calling this layer "client" addresses the concerns about the client interface including constructBuildRequest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion , It has to be RemoteIndexClient and not the accessor , Ideal Situation is whenever we are buiulding service , that service gives the client that represents that contacts for that service , but here are doing in opposite way. Let's take example for s3Client uses underhood HTTPS and java 11+ HttpClient or Apache HttpClient (depending on SDK version), but naming is still S3Client not an accessor , and generally we create accessorr by wrapping s3Client . So taking same analogy we have remoteIndexBuildService and there has to be someOne Representing that service and that can not be HTTP or APache becuase they are transport menchanism not the representative.
dbddb29
to
ce31623
Compare
Add RemoteIndexClient initial implementation, its accompanying dependencies, and Build Request, Retry Strategy, and test files Signed-off-by: owenhalpert <[email protected]> # Conflicts: # CHANGELOG.md # src/main/java/org/opensearch/knn/index/KNNSettings.java # src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexBuildStrategyFactory.java # src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java # src/test/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategyTests.java # Conflicts: # CHANGELOG.md # src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/RemoteIndexBuildStrategy.java
Extract interface, remove out of scope methods, move build request to client, refactors Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
…parameters, secure setting testing Signed-off-by: owenhalpert <[email protected]>
ce31623
to
8b6f12a
Compare
Signed-off-by: owenhalpert <[email protected]>
* @return RemoteBuildRequest to use to submit the build | ||
* @throws IOException if there is an error constructing the request | ||
*/ | ||
RemoteBuildRequest constructBuildRequest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this has to be part of this inteface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each Client implementation will have to construct a build request to pass as a parameter to submitVectorBuild
, and the index build strategy expects to be able to call client.constructBuildRequest
to trigger the client to build its compatible request. Otherwise we'd have to check something like (if HTTPClient return HTTPBuildRequest
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two example of how this can be sorted.
- Example 1:- Take an example of S3 client, It does not force user to create a full fleshed request , instead it takes parameters and create those request internally. Example ..PutObject(It takes couple of parameters) but S3 intenally create CreatePutRequest.
- Example 2:- The others aws Services Clinet , thet give a builder to create the request and then pass the request using client menthods but the builder is not part of Client.
here we can either take 1 or 2 , but giving building object should not be part of Any Client.
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClientRetryStrategy.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClientRetryStrategy.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/remote/RemoteIndexClientRetryStrategy.java
Show resolved
Hide resolved
* @param repositoryMetadata Metadata of the repository containing the index | ||
* @param blobName File name generated by the Build Strategy with a UUID | ||
*/ | ||
public RemoteBuildRequest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too concentrated , It's complex to extend this abstract class to work with multiple type of repository , Constructors responsibility has been increased , It has to figure out many things to assign the variables.
This is ideal abstraction we can think
package org.opensearch.knn.index.remote;
import lombok.Getter;
import java.io.IOException;
import java.util.Map;
/**
* Abstract class representing a remote index build request.
*/
@Getter
public abstract class RemoteBuildRequest {
protected final String repositoryType;
protected final String containerName;
protected final String vectorPath;
protected final String docIdPath;
protected final String tenantId;
protected final int dimension;
protected final int docCount;
protected final String dataType;
protected final String engine;
protected final Map<String, Object> indexParameters;
protected RemoteBuildRequest(RemoteBuildRequestBuilder<?> builder) {
this.repositoryType = builder.repositoryType;
this.containerName = builder.containerName;
this.vectorPath = builder.vectorPath;
this.docIdPath = builder.docIdPath;
this.tenantId = builder.tenantId;
this.dimension = builder.dimension;
this.docCount = builder.docCount;
this.dataType = builder.dataType;
this.engine = builder.engine;
this.indexParameters = builder.indexParameters;
}
/**
* Convert the request to JSON.
*/
public abstract String toJson() throws IOException;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here was that the build service depends on a concrete set of parameters from k-NN
. Regardless of the protocol/client implementation, using the BuildIndexParams to fetch the method name, index settings to get the cluster name, these are things that will need to happen for every build request, so I think it's important to have in the super constructor. The repository point is valid, but it might be the only part of the request that is taking too strong of a dependency on one implementation. For this, we could make a helper method that gets the repository container name (bucket
for s3
or location
for fs
, etc) based on the RepositoryMetadata passed in. As of now since we're only supporting s3
I think this is fine to leave as a fast followup/long term improvement, but importantly I think it's not worth removing the rest of the logic which to me is independent of implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be abstract class not dependent of any concrete , here we have repositoryMetadata, RemoteIndexBuildStrategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to take the parsing logic out of the constructor, if we want to re-use the parsing logic then we can provide a helper / static method to do so
Description
First PR for #2518 (Client Skeleton, Build Request)
This PR introduces the initial implementation of the Remote Index client.
In this PR, I have implemented:
In the next PR's, I will implement:
Related Issues
#2391 Meta Issue
#2393 HLD by @jed326
#2518 My LLD
Check List
- [ ] New functionality has been documented.- [ ] API changes companion pull request created.--signoff
.- [ ] Public documentation issue/PR created.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.