-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Opensearch embedding store implementation #213
base: main
Are you sure you want to change the base?
Conversation
formatting and doc
import software.amazon.awssdk.http.SdkHttpClient; | ||
import software.amazon.awssdk.http.apache.ApacheHttpClient; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really have to use these clients?
I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be the JDK's HttpClient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I need to dive a bit into this, tbh, this class is just a copy/paste from langchain4j, I just replaced the logger for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except that the JDK one can leak connection - so I would rather use the quarkus one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is to connect to an AWS OpenSearch server, the opensearch extension also uses those https://github.com/quarkiverse/quarkus-opensearch/blob/main/opensearch-java-client/runtime/src/main/java/io/quarkiverse/opensearch/client/runtime/OpenSearchTransportHelper.java , this is where it would be nice to have @sboeckelmann feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really have to use these clients?
I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.
If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code.
FWIW in Hibernate Search we just use the Elasticsearch low-level client (a thin wrapper on top of Apache HTTP client -- the async one) to connect to either Elasticsearch and OpenSearch, and have a custom Apache HTTP plugin based on the AWS SDK to support signing. That way we handle it all with a single impl: Elasticsearch or OpenSearch, AWS or non-AWS. Though I'll admit it feels a bit messy. Another advantage is that AWS request signing in that case is well-tested (it's harder than you'd think, especially with chunked requests).
I would recommend that you rely on the existing quarkus-elasticsearch-rest-client
extension but... then you won't get AWS signing (since the only support we have right now is specific to Hibernate Search).
Eventually we'll need a more consolidated approach for the Elasticsearch low-level client. quarkusio/quarkus#26991 will get us one step closer, as it will force me to move AWS signing to something not specific to Hibernate Search.
We could also consider dropping the Elasticsearch low-level client and shipping a low-level client of our own based on the Quarkus HTTP stack, but I really would only do this if we can agree to use it on all extensions related to Elasticearch/OpenSearch, and if we can afford the API break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code
I'm more thinking about the impact loading 100s of classes of yet another HTTP library, than the impact of another thread pool which is smaller (but of course ideally would not exist)
@yrodiere might also be interested in this :) |
import software.amazon.awssdk.http.SdkHttpClient; | ||
import software.amazon.awssdk.http.apache.ApacheHttpClient; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except that the JDK one can leak connection - so I would rather use the quarkus one.
Co-authored-by: Clement Escoffier <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pinging me @geoand, here are my two cents.
As explained below, consolidating our approach to connecting to OpenSearch/Elasticsearch is certainly going to be an issue.
In the long run (not possible right now), I wonder if we wouldn't want to use Hibernate Search in standalone mode (not yet supported in Quarkus, see quarkusio/quarkus#26182 ) for this store and the Elasticsearch one?
Hibernate Search supports connecting to multiple versions of Elasticsearch/OpenSearch transparently, and it has built-in support for mass indexing, search (full-text, spatial, ... and vector in 7.1). Perhaps more interesting, it supports keeping the Elasticsearch/OpenSearch index in sync with a primary source of truth (relational database) without reindexing everything all the time. What it does not (and will not) support is transforming content into embeddings (vectors), so that part would certainly still be langchain4j's job.
Or perhaps we'd just want to keep this embedding store "pristine", and simply have the Hibernate Search extension provide an integration with Lang4j to transform. I.e. expect people to either use the langchain4j API directly without Hibernate Search, or the Hibernate Search API which would delegate some things to langchain4j? At this point I'm not sure what's best, we'd need to discuss it.
import software.amazon.awssdk.http.SdkHttpClient; | ||
import software.amazon.awssdk.http.apache.ApacheHttpClient; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really have to use these clients?
I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.
If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code.
FWIW in Hibernate Search we just use the Elasticsearch low-level client (a thin wrapper on top of Apache HTTP client -- the async one) to connect to either Elasticsearch and OpenSearch, and have a custom Apache HTTP plugin based on the AWS SDK to support signing. That way we handle it all with a single impl: Elasticsearch or OpenSearch, AWS or non-AWS. Though I'll admit it feels a bit messy. Another advantage is that AWS request signing in that case is well-tested (it's harder than you'd think, especially with chunked requests).
I would recommend that you rely on the existing quarkus-elasticsearch-rest-client
extension but... then you won't get AWS signing (since the only support we have right now is specific to Hibernate Search).
Eventually we'll need a more consolidated approach for the Elasticsearch low-level client. quarkusio/quarkus#26991 will get us one step closer, as it will force me to move AWS signing to something not specific to Hibernate Search.
We could also consider dropping the Elasticsearch low-level client and shipping a low-level client of our own based on the Quarkus HTTP stack, but I really would only do this if we can agree to use it on all extensions related to Elasticearch/OpenSearch, and if we can afford the API break.
don't use the old rest-client, those are deprecated. |
import dev.langchain4j.store.embedding.EmbeddingStore; | ||
import io.quarkus.logging.Log; | ||
import software.amazon.awssdk.http.SdkHttpClient; | ||
import software.amazon.awssdk.http.apache.ApacheHttpClient; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, it just occurred to me that by bypassing the existing quarkus-elasticsearch-rest-client
extension, you're also bypassing the dev service support for OpenSearch... Might be a bit annoying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, very good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah IMO OpenSearch dev service is a "must have" (and it's also used by the integration test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And @sboeckelmann mentioned that the rest client is deprecated.
So, what is the conclusion for this part ? Do we keep it as is for now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And @sboeckelmann mentioned that the rest client is deprecated.
Only the one being used in this PR. There's another one with async/reactive support in Apache HTTP Client, and I'd be surprised if the AWS SDK didn't provide a wrapper around that, too.
Anyway, here's my suggestion: I'd recommend relying on quarkus-elasticsearch-rest-client
. It is quite low level (you'll have to write JSON), but it's consistent with what we expose in other Elasticsearch/OpenSearch extensions, and using it will get you dev services for (almost) free. Almost, because you'll probably still need to use a build item somewhere to let the Dev Services know you target OpenSearch, not Elasticsearch: see DevservicesElasticsearchBuildItem
.
Users would have to add AWS request signing manually for now, though. We can add built-in support to that client, though ideally we'd put that in https://github.com/quarkiverse/quarkus-amazon-services somewhere.
FWIW, you can find a relatively simple implementation of that stuff here (the entry point is ElasticsearchAwsHttpClientConfigurer
); it's LGPL, but I'm the author, and I'm hereby licensing this code to anyone interested as ASL2.
I'd personally suggest dropping AWS request signing for now and adding separately, but it's your call.
If nobody works on AWS request signing for quarkus-elasticsearch-rest-client
, I'll eventually have to do it when dealing with quarkusio/quarkus#26991, but I can't guarantee this will happen soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that if we do leave it as is, we'll never get around to improving it :).
List<String> ids = embeddings.stream() | ||
.map(ignored -> randomUUID()) | ||
.collect(toList()); | ||
addAllInternal(ids, embeddings, embedded); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not already handled by the caller, you might want to limit the number of documents you add to your bulk request.
I have no idea of the size and number of documents involved here, but I know that for more than a few hundred large documents on a low-RAM install of OpenSearch (e.g. what we provide with Dev Services, ~1GB heap), the request buffer of OpenSearch/Elasticsearch can easily get overloaded and then OpenSearch/Elasticsearch will start rejecting requests.
In Hibernate Search we have a configurable amount of parallel queues, each sending at most one bulk request at a time with a maximum number of documents in each bulk request (see https://github.com/quarkusio/search.quarkus.io/blob/8accfb39e06372805ce41d108a927e00828b1e25/src/main/resources/application.properties#L52C1-L56). This helps users tune their set up to avoid problems.
That being said, if you know your documents are small and the number of documents is reasonable (i.e. not millions of documents), you might be able to do away with such complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the time it should not be that many documents and their size should be pretty small but we can add maybe a "nice to have" for this.
@geoand what should we do about this one? The HTTP client issue is concerning. |
I think we can close this |
I was hoping for the apache httpclient 5 to become officially supported by Quarkus some day. So in my opinion there are 2 options:
|
That is highly unlikely unfortunately as it would be a suboptimal use of our (limited) resources |
Yes, exactly, in addition to that, I think having support for httpclient5 within Quarkus would not really make much of a difference either I actually don't see such a big isse with the httpclient5 dependency within opensearch. It's not nice to have two artifacts doing the same thing - but they are totally separate and independent and except for bloating things up with additional packages, it's not really doing any harm: |
@sboeckelmann There is a bit more to the story. We don't know if it works in native OOTB. Generally, when we have extensions, it's because we have to tune things. |
I really understand that. But like mentioned above: there's no httpcient 4 support in opensearch. The only option I can think of is moving the TranportProducer for httpclient5 out into an optional additional artifact. AWS should be fine. |
Even httpclient4 is not great. Is there any possibility of plugging an HTTP client managed by the framework itself? |
Thank you @cescoffier ! I’m really happy to see that the discussion has been picked up again and is moving forward. Regarding the possibility of plugging in an HTTP client managed by the framework: I can definitely look into this. It would be very helpful if you could provide a hint or reference to where this has already been implemented. |
What @cescoffier is referring is what for example what the Kubernetes Client or Testcontainers do, where the is an HTTP related SPI that various modules (like OkHttp, Vertx Http Client, JDK Http Client etc) can implement. |
Can you pinpoint me to where this was done in Kubernetes Client and what the SPI looks like over there? Would be nice to be able to follow some sort of blueprint. |
You can start looking at |
@sboeckelmann, any update on your side? |
I am planning on providing httpclient5 and aws as separate HttpTransport provider modules. |
Maybe @sboeckelmann can also review this since you are the author of the opensearch extension (and also added support in langchain4j)