OpenSearch integration improvements #139

filipecosta90 · 2024-05-16T00:26:16Z

The bellow changes aim to:

Ease the OpenSearch integration, while also reusing as much as possible the connection creation steps.
Add support for anonymous auth on opensearch client
Add support for HTTPS traffic and detect ssl configs from connection details
Ensure that the cluster is at least at yellow state before querying
Enables inner product distance (supported for Lucene in OpenSearch version 2.13 and later)
Enables vector dimensions up to 16K (Source: https://opensearch.org/docs/latest/search-plugins/knn/approximate-knn/)
Uses a backoff stragegy on search_one method when required (due to 429 errors). Fixes OpenSearch search run should handle rate-limiting / 429 HTTP errors #142

…ports it (supported for Lucene in OpenSearch version 2.13 and later)

… for anonymous_auth feature of opensearch)

…t upload stage

LukasWestholt

Thank you for your commits. I want to compare search and upload performance from old and new implementation and hope this will get merged soon.

LukasWestholt · 2024-10-02T12:44:09Z

experiments/configurations/opensearch-single-node-single-shard.json

@@ -0,0 +1,87 @@
+[
+  {
+    "name": "opensearch-default",


You need to rename the single-shard experiments from
"opensearch-" to "opensearch-single-shard-" so both are accessable.

number_of_shards=1 is according to this already the default. Does those experiments then even do something different to opensearch-single-node-default-index.json?

Should we maybe say number_of_replicas=0 for optimization? see https://repost.aws/knowledge-center/opensearch-indexing-performance

Ah i see: number_of_replicas=0 is already set.

LukasWestholt · 2024-10-02T12:44:15Z

engine/clients/opensearch/configure.py

+        if "number_of_shards" in index_config:
+            index_settings["number_of_shards"] = 1


"Tuples don't support item assignment"

I suggest:

index_settings = { "knn": True, "number_of_replicas": 0, "refresh_interval": -1, # no refresh is required because we index all the data at once } index_config = collection_params.get("index") # if we specify the number_of_shards on the config, enforce it. otherwise use the default if "number_of_shards" in index_config: index_settings["number_of_shards"] = 1

LukasWestholt · 2024-10-02T14:00:06Z

No real performace difference between old and new implementation.

--- ../results/opensearch-old/output.json       2024-10-02 15:05:57.197185400 +0200
+++ ../results/opensearch-new/output.json       2024-10-02 15:52:09.001055200 +0200
@@ -1,59 +1,59 @@
 [
   {
     "engine_name": "opensearch",
     "setup_name": "opensearch-m-16-ef-128",
     "dataset_name": "glove-100-angular",
-    "upload_time": 505.1072022999997,
-    "total_upload_time": 506.65230710000014,
+    "upload_time": 571.6861197999997,
+    "total_upload_time": 571.9920295000002,
     "parallel": 1,
     "engine_params": {
       "knn.algo_param.ef_search": 128
     },
-    "mean_time": 0.08039978975999684,
-    "mean_precisions": 0.819351,
-    "std_time": 0.07328582483094778,
-    "min_time": 0.021495099999810918,
-    "max_time": 4.868069999999534,
-    "rps": 12.39949293721894,
-    "p95_time": 0.10517231999988325,
-    "p99_time": 0.18999490199986213
+    "mean_time": 0.07982564825000299,
+    "mean_precisions": 0.8167789999999999,
+    "std_time": 0.07186505068791486,
+    "min_time": 0.008940600000642007,
+    "max_time": 6.208381400001599,
+    "rps": 12.489320049897197,
+    "p95_time": 0.10064215000093098,
+    "p99_time": 0.21390265599933658
   },
   {
     "engine_name": "opensearch",
     "setup_name": "opensearch-m-16-ef-128",
     "dataset_name": "glove-100-angular",
-    "upload_time": 505.1072022999997,
-    "total_upload_time": 506.65230710000014,
+    "upload_time": 571.6861197999997,
+    "total_upload_time": 571.9920295000002,
     "parallel": 1,
     "engine_params": {
       "knn.algo_param.ef_search": 256
     },
-    "mean_time": 0.08500446872999737,
-    "mean_precisions": 0.816762,
-    "std_time": 0.07344455333790761,
-    "min_time": 0.026552799999990384,
-    "max_time": 4.793960500000139,
-    "rps": 11.71986972688328,
-    "p95_time": 0.1365703799996481,
-    "p99_time": 0.28884295299986684
+    "mean_time": 0.08107645404999203,
+    "mean_precisions": 0.814407,
+    "std_time": 0.020966374585830613,
+    "min_time": 0.028347600000415696,
+    "max_time": 1.7499939999997878,
+    "rps": 12.30329778171997,
+    "p95_time": 0.09037325000108466,
+    "p99_time": 0.09874180600041654
   },
   {
     "engine_name": "opensearch",
     "setup_name": "opensearch-m-16-ef-128",
     "dataset_name": "glove-100-angular",
-    "upload_time": 505.1072022999997,
-    "total_upload_time": 506.65230710000014,
+    "upload_time": 571.6861197999997,
+    "total_upload_time": 571.9920295000002,
     "parallel": 1,
     "engine_params": {
       "knn.algo_param.ef_search": 512
     },
-    "mean_time": 0.07950227081000166,
-    "mean_precisions": 0.816762,
-    "std_time": 0.027110040761532377,
-    "min_time": 0.027317600000060338,
-    "max_time": 1.4170608000003995,
-    "rps": 12.539646869542393,
-    "p95_time": 0.09963219500077684,
-    "p99_time": 0.14407165300047384
+    "mean_time": 0.07992414966999531,
+    "mean_precisions": 0.814407,
+    "std_time": 0.010356757441751704,
+    "min_time": 0.025327800000013667,
+    "max_time": 0.42032800000015413,
+    "rps": 12.479517153224734,
+    "p95_time": 0.08913765000097555,
+    "p99_time": 0.09884917200071869
   }
 ]
\ No newline at end of file

LukasWestholt · 2024-10-02T14:02:06Z

engine/clients/opensearch/upload.py

+        # Update the index settings back to the default
+        refresh_interval = "1s"
+        cls.client.indices.put_settings(
            index=OPENSEARCH_INDEX,
-            params={
-                "timeout": 300,
-            },
+            body={"index": {"refresh_interval": refresh_interval}},


Is cls.client.indices.refresh(index=OPENSEARCH_INDEX) better?

i believe it's best as is, meaning:

we disable refresh during indexing

we enable it back after it

navneet1v · 2024-11-05T00:32:46Z

Please check this PR: #214 for Opensearch improvements and let me know if anything else is needed. I am one of the maintainer of Opensearch and happy to contribute on the improvements for Opensearch in this tool

@filipecosta90 , @LukasWestholt

filipecosta90 added 3 commits May 16, 2024 01:20

opensearch improvements

4964b7b

Fixes per PR linter

fbe689e

Fixes per ruff linter

91d3dd0

filipecosta90 changed the title ~~opensearch improvements~~ OpenSearch integration improvements May 16, 2024

filipecosta90 added 14 commits May 17, 2024 12:11

Increase the vector limit to 16K given the latest docs

acb11f1

Removed the dotproduct incompatibility error given opensearch now sup…

e05c145

…ports it (supported for Lucene in OpenSearch version 2.13 and later)

Added source for IncompatibilityError on vector size

c82c5e9

Only using basic_auth when we have opensearch login data (this allows…

d98196d

… for anonymous_auth feature of opensearch)

Only using basic_auth when we have opensearch login data (this allows…

3fb15b9

… for anonymous_auth feature of opensearch)

Detecting ssl features from url on opensearch client

0555aa4

Fixed OpenSearch connection setup

fa3eb76

Waiting for yellow status at least on opensearch post upload stage

8ea777a

Fixes per PR pre-commit: isort

aec4967

Fixed forcemerge api usage on opensearch

e8b5764

Renamed references to ES

e500000

Added backoff strategy for search_one method on opensearch client

4f5937a

Fixes per PR pre-commit: isort

f83bc75

Added backoff strategy for search_one method on opensearch client

abd8637

filipecosta90 mentioned this pull request May 17, 2024

OpenSearch search run should handle rate-limiting / 429 HTTP errors #142

Open

filipecosta90 added 5 commits May 19, 2024 09:40

Improved index and search performance based uppon docs recommendation

ae5b620

Collecting index stats at end of ingestion

5679a19

Using backoff on opensearch ingestion

276892c

Included single shard experiment for opensearch. Added backoff to pos…

2c54762

…t upload stage

Fixes per PR pre-commit: isort

7292e92

LukasWestholt approved these changes Oct 2, 2024

View reviewed changes

LukasWestholt reviewed Oct 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenSearch integration improvements #139

OpenSearch integration improvements #139

filipecosta90 commented May 16, 2024 •

edited

Loading

LukasWestholt left a comment

LukasWestholt Oct 2, 2024

LukasWestholt Oct 2, 2024

LukasWestholt Oct 2, 2024

LukasWestholt Oct 2, 2024

LukasWestholt Oct 2, 2024

LukasWestholt commented Oct 2, 2024

LukasWestholt Oct 2, 2024

filipecosta90 Oct 3, 2024

navneet1v commented Nov 5, 2024 •

edited

Loading

		if "number_of_shards" in index_config:
		index_settings["number_of_shards"] = 1

OpenSearch integration improvements #139

Are you sure you want to change the base?

OpenSearch integration improvements #139

Conversation

filipecosta90 commented May 16, 2024 • edited Loading

LukasWestholt left a comment

Choose a reason for hiding this comment

LukasWestholt Oct 2, 2024

Choose a reason for hiding this comment

LukasWestholt Oct 2, 2024

Choose a reason for hiding this comment

LukasWestholt Oct 2, 2024

Choose a reason for hiding this comment

LukasWestholt Oct 2, 2024

Choose a reason for hiding this comment

LukasWestholt Oct 2, 2024

Choose a reason for hiding this comment

LukasWestholt commented Oct 2, 2024

LukasWestholt Oct 2, 2024

Choose a reason for hiding this comment

filipecosta90 Oct 3, 2024

Choose a reason for hiding this comment

navneet1v commented Nov 5, 2024 • edited Loading

filipecosta90 commented May 16, 2024 •

edited

Loading

navneet1v commented Nov 5, 2024 •

edited

Loading