Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenSearch integration improvements #139

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

filipecosta90
Copy link
Contributor

@filipecosta90 filipecosta90 commented May 16, 2024

The bellow changes aim to:

@filipecosta90 filipecosta90 changed the title opensearch improvements OpenSearch integration improvements May 16, 2024
Copy link

@LukasWestholt LukasWestholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your commits. I want to compare search and upload performance from old and new implementation and hope this will get merged soon.

@@ -0,0 +1,87 @@
[
{
"name": "opensearch-default",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to rename the single-shard experiments from
"opensearch-" to "opensearch-single-shard-" so both are accessable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number_of_shards=1 is according to this already the default. Does those experiments then even do something different to opensearch-single-node-default-index.json?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe say number_of_replicas=0 for optimization? see https://repost.aws/knowledge-center/opensearch-indexing-performance

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah i see: number_of_replicas=0 is already set.

Comment on lines +57 to +58
if "number_of_shards" in index_config:
index_settings["number_of_shards"] = 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Tuples don't support item assignment"

I suggest:

    index_settings = {
        "knn": True,
        "number_of_replicas": 0,
        "refresh_interval": -1,  # no refresh is required because we index all the data at once
    }
    index_config = collection_params.get("index")

    # if we specify the number_of_shards on the config, enforce it. otherwise use the default
    if "number_of_shards" in index_config:
        index_settings["number_of_shards"] = 1

@LukasWestholt
Copy link

No real performace difference between old and new implementation.

--- ../results/opensearch-old/output.json       2024-10-02 15:05:57.197185400 +0200
+++ ../results/opensearch-new/output.json       2024-10-02 15:52:09.001055200 +0200
@@ -1,59 +1,59 @@
 [
   {
     "engine_name": "opensearch",
     "setup_name": "opensearch-m-16-ef-128",
     "dataset_name": "glove-100-angular",
-    "upload_time": 505.1072022999997,
-    "total_upload_time": 506.65230710000014,
+    "upload_time": 571.6861197999997,
+    "total_upload_time": 571.9920295000002,
     "parallel": 1,
     "engine_params": {
       "knn.algo_param.ef_search": 128
     },
-    "mean_time": 0.08039978975999684,
-    "mean_precisions": 0.819351,
-    "std_time": 0.07328582483094778,
-    "min_time": 0.021495099999810918,
-    "max_time": 4.868069999999534,
-    "rps": 12.39949293721894,
-    "p95_time": 0.10517231999988325,
-    "p99_time": 0.18999490199986213
+    "mean_time": 0.07982564825000299,
+    "mean_precisions": 0.8167789999999999,
+    "std_time": 0.07186505068791486,
+    "min_time": 0.008940600000642007,
+    "max_time": 6.208381400001599,
+    "rps": 12.489320049897197,
+    "p95_time": 0.10064215000093098,
+    "p99_time": 0.21390265599933658
   },
   {
     "engine_name": "opensearch",
     "setup_name": "opensearch-m-16-ef-128",
     "dataset_name": "glove-100-angular",
-    "upload_time": 505.1072022999997,
-    "total_upload_time": 506.65230710000014,
+    "upload_time": 571.6861197999997,
+    "total_upload_time": 571.9920295000002,
     "parallel": 1,
     "engine_params": {
       "knn.algo_param.ef_search": 256
     },
-    "mean_time": 0.08500446872999737,
-    "mean_precisions": 0.816762,
-    "std_time": 0.07344455333790761,
-    "min_time": 0.026552799999990384,
-    "max_time": 4.793960500000139,
-    "rps": 11.71986972688328,
-    "p95_time": 0.1365703799996481,
-    "p99_time": 0.28884295299986684
+    "mean_time": 0.08107645404999203,
+    "mean_precisions": 0.814407,
+    "std_time": 0.020966374585830613,
+    "min_time": 0.028347600000415696,
+    "max_time": 1.7499939999997878,
+    "rps": 12.30329778171997,
+    "p95_time": 0.09037325000108466,
+    "p99_time": 0.09874180600041654
   },
   {
     "engine_name": "opensearch",
     "setup_name": "opensearch-m-16-ef-128",
     "dataset_name": "glove-100-angular",
-    "upload_time": 505.1072022999997,
-    "total_upload_time": 506.65230710000014,
+    "upload_time": 571.6861197999997,
+    "total_upload_time": 571.9920295000002,
     "parallel": 1,
     "engine_params": {
       "knn.algo_param.ef_search": 512
     },
-    "mean_time": 0.07950227081000166,
-    "mean_precisions": 0.816762,
-    "std_time": 0.027110040761532377,
-    "min_time": 0.027317600000060338,
-    "max_time": 1.4170608000003995,
-    "rps": 12.539646869542393,
-    "p95_time": 0.09963219500077684,
-    "p99_time": 0.14407165300047384
+    "mean_time": 0.07992414966999531,
+    "mean_precisions": 0.814407,
+    "std_time": 0.010356757441751704,
+    "min_time": 0.025327800000013667,
+    "max_time": 0.42032800000015413,
+    "rps": 12.479517153224734,
+    "p95_time": 0.08913765000097555,
+    "p99_time": 0.09884917200071869
   }
 ]
\ No newline at end of file

Comment on lines +81 to +85
# Update the index settings back to the default
refresh_interval = "1s"
cls.client.indices.put_settings(
index=OPENSEARCH_INDEX,
params={
"timeout": 300,
},
body={"index": {"refresh_interval": refresh_interval}},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cls.client.indices.refresh(index=OPENSEARCH_INDEX) better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe it's best as is, meaning:

  • we disable refresh during indexing
  • we enable it back after it

@navneet1v
Copy link

navneet1v commented Nov 5, 2024

Please check this PR: #214 for Opensearch improvements and let me know if anything else is needed. I am one of the maintainer of Opensearch and happy to contribute on the improvements for Opensearch in this tool

@filipecosta90 , @LukasWestholt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpenSearch search run should handle rate-limiting / 429 HTTP errors
3 participants