-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenSearch integration improvements #139
base: master
Are you sure you want to change the base?
OpenSearch integration improvements #139
Conversation
…ports it (supported for Lucene in OpenSearch version 2.13 and later)
… for anonymous_auth feature of opensearch)
… for anonymous_auth feature of opensearch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your commits. I want to compare search and upload performance from old and new implementation and hope this will get merged soon.
@@ -0,0 +1,87 @@ | |||
[ | |||
{ | |||
"name": "opensearch-default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to rename the single-shard experiments from
"opensearch-" to "opensearch-single-shard-" so both are accessable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number_of_shards=1
is according to this already the default. Does those experiments then even do something different to opensearch-single-node-default-index.json
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe say number_of_replicas=0
for optimization? see https://repost.aws/knowledge-center/opensearch-indexing-performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah i see: number_of_replicas=0 is already set.
if "number_of_shards" in index_config: | ||
index_settings["number_of_shards"] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Tuples don't support item assignment"
I suggest:
index_settings = {
"knn": True,
"number_of_replicas": 0,
"refresh_interval": -1, # no refresh is required because we index all the data at once
}
index_config = collection_params.get("index")
# if we specify the number_of_shards on the config, enforce it. otherwise use the default
if "number_of_shards" in index_config:
index_settings["number_of_shards"] = 1
No real performace difference between old and new implementation. --- ../results/opensearch-old/output.json 2024-10-02 15:05:57.197185400 +0200
+++ ../results/opensearch-new/output.json 2024-10-02 15:52:09.001055200 +0200
@@ -1,59 +1,59 @@
[
{
"engine_name": "opensearch",
"setup_name": "opensearch-m-16-ef-128",
"dataset_name": "glove-100-angular",
- "upload_time": 505.1072022999997,
- "total_upload_time": 506.65230710000014,
+ "upload_time": 571.6861197999997,
+ "total_upload_time": 571.9920295000002,
"parallel": 1,
"engine_params": {
"knn.algo_param.ef_search": 128
},
- "mean_time": 0.08039978975999684,
- "mean_precisions": 0.819351,
- "std_time": 0.07328582483094778,
- "min_time": 0.021495099999810918,
- "max_time": 4.868069999999534,
- "rps": 12.39949293721894,
- "p95_time": 0.10517231999988325,
- "p99_time": 0.18999490199986213
+ "mean_time": 0.07982564825000299,
+ "mean_precisions": 0.8167789999999999,
+ "std_time": 0.07186505068791486,
+ "min_time": 0.008940600000642007,
+ "max_time": 6.208381400001599,
+ "rps": 12.489320049897197,
+ "p95_time": 0.10064215000093098,
+ "p99_time": 0.21390265599933658
},
{
"engine_name": "opensearch",
"setup_name": "opensearch-m-16-ef-128",
"dataset_name": "glove-100-angular",
- "upload_time": 505.1072022999997,
- "total_upload_time": 506.65230710000014,
+ "upload_time": 571.6861197999997,
+ "total_upload_time": 571.9920295000002,
"parallel": 1,
"engine_params": {
"knn.algo_param.ef_search": 256
},
- "mean_time": 0.08500446872999737,
- "mean_precisions": 0.816762,
- "std_time": 0.07344455333790761,
- "min_time": 0.026552799999990384,
- "max_time": 4.793960500000139,
- "rps": 11.71986972688328,
- "p95_time": 0.1365703799996481,
- "p99_time": 0.28884295299986684
+ "mean_time": 0.08107645404999203,
+ "mean_precisions": 0.814407,
+ "std_time": 0.020966374585830613,
+ "min_time": 0.028347600000415696,
+ "max_time": 1.7499939999997878,
+ "rps": 12.30329778171997,
+ "p95_time": 0.09037325000108466,
+ "p99_time": 0.09874180600041654
},
{
"engine_name": "opensearch",
"setup_name": "opensearch-m-16-ef-128",
"dataset_name": "glove-100-angular",
- "upload_time": 505.1072022999997,
- "total_upload_time": 506.65230710000014,
+ "upload_time": 571.6861197999997,
+ "total_upload_time": 571.9920295000002,
"parallel": 1,
"engine_params": {
"knn.algo_param.ef_search": 512
},
- "mean_time": 0.07950227081000166,
- "mean_precisions": 0.816762,
- "std_time": 0.027110040761532377,
- "min_time": 0.027317600000060338,
- "max_time": 1.4170608000003995,
- "rps": 12.539646869542393,
- "p95_time": 0.09963219500077684,
- "p99_time": 0.14407165300047384
+ "mean_time": 0.07992414966999531,
+ "mean_precisions": 0.814407,
+ "std_time": 0.010356757441751704,
+ "min_time": 0.025327800000013667,
+ "max_time": 0.42032800000015413,
+ "rps": 12.479517153224734,
+ "p95_time": 0.08913765000097555,
+ "p99_time": 0.09884917200071869
}
]
\ No newline at end of file |
# Update the index settings back to the default | ||
refresh_interval = "1s" | ||
cls.client.indices.put_settings( | ||
index=OPENSEARCH_INDEX, | ||
params={ | ||
"timeout": 300, | ||
}, | ||
body={"index": {"refresh_interval": refresh_interval}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is cls.client.indices.refresh(index=OPENSEARCH_INDEX)
better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe it's best as is, meaning:
- we disable refresh during indexing
- we enable it back after it
Please check this PR: #214 for Opensearch improvements and let me know if anything else is needed. I am one of the maintainer of Opensearch and happy to contribute on the improvements for Opensearch in this tool |
The bellow changes aim to: