Add concurrent segment search follow-up blog and concurrent search nightly benchmark dashboards #3031

jed326 · 2024-06-28T00:35:26Z

Description

Add concurrent segment search follow-up blog and concurrent search nightly benchmark dashboards

Check List

Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.

Signed-off-by: Ganesh Ramadurai <[email protected]>

jed326 · 2024-06-28T00:35:51Z

@kolchfa-aws could you please help review this? Thanks!

pajuric · 2024-07-01T19:39:11Z

@jed326 - Hi Jay, thanks for contributing a blog. As the blog manager, I ask that you not directly reach out to Fanit to review your blog, and instead follow the process in the Slack that I sent to you earlier today. It outlines the process.

_posts/2024-06-30-concurrent-search-follow-up.md

Signed-off-by: Jay Deng <[email protected]>

Signed-off-by: Fanit Kolchina <[email protected]>

jed326 · 2024-07-29T18:35:59Z

_posts/2024-06-30-concurrent-search-follow-up.md


-In concurrent segment search, each shard-level search request on a node is divided into multiple execution tasks called slices. Slices can be executed concurrently on separate threads in the index_searcher threadpool, separate from the search threadpool. Each slice searches within the segments associated with it. Once all slice executions are complete, the collected results from all slices are combined (reduced) and returned to the coordinator node. The index_searcher threadpool is used to execute the slices of each shard search request and is shared across all shard search requests on a node. By default, the index_searcher threadpool has twice as many threads as the number of available processors.
+In concurrent segment search, each shard-level search request on a node is divided into multiple execution tasks called _slices_. Slices can be executed concurrently on separate threads in the index searcher thread pool, separate from the search thread pool. Each slice searches within the segments associated with it. Once all slice executions are complete, the collected results from all slices are combined (reduced) and returned to the coordinator node. The index searcher thread pool is used to execute the slices of each shard search request and is shared across all shard search requests on a node. By default, the index searcher thread pool has twice as many threads as the number of available processors.


I meant to format it as index_searcher and search thread pools as those are the names of the threadpools.

jed326 · 2024-07-29T18:45:32Z

_posts/2024-06-30-concurrent-search-follow-up.md

-| Cluster Configuration | % perf improvement going from cs disabled to 2 slices | % additional CPU utilization | % perf improvement going from 2 slices to 4 slices | % additional CPU utilization | % perf improvement going from 4 slices to Lucene default | % additional CPU utilization |
+#### Setup comparison for `range-auto-date-histo-with-metrics`
+
+| Clsuter configuration | % Performance improvement from CS disabled to 2 slices | % Additional CPU utilization | % Performance improvement from 2 slices to 4 slices | % Additional CPU utilization | % Performance improvement from 4 slices to Lucene default | % Additional CPU utilization |


Cluster configuration

jed326 · 2024-07-29T18:47:46Z

_posts/2024-06-30-concurrent-search-follow-up.md


-Third, the specific implementation of queries can greatly impact the performance when increasing concurrency as some queries may end up performing more duplicate work as the number of slices increases. For example, significant terms aggregations perform count queries on each bucket key to determine the term background frequencies so duplicated bucket keys across segment slices will result in duplicated count queries across slices as well.
+Third, the specific query implementation can greatly impact the performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well.


significant terms aggregations perform count queries

Signed-off-by: Fanit Kolchina <[email protected]>

natebower

@jed326 @kolchfa-aws @pajuric Editorial review complete. Please see my comments and changes and let me know if you have any questions. Thanks!

_community_members/jaydeng.md

_community_members/sohami.md

_posts/2024-06-30-concurrent-search-follow-up.md

natebower · 2024-07-30T13:01:52Z

_posts/2024-06-30-concurrent-search-follow-up.md

+
+Second, whenever the number of active threads is higher than the number of CPU cores, each individual thread may spend more time processing because the CPU cores are multiplexing tasks. By default, the `r5.2xlarge` instance with 8 CPU cores has 16 threads in the `index_searcher` thread pool and 13 threads in the search thread pool. If all 29 threads are concurrently processing search tasks, then each individual thread will encounter a longer processing time because there are only 8 CPU cores to serve these 29 threads.
+
+Third, the specific query implementation can greatly impact the performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well.


Suggested change

Third, the specific query implementation can greatly impact the performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well.

- Third, the specific query implementation can greatly impact performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well.

natebower · 2024-07-30T13:05:34Z

_posts/2024-06-30-concurrent-search-follow-up.md

+
+Third, the specific query implementation can greatly impact the performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well.
+
+Fourth, the reduce phase is performed sequentially on all segment slices. If the reduce overhead is large, it can offset the gains from searching documents concurrently. For example, for aggregations, a new `Aggregator` instance is created for each segment slice. Each `Aggregator` creates an `InternalAggregation` object, which represents the buckets created during document collection. These `InternalAggregation` object instances are then processed sequentially during the reduce phase. As a result, a simple `term` aggregation can create up to `slice_count * shard_size` buckets per shard, which are then processed sequentially during the reduce phase.


Suggested change

Fourth, the reduce phase is performed sequentially on all segment slices. If the reduce overhead is large, it can offset the gains from searching documents concurrently. For example, for aggregations, a new `Aggregator` instance is created for each segment slice. Each `Aggregator` creates an `InternalAggregation` object, which represents the buckets created during document collection. These `InternalAggregation` object instances are then processed sequentially during the reduce phase. As a result, a simple `term` aggregation can create up to `slice_count * shard_size` buckets per shard, which are then processed sequentially during the reduce phase.

- Fourth, the reduce phase is performed sequentially on all segment slices. If the reduce overhead is large, it can offset the gains realized from searching documents concurrently. For example, for aggregations, a new `Aggregator` instance is created for each segment slice. Each `Aggregator` creates an `InternalAggregation` object, which represents the buckets created during document collection. These `InternalAggregation` object instances are then processed sequentially during the reduce phase. As a result, a simple `term` aggregation can create up to `slice_count * shard_size` buckets per shard, which are then processed sequentially during the reduce phase.

natebower · 2024-07-30T13:07:51Z

_posts/2024-06-30-concurrent-search-follow-up.md

+
+## Wrapping up
+
+In summary, when choosing a segment slice count to use, it’s important to run your own benchmarking to determine if the additional parallelization from adding more segment slices outweighs the additional processing overhead. While concurrent segment search is ready for use in production environments, you can continue to track its further improvements on this [project board](https://github.com/orgs/opensearch-project/projects/117).


Suggested change

In summary, when choosing a segment slice count to use, it’s important to run your own benchmarking to determine if the additional parallelization from adding more segment slices outweighs the additional processing overhead. While concurrent segment search is ready for use in production environments, you can continue to track its further improvements on this [project board](https://github.com/orgs/opensearch-project/projects/117).

In summary, when choosing a segment slice count to use, it's important to run your own benchmarking to determine whether the additional parallelization produced by adding more segment slices outweighs the additional processing overhead. Concurrent segment search is ready for use in production environments, and you can continue to track its ongoing improvements on this [project board](https://github.com/orgs/opensearch-project/projects/117).

natebower · 2024-07-30T13:12:51Z

_posts/2024-06-30-concurrent-search-follow-up.md

+
+In summary, when choosing a segment slice count to use, it’s important to run your own benchmarking to determine if the additional parallelization from adding more segment slices outweighs the additional processing overhead. While concurrent segment search is ready for use in production environments, you can continue to track its further improvements on this [project board](https://github.com/orgs/opensearch-project/projects/117).
+
+Additionally, in order to provide performance visibility over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks] (https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post.


Suggested change

Additionally, in order to provide performance visibility over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks] (https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post.

Additionally, to provide visibility into performance over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks](https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post.

natebower · 2024-07-30T13:15:13Z

_posts/2024-06-30-concurrent-search-follow-up.md

+
+Additionally, in order to provide performance visibility over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks] (https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post.
+
+For guidelines when getting started with concurrent segment search, see [General guidelines](https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#general-guidelines).


Suggested change

For guidelines when getting started with concurrent segment search, see [General guidelines](https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#general-guidelines).

For guidelines on getting started with concurrent segment search, see [General guidelines](https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#general-guidelines).

Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>

natebower

LGTM

pajuric · 2024-07-30T16:35:01Z

_posts/2024-06-30-concurrent-search-follow-up.md

+ - search
+ - technical-post
+meta_keywords:
+meta_description:


Please use this updated meta:

meta_keywords: concurrent segment search, search concurrency, control search concurrency, searching segments concurrently in OpenSearch

meta_description: Learn how to benchmark, track performance, and improve latency across your large workloads by searching segments concurrently in OpenSearch.

pajuric · 2024-07-30T16:35:40Z

_posts/2024-06-30-concurrent-search-follow-up.md

+authors:
+ - jaydeng
+ - sohami
+date: 2024-06-30


If you are able to get all the updates done today, we can push it live this afternoon. Please update the date to today if you feel this is possible.

Signed-off-by: Jay Deng <[email protected]>

pajuric · 2024-07-30T16:44:09Z

This blog is ready to push live @nateynateynate @krisfreedain.

nateynateynate

Let's go!

jed326 · 2024-07-30T19:07:43Z

Thanks @nateynateynate! Out of curiousity, how long does it take for changes to go live on the website after merge?

Add concurrent search big5,nyc_taxi,noaa dashboards to performance page

5340b10

Signed-off-by: Ganesh Ramadurai <[email protected]>

jed326 requested review from elfisher, AMoo-Miki, nknize, krisfreedain, peterzhuamazon, CEHENKLE, dtaivpp, kolchfa-aws, nateynateynate and natebower as code owners June 28, 2024 00:35

kolchfa-aws self-assigned this Jun 28, 2024

pajuric self-assigned this Jul 1, 2024

sohami reviewed Jul 2, 2024

View reviewed changes

jed326 force-pushed the cs-blog branch from b7b8fca to 0eb7011 Compare July 8, 2024 19:24

Add concurrent search blog opensearch-project#2

7129357

Signed-off-by: Jay Deng <[email protected]>

jed326 force-pushed the cs-blog branch from 0eb7011 to 7129357 Compare July 9, 2024 00:05

pajuric added the blog-under-review label Jul 25, 2024

jed326 and others added 2 commits July 26, 2024 15:43

Incorporate PM review

d9edb0d

Signed-off-by: Jay Deng <[email protected]>

Doc review

7fb7777

Signed-off-by: Fanit Kolchina <[email protected]>

jed326 commented Jul 29, 2024

View reviewed changes

Tech review feedback

de43202

Signed-off-by: Fanit Kolchina <[email protected]>

natebower reviewed Jul 30, 2024

View reviewed changes

Apply suggestions from code review

1b24743

Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>

natebower reviewed Jul 30, 2024

View reviewed changes

pajuric reviewed Jul 30, 2024

View reviewed changes

Update meta fields

c1cee5e

Signed-off-by: Jay Deng <[email protected]>

nateynateynate approved these changes Jul 30, 2024

View reviewed changes

nateynateynate merged commit 55795c5 into opensearch-project:main Jul 30, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add concurrent segment search follow-up blog and concurrent search nightly benchmark dashboards #3031

Add concurrent segment search follow-up blog and concurrent search nightly benchmark dashboards #3031

jed326 commented Jun 28, 2024

jed326 commented Jun 28, 2024

pajuric commented Jul 1, 2024

jed326 Jul 29, 2024

jed326 Jul 29, 2024

jed326 Jul 29, 2024

natebower left a comment

natebower Jul 30, 2024

natebower Jul 30, 2024

natebower Jul 30, 2024

natebower Jul 30, 2024

natebower Jul 30, 2024

natebower left a comment

pajuric Jul 30, 2024

pajuric Jul 30, 2024

pajuric commented Jul 30, 2024

nateynateynate left a comment

jed326 commented Jul 30, 2024


		In concurrent segment search, each shard-level search request on a node is divided into multiple execution tasks called slices. Slices can be executed concurrently on separate threads in the index_searcher threadpool, separate from the search threadpool. Each slice searches within the segments associated with it. Once all slice executions are complete, the collected results from all slices are combined (reduced) and returned to the coordinator node. The index_searcher threadpool is used to execute the slices of each shard search request and is shared across all shard search requests on a node. By default, the index_searcher threadpool has twice as many threads as the number of available processors.
		In concurrent segment search, each shard-level search request on a node is divided into multiple execution tasks called _slices_. Slices can be executed concurrently on separate threads in the index searcher thread pool, separate from the search thread pool. Each slice searches within the segments associated with it. Once all slice executions are complete, the collected results from all slices are combined (reduced) and returned to the coordinator node. The index searcher thread pool is used to execute the slices of each shard search request and is shared across all shard search requests on a node. By default, the index searcher thread pool has twice as many threads as the number of available processors.


		Third, the specific implementation of queries can greatly impact the performance when increasing concurrency as some queries may end up performing more duplicate work as the number of slices increases. For example, significant terms aggregations perform count queries on each bucket key to determine the term background frequencies so duplicated bucket keys across segment slices will result in duplicated count queries across slices as well.
		Third, the specific query implementation can greatly impact the performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well.


		Second, whenever the number of active threads is higher than the number of CPU cores, each individual thread may spend more time processing because the CPU cores are multiplexing tasks. By default, the `r5.2xlarge` instance with 8 CPU cores has 16 threads in the `index_searcher` thread pool and 13 threads in the search thread pool. If all 29 threads are concurrently processing search tasks, then each individual thread will encounter a longer processing time because there are only 8 CPU cores to serve these 29 threads.

		Third, the specific query implementation can greatly impact the performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well.


		Third, the specific query implementation can greatly impact the performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well.

		Fourth, the reduce phase is performed sequentially on all segment slices. If the reduce overhead is large, it can offset the gains from searching documents concurrently. For example, for aggregations, a new `Aggregator` instance is created for each segment slice. Each `Aggregator` creates an `InternalAggregation` object, which represents the buckets created during document collection. These `InternalAggregation` object instances are then processed sequentially during the reduce phase. As a result, a simple `term` aggregation can create up to `slice_count * shard_size` buckets per shard, which are then processed sequentially during the reduce phase.


		## Wrapping up

		In summary, when choosing a segment slice count to use, it’s important to run your own benchmarking to determine if the additional parallelization from adding more segment slices outweighs the additional processing overhead. While concurrent segment search is ready for use in production environments, you can continue to track its further improvements on this [project board](https://github.com/orgs/opensearch-project/projects/117).


		In summary, when choosing a segment slice count to use, it’s important to run your own benchmarking to determine if the additional parallelization from adding more segment slices outweighs the additional processing overhead. While concurrent segment search is ready for use in production environments, you can continue to track its further improvements on this [project board](https://github.com/orgs/opensearch-project/projects/117).

		Additionally, in order to provide performance visibility over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks] (https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post.

	Additionally, in order to provide performance visibility over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks] (https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post.
	Additionally, to provide visibility into performance over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks](https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post.


		Additionally, in order to provide performance visibility over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks] (https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post.

		For guidelines when getting started with concurrent segment search, see [General guidelines](https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#general-guidelines).

	For guidelines when getting started with concurrent segment search, see [General guidelines](https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#general-guidelines).
	For guidelines on getting started with concurrent segment search, see [General guidelines](https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#general-guidelines).

Add concurrent segment search follow-up blog and concurrent search nightly benchmark dashboards #3031

Add concurrent segment search follow-up blog and concurrent search nightly benchmark dashboards #3031

Conversation

jed326 commented Jun 28, 2024

Description

Check List

jed326 commented Jun 28, 2024

pajuric commented Jul 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natebower left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natebower left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pajuric commented Jul 30, 2024

nateynateynate left a comment

Choose a reason for hiding this comment

jed326 commented Jul 30, 2024