-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add concurrent segment search follow-up blog and concurrent search nightly benchmark dashboards #3031
Conversation
Signed-off-by: Ganesh Ramadurai <[email protected]>
@kolchfa-aws could you please help review this? Thanks! |
@jed326 - Hi Jay, thanks for contributing a blog. As the blog manager, I ask that you not directly reach out to Fanit to review your blog, and instead follow the process in the Slack that I sent to you earlier today. It outlines the process. |
Signed-off-by: Jay Deng <[email protected]>
Signed-off-by: Jay Deng <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
|
||
In concurrent segment search, each shard-level search request on a node is divided into multiple execution tasks called slices. Slices can be executed concurrently on separate threads in the index_searcher threadpool, separate from the search threadpool. Each slice searches within the segments associated with it. Once all slice executions are complete, the collected results from all slices are combined (reduced) and returned to the coordinator node. The index_searcher threadpool is used to execute the slices of each shard search request and is shared across all shard search requests on a node. By default, the index_searcher threadpool has twice as many threads as the number of available processors. | ||
In concurrent segment search, each shard-level search request on a node is divided into multiple execution tasks called _slices_. Slices can be executed concurrently on separate threads in the index searcher thread pool, separate from the search thread pool. Each slice searches within the segments associated with it. Once all slice executions are complete, the collected results from all slices are combined (reduced) and returned to the coordinator node. The index searcher thread pool is used to execute the slices of each shard search request and is shared across all shard search requests on a node. By default, the index searcher thread pool has twice as many threads as the number of available processors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to format it as index_searcher
and search
thread pools as those are the names of the threadpools.
| Cluster Configuration | % perf improvement going from cs disabled to 2 slices | % additional CPU utilization | % perf improvement going from 2 slices to 4 slices | % additional CPU utilization | % perf improvement going from 4 slices to Lucene default | % additional CPU utilization | | ||
#### Setup comparison for `range-auto-date-histo-with-metrics` | ||
|
||
| Clsuter configuration | % Performance improvement from CS disabled to 2 slices | % Additional CPU utilization | % Performance improvement from 2 slices to 4 slices | % Additional CPU utilization | % Performance improvement from 4 slices to Lucene default | % Additional CPU utilization | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster configuration
|
||
Third, the specific implementation of queries can greatly impact the performance when increasing concurrency as some queries may end up performing more duplicate work as the number of slices increases. For example, significant terms aggregations perform count queries on each bucket key to determine the term background frequencies so duplicated bucket keys across segment slices will result in duplicated count queries across slices as well. | ||
Third, the specific query implementation can greatly impact the performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
significant terms aggregations perform count queries
Signed-off-by: Fanit Kolchina <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jed326 @kolchfa-aws @pajuric Editorial review complete. Please see my comments and changes and let me know if you have any questions. Thanks!
|
||
Second, whenever the number of active threads is higher than the number of CPU cores, each individual thread may spend more time processing because the CPU cores are multiplexing tasks. By default, the `r5.2xlarge` instance with 8 CPU cores has 16 threads in the `index_searcher` thread pool and 13 threads in the search thread pool. If all 29 threads are concurrently processing search tasks, then each individual thread will encounter a longer processing time because there are only 8 CPU cores to serve these 29 threads. | ||
|
||
Third, the specific query implementation can greatly impact the performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Third, the specific query implementation can greatly impact the performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well. | |
- Third, the specific query implementation can greatly impact performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well. |
|
||
Third, the specific query implementation can greatly impact the performance when increasing concurrency because some queries may perform more duplicate work as the number of slices increases. For example, significant terms aggregations run count queries for each bucket key to determine the term background frequencies. Thus, duplicated bucket keys across segment slices result in duplicated count queries across slices as well. | ||
|
||
Fourth, the reduce phase is performed sequentially on all segment slices. If the reduce overhead is large, it can offset the gains from searching documents concurrently. For example, for aggregations, a new `Aggregator` instance is created for each segment slice. Each `Aggregator` creates an `InternalAggregation` object, which represents the buckets created during document collection. These `InternalAggregation` object instances are then processed sequentially during the reduce phase. As a result, a simple `term` aggregation can create up to `slice_count * shard_size` buckets per shard, which are then processed sequentially during the reduce phase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fourth, the reduce phase is performed sequentially on all segment slices. If the reduce overhead is large, it can offset the gains from searching documents concurrently. For example, for aggregations, a new `Aggregator` instance is created for each segment slice. Each `Aggregator` creates an `InternalAggregation` object, which represents the buckets created during document collection. These `InternalAggregation` object instances are then processed sequentially during the reduce phase. As a result, a simple `term` aggregation can create up to `slice_count * shard_size` buckets per shard, which are then processed sequentially during the reduce phase. | |
- Fourth, the reduce phase is performed sequentially on all segment slices. If the reduce overhead is large, it can offset the gains realized from searching documents concurrently. For example, for aggregations, a new `Aggregator` instance is created for each segment slice. Each `Aggregator` creates an `InternalAggregation` object, which represents the buckets created during document collection. These `InternalAggregation` object instances are then processed sequentially during the reduce phase. As a result, a simple `term` aggregation can create up to `slice_count * shard_size` buckets per shard, which are then processed sequentially during the reduce phase. |
|
||
## Wrapping up | ||
|
||
In summary, when choosing a segment slice count to use, it’s important to run your own benchmarking to determine if the additional parallelization from adding more segment slices outweighs the additional processing overhead. While concurrent segment search is ready for use in production environments, you can continue to track its further improvements on this [project board](https://github.com/orgs/opensearch-project/projects/117). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary, when choosing a segment slice count to use, it’s important to run your own benchmarking to determine if the additional parallelization from adding more segment slices outweighs the additional processing overhead. While concurrent segment search is ready for use in production environments, you can continue to track its further improvements on this [project board](https://github.com/orgs/opensearch-project/projects/117). | |
In summary, when choosing a segment slice count to use, it's important to run your own benchmarking to determine whether the additional parallelization produced by adding more segment slices outweighs the additional processing overhead. Concurrent segment search is ready for use in production environments, and you can continue to track its ongoing improvements on this [project board](https://github.com/orgs/opensearch-project/projects/117). |
|
||
In summary, when choosing a segment slice count to use, it’s important to run your own benchmarking to determine if the additional parallelization from adding more segment slices outweighs the additional processing overhead. While concurrent segment search is ready for use in production environments, you can continue to track its further improvements on this [project board](https://github.com/orgs/opensearch-project/projects/117). | ||
|
||
Additionally, in order to provide performance visibility over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks] (https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, in order to provide performance visibility over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks] (https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post. | |
Additionally, to provide visibility into performance over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks](https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post. |
|
||
Additionally, in order to provide performance visibility over time, we will publish nightly performance runs for concurrent segment search in [OpenSearch Performance Benchmarks] (https://opensearch.org/benchmarks), covering all the test workloads mentioned in this post. | ||
|
||
For guidelines when getting started with concurrent segment search, see [General guidelines](https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#general-guidelines). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For guidelines when getting started with concurrent segment search, see [General guidelines](https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#general-guidelines). | |
For guidelines on getting started with concurrent segment search, see [General guidelines](https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#general-guidelines). |
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- search | ||
- technical-post | ||
meta_keywords: | ||
meta_description: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use this updated meta:
meta_keywords: concurrent segment search, search concurrency, control search concurrency, searching segments concurrently in OpenSearch
meta_description: Learn how to benchmark, track performance, and improve latency across your large workloads by searching segments concurrently in OpenSearch.
authors: | ||
- jaydeng | ||
- sohami | ||
date: 2024-06-30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are able to get all the updates done today, we can push it live this afternoon. Please update the date to today if you feel this is possible.
Signed-off-by: Jay Deng <[email protected]>
This blog is ready to push live @nateynateynate @krisfreedain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go!
Thanks @nateynateynate! Out of curiousity, how long does it take for changes to go live on the website after merge? |
Description
Add concurrent segment search follow-up blog and concurrent search nightly benchmark dashboards
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.