-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Do not use the 'search' queue for everything #875
Comments
hi @mvanderlee , we have a bunch of performance fixes we're planning to release for |
also, some of the optimizations which you can already try out is using an
use the alias |
@sbcd90 glad to hear it. |
And we already have aliases, but they don't show up as options in the |
@mvanderlee already working on showing the aliases in the dropdown and should be available in 2.13 |
) (opensearch-project#875) * Fix getAlerts API for standard Alerting monitors Signed-off-by: Ashish Agrawal <[email protected]>
v 2.11.1
Our cluster was stable on a r5.2xlarge instance, hovering at ~10% CPU usage. Then we enabled windows detectors and even a r5.8xlarge isn't enough.
We were experimenting with detectors. But they essentially brought down our entire instance.
The main issue can be boiled down to the fact that it's all running in the same 'search' queue. The detector UI is backed by 'search', the detectors themselves are backed by 'search' etc.
Why is this the worst idea ever?
Because as the detectors fill up the queue and cause literally millions of searches to be rejected, ~48 Million per hour were observed overnight.
While this is a tuning and scaling issue, it also completely killed ingestion (our spark pipeline kept failing to write to OS and dropped it in our DLQ) and all dashboards no longer work since the UI also uses the 'search' queue.
So it wasn't just detectors that were failing. Everything started to fail. We couldn't even stop the detector because that request kept failing as well.
We have tried tuning the queues, but even a queue size of 100K is still filling up and we're still running into memory issues.
Management wanted us to try to use Detectors as they were hoping we'd no longer have to maintain our own rules engine with Sigma rules. But it can do the job with far less resources on the exact same data set and not affect anything else if it falls behind.
We are no longer moving forward with OS security analytics.
The text was updated successfully, but these errors were encountered: