Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Do not use the 'search' queue for everything #875

Open
mvanderlee opened this issue Mar 1, 2024 · 5 comments
Open

[BUG] Do not use the 'search' queue for everything #875

mvanderlee opened this issue Mar 1, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@mvanderlee
Copy link

mvanderlee commented Mar 1, 2024

v 2.11.1

Our cluster was stable on a r5.2xlarge instance, hovering at ~10% CPU usage. Then we enabled windows detectors and even a r5.8xlarge isn't enough.

We were experimenting with detectors. But they essentially brought down our entire instance.
The main issue can be boiled down to the fact that it's all running in the same 'search' queue. The detector UI is backed by 'search', the detectors themselves are backed by 'search' etc.

Why is this the worst idea ever?
Because as the detectors fill up the queue and cause literally millions of searches to be rejected, ~48 Million per hour were observed overnight.
While this is a tuning and scaling issue, it also completely killed ingestion (our spark pipeline kept failing to write to OS and dropped it in our DLQ) and all dashboards no longer work since the UI also uses the 'search' queue.
So it wasn't just detectors that were failing. Everything started to fail. We couldn't even stop the detector because that request kept failing as well.

We have tried tuning the queues, but even a queue size of 100K is still filling up and we're still running into memory issues.

Management wanted us to try to use Detectors as they were hoping we'd no longer have to maintain our own rules engine with Sigma rules. But it can do the job with far less resources on the exact same data set and not affect anything else if it falls behind.

We are no longer moving forward with OS security analytics.

@mvanderlee mvanderlee added bug Something isn't working untriaged labels Mar 1, 2024
@sbcd90
Copy link
Collaborator

sbcd90 commented Mar 1, 2024

hi @mvanderlee , we have a bunch of performance fixes we're planning to release for 2.13. We're aware of the high cpu & high jvmmp issues caused by running security-analytics detectors.
These issues should go away once the 2.13 release is out.

@sbcd90
Copy link
Collaborator

sbcd90 commented Mar 1, 2024

also, some of the optimizations which you can already try out is using an index alias to configure a detector instead of an index pattern. Here are the steps to do it.

1. ISM Changes

Define Component Template with mappings

PUT /_component_template/test-alias-template458
{"template" : {
  "mappings": {
    "properties": {
      "hello": {
        "type": "text"
      }
    }
  }
}}


Define Index template with the component template

POST /_index_template/test-index-template458
{
  "index_patterns": [
    "test-index458-*"
  ],
  "composed_of": [
    "test-alias-template458"
  ]
}


Create Initial Index

PUT /test-index458-1
{
  "aliases": {
    "test-alias458": {
      "is_write_index": true
    }
  }
}


Index data via the alias

POST /test-alias458/_doc
{
  "hello": "world"
}

use the alias test-alias458 to create the detector now.

@mvanderlee
Copy link
Author

@sbcd90 glad to hear it.
Until then, can you confirm if rejected tasks mean that events are not being analyzed by the detector, and thus not be alerted upon?

@sbcd90 sbcd90 removed the untriaged label Mar 1, 2024
@mvanderlee
Copy link
Author

And we already have aliases, but they don't show up as options in the Data source dropdown. We'll try just entering it manually.
It'd be great if it could show aliases in the UI and preferably prioritize them.

@amsiglan
Copy link
Collaborator

amsiglan commented Mar 1, 2024

@mvanderlee already working on showing the aliases in the dropdown and should be available in 2.13

riysaxen-amzn pushed a commit to riysaxen-amzn/security-analytics that referenced this issue Mar 25, 2024
) (opensearch-project#875)

* Fix getAlerts API for standard Alerting monitors

Signed-off-by: Ashish Agrawal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

3 participants