Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] Add pagination to the upgrade/_review endpoint #208361

Closed
Tracked by #201502
xcrzx opened this issue Jan 27, 2025 · 3 comments · Fixed by #211045
Closed
Tracked by #201502

[Security Solution] Add pagination to the upgrade/_review endpoint #208361

xcrzx opened this issue Jan 27, 2025 · 3 comments · Fixed by #211045
Assignees
Labels
8.18 candidate bug Fixes for quality problems that affect the customer experience Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. performance Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.18.0

Comments

@xcrzx
Copy link
Contributor

xcrzx commented Jan 27, 2025

Epic: #174168
Related to: https://github.com/elastic/security-team/issues/11822, #209551, #210544

Summary

This is related to OOM errors investigation (see #incident-1039-multiple-projects-oomkilled-when-installing-updating-fleet-pa in Slack for more details).

Here's a short summary of what happened.
  • This failure is unrelated to Fleet’s package installation. And external memory consumption wasn’t the issue in that case, but heap memory was.
  • There are a couple of /detection_engine/prebuilt_rules/_bootstrap calls in the logs, but since the endpoint responded in less than 1 second, that means the package installation didn’t occur. The endpoint skips installation if the package is already installed, which was the case here.
  • There were two concurrent detection_engine/prebuilt_rules/installation/_perform requests in the logs. These requests are used to install all prebuilt Elastic rules. Normally, browsers don’t send parallel requests like this, and it’s not something that can be easily triggered through the UI. Given there are many other duplicate requests in the proxy logs, I strongly suspect this was a testing environment where two rule installation requests were deliberately sent at the same time from separate browser tabs to test the workflow.
  • Both concurrent _perform requests were handled successfully, even though each took almost one minute to complete.
  • Then, after rules package installation, Kibana frontend sends a detection_engine/prebuilt_rules/installation/_review request to fetch updated information about prebuilt Elastic rules. This request is memory-intensive, because all rules (more than 1,000) are loaded into memory on the server side to calculate their installation status.
  • With two browser tabs open, two concurrent _review requests were sent, both logged with a 499 status (client closed connection). Immediately after, the browser sent two additional _review requests, resulting in Kibana handling four concurrent _review requests at one point.
  • The heavy memory load from handling multiple simultaneous _review requests appears to have caused the OOM. This conclusion can also be confirmed by proxy logs that show many heavy Elasticsearch responses (around 80Mb summed up).

To avoid OOM errors in the future, we need to implement pagination for the _review endpoint so all rules are not fetched into memory. This might be not that straightforward, and considered a long-term solution. A short term solution would be introduce a caching mechanism to skip redundant calculations: #208357

@xcrzx xcrzx added bug Fixes for quality problems that affect the customer experience Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. performance Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team triage_needed labels Jan 27, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

xcrzx added a commit that referenced this issue Mar 3, 2025
… size (#211045)

**Resolves: #208361
**Resolves: #210544

## Summary

This PR introduces significant memory consumption improvements to the
prebuilt rule endpoints, ensuring users won't encounter OOM errors on
memory-limited Kibana instances.

Memory consumption testing results provided in
#211045 (comment).

## Details

This PR implements a number of memory usage optimizations to the
prebuilt rule endpoints with the final goal reducing chances of getting
OOM errors. The changes are extensive and require thorough testing
before merging.

The changes are described by the following bullets

- The most significant change is the addition of pagination to the
`upgrade/_review` endpoint. This endpoint was known for causing OOM
errors due to its large and ever-growing response size. With pagination,
it now returns upgrade information for no more than 20-100 rules at a
time, significantly reducing its memory footprint.
- New backend methods, such as
`ruleObjectsClient.fetchInstalledRuleVersions`, have been introduced.
These methods return rule IDs with their corresponding installed
versions, allowing to build a map of outdated rules without loading all
available rules into memory. Previously, all installed rules, along with
their base and target versions, were fetched unconditionally before
filtering for updates.
- The `stats` data structure of the review endpoint has been deprecated
(it can be safely removed after one Serverless release cycle). Since the
endpoint now returns paginated results, building stats is no longer
feasible due to the limited rule set size fetched on the server side. As
the side effect it required removing related Cypress tests asserting
`Update All` disabled when rules can't be updated.
- All changes to the endpoints are backward-compatible. All previously
required returned structures still present in response. All newly added
structures are optional.
- Upgradeable rule tags are now returned from the prebuilt rule status
endpoint.
- The frontend logic has been updated to move sorting and filtering of
prebuilt rules from the client side to the server side.
- The `upgrade/_perform` endpoint has been rewritten to use lightweight
rule version information rather than full rules to determine upgradeable
rules. Additionally, upgrades are now performed in batches of up to 100
rules, further reducing memory usage.
- A dry run option has been added to the upgrade perform endpoint. This
is needed for the "Update all" rules scenario to determine if any rules
contain conflicts and display a confirmation modal to the user.
- An option to skip conflicting rules has been added to the upgrade
endpoint when called with the `ALL_RULES` mode.
- The `install/_review` endpoint's memory consumption has been optimized
by avoiding loading all rules into memory to determine available rules
for installation. Redundant fetching of all base versions has also been
removed, as they do not participate in the calculation.

---------

Co-authored-by: Maxim Palenov <[email protected]>
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Mar 3, 2025
… size (elastic#211045)

**Resolves: elastic#208361
**Resolves: elastic#210544

## Summary

This PR introduces significant memory consumption improvements to the
prebuilt rule endpoints, ensuring users won't encounter OOM errors on
memory-limited Kibana instances.

Memory consumption testing results provided in
elastic#211045 (comment).

## Details

This PR implements a number of memory usage optimizations to the
prebuilt rule endpoints with the final goal reducing chances of getting
OOM errors. The changes are extensive and require thorough testing
before merging.

The changes are described by the following bullets

- The most significant change is the addition of pagination to the
`upgrade/_review` endpoint. This endpoint was known for causing OOM
errors due to its large and ever-growing response size. With pagination,
it now returns upgrade information for no more than 20-100 rules at a
time, significantly reducing its memory footprint.
- New backend methods, such as
`ruleObjectsClient.fetchInstalledRuleVersions`, have been introduced.
These methods return rule IDs with their corresponding installed
versions, allowing to build a map of outdated rules without loading all
available rules into memory. Previously, all installed rules, along with
their base and target versions, were fetched unconditionally before
filtering for updates.
- The `stats` data structure of the review endpoint has been deprecated
(it can be safely removed after one Serverless release cycle). Since the
endpoint now returns paginated results, building stats is no longer
feasible due to the limited rule set size fetched on the server side. As
the side effect it required removing related Cypress tests asserting
`Update All` disabled when rules can't be updated.
- All changes to the endpoints are backward-compatible. All previously
required returned structures still present in response. All newly added
structures are optional.
- Upgradeable rule tags are now returned from the prebuilt rule status
endpoint.
- The frontend logic has been updated to move sorting and filtering of
prebuilt rules from the client side to the server side.
- The `upgrade/_perform` endpoint has been rewritten to use lightweight
rule version information rather than full rules to determine upgradeable
rules. Additionally, upgrades are now performed in batches of up to 100
rules, further reducing memory usage.
- A dry run option has been added to the upgrade perform endpoint. This
is needed for the "Update all" rules scenario to determine if any rules
contain conflicts and display a confirmation modal to the user.
- An option to skip conflicting rules has been added to the upgrade
endpoint when called with the `ALL_RULES` mode.
- The `install/_review` endpoint's memory consumption has been optimized
by avoiding loading all rules into memory to determine available rules
for installation. Redundant fetching of all base versions has also been
removed, as they do not participate in the calculation.

---------

Co-authored-by: Maxim Palenov <[email protected]>
(cherry picked from commit c4a016e)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Mar 3, 2025
… size (elastic#211045)

**Resolves: elastic#208361
**Resolves: elastic#210544

## Summary

This PR introduces significant memory consumption improvements to the
prebuilt rule endpoints, ensuring users won't encounter OOM errors on
memory-limited Kibana instances.

Memory consumption testing results provided in
elastic#211045 (comment).

## Details

This PR implements a number of memory usage optimizations to the
prebuilt rule endpoints with the final goal reducing chances of getting
OOM errors. The changes are extensive and require thorough testing
before merging.

The changes are described by the following bullets

- The most significant change is the addition of pagination to the
`upgrade/_review` endpoint. This endpoint was known for causing OOM
errors due to its large and ever-growing response size. With pagination,
it now returns upgrade information for no more than 20-100 rules at a
time, significantly reducing its memory footprint.
- New backend methods, such as
`ruleObjectsClient.fetchInstalledRuleVersions`, have been introduced.
These methods return rule IDs with their corresponding installed
versions, allowing to build a map of outdated rules without loading all
available rules into memory. Previously, all installed rules, along with
their base and target versions, were fetched unconditionally before
filtering for updates.
- The `stats` data structure of the review endpoint has been deprecated
(it can be safely removed after one Serverless release cycle). Since the
endpoint now returns paginated results, building stats is no longer
feasible due to the limited rule set size fetched on the server side. As
the side effect it required removing related Cypress tests asserting
`Update All` disabled when rules can't be updated.
- All changes to the endpoints are backward-compatible. All previously
required returned structures still present in response. All newly added
structures are optional.
- Upgradeable rule tags are now returned from the prebuilt rule status
endpoint.
- The frontend logic has been updated to move sorting and filtering of
prebuilt rules from the client side to the server side.
- The `upgrade/_perform` endpoint has been rewritten to use lightweight
rule version information rather than full rules to determine upgradeable
rules. Additionally, upgrades are now performed in batches of up to 100
rules, further reducing memory usage.
- A dry run option has been added to the upgrade perform endpoint. This
is needed for the "Update all" rules scenario to determine if any rules
contain conflicts and display a confirmation modal to the user.
- An option to skip conflicting rules has been added to the upgrade
endpoint when called with the `ALL_RULES` mode.
- The `install/_review` endpoint's memory consumption has been optimized
by avoiding loading all rules into memory to determine available rules
for installation. Redundant fetching of all base versions has also been
removed, as they do not participate in the calculation.

---------

Co-authored-by: Maxim Palenov <[email protected]>
(cherry picked from commit c4a016e)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Mar 3, 2025
… size (elastic#211045)

**Resolves: elastic#208361
**Resolves: elastic#210544

## Summary

This PR introduces significant memory consumption improvements to the
prebuilt rule endpoints, ensuring users won't encounter OOM errors on
memory-limited Kibana instances.

Memory consumption testing results provided in
elastic#211045 (comment).

## Details

This PR implements a number of memory usage optimizations to the
prebuilt rule endpoints with the final goal reducing chances of getting
OOM errors. The changes are extensive and require thorough testing
before merging.

The changes are described by the following bullets

- The most significant change is the addition of pagination to the
`upgrade/_review` endpoint. This endpoint was known for causing OOM
errors due to its large and ever-growing response size. With pagination,
it now returns upgrade information for no more than 20-100 rules at a
time, significantly reducing its memory footprint.
- New backend methods, such as
`ruleObjectsClient.fetchInstalledRuleVersions`, have been introduced.
These methods return rule IDs with their corresponding installed
versions, allowing to build a map of outdated rules without loading all
available rules into memory. Previously, all installed rules, along with
their base and target versions, were fetched unconditionally before
filtering for updates.
- The `stats` data structure of the review endpoint has been deprecated
(it can be safely removed after one Serverless release cycle). Since the
endpoint now returns paginated results, building stats is no longer
feasible due to the limited rule set size fetched on the server side. As
the side effect it required removing related Cypress tests asserting
`Update All` disabled when rules can't be updated.
- All changes to the endpoints are backward-compatible. All previously
required returned structures still present in response. All newly added
structures are optional.
- Upgradeable rule tags are now returned from the prebuilt rule status
endpoint.
- The frontend logic has been updated to move sorting and filtering of
prebuilt rules from the client side to the server side.
- The `upgrade/_perform` endpoint has been rewritten to use lightweight
rule version information rather than full rules to determine upgradeable
rules. Additionally, upgrades are now performed in batches of up to 100
rules, further reducing memory usage.
- A dry run option has been added to the upgrade perform endpoint. This
is needed for the "Update all" rules scenario to determine if any rules
contain conflicts and display a confirmation modal to the user.
- An option to skip conflicting rules has been added to the upgrade
endpoint when called with the `ALL_RULES` mode.
- The `install/_review` endpoint's memory consumption has been optimized
by avoiding loading all rules into memory to determine available rules
for installation. Redundant fetching of all base versions has also been
removed, as they do not participate in the calculation.

---------

Co-authored-by: Maxim Palenov <[email protected]>
(cherry picked from commit c4a016e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.18 candidate bug Fixes for quality problems that affect the customer experience Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. performance Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.18.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants