Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] Memory leak during prebuilt rule installation and upgrade #204800

Open
Tracked by #201502
xcrzx opened this issue Dec 18, 2024 · 3 comments
Open
Tracked by #201502
Labels
8.18 candidate bug Fixes for quality problems that affect the customer experience Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. performance Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.18.0

Comments

@xcrzx
Copy link
Contributor

xcrzx commented Dec 18, 2024

Summary

While testing the prebuilt rule installation workflow locally, I encountered an Out of Memory (OOM) error. The script I was running fetched Fleet packages containing prebuilt rules one by one, installed the rules from the packages, and then upgraded them to the latest version.

script.mjs
import packages from './packages.json' assert { type: 'json' };
// ^ this is the output of 
// curl https://epr.elastic.co/search?package=security_detection_engine\&all\=true > packages.json

import semver from 'semver';
import axios from 'axios';
import fs from 'fs';
import path from 'path';

const headers = {
  'Content-Type': 'application/json',
  Accept: 'application/json',
  'kbn-xsrf': 'f62a882f-7df0-4cf2-bd46-13566b795301',
  Authorization: 'Basic ZWxhc3RpYzpjaGFuZ2VtZQ==',
};

const installPackage = async (version) => {
  let data = JSON.stringify({
    force: true,
  });

  return axios.request({
    method: 'post',
    maxBodyLength: Infinity,
    url: `http://localhost:5601/kbn/api/fleet/epm/packages/security_detection_engine/${version}`,
    headers: {
      ...headers,
      'elastic-api-version': '2023-10-31',
    },
    data,
  });
};

const installAllRules = async () => {
  let data = JSON.stringify({
    mode: 'ALL_RULES',
  });

  let config = {
    method: 'post',
    maxBodyLength: Infinity,
    url: 'http://localhost:5601/kbn/internal/detection_engine/prebuilt_rules/installation/_perform',
    headers: {
      ...headers,
      'elastic-api-version': '1',
    },
    data,
  };

  return axios.request(config);
};

const reviewUpgrade = async () => {
  let config = {
    method: 'post',
    maxBodyLength: Infinity,
    url: 'http://localhost:5601/kbn/internal/detection_engine/prebuilt_rules/upgrade/_review',
    headers: {
      ...headers,
      'elastic-api-version': '1',
    },
  };

  return axios.request(config);
};

const performUpgrade = async () => {
  let data = JSON.stringify({
    mode: 'ALL_RULES',
  });

  let config = {
    method: 'post',
    maxBodyLength: Infinity,
    url: 'http://localhost:5601/kbn/internal/detection_engine/prebuilt_rules/upgrade/_perform',
    headers: {
      ...headers,
      'elastic-api-version': '1',
    },
    data: data,
  };

  return axios.request(config);
};

const deleteAllRules = async () => {
  let data = JSON.stringify({
    action: 'delete',
    query: '',
  });

  let config = {
    method: 'post',
    maxBodyLength: Infinity,
    url: 'http://localhost:5601/kbn/api/detection_engine/rules/_bulk_action?dry_run=false',
    headers: {
      ...headers,
      'elastic-api-version': '2023-10-31',
    },
    data: data,
  };

  return axios.request(config);
};

const filtered = packages
  .filter((pkg) => pkg.conditions.kibana.version.startsWith('^8'))
  .sort((a, b) => semver.compare(a.version, b.version));

const resultsDir = path.resolve('./results');
if (!fs.existsSync(resultsDir)) {
  fs.mkdirSync(resultsDir);
}

for (const pkg of filtered) {
  const { version } = pkg;
  const performFilePath = path.join(resultsDir, `${version}_perform.json`);

  if (fs.existsSync(performFilePath)) {
    console.log(`Skipping version ${version} as it has already been processed.`);
    continue;
  }

  try {
    console.log('Deleting all rules');
    await deleteAllRules();

    console.log(`Installing package ${pkg.name} version ${version}`);
    await installPackage(version);

    console.log(`Installing all rules`);
    const installAllRulesResponse = await installAllRules();
    const installAllRulesData = JSON.stringify(installAllRulesResponse.data, null, 2);
    fs.writeFileSync(path.join(resultsDir, `${version}_install.json`), installAllRulesData);

    console.log(`Installing the latest package`);
    await installPackage('8.17.1');

    console.log('Reviewing upgrade');
    const reviewResponse = await reviewUpgrade();
    const reviewData = JSON.stringify(reviewResponse.data, null, 2);
    fs.writeFileSync(path.join(resultsDir, `${version}_review.json`), reviewData);

    console.log('Performing upgrade');
    const performResponse = await performUpgrade();
    const performData = JSON.stringify(performResponse.data, null, 2);
    fs.writeFileSync(performFilePath, performData);
  } catch (error) {
    console.error(`Error processing version ${version}:`, error);
  }
}

After running for a couple of hours, the script failed with an OOM error. Below is the script output at the moment of failure:

Installing package security_detection_engine version 8.13.1
Installing all rules
Installing the latest package
Reviewing upgrade
Performing upgrade
Error processing version 8.13.1: AxiosError: Request failed with status code 502

Error Details

The error message itself was:

<--- Last few GCs --->

[38238:0x148008000] 10550599 ms: Scavenge 8004.3 (8197.0) -> 7996.3 (8197.2) MB, 9.25 / 0.00 ms  (average mu = 0.054, current mu = 0.010) allocation failure;
[38238:0x148008000] 10550624 ms: Scavenge 8009.1 (8197.5) -> 8001.2 (8199.2) MB, 10.29 / 0.00 ms  (average mu = 0.054, current mu = 0.010) allocation failure;
[38238:0x148008000] 10551378 ms: Scavenge (reduce) 8013.8 (8199.2) -> 8006.0 (8179.0) MB, 13.96 / 0.00 ms  (average mu = 0.054, current mu = 0.010) allocation failure;


<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

Findings

I reran the script, monitoring memory consumption over time, and confirmed a steady increase in memory usage—from ~2.5GB at the start to ~6GB within a couple of hours. Additionally, even after stopping the package test script, Kibana's memory consumption remained at a stable, elevated level.

This strongly indicates a memory leak in Kibana.

Additional Notes

The error occurred during the upgrade perform step, but this does not necessarily mean that the memory leak exists in that specific API handler. It could be in any of the following API handlers used by the script:

  1. POST /api/fleet/epm/packages/security_detection_engine/${version}
  2. POST /internal/detection_engine/prebuilt_rules/installation/_perform
  3. POST /internal/detection_engine/prebuilt_rules/upgrade/_review
  4. POST /internal/detection_engine/prebuilt_rules/upgrade/_perform
  5. POST /api/detection_engine/rules/_bulk_action?dry_run=false

Further investigation is required to pinpoint the exact source of the memory leak.

@xcrzx xcrzx added 8.18 candidate bug Fixes for quality problems that affect the customer experience Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team triage_needed labels Dec 18, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management)

@banderror banderror added performance impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. v8.18.0 and removed triage_needed labels Dec 20, 2024
@banderror banderror removed their assignment Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.18 candidate bug Fixes for quality problems that affect the customer experience Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. performance Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.18.0
Projects
None yet
Development

No branches or pull requests

3 participants