Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch: Instrumentation adds potentially dangerous GET _cluster/health request to every initial request #2360

Closed
erikkessler1 opened this issue Dec 12, 2023 · 11 comments · Fixed by #2377
Assignees
Labels
bug community To tag external issues and PRs submitted by the community

Comments

@erikkessler1
Copy link

Description

We recently updated a Rails service to a version of this agent that includes the Elasticsearch instrumentation added in #1525. We have a large Elasticsearch cluster backing the service, and when the service with the updated agent version was deployed, our Elasticsearch master nodes began to experience extreme CPU and network usage which caused the cluster to become unstable.

When sampling the hot threads on the cluster we found the following trace sample indicating the node was spending a lot of time generating a ClusterHealthResponse:

app//org.elasticsearch.action.admin.cluster.health.ClusterHealthResponse.(ClusterHealthResponse.java:188)

After investigation, we found that this agent injects a GET _cluster/health request before the first request from the application in order to resolve the cluster_name:

@nr_cluster_name ||= perform_request('GET', '_cluster/health').body['cluster_name']

In a high throughput environment with many clients, these _cluster/health checks can overwhelm the single master that has to handle the requests and make requests to other nodes in the cluster to gather the status.

Expected Behavior

Adding instrumentation to our service should not cause the Elasticsearch cluster to become unstable. Given that _cluster/health can be so expensive and disruptive in certain cases, it is dangerous to have the instrumentation silently making the request by default.

Steps to Reproduce

The following script gives a demonstration of what is happening:

c = Elasticsearch::Client.new(url: "...", log: true)
c.cluster.stats # or any other action
GET https://***/_cluster/health [status:200, request:0.033s, query:n/a] <- from the NewRelic agent
GET https://***/ [status:200, request:0.010s, query:n/a] <- from `Elasticsearch::Client#verify_elasticsearch`
GET https://***/_cluster/stats [status:200, request:0.028s, query:n/a] <- the real request

Your Environment

  • Ruby Version: 3.2.2
  • Agent Version: 9.6.0
  • Elasticsearch Version: 7.17.5
@workato-integration
Copy link

@github-actions github-actions bot added the community To tag external issues and PRs submitted by the community label Dec 12, 2023
@kaylareopelle
Copy link
Contributor

Hi @erikkessler1! Thanks for bringing this to our attention. We're discussing some fixes as a team and I hope to have the next steps to share with you soon.

In the meantime, if you'd like to benefit from the newer version of the agent while temporarily disabling Elasticsearch instrumentation, you can do so by updating your newrelic.yml config to the following:

common: &default_settings
  instrumentation.elasticsearch: disabled

Alternatively, you can set this value using the environment variable NEW_RELIC_INSTRUMENTATION_ELASTICSEARCH=disabled.

@erikkessler1
Copy link
Author

Great! Thank you for the update and disabling instructions, @kaylareopelle.

@kaylareopelle
Copy link
Contributor

Hi @erikkessler1, we have a few ideas about how to solve this problem. This is a little long, so please bear with me!

First idea

The first idea would be to introduce a configuration option that stops the health check request. This would be off by default and might be turned on by default in a future major version. In this case, when the configuration is set to false, the cluster name would not be retrieved by the agent.

#2369 contains that work using the branch. This could be tested in your environment by installing the newrelic_rpm gem using the elasticsearch-perform-health-check branch.

gem 'newrelic_rpm', git: 'https://github.com/newrelic/newrelic-ruby-agent', branch: 'elasticsearch-perform-health-check-config'

And updating your configuration to turn off the health check:

common: &default_settings
  instrumentation.elasticsearch.perform_health_check: false

Second idea

The second idea, suggested by @gremerritt, would be to use a different endpoint, /_cluster/stats, with master:false and see if it has less of a performance impact than /_cluster/health.

#2374 contains that work, on the branch test_cluster_stats.

This could be tested by updating the installation of newrelic_rpm in your Gemfile to the following:

gem 'newrelic_rpm', git: 'https://github.com/newrelic/newrelic-ruby-agent', branch: 'elasticsearch-perform-health-check-config'

Request

Would you be willing to test either of these solutions in your environment?

@erikkessler1
Copy link
Author

I'll test those out!

@erikkessler1
Copy link
Author

Hey @kaylareopelle, here are my findings from the various branches:

elasticsearch-perform-health-check-config

With the following configuration:

common: &default_settings
  elasticsearch.perform_health_check: false

It performs as expected and there is no extra load on the the masters.

test_cluster_stats

As the master:false suggests, this doesn't put extra load on the masters, but it does result in significant load on all the data nodes. As a result, I don't think this option should be pursued.

@joshbranham
Copy link

Can I make an alternative suggestion on getting the cluster name?

❯ curl http://localhost:9200
{
  "name" : "190de96e2f2c",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "MDchoarLRxyvsgQlSIJaLg",
  "version" : {
    "number" : "7.17.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "5ad023604c8d7416c9eb6c0eadb62b14e766caff",
    "build_date" : "2022-04-19T08:11:19.070913226Z",
    "build_snapshot" : false,
    "lucene_version" : "8.11.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

@kaylareopelle
Copy link
Contributor

Hi @erikkessler1, thanks for sharing your results of those tests. We'll keep elasticsearch-perform-health-check-config on standby for now.

Excellent suggestion, @joshbranham! This, plus the discussion on #2374 has brought me to make #2377 to access the cluster name from the root path.

If either of you would like to test that option, you can do so using the test_root_for_cluster_name branch to install the agent:

gem 'newrelic_rpm', git: 'https://github.com/newrelic/newrelic-ruby-agent', branch: 'test_root_for_cluster_name'

If you decide to test it, please let me know how it goes!

@erikkessler1
Copy link
Author

Using the root to get the cluster name seems like the best option. I didn't see any additional load on either the master or data nodes when load testing with the test_root_for_cluster_name branch.

@kaylareopelle
Copy link
Contributor

That's fantastic news! Thank you for testing the branch. We'll go with that option and close the other PRs.

This change will be included in our next release, which we expect to happen in early January. Until then, we'll keep the test_root_for_cluster_name branch around so you can continue installing newrelic_rpm from it if you'd like to do so.

@kaylareopelle kaylareopelle linked a pull request Jan 2, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from In progress to Code Complete/Done in Ruby Engineering Board Jan 2, 2024
@kaylareopelle
Copy link
Contributor

Hi @erikkessler1! Ruby agent version 9.7.0 is hot off the press! 📰

This new version contains the fix for this issue. I'll delete the test_root_for_cluster_name branch tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug community To tag external issues and PRs submitted by the community
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants