Elasticsearch: Instrumentation adds potentially dangerous `GET _cluster/health` request to every initial request #2360

erikkessler1 · 2023-12-12T17:10:21Z

Description

We recently updated a Rails service to a version of this agent that includes the Elasticsearch instrumentation added in #1525. We have a large Elasticsearch cluster backing the service, and when the service with the updated agent version was deployed, our Elasticsearch master nodes began to experience extreme CPU and network usage which caused the cluster to become unstable.

When sampling the hot threads on the cluster we found the following trace sample indicating the node was spending a lot of time generating a ClusterHealthResponse:

app//org.elasticsearch.action.admin.cluster.health.ClusterHealthResponse.(ClusterHealthResponse.java:188)

After investigation, we found that this agent injects a GET _cluster/health request before the first request from the application in order to resolve the cluster_name:

newrelic-ruby-agent/lib/new_relic/agent/instrumentation/elasticsearch/instrumentation.rb

Line 59 in 713200b

    
           @nr_cluster_name ||= perform_request('GET', '_cluster/health').body['cluster_name']

In a high throughput environment with many clients, these _cluster/health checks can overwhelm the single master that has to handle the requests and make requests to other nodes in the cluster to gather the status.

Expected Behavior

Adding instrumentation to our service should not cause the Elasticsearch cluster to become unstable. Given that _cluster/health can be so expensive and disruptive in certain cases, it is dangerous to have the instrumentation silently making the request by default.

Steps to Reproduce

The following script gives a demonstration of what is happening:

c = Elasticsearch::Client.new(url: "...", log: true)
c.cluster.stats # or any other action

GET https://***/_cluster/health [status:200, request:0.033s, query:n/a] <- from the NewRelic agent
GET https://***/ [status:200, request:0.010s, query:n/a] <- from `Elasticsearch::Client#verify_elasticsearch`
GET https://***/_cluster/stats [status:200, request:0.028s, query:n/a] <- the real request

Your Environment

Ruby Version: 3.2.2
Agent Version: 9.6.0
Elasticsearch Version: 7.17.5

The text was updated successfully, but these errors were encountered:

workato-integration · 2023-12-12T17:10:27Z

https://new-relic.atlassian.net/browse/NR-190223

kaylareopelle · 2023-12-13T00:37:05Z

Hi @erikkessler1! Thanks for bringing this to our attention. We're discussing some fixes as a team and I hope to have the next steps to share with you soon.

In the meantime, if you'd like to benefit from the newer version of the agent while temporarily disabling Elasticsearch instrumentation, you can do so by updating your newrelic.yml config to the following:

common: &default_settings
  instrumentation.elasticsearch: disabled

Alternatively, you can set this value using the environment variable NEW_RELIC_INSTRUMENTATION_ELASTICSEARCH=disabled.

erikkessler1 · 2023-12-13T00:40:39Z

Great! Thank you for the update and disabling instructions, @kaylareopelle.

kaylareopelle · 2023-12-20T00:24:35Z

Hi @erikkessler1, we have a few ideas about how to solve this problem. This is a little long, so please bear with me!

First idea

The first idea would be to introduce a configuration option that stops the health check request. This would be off by default and might be turned on by default in a future major version. In this case, when the configuration is set to false, the cluster name would not be retrieved by the agent.

#2369 contains that work using the branch. This could be tested in your environment by installing the newrelic_rpm gem using the elasticsearch-perform-health-check branch.

gem 'newrelic_rpm', git: 'https://github.com/newrelic/newrelic-ruby-agent', branch: 'elasticsearch-perform-health-check-config'

And updating your configuration to turn off the health check:

common: &default_settings
  instrumentation.elasticsearch.perform_health_check: false

Second idea

The second idea, suggested by @gremerritt, would be to use a different endpoint, /_cluster/stats, with master:false and see if it has less of a performance impact than /_cluster/health.

#2374 contains that work, on the branch test_cluster_stats.

This could be tested by updating the installation of newrelic_rpm in your Gemfile to the following:

gem 'newrelic_rpm', git: 'https://github.com/newrelic/newrelic-ruby-agent', branch: 'elasticsearch-perform-health-check-config'

Request

Would you be willing to test either of these solutions in your environment?

erikkessler1 · 2023-12-20T14:06:06Z

I'll test those out!

erikkessler1 · 2023-12-20T16:40:45Z

Hey @kaylareopelle, here are my findings from the various branches:

elasticsearch-perform-health-check-config

With the following configuration:

common: &default_settings
  elasticsearch.perform_health_check: false

It performs as expected and there is no extra load on the the masters.

test_cluster_stats

As the master:false suggests, this doesn't put extra load on the masters, but it does result in significant load on all the data nodes. As a result, I don't think this option should be pursued.

joshbranham · 2023-12-20T16:50:16Z

Can I make an alternative suggestion on getting the cluster name?

❯ curl http://localhost:9200
{
  "name" : "190de96e2f2c",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "MDchoarLRxyvsgQlSIJaLg",
  "version" : {
    "number" : "7.17.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "5ad023604c8d7416c9eb6c0eadb62b14e766caff",
    "build_date" : "2022-04-19T08:11:19.070913226Z",
    "build_snapshot" : false,
    "lucene_version" : "8.11.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

kaylareopelle · 2023-12-21T00:00:16Z

Hi @erikkessler1, thanks for sharing your results of those tests. We'll keep elasticsearch-perform-health-check-config on standby for now.

Excellent suggestion, @joshbranham! This, plus the discussion on #2374 has brought me to make #2377 to access the cluster name from the root path.

If either of you would like to test that option, you can do so using the test_root_for_cluster_name branch to install the agent:

gem 'newrelic_rpm', git: 'https://github.com/newrelic/newrelic-ruby-agent', branch: 'test_root_for_cluster_name'

If you decide to test it, please let me know how it goes!

erikkessler1 · 2023-12-21T13:53:18Z

Using the root to get the cluster name seems like the best option. I didn't see any additional load on either the master or data nodes when load testing with the test_root_for_cluster_name branch.

kaylareopelle · 2023-12-22T20:24:28Z

That's fantastic news! Thank you for testing the branch. We'll go with that option and close the other PRs.

This change will be included in our next release, which we expect to happen in early January. Until then, we'll keep the test_root_for_cluster_name branch around so you can continue installing newrelic_rpm from it if you'd like to do so.

kaylareopelle · 2024-01-10T22:05:53Z

Hi @erikkessler1! Ruby agent version 9.7.0 is hot off the press! 📰

This new version contains the fix for this issue. I'll delete the test_root_for_cluster_name branch tomorrow.

erikkessler1 added the bug label Dec 12, 2023

github-project-automation bot added this to Ruby Engineering Board Dec 12, 2023

github-project-automation bot moved this to Triage in Ruby Engineering Board Dec 12, 2023

github-actions bot added the community To tag external issues and PRs submitted by the community label Dec 12, 2023

kaylareopelle mentioned this issue Dec 15, 2023

Create :'elasticsearch.perform_health_check' config #2369

Closed

kaylareopelle moved this from Triage to In progress in Ruby Engineering Board Dec 18, 2023

kaylareopelle self-assigned this Dec 18, 2023

kaylareopelle mentioned this issue Dec 19, 2023

Use _cluster/stats to access cluster name #2374

Closed

kaylareopelle mentioned this issue Dec 20, 2023

Use root path to access Elasticsearch cluster name #2377

Merged

kaylareopelle linked a pull request Jan 2, 2024 that will close this issue

Use root path to access Elasticsearch cluster name #2377

Merged

kaylareopelle closed this as completed in #2377 Jan 2, 2024

github-project-automation bot moved this from In progress to Code Complete/Done in Ruby Engineering Board Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch: Instrumentation adds potentially dangerous `GET _cluster/health` request to every initial request #2360

Elasticsearch: Instrumentation adds potentially dangerous `GET _cluster/health` request to every initial request #2360

erikkessler1 commented Dec 12, 2023

workato-integration bot commented Dec 12, 2023

kaylareopelle commented Dec 13, 2023

erikkessler1 commented Dec 13, 2023

kaylareopelle commented Dec 20, 2023

erikkessler1 commented Dec 20, 2023

erikkessler1 commented Dec 20, 2023

joshbranham commented Dec 20, 2023

kaylareopelle commented Dec 21, 2023

erikkessler1 commented Dec 21, 2023

kaylareopelle commented Dec 22, 2023

kaylareopelle commented Jan 10, 2024

Elasticsearch: Instrumentation adds potentially dangerous GET _cluster/health request to every initial request #2360

Elasticsearch: Instrumentation adds potentially dangerous GET _cluster/health request to every initial request #2360

Comments

erikkessler1 commented Dec 12, 2023

Description

Expected Behavior

Steps to Reproduce

Your Environment

workato-integration bot commented Dec 12, 2023

kaylareopelle commented Dec 13, 2023

erikkessler1 commented Dec 13, 2023

kaylareopelle commented Dec 20, 2023

First idea

Second idea

Request

erikkessler1 commented Dec 20, 2023

erikkessler1 commented Dec 20, 2023

elasticsearch-perform-health-check-config

test_cluster_stats

joshbranham commented Dec 20, 2023

kaylareopelle commented Dec 21, 2023

erikkessler1 commented Dec 21, 2023

kaylareopelle commented Dec 22, 2023

kaylareopelle commented Jan 10, 2024

Elasticsearch: Instrumentation adds potentially dangerous `GET _cluster/health` request to every initial request #2360

Elasticsearch: Instrumentation adds potentially dangerous `GET _cluster/health` request to every initial request #2360