-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch: Instrumentation adds potentially dangerous GET _cluster/health
request to every initial request
#2360
Comments
Hi @erikkessler1! Thanks for bringing this to our attention. We're discussing some fixes as a team and I hope to have the next steps to share with you soon. In the meantime, if you'd like to benefit from the newer version of the agent while temporarily disabling Elasticsearch instrumentation, you can do so by updating your common: &default_settings
instrumentation.elasticsearch: disabled Alternatively, you can set this value using the environment variable |
Great! Thank you for the update and disabling instructions, @kaylareopelle. |
Hi @erikkessler1, we have a few ideas about how to solve this problem. This is a little long, so please bear with me! First ideaThe first idea would be to introduce a configuration option that stops the health check request. This would be off by default and might be turned on by default in a future major version. In this case, when the configuration is set to #2369 contains that work using the branch. This could be tested in your environment by installing the gem 'newrelic_rpm', git: 'https://github.com/newrelic/newrelic-ruby-agent', branch: 'elasticsearch-perform-health-check-config' And updating your configuration to turn off the health check: common: &default_settings
instrumentation.elasticsearch.perform_health_check: false Second ideaThe second idea, suggested by @gremerritt, would be to use a different endpoint, #2374 contains that work, on the branch This could be tested by updating the installation of gem 'newrelic_rpm', git: 'https://github.com/newrelic/newrelic-ruby-agent', branch: 'elasticsearch-perform-health-check-config' RequestWould you be willing to test either of these solutions in your environment? |
I'll test those out! |
Hey @kaylareopelle, here are my findings from the various branches: elasticsearch-perform-health-check-configWith the following configuration: common: &default_settings
elasticsearch.perform_health_check: false It performs as expected and there is no extra load on the the masters. test_cluster_statsAs the |
Can I make an alternative suggestion on getting the cluster name? ❯ curl http://localhost:9200
{
"name" : "190de96e2f2c",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "MDchoarLRxyvsgQlSIJaLg",
"version" : {
"number" : "7.17.3",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "5ad023604c8d7416c9eb6c0eadb62b14e766caff",
"build_date" : "2022-04-19T08:11:19.070913226Z",
"build_snapshot" : false,
"lucene_version" : "8.11.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
} |
Hi @erikkessler1, thanks for sharing your results of those tests. We'll keep Excellent suggestion, @joshbranham! This, plus the discussion on #2374 has brought me to make #2377 to access the cluster name from the root path. If either of you would like to test that option, you can do so using the gem 'newrelic_rpm', git: 'https://github.com/newrelic/newrelic-ruby-agent', branch: 'test_root_for_cluster_name' If you decide to test it, please let me know how it goes! |
Using the root to get the cluster name seems like the best option. I didn't see any additional load on either the master or data nodes when load testing with the |
That's fantastic news! Thank you for testing the branch. We'll go with that option and close the other PRs. This change will be included in our next release, which we expect to happen in early January. Until then, we'll keep the |
Hi @erikkessler1! Ruby agent version 9.7.0 is hot off the press! 📰 This new version contains the fix for this issue. I'll delete the |
Description
We recently updated a Rails service to a version of this agent that includes the Elasticsearch instrumentation added in #1525. We have a large Elasticsearch cluster backing the service, and when the service with the updated agent version was deployed, our Elasticsearch master nodes began to experience extreme CPU and network usage which caused the cluster to become unstable.
When sampling the hot threads on the cluster we found the following trace sample indicating the node was spending a lot of time generating a
ClusterHealthResponse
:After investigation, we found that this agent injects a
GET _cluster/health
request before the first request from the application in order to resolve thecluster_name
:newrelic-ruby-agent/lib/new_relic/agent/instrumentation/elasticsearch/instrumentation.rb
Line 59 in 713200b
In a high throughput environment with many clients, these
_cluster/health
checks can overwhelm the single master that has to handle the requests and make requests to other nodes in the cluster to gather the status.Expected Behavior
Adding instrumentation to our service should not cause the Elasticsearch cluster to become unstable. Given that
_cluster/health
can be so expensive and disruptive in certain cases, it is dangerous to have the instrumentation silently making the request by default.Steps to Reproduce
The following script gives a demonstration of what is happening:
Your Environment
The text was updated successfully, but these errors were encountered: