-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS endpoint with varying IP address #1091
Comments
Hmm... if I understand it correctly, with AWS you should simply disable sniffing and health checks as AWS does load-balancing for you, and you should simply use the hostname provided by AWS as a single endpoint. What I don't understand is why the I'm sorry if misunderstood—I'm not an AWS customer. |
Thank you for responding so quickly. Your suggestion to disable both sniffing and healthchecks sounds good and I'm trying it in a moment.
Yes, and even with healthchecks & sniffing disabled I'm guessing that this problem will appear. I'll post my findings here soon. |
Alright, thank you! That seems to have solved the problem 🎈 Not sure why though 🤔 I was hoping to learn something from this so I figure I should probably give some more context as to what I was going through. And so the problem that I have been trying to resolve happens when the cluster is re-provisioned. This is what I think happens during that phase:
Here's what I don't understand: after step 3, in the case where healthchecks are enabled, why would the healthcheck requests start failing - as opposed to when healthchecks are disabled, why would the normal requests not fail? |
Hmm... let's see. First of all, the whole idea of sniffing and health checks is only necessary because in the early days of ES, load-balancing was done client-side. If you have a server-side solution, which I think is the right solution, you shouldn't be able to do any of the things. Just let the server do the right thing and keep the client dumb. Now, sniffing is the process of initially and periodically finding the list of nodes in the connected cluster. Let's say you initially have a 1 node cluster and use elastic to connect to that cluster with a URL. Then ES will use the URL to find all nodes in the cluster (1 node only) via the Cluster State API. It will then throw away the initial URL and use the IP/hostname reported from the cluster API. Once in a while, this process is re-executed to find new nodes in the cluster that were eventually added by the admin. So, eventually, ES will have a full list of IPs/hostnames to connect to and use them via round-robin. Notice there are a few edge-cases like ensuring this process if we do end up with an empty list of nodes for some reason. But let's try to keep it simple. Health checks serve another purpose. They periodically check the list of nodes and manage the individual state of those nodes. E.g. if elastic tried to send a request to a node that didn't respond, it is marked as dead and no longer used. However, that could only be a blip in the network, so health check runs periodically to mark them as alive eventually. Again, there are some edge cases. I currently don't see why one would disable sniffing but keep health checks enabled. So maybe they should be disabled as well, automatically, when sniffing is disabled. |
That sounds reasonable to me. In any case it might be a good idea to put it into the AWS section of the wiki, to not use either sniffing or healthchecks. |
I changed the docs in the Wiki and advised to disable both sniffing and health checks for AWS Elasticsearch Service. |
Great, thanks for helping out 🥇 |
I'm running into the same problem as David, even with healthcheck and sniff turned off. @dwickstrom do you remember if you changed anything on the underlying HTTP client instance maybe? |
Hi @iandees, no I didn't change anything on the HTTP client. Lately however there has been some issue with this, again. Back in May, the way I tested it was by toggling some parameter in the cluster settings, to trigger a cluster "rollover". Recently however, when AWS themselves were triggering an elasticsearch upgrade on their side, that "rollover" did not go well - clients were not able to connect without intervention, just like the incidents I had ~6 months ago. |
Maybe there's still a problem. Reopening. |
There was a change quite recently that addresses an issue on AWS ES with nodes changing IPs particularly. Don't know if this has anything to do with it. #1125 |
Hi all, resurrecting this thread to shine some more info. We're seeing this issue as well. After some pretty thorough testing I can replicate the issue. I don't think the issue is with this library. AWS ES uses DNS based load-balancing to resolve the hostname to the ES nodes, it's not an EC2-style load balancer. If an HTTP client is used which uses keep-alive connections ( This means that when AWS rotates the nodes and changes the DNS records, an application using this library is none the wiser, it won't do another DNS lookup until the connections are left idle and then terminated. Eventually this library does recognise that requests are failing and resets everything, however this does cause fairly significant interruption of service. This issue is described well here golang/go#23427 |
Thanks for reporting your findings, @g-wilson. |
In this case it seems like clients would benefit from sniffing. |
@Sovietaced I'm not sure that's correct. Sniffing is a process by which the client library asks the ES cluster (not the DNS) for the IP addresses of the nodes, then uses those and watches for changes; that's effectively client-side LB. In case of DNS-based LB, the ES cluster usually doesn't know nor update its internal IP addresses. Hence, I think, disabling sniffing and healthchecks is the right way to use Elastic on AWS. Again, I'm not an active user of Elastic on AWS ES. The problem is, though, that Go itself caches IP addresses for a while, and doesn't resolve for each and every request, hence the reference to golang/go#23427. |
@olivere The Java library has the same problem. It resolves an IP address from the ES cluster domain name and caches the IP address of the data node indefinitely. What we notice is that if the IP of the data node we have no longer becomes a data node, our applications are essentially broken (receiving 503s) until we restart them and they get a new IP address from the AWS ES cluster DNS. This is obviously a pretty terrible user experience that seems ripe for the use of sniffing. |
I ended up testing sniffing with the AWS ES cluster and it appears that the |
Interesting. Maybe we should accommodate to that and—at least—log a warning. I will have to test this out on AWS ES. |
Here is a comparison of how AWS ES response differs from a normal ES deployment: elastic/elasticsearch-js#1178 (comment) One way to mitigate this might be a custom sniffer that does
|
Thanks for the links. Very helpful. |
For what its worth, we ended up writing our own custom sniffer and it appears to work well. I forced a blue/green deployment of an AWS ES cluster and I watched the IP addresses flip with no downtime. I realize this is a Go library but folks may find this generally useful. This is the basic logic for a periodic task that runs in the background. Note: This approach depends on having a DNS cache TTL set. Following code is in Kotlin val addresses: List<InetAddress>
try {
// host.hostName is the cluster domain name provided by AWS
addresses = InetAddress.getAllByName(host.hostName).asList()
} catch (e: UnknownHostException) {
throw AwsSnifferException("Failed to resolve addresses for ${host.hostName}", e)
}
logger.debug("Sniffed addresses: $addresses")
if (addresses.isEmpty()) {
logger.warn("No nodes to set")
} else {
val nodes = addresses.stream()
// Generate new hosts with the address swapped. Retain port/scheme
.map { HttpHost(it.hostAddress, host.port, host.schemeName) }
.map { Node(it) }
.toList()
logger.debug("Calculated nodes: $nodes")
restClient.setNodes(nodes)
} |
I'm also running into the exact same issue with AWS. Is there an easy way with this library to force a reconnection to the cluster maybe? -- edit |
Instead of doing a full reconnect / new client, you can call I'm not proud of this, but we do that on a 15 second interval and it works a treat 🤦♂️ |
Yup this works also. A little bit cleaner than the fresh client approach I guess. |
This commit adds a configuration option `SetCloseIdleConnections` to a client. The effect of enabling it is that whenever the Client finds a dead node, it will call `CloseIdleConnections` on the underlying HTTP transport. This is useful for.e.g. AWS Elasticsearch Service. When AWS ES reconfigures the cluster, it may change the underlying IP addresses while keeping the DNS entry stable. If the Client would _not_ close idle connections, the underlying HTTP client would re-use existing HTTP connections and use the old IP addresses. See #1091 for a discussion of this problem. The commit also illustrates how to connect to an AWS ES cluster in the recipes in [`recipes/aws-mapping-v4`](https://github.com/olivere/elastic/tree/release-branch.v7/recipes/aws-mapping-v4) and [`recipts/aws-es-client`](https://github.com/olivere/elastic/tree/release-branch.v7/recipes/aws-es-client). See the `ConnectToAWS` method for a blueprint of how to connect to an AWS ES cluster. See #1091
I've been looking into this and am experimenting with an additional If some of you could look into this and give it a thumbs up, #1507 might land in one of the next releases. |
Please use the following questions as a guideline to help me answer
your issue/question without further inquiry. Thank you.
Which version of Elastic are you using?
[x] elastic.v6 (for Elasticsearch 6.x)
Please describe the expected behavior
Hello 👋 We're trying to use this library with an AWS cluster of 3 nodes and specifying the endpoint hostname from AWS as a single entry in the
hosts
key in the library config file. The ideal situation would be where the client would be able to detect when the IP address changes, re-resolve the hostname and send a retry request, such that during re-provisioning phase no requests are dropped.Please describe the actual behavior
Requests will fail during the provisioning phase and then, in our case after about 15 minutes, the client will heal itself and requests stop failing.
Because of AWS not exposing the node IPs on the
/_nodes
endpoint these are my thoughts so far:With sniffing disabled we see that the single node connection won't be
MarkAsDead
, due toelastic/client.go
Lines 1204 to 1209 in 60d62e5
With sniffing enabled it's not going to work because sniffing can't be done due to AWS only exposing the load balancer IP. The client won't be able to detect any other nodes:
elastic/client.go
Lines 964 to 978 in 60d62e5
Any steps to reproduce the behavior?
The text was updated successfully, but these errors were encountered: