AWS endpoint with varying IP address #1091

dwickstrom · 2019-05-08T09:41:27Z

Please use the following questions as a guideline to help me answer
your issue/question without further inquiry. Thank you.

Which version of Elastic are you using?

[x] elastic.v6 (for Elasticsearch 6.x)

Please describe the expected behavior

Hello 👋 We're trying to use this library with an AWS cluster of 3 nodes and specifying the endpoint hostname from AWS as a single entry in the hosts key in the library config file. The ideal situation would be where the client would be able to detect when the IP address changes, re-resolve the hostname and send a retry request, such that during re-provisioning phase no requests are dropped.

Please describe the actual behavior

Requests will fail during the provisioning phase and then, in our case after about 15 minutes, the client will heal itself and requests stop failing.

Because of AWS not exposing the node IPs on the /_nodes endpoint these are my thoughts so far:

With sniffing disabled we see that the single node connection won't be MarkAsDead, due to

elastic/client.go

Lines 1204 to 1209 in 60d62e5

    
           if !c.snifferEnabled { 
        
           	c.errorf("elastic: all %d nodes marked as dead; resurrecting them to prevent deadlock", len(c.conns)) 
        
           	for _, conn := range c.conns { 
        
           		conn.MarkAsAlive() 
        
           	} 
        
           }

With sniffing enabled it's not going to work because sniffing can't be done due to AWS only exposing the load balancer IP. The client won't be able to detect any other nodes:

elastic/client.go

Lines 964 to 978 in 60d62e5

    
           if err := json.NewDecoder(res.Body).Decode(&info); err == nil { 
        
           	if len(info.Nodes) > 0 { 
        
           		for nodeID, node := range info.Nodes { 
        
           			if c.snifferCallback(node) { 
        
           				if node.HTTP != nil && len(node.HTTP.PublishAddress) > 0 { 
        
           					url := c.extractHostname(c.scheme, node.HTTP.PublishAddress) 
        
           					if url != "" { 
        
           						nodes = append(nodes, newConn(nodeID, url)) 
        
           					} 
        
           				} 
        
           			} 
        
           		} 
        
           	} 
        
           } 
        
           return nodes

Any steps to reproduce the behavior?

instantiate a new client, setting the AWS endpoint as a single host entry in the config
Trigger cluster re-provisioning in AWS, described here: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains.html#es-managedomains-configuration-changes

The text was updated successfully, but these errors were encountered:

olivere · 2019-05-08T17:44:38Z

Hmm... if I understand it correctly, with AWS you should simply disable sniffing and health checks as AWS does load-balancing for you, and you should simply use the hostname provided by AWS as a single endpoint.

What I don't understand is why the *http.Client won't find the new IP address when it changes. It should simply use the hostname, and the resolver should return the new IP. Unless there's some caching going on, that should simply work... as long as the hostname is the same.

I'm sorry if misunderstood—I'm not an AWS customer.

dwickstrom · 2019-05-09T09:49:55Z

Thank you for responding so quickly. Your suggestion to disable both sniffing and healthchecks sounds good and I'm trying it in a moment.

What I don't understand is why the *http.Client won't find the new IP address when it changes. It should simply use the hostname, and the resolver should return the new IP. Unless there's some caching going on, that should simply work... as long as the hostname is the same.

Yes, and even with healthchecks & sniffing disabled I'm guessing that this problem will appear. I'll post my findings here soon.

dwickstrom · 2019-05-10T07:02:53Z

Alright, thank you! That seems to have solved the problem 🎈 Not sure why though 🤔

I was hoping to learn something from this so I figure I should probably give some more context as to what I was going through.
They way we are set up is that we have a cluster of 3 master nodes that we connect to through a single endpoint, as you probably know, the one provided through the aws console. Initially I though of that AWS endpoint address as likely pointing to a load balancer, but after a while I realised that this isn't the case. Instead it's going to cycle randomly, resolving to the IP of any of the nodes.

And so the problem that I have been trying to resolve happens when the cluster is re-provisioned. This is what I think happens during that phase:

The amount of nodes are doubled
The data from the first set of nodes are migrated over to the new cluster
Once data is migrated the endpoint starts resolving to the addresses of the new set of cluster nodes
When data migration is complete, the old nodes are shut down one by one

Here's what I don't understand: after step 3, in the case where healthchecks are enabled, why would the healthcheck requests start failing - as opposed to when healthchecks are disabled, why would the normal requests not fail?

olivere · 2019-05-10T12:53:43Z

Hmm... let's see.

First of all, the whole idea of sniffing and health checks is only necessary because in the early days of ES, load-balancing was done client-side. If you have a server-side solution, which I think is the right solution, you shouldn't be able to do any of the things. Just let the server do the right thing and keep the client dumb.

Now, sniffing is the process of initially and periodically finding the list of nodes in the connected cluster. Let's say you initially have a 1 node cluster and use elastic to connect to that cluster with a URL. Then ES will use the URL to find all nodes in the cluster (1 node only) via the Cluster State API. It will then throw away the initial URL and use the IP/hostname reported from the cluster API. Once in a while, this process is re-executed to find new nodes in the cluster that were eventually added by the admin. So, eventually, ES will have a full list of IPs/hostnames to connect to and use them via round-robin. Notice there are a few edge-cases like ensuring this process if we do end up with an empty list of nodes for some reason. But let's try to keep it simple.

Health checks serve another purpose. They periodically check the list of nodes and manage the individual state of those nodes. E.g. if elastic tried to send a request to a node that didn't respond, it is marked as dead and no longer used. However, that could only be a blip in the network, so health check runs periodically to mark them as alive eventually. Again, there are some edge cases.

I currently don't see why one would disable sniffing but keep health checks enabled. So maybe they should be disabled as well, automatically, when sniffing is disabled.

dwickstrom · 2019-05-10T13:00:42Z

I currently don't see why one would disable sniffing but keep health checks enabled. So maybe they should be disabled as well, automatically, when sniffing is disabled.

That sounds reasonable to me.

In any case it might be a good idea to put it into the AWS section of the wiki, to not use either sniffing or healthchecks.

olivere · 2019-05-12T17:49:25Z

I changed the docs in the Wiki and advised to disable both sniffing and health checks for AWS Elasticsearch Service.

dwickstrom · 2019-05-13T06:35:12Z

Great, thanks for helping out 🥇

iandees · 2019-11-06T21:33:50Z

I'm running into the same problem as David, even with healthcheck and sniff turned off. @dwickstrom do you remember if you changed anything on the underlying HTTP client instance maybe?

dwickstrom · 2019-11-07T07:49:57Z

Hi @iandees, no I didn't change anything on the HTTP client. Lately however there has been some issue with this, again. Back in May, the way I tested it was by toggling some parameter in the cluster settings, to trigger a cluster "rollover". Recently however, when AWS themselves were triggering an elasticsearch upgrade on their side, that "rollover" did not go well - clients were not able to connect without intervention, just like the incidents I had ~6 months ago.

olivere · 2019-11-07T17:56:18Z

Maybe there's still a problem. Reopening.

olivere · 2019-11-07T18:03:56Z

There was a change quite recently that addresses an issue on AWS ES with nodes changing IPs particularly. Don't know if this has anything to do with it. #1125

g-wilson · 2020-05-22T13:29:10Z

Hi all, resurrecting this thread to shine some more info. We're seeing this issue as well. After some pretty thorough testing I can replicate the issue. I don't think the issue is with this library.

AWS ES uses DNS based load-balancing to resolve the hostname to the ES nodes, it's not an EC2-style load balancer.

If an HTTP client is used which uses keep-alive connections (http.DefaultClient does by default), and your volume of requests is high enough that the idle timeout is never reached, the connection will not be re-established.

This means that when AWS rotates the nodes and changes the DNS records, an application using this library is none the wiser, it won't do another DNS lookup until the connections are left idle and then terminated.

Eventually this library does recognise that requests are failing and resets everything, however this does cause fairly significant interruption of service.

This issue is described well here golang/go#23427

olivere · 2020-05-23T11:26:35Z

Thanks for reporting your findings, @g-wilson.

Sovietaced · 2021-05-18T02:53:46Z

AWS ES uses DNS based load-balancing to resolve the hostname to the ES nodes, it's not an EC2-style load balancer.

In this case it seems like clients would benefit from sniffing.

olivere · 2021-05-18T07:01:34Z

@Sovietaced I'm not sure that's correct. Sniffing is a process by which the client library asks the ES cluster (not the DNS) for the IP addresses of the nodes, then uses those and watches for changes; that's effectively client-side LB. In case of DNS-based LB, the ES cluster usually doesn't know nor update its internal IP addresses. Hence, I think, disabling sniffing and healthchecks is the right way to use Elastic on AWS. Again, I'm not an active user of Elastic on AWS ES.

The problem is, though, that Go itself caches IP addresses for a while, and doesn't resolve for each and every request, hence the reference to golang/go#23427.

Sovietaced · 2021-05-18T14:55:55Z

@olivere The Java library has the same problem. It resolves an IP address from the ES cluster domain name and caches the IP address of the data node indefinitely. What we notice is that if the IP of the data node we have no longer becomes a data node, our applications are essentially broken (receiving 503s) until we restart them and they get a new IP address from the AWS ES cluster DNS.

This is obviously a pretty terrible user experience that seems ripe for the use of sniffing.

Sovietaced · 2021-05-19T16:43:30Z

This is obviously a pretty terrible user experience that seems ripe for the use of sniffing.

I ended up testing sniffing with the AWS ES cluster and it appears that the /_nodes/http?pretty=true API does not even include http info about the nodes so sniffing doesn't work.

olivere · 2021-05-19T16:49:12Z

Interesting. Maybe we should accommodate to that and—at least—log a warning.

I will have to test this out on AWS ES.

wingsofovnia · 2021-05-19T17:10:11Z

Here is a comparison of how AWS ES response differs from a normal ES deployment: elastic/elasticsearch-js#1178 (comment)

One way to mitigate this might be a custom sniffer that does nslookup instead of
GET_nodes/.

root@shell:/# nslookup aws-elasticsearch-domain.eu-central-1a.es.amazonaws.com

Server: a.a.a.a
Address: b.b.b.b#...

Non-authoritative answer:
Name: aws-elasticsearch-domain.eu-central-1a.es.amazonaws.com
Address: x.x.x.x # Node IP 1

Name: aws-elasticsearch-domain.eu-central-1a.es.amazonaws.com
Address: y.y.yy # Node IP 2

olivere · 2021-05-19T17:49:40Z

Thanks for the links. Very helpful.

Sovietaced · 2021-05-20T01:41:46Z

For what its worth, we ended up writing our own custom sniffer and it appears to work well. I forced a blue/green deployment of an AWS ES cluster and I watched the IP addresses flip with no downtime.

I realize this is a Go library but folks may find this generally useful. This is the basic logic for a periodic task that runs in the background. Note: This approach depends on having a DNS cache TTL set.

Following code is in Kotlin

val addresses: List<InetAddress>

try {
   // host.hostName is the cluster domain name provided by AWS
    addresses = InetAddress.getAllByName(host.hostName).asList()
} catch (e: UnknownHostException) {
    throw AwsSnifferException("Failed to resolve addresses for ${host.hostName}", e)
}

logger.debug("Sniffed addresses: $addresses")

if (addresses.isEmpty()) {
    logger.warn("No nodes to set")
} else {
    val nodes = addresses.stream()
        // Generate new hosts with the address swapped. Retain port/scheme
        .map { HttpHost(it.hostAddress, host.port, host.schemeName) }
        .map { Node(it) }
        .toList()

    logger.debug("Calculated nodes: $nodes")

    restClient.setNodes(nodes)
}

chrisharrisonkiwi · 2021-05-31T02:04:37Z

I'm also running into the exact same issue with AWS.
Any ElasticSearch modification or automated action resulting in the nodes being reassigned seems to result in the issue for around 15 minutes (With both Sniffing and Healthchecks turned off).

Is there an easy way with this library to force a reconnection to the cluster maybe?
Might be nice to have a client.Reconnect() option in the event that no nodes are available?
I guess I could run client.Stop() and then get a new connection using elastic.NewClient() and see if the new connection has correctly mapped nodes etc.

-- edit
I tried the new client idea and it seemed to work. But it's a bit of a sledgehammer on a nail approach.

g-wilson · 2021-06-07T08:46:27Z

I'm also running into the exact same issue with AWS.

Instead of doing a full reconnect / new client, you can call CloseIdleConnections method on the *http.Transport that you pass the client itself.

I'm not proud of this, but we do that on a 15 second interval and it works a treat 🤦‍♂️

chrisharrisonkiwi · 2021-06-08T04:58:15Z

I'm also running into the exact same issue with AWS.

Instead of doing a full reconnect / new client, you can call CloseIdleConnections method on the *http.Transport that you pass the client itself.

I'm not proud of this, but we do that on a 15 second interval and it works a treat 🤦‍♂️

Yup this works also. A little bit cleaner than the fresh client approach I guess.

This commit adds a configuration option `SetCloseIdleConnections` to a client. The effect of enabling it is that whenever the Client finds a dead node, it will call `CloseIdleConnections` on the underlying HTTP transport. This is useful for.e.g. AWS Elasticsearch Service. When AWS ES reconfigures the cluster, it may change the underlying IP addresses while keeping the DNS entry stable. If the Client would _not_ close idle connections, the underlying HTTP client would re-use existing HTTP connections and use the old IP addresses. See #1091 for a discussion of this problem. The commit also illustrates how to connect to an AWS ES cluster in the recipes in [`recipes/aws-mapping-v4`](https://github.com/olivere/elastic/tree/release-branch.v7/recipes/aws-mapping-v4) and [`recipts/aws-es-client`](https://github.com/olivere/elastic/tree/release-branch.v7/recipes/aws-es-client). See the `ConnectToAWS` method for a blueprint of how to connect to an AWS ES cluster. See #1091

olivere · 2021-07-08T13:46:27Z

I've been looking into this and am experimenting with an additional elastic.SetCloseIdleConnections(true|false) configuration option for elastic.NewClient. When enabled, the PerformRequest method will automatically close idle connections in the underlying HTTP transport whenever it finds a dead node. This should make sure that the client picks up the new IP address whenever the AWS ES cluster reconfigures in any of the specified configuration changes.

If some of you could look into this and give it a thumbs up, #1507 might land in one of the next releases.

olivere closed this as completed May 12, 2019

olivere reopened this Nov 7, 2019

yudidi mentioned this issue Dec 21, 2020

Just want the final solution for some previous similar issues(*nodes marked as dead) #1449

Closed

olivere mentioned this issue Jul 8, 2021

Add option to close idle connections for dead nodes #1507

Draft

mastermanu mentioned this issue Aug 17, 2021

Expose healthCheck and idleConnectionCloseInterval as configuration options on Elasticsearch configuration temporalio/temporal#1830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS endpoint with varying IP address #1091

AWS endpoint with varying IP address #1091

dwickstrom commented May 8, 2019 •

edited

Loading

olivere commented May 8, 2019

dwickstrom commented May 9, 2019

dwickstrom commented May 10, 2019

olivere commented May 10, 2019

dwickstrom commented May 10, 2019

olivere commented May 12, 2019

dwickstrom commented May 13, 2019

iandees commented Nov 6, 2019

dwickstrom commented Nov 7, 2019 •

edited

Loading

olivere commented Nov 7, 2019

olivere commented Nov 7, 2019

g-wilson commented May 22, 2020

olivere commented May 23, 2020

Sovietaced commented May 18, 2021

olivere commented May 18, 2021

Sovietaced commented May 18, 2021 •

edited

Loading

Sovietaced commented May 19, 2021

olivere commented May 19, 2021

wingsofovnia commented May 19, 2021 •

edited

Loading

olivere commented May 19, 2021

Sovietaced commented May 20, 2021 •

edited

Loading

chrisharrisonkiwi commented May 31, 2021 •

edited

Loading

g-wilson commented Jun 7, 2021 •

edited

Loading

chrisharrisonkiwi commented Jun 8, 2021

olivere commented Jul 8, 2021

AWS endpoint with varying IP address #1091

AWS endpoint with varying IP address #1091

Comments

dwickstrom commented May 8, 2019 • edited Loading

Which version of Elastic are you using?

Please describe the expected behavior

Please describe the actual behavior

Any steps to reproduce the behavior?

olivere commented May 8, 2019

dwickstrom commented May 9, 2019

dwickstrom commented May 10, 2019

olivere commented May 10, 2019

dwickstrom commented May 10, 2019

olivere commented May 12, 2019

dwickstrom commented May 13, 2019

iandees commented Nov 6, 2019

dwickstrom commented Nov 7, 2019 • edited Loading

olivere commented Nov 7, 2019

olivere commented Nov 7, 2019

g-wilson commented May 22, 2020

olivere commented May 23, 2020

Sovietaced commented May 18, 2021

olivere commented May 18, 2021

Sovietaced commented May 18, 2021 • edited Loading

Sovietaced commented May 19, 2021

olivere commented May 19, 2021

wingsofovnia commented May 19, 2021 • edited Loading

olivere commented May 19, 2021

Sovietaced commented May 20, 2021 • edited Loading

chrisharrisonkiwi commented May 31, 2021 • edited Loading

g-wilson commented Jun 7, 2021 • edited Loading

chrisharrisonkiwi commented Jun 8, 2021

olivere commented Jul 8, 2021

dwickstrom commented May 8, 2019 •

edited

Loading

dwickstrom commented Nov 7, 2019 •

edited

Loading

Sovietaced commented May 18, 2021 •

edited

Loading

wingsofovnia commented May 19, 2021 •

edited

Loading

Sovietaced commented May 20, 2021 •

edited

Loading

chrisharrisonkiwi commented May 31, 2021 •

edited

Loading

g-wilson commented Jun 7, 2021 •

edited

Loading