High CPU on forwarder when the server disconnects quickly #535

jstangroome · 2015-10-05T22:36:03Z

I have ~100 logstash-forwarder 0.4.0 clients sending to Logstash 1.5.4 with logstash-input-lumberjack 1.0.5 behind an AWS ELB.

When Logstash is unavailable (eg its restarting, or dies) the ELB continues to accept connections from logstash-forwarder clients but closes the connection when it realises it cannot proxy to Logstash.

This is seen on the client logs as:

2015/10/05 09:40:40.943298 Connecting to [x.y.a.b]:5043 (the-elb)
2015/10/05 09:40:40.952560 Connected to x.y.a.b
2015/10/05 09:40:40.952692 Read error looking for ack: EOF
2015/10/05 09:40:40.952738 Setting trusted CA from file: /etc/logstash-forwarder/the-elb.pem
2015/10/05 09:40:40.953257 Connecting to [x.y.c.d]:5043 (the-elb)
2015/10/05 09:40:40.959790 Connected to x.y.c.d
2015/10/05 09:40:40.959946 Read error looking for ack: EOF
2015/10/05 09:40:40.959992 Setting trusted CA from file: /etc/logstash-forwarder/the-elb.pem
2015/10/05 09:40:40.960627 Connecting to [x.y.a.b]:5043 (the-elb)
2015/10/05 09:40:40.970599 Connected to x.y.a.b
2015/10/05 09:40:40.970704 Read error looking for ack: EOF
2015/10/05 09:40:40.970749 Setting trusted CA from file: /etc/logstash-forwarder/the-elb.pem
2015/10/05 09:40:40.971342 Connecting to [x.y.c.d]:5043 (the-elb)
2015/10/05 09:40:40.977973 Connected to x.y.c.d

This is essentially the same behaviour as described in issue #293 but I'm less concerned about the cause and much more concerned about the failure mode.

As you can see from the timestamps in the log snippet above, logstash-forwarder is entering a tight retry-loop in this failure mode and this results in very high CPU on the client. Left long enough this causes a cascade failure of other processes competing for the CPU on the same computer as logstash-forwarder.

For now I will attempt to resolve the situation with nice(1) but logstash-forwarder should probably limit its retry aggressiveness.

There are several places in the current code where a time.Sleep() is used following a communication failure and this probably needs to be inserted into the Read error looking for ack code path too.

Uses of time.Sleep():
https://github.com/elastic/logstash-forwarder/blob/master/publisher1.go#L65
https://github.com/elastic/logstash-forwarder/blob/master/publisher1.go#L181
https://github.com/elastic/logstash-forwarder/blob/master/publisher1.go#L200
https://github.com/elastic/logstash-forwarder/blob/master/publisher1.go#L211

Likely place for a new time.Sleep() call:
https://github.com/elastic/logstash-forwarder/blob/master/publisher1.go#L112-L113

The text was updated successfully, but these errors were encountered:

jordansissel · 2015-11-17T00:14:31Z

Thanks for helping make logstash-forwarder better!

Logstash-forwarder is going away and is replaced by filebeat and its friend, libbeat. If this is still an issue, would you mind opening a ticket there?

jstangroome mentioned this issue Oct 6, 2015

Stops accepting connections, never recovers logstash-plugins/logstash-input-lumberjack#61

Open

ruflin added filebeat libbeat and removed filebeat labels Oct 6, 2015

jstangroome added a commit to section-io/logstash-forwarder that referenced this issue Oct 9, 2015

Sleep on failure to read ACK. Refs elastic#535

7a4b1e8

jordansissel closed this as completed Nov 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU on forwarder when the server disconnects quickly #535

High CPU on forwarder when the server disconnects quickly #535

jstangroome commented Oct 5, 2015

jordansissel commented Nov 17, 2015

High CPU on forwarder when the server disconnects quickly #535

High CPU on forwarder when the server disconnects quickly #535

Comments

jstangroome commented Oct 5, 2015

jordansissel commented Nov 17, 2015