Logstash 2.1.1 immediate OOM when running simple lumberjack input (regression) #4333

jamesblackburn · 2015-12-10T15:00:53Z

Trying to upgrade from 1.5.5 to 2.1.1 for a simple Logstash lumberjack -> file pipeline, I'm observing an immediate OOM on startup. The OOM appears to be in org.jruby.ext.zlib.JZlibInflate

Running

logstash -f "/ahlinfra/monit/conf/logs.dev.man.com/init.d/../etc/fwd-receiver/*.conf" -w1

$ ps aux|grep 2.1.1
logstash 36294  255  0.7 5537560 123456 pts/1  Sl+  14:43   0:10 /apps/research/tools/java/current//bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp/heapdump.hprof -Xmx1g -Xss2048k -Djffi.boot.library.path=/apps/research/tools/logstash/2.1.1/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp/heapdump.hprof -XX:HeapDumpPath=/apps/research/tools/logstash/2.1.1/heapdump.hprof -Xbootclasspath/a:/apps/research/tools/logstash/2.1.1/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/apps/research/tools/logstash/2.1.1/vendor/jruby -Djruby.lib=/apps/research/tools/logstash/2.1.1/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /apps/research/tools/logstash/2.1.1/lib/bootstrap/environment.rb logstash/runner.rb agent -f /ahlinfra/monit/conf/logs.dev.man.com/init.d/../etc/fwd-receiver/*.conf -w1

Configuration

$ cat /ahlinfra/monit/conf/logs.dev.man.com/init.d/../etc/fwd-receiver/*.conf
input {
    lumberjack {
       port => 5043
       ssl_certificate => "/ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt"
       ssl_key => "/ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.key"
       congestion_threshold => 30
    }
}
filter {
    # Strip the base-path prefixes from the filenames
    mutate { 
        gsub => [
            "file", "^/var/log/", "",
            "file", "^/data/logs/AHL/", "",
            "file", "^/data/logs/", "",
            "file", "^/local/mongo/logs/", ""
            ]
    }

    # If there are no slashes left on name after strip above
    # insert the hostname before the filename
    mutate { add_field => { "[@metadata][svc_slash]" => "%{service}/" } }
    if [@metadata][svc_slash] not in [file] {
        mutate { replace => { "file" => "%{service}/%{file}" } }
    }
}
output {
    # End of NFS service tracking
    file {
        path => "/data/logs/AHL/%{file}-%{+YYYY-MM-dd}"
        message_format => "%{message}"
        flush_interval => 60
    }
}

Output

You are using a deprecated config setting "message_format" set in file. Deprecated settings will continue to work, but are scheduled for removal from logstash in the future. You can achieve the same behavior with the 'line' codec If you have any questions about this, please visit the #logstash channel on freenode irc. {:name=>"message_format", :plugin=><LogStash::Outputs::File path=>"/data/logs/AHL/%{file}-%{+YYYY-MM-dd}", message_format=>"%{message}", flush_interval=>60>, :level=>:warn}
Settings: User set filter workers: 1, Default filter workers: 12
Logstash startup completed
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /apps/research/tools/logstash/2.1.1/heapdump.hprof ...
Heap dump file created [556095097 bytes in 4.058 secs]
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
Logstash shutdown completed

HProf analysis

The text was updated successfully, but these errors were encountered:

ph · 2015-12-10T15:02:37Z

@jamesblackburn can you put the heap dump somewhere? I would like to peak into it.

jamesblackburn · 2015-12-10T15:07:55Z

I need to check what's in that decompressed byte array string @ph

Interestingly one of the reasons I'm trying to upgrade is because of logstash-forwarder#293 We have a bunch of servers running lumberjack (including tick1b in the screenshot) where forwarding just doesn't work... And it's running reliably for all the other hosts.

Something that might be related (or not), is that for 1.5.5 we sometimes see this exception in JZlibInflate:

{:timestamp=>"2015-12-10T00:18:15.525000+0000", :message=>"Lumberjack input: unhandled exception", :exception=>#<Zlib::BufError: buffer error>, :backtrace=>["org/jruby/ext/zlib/ZStream.java:134:in `finish'", "org/jruby/ext/zlib/JZlibInflate.java:72:in `inflate'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.26/lib/lumberjack/server.rb:292:in `compressed_payload'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.26/lib/lumberjack/server.rb:163:in `feed'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.26/lib/lumberjack/server.rb:332:in `read_socket'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.26/lib/lumberjack/server.rb:315:in `run'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/logstash-input-lumberjack-1.0.6/lib/logstash/inputs/lumberjack.rb:104:in `invoke'", "org/jruby/RubyProc.java:271:in `call'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.9.1-java/lib/concurrent/executor/executor_service.rb:515:in `run'", "Concurrent$$JavaExecutorService$$Job_754841046.gen:13:in `run'"], :level=>:error}

ph · 2015-12-10T15:11:59Z

I think it might be related, I don't think we have changed how we handle the zlib stuff in a while between 1.5 and 2.1, also where you see all the memory retain is inside the jruby core. It is possible you are sending really huge payload between LSF and logstash from time to time?

If you are upgrading for that issue I suggest you take a look at filebeat and logstash-input-beats https://github.com/elastic/filebeat

ph · 2015-12-10T15:17:28Z

@jamesblackburn I believe you should see a path in the data this might help us find which data is make logstash oom.

I think LSF doesn't do a check on the size of the payload before sending it, it just send it regardless. Than the input just buffer the content before extracting it, so this might be the source of the oom.

jamesblackburn · 2015-12-10T15:18:20Z

I don't think there's an easy way for me to upload the hprof file from here. To me it appears that all the memory has gone to a single JZlibInflate buffer.

Looking at the strings they look highly repetetive, and likely compresses very well.

Do you believe that filebeat has different behaviour in this regard?

ph · 2015-12-10T15:22:37Z

@jamesblackburn I am not sure if filebeat implement a different behavior, but since this is were the development is focused I would migrate to it to get all the latest features. LSF will eventually be unsupported.

In the byte buffer do you see the path to the original file, might be worth a look at the content?

With how much memory you are starting Logstash?

jamesblackburn · 2015-12-10T15:27:58Z

It was with the defaults - 1G. It fails with 2G too, and seems to stay up with 8G.

I notice on the logstash forwader, I'm seeing:

2015/12/10 15:24:30.099212 Socket error, will reconnect: write tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:24:31.100081 Setting trusted CA from file: /app/AHL/monit/bin/../conf/tick1b/../common/logs.dev.man.com.crt
2015/12/10 15:24:31.100635 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:24:46.103537 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:24:47.104454 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:25:02.106729 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:25:03.107971 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:25:18.111267 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:25:19.112278 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:25:34.114476 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:25:35.115543 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:25:50.117858 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:25:51.119049 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:26:06.121292 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:26:07.122333 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:26:22.124701 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:26:23.125701 Connecting to [10.220.53.55]:5043 (logs.dev.man.com)

The file being sent isn't too large:

-bash-3.2$ ls -lah /data/logs/AHL/tick1b/logstash.log
-rw-r--r-- 1 twsadmin systems 36M Dec 10 15:27 /data/logs/AHL/tick1b/logstash.log

ph · 2015-12-10T15:31:31Z

@jamesblackburn this is really a small files! How many LSF clients to that logstash instance?

jamesblackburn · 2015-12-10T15:36:02Z

Quite a few... I'm running 8 instances of the logstash to receive data, and have configured the forwarder like:

{
  "network": {
    "servers": [ "logs.dev.man.com:5043",
                 "logs.dev.man.com:5044",
                 "logs.dev.man.com:5045",
                 "logs.dev.man.com:5046",
                 "logs.dev.man.com:5047",
                 "logs.dev.man.com:5048",
                 "logs.dev.man.com:5049",
                 "logs.dev.man.com:5050"
                  ],
    "timeout": 30,

My logstash receiver configuration is the above - I'm only writing to a file output as simply as possible, but I find that logstash really can't cope with more than a handful of servers sending logs...

Is there a go version of a receiver for the lumberjack / filebeats protocol? I don't think the JRuby version scales :(

ph · 2015-12-10T15:55:33Z

The JRuby version should scale to what you want to do, the bottleneck should only be disk IO in this case.

What you are encountering is a bug and we can fixes theses, I don't have access to the data so I asking the question I can to debug this. From what I see from the screenshot. I think Thread-2 is a connection thread and the data keep growing in the buffer more than it should, Why a 36 mb files would generate a ~500 heap size is a bit concerning.

So it's either LSF is sending the wrong data or lumberjack doesn't know when a frame stop and just keep buffing, this might explain the duplicate of the data in the byte buffer.

Filebeat has some improvement over LSF with how we transmit the data between the node and logstash I think this might help in this case. So if its possible for you to try it out I would look into it.

The LSF i/o timeout you see is from the logstash server stuck in GC loop :(

If its not possible to upgrade, with the content of the byte buffer you should be able to find the host that generate the log and the problematic file, with this information it might be possible to recreate a simple test case to debug this issue.

ph · 2015-12-10T16:02:30Z

by simple test case, I meant 1 LSF + 1 Logstash.

jamesblackburn · 2015-12-11T11:49:24Z

It's actually an issue when sending a large file that has been corrupted by logrotate / java. See the repro here:
elastic/logstash-forwarder#293 (comment)

$ LOGSTASH_VERSION=2.1.1 logstash --config=./logstash-receiver.conf You are using a deprecated config setting "message_format" set in file. Deprecated settings will continue to work, but are scheduled for removal from logstash in the future. You can achieve the same behavior with the 'line' codec If you have any questions about this, please visit the #logstash channel on freenode irc. {:name=>"message_format", :plugin=><LogStash::Outputs::File path=>"/users/is/jblackburn/logstash-issue/out/%{file}-%{+YYYY-MM-dd}", message_format=>"%{message}", flush_interval=>60>, :level=>:warn}
Settings: Default filter workers: 12
Logstash startup completed
java.lang.OutOfMemoryError: Java heap space

2.1.1 always OOMs, whereas 1.5.5. doesn't

jamesblackburn · 2016-01-12T08:54:36Z

Just to add @suyograo I believe repro. details have been provided?

ph · 2016-01-12T16:19:24Z

I will give a look as soon as possible, thanks for providing a test case for that.

ph · 2017-02-17T17:32:08Z

I will keep this one open, I am not sure if its still an issue with FB/input beats.

kares · 2021-02-04T11:14:41Z

closing as defunct, org.jruby.ext.zlib.JZlibInflate is no longer used in beats input.

suyograo added needs details crashes labels Dec 15, 2015

ph mentioned this issue Jan 12, 2016

Read error looking for ack: EOF elastic/logstash-forwarder#293

Open

suyograo removed the needs details label Sep 14, 2017

kares closed this as completed Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logstash 2.1.1 immediate OOM when running simple lumberjack input (regression) #4333

Logstash 2.1.1 immediate OOM when running simple lumberjack input (regression) #4333

jamesblackburn commented Dec 10, 2015

ph commented Dec 10, 2015

jamesblackburn commented Dec 10, 2015

ph commented Dec 10, 2015

ph commented Dec 10, 2015

jamesblackburn commented Dec 10, 2015

ph commented Dec 10, 2015

jamesblackburn commented Dec 10, 2015

ph commented Dec 10, 2015

jamesblackburn commented Dec 10, 2015

ph commented Dec 10, 2015

ph commented Dec 10, 2015

jamesblackburn commented Dec 11, 2015

jamesblackburn commented Jan 12, 2016

ph commented Jan 12, 2016

ph commented Feb 17, 2017

kares commented Feb 4, 2021

Logstash 2.1.1 immediate OOM when running simple lumberjack input (regression) #4333

Logstash 2.1.1 immediate OOM when running simple lumberjack input (regression) #4333

Comments

jamesblackburn commented Dec 10, 2015

Running

Configuration

Output

HProf analysis

ph commented Dec 10, 2015

jamesblackburn commented Dec 10, 2015

ph commented Dec 10, 2015

ph commented Dec 10, 2015

jamesblackburn commented Dec 10, 2015

ph commented Dec 10, 2015

jamesblackburn commented Dec 10, 2015

ph commented Dec 10, 2015

jamesblackburn commented Dec 10, 2015

ph commented Dec 10, 2015

ph commented Dec 10, 2015

jamesblackburn commented Dec 11, 2015

jamesblackburn commented Jan 12, 2016

ph commented Jan 12, 2016

ph commented Feb 17, 2017

kares commented Feb 4, 2021