Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logstash 2.1.1 immediate OOM when running simple lumberjack input (regression) #4333

Closed
jamesblackburn opened this issue Dec 10, 2015 · 16 comments
Labels

Comments

@jamesblackburn
Copy link

Trying to upgrade from 1.5.5 to 2.1.1 for a simple Logstash lumberjack -> file pipeline, I'm observing an immediate OOM on startup. The OOM appears to be in org.jruby.ext.zlib.JZlibInflate

Running

logstash -f "/ahlinfra/monit/conf/logs.dev.man.com/init.d/../etc/fwd-receiver/*.conf" -w1

$ ps aux|grep 2.1.1
logstash 36294  255  0.7 5537560 123456 pts/1  Sl+  14:43   0:10 /apps/research/tools/java/current//bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp/heapdump.hprof -Xmx1g -Xss2048k -Djffi.boot.library.path=/apps/research/tools/logstash/2.1.1/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp/heapdump.hprof -XX:HeapDumpPath=/apps/research/tools/logstash/2.1.1/heapdump.hprof -Xbootclasspath/a:/apps/research/tools/logstash/2.1.1/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/apps/research/tools/logstash/2.1.1/vendor/jruby -Djruby.lib=/apps/research/tools/logstash/2.1.1/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /apps/research/tools/logstash/2.1.1/lib/bootstrap/environment.rb logstash/runner.rb agent -f /ahlinfra/monit/conf/logs.dev.man.com/init.d/../etc/fwd-receiver/*.conf -w1

Configuration

$ cat /ahlinfra/monit/conf/logs.dev.man.com/init.d/../etc/fwd-receiver/*.conf
input {
    lumberjack {
       port => 5043
       ssl_certificate => "/ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt"
       ssl_key => "/ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.key"
       congestion_threshold => 30
    }
}
filter {
    # Strip the base-path prefixes from the filenames
    mutate { 
        gsub => [
            "file", "^/var/log/", "",
            "file", "^/data/logs/AHL/", "",
            "file", "^/data/logs/", "",
            "file", "^/local/mongo/logs/", ""
            ]
    }

    # If there are no slashes left on name after strip above
    # insert the hostname before the filename
    mutate { add_field => { "[@metadata][svc_slash]" => "%{service}/" } }
    if [@metadata][svc_slash] not in [file] {
        mutate { replace => { "file" => "%{service}/%{file}" } }
    }
}
output {
    # End of NFS service tracking
    file {
        path => "/data/logs/AHL/%{file}-%{+YYYY-MM-dd}"
        message_format => "%{message}"
        flush_interval => 60
    }
}

Output

You are using a deprecated config setting "message_format" set in file. Deprecated settings will continue to work, but are scheduled for removal from logstash in the future. You can achieve the same behavior with the 'line' codec If you have any questions about this, please visit the #logstash channel on freenode irc. {:name=>"message_format", :plugin=><LogStash::Outputs::File path=>"/data/logs/AHL/%{file}-%{+YYYY-MM-dd}", message_format=>"%{message}", flush_interval=>60>, :level=>:warn}
Settings: User set filter workers: 1, Default filter workers: 12
Logstash startup completed
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /apps/research/tools/logstash/2.1.1/heapdump.hprof ...
Heap dump file created [556095097 bytes in 4.058 secs]
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
Logstash shutdown completed

HProf analysis

image

image

@ph
Copy link
Contributor

ph commented Dec 10, 2015

@jamesblackburn can you put the heap dump somewhere? I would like to peak into it.

@jamesblackburn
Copy link
Author

I need to check what's in that decompressed byte array string @ph

Interestingly one of the reasons I'm trying to upgrade is because of logstash-forwarder#293 We have a bunch of servers running lumberjack (including tick1b in the screenshot) where forwarding just doesn't work... And it's running reliably for all the other hosts.

Something that might be related (or not), is that for 1.5.5 we sometimes see this exception in JZlibInflate:

{:timestamp=>"2015-12-10T00:18:15.525000+0000", :message=>"Lumberjack input: unhandled exception", :exception=>#<Zlib::BufError: buffer error>, :backtrace=>["org/jruby/ext/zlib/ZStream.java:134:in `finish'", "org/jruby/ext/zlib/JZlibInflate.java:72:in `inflate'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.26/lib/lumberjack/server.rb:292:in `compressed_payload'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.26/lib/lumberjack/server.rb:163:in `feed'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.26/lib/lumberjack/server.rb:332:in `read_socket'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.26/lib/lumberjack/server.rb:315:in `run'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/logstash-input-lumberjack-1.0.6/lib/logstash/inputs/lumberjack.rb:104:in `invoke'", "org/jruby/RubyProc.java:271:in `call'", "/apps/research/tools/logstash/current/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.9.1-java/lib/concurrent/executor/executor_service.rb:515:in `run'", "Concurrent$$JavaExecutorService$$Job_754841046.gen:13:in `run'"], :level=>:error}

@ph
Copy link
Contributor

ph commented Dec 10, 2015

I think it might be related, I don't think we have changed how we handle the zlib stuff in a while between 1.5 and 2.1, also where you see all the memory retain is inside the jruby core. It is possible you are sending really huge payload between LSF and logstash from time to time?

If you are upgrading for that issue I suggest you take a look at filebeat and logstash-input-beats https://github.com/elastic/filebeat

@ph
Copy link
Contributor

ph commented Dec 10, 2015

@jamesblackburn I believe you should see a path in the data this might help us find which data is make logstash oom.

I think LSF doesn't do a check on the size of the payload before sending it, it just send it regardless. Than the input just buffer the content before extracting it, so this might be the source of the oom.

@jamesblackburn
Copy link
Author

I don't think there's an easy way for me to upload the hprof file from here. To me it appears that all the memory has gone to a single JZlibInflate buffer.

Looking at the strings they look highly repetetive, and likely compresses very well.

Do you believe that filebeat has different behaviour in this regard?

@ph
Copy link
Contributor

ph commented Dec 10, 2015

@jamesblackburn I am not sure if filebeat implement a different behavior, but since this is were the development is focused I would migrate to it to get all the latest features. LSF will eventually be unsupported.

In the byte buffer do you see the path to the original file, might be worth a look at the content?

With how much memory you are starting Logstash?

@jamesblackburn
Copy link
Author

It was with the defaults - 1G. It fails with 2G too, and seems to stay up with 8G.

I notice on the logstash forwader, I'm seeing:

2015/12/10 15:24:30.099212 Socket error, will reconnect: write tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:24:31.100081 Setting trusted CA from file: /app/AHL/monit/bin/../conf/tick1b/../common/logs.dev.man.com.crt
2015/12/10 15:24:31.100635 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:24:46.103537 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:24:47.104454 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:25:02.106729 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:25:03.107971 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:25:18.111267 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:25:19.112278 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:25:34.114476 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:25:35.115543 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:25:50.117858 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:25:51.119049 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:26:06.121292 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:26:07.122333 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 
2015/12/10 15:26:22.124701 Failed to tls handshake with 10.220.53.55 read tcp 10.220.53.55:5043: i/o timeout
2015/12/10 15:26:23.125701 Connecting to [10.220.53.55]:5043 (logs.dev.man.com) 

The file being sent isn't too large:

-bash-3.2$ ls -lah /data/logs/AHL/tick1b/logstash.log
-rw-r--r-- 1 twsadmin systems 36M Dec 10 15:27 /data/logs/AHL/tick1b/logstash.log

@ph
Copy link
Contributor

ph commented Dec 10, 2015

@jamesblackburn this is really a small files! How many LSF clients to that logstash instance?

@jamesblackburn
Copy link
Author

Quite a few... I'm running 8 instances of the logstash to receive data, and have configured the forwarder like:

{
  "network": {
    "servers": [ "logs.dev.man.com:5043",
                 "logs.dev.man.com:5044",
                 "logs.dev.man.com:5045",
                 "logs.dev.man.com:5046",
                 "logs.dev.man.com:5047",
                 "logs.dev.man.com:5048",
                 "logs.dev.man.com:5049",
                 "logs.dev.man.com:5050"
                  ],
    "timeout": 30,

My logstash receiver configuration is the above - I'm only writing to a file output as simply as possible, but I find that logstash really can't cope with more than a handful of servers sending logs...

Is there a go version of a receiver for the lumberjack / filebeats protocol? I don't think the JRuby version scales :(

@ph
Copy link
Contributor

ph commented Dec 10, 2015

The JRuby version should scale to what you want to do, the bottleneck should only be disk IO in this case.

What you are encountering is a bug and we can fixes theses, I don't have access to the data so I asking the question I can to debug this. From what I see from the screenshot. I think Thread-2 is a connection thread and the data keep growing in the buffer more than it should, Why a 36 mb files would generate a ~500 heap size is a bit concerning.

So it's either LSF is sending the wrong data or lumberjack doesn't know when a frame stop and just keep buffing, this might explain the duplicate of the data in the byte buffer.

Filebeat has some improvement over LSF with how we transmit the data between the node and logstash I think this might help in this case. So if its possible for you to try it out I would look into it.

The LSF i/o timeout you see is from the logstash server stuck in GC loop :(

If its not possible to upgrade, with the content of the byte buffer you should be able to find the host that generate the log and the problematic file, with this information it might be possible to recreate a simple test case to debug this issue.

@ph
Copy link
Contributor

ph commented Dec 10, 2015

by simple test case, I meant 1 LSF + 1 Logstash.

@jamesblackburn
Copy link
Author

It's actually an issue when sending a large file that has been corrupted by logrotate / java. See the repro here:
elastic/logstash-forwarder#293 (comment)

$ LOGSTASH_VERSION=2.1.1 logstash --config=./logstash-receiver.conf You are using a deprecated config setting "message_format" set in file. Deprecated settings will continue to work, but are scheduled for removal from logstash in the future. You can achieve the same behavior with the 'line' codec If you have any questions about this, please visit the #logstash channel on freenode irc. {:name=>"message_format", :plugin=><LogStash::Outputs::File path=>"/users/is/jblackburn/logstash-issue/out/%{file}-%{+YYYY-MM-dd}", message_format=>"%{message}", flush_interval=>60>, :level=>:warn}
Settings: Default filter workers: 12
Logstash startup completed
java.lang.OutOfMemoryError: Java heap space

2.1.1 always OOMs, whereas 1.5.5. doesn't

@jamesblackburn
Copy link
Author

Just to add @suyograo I believe repro. details have been provided?

@ph
Copy link
Contributor

ph commented Jan 12, 2016

I will give a look as soon as possible, thanks for providing a test case for that.

@ph
Copy link
Contributor

ph commented Feb 17, 2017

I will keep this one open, I am not sure if its still an issue with FB/input beats.

@kares
Copy link
Contributor

kares commented Feb 4, 2021

closing as defunct, org.jruby.ext.zlib.JZlibInflate is no longer used in beats input.

@kares kares closed this as completed Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants