Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decode_size_limit_bytes option. #30

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Aug 5, 2016

Resolves #29 . This is most useful when people accidentally use this codec on something that is not properly newline delimited. That can easily lead to an OOM.

Superseded by #43

it "should raise an error if the max bytes are exceeded" do
expect {
subject.decode(maximum_payload << "z")
}.to raise_error(RuntimeError, "input buffer full")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to provide more context to this exception? input buffer full feels a little vague, or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that exception is provided by the FileWatch library, so it'd require a patch there. I should probably wrap and re-raise it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just submitted https://github.com/jordansissel/ruby-filewatch/pull/82/files to enable us to catch a more precise exception.

# Maximum number of bytes for a single line before a fatal exception is raised
# which will stop Logsash.
# The default is 20MB which is quite large for a JSON document
config :decode_size_limit_bytes, :validate => :number, :default => 20 * (1024 * 1024) # 20MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is decode the right name? The description says this is bytes for a line, but then we call it decode which isn't something mentioned elsewhere in the docs.
  • Exceeding this will cause a fatal error in Logstash and stop the process? Is this the desired behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the size limit is exceeded, where do we show that this exception will terminate Logstash? I don't see it when I read through the code.

@andsel
Copy link
Contributor

andsel commented Jun 7, 2024

Test plan

Used 1 line big json file (~1GB), limited the Java heap to 512Mb, and processed with a file input plugin. It goes in OOM

Generate one big json file

Use the script to generate it:

require "json"

part = [ 
    {:name => "Jannik", :surname => "Sinner"}, 
    {:name => "Novak", :surname => "Djokovic"}, 
    {:name => "Rafa", :surname => "Nadal"}, 
    {:name => "Roger", :surname => "Federer"}, 
    {:name => "Pete", :surname => "Sampras"}, 
    {:name => "André", :surname => "Agassi"}, 
    {:name => "Rod", :surname => "Laver"}, 
    {:name => "Ivan", :surname => "Lendl"}, 
    {:name => "Bjorn", :surname => "Borg"}, 
    {:name => "John", :surname => "McEnroe"}, 
    {:name => "Jimmy", :surname => "Connors"}
]
 
json_part = JSON.generate(part)
out_file = File.open("big_single_line.json", "a")
out_file.write "{"

counter = 1
desired_size = 1024 * 1024 * 1024
actual_size = 0
while actual_size < desired_size do
  json_fragment = "\"field_#{counter}\": #{json_part}"
  actual_size += json_fragment.size
  if actual_size < desired_size
  	json_fragment += ","
  end
  counter += 1
  out_file.write json_fragment
end
out_file.write "}\r\n"
out_file.flush

puts "Done! output file is #{out_file.size} bytes"
out_file.close

Configure Logstash

In config/jvm.options set

-Xms512m
-Xmx512m

and execute the pipeline

input {
  file {
    path => "/path/to/big_single_line.json"
    sincedb_path => "/tmp/sincedb"
    mode => "read"
    file_completed_action => "log"
    file_completed_log_path => "/tmp/processed.log"
  
    codec => json_lines {
      decode_size_limit_bytes => 32768
    }
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

Configure this patch PR, in Gemfile
replace

"logstash-codec-json_lines"

with

"logstash-codec-json_lines", :path => "/Users/andrea/workspace/logstash_plugins/logstash-codec-json_lines"

and execute

bin/logstash-plugin install --no-verify

Result

It fails with following logs

[2024-06-07T16:09:54,017][INFO ][filewatch.readmode.handlers.readfile][main][0cdeaf0672f90b760dedf003f2c0dcbca174fd7200057d0d92fa085651619d3f] buffer_extract: a delimiter can't be found in current chunk, maybe there are no more delimiters or the delimiter is incorrect or the text before the delimiter, a 'line', is very large, if this message is logged often try increasing the `file_chunk_size` setting. {"delimiter"=>"\n", "read_position"=>413007872, "bytes_read_count"=>32768, "last_known_file_size"=>1078985215, "file_path"=>"/Users/andrea/workspace/logstash_plugins/logstash-codec-json_lines/big_single_line.json"}
[2024-06-07T16:09:54,164][FATAL][org.logstash.Logstash    ][main][0cdeaf0672f90b760dedf003f2c0dcbca174fd7200057d0d92fa085651619d3f] uncaught error (in thread [main]<file)
java.lang.OutOfMemoryError: Java heap space
	at org.jruby.util.ByteList.<init>(ByteList.java:95) ~[jruby.jar:?]
	at org.jruby.RubyString.newStringLight(RubyString.java:466) ~[jruby.jar:?]
	at org.jruby.util.io.EncodingUtils.setStrBuf(EncodingUtils.java:1281) ~[jruby.jar:?]
	at org.jruby.RubyIO.sysreadCommon(RubyIO.java:3277) ~[jruby.jar:?]
	at org.jruby.RubyIO.sysread(RubyIO.java:3266) ~[jruby.jar:?]
	at java.lang.invoke.LambdaForm$DMH/0x00000008007d2000.invokeVirtual(LambdaForm$DMH) ~[?:?]
	at java.lang.invoke.LambdaForm$MH/0x00000008007e6800.invoke(LambdaForm$MH) ~[?:?]
	at java.lang.invoke.DelegatingMethodHandle$Holder.delegate(DelegatingMethodHandle$Holder) ~[?:?]
	at java.lang.invoke.LambdaForm$MH/0x00000008007d9800.guard(LambdaForm$MH) ~[?:?]
	at java.lang.invoke.DelegatingMethodHandle$Holder.delegate(DelegatingMethodHandle$Holder) ~[?:?]
	at java.lang.invoke.LambdaForm$MH/0x00000008007d9800.guard(LambdaForm$MH) ~[?:?]
	at java.lang.invoke.Invokers$Holder.linkToCallSite(Invokers$Holder) ~[?:?]
	at Users.andrea.workspace.logstash_andsel.vendor.bundle.jruby.$3_dot_1_dot_0.gems.logstash_minus_input_minus_file_minus_4_dot_4_dot_6.lib.filewatch.watched_file.RUBY$method$file_read$0(/Users/andrea/workspace/logstash_andsel/vendor/bundle/jruby/3.1.0/gems/logstash-input-file-4.4.6/lib/filewatch/watched_file.rb:229) ~[?:?]
	at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(DirectMethodHandle$Holder) ~[?:?]
	at java.lang.invoke.LambdaForm$MH/0x000000080083d800.invoke(LambdaForm$MH) ~[?:?]
	at java.lang.invoke.DelegatingMethodHandle$Holder.delegate(DelegatingMethodHandle$Holder) ~[?:?]
	at java.lang.invoke.LambdaForm$MH/0x00000008007c0800.guard(LambdaForm$MH) ~[?:?]
	at java.lang.invoke.DelegatingMethodHandle$Holder.delegate(DelegatingMethodHandle$Holder) ~[?:?]
	at java.lang.invoke.LambdaForm$MH/0x00000008007c0800.guard(LambdaForm$MH) ~[?:?]
	at java.lang.invoke.Invokers$Holder.linkToCallSite(Invokers$Holder) ~[?:?]
	at Users.andrea.workspace.logstash_andsel.vendor.bundle.jruby.$3_dot_1_dot_0.gems.logstash_minus_input_minus_file_minus_4_dot_4_dot_6.lib.filewatch.watched_file.RUBY$method$read_extract_lines$0(/Users/andrea/workspace/logstash_andsel/vendor/bundle/jruby/3.1.0/gems/logstash-input-file-4.4.6/lib/filewatch/watched_file.rb:241) ~[?:?]
	at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(DirectMethodHandle$Holder) ~[?:?]
	at java.lang.invoke.LambdaForm$MH/0x0000000800846c00.invoke(LambdaForm$MH) ~[?:?]
	at java.lang.invoke.DelegatingMethodHandle$Holder.delegate(DelegatingMethodHandle$Holder) ~[?:?]
	at java.lang.invoke.LambdaForm$MH/0x00000008007d9800.guard(LambdaForm$MH) ~[?:?]
	at java.lang.invoke.DelegatingMethodHandle$Holder.delegate(DelegatingMethodHandle$Holder) ~[?:?]
	at java.lang.invoke.LambdaForm$MH/0x00000008007d9800.guard(LambdaForm$MH) ~[?:?]
	at java.lang.invoke.Invokers$Holder.linkToCallSite(Invokers$Holder) ~[?:?]
	at Users.andrea.workspace.logstash_andsel.vendor.bundle.jruby.$3_dot_1_dot_0.gems.logstash_minus_input_minus_file_minus_4_dot_4_dot_6.lib.filewatch.read_mode.handlers.read_file.RUBY$block$controlled_read$0(/Users/andrea/workspace/logstash_andsel/vendor/bundle/jruby/3.1.0/gems/logstash-input-file-4.4.6/lib/filewatch/read_mode/handlers/read_file.rb:50) ~[?:?]
	at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(DirectMethodHandle$Holder) ~[?:?]
	at java.lang.invoke.LambdaForm$MH/0x0000000800f88000.invoke(LambdaForm$MH) ~[?:?]
	at java.lang.invoke.LambdaForm$MH/0x00000008007f8800.invokeExact_MT(LambdaForm$MH) ~[?:?]

Copy link
Contributor

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in comment doesn't work.
If that's not the correct way to test, please provide guidance on how to verify the fix.

@andsel
Copy link
Contributor

andsel commented Aug 27, 2024

The problem with this approach is that the size limit works only when the data is fully loaded in memory:
https://github.com/jordansissel/ruby-filewatch/blob/4ae6ce52e069553516759c4e49389f19f65ec0dd/lib/filewatch/buftok.rb#L68-L74

In such case, as the exception's stacktrace exposes:

java.lang.OutOfMemoryError: Java heap space
	at org.jruby.util.ByteList.<init>(ByteList.java:95) ~[jruby.jar:?]
	at org.jruby.RubyString.newStringLight(RubyString.java:466) ~[jruby.jar:?]
	at org.jruby.util.io.EncodingUtils.setStrBuf(EncodingUtils.java:1281) ~[jruby.jar:?]
	at org.jruby.RubyIO.sysreadCommon(RubyIO.java:3277) ~[jruby.jar:?]
	at org.jruby.RubyIO.sysread(RubyIO.java:3266) ~[jruby.jar:?]

the problem is throw before, when the data are still being read from the IO. Can't be catched by this codec because it's raised by the input logstash-input-file-4.4.6/lib/filewatch/watched_file.rb:229

https://github.com/logstash-plugins/logstash-input-file/blob/55a4a7099f05f29351672417036c1342850c7adc/lib/filewatch/watched_file.rb#L229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add max_line_size option
5 participants