Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BufferedTokenizer doesn't dice correctly the payload when restart processing after buffer full error #16483

Closed
andsel opened this issue Oct 1, 2024 · 0 comments · Fixed by #16482
Assignees
Labels

Comments

@andsel
Copy link
Contributor

andsel commented Oct 1, 2024

Logstash information:

Please include the following information:

  1. Logstash version (e.g. bin/logstash --version) any
  2. Logstash installation source (e.g. built from source, with a package manager: DEB/RPM, expanded from tar or zip archive, docker)
  3. How is Logstash being run (e.g. as a service/service manager: systemd, upstart, etc. Via command line, docker/kubernetes)

Plugins installed: (bin/logstash-plugin list --verbose)

JVM (e.g. java -version):

If the affected version of Logstash is 7.9 (or earlier), or if it is NOT using the bundled JDK or using the 'no-jdk' version in 7.10 (or higher), please provide the following information:

  1. JVM version (java -version)
  2. JVM installation source (e.g. from the Operating System's package manager, from source, etc).
  3. Value of the LS_JAVA_HOME environment variable if set.

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:
When BufferedTokenizer is used to dice the input, after a buffer full error, the input should be consumed till next separator and start correctly with the data after that separator

Steps to reproduce:
Mostly inspired by logstash-plugins/logstash-codec-json_lines#45 (comment)

  1. Configure Logstash to use the json_lines codec present in PR Re-established previous behaviour without a default limit for 'decode_size_limit_bytes' logstash-plugins/logstash-codec-json_lines#45
In Gemfile add:
gem "logstash-codec-json_lines", :path => "/path/to/logstash-codec-json_lines"
  1. From shell run bin/logstash-plugin install --no-verify
  2. start Logstash with following pipeline
input {
  tcp {
    port => 1234

    codec => json_lines {
      decode_size_limit_bytes => 100000
    }
  }
}

output {
  stdout {
    codec => rubydebug
  }
}
  1. Use the following script to generate some load
require 'socket' 
require 'json'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

data = {"a" => "a"*105_000}.to_json + "\n"; socket.write(data[0...90_000])
data = {"a" => "a"*105_000}.to_json + "\n"; socket.write(data[90_000..] + "{\"b\": \"bbbbbbbbbbbbbbbbbbb\"}\n")

socket.close

Provide logs (if relevant):
Logstash generates 3 ebents:

{
  "message" => "Payload bigger than 100000 bytes",
  "@version" => "1",
  "@timestamp" => 2024-10-01T10:49:55.755601Z,
  "tags" => [
    [0] "_jsonparsetoobigfailure"
  ]
}
{
  "b" => "bbbbbbbbbbbbbbbbbbb",
  "@version" => "1",
  "@timestamp" => 2024-10-01T10:49:55.774574Z
}
{
  "a" => "aaaaa......a"
  "@version" => "1",
  "@timestamp" => 2024-10-01T10:49:55.774376Z
}

Instead of 2, one with the _jsonparsetoobigfailure error for the message made of a and then a valid with bs.
The extended motivation is explained in logstash-plugins/logstash-codec-json_lines#45 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants