BufferedTokenizer doesn't dice correctly the payload when restart processing after buffer full error

**Logstash information**:

Please include the following information:

1. Logstash version (e.g. `bin/logstash --version`) any
2. Logstash installation source (e.g. built from source, with a package manager: DEB/RPM, expanded from tar or zip archive, docker)
3. How is Logstash being run (e.g. as a service/service manager: systemd, upstart, etc. Via command line, docker/kubernetes)

**Plugins installed**: (`bin/logstash-plugin list --verbose`)

**JVM** (e.g. `java -version`):

If the affected version of Logstash is 7.9 (or earlier), or if it is NOT using the bundled JDK or using the 'no-jdk' version in 7.10 (or higher), please provide the following information:

1. JVM version (`java -version`)
2. JVM installation source (e.g. from the Operating System's package manager, from source, etc).
3. Value of the `LS_JAVA_HOME` environment variable if set.

**OS version** (`uname -a` if on a Unix-like system):

**Description of the problem including expected versus actual behavior**:
When BufferedTokenizer is used to dice the input, after a buffer full error, the input should be consumed till next separator and start correctly with the data after that separator

**Steps to reproduce**:
Mostly inspired by https://github.com/logstash-plugins/logstash-codec-json_lines/pull/45#issuecomment-2329289456
 1. Configure Logstash to use the json_lines codec present in PR https://github.com/logstash-plugins/logstash-codec-json_lines/pull/45
```
In Gemfile add:
gem "logstash-codec-json_lines", :path => "/path/to/logstash-codec-json_lines"
```
 2. From shell run `bin/logstash-plugin install --no-verify`
 3. start Logstash with following pipeline
```
input {
  tcp {
    port => 1234

    codec => json_lines {
      decode_size_limit_bytes => 100000
    }
  }
}

output {
  stdout {
    codec => rubydebug
  }
}
```
 4. Use the following script to generate some load
```ruby
require 'socket' 
require 'json'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

data = {"a" => "a"*105_000}.to_json + "\n"; socket.write(data[0...90_000])
data = {"a" => "a"*105_000}.to_json + "\n"; socket.write(data[90_000..] + "{\"b\": \"bbbbbbbbbbbbbbbbbbb\"}\n")

socket.close
```

**Provide logs (if relevant)**:
Logstash generates 3 ebents:
```
{
  "message" => "Payload bigger than 100000 bytes",
  "@version" => "1",
  "@timestamp" => 2024-10-01T10:49:55.755601Z,
  "tags" => [
    [0] "_jsonparsetoobigfailure"
  ]
}
{
  "b" => "bbbbbbbbbbbbbbbbbbb",
  "@version" => "1",
  "@timestamp" => 2024-10-01T10:49:55.774574Z
}
{
  "a" => "aaaaa......a"
  "@version" => "1",
  "@timestamp" => 2024-10-01T10:49:55.774376Z
}
```
Instead of 2, one with the `_jsonparsetoobigfailure` error for the message made of `a` and then a valid with `b`s. 
The extended motivation is explained in https://github.com/logstash-plugins/logstash-codec-json_lines/pull/45#issuecomment-2341258506

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BufferedTokenizer doesn't dice correctly the payload when restart processing after buffer full error #16483

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BufferedTokenizer doesn't dice correctly the payload when restart processing after buffer full error #16483

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions