Log lost when task_id is occupied #9997

duj4 · 2025-02-24T06:02:25Z

Bug Report

Describe the bug
We have Loki deployed as the output of FluentBit and owning to the maintenance of K8S cluster, Loki stack was down as well (36 hours-ish). After bring Loki back online this morning, we found some of the logs were lost.

Below is the configuration in our environment:

service:
  storage.metrics: on
  storage.sync: normal
  storage.checksum: on
  storage.path: <path_to_data>
  **storage.max_chunks_up: 256**
  storage.backlog.mem_limit: 1G
  storage.delete_irrecoverable_chunks: on
  scheduler.base: 2
  scheduler.cap: 30

pipeline:
  inputs:
    - name: tail
       path: <path_to_file1>
      ...
      storage.type: filesystem
    - name: tail
       path: <path_to_file2>
      ...
      **storage.type: filesystem**   

  outputs:
    - name: loki
      ...
      **storage.total_limit_size: 5G**
      retry_limit: no_limits

Please correct me if there is any misunderstanding:
We noticed that the task_id stopped at 2047 though there was incoming chunks created, this should be by design: https://github.com/fluent/fluent-bit/blob/v3.2.6/include/fluent-bit/flb_config.h#L291. Each chunk will be processed by each task_id, taking the chunk size into consideration, the maximum buffered data should be:
2048 * 2MB = 4096MB/4G (or even smaller), though storage.total_limit_size is set as 5G or larger.

Similar issues:
#8503
#8395

To Reproduce

Set storage.type as fileysystem
Shutdown the output plugin service, Loki in our case
Wait till the task_id goes to 2047
Check if there is any log files missed after bring up output plugin

Expected behavior
FluentBit can buffer the log files as much as it can be per the setting in storage.total_limit_size while output service is down.

Your Environment

Version used: 3.2.6
Environment name and version (e.g. Kubernetes? What version?): Linux
Server type and version: Linux
Operating System and version: RHEL 8.0

The text was updated successfully, but these errors were encountered:

duj4 · 2025-02-24T06:06:03Z

@edsiper @patrick-stephens @cosmo0920 it would be much appreciated if you could help check this issue.

pkqsun · 2025-02-24T07:05:54Z

Just some update:
1> As we observed, when use storage.type: filesystem, the local fs_chunk file size may diff from 4k, 36k or at maximum 2M, do you know why and how this happened as we keep tailing same logs files?

2> As mentioned in #9966 (comment), question 2, curl cmd shows the total_chunks is 2099 this time after about 40h without output. As the total file size (data/tail.1/*) is about 132M, still confused with unmapping number of task_id and local chunk files.

3> Could we add a new configurable parameter, such as max_flb_task in service. Default value is 2048 as https://github.com/fluent/fluent-bit/blob/v3.2.6/include/fluent-bit/flb_config.h#L291 defined and also could be changed per need. The actual number of taskid is depends on storage.total_limit_size and max_flb_task which reached first. Any concern or suggestion of this change?

duj4 added the status: waiting-for-triage label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log lost when task_id is occupied #9997

Log lost when task_id is occupied #9997

duj4 commented Feb 24, 2025

duj4 commented Feb 24, 2025

pkqsun commented Feb 24, 2025 •

edited

Loading

Log lost when task_id is occupied #9997

Log lost when task_id is occupied #9997

Comments

duj4 commented Feb 24, 2025

Bug Report

duj4 commented Feb 24, 2025

pkqsun commented Feb 24, 2025 • edited Loading

pkqsun commented Feb 24, 2025 •

edited

Loading