Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log lost when task_id is occupied #9997

Open
duj4 opened this issue Feb 24, 2025 · 2 comments
Open

Log lost when task_id is occupied #9997

duj4 opened this issue Feb 24, 2025 · 2 comments

Comments

@duj4
Copy link

duj4 commented Feb 24, 2025

Bug Report

Describe the bug
We have Loki deployed as the output of FluentBit and owning to the maintenance of K8S cluster, Loki stack was down as well (36 hours-ish). After bring Loki back online this morning, we found some of the logs were lost.

Below is the configuration in our environment:

service:
  storage.metrics: on
  storage.sync: normal
  storage.checksum: on
  storage.path: <path_to_data>
  **storage.max_chunks_up: 256**
  storage.backlog.mem_limit: 1G
  storage.delete_irrecoverable_chunks: on
  scheduler.base: 2
  scheduler.cap: 30

pipeline:
  inputs:
    - name: tail
       path: <path_to_file1>
      ...
      storage.type: filesystem
    - name: tail
       path: <path_to_file2>
      ...
      **storage.type: filesystem**   

  outputs:
    - name: loki
      ...
      **storage.total_limit_size: 5G**
      retry_limit: no_limits

Please correct me if there is any misunderstanding:
We noticed that the task_id stopped at 2047 though there was incoming chunks created, this should be by design: https://github.com/fluent/fluent-bit/blob/v3.2.6/include/fluent-bit/flb_config.h#L291. Each chunk will be processed by each task_id, taking the chunk size into consideration, the maximum buffered data should be:
2048 * 2MB = 4096MB/4G (or even smaller), though storage.total_limit_size is set as 5G or larger.

Similar issues:
#8503
#8395

To Reproduce

  • Set storage.type as fileysystem
  • Shutdown the output plugin service, Loki in our case
  • Wait till the task_id goes to 2047
  • Check if there is any log files missed after bring up output plugin

Expected behavior
FluentBit can buffer the log files as much as it can be per the setting in storage.total_limit_size while output service is down.

Your Environment

Version used: 3.2.6
Environment name and version (e.g. Kubernetes? What version?): Linux
Server type and version: Linux
Operating System and version: RHEL 8.0

@duj4
Copy link
Author

duj4 commented Feb 24, 2025

@edsiper @patrick-stephens @cosmo0920 it would be much appreciated if you could help check this issue.

@pkqsun
Copy link

pkqsun commented Feb 24, 2025

Just some update:
1> As we observed, when use storage.type: filesystem, the local fs_chunk file size may diff from 4k, 36k or at maximum 2M, do you know why and how this happened as we keep tailing same logs files?

2> As mentioned in #9966 (comment), question 2, curl cmd shows the total_chunks is 2099 this time after about 40h without output. As the total file size (data/tail.1/*) is about 132M, still confused with unmapping number of task_id and local chunk files.

3> Could we add a new configurable parameter, such as max_flb_task in service. Default value is 2048 as https://github.com/fluent/fluent-bit/blob/v3.2.6/include/fluent-bit/flb_config.h#L291 defined and also could be changed per need. The actual number of taskid is depends on storage.total_limit_size and max_flb_task which reached first. Any concern or suggestion of this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants