Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluent-bit input plugin tail doesn't process all logs: scan_blog add(): dismissed: #4155

Closed
rguske opened this issue Oct 3, 2021 · 96 comments

Comments

@rguske
Copy link

rguske commented Oct 3, 2021

Bug Report

Describe the bug
Fluent Bit is not processing all logs located in /var/log/containers/.

To Reproduce
The following messages are displayed:

[2021/10/01 14:40:05] [debug] [input:tail:tail.0] scanning path /var/log/containers/*.log                                                                                                                                                                                                                                               │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/activator-85cd6f6f9-nrncf_knative-serving_activator-3b631f27f6667599ae940f94afe6a65a4d1d488e7979fced513fa910082a5ae1.log, inode 404768                                                                                                │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/activator-85cd6f6f9-nrncf_knative-serving_activator-ca32320178170fe4198ce1b0bd57d8ea031c7c886a7b0e3d66bb1b78b67613b8.log, inode 921337                                                                                                │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/antrea-agent-gql5r_kube-system_antrea-agent-63659cdc8e5ddba3eaf729be280661b45fd198e6d2c7195965be85cdca81f41a.log, inode 536837                                                                                                        │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/antrea-agent-gql5r_kube-system_antrea-agent-8726abf73577f597e15716176cfcdce442b159d00ec12f59e439719d824a9585.log, inode 1190181                                                                                                       │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/antrea-agent-gql5r_kube-system_antrea-ovs-08045b767f2f8ee421b3b4d8d5b646b49b4e12199ae957cad178dd3d11670ec6.log, inode 663855 
  • Steps to reproduce the problem:
    Configuration details:

ServiceAccount:

rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  - pods
  verbs:
  - get
  - list
  - watch

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: vmware-system
  labels:
    k8s-app: fluent-bit
apiVersion: v1
data:
  filter-kubernetes.conf: |
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
    [FILTER]
        Name                  modify
        Match                 kube.*
        Copy                  kubernetes k8s

    [FILTER]
        Name                  nest
        Match                 kube.*
        Operation             lift
        Nested_Under          kubernetes
  filter-record.conf: |
    [FILTER]
        Name                record_modifier
        Match               *
        Record tkg_cluster veba-demo.jarvis.tanzu
        Record tkg_instance veba-demo.jarvis.tanzu
    [FILTER]
        Name                  nest
        Match                 kube.*
        Operation             nest
        Wildcard              tkg_instance*
        Nest_Under            tkg

    [FILTER]
        Name                  nest
        Match                 kube_systemd.*
        Operation             nest
        Wildcard              SYSTEMD*
        Nest_Under            systemd
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     debug
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    @INCLUDE input-kubernetes.conf
    @INCLUDE input-systemd.conf
    @INCLUDE input-kube-apiserver.conf
    @INCLUDE input-auditd.conf
    @INCLUDE filter-kubernetes.conf
    @INCLUDE filter-record.conf
    @INCLUDE output-syslog.conf
  input-auditd.conf: |
    [INPUT]
        Name              tail
        Tag               audit.*
        Path              /var/log/audit/audit.log
        Parser            logfmt
        DB                /var/log/flb_system_audit.db
        Mem_Buf_Limit     50MB
        Refresh_Interval  10
        Skip_Long_Lines   On
  input-kube-apiserver.conf: |
    [INPUT]
        Name              tail
        Tag               apiserver_audit.*
        Path              /var/log/kubernetes/audit.log
        Parser            json
        DB                /var/log/flb_kube_audit.db
        Mem_Buf_Limit     50MB
        Refresh_Interval  10
        Skip_Long_Lines   On
  input-kubernetes.conf: |
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On
        Refresh_Interval  10
  input-systemd.conf: |
    [INPUT]
        Name                systemd
        Tag                 kube_systemd.*
        Path                /var/log/journal
        DB                  /var/log/flb_kube_systemd.db
        Systemd_Filter      _SYSTEMD_UNIT=kubelet.service
        Systemd_Filter      _SYSTEMD_UNIT=containerd.service
        Read_From_Tail      On
        Strip_Underscores   On
  output-syslog.conf: |
    [OUTPUT]
        Name   syslog
        Match  kube.*
        Host   10.197.79.57
        Port   514
        Mode   tcp
        Syslog_Format        rfc5424
        Syslog_Hostname_key  tkg_cluster
        Syslog_Appname_key   pod_name
        Syslog_Procid_key    container_name
        Syslog_Message_key   message
        Syslog_SD_key        k8s
        Syslog_SD_key        labels
        Syslog_SD_key        annotations
        Syslog_SD_key        tkg

    [OUTPUT]
        Name   syslog
        Match  kube_systemd.*
        Host   10.197.79.57
        Port   514
        Mode   tcp
        Syslog_Format        rfc5424
        Syslog_Hostname_key  tkg_cluster
        Syslog_Appname_key   tkg_instance
        Syslog_Message_key   MESSAGE
        Syslog_SD_key        systemd
  parsers.conf: |
    [PARSER]
        Name   apache
        Format regex
        Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   apache2
        Format regex
        Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   apache_error
        Format regex
        Regex  ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$

    [PARSER]
        Name   nginx
        Format regex
        Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   json
        Format json
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        Name        docker-daemon
        Format      regex
        Regex       time="(?<time>[^ ]*)" level=(?<level>[^ ]*) msg="(?<msg>[^ ].*)"
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        # http://rubular.com/r/tjUt3Awgg4
        Name cri
        Format regex
        Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

    [PARSER]
        Name        logfmt
        Format      logfmt

    [PARSER]
        Name        syslog-rfc5424
        Format      regex
        Regex       ^\<(?<pri>[0-9]{1,5})\>1 (?<time>[^ ]+) (?<host>[^ ]+) (?<ident>[^ ]+) (?<pid>[-0-9]+) (?<msgid>[^ ]+) (?<extradata>(\[(.*)\]|-)) (?<message>.+)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        Name        syslog-rfc3164-local
        Format      regex
        Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key    time
        Time_Format %b %d %H:%M:%S
        Time_Keep   On

    [PARSER]
        Name        syslog-rfc3164
        Format      regex
        Regex       /^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$/
        Time_Key    time
        Time_Format %b %d %H:%M:%S
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        Name    kube-custom
        Format  regex
        Regex   (?<tag>[^.]+)?\.?(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$

DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: vmware-system
  labels:
    k8s-app: fluent-bit
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: fluent-bit
  template:
    metadata:
      labels:
        k8s-app: fluent-bit
    spec:
      containers:
      - image: projects.registry.vmware.com/tkg/fluent-bit:v1.6.9_vmware.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 2020
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: fluent-bit
        ports:
        - containerPort: 2020
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/v1/metrics/prometheus
            port: 2020
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 80m
            memory: 200Mi
          requests:
            cpu: 50m
            memory: 100Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/log
          name: var-log
        - mountPath: /var/log/pods
          name: var-log-pods
        - mountPath: /var/log/containers
          name: var-log-containers
        - mountPath: /var/lib/docker/containers
          name: var-lib-docker-containers
          readOnly: true
        - mountPath: /fluent-bit/etc/
          name: fluent-bit-config
        - mountPath: /run/log
          name: systemd-log
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: fluent-bit
      serviceAccountName: fluent-bit
      terminationGracePeriodSeconds: 10
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      volumes:
      - hostPath:
          path: /var/log
          type: ""
        name: var-log
      - hostPath:
          path: /var/log/pods
          type: ""
        name: var-log-pods
      - hostPath:
          path: /var/log/containers
          type: ""
        name: var-log-containers
      - hostPath:
          path: /var/lib/docker/containers
          type: ""
        name: var-lib-docker-containers
      - hostPath:
          path: /run/log
          type: ""
        name: systemd-log
      - configMap:
          defaultMode: 420
          name: fluent-bit-config
        name: fluent-bit-config
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

Expected behavior
All logs in /var/log/containers/ should be processed.

Your Environment

Additional context
Running tail -f manually from within the system on a specific pod log, which is writing to stdout, works.

{"log":"10/03/2021 14:47:13 - Handler Processing Completed ...\n","stream":"stdout","time":"2021-10-03T14:47:13.829672574Z"}
{"log":"\n","stream":"stdout","time":"2021-10-03T14:47:13.829772103Z"}

Logs which e.g. aren't processed:

root@veba-kn [ /var/log/containers ]# ls -rtl                                                                                                                                                                                          
total 376                                                                                                                                                                                                                                                                                                                                                                                                                                     
lrwxrwxrwx 1 root root 100 Sep 13 21:31 antrea-agent-gql5r_kube-system_antrea-agent-8726abf73577f597e15716176cfcdce442b159d00ec12f59e439719d824a9585.log -> /var/log/pods/kube-system_antrea-agent-gql5r_31aa406a-286c-495b-9dcf-e4036c
2a4756/antrea-agent/3.log                                                                                                                                                                                                              
lrwxrwxrwx 1 root root  98 Sep 13 21:31 antrea-agent-gql5r_kube-system_antrea-ovs-3f300f1d7b28c069df1f34cf37ff89be95d69fc3dc4ea0f269b5bd07ce5d56c1.log -> /var/log/pods/kube-system_antrea-agent-gql5r_31aa406a-286c-495b-9dcf-e4036c2a
4756/antrea-ovs/3.log                                                                                                                                                                                                                  
lrwxrwxrwx 1 root root 102 Sep 13 21:31 envoy-89vct_contour-external_shutdown-manager-c8ed97927c25d465f31cce5ab8bd91d02742504f8cf73ad53e493738d0a17f74.log -> /var/log/pods/contour-external_envoy-89vct_1c947a55-2b86-48bd-b442-c6c51e
c2dd3a/shutdown-manager/3.log                                                                                                                                                                                                          
lrwxrwxrwx 1 root root  91 Sep 13 21:31 envoy-89vct_contour-external_envoy-0ea7a33d12105058f74eae9653dd0266ac99ef2ba7f6cb3a3b04a8ec3bc02525.log -> /var/log/pods/contour-external_envoy-89vct_1c947a55-2b86-48bd-b442-c6c51ec2dd3a/envo
y/3.log                                                                                                                                                                                                                                
lrwxrwxrwx 1 root root 104 Sep 13 21:31 contour-5869594b-7jm89_contour-external_contour-803e6591f657fae9539b64ae4f86fa44cce99b409c5f92979c6045cf4b98b52c.log -> /var/log/pods/contour-external_contour-5869594b-7jm89_cc6cf243-7d3f-483
9-91e8-741ab87f6488/contour/3.log                                                                                                                                                                                                      
lrwxrwxrwx 1 root root 106 Sep 13 21:31 contour-5d47766fd8-n24mz_contour-internal_contour-ae34a8ae0b8398da294c5061ec5c0ef1e9be8cb2979f07077e5e9df12f2bab67.log -> /var/log/pods/contour-internal_contour-5d47766fd8-n24mz_a87131ad-d73a
-4371-a47b-dcc410f3b6e4/contour/3.log                                                                                                                                                                                                  
lrwxrwxrwx 1 root root 100 Sep 13 21:31 coredns-74ff55c5b-mjdlr_kube-system_coredns-60bd5f49def85a0ddc929e2c2da5c793a3c6de00cd6a81bdcfdb21f3d4f45129.log -> /var/log/pods/kube-system_coredns-74ff55c5b-mjdlr_7ef260c1-308e-4162-8a84-2
31d560f8023/coredns/3.log

I've also tried running the DS in

securityContext:
          privileged: true

Similar issues I found but which doesn't provide the solution for this issue:
#3857
#4014

Your help would be much appreciated. Thanks

@zy286938348
Copy link

@rguske I also have this problem, how did you solve it?

@rguske
Copy link
Author

rguske commented Nov 16, 2021

@rguske I also have this problem, how did you solve it?

Hi @zy286938348, I have no clue at the moment but I will follow some of the made suggestions here: https://fluent-all.slack.com/archives/C0CTQGHKJ/p1637005220493300
Please keep me posted if you have more luck with finding the solution for this particular problem. Thanks

@rguske
Copy link
Author

rguske commented Nov 19, 2021

Hi @zy286938348,
which container runtime are you using?

@kc-dot-io
Copy link

kc-dot-io commented Nov 27, 2021

@rguske @zy286938348 I am also seeing this issue using containerd as my runtime.
It came up ~16 days after my first deployment of fluentbit 1.8.1 on a fresh k8s 1.21 cluster on EKS.
My fluentbit configuration is basically the default helm chart install, with the kubernetes parser and thats about it.
It looks like I can't get Slack thread above @rguske - let me know if I can share more info to help figure this out.

@kc-dot-io
Copy link

kc-dot-io commented Nov 27, 2021

I think this is the RCA for me: #3873

When I removed Ignore_Older the issues I was seeing went away quite quickly.

Please tag me if you are able to try this and confirm it works for you too.

@rguske
Copy link
Author

rguske commented Nov 27, 2021

Hi @kc-dot-io,
I appreciate your comment and help. I wanted to update this issue already last week with my latest observations but last week was a busy one. However, it's interesting that you're mentioning that you are seeing this issue with containerd since I figured out, that changing the runtime from docker to containerd finally provided positive results. Subsequently, I was pretty sure it's due to the old dockershim.
Regarding the use of Ignore_Older - I'm sure I've never added it to the CM but I will double check this next week.
Again, thanks a lot.

@teru01
Copy link

teru01 commented Dec 3, 2021

I also have this issue. Upgrading Fargate 1.3(using Docker as runtime) to 1.4 (using containerd) did not help.

@kc-dot-io
Copy link

kc-dot-io commented Dec 7, 2021

Circling back to this and I'm no longer certain that Ignore_Older is the RCA for me & yes can confirm I'm using containerd.

The issue still appears to occur from time to time and while it's not limited to a single node usually, I have observed the circumstance where one node will work while another isn't. I'm now testing increasing the resources that are being allocated to the fluebt bit replicas since the issue appears to come and go.

If anyone has any more ideas, I'm happy to troubleshoot it with my setup. Please do ping me in follow ups.

@srideok2
Copy link

Is there any updates on this issue? I am facing the same on K8s 1.21 deployment with Fluent Bit v1.8.12 and elasticsearch.

@kc-dot-io
Copy link

kc-dot-io commented Feb 18, 2022

@srideok2 I don't think there is a permanent solution yet but I've talked to a few other people who had issues and we've had luck with reconfiguring our fluentbits to get to a place where things were stable.

From what I can see the issue mostly happens when buffer sizes get exceeded due to large log lines.

The solution I used and worked for at least one other person was to turn on the option for skipping long lines and then jack up the buffer size on the inputs and filters.

IRC there is another open issue that speaks to the issue at a lower level and I think there are multiple different symptoms of them same root cause here, which is basically that when the memory buffer is exceed, fluent-bit just silently stops working until it can be reset.

Ping me on Slack community if you have any questions. I hope this helps.

@rguske
Copy link
Author

rguske commented Feb 20, 2022

Hi all,
it finally worked for our project after replacing Docker with Containerd and by applying the following config:
https://github.com/vmware-samples/vcenter-event-broker-appliance/tree/development/files/configs/fluentbit/templates

@ankitkr
Copy link

ankitkr commented Mar 10, 2022

Has this been resolved?
This seems to be related to not removal of inode during file removal process i.e. (the inode entry not being removed from files_static, files_event[https://github.com/fluent/fluent-bit/blob/master/plugins/in_tail/tail_config.h#L129, https://github.com/fluent/fluent-bit/blob/master/plugins/in_tail/tail_file.c#1108])

@ramatudaan
Copy link

@edsiper can you please give us some pointers here to debug this issue. As I have understood

int flb_tail_file_append(char *path, struct stat *st, int mode,
this method is returning -1 and the files are getting dismissed. can you please help with some pointers on understanding when does this method return -1.

@edsiper
Copy link
Member

edsiper commented Mar 17, 2022

There is a bug in v1.8.13, we have just published the new container image for v1.8.14, please upgrade:

fluent/fluent-bit:1.8.14

@danielserrao
Copy link

I'm using version 1.9.0 and I'm getting dismissed logs. This doesn't really seems fixed.

@andutt
Copy link

andutt commented Mar 25, 2022

Same issue here. Openshift 4 CoreOS and cri-o. Any input regarding this would be great.

@ramatudaan
Copy link

@edsiper can you please give us some pointers here to debug this issue. As I have understood

int flb_tail_file_append(char *path, struct stat *st, int mode,

this method is returning -1 and the files are getting dismissed. can you please help with some pointers on understanding when does this method return -1.

This got fixed when we started using 1.9.0. @danielserrao

@andutt
Copy link

andutt commented Mar 25, 2022

@ramatudaan i switched my fluent-bit image to 1.9.0 and my issue now seems to be security/permission related or missconfiguration of templates to match Openshift 4 CoreOS/cri-o. I created and issue here on it: here

@Lord-Y
Copy link

Lord-Y commented Apr 22, 2022

Hello guys, is there any ETA on this @edsiper?
Sound like a lot of issues have been opened about this topic and debian packages 1.8.X are not available anymore.
All 1.9.X packages are not fixing the issue.

@ahvahsky2008
Copy link

Same issue guys, tail logging not works

@kashirindynatech
Copy link

Have same issue. Tried both output as http to logstash and es plugin to ES

@sxyandapp
Copy link

sxyandapp commented Jun 10, 2022

fluent-bit版本:1.9.3
问题:使用fluent-bit 的tail插件收集日志时,fluent-bit运行一段时间后,会停止继续收集日志
问题原因:使用multiline.parser进行日志合并时,cont_state可能将若干条日志合并为一行,导致tail停止采集
解决思路:
multiline.parser进行日志合并时,受限使用start_state判断是否为开始行,然后使用cont_state判断后续行是否需要合并,因此state_state和cont_state应采用互斥的逻辑,
正确示范:
rule "start_state" "/^(\d{4})(.)/" "cont"
rule "cont" "/^(?!\d{4})(.
)/" "cont"
错误示范:
rule "start_state" "/^(\d{4})(.)/" "cont"
rule "cont" "/(.
)/" "cont"

目前暂时没有遇到tail停止采集的问题

@Jonathan-w6d
Copy link

Has anyone tried turning Skip_Long_Lines to off ?
Some of my logs were dismissed, because they were too long I guess, this did the trick

@zulrang
Copy link

zulrang commented Aug 5, 2022

I'm having this issue as well on 1.9.6. The problem is that I NEED the long lines to be processed.

@kc-dot-io
Copy link

@Jonathan-w6d yes this helped for me. Also increasing the max buffer size. That said you'll still run into an issue with a memory leak and eventually need to restart the process. I have to do this every 90-120 days.

@zulrang Try increasing your max buffer size to something very large - you'll probably still have to restart the processes every so often, but in theory if they buffer is large enough it should be able to process long lines until it can't again.

@zulrang
Copy link

zulrang commented Aug 6, 2022

@zulrang Try increasing your max buffer size to something very large - you'll probably still have to restart the processes every so often, but in theory if they buffer is large enough it should be able to process long lines until it can't again.

I have all buffers set to 2M and it has never successfully flushed chunks from one specific node - the one that is logging stack traces (about 8k long). The other 13 work fine.

@Jonathan-w6d
Copy link

@leonardo-albertovich i can give you the last values.yaml containing the confs that I used before switching to another solution if you'd like ?

@Jonathan-w6d
Copy link

Jonathan-w6d commented May 9, 2023

Here you go, that's actually pretty straightforward, just adding k8s metadata and modifying tag on some logs so I can differentiate them, then output to opensearch directly.


# replicaCount -- Only applicable if kind=Deployment
replicaCount: 1

image:
  repository: cr.fluentbit.io/fluent/fluent-bit
  tag: "1.9.6-debug"
  pullPolicy: Always

#image:
#  repository: gke.gcr.io/gke-metrics-agent
#  tag: "1.8.3-gke.0"
#  pullPolicy: Always

testFramework:
  image:
    repository: busybox
    pullPolicy: Always
    tag: latest

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name:

rbac:
  create: true
  nodeAccess: true

podSecurityPolicy:
  create: false
  annotations: {}

openShift:
  enabled: false
  securityContextConstraints:
    create: true
    annotations: {}

podSecurityContext: {}

hostNetwork: false
dnsPolicy: ClusterFirst

dnsConfig: {}

hostAliases: []

securityContext: {}

service:
  type: ClusterIP
  port: 2020
  labels: {}
  annotations: {}

serviceMonitor:
  enabled: false

prometheusRule:
  enabled: false

dashboards:
  enabled: false
  labelKey: grafana_dashboard
  annotations: {}
  namespace: ""

lifecycle: {}

livenessProbe:
  httpGet:
    path: /
    port: http

readinessProbe:
  httpGet:
    path: /api/v1/health
    port: http

resources: {}
#   limits:
#     cpu: 100m
#     memory: 128Mi
#   requests:
#     cpu: 100m
#     memory: 128Mi

## only available if kind is Deployment
ingress:
  enabled: false
  className: ""
  annotations: {}
  hosts: []
  extraHosts: []
  tls: []

## only available if kind is Deployment
autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 3
  targetCPUUtilizationPercentage: 75
  customRules: []
  behavior: {}

## only available if kind is Deployment
podDisruptionBudget:
  enabled: false
  annotations: {}
  maxUnavailable: "30%"

nodeSelector: {}

tolerations:
  - operator: Exists
    effect: NoExecute
  - operator: Exists
    effect: NoSchedule
#  - effect: NoSchedule
#    key: dedicated
#    operator: Equal
#    value: appli-strada
#  - effect: NoSchedule
#    key: dedicated
#    operator: Equal
#    value: infra
#  - effect: NoSchedule
#    key: dedicated
#    operator: Equal
#    value: kafka

affinity: {}

labels: {}

annotations: {}

podAnnotations: {}

podLabels: {}

priorityClassName: ""

env: []
#  - name: FOO
#    value: "bar"

# The envWithTpl array below has the same usage as "env", but is using the tpl function to support templatable string.
# This can be useful when you want to pass dynamic values to the Chart using the helm argument "--set <variable>=<value>"
# https://helm.sh/docs/howto/charts_tips_and_tricks/#using-the-tpl-function
envWithTpl: []
#  - name: FOO_2
#    value: "{{ .Values.foo2 }}"
#
# foo2: bar2

envFrom: []

extraContainers: []

flush: 5

metricsPort: 2020

extraPorts: []

extraVolumes:
  - name: certs
    secret:
      secretName: tls-for-fluentbit-key-pair

extraVolumeMounts:
  - name: certs
    mountPath: /fluent-bit/etc/certs/ca.crt
    subPath: ca.crt
  - name: certs
    mountPath: /fluent-bit/etc/certs/tls.key
    subPath: tls.key
  - name: certs
    mountPath: /fluent-bit/etc/certs/tls.crt
    subPath: tls.crt

updateStrategy: {}

existingConfigMap: ""

networkPolicy:
  enabled: false

luaScripts: {}

config:
  service: |
    [SERVICE]
        Daemon Off
        Flush {{ .Values.flush }}
        Grace 10
        Log_Level {{ .Values.logLevel }}
        Log_File      /dev/stdout
        Parsers_File custom_parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port {{ .Values.metricsPort }}
        Storage.path  /var/log/fluentbit
        Storage.sync  normal
        Storage.metrics on

  inputs: |
    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        multiline.parser docker, cri
        Tag kube.*
        Skip_Long_Lines Off
        Mem_Buf_Limit 100MB

  filters: |
    [FILTER]
        Name kubernetes
        Match kube.*
        Merge_Log On
        Buffer_size False

    [FILTER]
        Name        grep
        Match       kube.*

        Exclude     $kubernetes['container_name'] fluent-bit
        Exclude     $kubernetes['container_name'] opensearch
        Exclude     $kubernetes['namespace_name'] logs

    [FILTER]
        Name        modify
        Match       kube.*
        Hard_copy log message

    [FILTER]
        Name          rewrite_tag
        Match         kube.*
        Emitter_Mem_Buf_Limit 100MB

        Rule          $kubernetes['labels']['createdfor'] strada  strada.$TAG false
        Rule          $kubernetes['labels']['app'] gitlab-gitlab  strada.$TAG false
        Rule          $kubernetes['container_image'] strada  strada.$TAG false

    [FILTER]
        Name         parser
        Match_Regex  ^(kube|strada)      
        Key_Name     message
        Reserve_Data True
        Parser       pg
        Parser       glog
        Parser       nginx
        Parser       json
        Parser       kfk

  outputs: |
    [OUTPUT]
        Name opensearch
        Match kube.*
        Host opensearch-cluster-master-headless.logs.svc
        Port 9200
        HTTP_User admin
        HTTP_Passwd admin
        Logstash_Format On
        Logstash_Prefix logk8s
        Logstash_DateFormat %Y.%m.%d
        Retry_Limit False
        Suppress_Type_Name On
        tls On
        tls.verify On
        tls.ca_file /fluent-bit/etc/certs/ca.crt
        tls.crt_file /fluent-bit/etc/certs/tls.crt
        tls.key_file /fluent-bit/etc/certs/tls.key
        Buffer_size False
        Trace_Output On
        Trace_Error On
        Replace_Dots On

    [OUTPUT]
        Name opensearch
        Match strada.*
        Host opensearch-cluster-master-headless.logs.svc
        Port 9200
        HTTP_User admin
        HTTP_Passwd admin
        Logstash_Format On
        Logstash_Prefix logk8s-strada
        Logstash_DateFormat %Y.%m.%d
        Retry_Limit False
        Suppress_Type_Name On
        tls On
        tls.verify On
        tls.ca_file /fluent-bit/etc/certs/ca.crt
        tls.crt_file /fluent-bit/etc/certs/tls.crt
        tls.key_file /fluent-bit/etc/certs/tls.key
        Buffer_size False
        Trace_Output On
        Trace_Error On
        Replace_Dots On

  customParsers: |
    [PARSER]
        Name        cri
        Format      regex
        Regex       ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

    [PARSER]
        Name        nginx
        Format      regex
        Regex       (?<host>[^ ]*) - (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*) "(?<referer>[^\"]*)" "(?<agent>[^\"]*)" (?<request_length>[^ ]*) (?<request_time>[^ ]*) \[(?<proxy_upstream_name>[^ ]*)\] (\[(?<proxy_alternative_upstream_name>[^ ]*)\] )?(?<upstream_addr>[^ ]*) (?<upstream_response_length>[^ ]*) (?<upstream_response_time>[^ ]*) (?<upstream_status>[^ ]*) (?<reg_id>[^ ]*).*$
        Time_Key    time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   json
        Format json
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name pg
        Format regex
        Regex (?<time>[^ ]+[ ]+[^ ]+[ ]+[^ ]*) \[(?<test>[^\]]*)\] (STATEMENT:\s*(?<statement>[^$].*?)|ERROR:\s*(?<error>[^$].*?)|DETAIL:\s*(?<detail>[^$].*?))$
        Time_Key time
        Time_Format %Y-%m-%d %H:%M:%S.%L %Z

    [PARSER]
        Name        glog
        Format      regex
        Regex       ^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source_file>[^ \]]+)\:(?<source_line>\d+)\]\s(?<message>.*)$
        Time_Key    time
        Time_Format %m%d %H:%M:%S.%L%z

    [PARSER]
        Name        kfk
        Format      regex
        Regex       ^(?<time>[^ ]+[ ]+[^ ]*) \[(?<id>[^\]]*)\] - (?<level>[^ ]*)  \[(?<thread>[^\]]*)\] - (?<message>[^$]*)$
        Time_Key    time
        Format      %Y-%m-%d %H:%M:%S,%L

    [PARSER]
        Name        logfmt
        Format      logfmt

  extraFiles: {}

# The config volume is mounted by default, either to the existingConfigMap value, or the default of "fluent-bit.fullname"
volumeMounts:
  - name: config
    mountPath: /fluent-bit/etc/fluent-bit.conf
    subPath: fluent-bit.conf
  - name: config
    mountPath: /fluent-bit/etc/custom_parsers.conf
    subPath: custom_parsers.conf

daemonSetVolumes:
  - name: varlog
    hostPath:
      path: /var/log
  - name: varlibdockercontainers
    hostPath:
      path: /var/lib/docker/containers
  - name: etcmachineid
    hostPath:
      path: /etc/machine-id
      type: File

daemonSetVolumeMounts:
  - name: varlog
    mountPath: /var/log
  - name: varlibdockercontainers
    mountPath: /var/lib/docker/containers
    readOnly: true
  - name: etcmachineid
    mountPath: /etc/machine-id
    readOnly: true

args: []

command: []

initContainers: []

logLevel: trace```

@leonardo-albertovich
Copy link
Collaborator

Thank you @Jonathan-w6d, we'll try to have someone from the team reproduce the issue.
There's no ETA at the moment but I'll send an update as soon as we have some insight.

@msolters
Copy link

We are encountering this in v2.1.2 in EKS 1.22.
For us it manifests as log files pre-existing when Fluent Bit starts up on a node are tailed and processed.
New files created on that node moving forward yield the scan_blog add() debug log message and can't be processed until Fluent Bit is restarted -- so that the files are "pre-existing" by construction. Not a tenable solution but hopefully a useful symptom diagnostically?

@benjaminhuo
Copy link
Member

benjaminhuo commented May 16, 2023

New files created on that node moving forward yield the scan_blog add() debug log message and can't be processed until Fluent Bit is restarted -- so that the files are "pre-existing" by construction.

@msolters That's indeed a good clue to follow, hopefully this can bring some lights for debugging this @leonardo-albertovich

fluent-bit has difficulty detecting new files?

@leonardo-albertovich
Copy link
Collaborator

Fluent-bit shouldn't have issues detecting new files, at least I never had that happen. Could you please share more information about your context @msolters?

I'm interested in knowing which platform is it, if you are using inotify, how many files are there in the paths where this is happening (or how many glob matchs), how does your configuation file look like and a copy of your log if possible.

Feel free to share things with me privately in slack if you don't feel comfortable doing it in public.

@wanjunlei
Copy link
Contributor

wanjunlei commented Jul 7, 2023

environment.zip

This is some information about the cluster, including the yaml of the Fluent Bit daemonset, the configuration file of Fluent Bit, the version information of k8s and docker, the OS version, and kernel parameters. I hope this helps to reproduce the problem, If you need other information please let me know.

PS: The k8s version is v1.19.9. And I tried 1.6.9, 1.8.3, and 2.0.11, they have the same problem.

@benjaminhuo
Copy link
Member

benjaminhuo commented Jul 7, 2023

environment.zip

This is some information about the cluster, including the yaml of the Fluent Bit daemonset, the configuration file of Fluent Bit, the version information of k8s and docker, the OS version, and kernel parameters. I hope this helps to reproduce the problem, If you need other information please let me know.

@leonardo-albertovich @edsiper @agup006 @patrick-stephens we hope this can help to narrow down the investigation scope and to locate the root cause. Let us know what else you need

@IbrahimMCode
Copy link

Same issue using latest version, has this been resolved?

@emmacz
Copy link

emmacz commented Aug 2, 2023

Looks like 'scan_blog add(): dismissed' is generic message, when the root cause can be on different places like in parser etc.

It would be great to add more details in debug mode, which can help to user uncover the issue for example error message coming from parser etc.

Would it be possible?

In general .. where to find the debug information related to parsers processing?

@benjaminhuo
Copy link
Member

Looks like 'scan_blog add(): dismissed' is generic message, when the root cause can be on different places like in parser etc.

It would be great to add more details in debug mode, which can help to user uncover the issue for example error message coming from parser etc.

Would it be possible?

In general .. where to find the debug information related to parsers processing?

@leonardo-albertovich what do you think? Is it possible to add more debug messages to help narrow the scope?

@studyhuang1996
Copy link

image

fluent-bit version 2.1.6 has this been resolved?

@benjaminhuo
Copy link
Member

image

fluent-bit version 2.1.6 has this been resolved?

@studyhuang1996 fluentbit v2.1.6 no longer skips any logs, issue resolved in your env that has this issue before?

@dntosas
Copy link

dntosas commented Sep 13, 2023

FYI team, issue is still present on v2.1.9 ^

@dpallagolla
Copy link

I am facing the same issue, one additional observation. The dismissed: log seems to appear only in the window when fluent-bit scans for new log files. Once the scan for files is done, logs are forwarded from fluent-bit.

@praveenkumarp893
Copy link

I tried with latest version of fluentbit (2.1.10) Still facing the same error. Do we have any news on this ?

[2023/10/10 09:28:34] [debug] [input:tail:tail.0] scan_blog add(): dismissed:

Please see the input config used below.

inputs: |
[INPUT]
Name tail
Path /var/log/containers/default.log
multiline.parser docker, cri, java, multiline-regex-java
Tag kube.*
Mem_Buf_Limit 5MB
Skip_Long_Lines Off

@studyhuang1996
Copy link

studyhuang1996 commented Oct 11, 2023 via email

@omidraha
Copy link

omidraha commented Oct 27, 2023

I have the same issue fluent/helm-charts#415:

kubectl logs  -n fluent-bit -f fluent-bit-6341f45a-j8h2x | grep -i volume-test
[2023/10/27 17:04:05] [debug] [input:tail:tail.0] inode=1062251 with offset=2026 appended as /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log
[2023/10/27 17:04:05] [debug] [input:tail:tail.0] scan_glob add(): /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log, inode 1062251
[2023/10/27 17:04:05] [debug] [input:tail:tail.0] inode=1062251 file=/var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log promote to TAIL_EVENT
[2023/10/27 17:04:05] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1062251 watch_fd=23 name=/var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log
[2023/10/27 17:05:04] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log, inode 1062251
[2023/10/27 17:06:04] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log, inode 1062251
[2023/10/27 17:07:04] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log, inode 1062251
[2023/10/27 17:08:04] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log, inode 1062251

@nulldoot2k
Copy link

Still error: i using version latest and scan_blog add(): dismissed
OMG
image

@nagyzekkyandras
Copy link

same here with fluent/fluent-bit helm chart 0.42.0, and 2.2.1 image version

Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Apr 18, 2024
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 23, 2024
@benjaminhuo
Copy link
Member

@kc-dot-io reopen

@lecaros
Copy link
Contributor

lecaros commented Apr 24, 2024

Hello. If you have this issue, please build your binary using this: https://github.com/lecaros/fluent-bit/tree/master
I've added a few debug messages to narrow down the issue.
Set the log_level to debug. Run your scenario and share the log file along with your configuration.

@lecaros
Copy link
Contributor

lecaros commented Apr 24, 2024

I'll gladly reopen the ticket if you can provide the requested indo.

@lorenzobenvenuti
Copy link

Hello. If you have this issue, please build your binary using this: https://github.com/lecaros/fluent-bit/tree/master I've added a few debug messages to narrow down the issue. Set the log_level to debug. Run your scenario and share the log file along with your configuration.

Hi, we had a few scenarios where some logs weren't delivered (using Fluent Bit 2.2.2, "tail" input plugin and "splunk" output plugin). Looking at Fluent Bit logs I've noticed a scan_blog add(): dismissed so I thought it was related to the missing events and came here. I've followed your advice and rebuilt the executable from your fork (BTW, it's missing a couple of ;), but as far as I can see in our case dismissed was a red herring: it's printed even for legit use cases, for example

[2024/07/25 11:51:12] [debug] [in_tail] file /var/log/app/access.log already registered
[2024/07/25 11:51:12] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/app/access.log, inode 16582
1060

Basically after FluentBit starts monitoring a file it will always print `dismissed" because it doesn't need to register it again (although TBH the message is misleading, it makes you think that the file is not monitored anymore).

TL,DR: if you came here because you've lost some events and you're seeing dismissed in the logs, I'd recommend using the fork to print the root cause (although it'd be great if these info were available in the main development branch) because the actual issue could be unrelated.

Thanks!

@AstritCepele
Copy link

AstritCepele commented Aug 6, 2024

Can we reopen this issue? We see that this message is misleading, instead of the proper root cause issue.

@srajappa
Copy link

srajappa commented Sep 19, 2024

Still seeing this issue even for 3.1.8

@leonardo-albertovich Logs should paint an appropriate picture of what's wrong.

An abrupt event out of nowhere dismissed "your file" sends out panic. And when we see an issue opened several years ago and then coming to a conclusion that fluentbit will likely put out that message is frivolous.

contemplating switching to a different tool

OR

use the forked version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests