[Bug] admission reports piled up causing etcd turned into read-only mode #8974

chaochn47 · 2023-11-21T17:33:12Z

Kyverno Version

1.10.3

Description

Follow up issue report from slack discussion

1.24 EKS cluster

# HELP apiserver_storage_objects [STABLE] Number of stored objects at the time of last check split by kind.
# TYPE apiserver_storage_objects gauge
apiserver_storage_objects{resource="admissionreports.kyverno.io"} 1.601408e+06

Millions of kyverno admission reports piled up since June, 2023 and they occupied most of the space in etcd db. It breached the upstream recommended maximum db size quota (8G) and then turned the etcd into read-only mode.

Entries by 'Kind' (total 9.5 GB):
+--------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+--------+
|                                                                       KEY GROUP                                                                        |              KIND               |  SIZE  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+--------+
| /registry/kyverno.io/admissionreports/monitoring,/registry/kyverno.io/admissionreports/monitoring,/registry/kyverno.io/admissionreports/monitoring,/re | AdmissionReport                 | 9.4 GB |

kyverno-app-controller-pod-spec.yaml was the pod spec when the db was filled up while I am not sure if the user has ever upgraded the controller version in the past since June, 2023. The 1.10.3 Kyverno Version is fetched from ghcr.io/kyverno/kyverno:v1.10.3 in this spec.

kyverno-admission-report-sample.json was one of the example admission report custom resources.

Please let me know if kyverno community wants more information like apiserver audit log or other admission report samples.

Slack discussion

https://kubernetes.slack.com/archives/CLGR9BJU9/p1700252421515759

Troubleshooting

I have read and followed the documentation AND the troubleshooting guide.
I have searched other issues in this repository and mine is not recorded.

The text was updated successfully, but these errors were encountered:

welcome · 2023-11-21T17:33:15Z

Thanks for opening your first issue here! Be sure to follow the issue template!

realshuting · 2023-11-22T05:54:58Z

Thank you @chaochn47 !

Pasting the snippet of the admission report for easier reading:

{
    "apiVersion": "kyverno.io/v1alpha2",
    "kind": "AdmissionReport",
    "metadata": {
        "creationTimestamp": "2023-11-04T01:52:18Z",
        "generation": 1,
        "labels": {
            "app.kubernetes.io/managed-by": "kyverno",
            "audit.kyverno.io/resource.gvr": "deployments.v1.apps",
            "audit.kyverno.io/resource.name": "vmagent-site3",
            "audit.kyverno.io/resource.namespace": "monitoring",
            "audit.kyverno.io/resource.uid": "c479bb6a-3bed-4853-a200-6e89af25d795",
            "cpol.kyverno.io/disallow-capabilities": "661837242",
            "cpol.kyverno.io/disallow-host-namespaces": "661837249",
            "cpol.kyverno.io/disallow-host-path": "661837203",
            "cpol.kyverno.io/disallow-host-ports": "661837251",
            "cpol.kyverno.io/disallow-host-process": "661837250",
            "cpol.kyverno.io/disallow-privileged-containers": "661837233",
            "cpol.kyverno.io/disallow-proc-mount": "661837221",
            "cpol.kyverno.io/disallow-selinux": "661837223",
            "cpol.kyverno.io/restrict-apparmor-profiles": "661837207",
            "cpol.kyverno.io/restrict-seccomp": "661837239",
            "cpol.kyverno.io/restrict-sysctls": "661837183"
        },
        "managedFields": [
...
]
        "name": "00000fad-1bc8-4e61-ab11-ba8ba10a602b",
        "namespace": "monitoring",
        "uid": "74034264-bc6f-4dc9-a119-d78aa2537597"
    },
    "spec": {
        "owner": {
            "apiVersion": "",
            "kind": "",
            "name": "",
            "uid": ""
        },
        "results": [
            {
                "category": "Pod Security Standards (Baseline)",
                "message": "validation rule 'autogen-adding-capabilities' passed.",
                "policy": "disallow-capabilities",
                "result": "pass",
                "rule": "autogen-adding-capabilities",
                "scored": true,
                "severity": "medium",
                "source": "kyverno",
                "timestamp": {
                    "nanos": 0,
                    "seconds": 1699062738
                }
            },
            ...
        ],
        "summary": {
            "error": 0,
            "fail": 0,
            "pass": 12,
            "skip": 0,
            "warn": 0
        }
    }
}

mshanmu · 2023-11-22T16:24:29Z

This issue of report objects causing k8s control plane instability has been ongoing for a very long time. So, this requires a fundamental design change, i.e., use a separate data store, namely postgres or mongodb etc., instead of using the k8s store.

Please don't try patching a fundamentally wrong design choice.

There is a real customer cost because of this issue. We have completely stopped using reports feature.

JimBugwadia · 2023-11-22T17:39:36Z

@mshanmu - please see kyverno/KDP#51. This is proposal for using API aggregation and support alternate storage backends for reports. Feel free to provide feedback and contribute there.

The challenge with reports piling up is caused by reports being produced faster than can be consumed (processed). The processing rate is throttled due to default or inadequate settings for --clientRateLimitQPS and --clientRateLimitBurst.

In 1.11.0 we have changed the defaults for QPS but they still need to be tuned based on your installation. In 1.11.0 there is also a cronjob for cleanup, in case the configuration is not correct. We will look at porting these back to 1.10.x.

mshanmu · 2023-11-23T18:11:26Z

Thanks for the response @JimBugwadia !! Will contribute to the KDP#51.

realshuting · 2023-11-27T15:11:01Z

Hi @chaochn47 - there was a cleanup cronjob in 1.10.3 to periodically delete admission reports without the aggregation label. This optional is enabled by default.

How was Kyverno installed and configured in your scenario? Were there any custom configurations?

realshuting · 2023-11-29T07:21:10Z

@chaochn47 - Please help us identify if this is a configuration issue⬆️

chaochn47 · 2023-11-30T17:00:17Z

Thanks @realshuting for the pointer, would you mind give an example cronjob name so I can use try a matched cronjob key in etcd?

realshuting · 2023-12-01T06:18:58Z

Thanks @realshuting for the pointer, would you mind give an example cronjob name so I can use try a matched cronjob key in etcd?

@chaochn47 - you can search the cronjob "kyverno-cleanup-admission-reports" in the namespace that Kyverno was deployed.

JimBugwadia · 2023-12-03T20:34:38Z

I am moving this to 1.11.2, as there seem to be no changes we are targeting to 1.10.x.

chaochn47 · 2023-12-04T23:48:36Z

kyverno-cleanup-admission-reports cronjob spec.

$ etcdctl get /registry/cronjobs/kyverno/kyverno-cleanup-admission-reports --print-value-only | auger decode
apiVersion: batch/v1
kind: CronJob
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"batch/v1","kind":"CronJob","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"cleanup","app.kubernetes.io/instance":"kyverno","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/part-of":"kyverno","app.kubernetes.io/version":"3.0.5","argocd.argoproj.io/instance":"sintral-s3-kyverno","helm.sh/chart":"kyverno-3.0.5"},"name":"kyverno-cleanup-admission-reports","namespace":"kyverno"},"spec":{"concurrencyPolicy":"Forbid","failedJobsHistoryLimit":1,"jobTemplate":{"spec":{"template":{"metadata":null,"spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"karpenter.k8s.aws/instance-family","operator":"In","values":["m6i","r6i"]}]}]}}},"containers":[{"command":["/bin/sh","-c","COUNT=$(kubectl get admissionreports.kyverno.io -A | wc -l)\nif [ \"$COUNT\" -gt 10000 ]; then\n  echo \"too many reports found ($COUNT), cleaning up...\"\n  kubectl delete admissionreports.kyverno.io -A -l='!audit.kyverno.io/report.aggregate'\nelse\n  echo \"($COUNT) reports found, no clean up needed\"\nfi\n"],"image":"bitnami/kubectl:1.26.4","imagePullPolicy":null,"name":"cleanup","resources":{"limits":{"cpu":1,"memory":"12Gi"},"requests":{"cpu":"256m","memory":"2Gi"}},"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"privileged":false,"readOnlyRootFilesystem":true,"runAsNonRoot":true,"seccompProfile":{"type":"RuntimeDefault"}}}],"restartPolicy":"OnFailure","serviceAccountName":"kyverno-cleanup-jobs","tolerations":[{"effect":"NoSchedule","key":"provisioner-type","operator":"Equal","value":"high-memory"}]}}}},"schedule":"*/10 * * * *","successfulJobsHistoryLimit":1}}
  creationTimestamp: "2023-06-14T07:33:03Z"
  generation: 4
  labels:
    app.kubernetes.io/component: cleanup
    app.kubernetes.io/instance: kyverno
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kyverno
    app.kubernetes.io/version: 3.0.5
    argocd.argoproj.io/instance: sintral-s3-kyverno
    helm.sh/chart: kyverno-3.0.5
  managedFields:
  - apiVersion: batch/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:active: {}
        f:lastScheduleTime: {}
        f:lastSuccessfulTime: {}
    manager: kube-controller-manager
    operation: Update
    subresource: status
    time: "2023-10-03T08:30:00Z"
  - apiVersion: batch/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/part-of: {}
          f:app.kubernetes.io/version: {}
          f:argocd.argoproj.io/instance: {}
          f:helm.sh/chart: {}
      f:spec:
        f:concurrencyPolicy: {}
        f:failedJobsHistoryLimit: {}
        f:jobTemplate:
          f:spec:
            f:template:
              f:spec:
                f:affinity:
                  .: {}
                  f:nodeAffinity:
                    .: {}
                    f:requiredDuringSchedulingIgnoredDuringExecution: {}
                f:containers:
                  k:{"name":"cleanup"}:
                    .: {}
                    f:command: {}
                    f:image: {}
                    f:imagePullPolicy: {}
                    f:name: {}
                    f:resources:
                      .: {}
                      f:limits:
                        .: {}
                        f:cpu: {}
                        f:memory: {}
                      f:requests:
                        .: {}
                        f:cpu: {}
                        f:memory: {}
                    f:securityContext:
                      .: {}
                      f:allowPrivilegeEscalation: {}
                      f:capabilities:
                        .: {}
                        f:drop: {}
                      f:privileged: {}
                      f:readOnlyRootFilesystem: {}
                      f:runAsNonRoot: {}
                      f:seccompProfile:
                        .: {}
                        f:type: {}
                    f:terminationMessagePath: {}
                    f:terminationMessagePolicy: {}
                f:dnsPolicy: {}
                f:restartPolicy: {}
                f:schedulerName: {}
                f:securityContext: {}
                f:serviceAccount: {}
                f:serviceAccountName: {}
                f:terminationGracePeriodSeconds: {}
                f:tolerations: {}
        f:schedule: {}
        f:successfulJobsHistoryLimit: {}
        f:suspend: {}
    manager: argocd-controller
    operation: Update
    time: "2023-10-03T08:58:19Z"
  name: kyverno-cleanup-admission-reports
  namespace: kyverno
  uid: 0579b2da-2238-4bdd-9e42-f18da832eb86
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      template:
        metadata:
          creationTimestamp: null
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: karpenter.k8s.aws/instance-family
                    operator: In
                    values:
                    - m6i
                    - r6i
          containers:
          - command:
            - /bin/sh
            - -c
            - |
              COUNT=$(kubectl get admissionreports.kyverno.io -A | wc -l)
              if [ "$COUNT" -gt 10000 ]; then
                echo "too many reports found ($COUNT), cleaning up..."
                kubectl delete admissionreports.kyverno.io -A -l='!audit.kyverno.io/report.aggregate'
              else
                echo "($COUNT) reports found, no clean up needed"
              fi
            image: bitnami/kubectl:1.26.4
            imagePullPolicy: IfNotPresent
            name: cleanup
            resources:
              limits:
                cpu: "1"
                memory: 12Gi
              requests:
                cpu: 256m
                memory: 2Gi
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                - ALL
              privileged: false
              readOnlyRootFilesystem: true
              runAsNonRoot: true
              seccompProfile:
                type: RuntimeDefault
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: kyverno-cleanup-jobs
          serviceAccountName: kyverno-cleanup-jobs
          terminationGracePeriodSeconds: 30
          tolerations:
          - effect: NoSchedule
            key: provisioner-type
            operator: Equal
            value: high-memory
  schedule: '*/10 * * * *'
  successfulJobsHistoryLimit: 1
  suspend: false
status:
  active:
  - apiVersion: batch/v1
    kind: Job
    name: kyverno-cleanup-admission-reports-28325940
    namespace: kyverno
    resourceVersion: "692561028"
    uid: e987ce5c-0920-4b68-acc1-454e12b5d66b
  lastScheduleTime: "2023-11-09T19:00:00Z"
  lastSuccessfulTime: "2023-11-06T16:24:43Z"

realshuting · 2023-12-05T14:08:42Z

Thanks for the update @chaochn47 !

We noticed that the cronjob status is in an odd stage, the lastSuccessfulTime happened before lastScheduleTime which indicates the job might not be completed successfully.

Per Kuberentes CronJob limitations, the controller will not start the new job if it misses more than 100 schedules:

For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error

Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.

Can you confirm the job status and share the logs, if possible?

realshuting · 2023-12-05T14:10:06Z

BTW the resource settings for the cronjob seems to be enough for the cleanup, but it's still worth checking if there was any OOM restarts or related errros.

            resources:
              limits:
                cpu: "1"
                memory: 12Gi
              requests:
                cpu: 256m
                memory: 2Gi

chaochn47 · 2023-12-05T17:09:17Z

All of the clean up reports pods were either OOMKilled or had Error

dev-dsk-chaochn-2c-a26acd76 % etcdctl get /registry/pods/kyverno/kyverno-cleanup-admission-reports-28325940-7pkd5 | auger decode --output json | jq -r '.status.containerStatuses[0].lastState.terminated.reason'
Error

dev-dsk-chaochn-2c-a26acd76 % etcdctl get /registry/pods/kyverno/kyverno-cleanup-admission-reports-28257080-wx9rc | auger decode --output json | jq -r '.status.containerStatuses[0].state.terminated.reason'
OOMKilled

dev-dsk-chaochn-2c-a26acd76 % etcdctl get /registry/pods/kyverno/kyverno-cleanup-admission-reports-28245020-2s6hn | auger decode --output json | jq -r '.status.containerStatuses[0].state.terminated.reason'
OOMKilled

chaochn47 · 2023-12-05T17:21:12Z

Can you confirm the job status and share the logs, if possible?

Thanks @realshuting for the pointer, as the platform provider we do not retain historical application logs that our customers have deployed.

It brings challenges to figure out when the cronjob started to fail.

From 2023-11-01T00:00:00Z to 2023-11-02T23:59:59Z, I can tell reports-controller had hundreds of 500 and 410 list requests of list admission reports.

Every 10 minutes, there was a 500 list all admission reports timeout context deadline exceeded 1min error. Example

It is an indicator that the clean up cronjob could also fail for the same reason. So I conclude

It is too late to delete those objects.
The deletion script in the cronjob spec is also not as efficient as EKS blog post about managing etcd db size > How to reclaim etcd database space? section.

However, I do think this script should only be executed as the last resort.

Longer term, in order to better prevent the unbounded admission reports growth, we should consider resource quota.

realshuting · 2023-12-07T06:47:33Z

All of the clean up reports pods were either OOMKilled or had Error

Great, now we know why admission reports were piled up.

@KhaledEmaraDev - can we perform the load testing against Kyverno 1.10.x and capture the cronjob resource usage based on various loads?

The deletion script in the cronjob spec is also not as efficient as EKS blog post about managing etcd db size > How to reclaim etcd database space? section.

Thanks for the pointer. In Kyverno 1.10.x, there are "aggregate" and "non-aggregate" admission reports. The stale non-aggregate admission reports are cleaned up by using the label as you can see here. With 1.11.x, the admission reports have been changed to the short-lived resource and are garbage collected right after their aggregation.

We are continuously working on optimizing the reporting system. As Jim mentioned above, we are working towards leveraging API aggregation to support alternate storage backends for reports in the Kyverno 1.12 release, see kyverno/KDP#51.

realshuting · 2023-12-21T13:08:14Z

can we perform the load testing against Kyverno 1.10.x and capture the cronjob resource usage based on various loads?

Performed a test to check the resource usage on the cronjob pod, with 1.10.7. The max memory usage was around 650Mi to clean up 10k admission reports.

So with 1.10.x, users need to tune the cronjob resource allocations properly to prevent job failures.

With 1.12, the cronjob will be removed, and a threshold will be added when creating admissionreports, see #9241.

chaochn47 added bug Something isn't working triage Default label assigned to all new issues indicating label curation is needed to fully organize. labels Nov 21, 2023

realshuting added reports Issues related to policy reports. and removed triage Default label assigned to all new issues indicating label curation is needed to fully organize. labels Nov 22, 2023

realshuting added this to the Kyverno Release 1.11.1 milestone Nov 22, 2023

realshuting modified the milestones: Kyverno Release 1.11.1, Kyverno Release 1.10.6 Nov 22, 2023

JimBugwadia mentioned this issue Nov 22, 2023

[Bug] High CPU usage - admissionreports #7895

Closed

2 tasks

JimBugwadia modified the milestones: Kyverno Release 1.10.6, Kyverno Release 1.11.2 Dec 3, 2023

realshuting self-assigned this Dec 4, 2023

realshuting added the load testing label Dec 7, 2023

realshuting mentioned this issue Dec 13, 2023

[Bug] bitnami/kubectl image has vulnerabilities #9110

Closed

2 tasks

realshuting closed this as completed Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] admission reports piled up causing etcd turned into read-only mode #8974

[Bug] admission reports piled up causing etcd turned into read-only mode #8974

chaochn47 commented Nov 21, 2023

welcome bot commented Nov 21, 2023

realshuting commented Nov 22, 2023

mshanmu commented Nov 22, 2023 •

edited

Loading

JimBugwadia commented Nov 22, 2023 •

edited

Loading

mshanmu commented Nov 23, 2023

realshuting commented Nov 27, 2023

realshuting commented Nov 29, 2023

chaochn47 commented Nov 30, 2023

realshuting commented Dec 1, 2023

JimBugwadia commented Dec 3, 2023

chaochn47 commented Dec 4, 2023 •

edited

Loading

realshuting commented Dec 5, 2023

realshuting commented Dec 5, 2023

chaochn47 commented Dec 5, 2023

chaochn47 commented Dec 5, 2023 •

edited

Loading

realshuting commented Dec 7, 2023

realshuting commented Dec 21, 2023

[Bug] admission reports piled up causing etcd turned into read-only mode #8974

[Bug] admission reports piled up causing etcd turned into read-only mode #8974

Comments

chaochn47 commented Nov 21, 2023

Kyverno Version

Description

Slack discussion

Troubleshooting

welcome bot commented Nov 21, 2023

realshuting commented Nov 22, 2023

mshanmu commented Nov 22, 2023 • edited Loading

JimBugwadia commented Nov 22, 2023 • edited Loading

mshanmu commented Nov 23, 2023

realshuting commented Nov 27, 2023

realshuting commented Nov 29, 2023

chaochn47 commented Nov 30, 2023

realshuting commented Dec 1, 2023

JimBugwadia commented Dec 3, 2023

chaochn47 commented Dec 4, 2023 • edited Loading

realshuting commented Dec 5, 2023

realshuting commented Dec 5, 2023

chaochn47 commented Dec 5, 2023

chaochn47 commented Dec 5, 2023 • edited Loading

realshuting commented Dec 7, 2023

realshuting commented Dec 21, 2023

mshanmu commented Nov 22, 2023 •

edited

Loading

JimBugwadia commented Nov 22, 2023 •

edited

Loading

chaochn47 commented Dec 4, 2023 •

edited

Loading

chaochn47 commented Dec 5, 2023 •

edited

Loading