Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] admission reports piled up causing etcd turned into read-only mode #8974

Closed
2 tasks done
chaochn47 opened this issue Nov 21, 2023 · 17 comments
Closed
2 tasks done
Assignees
Labels
bug Something isn't working load testing reports Issues related to policy reports.

Comments

@chaochn47
Copy link

Kyverno Version

1.10.3

Description

Follow up issue report from slack discussion

image

1.24 EKS cluster

# HELP apiserver_storage_objects [STABLE] Number of stored objects at the time of last check split by kind.
# TYPE apiserver_storage_objects gauge
apiserver_storage_objects{resource="admissionreports.kyverno.io"} 1.601408e+06

Millions of kyverno admission reports piled up since June, 2023 and they occupied most of the space in etcd db. It breached the upstream recommended maximum db size quota (8G) and then turned the etcd into read-only mode.

Entries by 'Kind' (total 9.5 GB):
+--------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+--------+
|                                                                       KEY GROUP                                                                        |              KIND               |  SIZE  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+--------+
| /registry/kyverno.io/admissionreports/monitoring,/registry/kyverno.io/admissionreports/monitoring,/registry/kyverno.io/admissionreports/monitoring,/re | AdmissionReport                 | 9.4 GB |

kyverno-app-controller-pod-spec.yaml was the pod spec when the db was filled up while I am not sure if the user has ever upgraded the controller version in the past since June, 2023. The 1.10.3 Kyverno Version is fetched from ghcr.io/kyverno/kyverno:v1.10.3 in this spec.

kyverno-admission-report-sample.json was one of the example admission report custom resources.

Please let me know if kyverno community wants more information like apiserver audit log or other admission report samples.

Slack discussion

https://kubernetes.slack.com/archives/CLGR9BJU9/p1700252421515759

Troubleshooting

  • I have read and followed the documentation AND the troubleshooting guide.
  • I have searched other issues in this repository and mine is not recorded.
@chaochn47 chaochn47 added bug Something isn't working triage Default label assigned to all new issues indicating label curation is needed to fully organize. labels Nov 21, 2023
Copy link

welcome bot commented Nov 21, 2023

Thanks for opening your first issue here! Be sure to follow the issue template!

@realshuting realshuting added reports Issues related to policy reports. and removed triage Default label assigned to all new issues indicating label curation is needed to fully organize. labels Nov 22, 2023
@realshuting realshuting added this to the Kyverno Release 1.11.1 milestone Nov 22, 2023
@realshuting
Copy link
Member

Thank you @chaochn47 !

Pasting the snippet of the admission report for easier reading:

{
    "apiVersion": "kyverno.io/v1alpha2",
    "kind": "AdmissionReport",
    "metadata": {
        "creationTimestamp": "2023-11-04T01:52:18Z",
        "generation": 1,
        "labels": {
            "app.kubernetes.io/managed-by": "kyverno",
            "audit.kyverno.io/resource.gvr": "deployments.v1.apps",
            "audit.kyverno.io/resource.name": "vmagent-site3",
            "audit.kyverno.io/resource.namespace": "monitoring",
            "audit.kyverno.io/resource.uid": "c479bb6a-3bed-4853-a200-6e89af25d795",
            "cpol.kyverno.io/disallow-capabilities": "661837242",
            "cpol.kyverno.io/disallow-host-namespaces": "661837249",
            "cpol.kyverno.io/disallow-host-path": "661837203",
            "cpol.kyverno.io/disallow-host-ports": "661837251",
            "cpol.kyverno.io/disallow-host-process": "661837250",
            "cpol.kyverno.io/disallow-privileged-containers": "661837233",
            "cpol.kyverno.io/disallow-proc-mount": "661837221",
            "cpol.kyverno.io/disallow-selinux": "661837223",
            "cpol.kyverno.io/restrict-apparmor-profiles": "661837207",
            "cpol.kyverno.io/restrict-seccomp": "661837239",
            "cpol.kyverno.io/restrict-sysctls": "661837183"
        },
        "managedFields": [
...
]
        "name": "00000fad-1bc8-4e61-ab11-ba8ba10a602b",
        "namespace": "monitoring",
        "uid": "74034264-bc6f-4dc9-a119-d78aa2537597"
    },
    "spec": {
        "owner": {
            "apiVersion": "",
            "kind": "",
            "name": "",
            "uid": ""
        },
        "results": [
            {
                "category": "Pod Security Standards (Baseline)",
                "message": "validation rule 'autogen-adding-capabilities' passed.",
                "policy": "disallow-capabilities",
                "result": "pass",
                "rule": "autogen-adding-capabilities",
                "scored": true,
                "severity": "medium",
                "source": "kyverno",
                "timestamp": {
                    "nanos": 0,
                    "seconds": 1699062738
                }
            },
            ...
        ],
        "summary": {
            "error": 0,
            "fail": 0,
            "pass": 12,
            "skip": 0,
            "warn": 0
        }
    }
}

@mshanmu
Copy link

mshanmu commented Nov 22, 2023

This issue of report objects causing k8s control plane instability has been ongoing for a very long time. So, this requires a fundamental design change, i.e., use a separate data store, namely postgres or mongodb etc., instead of using the k8s store.

Please don't try patching a fundamentally wrong design choice.

There is a real customer cost because of this issue. We have completely stopped using reports feature.

@JimBugwadia
Copy link
Member

JimBugwadia commented Nov 22, 2023

@mshanmu - please see kyverno/KDP#51. This is proposal for using API aggregation and support alternate storage backends for reports. Feel free to provide feedback and contribute there.

The challenge with reports piling up is caused by reports being produced faster than can be consumed (processed). The processing rate is throttled due to default or inadequate settings for --clientRateLimitQPS and --clientRateLimitBurst.

In 1.11.0 we have changed the defaults for QPS but they still need to be tuned based on your installation. In 1.11.0 there is also a cronjob for cleanup, in case the configuration is not correct. We will look at porting these back to 1.10.x.

@mshanmu
Copy link

mshanmu commented Nov 23, 2023

Thanks for the response @JimBugwadia !! Will contribute to the KDP#51.

@realshuting
Copy link
Member

Hi @chaochn47 - there was a cleanup cronjob in 1.10.3 to periodically delete admission reports without the aggregation label. This optional is enabled by default.

How was Kyverno installed and configured in your scenario? Were there any custom configurations?

@realshuting
Copy link
Member

@chaochn47 - Please help us identify if this is a configuration issue⬆️

@chaochn47
Copy link
Author

Thanks @realshuting for the pointer, would you mind give an example cronjob name so I can use try a matched cronjob key in etcd?

@realshuting
Copy link
Member

Thanks @realshuting for the pointer, would you mind give an example cronjob name so I can use try a matched cronjob key in etcd?

@chaochn47 - you can search the cronjob "kyverno-cleanup-admission-reports" in the namespace that Kyverno was deployed.

@JimBugwadia
Copy link
Member

I am moving this to 1.11.2, as there seem to be no changes we are targeting to 1.10.x.

@chaochn47
Copy link
Author

chaochn47 commented Dec 4, 2023

kyverno-cleanup-admission-reports cronjob spec.
$ etcdctl get /registry/cronjobs/kyverno/kyverno-cleanup-admission-reports --print-value-only | auger decode
apiVersion: batch/v1
kind: CronJob
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"batch/v1","kind":"CronJob","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"cleanup","app.kubernetes.io/instance":"kyverno","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/part-of":"kyverno","app.kubernetes.io/version":"3.0.5","argocd.argoproj.io/instance":"sintral-s3-kyverno","helm.sh/chart":"kyverno-3.0.5"},"name":"kyverno-cleanup-admission-reports","namespace":"kyverno"},"spec":{"concurrencyPolicy":"Forbid","failedJobsHistoryLimit":1,"jobTemplate":{"spec":{"template":{"metadata":null,"spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"karpenter.k8s.aws/instance-family","operator":"In","values":["m6i","r6i"]}]}]}}},"containers":[{"command":["/bin/sh","-c","COUNT=$(kubectl get admissionreports.kyverno.io -A | wc -l)\nif [ \"$COUNT\" -gt 10000 ]; then\n  echo \"too many reports found ($COUNT), cleaning up...\"\n  kubectl delete admissionreports.kyverno.io -A -l='!audit.kyverno.io/report.aggregate'\nelse\n  echo \"($COUNT) reports found, no clean up needed\"\nfi\n"],"image":"bitnami/kubectl:1.26.4","imagePullPolicy":null,"name":"cleanup","resources":{"limits":{"cpu":1,"memory":"12Gi"},"requests":{"cpu":"256m","memory":"2Gi"}},"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"privileged":false,"readOnlyRootFilesystem":true,"runAsNonRoot":true,"seccompProfile":{"type":"RuntimeDefault"}}}],"restartPolicy":"OnFailure","serviceAccountName":"kyverno-cleanup-jobs","tolerations":[{"effect":"NoSchedule","key":"provisioner-type","operator":"Equal","value":"high-memory"}]}}}},"schedule":"*/10 * * * *","successfulJobsHistoryLimit":1}}
  creationTimestamp: "2023-06-14T07:33:03Z"
  generation: 4
  labels:
    app.kubernetes.io/component: cleanup
    app.kubernetes.io/instance: kyverno
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kyverno
    app.kubernetes.io/version: 3.0.5
    argocd.argoproj.io/instance: sintral-s3-kyverno
    helm.sh/chart: kyverno-3.0.5
  managedFields:
  - apiVersion: batch/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:active: {}
        f:lastScheduleTime: {}
        f:lastSuccessfulTime: {}
    manager: kube-controller-manager
    operation: Update
    subresource: status
    time: "2023-10-03T08:30:00Z"
  - apiVersion: batch/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/part-of: {}
          f:app.kubernetes.io/version: {}
          f:argocd.argoproj.io/instance: {}
          f:helm.sh/chart: {}
      f:spec:
        f:concurrencyPolicy: {}
        f:failedJobsHistoryLimit: {}
        f:jobTemplate:
          f:spec:
            f:template:
              f:spec:
                f:affinity:
                  .: {}
                  f:nodeAffinity:
                    .: {}
                    f:requiredDuringSchedulingIgnoredDuringExecution: {}
                f:containers:
                  k:{"name":"cleanup"}:
                    .: {}
                    f:command: {}
                    f:image: {}
                    f:imagePullPolicy: {}
                    f:name: {}
                    f:resources:
                      .: {}
                      f:limits:
                        .: {}
                        f:cpu: {}
                        f:memory: {}
                      f:requests:
                        .: {}
                        f:cpu: {}
                        f:memory: {}
                    f:securityContext:
                      .: {}
                      f:allowPrivilegeEscalation: {}
                      f:capabilities:
                        .: {}
                        f:drop: {}
                      f:privileged: {}
                      f:readOnlyRootFilesystem: {}
                      f:runAsNonRoot: {}
                      f:seccompProfile:
                        .: {}
                        f:type: {}
                    f:terminationMessagePath: {}
                    f:terminationMessagePolicy: {}
                f:dnsPolicy: {}
                f:restartPolicy: {}
                f:schedulerName: {}
                f:securityContext: {}
                f:serviceAccount: {}
                f:serviceAccountName: {}
                f:terminationGracePeriodSeconds: {}
                f:tolerations: {}
        f:schedule: {}
        f:successfulJobsHistoryLimit: {}
        f:suspend: {}
    manager: argocd-controller
    operation: Update
    time: "2023-10-03T08:58:19Z"
  name: kyverno-cleanup-admission-reports
  namespace: kyverno
  uid: 0579b2da-2238-4bdd-9e42-f18da832eb86
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      template:
        metadata:
          creationTimestamp: null
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: karpenter.k8s.aws/instance-family
                    operator: In
                    values:
                    - m6i
                    - r6i
          containers:
          - command:
            - /bin/sh
            - -c
            - |
              COUNT=$(kubectl get admissionreports.kyverno.io -A | wc -l)
              if [ "$COUNT" -gt 10000 ]; then
                echo "too many reports found ($COUNT), cleaning up..."
                kubectl delete admissionreports.kyverno.io -A -l='!audit.kyverno.io/report.aggregate'
              else
                echo "($COUNT) reports found, no clean up needed"
              fi
            image: bitnami/kubectl:1.26.4
            imagePullPolicy: IfNotPresent
            name: cleanup
            resources:
              limits:
                cpu: "1"
                memory: 12Gi
              requests:
                cpu: 256m
                memory: 2Gi
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                - ALL
              privileged: false
              readOnlyRootFilesystem: true
              runAsNonRoot: true
              seccompProfile:
                type: RuntimeDefault
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: kyverno-cleanup-jobs
          serviceAccountName: kyverno-cleanup-jobs
          terminationGracePeriodSeconds: 30
          tolerations:
          - effect: NoSchedule
            key: provisioner-type
            operator: Equal
            value: high-memory
  schedule: '*/10 * * * *'
  successfulJobsHistoryLimit: 1
  suspend: false
status:
  active:
  - apiVersion: batch/v1
    kind: Job
    name: kyverno-cleanup-admission-reports-28325940
    namespace: kyverno
    resourceVersion: "692561028"
    uid: e987ce5c-0920-4b68-acc1-454e12b5d66b
  lastScheduleTime: "2023-11-09T19:00:00Z"
  lastSuccessfulTime: "2023-11-06T16:24:43Z"

@realshuting
Copy link
Member

Thanks for the update @chaochn47 !

We noticed that the cronjob status is in an odd stage, the lastSuccessfulTime happened before lastScheduleTime which indicates the job might not be completed successfully.

Per Kuberentes CronJob limitations, the controller will not start the new job if it misses more than 100 schedules:

For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error

Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.

Can you confirm the job status and share the logs, if possible?

@realshuting
Copy link
Member

BTW the resource settings for the cronjob seems to be enough for the cleanup, but it's still worth checking if there was any OOM restarts or related errros.

            resources:
              limits:
                cpu: "1"
                memory: 12Gi
              requests:
                cpu: 256m
                memory: 2Gi

@chaochn47
Copy link
Author

All of the clean up reports pods were either OOMKilled or had Error
dev-dsk-chaochn-2c-a26acd76 % etcdctl get /registry/pods/kyverno/kyverno-cleanup-admission-reports-28325940-7pkd5 | auger decode --output json | jq -r '.status.containerStatuses[0].lastState.terminated.reason'
Error

dev-dsk-chaochn-2c-a26acd76 % etcdctl get /registry/pods/kyverno/kyverno-cleanup-admission-reports-28257080-wx9rc | auger decode --output json | jq -r '.status.containerStatuses[0].state.terminated.reason'
OOMKilled

dev-dsk-chaochn-2c-a26acd76 % etcdctl get /registry/pods/kyverno/kyverno-cleanup-admission-reports-28245020-2s6hn | auger decode --output json | jq -r '.status.containerStatuses[0].state.terminated.reason'
OOMKilled

@chaochn47
Copy link
Author

chaochn47 commented Dec 5, 2023

Can you confirm the job status and share the logs, if possible?

Thanks @realshuting for the pointer, as the platform provider we do not retain historical application logs that our customers have deployed.

It brings challenges to figure out when the cronjob started to fail.

From 2023-11-01T00:00:00Z to 2023-11-02T23:59:59Z, I can tell reports-controller had hundreds of 500 and 410 list requests of list admission reports.

image

Every 10 minutes, there was a 500 list all admission reports timeout context deadline exceeded 1min error. Example
image

It is an indicator that the clean up cronjob could also fail for the same reason. So I conclude

  1. It is too late to delete those objects.
  2. The deletion script in the cronjob spec is also not as efficient as EKS blog post about managing etcd db size > How to reclaim etcd database space? section.

However, I do think this script should only be executed as the last resort.

Longer term, in order to better prevent the unbounded admission reports growth, we should consider resource quota.

@realshuting
Copy link
Member

All of the clean up reports pods were either OOMKilled or had Error

Great, now we know why admission reports were piled up.

@KhaledEmaraDev - can we perform the load testing against Kyverno 1.10.x and capture the cronjob resource usage based on various loads?

The deletion script in the cronjob spec is also not as efficient as EKS blog post about managing etcd db size > How to reclaim etcd database space? section.

Thanks for the pointer. In Kyverno 1.10.x, there are "aggregate" and "non-aggregate" admission reports. The stale non-aggregate admission reports are cleaned up by using the label as you can see here. With 1.11.x, the admission reports have been changed to the short-lived resource and are garbage collected right after their aggregation.

We are continuously working on optimizing the reporting system. As Jim mentioned above, we are working towards leveraging API aggregation to support alternate storage backends for reports in the Kyverno 1.12 release, see kyverno/KDP#51.

@realshuting
Copy link
Member

can we perform the load testing against Kyverno 1.10.x and capture the cronjob resource usage based on various loads?

Performed a test to check the resource usage on the cronjob pod, with 1.10.7. The max memory usage was around 650Mi to clean up 10k admission reports.

image

So with 1.10.x, users need to tune the cronjob resource allocations properly to prevent job failures.

With 1.12, the cronjob will be removed, and a threshold will be added when creating admissionreports, see #9241.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working load testing reports Issues related to policy reports.
Projects
None yet
Development

No branches or pull requests

4 participants