-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] admission reports piled up causing etcd turned into read-only mode #8974
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Thank you @chaochn47 ! Pasting the snippet of the admission report for easier reading:
|
This issue of report objects causing k8s control plane instability has been ongoing for a very long time. So, this requires a fundamental design change, i.e., use a separate data store, namely postgres or mongodb etc., instead of using the k8s store. Please don't try patching a fundamentally wrong design choice. There is a real customer cost because of this issue. We have completely stopped using reports feature. |
@mshanmu - please see kyverno/KDP#51. This is proposal for using API aggregation and support alternate storage backends for reports. Feel free to provide feedback and contribute there. The challenge with reports piling up is caused by reports being produced faster than can be consumed (processed). The processing rate is throttled due to default or inadequate settings for In 1.11.0 we have changed the defaults for QPS but they still need to be tuned based on your installation. In 1.11.0 there is also a cronjob for cleanup, in case the configuration is not correct. We will look at porting these back to 1.10.x. |
Thanks for the response @JimBugwadia !! Will contribute to the KDP#51. |
Hi @chaochn47 - there was a cleanup cronjob in 1.10.3 to periodically delete admission reports without the aggregation label. This optional is enabled by default. How was Kyverno installed and configured in your scenario? Were there any custom configurations? |
@chaochn47 - Please help us identify if this is a configuration issue⬆️ |
Thanks @realshuting for the pointer, would you mind give an example cronjob name so I can use try a matched cronjob key in etcd? |
@chaochn47 - you can search the cronjob "kyverno-cleanup-admission-reports" in the namespace that Kyverno was deployed. |
I am moving this to 1.11.2, as there seem to be no changes we are targeting to 1.10.x. |
kyverno-cleanup-admission-reports cronjob spec.
|
Thanks for the update @chaochn47 ! We noticed that the cronjob status is in an odd stage, the Per Kuberentes CronJob limitations, the controller will not start the new job if it misses more than 100 schedules:
Can you confirm the job status and share the logs, if possible? |
BTW the resource settings for the cronjob seems to be enough for the cleanup, but it's still worth checking if there was any OOM restarts or related errros.
|
All of the clean up reports pods were either OOMKilled or had Error
|
Thanks @realshuting for the pointer, as the platform provider we do not retain historical application logs that our customers have deployed. It brings challenges to figure out when the cronjob started to fail. From ![]() Every 10 minutes, there was a 500 list all admission reports timeout context deadline exceeded 1min error. Example It is an indicator that the clean up cronjob could also fail for the same reason. So I conclude
However, I do think this script should only be executed as the last resort. Longer term, in order to better prevent the unbounded admission reports growth, we should consider resource quota. |
Great, now we know why admission reports were piled up. @KhaledEmaraDev - can we perform the load testing against Kyverno 1.10.x and capture the cronjob resource usage based on various loads?
Thanks for the pointer. In Kyverno 1.10.x, there are "aggregate" and "non-aggregate" admission reports. The stale non-aggregate admission reports are cleaned up by using the label as you can see here. With 1.11.x, the admission reports have been changed to the short-lived resource and are garbage collected right after their aggregation. We are continuously working on optimizing the reporting system. As Jim mentioned above, we are working towards leveraging API aggregation to support alternate storage backends for reports in the Kyverno 1.12 release, see kyverno/KDP#51. |
Performed a test to check the resource usage on the cronjob pod, with 1.10.7. The max memory usage was around 650Mi to clean up 10k admission reports. ![]() So with 1.10.x, users need to tune the cronjob resource allocations properly to prevent job failures. With 1.12, the cronjob will be removed, and a threshold will be added when creating admissionreports, see #9241. |
Kyverno Version
1.10.3
Description
Follow up issue report from slack discussion
1.24
EKS clusterMillions of kyverno admission reports piled up since June, 2023 and they occupied most of the space in etcd db. It breached the upstream recommended maximum db size quota (8G) and then turned the etcd into read-only mode.
kyverno-app-controller-pod-spec.yaml was the pod spec when the db was filled up while I am not sure if the user has ever upgraded the controller version in the past since June, 2023. The
1.10.3
Kyverno Version is fetched fromghcr.io/kyverno/kyverno:v1.10.3
in this spec.kyverno-admission-report-sample.json was one of the example admission report custom resources.
Please let me know if kyverno community wants more information like
apiserver audit log
or other admission report samples.Slack discussion
https://kubernetes.slack.com/archives/CLGR9BJU9/p1700252421515759
Troubleshooting
The text was updated successfully, but these errors were encountered: