-
Notifications
You must be signed in to change notification settings - Fork 5.2k
[prometheus-operator-admission-webhook] job-patchWebhook runs too late #5607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @zeritti |
Thanks, @TheRealNoob, for raising this issue. This is indeed a known problem. Once a mutating/validation webhook configuration is present, the API server will contact the admission webhook just after a PrometheusRule has been created. At an install time, this will fail (assuming a custom CA is being used) as the patch job has not run as yet and the configuration has not been patched with the CA bundle. I used to work around this problem with two steps when installing - 1. install a release with the PrometheusRule disabled and 2. upgrade the release with the rule enabled. I always preferred using cert-manager wherever possible, though. Using hook weight would be elegant. In this case, though, the patch job has to patch an existing resource - if the job runs as a pre-install hook at the install time, the resource, i.e. a validating/mutating webhook configuration is not present and the job is going to fail. |
Interesting problem....I spent some time thinking about the different approaches. The cleanest solution I came up with was to include the validating/mutating webhookconfiguration objects in the presync. And it turns out someone's already added support for just that in kube-prometheus-stack. It's possible that this could cause errors if someone is applying prometheusRule/Alertmanager objects in another parallel release, but I'm chalking this up as an acceptable risk to get a clean release. Thoughts? |
So I updated the MR with the proposed changes above. I tested it out on minikube and I found that it does work, but a new problem arose. When helm gets to applying the prometheusRule objects, apiserver makes the call to the operator, and the request times out because the operator is still in startup. |
Not sure how to address that issue. Helm doesn't have any concept of forced sleep unless you go crazy with a Job that runs a sleep command. Even so, I still consider the above change an improvement. At least now when we re-run the release with the same values file it works. We use ArgoCD at my dayjob, not sure how familiar you are with it. If Argo does a release and it fails, it tries it again 4 more times before marking it as failed. I figure at least in this setup it will now succeed one of those times. |
Come to think of it, the chart probably has the same issue in the certManager approach |
Yes, this is what happens, indeed. It will be ok in the end if the install gets retried. The thing with making the configurations into hooks is that as hooks they get deleted, i.e. there will be a moment on upgrades with the configurations being absent. I also think that the configurations won't be deleted with a deletion of the release as they are not part of the release. These points should also be considered. |
Sorry, I'm struggling to follow. Are you say that the ValidatingWebhookConfiguration and MutatingWebhookConfiguration objects get deleted and recreated during a chart upgrade? Why would that be? I can't even image there being changes to those objects |
Perhaps you're referring to the |
Yes, this is what I meant. If you do not set it, the default applies (Ref.):
|
Ah I see, I didn't know that. I assumed it defaulted to blank. So we need to define it as blank then. I'll give this a try and make sure this works. Assuming it does, you're onboard with this approach? |
Describe the bug a clear and concise description of what the bug is.
When using the default
jobs.enabled=true
to create the TLS key-pairs, the job-patchWebhook runs too late in the release life-cycle. If you deploy anyprometheusRules
as part of the release, they will fail with the errorunknown Certificate Authority
causing the helm release to error out. At present the patch job is applied with "helm.sh/hook": post-install,post-upgrade which means after all other resources are applied.What are the scenarios where a prometheusRule could be deployed with the release?
prometheusRule.enabled
, but it is disabled by defaultProposed solution
I think it makes the most sense to run
job-patchWebhook
using"helm.sh/hook": pre-install
. This is what thejob-createSecret
uses. The only thing we need to ensure is that the patch runs after the create and we can do that withhelm.sh/hook-weight: <int>
.What's your helm version?
version.BuildInfo{Version:"v3.12.0", GitCommit:"c9f554d75773799f72ceef38c51210f1842a1dea", GitTreeState:"clean", GoVersion:"go1.20.3"}
What's your kubectl version?
mike@EliteBook-840:~$ k version Client Version: v1.30.11 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.31.7-eks-bcf3d70
Which chart?
prometheus-operator-admission-webhook
What's the chart version?
0.24.1
What happened?
helm release fails
What you expected to happen?
caBundle
needs to be defined beforeprometheusRule
objects are appliedHow to reproduce it?
NA
Enter the changed values of values.yaml?
prometheusRule.enabled=true
Enter the command that you execute and failing/misfunctioning.
helm install prometheus-operator-admission-webhook prometheus-community/prometheus-operator-admission-webhook --set prometheusRule.enabled=true
Anything else we need to know?
Nope!
The text was updated successfully, but these errors were encountered: