Skip to content

[prometheus-operator-admission-webhook] job-patchWebhook runs too late #5607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TheRealNoob opened this issue May 5, 2025 · 11 comments · May be fixed by #5608
Open

[prometheus-operator-admission-webhook] job-patchWebhook runs too late #5607

TheRealNoob opened this issue May 5, 2025 · 11 comments · May be fixed by #5608
Labels
bug Something isn't working

Comments

@TheRealNoob
Copy link
Contributor

Describe the bug a clear and concise description of what the bug is.

When using the default jobs.enabled=true to create the TLS key-pairs, the job-patchWebhook runs too late in the release life-cycle. If you deploy any prometheusRules as part of the release, they will fail with the error unknown Certificate Authority causing the helm release to error out. At present the patch job is applied with "helm.sh/hook": post-install,post-upgrade which means after all other resources are applied.

What are the scenarios where a prometheusRule could be deployed with the release?

  • This chart has native support for it in prometheusRule.enabled, but it is disabled by default
  • If this chart is used as part of another chart (IE subchart), and one of the other charts deploys prometheusRule objects

Proposed solution

I think it makes the most sense to run job-patchWebhook using "helm.sh/hook": pre-install. This is what the job-createSecret uses. The only thing we need to ensure is that the patch runs after the create and we can do that with helm.sh/hook-weight: <int>.

What's your helm version?

version.BuildInfo{Version:"v3.12.0", GitCommit:"c9f554d75773799f72ceef38c51210f1842a1dea", GitTreeState:"clean", GoVersion:"go1.20.3"}

What's your kubectl version?

mike@EliteBook-840:~$ k version Client Version: v1.30.11 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.31.7-eks-bcf3d70

Which chart?

prometheus-operator-admission-webhook

What's the chart version?

0.24.1

What happened?

helm release fails

What you expected to happen?

caBundle needs to be defined before prometheusRule objects are applied

How to reproduce it?

NA

Enter the changed values of values.yaml?

prometheusRule.enabled=true

Enter the command that you execute and failing/misfunctioning.

helm install prometheus-operator-admission-webhook prometheus-community/prometheus-operator-admission-webhook --set prometheusRule.enabled=true

Anything else we need to know?

Nope!

@TheRealNoob TheRealNoob added the bug Something isn't working label May 5, 2025
@TheRealNoob
Copy link
Contributor Author

cc @zeritti

@zeritti
Copy link
Contributor

zeritti commented May 5, 2025

Thanks, @TheRealNoob, for raising this issue. This is indeed a known problem. Once a mutating/validation webhook configuration is present, the API server will contact the admission webhook just after a PrometheusRule has been created. At an install time, this will fail (assuming a custom CA is being used) as the patch job has not run as yet and the configuration has not been patched with the CA bundle.

I used to work around this problem with two steps when installing - 1. install a release with the PrometheusRule disabled and 2. upgrade the release with the rule enabled. I always preferred using cert-manager wherever possible, though.

Using hook weight would be elegant. In this case, though, the patch job has to patch an existing resource - if the job runs as a pre-install hook at the install time, the resource, i.e. a validating/mutating webhook configuration is not present and the job is going to fail.

@TheRealNoob
Copy link
Contributor Author

Interesting problem....I spent some time thinking about the different approaches. The cleanest solution I came up with was to include the validating/mutating webhookconfiguration objects in the presync. And it turns out someone's already added support for just that in kube-prometheus-stack.

It's possible that this could cause errors if someone is applying prometheusRule/Alertmanager objects in another parallel release, but I'm chalking this up as an acceptable risk to get a clean release.

Thoughts?

@TheRealNoob
Copy link
Contributor Author

So I updated the MR with the proposed changes above. I tested it out on minikube and I found that it does work, but a new problem arose. When helm gets to applying the prometheusRule objects, apiserver makes the call to the operator, and the request times out because the operator is still in startup.

@TheRealNoob
Copy link
Contributor Author

Not sure how to address that issue. Helm doesn't have any concept of forced sleep unless you go crazy with a Job that runs a sleep command. Even so, I still consider the above change an improvement. At least now when we re-run the release with the same values file it works.

We use ArgoCD at my dayjob, not sure how familiar you are with it. If Argo does a release and it fails, it tries it again 4 more times before marking it as failed. I figure at least in this setup it will now succeed one of those times.

@TheRealNoob
Copy link
Contributor Author

Come to think of it, the chart probably has the same issue in the certManager approach

@zeritti
Copy link
Contributor

zeritti commented May 7, 2025

So I updated the MR with the proposed changes above. I tested it out on minikube and I found that it does work, but a new problem arose. When helm gets to applying the prometheusRule objects, apiserver makes the call to the operator, and the request times out because the operator is still in startup.

Yes, this is what happens, indeed. It will be ok in the end if the install gets retried.

The thing with making the configurations into hooks is that as hooks they get deleted, i.e. there will be a moment on upgrades with the configurations being absent. I also think that the configurations won't be deleted with a deletion of the release as they are not part of the release. These points should also be considered.

@TheRealNoob
Copy link
Contributor Author

Sorry, I'm struggling to follow. Are you say that the ValidatingWebhookConfiguration and MutatingWebhookConfiguration objects get deleted and recreated during a chart upgrade? Why would that be? I can't even image there being changes to those objects

@TheRealNoob
Copy link
Contributor Author

Perhaps you're referring to the "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded annotation? I did not include this annotations on the *WebhookConfiguration objects.

@zeritti
Copy link
Contributor

zeritti commented May 7, 2025

Perhaps you're referring to the "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded annotation? I did not include this annotations on the *WebhookConfiguration objects.

Yes, this is what I meant. If you do not set it, the default applies (Ref.):

If no hook deletion policy annotation is specified, the before-hook-creation behavior applies by default.

@TheRealNoob
Copy link
Contributor Author

Ah I see, I didn't know that. I assumed it defaulted to blank. So we need to define it as blank then. I'll give this a try and make sure this works. Assuming it does, you're onboard with this approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants