Webhook Race Condition with TLS handshake error: tls: bad certificate #2560

njtran · 2022-07-25T19:09:53Z

Hoping to get some insight on the following issue. Happy to hop on a call or slack huddle in the knative slack to give more info.

Expected Behavior

The webhook should work and not require a non-deterministic amount of container restarts for it to work.

Actual Behavior

Using defaulting and validating webhooks for Karpenter CRDs. When first installing Karpenter, we get the following error in the webhook container logs. Even after receiving a failure, the webhook container stays ready. The issue is resolved sometimes by restarting the container a non-deterministic amount of times.

karpenter-6f84d7c89d-4246h webhook 2022/07/14 17:47:02 http: TLS handshake error from 192.168.139.165:44004: remote error: tls: bad certificate
karpenter-6f84d7c89d-4246h webhook 2022/07/14 17:47:02 http: TLS handshake error from 192.168.139.165:44006: remote error: tls: bad certificate

This webhook is further proven broken when it blocks creation of the CRD because the certificate is signed by an unknown authority.

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "validation.webhook.karpenter.k8s.aws": Post "[https://karpenter.karpenter.svc:443/?timeout=10s](https://karpenter.karpenter.svc/?timeout=10s)": x509: certificate signed by unknown authority

Here’s where this webhook was created in code that we started to see this issue after.

Steps to Reproduce the Problem

Install Karpenter with the webhook mentioned.
Watch webhook logs.

Additional Info

karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.145Z	INFO	webhook.DefaultingWebhook	Starting controller and workers	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.145Z	INFO	webhook.DefaultingWebhook	Started workers	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.145Z	DEBUG	webhook.DefaultingWebhook	Processing from queue defaulting.webhook.karpenter.k8s.aws (depth: 1)	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.145Z	DEBUG	webhook.DefaultingWebhook	Processing from queue karpenter/karpenter-cert (depth: 0)	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook I0714 17:46:46.145610       1 leaderelection.go:243] attempting to acquire leader lease karpenter/webhook.validationwebhook.00-of-01...
karpenter-6f84d7c89d-4246h webhook I0714 17:46:46.146009       1 leaderelection.go:243] attempting to acquire leader lease karpenter/webhook.webhookcertificates.00-of-01...
karpenter-6f84d7c89d-4246h webhook I0714 17:46:46.146311       1 leaderelection.go:243] attempting to acquire leader lease karpenter/webhook.defaultingwebhook.00-of-01...
karpenter-6f84d7c89d-4246h webhook I0714 17:46:46.146585       1 leaderelection.go:243] attempting to acquire leader lease karpenter/webhook.validationwebhook.00-of-01...
karpenter-6f84d7c89d-4246h webhook I0714 17:46:46.146866       1 leaderelection.go:243] attempting to acquire leader lease karpenter/webhook.configmapwebhook.00-of-01...
karpenter-6f84d7c89d-4246h webhook I0714 17:46:46.147154       1 leaderelection.go:243] attempting to acquire leader lease karpenter/webhook.defaultingwebhook.00-of-01...
karpenter-6f84d7c89d-4246h webhook I0714 17:46:46.158486       1 leaderelection.go:253] successfully acquired lease karpenter/webhook.defaultingwebhook.00-of-01
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.158Z	INFO	webhook	"karpenter-6f84d7c89d-4246h_7349c3d9-0ba3-4bf1-94b3-0952c56f9be9" has started leading "webhook.defaultingwebhook.00-of-01"	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.158Z	DEBUG	webhook.DefaultingWebhook	Adding to the slow queue defaulting.webhook.provisioners.karpenter.sh (depth(total/slow): 1/1)	{"commit": "2c98771", "knative.dev/key": "/defaulting.webhook.provisioners.karpenter.sh"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.158Z	DEBUG	webhook.DefaultingWebhook	Processing from queue defaulting.webhook.provisioners.karpenter.sh (depth: 0)	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook I0714 17:46:46.159316       1 leaderelection.go:253] successfully acquired lease karpenter/webhook.configmapwebhook.00-of-01
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.159Z	INFO	webhook	"karpenter-6f84d7c89d-4246h_3422fdfa-830a-42e1-bfe5-323689e5641e" has started leading "webhook.configmapwebhook.00-of-01"	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.159Z	DEBUG	webhook.ConfigMapWebhook	Adding to the slow queue validation.webhook.config.karpenter.sh (depth(total/slow): 1/1)	{"commit": "2c98771", "knative.dev/key": "/validation.webhook.config.karpenter.sh"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.159Z	DEBUG	webhook.ConfigMapWebhook	Processing from queue validation.webhook.config.karpenter.sh (depth: 0)	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook E0714 17:46:46.159762       1 leaderelection.go:361] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "webhook.defaultingwebhook.00-of-01": the object has been modified; please apply your changes to the latest version and try again
karpenter-6f84d7c89d-4246h webhook I0714 17:46:46.159844       1 leaderelection.go:253] successfully acquired lease karpenter/webhook.validationwebhook.00-of-01
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.159Z	INFO	webhook	"karpenter-6f84d7c89d-4246h_a0c9c1f8-5016-4ebb-b738-f455dd699c48" has started leading "webhook.validationwebhook.00-of-01"	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.160Z	DEBUG	webhook.ValidationWebhook	Adding to the slow queue validation.webhook.karpenter.k8s.aws (depth(total/slow): 1/1)	{"commit": "2c98771", "knative.dev/key": "/validation.webhook.karpenter.k8s.aws"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.160Z	DEBUG	webhook.ValidationWebhook	Processing from queue validation.webhook.karpenter.k8s.aws (depth: 0)	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook I0714 17:46:46.160306       1 leaderelection.go:253] successfully acquired lease karpenter/webhook.webhookcertificates.00-of-01
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.160Z	INFO	webhook	"karpenter-6f84d7c89d-4246h_dafed312-0c03-4c85-9afc-9ab7842483ed" has started leading "webhook.webhookcertificates.00-of-01"	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.160Z	DEBUG	webhook.WebhookCertificates	Adding to the slow queue karpenter/karpenter-cert (depth(total/slow): 1/1)	{"commit": "2c98771", "knative.dev/key": "karpenter/karpenter-cert"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.160Z	DEBUG	webhook.WebhookCertificates	Processing from queue karpenter/karpenter-cert (depth: 0)	{"commit": "2c98771"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.160Z	INFO	webhook.WebhookCertificates	Reconcile succeeded	{"commit": "2c98771", "knative.dev/traceid": "36a77d14-41da-4ef7-bcaf-585fcbf0e45b", "knative.dev/key": "karpenter/karpenter-cert", "duration": "248.666µs"}
karpenter-6f84d7c89d-4246h webhook E0714 17:46:46.160932       1 leaderelection.go:361] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "webhook.validationwebhook.00-of-01": the object has been modified; please apply your changes to the latest version and try again
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.213Z	INFO	webhook.DefaultingWebhook	Updating webhook	{"commit": "2c98771", "knative.dev/traceid": "17354ae1-5148-439f-bde6-2c0bd820c9a8", "knative.dev/key": "defaulting.webhook.provisioners.karpenter.sh"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.218Z	INFO	webhook.ConfigMapWebhook	Webhook is valid	{"commit": "2c98771", "knative.dev/traceid": "54b1e742-648e-4d9e-ba37-dae50fbdf615", "knative.dev/key": "validation.webhook.config.karpenter.sh"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.218Z	INFO	webhook.ConfigMapWebhook	Reconcile succeeded	{"commit": "2c98771", "knative.dev/traceid": "54b1e742-648e-4d9e-ba37-dae50fbdf615", "knative.dev/key": "validation.webhook.config.karpenter.sh", "duration": "58.836021ms"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.310Z	INFO	webhook.ValidationWebhook	Updating webhook	{"commit": "2c98771", "knative.dev/traceid": "41feaed3-40f0-405c-afea-e44d1dc78c3d", "knative.dev/key": "validation.webhook.karpenter.k8s.aws"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.311Z	INFO	webhook.DefaultingWebhook	Reconcile succeeded	{"commit": "2c98771", "knative.dev/traceid": "17354ae1-5148-439f-bde6-2c0bd820c9a8", "knative.dev/key": "defaulting.webhook.provisioners.karpenter.sh", "duration": "152.813399ms"}
karpenter-6f84d7c89d-4246h webhook 2022-07-14T17:46:46.319Z	INFO	webhook.ValidationWebhook	Reconcile succeeded	{"commit": "2c98771", "knative.dev/traceid": "41feaed3-40f0-405c-afea-e44d1dc78c3d", "knative.dev/key": "validation.webhook.karpenter.k8s.aws", "duration": "159.026253ms"}
karpenter-6f84d7c89d-4246h webhook 2022/07/14 17:47:02 http: TLS handshake error from 192.168.139.165:44004: remote error: tls: bad certificate
karpenter-6f84d7c89d-4246h webhook 2022/07/14 17:47:02 http: TLS handshake error from 192.168.139.165:44006: remote error: tls: bad certificate

The text was updated successfully, but these errors were encountered:

dprotaso · 2022-07-26T18:32:09Z

Typically knative webhooks on startup go through the following steps

Setup the TLS config of the webhook to look up the k8s secret from the informer (the secret name is hard coded as part of the webhook options)
Spin up a controller to reconcile the webhook certificate
Spin up a controller to reconcile the webhook configuration
Each controller tries to acquire a lease to become the leader
Once the lease is acquired perform the mutations (ie. cert controller creates a cert and updates the secret)

For this flow the webhooks need to be constructed with the certificate controller in addition with the default/validating admission controllers

ie. see: https://github.com/knative/serving/blob/c82be271867f137d0923be34acd18b6aca452446/cmd/webhook/main.go#L158-L172

If you're getting bad certificate I'm curious what cert is the webhook presenting and see what's defined in your CA bundle of the configured webhook (ie. ValidatingWebhookConfiguration and MutatingWebhookConfiguration)

dprotaso · 2022-07-26T18:34:17Z

The typical misconfiguration we see is if the liveness probe timeout of the webhook deployment is too low - it never gets a chance to become the leader and create the certificate. This is because K8s kills the container.

ie. vmware-tanzu/sources-for-knative#356

It's interesting to see bad certificate - that's a new one.

github-actions · 2022-10-25T01:32:54Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

knative-prow-robot · 2022-11-24T01:42:45Z

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

github-actions · 2023-02-23T01:31:43Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

dprotaso · 2023-03-27T18:44:18Z

/reopen
/lifecycle frozen

knative-prow · 2023-03-27T18:44:21Z

@dprotaso: Reopened this issue.

In response to this:

/reopen
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dprotaso · 2023-03-27T18:45:49Z

@njtran just following up if this is still an issue. I haven't had time to dig into this but this might be a good-first-issue

njtran · 2023-03-27T21:34:49Z

Hey @dprotaso, thanks for following up! I hadn't added more to this just because in our later releases we haven't seen this problem occur. I don't have a reproduction I can give you, but I know that I saw it not uncommonly in our earlier releases. Happy to dive into this with you if you want?

dprotaso · 2023-03-28T00:57:31Z

Did anything change in your later releases?

njtran · 2023-03-28T17:32:16Z

Yep. Here are the webhook definitions for what I believe had the issue:

And here are the webhook definitions now, where I haven't seen the issue in a while.

Sometimes the issue has been because of an old unreachable webhook left around due to Argo CR syncing.

Maybe you see something different though?

njtran · 2023-04-03T20:38:16Z

We think it might be something to do with Argo syncing some old versions of webhooks. Have you heard anything like this?

nalshamaajc · 2023-04-18T11:59:21Z

@dprotaso Is the below error a related issue?

Karpenter: v0.27.1
EKS: v1.25.6-eks-232056e
ArgoCD: v2.6.7+5bcd846

ERROR	webhook.WebhookCertificates	Reconcile error	{"commit": "7131be2-dirty", "knative.dev/traceid": "ce9f2f51-3ee5-4fdc-a1ce-19caa7807db5", "knative.dev/key": "karpenter/karpenter-cert", "duration": "45.405825ms", "error": "Operation cannot be fulfilled on secrets \"karpenter-cert\": the object has been modified; please apply your changes to the latest version and try again"}

BryanStenson-okta · 2023-09-27T06:35:48Z

I'm pretty sure I'm seeing this too:

Karpenter: v0.27.6
EKS: v1.25.12-eks-2d98532
ArgoCD: v2.8.3+77556d9

I'm seeing this while building out this cluster -- everything is "new" (no old versions of eks, karpenter, etc):

$ kubectl logs karpenter-provisioners-5cd99796cf-lrnbs -f
{"level":"info","ts":1695796080.7865567,"logger":"fallback","caller":"injection/injection.go:63","msg":"Starting informers..."}
2023/09/27 06:28:03 Registering 2 clients
2023/09/27 06:28:03 Registering 2 informer factories
2023/09/27 06:28:03 Registering 3 informers
2023/09/27 06:28:03 Registering 5 controllers
{"level":"INFO","time":"2023-09-27T06:28:03.864Z","logger":"controller","message":"Starting server","commit":"5a2fe84-dirty","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"INFO","time":"2023-09-27T06:28:03.867Z","logger":"controller","message":"Starting server","commit":"5a2fe84-dirty","kind":"health probe","addr":"[::]:8081"}
I0927 06:28:03.970155       1 leaderelection.go:248] attempting to acquire leader lease infra/karpenter-leader-election...
{"level":"INFO","time":"2023-09-27T06:28:04.009Z","logger":"controller","message":"Starting informers...","commit":"5a2fe84-dirty"}
2023/09/27 06:28:05 http: TLS handshake error from 10.42.172.163:50520: remote error: tls: bad certificate
...

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2022

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2023

github-actions bot closed this as completed Mar 26, 2023

knative-prow bot reopened this Mar 27, 2023

knative-prow bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 27, 2023

BryanStenson-okta mentioned this issue Sep 27, 2023

webhook calls fails with "remote error: tls: bad certificate" and prevents management of Provisioners aws/karpenter-provider-aws#4709

Closed

pierDipi mentioned this issue May 14, 2024

Eventing webhook fails to start / stuck in crash loop knative/eventing#7885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhook Race Condition with TLS handshake error: tls: bad certificate #2560

Webhook Race Condition with TLS handshake error: tls: bad certificate #2560

njtran commented Jul 25, 2022

dprotaso commented Jul 26, 2022 •

edited

Loading

dprotaso commented Jul 26, 2022 •

edited

Loading

github-actions bot commented Oct 25, 2022

knative-prow-robot commented Nov 24, 2022

github-actions bot commented Feb 23, 2023

dprotaso commented Mar 27, 2023

knative-prow bot commented Mar 27, 2023

dprotaso commented Mar 27, 2023

njtran commented Mar 27, 2023

dprotaso commented Mar 28, 2023

njtran commented Mar 28, 2023

njtran commented Apr 3, 2023

nalshamaajc commented Apr 18, 2023 •

edited

Loading

BryanStenson-okta commented Sep 27, 2023

Webhook Race Condition with TLS handshake error: tls: bad certificate #2560

Webhook Race Condition with TLS handshake error: tls: bad certificate #2560

Comments

njtran commented Jul 25, 2022

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

dprotaso commented Jul 26, 2022 • edited Loading

dprotaso commented Jul 26, 2022 • edited Loading

github-actions bot commented Oct 25, 2022

knative-prow-robot commented Nov 24, 2022

github-actions bot commented Feb 23, 2023

dprotaso commented Mar 27, 2023

knative-prow bot commented Mar 27, 2023

dprotaso commented Mar 27, 2023

njtran commented Mar 27, 2023

dprotaso commented Mar 28, 2023

njtran commented Mar 28, 2023

njtran commented Apr 3, 2023

nalshamaajc commented Apr 18, 2023 • edited Loading

BryanStenson-okta commented Sep 27, 2023

dprotaso commented Jul 26, 2022 •

edited

Loading

dprotaso commented Jul 26, 2022 •

edited

Loading

nalshamaajc commented Apr 18, 2023 •

edited

Loading