admission workload doesn't restart after charm's reconfiguration. #123

orfeas-k · 2024-02-15T13:08:53Z

Bug Description

Workload doesn't restart after charm's reconfiguration. This means that if the workload is misconfigured, it will restart once it hits the health check's threshold (around 3 and half minutes after deployment (threshold * (period + timeout)). But the workload "unhealthiness" will be visible to the user once an update-status is fired, which is 5 minutes after deployment. If then, a user reconfigures the workload, that won't have an effect to the actual workload, since the workload won't be restarted. Instead, it will keep its initial configuration and the workload will log health check failures trying to hit the previous port.

The opposite scenario also exposes the same issue: if its port configuration is modified after deploying the charm, this doesn't affect the charm and the charm remains healthy (considering that the charms is reconfigured in an inappropriate way). This shows that its workload isn't restarted and still uses the initial port provided.

Update-status

During udpate status, the charm's status will be set to Maintenance (Workload failed health check) and it will log the following which is inaccurate since the workload won't be restarted every time an update-status is received rather only when threshold is hit.

unit-admission-webhook-0: 14:54:48 ERROR unit.admission-webhook/0.juju-log Container admission-webhook failed health check. It will be restarted.

Questions

Is pebble expected to restart the workload only once, when the health check failures' threshold is hit and not ever again? Asked about this in a matrix thread
Should config-changed events restart the workload? I understand that yes since charm updates its layer. For comparison, observing oidc-gatekeeper behaviour, once its public-url is reconfigured, the workload is actually restarted. I 'm not how this may interact with its service_patch which is configured during init().
What should we log?

To Reproduce

Deploy admission-webhook charm with its port misconfigured e.g. with argument --config port=3333
```
juju deploy admission-webhook --channel latest/edge --trust --config port=3333  
```
Wait until the health check is down. This takes around 3 and a half minutes. You can view its checks with
```
kubectl -n kubeflow exec admission-webhook-0 -c admission-webhook -- /charm/bin/pebble checks
```

Reconfigure its port

juju config admission-webhook port=4443

Observe the workload logs

Environment

╰─$ microk8s version
MicroK8s v1.26.11 revision 6237

╰─$ juju version --all
version: 3.1.7-genericlinux-amd64
git-commit: 0cd207d999fef1fc8b965c410e9f58fafe7ee335
git-tree-state: archive

Relevant Log Output

╰─$ kubectl -n kubeflow logs admission-webhook-0 -c admission-webhook
2024-02-15T12:52:03.914Z [pebble] HTTP API server listening on ":38813".
2024-02-15T12:52:03.914Z [pebble] Started daemon.
2024-02-15T12:52:07.731Z [pebble] POST /v1/files 3.479094ms 200
2024-02-15T12:52:07.735Z [pebble] POST /v1/files 3.194301ms 200
2024-02-15T12:52:09.290Z [pebble] GET /v1/plan?format=yaml 176.341µs 200
2024-02-15T12:52:09.292Z [pebble] POST /v1/layers 390.801µs 200
2024-02-15T12:52:09.304Z [pebble] POST /v1/services 4.234492ms 202
2024-02-15T12:52:09.307Z [pebble] Service "admission-webhook" starting: /webhook
2024-02-15T12:52:09.327Z [admission-webhook] {"level":"info","ts":"2024-02-15T12:52:09Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
2024-02-15T12:52:09.327Z [admission-webhook] I0215 12:52:09.327074      14 main.go:771] About to start serving webhooks: &http.Server{Addr:":4443", Handler:http.Handler(nil), DisableGeneralOptionsHandler:false, TLSConfig:(*tls.Config)(0xc0006829c0), ReadTimeout:0, ReadHeaderTimeout:0, WriteTimeout:0, IdleTimeout:0, MaxHeaderBytes:0, TLSNextProto:map[string]func(*http.Server, *tls.Conn, http.Handler)(nil), ConnState:(func(net.Conn, http.ConnState))(nil), ErrorLog:(*log.Logger)(nil), BaseContext:(func(net.Listener) context.Context)(nil), ConnContext:(func(context.Context, net.Conn) context.Context)(nil), inShutdown:atomic.Bool{_:atomic.noCopy{}, v:0x0}, disableKeepAlives:atomic.Bool{_:atomic.noCopy{}, v:0x0}, nextProtoOnce:sync.Once{done:0x0, m:sync.Mutex{state:0, sema:0x0}}, nextProtoErr:error(nil), mu:sync.Mutex{state:0, sema:0x0}, listeners:map[*net.Listener]struct {}(nil), activeConn:map[*http.conn]struct {}(nil), onShutdown:[]func()(nil), listenerGroup:sync.WaitGroup{noCopy:sync.noCopy{}, state:atomic.Uint64{_:atomic.noCopy{}, _:atomic.align64{}, v:0x0}, sema:0x0}}
2024-02-15T12:52:09.327Z [admission-webhook] {"level":"info","ts":"2024-02-15T12:52:09Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
2024-02-15T12:52:10.313Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.008495385s 200
2024-02-15T12:52:12.305Z [pebble] GET /v1/plan?format=yaml 155.159µs 200
2024-02-15T12:52:39.292Z [pebble] Check "admission-webhook-up" failure 1 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:53:09.294Z [pebble] Check "admission-webhook-up" failure 2 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:53:39.293Z [pebble] Check "admission-webhook-up" failure 3 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:54:09.294Z [pebble] Check "admission-webhook-up" failure 4 (threshold 4): dial tcp [::1]:3333: connect: connection refused


2024-02-15T12:54:09.294Z [pebble] Check "admission-webhook-up" failure threshold 4 hit, triggering action
2024-02-15T12:54:09.294Z [pebble] Service "admission-webhook" on-check-failure action is "restart", terminating process before restarting
2024-02-15T12:54:09.297Z [pebble] Service "admission-webhook" exited after check failure, restarting
2024-02-15T12:54:09.297Z [pebble] Service "admission-webhook" on-check-failure action is "restart", waiting ~500ms before restart (backoff 1)
2024-02-15T12:54:09.826Z [pebble] Service "admission-webhook" starting: /webhook
2024-02-15T12:54:09.850Z [admission-webhook] {"level":"info","ts":"2024-02-15T12:54:09Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
2024-02-15T12:54:09.850Z [admission-webhook] I0215 12:54:09.850506      24 main.go:771] About to start serving webhooks: &http.Server{Addr:":4443", Handler:http.Handler(nil), DisableGeneralOptionsHandler:false, TLSConfig:(*tls.Config)(0xc0002871e0), ReadTimeout:0, ReadHeaderTimeout:0, WriteTimeout:0, IdleTimeout:0, MaxHeaderBytes:0, TLSNextProto:map[string]func(*http.Server, *tls.Conn, http.Handler)(nil), ConnState:(func(net.Conn, http.ConnState))(nil), ErrorLog:(*log.Logger)(nil), BaseContext:(func(net.Listener) context.Context)(nil), ConnContext:(func(context.Context, net.Conn) context.Context)(nil), inShutdown:atomic.Bool{_:atomic.noCopy{}, v:0x0}, disableKeepAlives:atomic.Bool{_:atomic.noCopy{}, v:0x0}, nextProtoOnce:sync.Once{done:0x0, m:sync.Mutex{state:0, sema:0x0}}, nextProtoErr:error(nil), mu:sync.Mutex{state:0, sema:0x0}, listeners:map[*net.Listener]struct {}(nil), activeConn:map[*http.conn]struct {}(nil), onShutdown:[]func()(nil), listenerGroup:sync.WaitGroup{noCopy:sync.noCopy{}, state:atomic.Uint64{_:atomic.noCopy{}, _:atomic.align64{}, v:0x0}, sema:0x0}}
2024-02-15T12:54:09.850Z [admission-webhook] {"level":"info","ts":"2024-02-15T12:54:09Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
2024-02-15T12:54:39.293Z [pebble] Check "admission-webhook-up" failure 5 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:54:48.114Z [pebble] GET /v1/checks?names=admission-webhook-up 175.272µs 200
2024-02-15T12:55:03.823Z [pebble] GET /v1/plan?format=yaml 195.019µs 200
2024-02-15T12:55:09.294Z [pebble] Check "admission-webhook-up" failure 6 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:55:39.294Z [pebble] Check "admission-webhook-up" failure 7 (threshold 4): dial tcp [::1]:3333: connect: connection refused


# ran `juju config admission-webhook port=4443` here
2024-02-15T12:56:09.295Z [pebble] Check "admission-webhook-up" failure 8 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:56:39.294Z [pebble] Check "admission-webhook-up" failure 9 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:56:44.325Z [pebble] GET /v1/checks?names=admission-webhook-up 82.388µs 200

Additional Context

No response

The text was updated successfully, but these errors were encountered:

syncronize-issues-to-jira · 2024-02-15T13:09:01Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5343.

This message was autogenerated

orfeas-k added the bug Something isn't working label Feb 15, 2024

DnPlas mentioned this issue Feb 15, 2024

Charms standardisation canonical/bundle-kubeflow#794

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission workload doesn't restart after charm's reconfiguration. #123

admission workload doesn't restart after charm's reconfiguration. #123

orfeas-k commented Feb 15, 2024

syncronize-issues-to-jira bot commented Feb 15, 2024

admission workload doesn't restart after charm's reconfiguration. #123

admission workload doesn't restart after charm's reconfiguration. #123

Comments

orfeas-k commented Feb 15, 2024

Bug Description

Update-status

Questions

To Reproduce

Environment

Relevant Log Output

Additional Context

syncronize-issues-to-jira bot commented Feb 15, 2024