Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission workload doesn't restart after charm's reconfiguration. #123

Open
orfeas-k opened this issue Feb 15, 2024 · 1 comment
Open

admission workload doesn't restart after charm's reconfiguration. #123

orfeas-k opened this issue Feb 15, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@orfeas-k
Copy link
Contributor

Bug Description

Workload doesn't restart after charm's reconfiguration. This means that if the workload is misconfigured, it will restart once it hits the health check's threshold (around 3 and half minutes after deployment (threshold * (period + timeout)). But the workload "unhealthiness" will be visible to the user once an update-status is fired, which is 5 minutes after deployment. If then, a user reconfigures the workload, that won't have an effect to the actual workload, since the workload won't be restarted. Instead, it will keep its initial configuration and the workload will log health check failures trying to hit the previous port.

The opposite scenario also exposes the same issue: if its port configuration is modified after deploying the charm, this doesn't affect the charm and the charm remains healthy (considering that the charms is reconfigured in an inappropriate way). This shows that its workload isn't restarted and still uses the initial port provided.

Update-status

During udpate status, the charm's status will be set to Maintenance (Workload failed health check) and it will log the following which is inaccurate since the workload won't be restarted every time an update-status is received rather only when threshold is hit.

unit-admission-webhook-0: 14:54:48 ERROR unit.admission-webhook/0.juju-log Container admission-webhook failed health check. It will be restarted.

Questions

  1. Is pebble expected to restart the workload only once, when the health check failures' threshold is hit and not ever again? Asked about this in a matrix thread
  2. Should config-changed events restart the workload? I understand that yes since charm updates its layer. For comparison, observing oidc-gatekeeper behaviour, once its public-url is reconfigured, the workload is actually restarted. I 'm not how this may interact with its service_patch which is configured during init().
  3. What should we log?

To Reproduce

  • Deploy admission-webhook charm with its port misconfigured e.g. with argument --config port=3333
    juju deploy admission-webhook --channel latest/edge --trust --config port=3333  
    
  • Wait until the health check is down. This takes around 3 and a half minutes. You can view its checks with
    kubectl -n kubeflow exec admission-webhook-0 -c admission-webhook -- /charm/bin/pebble checks
    
  • Reconfigure its port
    juju config admission-webhook port=4443
    
  • Observe the workload logs

Environment

╰─$ microk8s version
MicroK8s v1.26.11 revision 6237

╰─$ juju version --all
version: 3.1.7-genericlinux-amd64
git-commit: 0cd207d999fef1fc8b965c410e9f58fafe7ee335
git-tree-state: archive

Relevant Log Output

╰─$ kubectl -n kubeflow logs admission-webhook-0 -c admission-webhook
2024-02-15T12:52:03.914Z [pebble] HTTP API server listening on ":38813".
2024-02-15T12:52:03.914Z [pebble] Started daemon.
2024-02-15T12:52:07.731Z [pebble] POST /v1/files 3.479094ms 200
2024-02-15T12:52:07.735Z [pebble] POST /v1/files 3.194301ms 200
2024-02-15T12:52:09.290Z [pebble] GET /v1/plan?format=yaml 176.341µs 200
2024-02-15T12:52:09.292Z [pebble] POST /v1/layers 390.801µs 200
2024-02-15T12:52:09.304Z [pebble] POST /v1/services 4.234492ms 202
2024-02-15T12:52:09.307Z [pebble] Service "admission-webhook" starting: /webhook
2024-02-15T12:52:09.327Z [admission-webhook] {"level":"info","ts":"2024-02-15T12:52:09Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
2024-02-15T12:52:09.327Z [admission-webhook] I0215 12:52:09.327074      14 main.go:771] About to start serving webhooks: &http.Server{Addr:":4443", Handler:http.Handler(nil), DisableGeneralOptionsHandler:false, TLSConfig:(*tls.Config)(0xc0006829c0), ReadTimeout:0, ReadHeaderTimeout:0, WriteTimeout:0, IdleTimeout:0, MaxHeaderBytes:0, TLSNextProto:map[string]func(*http.Server, *tls.Conn, http.Handler)(nil), ConnState:(func(net.Conn, http.ConnState))(nil), ErrorLog:(*log.Logger)(nil), BaseContext:(func(net.Listener) context.Context)(nil), ConnContext:(func(context.Context, net.Conn) context.Context)(nil), inShutdown:atomic.Bool{_:atomic.noCopy{}, v:0x0}, disableKeepAlives:atomic.Bool{_:atomic.noCopy{}, v:0x0}, nextProtoOnce:sync.Once{done:0x0, m:sync.Mutex{state:0, sema:0x0}}, nextProtoErr:error(nil), mu:sync.Mutex{state:0, sema:0x0}, listeners:map[*net.Listener]struct {}(nil), activeConn:map[*http.conn]struct {}(nil), onShutdown:[]func()(nil), listenerGroup:sync.WaitGroup{noCopy:sync.noCopy{}, state:atomic.Uint64{_:atomic.noCopy{}, _:atomic.align64{}, v:0x0}, sema:0x0}}
2024-02-15T12:52:09.327Z [admission-webhook] {"level":"info","ts":"2024-02-15T12:52:09Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
2024-02-15T12:52:10.313Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.008495385s 200
2024-02-15T12:52:12.305Z [pebble] GET /v1/plan?format=yaml 155.159µs 200
2024-02-15T12:52:39.292Z [pebble] Check "admission-webhook-up" failure 1 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:53:09.294Z [pebble] Check "admission-webhook-up" failure 2 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:53:39.293Z [pebble] Check "admission-webhook-up" failure 3 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:54:09.294Z [pebble] Check "admission-webhook-up" failure 4 (threshold 4): dial tcp [::1]:3333: connect: connection refused


2024-02-15T12:54:09.294Z [pebble] Check "admission-webhook-up" failure threshold 4 hit, triggering action
2024-02-15T12:54:09.294Z [pebble] Service "admission-webhook" on-check-failure action is "restart", terminating process before restarting
2024-02-15T12:54:09.297Z [pebble] Service "admission-webhook" exited after check failure, restarting
2024-02-15T12:54:09.297Z [pebble] Service "admission-webhook" on-check-failure action is "restart", waiting ~500ms before restart (backoff 1)
2024-02-15T12:54:09.826Z [pebble] Service "admission-webhook" starting: /webhook
2024-02-15T12:54:09.850Z [admission-webhook] {"level":"info","ts":"2024-02-15T12:54:09Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
2024-02-15T12:54:09.850Z [admission-webhook] I0215 12:54:09.850506      24 main.go:771] About to start serving webhooks: &http.Server{Addr:":4443", Handler:http.Handler(nil), DisableGeneralOptionsHandler:false, TLSConfig:(*tls.Config)(0xc0002871e0), ReadTimeout:0, ReadHeaderTimeout:0, WriteTimeout:0, IdleTimeout:0, MaxHeaderBytes:0, TLSNextProto:map[string]func(*http.Server, *tls.Conn, http.Handler)(nil), ConnState:(func(net.Conn, http.ConnState))(nil), ErrorLog:(*log.Logger)(nil), BaseContext:(func(net.Listener) context.Context)(nil), ConnContext:(func(context.Context, net.Conn) context.Context)(nil), inShutdown:atomic.Bool{_:atomic.noCopy{}, v:0x0}, disableKeepAlives:atomic.Bool{_:atomic.noCopy{}, v:0x0}, nextProtoOnce:sync.Once{done:0x0, m:sync.Mutex{state:0, sema:0x0}}, nextProtoErr:error(nil), mu:sync.Mutex{state:0, sema:0x0}, listeners:map[*net.Listener]struct {}(nil), activeConn:map[*http.conn]struct {}(nil), onShutdown:[]func()(nil), listenerGroup:sync.WaitGroup{noCopy:sync.noCopy{}, state:atomic.Uint64{_:atomic.noCopy{}, _:atomic.align64{}, v:0x0}, sema:0x0}}
2024-02-15T12:54:09.850Z [admission-webhook] {"level":"info","ts":"2024-02-15T12:54:09Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
2024-02-15T12:54:39.293Z [pebble] Check "admission-webhook-up" failure 5 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:54:48.114Z [pebble] GET /v1/checks?names=admission-webhook-up 175.272µs 200
2024-02-15T12:55:03.823Z [pebble] GET /v1/plan?format=yaml 195.019µs 200
2024-02-15T12:55:09.294Z [pebble] Check "admission-webhook-up" failure 6 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:55:39.294Z [pebble] Check "admission-webhook-up" failure 7 (threshold 4): dial tcp [::1]:3333: connect: connection refused


# ran `juju config admission-webhook port=4443` here
2024-02-15T12:56:09.295Z [pebble] Check "admission-webhook-up" failure 8 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:56:39.294Z [pebble] Check "admission-webhook-up" failure 9 (threshold 4): dial tcp [::1]:3333: connect: connection refused
2024-02-15T12:56:44.325Z [pebble] GET /v1/checks?names=admission-webhook-up 82.388µs 200

Additional Context

No response

@orfeas-k orfeas-k added the bug Something isn't working label Feb 15, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5343.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant