COS apps unreachable after setting tls-* options #330

przemeklal · 2024-04-23T14:29:06Z

Bug Description

I tried to replace self-signed-certificates:certificates relations with the newly introduced tls-* config options. It didn't work.

I tried setting tls-cert and tls-key only as the cert is signed by a well-trusted 3rd party CA. I ended up with Please set tls-cert, tls-key, and tls-ca
I set the tls-ca anyway. I ended up with:

catalogue working fine with the new cert
"Bad Gateway" on /cos-prometheus-0
alertmanager working but with the old cert
"Internal Server Error" on /cos-grafana

I removed :certificates relations which resulted in:

alertmanager/0*  error     idle   10.1.233.221         hook failed: "certificates-relation-broken" for ca:certificates
grafana/0*       error     idle   10.1.233.223         hook failed: "certificates-relation-broken" for ca:certificates
loki/0*          error     idle   10.1.233.222         hook failed: "certificates-relation-broken" for ca:certificates
prometheus/0*    error     idle   10.1.233.224         hook failed: "certificates-relation-broken" for ca:certificates
traefik/0*       error     idle   10.1.233.230         hook failed: "certificates-relation-broken" for ca:certificates

After hammering juju resolve --no-retry and recreating COS apps pods a few times (prometheus, grafana, traefik, catalogue, alertmanager), I ended up with:

catalogue, alertmanager, prometheus reachable and serving the new cert
grafana endpoint throwing "Bad Gateway" and printing this in its logs:

2024-04-23T14:26:26.539Z [grafana] Error: ✗ *api.HTTPServer run error: cert_file cannot be empty when using HTTPS

I tried recreating pods, unsetting tls-* options in traefik and setting them again, but I was not able to restore it.

To Reproduce

Listed above.

Environment

juju 3.4.0

Versions:

App           Version  Status  Scale  Charm                     Channel        Rev  Address         Exposed  Message
alertmanager  0.26.0   active      1  alertmanager-k8s          latest/edge    105  10.152.183.185  no       
ca                     active      1  self-signed-certificates  stable          72  10.152.183.220  no       
catalogue              active      1  catalogue-k8s             latest/stable   33  10.152.183.206  no       
grafana       9.5.3    active      1  grafana-k8s               latest/stable  105  10.152.183.219  no       
loki          2.9.4    active      1  loki-k8s                  latest/stable  118  10.152.183.233  no       
prometheus    2.49.1   active      1  prometheus-k8s            latest/stable  170  10.152.183.49   no       
traefik       v2.11.0  active      1  traefik-k8s               latest/edge    180  10.5.1.15       no

Relevant log output

Traefik when trying to open /cos-grafana:

2024-04-23T14:25:21.427Z [traefik] time="2024-04-23T14:25:21Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 10.1.233.234:3000: connect: connection refused"

2024-04-23T14:26:26.531Z [grafana] logger=provisioning.dashboard type=file name=Default t=2024-04-23T14:26:26.531084435Z level=error msg="failed to save dashboard" file=/etc/grafana/provisioning/dashboards/juju_alertmanager-k8s_e9224b0.json error="SQL query for existing dashboard by UID failed: context canceled"
2024-04-23T14:26:26.535Z [grafana] logger=provisioning.dashboard type=file name=Default t=2024-04-23T14:26:26.535152504Z level=error msg="failed to save dashboard" file=/etc/grafana/provisioning/dashboards/juju_loki-k8s_0804127.json error="SQL query for existing dashboard by UID failed: context canceled"
2024-04-23T14:26:26.538Z [grafana] logger=provisioning.dashboard type=file name=Default t=2024-04-23T14:26:26.538180317Z level=error msg="failed to save dashboard" file=/etc/grafana/provisioning/dashboards/juju_prometheus-k8s_35dd368.json error="SQL query for existing dashboard by UID failed: context canceled"
2024-04-23T14:26:26.539Z [grafana] Error: ✗ *api.HTTPServer run error: cert_file cannot be empty when using HTTPS
2024-04-23T14:26:26.550Z [pebble] Service "grafana" stopped unexpectedly with code 1
2024-04-23T14:26:26.550Z [pebble] Service "grafana" on-failure action is "restart", waiting ~30s before restart (backoff 30)

Additional context

No response

The text was updated successfully, but these errors were encountered:

przemeklal added Status: Triage Type: Bug labels Apr 23, 2024

Abuelodelanada mentioned this issue Apr 23, 2024

Traefik ends up in error state. #331

Closed

PietroPasotti mentioned this issue Apr 24, 2024

Fix certhandler relation broken canonical/observability-libs#84

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COS apps unreachable after setting tls-* options #330

COS apps unreachable after setting tls-* options #330

przemeklal commented Apr 23, 2024

COS apps unreachable after setting tls-* options #330

COS apps unreachable after setting tls-* options #330

Comments

przemeklal commented Apr 23, 2024

Bug Description

To Reproduce

Environment

Relevant log output

Additional context