Record metrics for render failures #7

Niksko · 2021-04-27T23:52:07Z

At the moment, it's hard to monitor this service, because render failures aren't recorded. It would be nice if the existing metrics endpoint exposed these metrics.

Happy to submit a PR

Niksko · 2021-04-28T06:07:36Z

Never mind, this is already recorded as part of the standard controller runtime reconcile metrics. Just make sure you're looking at the instance that is the leader, otherwise they won't show up 😅

Niksko · 2021-04-28T07:08:16Z

Actually, the currently exported metrics don't really fit the purpose of alerting on render failures. What we need is some sort of condition gauge like Flux has: https://github.com/fluxcd/pkg/blob/main/runtime/metrics/recorder.go

Again, happy to submit a PR for this

Fixes: #7 maybe?

abursavich · 2021-04-28T18:46:29Z

For a little bit of context, the intended design was to allow the separation of (human) operators and users. For instance one team may run one CMS controller for an entire cluster and multiple other teams may use CMS objects in the cluster.

As you found, there is a builtin controller_runtime_reconcile_total metric that includes a result label (success, error, requeue, or requeue_after). It was a conscious decision to not treat cases where a required secret/configmap object/key is missing as a reconcile error, rather an info-level warning is logged and the CMS is requeue'd. An error would indicate that something is actually wrong (e.g. RBAC) and may need (human) operator intervention. A requeue would indicate that a user hasn't configured their CMS properly (and operators should sleep through the night without getting paged).

I went ahead and added an explicit configmapsecret_controller_missing_value_render_errors_total metric, which includes a label for the namespace of the CMS. A gauge, as you suggested, would probably be nicer since the reconcile retries will eventually backoff to ~15m. Another possible solution would be to add kube-state-metrics-like support for all CMS instances with their current status.

Niksko closed this as completed Apr 28, 2021

abursavich added a commit that referenced this issue Apr 28, 2021

controllers: add missing value render errors metric

4546b16

Fixes: #7 maybe?

abursavich reopened this Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record metrics for render failures #7

Record metrics for render failures #7

Niksko commented Apr 27, 2021 •

edited

Loading

Niksko commented Apr 28, 2021

Niksko commented Apr 28, 2021 •

edited

Loading

abursavich commented Apr 28, 2021

Record metrics for render failures #7

Record metrics for render failures #7

Comments

Niksko commented Apr 27, 2021 • edited Loading

Niksko commented Apr 28, 2021

Niksko commented Apr 28, 2021 • edited Loading

abursavich commented Apr 28, 2021

Niksko commented Apr 27, 2021 •

edited

Loading

Niksko commented Apr 28, 2021 •

edited

Loading