Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record metrics for render failures #7

Open
Niksko opened this issue Apr 27, 2021 · 3 comments
Open

Record metrics for render failures #7

Niksko opened this issue Apr 27, 2021 · 3 comments

Comments

@Niksko
Copy link

Niksko commented Apr 27, 2021

At the moment, it's hard to monitor this service, because render failures aren't recorded. It would be nice if the existing metrics endpoint exposed these metrics.

Happy to submit a PR

@Niksko
Copy link
Author

Niksko commented Apr 28, 2021

Never mind, this is already recorded as part of the standard controller runtime reconcile metrics. Just make sure you're looking at the instance that is the leader, otherwise they won't show up 😅

@Niksko Niksko closed this as completed Apr 28, 2021
@Niksko
Copy link
Author

Niksko commented Apr 28, 2021

Actually, the currently exported metrics don't really fit the purpose of alerting on render failures. What we need is some sort of condition gauge like Flux has: https://github.com/fluxcd/pkg/blob/main/runtime/metrics/recorder.go

Again, happy to submit a PR for this

abursavich added a commit that referenced this issue Apr 28, 2021
@abursavich abursavich reopened this Apr 28, 2021
@abursavich
Copy link
Contributor

For a little bit of context, the intended design was to allow the separation of (human) operators and users. For instance one team may run one CMS controller for an entire cluster and multiple other teams may use CMS objects in the cluster.

As you found, there is a builtin controller_runtime_reconcile_total metric that includes a result label (success, error, requeue, or requeue_after). It was a conscious decision to not treat cases where a required secret/configmap object/key is missing as a reconcile error, rather an info-level warning is logged and the CMS is requeue'd. An error would indicate that something is actually wrong (e.g. RBAC) and may need (human) operator intervention. A requeue would indicate that a user hasn't configured their CMS properly (and operators should sleep through the night without getting paged).

I went ahead and added an explicit configmapsecret_controller_missing_value_render_errors_total metric, which includes a label for the namespace of the CMS. A gauge, as you suggested, would probably be nicer since the reconcile retries will eventually backoff to ~15m. Another possible solution would be to add kube-state-metrics-like support for all CMS instances with their current status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants