diff --git a/content/en/flux/monitoring/metrics.md b/content/en/flux/monitoring/metrics.md index 27403f018..e251a265d 100644 --- a/content/en/flux/monitoring/metrics.md +++ b/content/en/flux/monitoring/metrics.md @@ -5,65 +5,198 @@ description: "How to monitor Flux with Prometheus Operator and Grafana" weight: 1 --- -## Reconcile metrics +Flux has native support for [Prometheus][prometheus] metrics to provide insights +into the state of the Flux components. These can be used to set up monitoring +for the Flux controllers. In addition, Flux Custom Resource metrics can also +be collected leveraging tools like [kube-state-metrics][kube-state-metrics]. +This document provides information about Flux metrics that can be used to set up +monitoring, with some examples. -Ready status metrics: +The [fluxcd/flux2-monitoring-example][monitoring-example-repo] repository +provides a ready-made example setup to get started with monitoring Flux. It is +recommended to set up the monitoring example before continuing with this +document to follow along. Before getting into the monitoring setup, the +following sections will describe the kinds of metrics that can be collected for +Flux. -```sh -gotk_reconcile_condition{kind, name, namespace, type="Ready", status="True"} -gotk_reconcile_condition{kind, name, namespace, type="Ready", status="False"} -gotk_reconcile_condition{kind, name, namespace, type="Ready", status="Unknown"} -``` - -Suspend status metrics: +## Controller metrics -```sh -gotk_suspend_status{kind, name, namespace} -``` +The default installation of Flux controllers export Prometheus metrics at +port `8080` in the standard `/metrics` path. These metrics are about the inner +workings of the controllers. -Time spent reconciling: +Flux resource reconciliation duration metrics: -```sh +``` gotk_reconcile_duration_seconds_bucket{kind, name, namespace, le} gotk_reconcile_duration_seconds_sum{kind, name, namespace} gotk_reconcile_duration_seconds_count{kind, name, namespace} ``` -## Control plane metrics +Cache event metrics: + +``` +gotk_cache_events_total{event_type, name, namespace} +``` Controller CPU and memory usage: -```sh +``` process_cpu_seconds_total{namespace, pod} container_memory_working_set_bytes{namespace, pod} ``` Kubernetes API usage: -```shell +``` rest_client_requests_total{namespace, pod} ``` Controller runtime: -```shell +``` workqueue_longest_running_processor_seconds{name} controller_runtime_reconcile_total{controller, result} ``` -## Setup monitoring with kube-prom-stack +In addition, many other Go runtime and [controller-runtime +metrics][controller-runtime-metrics] are also exported. -Flux uses [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) -to provide a monitoring stack made out of: +## Resource metrics -* **Prometheus Operator** - manages Prometheus clusters atop Kubernetes -* **Prometheus** - collects metrics from the Flux controllers and Kubernetes API -* **Grafana** dashboards - displays the Flux control plane resource usage and reconciliation stats -* **kube-state-metrics** - generates metrics about the state of the Kubernetes objects +Metrics for the Flux custom resources can be used to monitor the deployment of +workloads. Since the use case for these metrics may vary depending on the +needs, it's hard to decide which fields of the resources would be useful to the +users. Hence, these metrics are not exported by the Flux controllers themselves +but can be collected and exported by using other tools that can read the custom +resource state from the kube-apiserver. One such tool is [kube-state-metrics +(KSM)][kube-state-metrics]. KSM is also deployed as part of +[kube-prometheus-stack][kube-prometheus-stack] and is used to export the metrics +of kubernetes core resources. It can be configured to also collect custom +resource metrics. The monitoring setup in +[flux2-monitoring-example][monitoring-example-repo] uses KSM to collect and +export Flux custom resource metrics. -### Alert manager examples +In the [example monitoring setup][monitoring-example-repo], the metric +`gotk_resource_info` provides information about the current state of Flux +resources. -## Flux Grafana dashboards - -### Grafana annotations +``` +gotk_resource_info{customresource_group, customresource_kind, customresource_version, exported_namespace, name, ready, suspended, ...} +``` +- `customresource_group` is the API group of the resource, for example + `source.toolkit.fluxcd.io` for the Flux source API. +- `customresource_kind` is the kind of the resource, for example a + `GitRepository` source. +- `customresource_version` is the API version of the resource, for example `v1`. +- `exported_namespace` is the namespace of the resource. +- `name` is the name of the resource. +- `ready` shows the readiness of the resource. +- `suspended` shows if the resource's reconciliation is suspended. + +These are some of the common labels that are present in metrics for all the +kinds of resources. In addition, there are a few resource kind specific labels. +See the following table for a list of labels associated with specific resource +kind. + +| Resource Kind | Labels | +| --- | --- | +| Kustomization | `revision`, `source_name` | +| HelmRelease | `revision`, `chart_name`, `chart_source_name` | +| GitRepository | `revision`, `url` | +| Bucket | `revision`, `endpoint`, `bucket_name` | +| HelmRepository | `revision`, `url` | +| HelmChart | `revision`, `chart_name`, `chart_version` | +| OCIRepository | `revision`, `url` | +| Receiver | `webhook_path` | +| ImageRepository | `image` | +| ImagePolicy | `source_name` | +| ImageUpdateAutomation | `source_name` | + +{{< note >}} +The above metric may have extra labels after being collected in Prometheus. This +may be due to the default Prometheus scrape configuration used by +kube-prometheus-stack. Since they are about the kube-state-metrics service and +not about Flux itself, they can be ignored. +{{< /note >}} + +`gotk_resource_info` is an example of a metric used to collect information about +the Flux resources. This metric can be customized to add more labels, or more +such metrics can also be created by changing the kube-state-metrics custom +resource state configuration. Please see [Flux custom Prometheus +metrics][custom-metrics] for details about them. + +### ⚠️ Deprecated resource metrics + +Prior to Flux v2.1.0, the individual Flux controllers used to export resource +metrics that they managed. They have been deprecated for custom metrics using +kube-state-metrics. + +Users of the deprecated metrics `gotk_reconcile_condition` and +`gotk_suspend_status` can find the same information in the new +`gotk_resource_info` metric exported using kube-state-metrics. If needed, an +equivalent of `gotk_reconcile_condition` and `gotk_suspend_status` can be +created as a custom metric using the kube-state-metrics custom resource state +configuration. Please see [Flux custom Prometheus +metrics][custom-metrics] for details. + +## Monitoring setup + +In the [monitoring example repository][monitoring-example-repo], the monitoring configurations can be found in the +[`monitoring/`](https://github.com/fluxcd/flux2-monitoring-example/tree/main/monitoring) +directory. `monitoring/controllers/` directory contains the configurations for +deploying kube-prometheus-stack and loki-stack. We'll discuss +kube-prometheus-stack below. For Flux log collection using Loki, refer to the +[Flux logs](/flux/monitoring/logs/) docs. + +The configuration in the `monitoring/controllers/kube-prometheus-stack/` +directory creates a HelmRepository of type OCI for the [prometheus-community +helm charts](https://github.com/prometheus-community/helm-charts) and a +HelmRelease to deploy the `kube-prometheus-stack` chart in the `monitoring` +namespace. This installs all the monitoring components in the `monitoring` +namespace. Please see the +[values](https://github.com/fluxcd/flux2-monitoring-example/blob/main/monitoring/controllers/kube-prometheus-stack/release.yaml) +used for the chart deployment and modify them accordingly. + +The chart values used for configuring kube-state-metrics are in the file +[`kube-state-metrics-config.yaml`](https://github.com/fluxcd/flux2-monitoring-example/blob/main/monitoring/controllers/kube-prometheus-stack/kube-state-metrics-config.yaml), +as seen in the +[kustomization.yaml](https://github.com/fluxcd/flux2-monitoring-example/blob/main/monitoring/controllers/kube-prometheus-stack/kustomization.yaml), +which uses a kustomize ConfigMap generator to put the configurations in a +ConfigMap and use the chart values from the ConfigMap. +These values are merged with the inline chart values in the HelmRelease. +Kube-state-metrics values are in a separate file to make it easier to customize +the metrics it collects; refer to the [Flux custom Prometheus +metrics][custom-metrics] docs to see how they are used. Once +deployed with these values, the kube-state-metrics starts collecting and +exporting the Flux resource metrics. + +To configure Prometheus to scrape Flux controller metrics, a +[PodMonitor](https://github.com/fluxcd/flux2-monitoring-example/blob/main/monitoring/configs/podmonitor.yaml) +is used that selects all the Flux controller Pods and sets the metrics endpoint +to the `http-prom` port. Once created, the prometheus-operator will +automatically configure Prometheus to scrape the Flux controller metrics. + +### Flux Grafana dashboards + +The [example monitoring setup][monitoring-example-repo] provides two example +Grafana dashboards in +[`monitoring/configs/dashboards`](https://github.com/fluxcd/flux2-monitoring-example/tree/main/monitoring/configs/dashboards) +that use the Flux controller and resource metrics. The Flux Cluster Stats +dashboard shows the overall state of the Flux Sources and Cluster Reconcilers. +The Flux Control Plane dashboard shows the statistics of the various components +that constitute the Flux Control Plane and their operational metrics. + +[Insert screenshots of the grafana dashboards] + +More custom metrics can be created and used in the dashboards for monitoring +Flux. + + +[kube-state-metrics]: https://github.com/kubernetes/kube-state-metrics +[prometheus]: https://prometheus.io/ +[monitoring-example-repo]: https://github.com/fluxcd/flux2-monitoring-example +[kube-prometheus-stack]: https://github.com/prometheus-operator/kube-prometheus +[controller-runtime-metrics]: https://book.kubebuilder.io/reference/metrics-reference +[custom-metrics]: /flux/monitoring/custom-metrics/