-
-
Notifications
You must be signed in to change notification settings - Fork 494
Prometheus for monitoring and alerting K8s cluster
We can use Prometheus as a complete monitoring and alerting system for our K8s cluster. You can see related K8s scripts at scripts/kubernetes/
.
Above image would help you in understanding different components part of the alerting and monitoring system. Feel free to modify the configurations as you wish.
Set of rules or conditions when satisfied Prometheus fires alert to Alert Manager. Rules are written in the YAML configuration file which will be used by Prometheus. There are a number of clauses to be defined in the configuration file to describe the alert triggers.
-
groups
holds different groups of alerts -
name
Name of the group -
rules
holds all the alerts belonging to the group. -
alert
for each rule this clause holds the name of the alert. -
annotations
clause is used to hold the alert description. -
expr
clause holds a boolean expression when satisfied Prometheus fires the associated alert to Alert Manager. -
for
clause holds the duration Prometheus to wait for further same alerts after firing the first alert. -
labels
clause holds various labels attached to every alert which will be used by the Alert Manager for customized notification.
expr
holds the expression which will be utilized by Prometheus to fire alerts. We use metric names to get various metric values on which the operations are done yielding a boolean value.
For example to fire an alert for no pods under a deployment we write an expression as below
sum(kube_deployment_status_replicas{pod_template_hash=""}) by (deployment,namespace) < 1
breaking down that expression, kube_deployment_status_replicas
is the metric name provided by kube-state-metrics
metric exporter which states the number of pods currently running in a deployment. kube_deployment_status_replicas
metric result looks something like below
kube_deployment_spec_replicas{deployment="XXXX",instance="XXXX",job="XXXX",namespace="XXXX",scrape_endpoint="XXXX"} 1
We use the aggregation operator sum
to get the data as a single datum and use by
to get only required components and then we check if the count is less than 1 if true to fire an alert as there are no running pods under a deployment.