Changes on Kubernetes monitors (#62)

* Add Kubernetes monitors * typoe * replace default apiserver by heartbeat * add explanation on apiserver * Add nginx is down monitor * add vars on titles * fix metric
claranet · Feb 3, 2025 · e92d5f5 · e92d5f5
1 parent 9f38952
commit e92d5f5
Show file tree

Hide file tree

Showing 20 changed files with 784 additions and 93 deletions.
diff --git a/caas/kubernetes/cluster/README.md b/caas/kubernetes/cluster/README.md
@@ -17,7 +17,8 @@ module "datadog-monitors-caas-kubernetes-cluster" {
 
 Creates DataDog monitors with the following checks:
 
-- Kubernetes API server does not respond
+- Kubernetes API server does not respond on {{kube_cluster_name}} (disabled by default)
+- Kubernetes cluster heartbeat alert on {{kube_cluster_name}}
 
 <!-- BEGIN_TF_DOCS -->
 ## Requirements
@@ -44,12 +45,13 @@ Creates DataDog monitors with the following checks:
 | Name | Type |
 |------|------|
 | [datadog_monitor.apiserver](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.heartbeat](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
 
 ## Inputs
 
 | Name | Description | Type | Default | Required |
 |------|-------------|------|---------|:--------:|
-| <a name="input_apiserver_enabled"></a> [apiserver\_enabled](#input\_apiserver\_enabled) | Flag to enable API server monitor | `string` | `"true"` | no |
+| <a name="input_apiserver_enabled"></a> [apiserver\_enabled](#input\_apiserver\_enabled) | Flag to enable API server monitor (do not work on some clusters, see https://docs.datadoghq.com/containers/kubernetes/control_plane/?tab=datadogoperator#ManagedServices) | `string` | `"false"` | no |
 | <a name="input_apiserver_extra_tags"></a> [apiserver\_extra\_tags](#input\_apiserver\_extra\_tags) | Extra tags for API server monitor | `list(string)` | `[]` | no |
 | <a name="input_apiserver_message"></a> [apiserver\_message](#input\_apiserver\_message) | Custom message for API server monitor | `string` | `""` | no |
 | <a name="input_apiserver_no_data_timeframe"></a> [apiserver\_no\_data\_timeframe](#input\_apiserver\_no\_data\_timeframe) | Number of minutes before reporting no data | `string` | `10` | no |
@@ -60,6 +62,12 @@ Creates DataDog monitors with the following checks:
 | <a name="input_filter_tags_custom_excluded"></a> [filter\_tags\_custom\_excluded](#input\_filter\_tags\_custom\_excluded) | Tags excluded for custom filtering when filter\_tags\_use\_defaults is false | `string` | `""` | no |
 | <a name="input_filter_tags_separator"></a> [filter\_tags\_separator](#input\_filter\_tags\_separator) | Set the filter tags separator (, or AND) | `string` | `","` | no |
 | <a name="input_filter_tags_use_defaults"></a> [filter\_tags\_use\_defaults](#input\_filter\_tags\_use\_defaults) | Use default filter tags convention | `string` | `"true"` | no |
+| <a name="input_heartbeat_enabled"></a> [heartbeat\_enabled](#input\_heartbeat\_enabled) | Flag to enable heartbeat monitor | `string` | `"true"` | no |
+| <a name="input_heartbeat_extra_tags"></a> [heartbeat\_extra\_tags](#input\_heartbeat\_extra\_tags) | Extra tags for heartbeat monitor | `list(string)` | `[]` | no |
+| <a name="input_heartbeat_message"></a> [heartbeat\_message](#input\_heartbeat\_message) | Custom message for heartbeat monitor | `string` | `""` | no |
+| <a name="input_heartbeat_no_data_timeframe"></a> [heartbeat\_no\_data\_timeframe](#input\_heartbeat\_no\_data\_timeframe) | Number of minutes before reporting no data | `string` | `20` | no |
+| <a name="input_heartbeat_time_aggregator"></a> [heartbeat\_time\_aggregator](#input\_heartbeat\_time\_aggregator) | Time aggregator for heartbeat monitor | `string` | `"min"` | no |
+| <a name="input_heartbeat_timeframe"></a> [heartbeat\_timeframe](#input\_heartbeat\_timeframe) | Timeframe for heartbeat monitor | `string` | `"last_30m"` | no |
 | <a name="input_message"></a> [message](#input\_message) | Message sent when a monitor is triggered | `any` | n/a | yes |
 | <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
 | <a name="input_new_host_delay"></a> [new\_host\_delay](#input\_new\_host\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
@@ -74,6 +82,7 @@ Creates DataDog monitors with the following checks:
 | Name | Description |
 |------|-------------|
 | <a name="output_apiserver_id"></a> [apiserver\_id](#output\_apiserver\_id) | id for monitor apiserver |
+| <a name="output_heartbeat_id"></a> [heartbeat\_id](#output\_heartbeat\_id) | id for monitor heartbeat |
 <!-- END_TF_DOCS -->
 ## Related documentation
 

diff --git a/caas/kubernetes/cluster/inputs.tf b/caas/kubernetes/cluster/inputs.tf
@@ -66,11 +66,11 @@ variable "apiserver_no_data_timeframe" {
 }
 
 # Datadog monitors variables
-
+## API server monitor variables
 variable "apiserver_enabled" {
-  description = "Flag to enable API server monitor"
+  description = "Flag to enable API server monitor (do not work on some clusters, see https://docs.datadoghq.com/containers/kubernetes/control_plane/?tab=datadogoperator#ManagedServices)"
   type        = string
-  default     = "true"
+  default     = "false"
 }
 
 variable "apiserver_extra_tags" {
@@ -91,3 +91,39 @@ variable "apiserver_threshold_warning" {
   default     = 3
 }
 
+## Heartbeat monitor variables
+variable "heartbeat_enabled" {
+  description = "Flag to enable heartbeat monitor"
+  type        = string
+  default     = "true"
+}
+
+variable "heartbeat_message" {
+  description = "Custom message for heartbeat monitor"
+  type        = string
+  default     = ""
+}
+
+variable "heartbeat_no_data_timeframe" {
+  description = "Number of minutes before reporting no data"
+  type        = string
+  default     = 20
+}
+
+variable "heartbeat_time_aggregator" {
+  description = "Time aggregator for heartbeat monitor"
+  type        = string
+  default     = "min"
+}
+
+variable "heartbeat_timeframe" {
+  description = "Timeframe for heartbeat monitor"
+  type        = string
+  default     = "last_30m"
+}
+
+variable "heartbeat_extra_tags" {
+  description = "Extra tags for heartbeat monitor"
+  type        = list(string)
+  default     = []
+}
diff --git a/caas/kubernetes/cluster/monitors-k8s-cluster.tf b/caas/kubernetes/cluster/monitors-k8s-cluster.tf
@@ -1,12 +1,12 @@
 resource "datadog_monitor" "apiserver" {
   count   = var.apiserver_enabled == "true" ? 1 : 0
-  name    = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Kubernetes API server does not respond"
+  name    = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Kubernetes API server does not respond on {{kube_cluster_name}}"
   message = coalesce(var.apiserver_message, var.message)
 
   type = "service check"
 
   query = <<EOQ
-    "kube_apiserver_controlplane.up"${module.filter-tags.service_check}.last(6).count_by_status()
+    "kube_apiserver_controlplane.up"${module.filter-tags.service_check}.by("kube_cluster_name").last(6).count_by_status()
 EOQ
 
   monitor_thresholds {
@@ -16,7 +16,7 @@ EOQ
 
   new_host_delay      = var.new_host_delay
   new_group_delay     = var.new_group_delay
-  notify_no_data      = var.notify_no_data
+  notify_no_data      = false
   no_data_timeframe   = var.apiserver_no_data_timeframe
   renotify_interval   = 0
   notify_audit        = false
@@ -26,3 +26,30 @@ EOQ
 
   tags = concat(local.common_tags, var.tags, var.apiserver_extra_tags)
 }
+
+resource "datadog_monitor" "heartbeat" {
+  count   = var.heartbeat_enabled == "true" ? 1 : 0
+  name    = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Kubernetes cluster heartbeat alert on {{kube_cluster_name}}"
+  message = coalesce(var.heartbeat_message, var.message)
+  type    = "metric alert"
+
+  query = <<EOQ
+    ${var.heartbeat_time_aggregator}(${var.heartbeat_timeframe}):
+    sum:kubernetes.pods.running${module.filter-tags.query_alert} by {kube_cluster_name} > 1000000
+EOQ
+
+  monitor_thresholds {
+    critical = 1000000 # high threshold to handle no data only
+  }
+
+  new_group_delay     = var.new_group_delay
+  notify_no_data      = true
+  no_data_timeframe   = var.heartbeat_no_data_timeframe
+  renotify_interval   = 0
+  notify_audit        = false
+  timeout_h           = var.timeout_h
+  include_tags        = true
+  require_full_window = true
+
+  tags = concat(local.common_tags, var.tags, var.heartbeat_extra_tags)
+}
diff --git a/caas/kubernetes/cluster/outputs.tf b/caas/kubernetes/cluster/outputs.tf
@@ -3,3 +3,8 @@ output "apiserver_id" {
   value       = datadog_monitor.apiserver.*.id
 }
 
+output "heartbeat_id" {
+  description = "id for monitor heartbeat"
+  value       = datadog_monitor.heartbeat.*.id
+}
+
diff --git a/caas/kubernetes/ingress/vts/README.md b/caas/kubernetes/ingress/vts/README.md
@@ -19,6 +19,7 @@ Creates DataDog monitors with the following checks:
 
 - Nginx Ingress 4xx errors
 - Nginx Ingress 5xx errors
+- Nginx Ingress {{kube_replica_set}} is down on {{kube_cluster_name}}
 
 <!-- BEGIN_TF_DOCS -->
 ## Requirements
@@ -46,6 +47,7 @@ Creates DataDog monitors with the following checks:
 
 | Name | Type |
 |------|------|
+| [datadog_monitor.nginx_ingress_is_down](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
 | [datadog_monitor.nginx_ingress_too_many_4xx](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
 | [datadog_monitor.nginx_ingress_too_many_5xx](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
 
@@ -74,6 +76,13 @@ Creates DataDog monitors with the following checks:
 | <a name="input_ingress_5xx_threshold_warning"></a> [ingress\_5xx\_threshold\_warning](#input\_ingress\_5xx\_threshold\_warning) | 5xx warning threshold in percentage | `string` | `"10"` | no |
 | <a name="input_ingress_5xx_time_aggregator"></a> [ingress\_5xx\_time\_aggregator](#input\_ingress\_5xx\_time\_aggregator) | Monitor aggregator for Ingress 5xx errors [available values: min, max or avg] | `string` | `"min"` | no |
 | <a name="input_ingress_5xx_timeframe"></a> [ingress\_5xx\_timeframe](#input\_ingress\_5xx\_timeframe) | Monitor timeframe for Ingress 5xx errors [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
+| <a name="input_ingress_down_enabled"></a> [ingress\_down\_enabled](#input\_ingress\_down\_enabled) | Flag to enable Nginx Ingress is down monitor | `string` | `"true"` | no |
+| <a name="input_ingress_down_extra_tags"></a> [ingress\_down\_extra\_tags](#input\_ingress\_down\_extra\_tags) | Extra tags for Nginx Ingress is down monitor | `list(string)` | `[]` | no |
+| <a name="input_ingress_down_message"></a> [ingress\_down\_message](#input\_ingress\_down\_message) | Message sent when an alert is triggered | `string` | `""` | no |
+| <a name="input_ingress_down_threshold_critical"></a> [ingress\_down\_threshold\_critical](#input\_ingress\_down\_threshold\_critical) | Nginx Ingress is down critical threshold in percentage | `number` | `0.3` | no |
+| <a name="input_ingress_down_threshold_warning"></a> [ingress\_down\_threshold\_warning](#input\_ingress\_down\_threshold\_warning) | Nginx Ingress is down warning threshold in percentage | `number` | `0.7` | no |
+| <a name="input_ingress_down_time_aggregator"></a> [ingress\_down\_time\_aggregator](#input\_ingress\_down\_time\_aggregator) | Monitor aggregator for Nginx Ingress is down [available values: min, max or avg] | `string` | `"avg"` | no |
+| <a name="input_ingress_down_timeframe"></a> [ingress\_down\_timeframe](#input\_ingress\_down\_timeframe) | Monitor timeframe for Nginx Ingress is down [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | `string` | `"last_10m"` | no |
 | <a name="input_message"></a> [message](#input\_message) | Message sent when an alert is triggered | `any` | n/a | yes |
 | <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
 | <a name="input_new_host_delay"></a> [new\_host\_delay](#input\_new\_host\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
@@ -87,6 +96,7 @@ Creates DataDog monitors with the following checks:
 
 | Name | Description |
 |------|-------------|
+| <a name="output_nginx_ingress_is_down_id"></a> [nginx\_ingress\_is\_down\_id](#output\_nginx\_ingress\_is\_down\_id) | id for monitor nginx\_ingress\_is\_down |
 | <a name="output_nginx_ingress_too_many_4xx_id"></a> [nginx\_ingress\_too\_many\_4xx\_id](#output\_nginx\_ingress\_too\_many\_4xx\_id) | id for monitor nginx\_ingress\_too\_many\_4xx |
 | <a name="output_nginx_ingress_too_many_5xx_id"></a> [nginx\_ingress\_too\_many\_5xx\_id](#output\_nginx\_ingress\_too\_many\_5xx\_id) | id for monitor nginx\_ingress\_too\_many\_5xx |
 <!-- END_TF_DOCS -->

diff --git a/caas/kubernetes/ingress/vts/inputs.tf b/caas/kubernetes/ingress/vts/inputs.tf
@@ -59,8 +59,8 @@ variable "filter_tags_separator" {
   default     = ","
 }
 
-#Ingress
-
+# Nginx Ingress
+## Nginx Ingress 5xx errors monitor
 variable "ingress_5xx_enabled" {
   description = "Flag to enable Ingress 5xx errors monitor"
   type        = string
@@ -102,6 +102,7 @@ variable "ingress_5xx_threshold_warning" {
   description = "5xx warning threshold in percentage"
 }
 
+## Nginx Ingress 4xx errors monitor
 variable "ingress_4xx_enabled" {
   description = "Flag to enable Ingress 4xx errors monitor"
   type        = string
@@ -148,3 +149,44 @@ variable "artificial_requests_count" {
   description = "Number of false requests used to mitigate false positive in case of low trafic"
 }
 
+## Nginx Ingress is down monitor
+variable "ingress_down_enabled" {
+  type        = string
+  default     = "true"
+  description = "Flag to enable Nginx Ingress is down monitor"
+}
+
+variable "ingress_down_message" {
+  default     = ""
+  description = "Message sent when an alert is triggered"
+}
+
+variable "ingress_down_time_aggregator" {
+  type        = string
+  default     = "avg"
+  description = "Monitor aggregator for Nginx Ingress is down [available values: min, max or avg]"
+}
+
+variable "ingress_down_timeframe" {
+  type        = string
+  default     = "last_10m"
+  description = "Monitor timeframe for Nginx Ingress is down [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`]"
+}
+
+variable "ingress_down_threshold_critical" {
+  type        = number
+  default     = 0.3
+  description = "Nginx Ingress is down critical threshold in percentage"
+}
+
+variable "ingress_down_threshold_warning" {
+  type        = number
+  default     = 0.7
+  description = "Nginx Ingress is down warning threshold in percentage"
+}
+
+variable "ingress_down_extra_tags" {
+  type        = list(string)
+  default     = []
+  description = "Extra tags for Nginx Ingress is down monitor"
+}
diff --git a/caas/kubernetes/ingress/vts/monitors-ingress.tf b/caas/kubernetes/ingress/vts/monitors-ingress.tf
@@ -60,3 +60,31 @@ EOQ
   tags = concat(local.common_tags, var.tags, var.ingress_4xx_extra_tags)
 }
 
+resource "datadog_monitor" "nginx_ingress_is_down" {
+  count   = var.ingress_down_enabled == "true" ? 1 : 0
+  name    = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Nginx Ingress {{kube_replica_set}} is down on {{kube_cluster_name}}"
+  message = coalesce(var.ingress_down_message, var.message)
+  type    = "query alert"
+
+  query = <<EOQ
+    ${var.ingress_down_time_aggregator}(${var.ingress_down_timeframe}):
+      avg:nginx_ingress.nginx_up${module.filter-tags.query_alert} by {kube_replica_set,kube_cluster_name}
+      <= ${var.ingress_down_threshold_critical}
+EOQ
+
+  monitor_thresholds {
+    warning  = var.ingress_down_threshold_warning
+    critical = var.ingress_down_threshold_critical
+  }
+
+  evaluation_delay    = var.evaluation_delay
+  new_group_delay     = var.new_group_delay
+  notify_no_data      = true
+  renotify_interval   = 0
+  notify_audit        = false
+  timeout_h           = var.timeout_h
+  include_tags        = true
+  require_full_window = true
+
+  tags = concat(local.common_tags, var.tags, var.ingress_down_extra_tags)
+}
diff --git a/caas/kubernetes/ingress/vts/outputs.tf b/caas/kubernetes/ingress/vts/outputs.tf
@@ -1,3 +1,8 @@
+output "nginx_ingress_is_down_id" {
+  description = "id for monitor nginx_ingress_is_down"
+  value       = datadog_monitor.nginx_ingress_is_down.*.id
+}
+
 output "nginx_ingress_too_many_4xx_id" {
   description = "id for monitor nginx_ingress_too_many_4xx"
   value       = datadog_monitor.nginx_ingress_too_many_4xx.*.id

diff --git a/caas/kubernetes/node/README.md b/caas/kubernetes/node/README.md
@@ -17,16 +17,15 @@ module "datadog-monitors-caas-kubernetes-node" {
 
 Creates DataDog monitors with the following checks:
 
-- Kubernetes Node Disk pressure
-- Kubernetes Node Frequent unregister net device
-- Kubernetes Node Kubelet API does not respond
-- Kubernetes Node Kubelet sync loop that updates containers does not work
-- Kubernetes Node Memory pressure
-- Kubernetes Node not ready
-- Kubernetes Node Out of disk
-- Kubernetes Node unschedulable
-- Kubernetes Node volume inodes usage
-- Kubernetes Node volume space usage
+- Kubernetes Node {{kube_node}} disk pressure on {{kube_cluster_name}}
+- Kubernetes Node {{kube_node}} frequent unregister net device
+- Kubernetes Node {{kube_node}} Kubelet API does not respond on {{kube_cluster_name}}
+- Kubernetes Node {{kube_node}} Kubelet sync loop that updates containers does not work on {{kube_cluster_name}}
+- Kubernetes Node {{kube_node}} memory pressure on {{kube_cluster_name}}
+- Kubernetes Node {{kube_node}} not ready on {{kube_cluster_name}}
+- Kubernetes Node {{kube_node}} unschedulable on {{kube_cluster_name}}
+- Kubernetes Node volume {{persistentvolumeclaim}} inodes usage
+- Kubernetes Node volume {{persistentvolumeclaim}} space usage
 
 <!-- BEGIN_TF_DOCS -->
 ## Requirements
@@ -53,7 +52,6 @@ Creates DataDog monitors with the following checks:
 
 | Name | Type |
 |------|------|
-| [datadog_monitor.disk_out](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
 | [datadog_monitor.disk_pressure](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
 | [datadog_monitor.kubelet_ping](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
 | [datadog_monitor.kubelet_syncloop](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
@@ -68,10 +66,6 @@ Creates DataDog monitors with the following checks:
 
 | Name | Description | Type | Default | Required |
 |------|-------------|------|---------|:--------:|
-| <a name="input_disk_out_enabled"></a> [disk\_out\_enabled](#input\_disk\_out\_enabled) | Flag to enable Out of disk monitor | `string` | `"true"` | no |
-| <a name="input_disk_out_extra_tags"></a> [disk\_out\_extra\_tags](#input\_disk\_out\_extra\_tags) | Extra tags for Out of disk monitor | `list(string)` | `[]` | no |
-| <a name="input_disk_out_message"></a> [disk\_out\_message](#input\_disk\_out\_message) | Custom message for Out of disk monitor | `string` | `""` | no |
-| <a name="input_disk_out_threshold_warning"></a> [disk\_out\_threshold\_warning](#input\_disk\_out\_threshold\_warning) | Out of disk monitor (warning threshold) | `string` | `3` | no |
 | <a name="input_disk_pressure_enabled"></a> [disk\_pressure\_enabled](#input\_disk\_pressure\_enabled) | Flag to enable Disk pressure monitor | `string` | `"true"` | no |
 | <a name="input_disk_pressure_extra_tags"></a> [disk\_pressure\_extra\_tags](#input\_disk\_pressure\_extra\_tags) | Extra tags for Disk pressure monitor | `list(string)` | `[]` | no |
 | <a name="input_disk_pressure_message"></a> [disk\_pressure\_message](#input\_disk\_pressure\_message) | Custom message for Disk pressure monitor | `string` | `""` | no |
@@ -137,7 +131,6 @@ Creates DataDog monitors with the following checks:
 
 | Name | Description |
 |------|-------------|
-| <a name="output_disk_out_id"></a> [disk\_out\_id](#output\_disk\_out\_id) | id for monitor disk\_out |
 | <a name="output_disk_pressure_id"></a> [disk\_pressure\_id](#output\_disk\_pressure\_id) | id for monitor disk\_pressure |
 | <a name="output_kubelet_ping_id"></a> [kubelet\_ping\_id](#output\_kubelet\_ping\_id) | id for monitor kubelet\_ping |
 | <a name="output_kubelet_syncloop_id"></a> [kubelet\_syncloop\_id](#output\_kubelet\_syncloop\_id) | id for monitor kubelet\_syncloop |