Skip to content

Commit

Permalink
Changes on Kubernetes monitors (#62)
Browse files Browse the repository at this point in the history
* Add Kubernetes monitors

* typoe

* replace default apiserver by heartbeat

* add explanation on apiserver

* Add nginx is down monitor

* add vars on titles

* fix metric
  • Loading branch information
Aohzan authored Feb 3, 2025
1 parent 9f38952 commit e92d5f5
Show file tree
Hide file tree
Showing 20 changed files with 784 additions and 93 deletions.
13 changes: 11 additions & 2 deletions caas/kubernetes/cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ module "datadog-monitors-caas-kubernetes-cluster" {

Creates DataDog monitors with the following checks:

- Kubernetes API server does not respond
- Kubernetes API server does not respond on {{kube_cluster_name}} (disabled by default)
- Kubernetes cluster heartbeat alert on {{kube_cluster_name}}

<!-- BEGIN_TF_DOCS -->
## Requirements
Expand All @@ -44,12 +45,13 @@ Creates DataDog monitors with the following checks:
| Name | Type |
|------|------|
| [datadog_monitor.apiserver](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.heartbeat](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_apiserver_enabled"></a> [apiserver\_enabled](#input\_apiserver\_enabled) | Flag to enable API server monitor | `string` | `"true"` | no |
| <a name="input_apiserver_enabled"></a> [apiserver\_enabled](#input\_apiserver\_enabled) | Flag to enable API server monitor (do not work on some clusters, see https://docs.datadoghq.com/containers/kubernetes/control_plane/?tab=datadogoperator#ManagedServices) | `string` | `"false"` | no |
| <a name="input_apiserver_extra_tags"></a> [apiserver\_extra\_tags](#input\_apiserver\_extra\_tags) | Extra tags for API server monitor | `list(string)` | `[]` | no |
| <a name="input_apiserver_message"></a> [apiserver\_message](#input\_apiserver\_message) | Custom message for API server monitor | `string` | `""` | no |
| <a name="input_apiserver_no_data_timeframe"></a> [apiserver\_no\_data\_timeframe](#input\_apiserver\_no\_data\_timeframe) | Number of minutes before reporting no data | `string` | `10` | no |
Expand All @@ -60,6 +62,12 @@ Creates DataDog monitors with the following checks:
| <a name="input_filter_tags_custom_excluded"></a> [filter\_tags\_custom\_excluded](#input\_filter\_tags\_custom\_excluded) | Tags excluded for custom filtering when filter\_tags\_use\_defaults is false | `string` | `""` | no |
| <a name="input_filter_tags_separator"></a> [filter\_tags\_separator](#input\_filter\_tags\_separator) | Set the filter tags separator (, or AND) | `string` | `","` | no |
| <a name="input_filter_tags_use_defaults"></a> [filter\_tags\_use\_defaults](#input\_filter\_tags\_use\_defaults) | Use default filter tags convention | `string` | `"true"` | no |
| <a name="input_heartbeat_enabled"></a> [heartbeat\_enabled](#input\_heartbeat\_enabled) | Flag to enable heartbeat monitor | `string` | `"true"` | no |
| <a name="input_heartbeat_extra_tags"></a> [heartbeat\_extra\_tags](#input\_heartbeat\_extra\_tags) | Extra tags for heartbeat monitor | `list(string)` | `[]` | no |
| <a name="input_heartbeat_message"></a> [heartbeat\_message](#input\_heartbeat\_message) | Custom message for heartbeat monitor | `string` | `""` | no |
| <a name="input_heartbeat_no_data_timeframe"></a> [heartbeat\_no\_data\_timeframe](#input\_heartbeat\_no\_data\_timeframe) | Number of minutes before reporting no data | `string` | `20` | no |
| <a name="input_heartbeat_time_aggregator"></a> [heartbeat\_time\_aggregator](#input\_heartbeat\_time\_aggregator) | Time aggregator for heartbeat monitor | `string` | `"min"` | no |
| <a name="input_heartbeat_timeframe"></a> [heartbeat\_timeframe](#input\_heartbeat\_timeframe) | Timeframe for heartbeat monitor | `string` | `"last_30m"` | no |
| <a name="input_message"></a> [message](#input\_message) | Message sent when a monitor is triggered | `any` | n/a | yes |
| <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
| <a name="input_new_host_delay"></a> [new\_host\_delay](#input\_new\_host\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
Expand All @@ -74,6 +82,7 @@ Creates DataDog monitors with the following checks:
| Name | Description |
|------|-------------|
| <a name="output_apiserver_id"></a> [apiserver\_id](#output\_apiserver\_id) | id for monitor apiserver |
| <a name="output_heartbeat_id"></a> [heartbeat\_id](#output\_heartbeat\_id) | id for monitor heartbeat |
<!-- END_TF_DOCS -->
## Related documentation

Expand Down
42 changes: 39 additions & 3 deletions caas/kubernetes/cluster/inputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,11 @@ variable "apiserver_no_data_timeframe" {
}

# Datadog monitors variables

## API server monitor variables
variable "apiserver_enabled" {
description = "Flag to enable API server monitor"
description = "Flag to enable API server monitor (do not work on some clusters, see https://docs.datadoghq.com/containers/kubernetes/control_plane/?tab=datadogoperator#ManagedServices)"
type = string
default = "true"
default = "false"
}

variable "apiserver_extra_tags" {
Expand All @@ -91,3 +91,39 @@ variable "apiserver_threshold_warning" {
default = 3
}

## Heartbeat monitor variables
variable "heartbeat_enabled" {
description = "Flag to enable heartbeat monitor"
type = string
default = "true"
}

variable "heartbeat_message" {
description = "Custom message for heartbeat monitor"
type = string
default = ""
}

variable "heartbeat_no_data_timeframe" {
description = "Number of minutes before reporting no data"
type = string
default = 20
}

variable "heartbeat_time_aggregator" {
description = "Time aggregator for heartbeat monitor"
type = string
default = "min"
}

variable "heartbeat_timeframe" {
description = "Timeframe for heartbeat monitor"
type = string
default = "last_30m"
}

variable "heartbeat_extra_tags" {
description = "Extra tags for heartbeat monitor"
type = list(string)
default = []
}
33 changes: 30 additions & 3 deletions caas/kubernetes/cluster/monitors-k8s-cluster.tf
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
resource "datadog_monitor" "apiserver" {
count = var.apiserver_enabled == "true" ? 1 : 0
name = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Kubernetes API server does not respond"
name = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Kubernetes API server does not respond on {{kube_cluster_name}}"
message = coalesce(var.apiserver_message, var.message)

type = "service check"

query = <<EOQ
"kube_apiserver_controlplane.up"${module.filter-tags.service_check}.last(6).count_by_status()
"kube_apiserver_controlplane.up"${module.filter-tags.service_check}.by("kube_cluster_name").last(6).count_by_status()
EOQ

monitor_thresholds {
Expand All @@ -16,7 +16,7 @@ EOQ

new_host_delay = var.new_host_delay
new_group_delay = var.new_group_delay
notify_no_data = var.notify_no_data
notify_no_data = false
no_data_timeframe = var.apiserver_no_data_timeframe
renotify_interval = 0
notify_audit = false
Expand All @@ -26,3 +26,30 @@ EOQ

tags = concat(local.common_tags, var.tags, var.apiserver_extra_tags)
}

resource "datadog_monitor" "heartbeat" {
count = var.heartbeat_enabled == "true" ? 1 : 0
name = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Kubernetes cluster heartbeat alert on {{kube_cluster_name}}"
message = coalesce(var.heartbeat_message, var.message)
type = "metric alert"

query = <<EOQ
${var.heartbeat_time_aggregator}(${var.heartbeat_timeframe}):
sum:kubernetes.pods.running${module.filter-tags.query_alert} by {kube_cluster_name} > 1000000
EOQ

monitor_thresholds {
critical = 1000000 # high threshold to handle no data only
}

new_group_delay = var.new_group_delay
notify_no_data = true
no_data_timeframe = var.heartbeat_no_data_timeframe
renotify_interval = 0
notify_audit = false
timeout_h = var.timeout_h
include_tags = true
require_full_window = true

tags = concat(local.common_tags, var.tags, var.heartbeat_extra_tags)
}
5 changes: 5 additions & 0 deletions caas/kubernetes/cluster/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,8 @@ output "apiserver_id" {
value = datadog_monitor.apiserver.*.id
}

output "heartbeat_id" {
description = "id for monitor heartbeat"
value = datadog_monitor.heartbeat.*.id
}

10 changes: 10 additions & 0 deletions caas/kubernetes/ingress/vts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Creates DataDog monitors with the following checks:

- Nginx Ingress 4xx errors
- Nginx Ingress 5xx errors
- Nginx Ingress {{kube_replica_set}} is down on {{kube_cluster_name}}

<!-- BEGIN_TF_DOCS -->
## Requirements
Expand Down Expand Up @@ -46,6 +47,7 @@ Creates DataDog monitors with the following checks:

| Name | Type |
|------|------|
| [datadog_monitor.nginx_ingress_is_down](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.nginx_ingress_too_many_4xx](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.nginx_ingress_too_many_5xx](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |

Expand Down Expand Up @@ -74,6 +76,13 @@ Creates DataDog monitors with the following checks:
| <a name="input_ingress_5xx_threshold_warning"></a> [ingress\_5xx\_threshold\_warning](#input\_ingress\_5xx\_threshold\_warning) | 5xx warning threshold in percentage | `string` | `"10"` | no |
| <a name="input_ingress_5xx_time_aggregator"></a> [ingress\_5xx\_time\_aggregator](#input\_ingress\_5xx\_time\_aggregator) | Monitor aggregator for Ingress 5xx errors [available values: min, max or avg] | `string` | `"min"` | no |
| <a name="input_ingress_5xx_timeframe"></a> [ingress\_5xx\_timeframe](#input\_ingress\_5xx\_timeframe) | Monitor timeframe for Ingress 5xx errors [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| <a name="input_ingress_down_enabled"></a> [ingress\_down\_enabled](#input\_ingress\_down\_enabled) | Flag to enable Nginx Ingress is down monitor | `string` | `"true"` | no |
| <a name="input_ingress_down_extra_tags"></a> [ingress\_down\_extra\_tags](#input\_ingress\_down\_extra\_tags) | Extra tags for Nginx Ingress is down monitor | `list(string)` | `[]` | no |
| <a name="input_ingress_down_message"></a> [ingress\_down\_message](#input\_ingress\_down\_message) | Message sent when an alert is triggered | `string` | `""` | no |
| <a name="input_ingress_down_threshold_critical"></a> [ingress\_down\_threshold\_critical](#input\_ingress\_down\_threshold\_critical) | Nginx Ingress is down critical threshold in percentage | `number` | `0.3` | no |
| <a name="input_ingress_down_threshold_warning"></a> [ingress\_down\_threshold\_warning](#input\_ingress\_down\_threshold\_warning) | Nginx Ingress is down warning threshold in percentage | `number` | `0.7` | no |
| <a name="input_ingress_down_time_aggregator"></a> [ingress\_down\_time\_aggregator](#input\_ingress\_down\_time\_aggregator) | Monitor aggregator for Nginx Ingress is down [available values: min, max or avg] | `string` | `"avg"` | no |
| <a name="input_ingress_down_timeframe"></a> [ingress\_down\_timeframe](#input\_ingress\_down\_timeframe) | Monitor timeframe for Nginx Ingress is down [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | `string` | `"last_10m"` | no |
| <a name="input_message"></a> [message](#input\_message) | Message sent when an alert is triggered | `any` | n/a | yes |
| <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
| <a name="input_new_host_delay"></a> [new\_host\_delay](#input\_new\_host\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
Expand All @@ -87,6 +96,7 @@ Creates DataDog monitors with the following checks:

| Name | Description |
|------|-------------|
| <a name="output_nginx_ingress_is_down_id"></a> [nginx\_ingress\_is\_down\_id](#output\_nginx\_ingress\_is\_down\_id) | id for monitor nginx\_ingress\_is\_down |
| <a name="output_nginx_ingress_too_many_4xx_id"></a> [nginx\_ingress\_too\_many\_4xx\_id](#output\_nginx\_ingress\_too\_many\_4xx\_id) | id for monitor nginx\_ingress\_too\_many\_4xx |
| <a name="output_nginx_ingress_too_many_5xx_id"></a> [nginx\_ingress\_too\_many\_5xx\_id](#output\_nginx\_ingress\_too\_many\_5xx\_id) | id for monitor nginx\_ingress\_too\_many\_5xx |
<!-- END_TF_DOCS -->
Expand Down
46 changes: 44 additions & 2 deletions caas/kubernetes/ingress/vts/inputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ variable "filter_tags_separator" {
default = ","
}

#Ingress

# Nginx Ingress
## Nginx Ingress 5xx errors monitor
variable "ingress_5xx_enabled" {
description = "Flag to enable Ingress 5xx errors monitor"
type = string
Expand Down Expand Up @@ -102,6 +102,7 @@ variable "ingress_5xx_threshold_warning" {
description = "5xx warning threshold in percentage"
}

## Nginx Ingress 4xx errors monitor
variable "ingress_4xx_enabled" {
description = "Flag to enable Ingress 4xx errors monitor"
type = string
Expand Down Expand Up @@ -148,3 +149,44 @@ variable "artificial_requests_count" {
description = "Number of false requests used to mitigate false positive in case of low trafic"
}

## Nginx Ingress is down monitor
variable "ingress_down_enabled" {
type = string
default = "true"
description = "Flag to enable Nginx Ingress is down monitor"
}

variable "ingress_down_message" {
default = ""
description = "Message sent when an alert is triggered"
}

variable "ingress_down_time_aggregator" {
type = string
default = "avg"
description = "Monitor aggregator for Nginx Ingress is down [available values: min, max or avg]"
}

variable "ingress_down_timeframe" {
type = string
default = "last_10m"
description = "Monitor timeframe for Nginx Ingress is down [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`]"
}

variable "ingress_down_threshold_critical" {
type = number
default = 0.3
description = "Nginx Ingress is down critical threshold in percentage"
}

variable "ingress_down_threshold_warning" {
type = number
default = 0.7
description = "Nginx Ingress is down warning threshold in percentage"
}

variable "ingress_down_extra_tags" {
type = list(string)
default = []
description = "Extra tags for Nginx Ingress is down monitor"
}
28 changes: 28 additions & 0 deletions caas/kubernetes/ingress/vts/monitors-ingress.tf
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,31 @@ EOQ
tags = concat(local.common_tags, var.tags, var.ingress_4xx_extra_tags)
}

resource "datadog_monitor" "nginx_ingress_is_down" {
count = var.ingress_down_enabled == "true" ? 1 : 0
name = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Nginx Ingress {{kube_replica_set}} is down on {{kube_cluster_name}}"
message = coalesce(var.ingress_down_message, var.message)
type = "query alert"

query = <<EOQ
${var.ingress_down_time_aggregator}(${var.ingress_down_timeframe}):
avg:nginx_ingress.nginx_up${module.filter-tags.query_alert} by {kube_replica_set,kube_cluster_name}
<= ${var.ingress_down_threshold_critical}
EOQ

monitor_thresholds {
warning = var.ingress_down_threshold_warning
critical = var.ingress_down_threshold_critical
}

evaluation_delay = var.evaluation_delay
new_group_delay = var.new_group_delay
notify_no_data = true
renotify_interval = 0
notify_audit = false
timeout_h = var.timeout_h
include_tags = true
require_full_window = true

tags = concat(local.common_tags, var.tags, var.ingress_down_extra_tags)
}
5 changes: 5 additions & 0 deletions caas/kubernetes/ingress/vts/outputs.tf
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
output "nginx_ingress_is_down_id" {
description = "id for monitor nginx_ingress_is_down"
value = datadog_monitor.nginx_ingress_is_down.*.id
}

output "nginx_ingress_too_many_4xx_id" {
description = "id for monitor nginx_ingress_too_many_4xx"
value = datadog_monitor.nginx_ingress_too_many_4xx.*.id
Expand Down
25 changes: 9 additions & 16 deletions caas/kubernetes/node/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,15 @@ module "datadog-monitors-caas-kubernetes-node" {

Creates DataDog monitors with the following checks:

- Kubernetes Node Disk pressure
- Kubernetes Node Frequent unregister net device
- Kubernetes Node Kubelet API does not respond
- Kubernetes Node Kubelet sync loop that updates containers does not work
- Kubernetes Node Memory pressure
- Kubernetes Node not ready
- Kubernetes Node Out of disk
- Kubernetes Node unschedulable
- Kubernetes Node volume inodes usage
- Kubernetes Node volume space usage
- Kubernetes Node {{kube_node}} disk pressure on {{kube_cluster_name}}
- Kubernetes Node {{kube_node}} frequent unregister net device
- Kubernetes Node {{kube_node}} Kubelet API does not respond on {{kube_cluster_name}}
- Kubernetes Node {{kube_node}} Kubelet sync loop that updates containers does not work on {{kube_cluster_name}}
- Kubernetes Node {{kube_node}} memory pressure on {{kube_cluster_name}}
- Kubernetes Node {{kube_node}} not ready on {{kube_cluster_name}}
- Kubernetes Node {{kube_node}} unschedulable on {{kube_cluster_name}}
- Kubernetes Node volume {{persistentvolumeclaim}} inodes usage
- Kubernetes Node volume {{persistentvolumeclaim}} space usage

<!-- BEGIN_TF_DOCS -->
## Requirements
Expand All @@ -53,7 +52,6 @@ Creates DataDog monitors with the following checks:

| Name | Type |
|------|------|
| [datadog_monitor.disk_out](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.disk_pressure](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.kubelet_ping](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.kubelet_syncloop](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
Expand All @@ -68,10 +66,6 @@ Creates DataDog monitors with the following checks:

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_disk_out_enabled"></a> [disk\_out\_enabled](#input\_disk\_out\_enabled) | Flag to enable Out of disk monitor | `string` | `"true"` | no |
| <a name="input_disk_out_extra_tags"></a> [disk\_out\_extra\_tags](#input\_disk\_out\_extra\_tags) | Extra tags for Out of disk monitor | `list(string)` | `[]` | no |
| <a name="input_disk_out_message"></a> [disk\_out\_message](#input\_disk\_out\_message) | Custom message for Out of disk monitor | `string` | `""` | no |
| <a name="input_disk_out_threshold_warning"></a> [disk\_out\_threshold\_warning](#input\_disk\_out\_threshold\_warning) | Out of disk monitor (warning threshold) | `string` | `3` | no |
| <a name="input_disk_pressure_enabled"></a> [disk\_pressure\_enabled](#input\_disk\_pressure\_enabled) | Flag to enable Disk pressure monitor | `string` | `"true"` | no |
| <a name="input_disk_pressure_extra_tags"></a> [disk\_pressure\_extra\_tags](#input\_disk\_pressure\_extra\_tags) | Extra tags for Disk pressure monitor | `list(string)` | `[]` | no |
| <a name="input_disk_pressure_message"></a> [disk\_pressure\_message](#input\_disk\_pressure\_message) | Custom message for Disk pressure monitor | `string` | `""` | no |
Expand Down Expand Up @@ -137,7 +131,6 @@ Creates DataDog monitors with the following checks:

| Name | Description |
|------|-------------|
| <a name="output_disk_out_id"></a> [disk\_out\_id](#output\_disk\_out\_id) | id for monitor disk\_out |
| <a name="output_disk_pressure_id"></a> [disk\_pressure\_id](#output\_disk\_pressure\_id) | id for monitor disk\_pressure |
| <a name="output_kubelet_ping_id"></a> [kubelet\_ping\_id](#output\_kubelet\_ping\_id) | id for monitor kubelet\_ping |
| <a name="output_kubelet_syncloop_id"></a> [kubelet\_syncloop\_id](#output\_kubelet\_syncloop\_id) | id for monitor kubelet\_syncloop |
Expand Down
Loading

0 comments on commit e92d5f5

Please sign in to comment.