-
Notifications
You must be signed in to change notification settings - Fork 68
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add obsctl reloader alerting rules (#603)
* Add obsctl-reloader alert rules * Refactor obsctl-reloader alert rules * Fix typo * Prefix alerts for easy eye grepping
- Loading branch information
1 parent
3bbcaf0
commit 7847d49
Showing
5 changed files
with
149 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
67 changes: 67 additions & 0 deletions
67
...servability/prometheusrules/observatorium-obsctl-reloader-production.prometheusrules.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
--- | ||
$schema: /openshift/prometheus-rule-1.yml | ||
apiVersion: monitoring.coreos.com/v1 | ||
kind: PrometheusRule | ||
metadata: | ||
labels: | ||
prometheus: app-sre | ||
role: alert-rules | ||
name: obsctl-reloader-production | ||
spec: | ||
groups: | ||
- name: obsctl-reloader.rules | ||
rules: | ||
- alert: ObsCtlRulesStoreServerError | ||
annotations: | ||
dashboard: https://grafana.app-sre.devshift.net/d/no-dashboard/obsctl-reloader.rules?orgId=1&refresh=10s&var-datasource={{$externalLabels.cluster}}-prometheus&var-namespace={{$labels.namespace}}&var-job=All&var-pod=All&var-interval=5m | ||
description: Failed to send rules from tenant {{ $labels.tenant }} to store {{ $value | humanizePercentage }}% of the time with a 5xx or 4xx status code. | ||
message: Failed to send rules from tenant {{ $labels.tenant }} to store {{ $value | humanizePercentage }}% of the time with a 5xx or 4xx status code. | ||
runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#obsctlrulesstoreservererror | ||
summary: Failing to send rules to Observatorium. | ||
expr: | | ||
( | ||
sum_over_time(obsctl_reloader_prom_rules_store_ops_total{status_code=~"5..|4..", job="rules-obsctl-reloader"}[5m]) | ||
/ | ||
sum(sum_over_time(obsctl_reloader_prom_rules_store_ops_total{job="rules-obsctl-reloader"}[5m])) | ||
) or vector(0) | ||
> 0.10 | ||
for: 10m | ||
labels: | ||
service: telemeter | ||
severity: critical | ||
- alert: ObsCtlRulesSetFailure | ||
annotations: | ||
dashboard: https://grafana.app-sre.devshift.net/d/no-dashboard/obsctl-reloader.rules?orgId=1&refresh=10s&var-datasource={{$externalLabels.cluster}}-prometheus&var-namespace={{$labels.namespace}}&var-job=All&var-pod=All&var-interval=5m | ||
description: obsctl-reloader is failing to set rules for tenant {{ $labels.tenant }} before reaching Observatorium {{ $value | humanizePercentage }}% of the time due to {{ $labels.reason }}. | ||
message: obsctl-reloader is failing to set rules for tenant {{ $labels.tenant }} before reaching Observatorium {{ $value | humanizePercentage }}% of the time due to {{ $labels.reason }}. | ||
runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#obsctlrulessetfailure | ||
summary: Failing to set rules due to issue before talking to Observatorium. | ||
expr: | | ||
( | ||
sum_over_time(obsctl_reloader_prom_rule_set_failures_total{reason!="rules_store_error", job="rules-obsctl-reloader"}[5m]) | ||
/ | ||
sum_over_time(obsctl_reloader_prom_rule_set_total{job="rules-obsctl-reloader"}[5m]) | ||
) or vector(0) | ||
> 0.10 | ||
for: 10m | ||
labels: | ||
service: telemeter | ||
severity: medium | ||
- alert: ObsCtlFetchRulesFailed | ||
annotations: | ||
dashboard: https://grafana.app-sre.devshift.net/d/no-dashboard/obsctl-reloader.rules?orgId=1&refresh=10s&var-datasource={{$externalLabels.cluster}}-prometheus&var-namespace={{$labels.namespace}}&var-job=All&var-pod=All&var-interval=5m | ||
description: obsctl-reloader is failing to fetch rules via the PrometheusRule CRD in the local cluster. | ||
message: obsctl-reloader is failing to fetch rules via the PrometheusRule CRD in the local cluster. | ||
runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#obsctlfetchrulesfailed | ||
summary: Failing to fetch rules from the local cluster. | ||
expr: | | ||
( | ||
sum_over_time(obsctl_reloader_prom_rule_fetch_failures_total{job="rules-obsctl-reloader"}[5m]) | ||
/ | ||
sum_over_time(obsctl_reloader_prom_rule_fetches_total{job="rules-obsctl-reloader"}[5m]) | ||
) or vector(0) | ||
> 0.20 | ||
for: 5m | ||
labels: | ||
service: telemeter | ||
severity: critical |
67 changes: 67 additions & 0 deletions
67
...es/observability/prometheusrules/observatorium-obsctl-reloader-stage.prometheusrules.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
--- | ||
$schema: /openshift/prometheus-rule-1.yml | ||
apiVersion: monitoring.coreos.com/v1 | ||
kind: PrometheusRule | ||
metadata: | ||
labels: | ||
prometheus: app-sre | ||
role: alert-rules | ||
name: obsctl-reloader-stage | ||
spec: | ||
groups: | ||
- name: obsctl-reloader.rules | ||
rules: | ||
- alert: ObsCtlRulesStoreServerError | ||
annotations: | ||
dashboard: https://grafana.app-sre.devshift.net/d/no-dashboard/obsctl-reloader.rules?orgId=1&refresh=10s&var-datasource={{$externalLabels.cluster}}-prometheus&var-namespace={{$labels.namespace}}&var-job=All&var-pod=All&var-interval=5m | ||
description: Failed to send rules from tenant {{ $labels.tenant }} to store {{ $value | humanizePercentage }}% of the time with a 5xx or 4xx status code. | ||
message: Failed to send rules from tenant {{ $labels.tenant }} to store {{ $value | humanizePercentage }}% of the time with a 5xx or 4xx status code. | ||
runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#obsctlrulesstoreservererror | ||
summary: Failing to send rules to Observatorium. | ||
expr: | | ||
( | ||
sum_over_time(obsctl_reloader_prom_rules_store_ops_total{status_code=~"5..|4..", job="rules-obsctl-reloader"}[5m]) | ||
/ | ||
sum(sum_over_time(obsctl_reloader_prom_rules_store_ops_total{job="rules-obsctl-reloader"}[5m])) | ||
) or vector(0) | ||
> 0.10 | ||
for: 10m | ||
labels: | ||
service: telemeter | ||
severity: high | ||
- alert: ObsCtlRulesSetFailure | ||
annotations: | ||
dashboard: https://grafana.app-sre.devshift.net/d/no-dashboard/obsctl-reloader.rules?orgId=1&refresh=10s&var-datasource={{$externalLabels.cluster}}-prometheus&var-namespace={{$labels.namespace}}&var-job=All&var-pod=All&var-interval=5m | ||
description: obsctl-reloader is failing to set rules for tenant {{ $labels.tenant }} before reaching Observatorium {{ $value | humanizePercentage }}% of the time due to {{ $labels.reason }}. | ||
message: obsctl-reloader is failing to set rules for tenant {{ $labels.tenant }} before reaching Observatorium {{ $value | humanizePercentage }}% of the time due to {{ $labels.reason }}. | ||
runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#obsctlrulessetfailure | ||
summary: Failing to set rules due to issue before talking to Observatorium. | ||
expr: | | ||
( | ||
sum_over_time(obsctl_reloader_prom_rule_set_failures_total{reason!="rules_store_error", job="rules-obsctl-reloader"}[5m]) | ||
/ | ||
sum_over_time(obsctl_reloader_prom_rule_set_total{job="rules-obsctl-reloader"}[5m]) | ||
) or vector(0) | ||
> 0.10 | ||
for: 10m | ||
labels: | ||
service: telemeter | ||
severity: medium | ||
- alert: ObsCtlFetchRulesFailed | ||
annotations: | ||
dashboard: https://grafana.app-sre.devshift.net/d/no-dashboard/obsctl-reloader.rules?orgId=1&refresh=10s&var-datasource={{$externalLabels.cluster}}-prometheus&var-namespace={{$labels.namespace}}&var-job=All&var-pod=All&var-interval=5m | ||
description: obsctl-reloader is failing to fetch rules via the PrometheusRule CRD in the local cluster. | ||
message: obsctl-reloader is failing to fetch rules via the PrometheusRule CRD in the local cluster. | ||
runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#obsctlfetchrulesfailed | ||
summary: Failing to fetch rules from the local cluster. | ||
expr: | | ||
( | ||
sum_over_time(obsctl_reloader_prom_rule_fetch_failures_total{job="rules-obsctl-reloader"}[5m]) | ||
/ | ||
sum_over_time(obsctl_reloader_prom_rule_fetches_total{job="rules-obsctl-reloader"}[5m]) | ||
) or vector(0) | ||
> 0.20 | ||
for: 5m | ||
labels: | ||
service: telemeter | ||
severity: high |