Add SOPs for Obsctl Reloader alerts (#601)

* Add SOPs for Obsctl Reloader alerts Signed-off-by: Douglas Camata <[email protected]> * Address review comments Signed-off-by: Douglas Camata <[email protected]> * Update docs/sop/observatorium.md Co-authored-by: Philip Gough <[email protected]> --------- Signed-off-by: Douglas Camata <[email protected]> Co-authored-by: Philip Gough <[email protected]>
rhobs · Sep 22, 2023 · 56ad08b · 56ad08b
1 parent 7847d49
commit 56ad08b
Showing 1 changed file with 85 additions and 0 deletions.
diff --git a/docs/sop/observatorium.md b/docs/sop/observatorium.md
@@ -33,6 +33,10 @@
   * [ObservatoriumPersistentVolumeUsageCritical](#observatoriumpersistentvolumeusagecritical)
 * [Observatorium Gubernator Alerts](#observatorium-gubernator-alerts)
   * [GubernatorIsDown](#gubernatorisdown)
+* [Observatorium Obsctl Reloader Alerts](#observatorium-obsctl-reloader-alerts)
+  * [ObsCtlRulesStoreServerError](#obsctlrulesstoreservererror)
+  * [ObsCtlFetchRulesFailed](#obsctlfetchrulesfailed)
+  * [ObsCtlRulesSetFailure](#obsctlrulessetfailure)
 * [Observatorium Thanos Alerts](#observatorium-thanos-alerts)
   * [MandatoryThanosComponentIsDown](#mandatorythanoscomponentisdown)
   * [ThanosCompactIsDown](#thanoscompactisdown)
@@ -892,6 +896,87 @@ Observatorium rate-limiting service is not working.
 - Inspect logs and events of failing jobs, using [OpenShift console](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/ns/telemeter-production/deployments/observatorium-gubernator).
 - Reach out to Observability Team ([email protected]), [`#forum-observatorium`](https://slack.com/app_redirect?channel=forum-observatorium) at CoreOS Slack, to get help in the investigation.
 
+# Observatorium Obsctl Reloader Alerts
+
+## ObsCtlRulesStoreServerError
+
+### Impact
+
+Tenant's rules are not being pushed to Observatorium, so they might be stale.
+
+### Summary
+
+Obsctl Reloader is not able to push rules to Observatorium. Potential causes could be:
+
+- Failing tenant authentication due to bad credentials or issues with SSO.
+- Internal server error in Observatorium API.
+
+### Severity
+
+`critical`
+
+### Access Required
+
+- Console access to the production clusters (this system is't used in staging) that runs Observatorium (currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [rhobsp0ue1 OSD](https://console-openshift-console.apps.rhobsp02ue1.y9ya.p1.openshiftapps.com/)).
+- Edit access to the Observatorium namespaces:
+  - `observatorium-mst-production`
+
+### Steps
+
+- If the error is a 403, check the tenant credentials in the Vault path indicated in [App Interface](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/rhobs/observatorium-mst/namespaces/telemeter-prod-01/observatorium-mst-production.yml#L68). Verify if they are valid and can authenticate the tenant properly. This can be done using obsctl-reloader locally and details can be found in the [RHOBS Tenant Test & Verification document](https://docs.google.com/document/d/1iDUh-U7d2luwRBDl8ZkRancsMCePt2pu2NFSf63j10Q/edit#heading=h.bupciudrwmna). If credentials are invalid, identify the tenant and notify them in Slack.
+- For any other status code, check the logs of the Observatorium API and obsctl-reloader pods in the namespace indicated in the alert. The logs should contain more details about the error.
+- Ultimately you can check the tenant's rules by checking the PrometheusRule CRs in the namespace indicated in the alert.
+
+## ObsCtlFetchRulesFailed
+
+### Impact
+
+Unable to fetch tenant's rules from the local cluster to process, so they might be stale.
+
+### Summary
+
+Obsctl Reloader is not able to fetch PrometheusRule CRs from the local cluster.
+
+### Severity
+
+`critical`
+
+### Access Required
+
+- Console access to the production clusters (this system is't used in staging) that runs Observatorium (currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [rhobsp0ue1 OSD](https://console-openshift-console.apps.rhobsp02ue1.y9ya.p1.openshiftapps.com/)).
+- Edit access to the Observatorium namespaces:
+  - `observatorium-mst-production`
+
+### Steps
+
+- Check the logs of the Obsctl Reloader pods in the namespace indicated in the alert. The logs should contain the more details about the error.
+- Ensure that the Obsctl Reloader deployment has a service account that can do `get, list, watch` on PrometheusRules.
+
+## ObsCtlRulesSetFailure
+
+### Impact
+
+Unable to set tenant's rules in Observatorium, so they might be stale. Didn't even try to talk to the Observatorium API.
+
+### Summary
+
+Obsctl Reloader is not able to set PrometheusRule CRs in Observatorium due to a problem happening **before** sending the request.
+
+### Severity
+
+`warning`
+
+### Access Required
+
+- Console access to the production clusters (this system is't used in staging) that runs Observatorium (currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [rhobsp0ue1 OSD](https://console-openshift-console.apps.rhobsp02ue1.y9ya.p1.openshiftapps.com/)).
+- Edit access to the Observatorium namespaces:
+  - `observatorium-mst-production`
+
+### Steps
+
+- For any other status code, check the logs of the Observatorium API and obsctl-reloader pods in the namespace indicated in the alert. The logs should contain more details about the error.
+- Ultimately you can check the tenant's rules by checking the PrometheusRule CRs in the namespace indicated in the alert.
+
 # Observatorium Thanos Alerts
 
 ## MandatoryThanosComponentIsDown