Skip to content

Commit

Permalink
Add SOPs for Obsctl Reloader alerts (#601)
Browse files Browse the repository at this point in the history
* Add SOPs for Obsctl Reloader alerts

Signed-off-by: Douglas Camata <[email protected]>

* Address review comments

Signed-off-by: Douglas Camata <[email protected]>

* Update docs/sop/observatorium.md

Co-authored-by: Philip Gough <[email protected]>

---------

Signed-off-by: Douglas Camata <[email protected]>
Co-authored-by: Philip Gough <[email protected]>
  • Loading branch information
douglascamata and philipgough authored Sep 22, 2023
1 parent 7847d49 commit 56ad08b
Showing 1 changed file with 85 additions and 0 deletions.
85 changes: 85 additions & 0 deletions docs/sop/observatorium.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@
* [ObservatoriumPersistentVolumeUsageCritical](#observatoriumpersistentvolumeusagecritical)
* [Observatorium Gubernator Alerts](#observatorium-gubernator-alerts)
* [GubernatorIsDown](#gubernatorisdown)
* [Observatorium Obsctl Reloader Alerts](#observatorium-obsctl-reloader-alerts)
* [ObsCtlRulesStoreServerError](#obsctlrulesstoreservererror)
* [ObsCtlFetchRulesFailed](#obsctlfetchrulesfailed)
* [ObsCtlRulesSetFailure](#obsctlrulessetfailure)
* [Observatorium Thanos Alerts](#observatorium-thanos-alerts)
* [MandatoryThanosComponentIsDown](#mandatorythanoscomponentisdown)
* [ThanosCompactIsDown](#thanoscompactisdown)
Expand Down Expand Up @@ -892,6 +896,87 @@ Observatorium rate-limiting service is not working.
- Inspect logs and events of failing jobs, using [OpenShift console](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/ns/telemeter-production/deployments/observatorium-gubernator).
- Reach out to Observability Team ([email protected]), [`#forum-observatorium`](https://slack.com/app_redirect?channel=forum-observatorium) at CoreOS Slack, to get help in the investigation.

# Observatorium Obsctl Reloader Alerts

## ObsCtlRulesStoreServerError

### Impact

Tenant's rules are not being pushed to Observatorium, so they might be stale.

### Summary

Obsctl Reloader is not able to push rules to Observatorium. Potential causes could be:

- Failing tenant authentication due to bad credentials or issues with SSO.
- Internal server error in Observatorium API.

### Severity

`critical`

### Access Required

- Console access to the production clusters (this system is't used in staging) that runs Observatorium (currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [rhobsp0ue1 OSD](https://console-openshift-console.apps.rhobsp02ue1.y9ya.p1.openshiftapps.com/)).
- Edit access to the Observatorium namespaces:
- `observatorium-mst-production`

### Steps

- If the error is a 403, check the tenant credentials in the Vault path indicated in [App Interface](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/rhobs/observatorium-mst/namespaces/telemeter-prod-01/observatorium-mst-production.yml#L68). Verify if they are valid and can authenticate the tenant properly. This can be done using obsctl-reloader locally and details can be found in the [RHOBS Tenant Test & Verification document](https://docs.google.com/document/d/1iDUh-U7d2luwRBDl8ZkRancsMCePt2pu2NFSf63j10Q/edit#heading=h.bupciudrwmna). If credentials are invalid, identify the tenant and notify them in Slack.
- For any other status code, check the logs of the Observatorium API and obsctl-reloader pods in the namespace indicated in the alert. The logs should contain more details about the error.
- Ultimately you can check the tenant's rules by checking the PrometheusRule CRs in the namespace indicated in the alert.

## ObsCtlFetchRulesFailed

### Impact

Unable to fetch tenant's rules from the local cluster to process, so they might be stale.

### Summary

Obsctl Reloader is not able to fetch PrometheusRule CRs from the local cluster.

### Severity

`critical`

### Access Required

- Console access to the production clusters (this system is't used in staging) that runs Observatorium (currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [rhobsp0ue1 OSD](https://console-openshift-console.apps.rhobsp02ue1.y9ya.p1.openshiftapps.com/)).
- Edit access to the Observatorium namespaces:
- `observatorium-mst-production`

### Steps

- Check the logs of the Obsctl Reloader pods in the namespace indicated in the alert. The logs should contain the more details about the error.
- Ensure that the Obsctl Reloader deployment has a service account that can do `get, list, watch` on PrometheusRules.

## ObsCtlRulesSetFailure

### Impact

Unable to set tenant's rules in Observatorium, so they might be stale. Didn't even try to talk to the Observatorium API.

### Summary

Obsctl Reloader is not able to set PrometheusRule CRs in Observatorium due to a problem happening **before** sending the request.

### Severity

`warning`

### Access Required

- Console access to the production clusters (this system is't used in staging) that runs Observatorium (currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [rhobsp0ue1 OSD](https://console-openshift-console.apps.rhobsp02ue1.y9ya.p1.openshiftapps.com/)).
- Edit access to the Observatorium namespaces:
- `observatorium-mst-production`

### Steps

- For any other status code, check the logs of the Observatorium API and obsctl-reloader pods in the namespace indicated in the alert. The logs should contain more details about the error.
- Ultimately you can check the tenant's rules by checking the PrometheusRule CRs in the namespace indicated in the alert.

# Observatorium Thanos Alerts

## MandatoryThanosComponentIsDown
Expand Down

0 comments on commit 56ad08b

Please sign in to comment.