-
Notifications
You must be signed in to change notification settings - Fork 68
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add SOPs for Obsctl Reloader alerts (#601)
* Add SOPs for Obsctl Reloader alerts Signed-off-by: Douglas Camata <[email protected]> * Address review comments Signed-off-by: Douglas Camata <[email protected]> * Update docs/sop/observatorium.md Co-authored-by: Philip Gough <[email protected]> --------- Signed-off-by: Douglas Camata <[email protected]> Co-authored-by: Philip Gough <[email protected]>
- Loading branch information
1 parent
7847d49
commit 56ad08b
Showing
1 changed file
with
85 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,6 +33,10 @@ | |
* [ObservatoriumPersistentVolumeUsageCritical](#observatoriumpersistentvolumeusagecritical) | ||
* [Observatorium Gubernator Alerts](#observatorium-gubernator-alerts) | ||
* [GubernatorIsDown](#gubernatorisdown) | ||
* [Observatorium Obsctl Reloader Alerts](#observatorium-obsctl-reloader-alerts) | ||
* [ObsCtlRulesStoreServerError](#obsctlrulesstoreservererror) | ||
* [ObsCtlFetchRulesFailed](#obsctlfetchrulesfailed) | ||
* [ObsCtlRulesSetFailure](#obsctlrulessetfailure) | ||
* [Observatorium Thanos Alerts](#observatorium-thanos-alerts) | ||
* [MandatoryThanosComponentIsDown](#mandatorythanoscomponentisdown) | ||
* [ThanosCompactIsDown](#thanoscompactisdown) | ||
|
@@ -892,6 +896,87 @@ Observatorium rate-limiting service is not working. | |
- Inspect logs and events of failing jobs, using [OpenShift console](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/ns/telemeter-production/deployments/observatorium-gubernator). | ||
- Reach out to Observability Team ([email protected]), [`#forum-observatorium`](https://slack.com/app_redirect?channel=forum-observatorium) at CoreOS Slack, to get help in the investigation. | ||
|
||
# Observatorium Obsctl Reloader Alerts | ||
|
||
## ObsCtlRulesStoreServerError | ||
|
||
### Impact | ||
|
||
Tenant's rules are not being pushed to Observatorium, so they might be stale. | ||
|
||
### Summary | ||
|
||
Obsctl Reloader is not able to push rules to Observatorium. Potential causes could be: | ||
|
||
- Failing tenant authentication due to bad credentials or issues with SSO. | ||
- Internal server error in Observatorium API. | ||
|
||
### Severity | ||
|
||
`critical` | ||
|
||
### Access Required | ||
|
||
- Console access to the production clusters (this system is't used in staging) that runs Observatorium (currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [rhobsp0ue1 OSD](https://console-openshift-console.apps.rhobsp02ue1.y9ya.p1.openshiftapps.com/)). | ||
- Edit access to the Observatorium namespaces: | ||
- `observatorium-mst-production` | ||
|
||
### Steps | ||
|
||
- If the error is a 403, check the tenant credentials in the Vault path indicated in [App Interface](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/rhobs/observatorium-mst/namespaces/telemeter-prod-01/observatorium-mst-production.yml#L68). Verify if they are valid and can authenticate the tenant properly. This can be done using obsctl-reloader locally and details can be found in the [RHOBS Tenant Test & Verification document](https://docs.google.com/document/d/1iDUh-U7d2luwRBDl8ZkRancsMCePt2pu2NFSf63j10Q/edit#heading=h.bupciudrwmna). If credentials are invalid, identify the tenant and notify them in Slack. | ||
- For any other status code, check the logs of the Observatorium API and obsctl-reloader pods in the namespace indicated in the alert. The logs should contain more details about the error. | ||
- Ultimately you can check the tenant's rules by checking the PrometheusRule CRs in the namespace indicated in the alert. | ||
|
||
## ObsCtlFetchRulesFailed | ||
|
||
### Impact | ||
|
||
Unable to fetch tenant's rules from the local cluster to process, so they might be stale. | ||
|
||
### Summary | ||
|
||
Obsctl Reloader is not able to fetch PrometheusRule CRs from the local cluster. | ||
|
||
### Severity | ||
|
||
`critical` | ||
|
||
### Access Required | ||
|
||
- Console access to the production clusters (this system is't used in staging) that runs Observatorium (currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [rhobsp0ue1 OSD](https://console-openshift-console.apps.rhobsp02ue1.y9ya.p1.openshiftapps.com/)). | ||
- Edit access to the Observatorium namespaces: | ||
- `observatorium-mst-production` | ||
|
||
### Steps | ||
|
||
- Check the logs of the Obsctl Reloader pods in the namespace indicated in the alert. The logs should contain the more details about the error. | ||
- Ensure that the Obsctl Reloader deployment has a service account that can do `get, list, watch` on PrometheusRules. | ||
|
||
## ObsCtlRulesSetFailure | ||
|
||
### Impact | ||
|
||
Unable to set tenant's rules in Observatorium, so they might be stale. Didn't even try to talk to the Observatorium API. | ||
|
||
### Summary | ||
|
||
Obsctl Reloader is not able to set PrometheusRule CRs in Observatorium due to a problem happening **before** sending the request. | ||
|
||
### Severity | ||
|
||
`warning` | ||
|
||
### Access Required | ||
|
||
- Console access to the production clusters (this system is't used in staging) that runs Observatorium (currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [rhobsp0ue1 OSD](https://console-openshift-console.apps.rhobsp02ue1.y9ya.p1.openshiftapps.com/)). | ||
- Edit access to the Observatorium namespaces: | ||
- `observatorium-mst-production` | ||
|
||
### Steps | ||
|
||
- For any other status code, check the logs of the Observatorium API and obsctl-reloader pods in the namespace indicated in the alert. The logs should contain more details about the error. | ||
- Ultimately you can check the tenant's rules by checking the PrometheusRule CRs in the namespace indicated in the alert. | ||
|
||
# Observatorium Thanos Alerts | ||
|
||
## MandatoryThanosComponentIsDown | ||
|