Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SOPs for Obsctl Reloader alerts #601

Merged
merged 3 commits into from
Sep 22, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions docs/sop/observatorium.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@
* [ObservatoriumPersistentVolumeUsageCritical](#observatoriumpersistentvolumeusagecritical)
* [Observatorium Gubernator Alerts](#observatorium-gubernator-alerts)
* [GubernatorIsDown](#gubernatorisdown)
* [Observatorium Obsctl Reloader Alerts](#observatorium-obsctl-reloader-alerts)
* [ObsCtlRulesStoreServerError](#obsctlrulesstoreservererror)
* [ObsCtlFetchRulesFailed](#obsctlfetchrulesfailed)
* [ObsCtlRulesSetFailure](#obsctlrulessetfailure)
* [Observatorium Thanos Alerts](#observatorium-thanos-alerts)
* [MandatoryThanosComponentIsDown](#mandatorythanoscomponentisdown)
* [ThanosCompactIsDown](#thanoscompactisdown)
Expand Down Expand Up @@ -892,6 +896,90 @@ Observatorium rate-limiting service is not working.
- Inspect logs and events of failing jobs, using [OpenShift console](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/ns/telemeter-production/deployments/observatorium-gubernator).
- Reach out to Observability Team ([email protected]), [`#forum-observatorium`](https://slack.com/app_redirect?channel=forum-observatorium) at CoreOS Slack, to get help in the investigation.

# Observatorium Obsctl Reloader Alerts

## ObsCtlRulesStoreServerError

### Impact

Tenant's rules are not being pushed to Observatorium, so they might be stale.

### Summary

Obsctl Reloader is not able to push rules to Observatorium. Potential causes could be:

- Failing tenant authentication due to bad credentials or issues with SSO.
- Internal server error in Observatorium API.

### Severity

`critical`

### Access Required

- Console access to the cluster that runs Observatorium (Currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [app-sre-stage-0 OSD](https://console-openshift-console.apps.app-sre-stage-0.k3s7.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-stage)).
douglascamata marked this conversation as resolved.
Show resolved Hide resolved
- Edit access to the Observatorium namespaces:
- `observatorium-mst-stage`
- `observatorium-mst-production`

### Steps

- If the error is a 403, check the tenant credentials in the Vault path indicated in [App Interface](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/rhobs/observatorium-mst/namespaces/telemeter-prod-01/observatorium-mst-production.yml#L68). Verify if they are valid and can authenticate the tenant properly. This can be done using obsctl-reloader locally and details can be found in the RHOBS Tenant Test & Verification document. If credentials are invalid, identify the tenant and notify them in Slack.
douglascamata marked this conversation as resolved.
Show resolved Hide resolved
- For any other status code, check the logs of the Observatorium API and obsctl-reloader pods in the namespace indicated in the alert. The logs should contain the more details about the error.
douglascamata marked this conversation as resolved.
Show resolved Hide resolved
- Ultimately you can check the tenant's rules by checking the PrometheusRule CRs in the namespace indicated in the alert.

## ObsCtlFetchRulesFailed

### Impact

Unable to fetch tenant's rules from the local cluster to process, so they might be stale.

### Summary

Obsctl Reloader is not able to fetch PrometheusRule CRs from the local cluster.

### Severity

`critical`

### Access Required

- Console access to the cluster that runs Observatorium (Currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [app-sre-stage-0 OSD](https://console-openshift-console.apps.app-sre-stage-0.k3s7.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-stage)).
- Edit access to the Observatorium namespaces:
- `observatorium-mst-stage`
- `observatorium-mst-production`

### Steps

- Check the logs of the Obsctl Reloader pods in the namespace indicated in the alert. The logs should contain the more details about the error.
- Ensure that the Obsctl Reloader deployment has a service account that can do `get, list, watch` on PrometheusRules.

## ObsCtlRulesSetFailure

### Impact

Unable to set tenant's rules in Observatorium, so they might be stale. Didn't even try to talk to the Observatorium API.

### Summary

Obsctl Reloader is not able to set PrometheusRule CRs in Observatorium due to a problem happening **before** sending the request.

### Severity

`warning`

### Access Required

- Console access to the cluster that runs Observatorium (Currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-production) and [app-sre-stage-0 OSD](https://console-openshift-console.apps.app-sre-stage-0.k3s7.p1.openshiftapps.com/k8s/cluster/projects/observatorium-mst-stage)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed here but I wonder should we start pointing at the loki datasource here instead. For a follow up regardless

- Edit access to the Observatorium namespaces:
- `observatorium-mst-stage`
- `observatorium-mst-production`

### Steps

- Check the logs of the Observatorium API and obsctl-reloader pods in the namespace indicated in the alert. The logs should contain the more details about the error.
douglascamata marked this conversation as resolved.
Show resolved Hide resolved
- Ultimately you can check the tenant's rules by checking the PrometheusRule CRs in the namespace indicated in the alert.

# Observatorium Thanos Alerts

## MandatoryThanosComponentIsDown
Expand Down
Loading