Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update alerting.html.md.erb #911

Merged
merged 1 commit into from
Jul 3, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions source/standards/alerting.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
---
title: How to manage alerts
last_reviewed_on: 2023-06-08
last_reviewed_on: 2024-06-27
review_in: 6 months
---

# <%= current_page.data.title %>

Your service should have a system in place to send automated alerts if its monitoring system detects a problem. Sending alerts help services meet service level agreements (SLAs).
Your service should have a system in place to send automated alerts if its monitoring system(s) detects a problem. Sending alerts help services meet service level agreements (SLAs), and provide awareness of suspicious activity to enable incident response.

## Sending alerts

Expand All @@ -15,6 +15,7 @@ Your service should send an alert when your [service monitoring][] detects an is
* affects service users
* requires action to fix
* lasts for a sustained period of time
* indicates compromise or suspicious activity (such as multiple failed login attempts or unrecognised escalation of privilege)

You should only send an alert for things that need action. Alert text should be specific and [include actionable information][]. You should not include sensitive material.

Expand All @@ -41,6 +42,7 @@ You must prioritise alerts based on whether they need an immediate fix. It can h

* interrupting - need immediate investigation and resolution
* non-interrupting - do not need immediate resolution
* security-related - may indicate compromise of the system

The [Google Site Reliability Engineering (SRE)][site reliability engineering] handbook classifies “interrupting” issues as “pages”, and “non-interrupting” issues as “tickets”. Put non-interrupting alerts into a ticket queue for your support team to solve. Keep the ticket queue and team backlog separate to avoid confusion. You should specify an SLA for how long both types of alert take to resolve.

Expand All @@ -55,6 +57,7 @@ Recommended tools are:

- [PagerDuty][] to send high-priority / interrupting alerts
- [Zendesk][] to manage non-interrupting alerts as tickets
- [Splunk][] to manage security-related alerts

You can also configure these tools to send alert notifications using email or Slack. However, you should only use email and Slack as additions to your primary alerting tool. If alerts only go to email or Slack, people may ignore, overlook, filter them out, or treat them like spam.

Expand All @@ -71,6 +74,7 @@ For more information refer to the:
[service monitoring]: /standards/monitoring.html
[PagerDuty]: https://www.pagerduty.com
[Zendesk]: https://www.zendesk.com
[Splunk]: https://splunk.com
[Smashing]: https://github.com/Smashing/smashing
[BlinkenJS]: https://github.com/alphagov/blinkenjs
[information about monitoring]: /standards/monitoring.html
Expand Down