Skip to content

Commit

Permalink
MLPAB-1709 - How to get uptime stats (#986)
Browse files Browse the repository at this point in the history
* create dash and document how to check them
  • Loading branch information
andrewpearce-digital authored Jan 25, 2024
1 parent a223c87 commit 01d59d5
Show file tree
Hide file tree
Showing 3 changed files with 84 additions and 27 deletions.
21 changes: 21 additions & 0 deletions docs/runbooks/checking_service_uptime.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Checking service uptime

## Overview

This runbook describes how to check the uptime of the service, and how to check the uptime of the service's dependencies.

Health checks are defined in the [adr-007](https://docs.opg.service.justice.gov.uk/documentation/adrs/adr-007.html) ADR.

We have metrics for the `/health-check/service` endpoint and the `/health-check/dependencies` endpoint.

Both endpoints are monitored by a Route53 health check that runs every 30 seconds. Checks are configured to send a notification to the team via Slack if the endpoint is down for key environments like Production.

The [Route53 Health checks](https://us-east-1.console.aws.amazon.com/route53/healthchecks/home?region=us-east-1#/) are in the AWS us-east-1 region, and check from locations in the US, EU and Asia.

## Checking the uptime of the service

Each environment has a Cloudwatch dashboard that shows the uptime of the service and it's dependencies, named `health-checks-<environment-name>-environment`.

You can access them here, after logging in and assuming role into the relevant AWS account:

- [Cloudwatch Dashboards](https://eu-west-1.console.aws.amazon.com/cloudwatch/home?region=eu-west-1#dashboards)
42 changes: 15 additions & 27 deletions terraform/environment/.terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

48 changes: 48 additions & 0 deletions terraform/environment/region/dashboard.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
resource "aws_cloudwatch_dashboard" "health_checks" {
provider = aws.region
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
x = 0
y = 0
width = 12
height = 6

properties = {
sparkline = true,
view = "singleValue",
metrics = [
["AWS/Route53", "HealthCheckPercentageHealthy", "HealthCheckId", aws_route53_health_check.service_health_check.id, { region = "us-east-1" }]
],
region = "us-east-1",
start = "-PT8640H",
end = "P0D",
period = 300,
title = "service health-check - average uptime of the service over 12 month window"
}
},
{
type = "metric"
x = 0
y = 6
width = 12
height = 5

properties = {
sparkline = true,
view = "singleValue",
metrics = [
["AWS/Route53", "HealthCheckPercentageHealthy", "HealthCheckId", aws_route53_health_check.dependency_health_check.id, { region = "us-east-1" }]
],
region = "us-east-1",
start = "-PT8640H",
end = "P0D",
period = 300,
title = "dependency health-check - average availability of service dependencies over 12 month window"
}
}
]
})
dashboard_name = "health-checks-${data.aws_default_tags.current.tags.environment-name}-environment"
}

0 comments on commit 01d59d5

Please sign in to comment.