Skip to content

Commit

Permalink
sabakan-state-setter: detect machine failures using Prometheus alerts
Browse files Browse the repository at this point in the history
Signed-off-by: morimoto-cybozu <[email protected]>
  • Loading branch information
morimoto-cybozu committed Feb 7, 2025
1 parent 1c079b9 commit 1258bc4
Show file tree
Hide file tree
Showing 10 changed files with 623 additions and 72 deletions.
51 changes: 38 additions & 13 deletions docs/sabakan-state-setter.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
sabakan-state-setter
====================

sabakan-state-setter changes the state of machines. It has the following three functions.
`sabakan-state-setter` changes the states of machines. It has the following three functions.

1. Health check
Decide sabakan machine states according to [serf][] status and [monitor-hw][] metrics. And update the states.
The target machines are whose current sabakan machine state is `Uninitialized`, `Healthy`, `Unhealthy`, or `Unreachable`.
Health check is just update sabakan machine state. There is no any side effect.
Decide sabakan machine states according to [serf][] statuses, [monitor-hw][] metrics, and [Prometheus][] alerts, and then update the states.
Not all machines are checked by `sabakan-state-setter`. It observes `Uninitialized`, `Healthy`, `Unhealthy`, and `Unreachable` machines.
Health check just updates sabakan machine states. There are no side effects.

2. Retirement
`sabakan-state-setter` let retiring machines retire.
Expand All @@ -32,14 +32,18 @@ Health check
- serf status is `alive`.
- serf tags `systemd-units-failed` is set and has no errors.
- All of later mentioned machine peripherals are healthy.
- There are no active Prometheus alerts that match any of `trigger-alerts` in the configuration file.
- This condition includes the case when Prometheus Alertmanager is down.
- Judge as `Unreachable`
- serf status is `failed`, `left` or machine is not yet as a serf member. It is the same as that `sabakan-state-setter` can not access monitor-hw metrics.
- There is an active Prometheus alert which indicates that the machine is unreachable.
- Judge as `Unhealthy`
- serf status is `alive`.
- At least one of them matches:
- serf tags `systemd-units-failed` is not set or has errors.
- `sabakan-state-setter` can not retrieve monitor-hw metrics.
- At least one of later mentioned machine peripherals is unhealthy.
- There is an active Prometheus alert which indicates that the machine is unhealthy.
- Nothing to judge machine state
- `sabakan-state-setter` can not access to `serf.service` of the same boot server.

Expand All @@ -51,6 +55,9 @@ sabakan-state-setter waits a grace period before updating a machine's state to `
sabakan-state-setter updates the machine state
if and only if it judges the machine's state as `unhealthy` for the time specified in this value.

Note that this grace period is not applied when an active Prometheus alert is the source of the state change.
We can configure Prometheus alerts to wait a sufficient amount of time before becoming active.

### Target machine peripherals

You can define the metrics used for health checking in in the configuration file.
Expand Down Expand Up @@ -88,15 +95,15 @@ Usage
sabakan-state-setter [OPTIONS]
```

| Option | Default value | Description |
| ------------------- | ------------------------ | --------------------------------------------------------------------------------- |
| `-config-file` | `''` | Path of config file. |
| `-etcd-session-ttl` | `1m` | TTL of etcd session. This value is interpreted as a [duration string][]. |
| `-interval` | `1m` | Interval of scraping metrics. This value is interpreted as a [duration string][]. |
| `-parallel` | `30` | The number of parallel execution of getting machines metrics. |
| `-sabakan-url` | `http://localhost:10080` | sabakan HTTP Server URL. |
| `-sabakan-url-https`| `https://localhost:10443`| sabakan HTTPS Server URL. |
| `-serf-address` | `127.0.0.1:7373` | serf address. |
| Option | Default value | Description |
| -------------------- | ------------------------- | --------------------------------------------------------------------------------- |
| `-config-file` | `''` | Path of config file. |
| `-etcd-session-ttl` | `1m` | TTL of etcd session. This value is interpreted as a [duration string][]. |
| `-interval` | `1m` | Interval of scraping metrics. This value is interpreted as a [duration string][]. |
| `-parallel` | `30` | The number of parallel execution of getting machines metrics. |
| `-sabakan-url` | `http://localhost:10080` | sabakan HTTP Server URL. |
| `-sabakan-url-https` | `https://localhost:10443` | sabakan HTTPS Server URL. |
| `-serf-address` | `127.0.0.1:7373` | serf address. |

Config file
-----------
Expand All @@ -105,6 +112,7 @@ Config file
| ------------------------------------------------- | ------------- | ---------------------------------------------------------------------------------------------------------------- |
| `shutdown-schedule` string | `""` | Schedule in Cron format for retired machines shutdown. If this field is omitted, shutdown will not be performed. |
| `machine-types` [MachineType](#MachineType) array | `nil` | Machine types is a list of `MachineType`. You should list all machine types used in your data center. |
| `alert-monitor` \*[AlertMonitor](#AlertMonitor) | `nil` | Configurations to monitor Prometheus alerts. |

### `MachineType`

Expand Down Expand Up @@ -138,8 +146,25 @@ https://github.com/cybozu-go/setup-hw/blob/master/docs/rule.md
`labels` and `label-prefix` are AND condition,
i.e. a metric is selected if and only if all of the conditions are satisfied.

### `AlertMonitor`

| Field | Default value | Description |
| ---------------------------------------------------- | ------------- | ------------------------------------------------------------------------------- |
| `alertmanager-endpoint` string | `''` | URL of Alertmanager API V2 endpoint (e.g., `http://alertmanager:9093/api/v2/`). |
| `trigger-alerts` [TriggerAlert](#TriggerAlert) array | `nil` | Prometheus alerts that are recognized as indications of non-healthy machines. |

### `TriggerAlert`

| Field | Default value | Description |
| ---------------------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `name` string | `''` | Name of a Prometheus alert to be monitored. |
| `labels` `map[string]string` | `nil` | Filtering labels and their values of a Prometheus alert. If a suspicious alert does not contain these labels, the alert is ignored. |
| `address-label` string | `''` | Label name of a Prometheus alert that denotes the IP address of a non-healthy machine. Exactly one of `address-label` and `serial-label` is required. |
| `serial-label` string | `''` | Label name of a Prometheus alert that denotes the serial number of a non-healthy machine. Exactly one of `address-label` and `serial-label` is required. |
| `state` string | `''` | Candidate of the next state of a non-healthy machine. Currently `unreachable` and `unhealthy` are supported. |

[Dell BOSS]: https://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dell-PowerEdge-Boot-Optimized-Storage-Solution.pdf
[duration string]: https://golang.org/pkg/time/#ParseDuration
[monitor-hw]: https://github.com/cybozu-go/setup-hw/blob/master/docs/monitor-hw.md
[serf]: https://www.serf.io/
[Prometheus]: https://prometheus.io/
24 changes: 20 additions & 4 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ require (
github.com/cybozu-go/sabakan/v3 v3.1.2
github.com/cybozu-go/well v1.11.2
github.com/flatcar/ignition v0.36.2
github.com/go-openapi/runtime v0.27.1
github.com/google/go-cmp v0.6.0
github.com/google/go-containerregistry v0.20.2
github.com/google/go-github/v50 v50.2.0
Expand All @@ -23,6 +24,7 @@ require (
github.com/mattn/go-isatty v0.0.20
github.com/onsi/ginkgo/v2 v2.21.0
github.com/onsi/gomega v1.35.1
github.com/prometheus/alertmanager v0.27.0
github.com/prometheus/client_golang v1.20.5
github.com/prometheus/client_model v0.6.1
github.com/prometheus/common v0.60.1
Expand Down Expand Up @@ -50,6 +52,7 @@ require (
github.com/99designs/gqlgen v0.17.50 // indirect
github.com/ProtonMail/go-crypto v0.0.0-20230217124315-7d5c6f04bbb8 // indirect
github.com/armon/go-metrics v0.4.1 // indirect
github.com/asaskevich/govalidator v0.0.0-20230301143203-a9d515a09cc2 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/blang/semver/v4 v4.0.0 // indirect
github.com/cenkalti/backoff/v4 v4.3.0 // indirect
Expand All @@ -69,9 +72,16 @@ require (
github.com/fxamacker/cbor/v2 v2.7.0 // indirect
github.com/go-jose/go-jose/v4 v4.0.1 // indirect
github.com/go-logr/logr v1.4.2 // indirect
github.com/go-openapi/jsonpointer v0.19.6 // indirect
github.com/go-openapi/jsonreference v0.20.2 // indirect
github.com/go-openapi/swag v0.22.4 // indirect
github.com/go-logr/stdr v1.2.2 // indirect
github.com/go-openapi/analysis v0.22.2 // indirect
github.com/go-openapi/errors v0.21.0 // indirect
github.com/go-openapi/jsonpointer v0.20.2 // indirect
github.com/go-openapi/jsonreference v0.20.4 // indirect
github.com/go-openapi/loads v0.21.5 // indirect
github.com/go-openapi/spec v0.20.14 // indirect
github.com/go-openapi/strfmt v0.22.0 // indirect
github.com/go-openapi/swag v0.22.9 // indirect
github.com/go-openapi/validate v0.23.0 // indirect
github.com/go-task/slim-sprig/v3 v3.0.0 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/protobuf v1.5.4 // indirect
Expand All @@ -90,7 +100,7 @@ require (
github.com/hashicorp/go-rootcerts v1.0.2 // indirect
github.com/hashicorp/go-secure-stdlib/parseutil v0.1.7 // indirect
github.com/hashicorp/go-secure-stdlib/strutil v0.1.2 // indirect
github.com/hashicorp/go-sockaddr v1.0.2 // indirect
github.com/hashicorp/go-sockaddr v1.0.6 // indirect
github.com/hashicorp/golang-lru v0.5.4 // indirect
github.com/hashicorp/hcl v1.0.0 // indirect
github.com/hashicorp/logutils v1.0.0 // indirect
Expand All @@ -109,9 +119,11 @@ require (
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/modern-go/reflect2 v1.0.2 // indirect
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
github.com/oklog/ulid v1.3.1 // indirect
github.com/opencontainers/go-digest v1.0.0 // indirect
github.com/opencontainers/image-spec v1.1.0-rc3 // indirect
github.com/opencontainers/selinux v1.11.0 // indirect
github.com/opentracing/opentracing-go v1.2.0 // indirect
github.com/pelletier/go-toml/v2 v2.2.3 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/prometheus/procfs v0.15.1 // indirect
Expand All @@ -131,6 +143,10 @@ require (
github.com/vishvananda/netns v0.0.5 // indirect
github.com/x448/float16 v0.8.4 // indirect
go.etcd.io/etcd/client/pkg/v3 v3.5.17 // indirect
go.mongodb.org/mongo-driver v1.13.1 // indirect
go.opentelemetry.io/otel v1.24.0 // indirect
go.opentelemetry.io/otel/metric v1.24.0 // indirect
go.opentelemetry.io/otel/trace v1.24.0 // indirect
go.uber.org/multierr v1.11.0 // indirect
go.uber.org/zap v1.27.0 // indirect
golang.org/x/exp v0.0.0-20240719175910-8a7402abbf56 // indirect
Expand Down
Loading

0 comments on commit 1258bc4

Please sign in to comment.