feat: perform health checks via changes and tasks #409

benhoyt · 2024-04-11T06:13:40Z

Per spec JU073, this reimplements the health check manager to to use Changes and Tasks do drive the check operations. There are two change kinds, each of which has a single task of the same kind:

perform-check: used for driving the check while it's "up". The change (and task) finish when the number of failures hits the threshold, at which point it goes into Error status. Each check failure records a task log.
recover-check: used for driving the check while it's "down". The change (and task) finish when the check starts succeeding again, at which point it goes into Done status. Again, each check failure records a task log.

We also add a new change-id field to the /v1/checks responses, allowing a client and the CLI to look up the change for each check (primarily to get the task logs for failing checks). The pebble checks command displays this as follows:

Check  Level  Status  Failures  Change
chk1   -      up      0/3       1
chk2   -      down    1/1       2 (cannot perform check: blah blah error)
chk3   alive  down    42/3      3 (this is a long truncated error messag... run "pebble tasks 3" for more)

We (@hpidcock, @jameinel, and I) debated whether it should switch from perform-check to recover-check after the first failure or only when it hits the threshold (which is how I had the code originally). We decided in the end that it's better and more obvious to switch when it hits the threshold, as otherwise the check status is out of sync with the change kind, and you need a third state. We figured in a Juju context, for example, your alive/ready checks would normally be separate from the change-update checks you want the charm to be notified about. And if you do want them to be the same check, you can easily add a similar check with a different threshold.

Here's an example of that flow (taken from the spec):

initial state:
  change 1 - Doing
task 1: perform-check - Doing

on first error:
  change 1 - Error
task 1: perform-check - Error (contains first failure log(s) - up to 10)
  change 2 - Doing
task 2: recover-check - Doing

on second (or subsequent) errors:
  change 1 - Error
task 1: perform-check - Error
  change 2 - Doing
task 2: recover-check - Doing (contains last failure log(s) - up to 10)

now on success:
  change 1 - Error
task 1: perform-check - Error
  change 2 - Done
task 2: recover-check - Done (keeps last failure log(s) - up to 10)
  change 3 - Doing
task 3: perform-check - Doing

As far as the checks system concerns, this is an implementation detail. However, the use of changes and tasks mean that the status of a check operation and check failures can be introspected (via the existing changes API).

Fixes #103.

- Start and stop overlord in checks tests as it's needed for EnsureBefore - Wait for checks to settle in TestChecksGet - Make sure Ensure is called either by PlanChanged or by the first Overlord.Loop ensure

Also increase abortWait from 7d to 14d

This is because they may be long-lived

internals/overlord/checkstate/manager.go

internals/overlord/overlord.go

flotter · 2024-04-12T15:50:52Z

@benhoyt still busy reviewing ... will continue asap.

internals/overlord/checkstate/handlers.go

internals/overlord/checkstate/manager.go

internals/overlord/checkstate/handlers.go

flotter

Looks like a feasible approach to be Ben. I've added some questions and a couple of crazy ideas as counter spins on the solution, but that's mostly just dumping thoughts that went through my mind. I am also taking some code from you after this, thanks!

This reverts commit 6e35c31.

benhoyt · 2024-04-17T05:25:42Z

internals/overlord/checkstate/checkers_test.go

@@ -252,3 +253,121 @@ func (s *CheckersSuite) TestExec(c *C) {
 	c.Assert(ok, Equals, true)
 	c.Assert(detailsErr.Details(), Equals, currentUser.Username)
 }
+
+func (s *CheckersSuite) TestNewChecker(c *C) {


This was simply moved from manager_test.go.

internals/overlord/checkstate/handlers.go

internals/overlord/checkstate/manager.go

flotter · 2024-04-18T19:47:10Z

internals/overlord/checkstate/manager.go

+		case performCheckKind, recoverCheckKind:
+			if change.IsReady() {
+				// Skip check changes that have finished already.
+				continue
+			}


If I read this correctly, we are saying that running changes represents the active checks, and each running check holds the latest truth about its check state. I actually like this - it was easier for me to follow the flow of data, and ownership in general. Just a quick question now ... Is there an opportunity for a query (API) to happen exactly at a point in time between two changes (where the one change become ready - error or done), and the next change has not yet been committed? I think it boils down to where in the state engine the changeStatusChanged is called from. If while the state engine has its lock, the change transitions to Error / Done and while maintaining the lock calls the change callback, if feels like everything works perfectly for me.

Yeah, I liked having all the data in the change/task too. I traced that through the state engine and yes, changeStatusChanged is called with the state locked throughout all this. TaskRunner.run calls SetStatus which updates the status from Doing to Done (or Error) and SetStatus calls the status-changed handler.

However, I had to bypass this "all the data in the change/task" due to the GET /v1/health endpoint requiring that the state lock isn't held. Saving modified state is relatively slow (due to the JSON serialisation and fsync), and in certain deployments when the machine was under load it would take too long to respond (see details). We have a task on next cycle's roadmap to improve the speed/operation of state saving, but that's longer term.

In the meantime, I had to store a side map CheckManager.health with a mutex around it for the Health endpoint. I'm beginning to think this is a mistake, and I should remove the new parallel Health method and just have a CheckManager.checks map again, but it would be a map[string]CheckInfo instead of that funky checkData struct. All the operational logic and plan updates would still be done via changes. So I think I'll make them change yet, and I think it'll be simpler.

I've now reverted the new Health() method and am using CheckManager.checks map[string]CheckInfo as described above.

internals/overlord/checkstate/handlers.go

internals/overlord/checkstate/manager.go

internals/overlord/checkstate/handlers.go

flotter

Hi Ben. Your PR looks good to me. Compared with the previous design, I feel it is simpler. I left a couple of minor comments. My only question is whether it could be feasible to not expose the tasks to plan changes at all, and only dispatch a new change with a snapshot of what it needs from the plan, since you already have a top level change lifecycle manager associated with PlanChanged.

The Health API was kind of a clone of that, and was not tested. It also had a nasty bug (the Go for loop issue): for _, h := range m.health { infos = append(infos, &h) } This is definitely simpler.

internals/cli/cmd_checks.go

internals/overlord/checkstate/handlers.go

internals/overlord/checkstate/manager.go

benhoyt · 2024-04-23T04:00:59Z

internals/overlord/checkstate/checkers.go

@@ -202,3 +204,7 @@ type detailsError struct {
 func (e *detailsError) Details() string {
 	return e.details
 }
+
+func (e *detailsError) Unwrap() error {


This isn't actually needed, but it was useful during debugging and I realised this error-wrapper type should probably have an Unwrap method, so I kept it.

hpidcock

Awesome thank-you. I assume @flotter might want one final pass too.

internals/overlord/checkstate/manager_test.go

internals/overlord/checkstate/handlers.go

This is new with health checks being implemented as changes and tasks (canonical/pebble#409). We shouldn't merge this before that Pebble PR is merged, just in case anything changes.

This is new with health checks being implemented as changes and tasks (canonical/pebble#409). Ops/charms will likely not use this new field. Even once we have the new `change-updated` event, that has a `.get_change()` method to get the change directly. But we're adding it to the Pebble client/types for completeness.

This includes the recent work to implement health checks using Changes and Tasks: - Pebble PR: canonical/pebble#409 - Spec: JU073

#17288 This includes the recent work to implement health checks using Changes and Tasks (for the 3.6 branch): - Pebble PR: canonical/pebble#409 - Spec: [JU073](https://docs.google.com/document/d/1VbdRtcoU0igd64YBLW5jwDQXmA6B_0kepVtzvdxw7Cw/edit)

benhoyt added 10 commits April 11, 2024 14:09

Basic structure of doing health checks via changes and tasks

7c6b751

Merge branch 'master' into health-checks-via-changes

d75cf43

Fix issues with daemon tests

3f216ec

- Start and stop overlord in checks tests as it's needed for EnsureBefore - Wait for checks to settle in TestChecksGet - Make sure Ensure is called either by PlanChanged or by the first Overlord.Loop ensure

Wait for checks to settle in servstate tests

d158e53

Wire up ChangeID through /v1/checks, client, and CLI

924b63b

Fix formatting

1ee199a

Only fetch change (for task log) if check is down

a585d20

Include last task log if check is failing (not necessarily "down")

f85c47f

Per JU073, increase Pebble pruneWait from 24h to 7d

62e34c1

Also increase abortWait from 7d to 14d

Mark perform-check and recover-check as "pending" (not prunable)

ba102f8

This is because they may be long-lived

flotter reviewed Apr 12, 2024

View reviewed changes

internals/overlord/checkstate/manager.go Show resolved Hide resolved

flotter reviewed Apr 12, 2024

View reviewed changes

internals/overlord/checkstate/manager.go Show resolved Hide resolved

flotter reviewed Apr 12, 2024

View reviewed changes

internals/overlord/checkstate/manager.go Show resolved Hide resolved

flotter reviewed Apr 12, 2024

View reviewed changes

internals/overlord/overlord.go Show resolved Hide resolved

flotter reviewed Apr 14, 2024

View reviewed changes

internals/overlord/checkstate/handlers.go Show resolved Hide resolved

flotter reviewed Apr 14, 2024

View reviewed changes

internals/overlord/checkstate/handlers.go Outdated Show resolved Hide resolved

flotter reviewed Apr 14, 2024

View reviewed changes

internals/overlord/checkstate/manager.go Outdated Show resolved Hide resolved

flotter reviewed Apr 14, 2024

View reviewed changes

internals/overlord/checkstate/handlers.go Show resolved Hide resolved

flotter reviewed Apr 14, 2024

View reviewed changes

benhoyt added 9 commits April 15, 2024 16:02

Improve tomb.Dying debug log messages

7a8d5e5

Placeholder CheckManager.Stop

cf1a13e

Rework to avoid m.checks/checksLock and store state in task data

ede37c7

Remove panics in mustGetCheckDetails

6e35c31

Add test for resetting failure count in perform-check

1ddc9aa

Revert "Remove panics in mustGetCheckDetails"

bf10e34

This reverts commit 6e35c31.

Make PlanChanged only restart checks that are new or modified

db5e1e7

Wording tweak to CheckInfo.ChangeID doc comment

243fa60

Add CheckManager.Health() to avoid holding state lock in /v1/health

ee478b2

benhoyt commented Apr 17, 2024

View reviewed changes