Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: perform health checks via changes and tasks #409

Merged
merged 34 commits into from
Apr 24, 2024

Conversation

benhoyt
Copy link
Contributor

@benhoyt benhoyt commented Apr 11, 2024

Per spec JU073, this reimplements the health check manager to to use Changes and Tasks do drive the check operations. There are two change kinds, each of which has a single task of the same kind:

  • perform-check: used for driving the check while it's "up". The change (and task) finish when the number of failures hits the threshold, at which point it goes into Error status. Each check failure records a task log.
  • recover-check: used for driving the check while it's "down". The change (and task) finish when the check starts succeeding again, at which point it goes into Done status. Again, each check failure records a task log.

We also add a new change-id field to the /v1/checks responses, allowing a client and the CLI to look up the change for each check (primarily to get the task logs for failing checks). The pebble checks command displays this as follows:

Check  Level  Status  Failures  Change
chk1   -      up      0/3       1
chk2   -      down    1/1       2 (cannot perform check: blah blah error)
chk3   alive  down    42/3      3 (this is a long truncated error messag... run "pebble tasks 3" for more)

We (@hpidcock, @jameinel, and I) debated whether it should switch from perform-check to recover-check after the first failure or only when it hits the threshold (which is how I had the code originally). We decided in the end that it's better and more obvious to switch when it hits the threshold, as otherwise the check status is out of sync with the change kind, and you need a third state. We figured in a Juju context, for example, your alive/ready checks would normally be separate from the change-update checks you want the charm to be notified about. And if you do want them to be the same check, you can easily add a similar check with a different threshold.

Here's an example of that flow (taken from the spec):

initial state:
  change 1 - Doing
task 1: perform-check - Doing

on first error:
  change 1 - Error
task 1: perform-check - Error (contains first failure log(s) - up to 10)
  change 2 - Doing
task 2: recover-check - Doing

on second (or subsequent) errors:
  change 1 - Error
task 1: perform-check - Error
  change 2 - Doing
task 2: recover-check - Doing (contains last failure log(s) - up to 10)

now on success:
  change 1 - Error
task 1: perform-check - Error
  change 2 - Done
task 2: recover-check - Done (keeps last failure log(s) - up to 10)
  change 3 - Doing
task 3: perform-check - Doing

As far as the checks system concerns, this is an implementation detail. However, the use of changes and tasks mean that the status of a check operation and check failures can be introspected (via the existing changes API).

Fixes #103.

@flotter
Copy link
Contributor

flotter commented Apr 12, 2024

@benhoyt still busy reviewing ... will continue asap.

Copy link
Contributor

@flotter flotter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a feasible approach to be Ben. I've added some questions and a couple of crazy ideas as counter spins on the solution, but that's mostly just dumping thoughts that went through my mind. I am also taking some code from you after this, thanks!

@@ -252,3 +253,121 @@ func (s *CheckersSuite) TestExec(c *C) {
c.Assert(ok, Equals, true)
c.Assert(detailsErr.Details(), Equals, currentUser.Username)
}

func (s *CheckersSuite) TestNewChecker(c *C) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was simply moved from manager_test.go.

internals/overlord/checkstate/handlers.go Show resolved Hide resolved
internals/overlord/checkstate/manager.go Show resolved Hide resolved
Comment on lines 278 to 282
case performCheckKind, recoverCheckKind:
if change.IsReady() {
// Skip check changes that have finished already.
continue
}
Copy link
Contributor

@flotter flotter Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this correctly, we are saying that running changes represents the active checks, and each running check holds the latest truth about its check state. I actually like this - it was easier for me to follow the flow of data, and ownership in general. Just a quick question now ... Is there an opportunity for a query (API) to happen exactly at a point in time between two changes (where the one change become ready - error or done), and the next change has not yet been committed? I think it boils down to where in the state engine the changeStatusChanged is called from. If while the state engine has its lock, the change transitions to Error / Done and while maintaining the lock calls the change callback, if feels like everything works perfectly for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I liked having all the data in the change/task too. I traced that through the state engine and yes, changeStatusChanged is called with the state locked throughout all this. TaskRunner.run calls SetStatus which updates the status from Doing to Done (or Error) and SetStatus calls the status-changed handler.

However, I had to bypass this "all the data in the change/task" due to the GET /v1/health endpoint requiring that the state lock isn't held. Saving modified state is relatively slow (due to the JSON serialisation and fsync), and in certain deployments when the machine was under load it would take too long to respond (see details). We have a task on next cycle's roadmap to improve the speed/operation of state saving, but that's longer term.

In the meantime, I had to store a side map CheckManager.health with a mutex around it for the Health endpoint. I'm beginning to think this is a mistake, and I should remove the new parallel Health method and just have a CheckManager.checks map again, but it would be a map[string]CheckInfo instead of that funky checkData struct. All the operational logic and plan updates would still be done via changes. So I think I'll make them change yet, and I think it'll be simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now reverted the new Health() method and am using CheckManager.checks map[string]CheckInfo as described above.

Copy link
Contributor

@flotter flotter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ben. Your PR looks good to me. Compared with the previous design, I feel it is simpler. I left a couple of minor comments. My only question is whether it could be feasible to not expose the tasks to plan changes at all, and only dispatch a new change with a snapshot of what it needs from the plan, since you already have a top level change lifecycle manager associated with PlanChanged.

The Health API was kind of a clone of that, and was not tested. It also
had a nasty bug (the Go for loop issue):

	for _, h := range m.health {
		infos = append(infos, &h)
	}

This is definitely simpler.
internals/cli/cmd_checks.go Show resolved Hide resolved
internals/cli/cmd_checks.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/handlers.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/handlers.go Show resolved Hide resolved
internals/overlord/checkstate/handlers.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/handlers.go Show resolved Hide resolved
internals/overlord/checkstate/handlers.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/handlers.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/manager.go Show resolved Hide resolved
@@ -202,3 +204,7 @@ type detailsError struct {
func (e *detailsError) Details() string {
return e.details
}

func (e *detailsError) Unwrap() error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually needed, but it was useful during debugging and I realised this error-wrapper type should probably have an Unwrap method, so I kept it.

Copy link
Member

@hpidcock hpidcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank-you. I assume @flotter might want one final pass too.

internals/overlord/checkstate/manager_test.go Show resolved Hide resolved
internals/overlord/checkstate/handlers.go Show resolved Hide resolved
benhoyt added a commit to benhoyt/operator that referenced this pull request Apr 24, 2024
This is new with health checks being implemented as changes and tasks
(canonical/pebble#409). We shouldn't merge this
before that Pebble PR is merged, just in case anything changes.
@benhoyt benhoyt merged commit 4e93e7b into canonical:master Apr 24, 2024
15 checks passed
@benhoyt benhoyt deleted the health-checks-via-changes branch April 24, 2024 03:38
benhoyt added a commit to canonical/operator that referenced this pull request Apr 24, 2024
This is new with health checks being implemented as changes and tasks
(canonical/pebble#409).

Ops/charms will likely not use this new field. Even once we have the new
`change-updated` event, that has a `.get_change()` method to get the
change directly. But we're adding it to the Pebble client/types for
completeness.
jujubot pushed a commit to juju/juju that referenced this pull request Apr 26, 2024
This includes the recent work to implement health checks using Changes
and Tasks:
- Pebble PR: canonical/pebble#409
- Spec: JU073
jujubot added a commit to juju/juju that referenced this pull request Apr 26, 2024
#17288

This includes the recent work to implement health checks using Changes and Tasks (for the 3.6 branch):

- Pebble PR: canonical/pebble#409
- Spec: [JU073](https://docs.google.com/document/d/1VbdRtcoU0igd64YBLW5jwDQXmA6B_0kepVtzvdxw7Cw/edit)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a way to detect and introspect check failures
3 participants