-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: perform health checks via changes and tasks #409
Conversation
- Start and stop overlord in checks tests as it's needed for EnsureBefore - Wait for checks to settle in TestChecksGet - Make sure Ensure is called either by PlanChanged or by the first Overlord.Loop ensure
Also increase abortWait from 7d to 14d
This is because they may be long-lived
@benhoyt still busy reviewing ... will continue asap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a feasible approach to be Ben. I've added some questions and a couple of crazy ideas as counter spins on the solution, but that's mostly just dumping thoughts that went through my mind. I am also taking some code from you after this, thanks!
This reverts commit 6e35c31.
@@ -252,3 +253,121 @@ func (s *CheckersSuite) TestExec(c *C) { | |||
c.Assert(ok, Equals, true) | |||
c.Assert(detailsErr.Details(), Equals, currentUser.Username) | |||
} | |||
|
|||
func (s *CheckersSuite) TestNewChecker(c *C) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was simply moved from manager_test.go
.
case performCheckKind, recoverCheckKind: | ||
if change.IsReady() { | ||
// Skip check changes that have finished already. | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this correctly, we are saying that running changes represents the active checks, and each running check holds the latest truth about its check state. I actually like this - it was easier for me to follow the flow of data, and ownership in general. Just a quick question now ... Is there an opportunity for a query (API) to happen exactly at a point in time between two changes (where the one change become ready - error
or done
), and the next change has not yet been committed? I think it boils down to where in the state engine the changeStatusChanged
is called from. If while the state engine has its lock, the change transitions to Error
/ Done
and while maintaining the lock calls the change callback, if feels like everything works perfectly for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I liked having all the data in the change/task too. I traced that through the state engine and yes, changeStatusChanged
is called with the state locked throughout all this. TaskRunner.run
calls SetStatus
which updates the status from Doing to Done (or Error) and SetStatus
calls the status-changed handler.
However, I had to bypass this "all the data in the change/task" due to the GET /v1/health
endpoint requiring that the state lock isn't held. Saving modified state is relatively slow (due to the JSON serialisation and fsync), and in certain deployments when the machine was under load it would take too long to respond (see details). We have a task on next cycle's roadmap to improve the speed/operation of state saving, but that's longer term.
In the meantime, I had to store a side map CheckManager.health
with a mutex around it for the Health endpoint. I'm beginning to think this is a mistake, and I should remove the new parallel Health
method and just have a CheckManager.checks
map again, but it would be a map[string]CheckInfo
instead of that funky checkData
struct. All the operational logic and plan updates would still be done via changes. So I think I'll make them change yet, and I think it'll be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now reverted the new Health()
method and am using CheckManager.checks map[string]CheckInfo
as described above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ben. Your PR looks good to me. Compared with the previous design, I feel it is simpler. I left a couple of minor comments. My only question is whether it could be feasible to not expose the tasks to plan changes at all, and only dispatch a new change with a snapshot of what it needs from the plan, since you already have a top level change lifecycle manager associated with PlanChanged.
The Health API was kind of a clone of that, and was not tested. It also had a nasty bug (the Go for loop issue): for _, h := range m.health { infos = append(infos, &h) } This is definitely simpler.
@@ -202,3 +204,7 @@ type detailsError struct { | |||
func (e *detailsError) Details() string { | |||
return e.details | |||
} | |||
|
|||
func (e *detailsError) Unwrap() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't actually needed, but it was useful during debugging and I realised this error-wrapper type should probably have an Unwrap method, so I kept it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank-you. I assume @flotter might want one final pass too.
This is new with health checks being implemented as changes and tasks (canonical/pebble#409). We shouldn't merge this before that Pebble PR is merged, just in case anything changes.
This is new with health checks being implemented as changes and tasks (canonical/pebble#409). Ops/charms will likely not use this new field. Even once we have the new `change-updated` event, that has a `.get_change()` method to get the change directly. But we're adding it to the Pebble client/types for completeness.
This includes the recent work to implement health checks using Changes and Tasks: - Pebble PR: canonical/pebble#409 - Spec: JU073
#17288 This includes the recent work to implement health checks using Changes and Tasks (for the 3.6 branch): - Pebble PR: canonical/pebble#409 - Spec: [JU073](https://docs.google.com/document/d/1VbdRtcoU0igd64YBLW5jwDQXmA6B_0kepVtzvdxw7Cw/edit)
Per spec JU073, this reimplements the health check manager to to use Changes and Tasks do drive the check operations. There are two change kinds, each of which has a single task of the same kind:
perform-check
: used for driving the check while it's "up". The change (and task) finish when the number of failures hits the threshold, at which point it goes into Error status. Each check failure records a task log.recover-check
: used for driving the check while it's "down". The change (and task) finish when the check starts succeeding again, at which point it goes into Done status. Again, each check failure records a task log.We also add a new
change-id
field to the /v1/checks responses, allowing a client and the CLI to look up the change for each check (primarily to get the task logs for failing checks). Thepebble checks
command displays this as follows:We (@hpidcock, @jameinel, and I) debated whether it should switch from perform-check to recover-check after the first failure or only when it hits the threshold (which is how I had the code originally). We decided in the end that it's better and more obvious to switch when it hits the threshold, as otherwise the check status is out of sync with the change kind, and you need a third state. We figured in a Juju context, for example, your alive/ready checks would normally be separate from the change-update checks you want the charm to be notified about. And if you do want them to be the same check, you can easily add a similar check with a different threshold.
Here's an example of that flow (taken from the spec):
As far as the checks system concerns, this is an implementation detail. However, the use of changes and tasks mean that the status of a check operation and check failures can be introspected (via the existing changes API).
Fixes #103.