Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(planstate): make plan propagation safer #474

Closed
wants to merge 2 commits into from

Conversation

flotter
Copy link
Contributor

@flotter flotter commented Aug 5, 2024

The original Pebble plan was part of the service manager, only concerned with services. The plan was lazy loaded, so the first service manager public method called (internally or over the HTTP API) would result in plan layers getting loaded, combined and propagated. This scheme was later extended to support additional managers, which also added the PlanChanged subscription mechanism. Subscribed managers would receive updates of the plan on the initial load and subsequent runtime changes, due to layers added using the HTTP API.

The plan manager decoupled the Pebble Plan from the service manager. The first version of the plan manager simply loaded the plan from disk as early as possible, before the HTTP API endpoints were activated. However, since the load also triggered a PlanChanged update to all subscribed managers, these managers received an update of the plan before the State Engine was running. This resulted in extra code required in checkstate to defer calling Ensure() inside the PlanChanged callback.

The following changes are proposed:

  1. Adapt the planstate load and plan changed notifications to only use standard State Engine events StartUp() and Ensure() during startup. Note that because we now guarantee a call to Ensure before the HTTP API is enabled, it is safe to do the initial load and propagate lock-less (care has to be taken so external callers do to call us before the state engine is running).

  2. Make a change to the overlord to only enable the HTTP API endpoints once both StartUp and at least one Ensure() pass was completed. Note that this does not impact startup performance of managers, but would ensure external HTTP API calls, and self-calls (i.e. autostart) would only happen after StartUp and at least one Ensure() completed, giving managers a guarantee on startup behaviour.

Performing a simple startup duration test from the application entrypoint until HTTP API is enabled resulted in no measurable difference.

if err != nil {
return err
}
if err := d.SetServiceArgs(mappedArgs); err != nil {
Copy link
Contributor Author

@flotter flotter Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetServiceArgs requires the service manager plan so we have to move it down slightly. Care must be taken to not make direct Plan Manager calls before the state engine is running, as the lock-less load happens during this time. We could add the lock for the initial load / propagate if really needed.

@@ -379,15 +371,34 @@ func (o *Overlord) Loop() {
}
}
})
o.ensureWaitRun()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also discussing this proposal with @pedronis in a separate thread, to make sure this is compatible with snapd.

@@ -89,7 +88,7 @@ type Overlord struct {
ensureLock sync.Mutex
ensureTimer *time.Timer
ensureNext time.Time
ensureRun int32
ensureRun chan struct{}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am extending an existing feature here to also allow for waiting, not only checking if ensure ran already.

@flotter
Copy link
Contributor Author

flotter commented Aug 5, 2024

@benhoyt and @hpidcock please could you give me some feedback on this proposal when you next have some time, thank you!

@@ -38,8 +37,7 @@ const (

// CheckManager starts and manages the health checks.
type CheckManager struct {
state *state.State
ensureDone atomic.Bool
Copy link
Contributor Author

@flotter flotter Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic, which would also have been needed in derived project managers, can now safely be removed. The race is prevented in the overlord.

Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Just a couple of style comments.

internals/overlord/overlord.go Outdated Show resolved Hide resolved
internals/overlord/overlord.go Outdated Show resolved Hide resolved
@flotter flotter requested a review from anpep August 6, 2024 06:33
@flotter
Copy link
Contributor Author

flotter commented Aug 6, 2024

Following super helpful conversations with @pedronis and @niemeyer I realised I am solving this the wrong way. Closing this PR - will create a new PR shortly.

@flotter flotter closed this Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants