Restart LXD on MicroCloud start to refresh symlinks #136

masnax · 2023-07-12T10:12:18Z

Adds a hook that restarts LXD on uninitialized MicroCloud systems when MicroCloud starts.

I'm not sure if immediately calling GetServer like I'm doing here will cause issues on slower machines, it seems to work fine on my machine.

This still has the caveat of requiring a snap restart microcloud to force LXD to restart. I'm open to also putting this restart on run on an execution of microcloud init although I'm concerned that have some odd effects on slower machines again.

tomponline · 2023-07-12T10:13:30Z

Please can you update the PR description to explain what prompted this change, what does it fix? Thanks

masnax · 2023-07-12T11:38:40Z

@tomponline

The main issue is that when MicroCloud directs LXD to set up an OVN network with MicroOVN, LXD uses some symlinks in its snap environment to execute the commands. These symlinks are generated by the LXD snap's daemon start hook, so if LXD happens to start before MicroOVN is installed, LXD won't have the right symlinks, and setting up the network will fail (Resulting in the error in the main post of #107).

To fix this, LXD needs to be restarted when MicroCloud runs, so it has the right symlinks.

The problems here are:

MicroCloud actually needs LXD to restart before a call to microcloud init, restarting on daemon start might not be enough if MicroOVN is installed after MicroCloud.
LXD restarting may not solve the problem in the microcloud add case because if the local system needs to restart its LXD, then a node in the cluster has to go down. Additionally, the node being added might need to be signalled to restart its LXD as well.

The other solution (as we briefly discussed on IRC) would be to use the proper paths dynamically in LXD. This would solve the problems above, but would add split the central location where we track the paths of services that LXD works with, and integrate it into the code.

markylaing · 2023-07-13T08:15:57Z

@tomponline

The main issue is that when MicroCloud directs LXD to set up an OVN network with MicroOVN, LXD uses some symlinks in its snap environment to execute the commands. These symlinks are generated by the LXD snap's daemon start hook, so if LXD happens to start before MicroOVN is installed, LXD won't have the right symlinks, and setting up the network will fail (Resulting in the error in the main post of #107).

To fix this, LXD needs to be restarted when MicroCloud runs, so it has the right symlinks.

The problems here are:

MicroCloud actually needs LXD to restart before a call to microcloud init, restarting on daemon start might not be enough if MicroOVN is installed after MicroCloud.

LXD restarting may not solve the problem in the microcloud add case because if the local system needs to restart its LXD, then a node in the cluster has to go down. Additionally, the node being added might need to be signalled to restart its LXD as well.

The other solution (as we briefly discussed on IRC) would be to use the proper paths dynamically in LXD. This would solve the problems above, but would add split the central location where we track the paths of services that LXD works with, and integrate it into the code.

I just want to double check I'm understanding this correctly:

If microOVN is installed before LXD, LXD can detect that and will generate symlinks that are required to configure OVN? But if it is installed after LXD the symlinks are incorrect and we need to restart LXD to re-generate them?
We can't restart LXD during microcloud init. Why might this not be enough?
We can't restart LXD during microcloud add. Why do we need to restart the local LXD when adding a new member? Can we not check if we have an OVN service and if so restart LXD in the new member?

Also where are these hooks defined, and what happens if microOVN is installed after the daemon is started?

markylaing · 2023-07-13T08:19:49Z

Also a side note - can we add an internal endpoint to LXD to ask it to refresh it's symlinks without requiring a restart?

masnax · 2023-07-13T08:50:20Z

I just want to double check I'm understanding this correctly:

If microOVN is installed after LXD, the LXD snap (but presumably not LXD itself?) can detect that and will generate symlinks (to what?) that are required to configure OVN?

The LXD snap creates the symlinks only when the LXD snap starts. If microovn is installed after LXD, these symlinks will only be set the next time the LXD snap starts.

We can't restart LXD during microcloud init. Why might this not be enough?

This would solve the problem for the init case, but it would cause a consistent delay when a user initially runs microcloud init. Not sure how bad this would be on slower systems.

We can't restart LXD during microcloud add. Why do we need to restart the local LXD when adding a new member? Can we not check if we have an OVN service and if so restart LXD in the new member?

So when joining, we perform some actions both on the joining system that sends the join request and the clustered system that eventually handles it (dqlite leader, I believe). I believe currently the symlinks we care about are just for setting up local services for ovn, but technically both the cluster and the joiner might potentially want to use some of these paths.

Also where are these hooks defined, and what happens if microOVN is installed after the daemon is started?

Where it pertains to microovn, these symlinks are mainly for some certs and state information passed to ovn commands. They're set up here and here

If microovn is installed after LXD is started, absolutely nothing happens. LXD would not have set up the symlinks in a way that would work with microovn.

Also a side note - can we add an internal endpoint to LXD to ask it to refresh it's symlinks without requiring a restart?

That endpoint would have to have some way of executing either the daemon.start snap hook or some other snap action that can also be run when the daemon starts. I'm not familiar enough with how snaps work to know if this is possible.

markylaing · 2023-07-14T07:24:28Z

We can't restart LXD during microcloud init. Why might this not be enough?

This would solve the problem for the init case, but it would cause a consistent delay when a user initially runs microcloud init. Not sure how bad this would be on slower systems.

I don't think it's too much of a concern to have a delay during microcloud init. The init command is interactive and takes a while anyway. We can indicate to the user what we are doing.

We can't restart LXD during microcloud add. Why do we need to restart the local LXD when adding a new member? Can we not check if we have an OVN service and if so restart LXD in the new member?

So when joining, we perform some actions both on the joining system that sends the join request and the clustered system that eventually handles it (dqlite leader, I believe). I believe currently the symlinks we care about are just for setting up local services for ovn, but technically both the cluster and the joiner might potentially want to use some of these paths.

I'm still confused about this, why would the existing member use symlinks on another machine?

Also a side note - can we add an internal endpoint to LXD to ask it to refresh it's symlinks without requiring a restart?

That endpoint would have to have some way of executing either the daemon.start snap hook or some other snap action that can also be run when the daemon starts. I'm not familiar enough with how snaps work to know if this is possible.

Presumably the daemon.start snap hook is executed in the same environment as LXD itself, otherwise it would be a security concern. Is that the case @simondeziel @tomponline?

If this is true then the symlinks created in the daemon.start hook could be created on the fly by LXD itself. We could add a /internal/detect-integrations (or similar) endpoint to LXD and call it from microcloud on all nodes during init, and on the joining node during add. Then there would be no reason to restart LXD.

Edit: Also since it would always be called by microcloud, we can remove it from LXDs daemon.start hook.

masnax · 2023-07-14T08:19:53Z

I'm still confused about this, why would the existing member use symlinks on another machine?

Not from another machine. It's a hypothetical edge case: A node with a symlink to something that's utilized once the node is already clustered, and is handling a join request. This node would've joined the cluster with incorrectly set up symlinks (because LXD started before the relevant service), and never happened to restart. It wouldn't have errored out when it joined because the hypothetical command is run on existing cluster members, so the invalid state on the joiner is irrelevant. Once the node is clustered, its state would be silently invalid until this node becomes dqlite leader and is forwarded a request to add a new cluster member.

If this is true then the symlinks created in the daemon.start hook could be created on the fly by LXD itself. We could add a /internal/detect-integrations (or similar) endpoint to LXD and call it from microcloud on all nodes during init, and on the joining node during add. Then there would be no reason to restart LXD.

Right, personally I think this is the most foolproof solution, but involves migrating all of this state logic over to go, unless there's some way to leave it in the snap configuration and have the endpoint execute a snap action which runs exactly what's currently in daemon.start.

tomponline · 2023-07-14T08:27:09Z

I think this gets to the heart of the issue. Should LXD (beyond its own packaging) have knowledge of and interact with (even just passively) MicroCloud, or should we treat it as just another external client.

I would strongly caution against using /internal from anything except lxd itself, as otherwise it fundamentally changes what the /internal endpoints are for.

I think we should discuss this in our next microcloud meeting so we can all get on the same page, and consider both microovn and microceph at the same time to see if we can come up with a general solution.

simondeziel · 2023-07-14T09:31:33Z

Those snap to snap interactions should probably be best discussed with the snap people (https://chat.canonical.com/canonical/channels/starcraft). It sounds to me that we would need some more interfaces automatically connected to let them interact.

tomponline · 2023-07-14T09:32:52Z

Lets chat about it in our next meeting as it maybe we can avoid needing to do it in the snap at all.

Signed-off-by: Max Asnaashari <[email protected]>

masnax · 2023-07-24T16:27:02Z

As discussed in our meeting, I've updated this PR to restart LXD in more places:

When MicroCloud starts
When microcloud init is called
When a joining system receives its join token from the system that called microcloud init

This should suffice and avoid the nasty MicroOVN error until we can discuss a more permanent and resilient solution that doesn't rely on restarting LXD completely.

microcloud/service/service_handler.go

microcloud/service/lxd.go

Sets LXD to restart at three points: 1) When MicroCloud first starts (to avoid a potential lag on all peers when fetching resources as LXD sets up for the first time) 2) When the user invokes `microcloud init`, to ensure LXD is set up properly with the correct symlinks for microcloud 3) When a MicroCloud daemon receives a join request for LXD, to ensure the above for all joining nodes. Signed-off-by: Max Asnaashari <[email protected]>

Signed-off-by: Max Asnaashari <[email protected]>

masnax force-pushed the restart branch from b9aca50 to a305f26 Compare July 12, 2023 10:13

tomponline requested review from tomponline and markylaing July 12, 2023 12:59

masnax force-pushed the restart branch from a305f26 to 10f29bf Compare July 24, 2023 16:14

microcloud/service/lxd: Add Restart helper to LXD service

55c2a8f

Signed-off-by: Max Asnaashari <[email protected]>

masnax force-pushed the restart branch from 10f29bf to b58bf50 Compare July 24, 2023 16:26

tomponline reviewed Jul 24, 2023

View reviewed changes

microcloud/service/service_handler.go Outdated Show resolved Hide resolved

tomponline reviewed Jul 24, 2023

View reviewed changes

microcloud/service/lxd.go Outdated Show resolved Hide resolved

tomponline reviewed Jul 24, 2023

View reviewed changes

microcloud/service/lxd.go Outdated Show resolved Hide resolved

tomponline reviewed Jul 24, 2023

View reviewed changes

microcloud/service/lxd.go Show resolved Hide resolved

masnax added 2 commits July 24, 2023 17:34

microcloud/service/lxd: Check if LXD is initialized before restarting

80f4f03

Signed-off-by: Max Asnaashari <[email protected]>

masnax force-pushed the restart branch from b58bf50 to 80f4f03 Compare July 24, 2023 17:35

tomponline approved these changes Jul 24, 2023

View reviewed changes

tomponline merged commit 8dff5af into canonical:main Jul 24, 2023
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart LXD on MicroCloud start to refresh symlinks #136

Restart LXD on MicroCloud start to refresh symlinks #136

masnax commented Jul 12, 2023

tomponline commented Jul 12, 2023

masnax commented Jul 12, 2023 •

edited

Loading

markylaing commented Jul 13, 2023 •

edited

Loading

markylaing commented Jul 13, 2023

masnax commented Jul 13, 2023 •

edited

Loading

markylaing commented Jul 14, 2023 •

edited

Loading

masnax commented Jul 14, 2023

tomponline commented Jul 14, 2023

simondeziel commented Jul 14, 2023

tomponline commented Jul 14, 2023

masnax commented Jul 24, 2023

Restart LXD on MicroCloud start to refresh symlinks #136

Restart LXD on MicroCloud start to refresh symlinks #136

Conversation

masnax commented Jul 12, 2023

tomponline commented Jul 12, 2023

masnax commented Jul 12, 2023 • edited Loading

markylaing commented Jul 13, 2023 • edited Loading

markylaing commented Jul 13, 2023

masnax commented Jul 13, 2023 • edited Loading

markylaing commented Jul 14, 2023 • edited Loading

masnax commented Jul 14, 2023

tomponline commented Jul 14, 2023

simondeziel commented Jul 14, 2023

tomponline commented Jul 14, 2023

masnax commented Jul 24, 2023

masnax commented Jul 12, 2023 •

edited

Loading

markylaing commented Jul 13, 2023 •

edited

Loading

masnax commented Jul 13, 2023 •

edited

Loading

markylaing commented Jul 14, 2023 •

edited

Loading