-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart LXD on MicroCloud start to refresh symlinks #136
Conversation
Please can you update the PR description to explain what prompted this change, what does it fix? Thanks |
The main issue is that when MicroCloud directs LXD to set up an OVN network with MicroOVN, LXD uses some symlinks in its snap environment to execute the commands. These symlinks are generated by the LXD snap's daemon start hook, so if LXD happens to start before MicroOVN is installed, LXD won't have the right symlinks, and setting up the network will fail (Resulting in the error in the main post of #107). To fix this, LXD needs to be restarted when MicroCloud runs, so it has the right symlinks. The problems here are:
The other solution (as we briefly discussed on IRC) would be to use the proper paths dynamically in LXD. This would solve the problems above, but would add split the central location where we track the paths of services that LXD works with, and integrate it into the code. |
I just want to double check I'm understanding this correctly:
Also where are these hooks defined, and what happens if microOVN is installed after the daemon is started? |
Also a side note - can we add an internal endpoint to LXD to ask it to refresh it's symlinks without requiring a restart? |
The LXD snap creates the symlinks only when the LXD snap starts. If microovn is installed after LXD, these symlinks will only be set the next time the LXD snap starts.
This would solve the problem for the init case, but it would cause a consistent delay when a user initially runs
So when joining, we perform some actions both on the joining system that sends the join request and the clustered system that eventually handles it (dqlite leader, I believe). I believe currently the symlinks we care about are just for setting up local services for ovn, but technically both the cluster and the joiner might potentially want to use some of these paths.
Where it pertains to microovn, these symlinks are mainly for some certs and state information passed to ovn commands. They're set up here and here If microovn is installed after LXD is started, absolutely nothing happens. LXD would not have set up the symlinks in a way that would work with microovn.
That endpoint would have to have some way of executing either the |
I don't think it's too much of a concern to have a delay during microcloud init. The init command is interactive and takes a while anyway. We can indicate to the user what we are doing.
I'm still confused about this, why would the existing member use symlinks on another machine?
Presumably the If this is true then the symlinks created in the daemon.start hook could be created on the fly by LXD itself. We could add a Edit: Also since it would always be called by microcloud, we can remove it from LXDs daemon.start hook. |
Not from another machine. It's a hypothetical edge case: A node with a symlink to something that's utilized once the node is already clustered, and is handling a join request. This node would've joined the cluster with incorrectly set up symlinks (because LXD started before the relevant service), and never happened to restart. It wouldn't have errored out when it joined because the hypothetical command is run on existing cluster members, so the invalid state on the joiner is irrelevant. Once the node is clustered, its state would be silently invalid until this node becomes dqlite leader and is forwarded a request to add a new cluster member.
Right, personally I think this is the most foolproof solution, but involves migrating all of this state logic over to |
I think this gets to the heart of the issue. Should LXD (beyond its own packaging) have knowledge of and interact with (even just passively) MicroCloud, or should we treat it as just another external client. I would strongly caution against using I think we should discuss this in our next microcloud meeting so we can all get on the same page, and consider both microovn and microceph at the same time to see if we can come up with a general solution. |
Those snap to snap interactions should probably be best discussed with the snap people (https://chat.canonical.com/canonical/channels/starcraft). It sounds to me that we would need some more interfaces automatically connected to let them interact. |
Lets chat about it in our next meeting as it maybe we can avoid needing to do it in the snap at all. |
Signed-off-by: Max Asnaashari <[email protected]>
As discussed in our meeting, I've updated this PR to restart LXD in more places:
This should suffice and avoid the nasty MicroOVN error until we can discuss a more permanent and resilient solution that doesn't rely on restarting LXD completely. |
Sets LXD to restart at three points: 1) When MicroCloud first starts (to avoid a potential lag on all peers when fetching resources as LXD sets up for the first time) 2) When the user invokes `microcloud init`, to ensure LXD is set up properly with the correct symlinks for microcloud 3) When a MicroCloud daemon receives a join request for LXD, to ensure the above for all joining nodes. Signed-off-by: Max Asnaashari <[email protected]>
Signed-off-by: Max Asnaashari <[email protected]>
Adds a hook that restarts LXD on uninitialized MicroCloud systems when MicroCloud starts.
I'm not sure if immediately calling
GetServer
like I'm doing here will cause issues on slower machines, it seems to work fine on my machine.This still has the caveat of requiring a
snap restart microcloud
to force LXD to restart. I'm open to also putting this restart on run on an execution ofmicrocloud init
although I'm concerned that have some odd effects on slower machines again.