Reuse existing MicroCeph and MicroOVN clusters #259

masnax · 2024-01-26T04:05:17Z

Closes #145

This PR allows initializing a new MicroCloud with some nodes that have already set up MicroCeph or MicroOVN in the past. This can be useful if you already have a MicroCeph cluster for example, that you want to "upgrade" into a MicroCloud.

Just after confirming the list of systems to use for the MicroCloud, we will try to grab a list of cluster members from each system. The underlying microcluster package will either return the list or report that the database is not initialized. We will record the first list that we obtain for each of MicroOVN and MicroCeph, and if the list is either non-existent or identical across any other systems, we can try to reuse those clusters.

The reuse process basically amounts to asking the system that has an existing cluster to generate join tokens and send them to the node orchestrating the initialization. Normally you have to be trusted to make such a request, so instead we can use the auth secret that we got from finding the systems over mDNS to send a request using the MicroCloud proxy, which will initiate the request from the unix socket on the clustered system.

The behaviour is as follows:

If two systems belong to different clusters for the same service, we will error out as they wouldn't be able to join each other.
If a cluster contains more nodes than what will end up in the MicroCloud, we will proceed as normal and ignore those nodes. Perhaps a user might want a larger MicroCeph cluster than their LXD cluster, for example.
If --auto is specified, we will strictly require no clustered systems.
The preseed file will have a new key reuse_existing_clusters which can be set to true or false. If true, we will reuse any clusters we find, and if false, we will skip that service entirely and set up as normal.
In interactive mode, we will prompt the user for each service whether they want to reuse the existing cluster or setup MicroCloud without that service.

microcloud/service/microcloud.go

microcloud/service/microceph.go

microcloud/service/microovn.go

microcloud/service/microcloud.go

masnax · 2024-04-09T16:12:08Z

@markylaing @roosterfish @tomponline

This one should be ready for review now.

tomponline · 2024-04-09T18:05:41Z

Thanks @masnax

Once @markylaing and @roosterfish have approved ill do a final pass. Ta

roosterfish

Looks good to me, only a few smaller suggestions.

microcloud/service/microceph.go

microcloud/cmd/microcloud/main_init.go

roosterfish

LGTM!

masnax · 2024-04-11T17:41:10Z

@markylaing @tomponline Do either of you want to take a look at this before merging?

Also @ru-fu Looks like there's something going on with the doc checks here. Got any ideas?

markylaing · 2024-04-11T19:48:48Z

@markylaing @tomponline Do either of you want to take a look at this before merging?

Also @ru-fu Looks like there's something going on with the doc checks here. Got any ideas?

Yes I'd like to take a look. Will get to it first thing tomorrow.

ru-fu · 2024-04-12T06:59:34Z

This sounds like it needs doc updates.

markylaing · 2024-04-12T08:31:50Z

The preseed file will have a new key reuse_existing_clusters which can be set to true or false. If true, we will reuse any clusters we find, and if false, we will skip that service entirely and set up as normal.

If this key is false, does it error out if we find existing Ceph or OVN clusters as it does with the --auto flag? I'm presuming here that we can't set up a new Ceph or OVN cluster on top of one that pre-exists.

markylaing

I think the structure of it looks good but some parts could be made a bit clearer.

Also, one concern I have with this is the reliance on mDNS for issuing the tokens. Will it be possible to add existing services if the join process changes such that explicit verification is required on both ends? I believe @roosterfish is working on this currently so I think it's worth a discussion around that.

microcloud/service/microcloud.go

microcloud/service/lxd.go

microcloud/service/microceph.go

microcloud/cmd/microcloud/ask.go

microcloud/cmd/microcloud/main_init.go

markylaing · 2024-04-12T09:27:58Z

microcloud/service/microcloud.go

-		},
+				return shared.ProxyFromEnvironment(r)
+			},
+		}


Now that canonical/microcluster#83 is merged does this need to be updated? Same for the other cases where this is being set.

Instead of filling this PR with all of those changes, I've added a new PR that moves all this implementation into a helper so we can just call that instead: #287

Ok so #287 has to come after this one then?

microcloud/cmd/microcloud/main_init_preseed.go

microcloud/test/includes/microcloud.sh

microcloud/cmd/microcloud/main_init.go

tomponline · 2024-04-12T09:41:31Z

@markylaing @tomponline Do either of you want to take a look at this before merging?

I'll look over once approved by the others.

roosterfish · 2024-04-12T09:57:01Z

Also, one concern I have with this is the reliance on mDNS for issuing the tokens. Will it be possible to add existing services if the join process changes such that explicit verification is required on both ends?

Good point, I have also thought about this the other day when reviewing the PR. Essentially what we will probably not have anymore is the secret used for the X-MicroCloud-Auth header to talk to the other MicroCloud daemon.
So it has to be an action invoked by the administrator on the target system.
I have it on my list of things for the spec so we can discuss it there :)

tomponline · 2024-04-12T09:58:40Z

Also, one concern I have with this is the reliance on mDNS for issuing the tokens. Will it be possible to add existing services if the join process changes such that explicit verification is required on both ends? I believe @roosterfish is working on this currently so I think it's worth a discussion around that.

It will almost certainly have to change as mdns isn't going to be used as much. But that shouldn't necessarily block this PR - although it will likely have to be reworked as part of @roosterfish planned changes.

masnax · 2024-04-12T13:55:08Z

Also, one concern I have with this is the reliance on mDNS for issuing the tokens. Will it be possible to add existing services if the join process changes such that explicit verification is required on both ends? I believe @roosterfish is working on this currently so I think it's worth a discussion around that.

At the moment, pretty much everything MicroCloud does requires mDNS to work because we can't be trusted on the other systems until we've clustered with them. We need mDNS for selecting disks, selecting networks, and selecting the nodes themselves, and now issuing tokens from existing clusters with this PR.

I'd assume the verification process would establish a long-lived trust for the duration of the join process similar to the mDNS auth secret, allowing us to remotely issue tokens, view available disks and network interfaces, and request a node to join the clusters.

roosterfish · 2024-04-12T14:18:49Z

I'd assume the verification process would establish a long-lived trust for the duration of the join process similar to the mDNS auth secret, allowing us to remotely issue tokens, view available disks and network interfaces, and request a node to join the clusters.

An option I am trying to validate for the spec is moving the forming of the MicroCloud MicroCluster right after the discovery of new peers. This establishes trust between all the peers and allow them talking to each other using mutual TLS to perform the actions we are currently performing using the auth secret.

With this approach we don't anymore require any type of auth secret besides a join token for each of the peers that want to join the MicroCloud.

masnax · 2024-04-12T14:24:51Z

I'd assume the verification process would establish a long-lived trust for the duration of the join process similar to the mDNS auth secret, allowing us to remotely issue tokens, view available disks and network interfaces, and request a node to join the clusters.

An option I am trying to validate for the spec is moving the forming of the MicroCloud MicroCluster right after the discovery of new peers. This establishes trust between all the peers and allow them talking to each other using mutual TLS to perform the actions we are currently performing using the auth secret.

With this approach we don't anymore require any type of auth secret besides a join token for each of the peers that want to join the MicroCloud.

For some historical context, we actually had a similar process in place very early on because I was hesitant to rely on the mDNS component so much.

Problem was mainly user experience. 90% of errors will happen around the configuration so a mistake would be costly as we would have to tear down the clusters. Plus it puts a significant pause (could be minutes for larger clusters) before we even get to the bulk of the user interactions.

roosterfish · 2024-04-15T07:59:43Z

90% of errors will happen around the configuration so a mistake would be costly as we would have to tear down the clusters. Plus it puts a significant pause (could be minutes for larger clusters) before we even get to the bulk of the user interactions.

What do you mean by we have to tear down the clusters?

If I understood you right you are talking about the pause which will be an effect of joining in the cluster members manually? I guess this is something we have to accept as part of making the overall concept more secure. But it's worth evaluating options to make this as fast as possible for the end user.

masnax · 2024-04-15T08:15:18Z

What do you mean by we have to tear down the clusters?

The user runs microcloud init, but mistakenly hasn't prepared disks/network interfaces appropriately yet. If the systems are clustered before we ask the storage/network questions, the user only realizes the mistake after the nodes have been clustered. Now aborting the init process so the user can fix the mistake becomes much more expensive because we have already changed state on every system, before the user has made any choices.

doc/how-to/initialise.rst

ru-fu

Thanks! Docs look good now. :)

Changes addressed

masnax · 2024-04-30T14:45:09Z

@roosterfish @tomponline Could one of you please look this over one last time before merging? Thanks!

(That 1 test failure is an issue with the test suite being too hard on the runners, it passes consistently when run locally).

tomponline · 2024-04-30T14:46:49Z

(That 1 test failure is an issue with the test suite being too hard on the runners, it passes consistently when run locally).

@masnax does this mean if we merge this we will get that test failing going forward then?

If so then thats a merge blocker to me.

masnax · 2024-04-30T14:54:33Z

(That 1 test failure is an issue with the test suite being too hard on the runners, it passes consistently when run locally).

@masnax does this mean if we merge this we will get that test failing going forward then?

If so then thats a merge blocker to me.

The test will be just as flaky on any PR regardless of if this one is merged. The test was not affected by this PR, the issue is that we just can't consistently run 4 VMs on the github runners without them running into issues like LXD being unable to start up.

tomponline · 2024-04-30T14:56:17Z

I see @masnax thanks for clarifying that. It sounded like this PR was introducing the tests that were flaky.

I'll leave the final review to @roosterfish

roosterfish

Just a few more small comments, rest still looking good to me.

microcloud/api/types/services.go

microcloud/cmd/microcloud/main_init.go

roosterfish · 2024-05-02T16:20:38Z

microcloud/service/microcloud.go

-		},
+				return shared.ProxyFromEnvironment(r)
+			},
+		}


Ok so #287 has to come after this one then?

microcloud/test/suites/basic.sh

During init, to handle the case where another system is already clustered on a particular service, we need to be able to request this node to issue a token for us from an as-yet untrusted system. This endpoint is untrusted by the cluster, but authenticated with a secret generated during mDNS lookup, so we can use it as a proxy to the unix socket on the remote system, where we will be trusted and can issue a token. Signed-off-by: Max Asnaashari <[email protected]>

Signed-off-by: Max Asnaashari <[email protected]>

…e cluster Signed-off-by: Max Asnaashari <[email protected]>

Detecting already clustered members will make this block very complex, so separate it out from `waitForJoin` so that the scope of the helper is just to wait for nodes to join Signed-off-by: Max Asnaashari <[email protected]>

…ustered The cluster size delta per service will become uneven if some nodes are already clustered, and we can't guarantee that the local node isn't already participating in some of those clusters, so we need to handle each service more explicitly by carrying a map around the cluster join process. Signed-off-by: Max Asnaashari <[email protected]>

Signed-off-by: Max Asnaashari <[email protected]>

… helper Signed-off-by: Max Asnaashari <[email protected]>

Signed-off-by: Max Asnaashari <[email protected]>

Closed the old PR to start fresh. Adds two new commands: * `microcloud service list` will list the cluster members for every installed service, or report if it is not initialized. This will effectively be the same as calling all of the following in succession: ``` lxc cluster list microcloud cluster list microceph cluster list microovn cluster list ``` The information shown will be the name, address, dqlite role, and current status of each member. * `microcloud service add` will try to setup MicroOVN and MicroCeph on all existing MicroCloud cluster members, optionally setting up storage and networks for LXD. This is useful if MicroOVN or MicroCeph was at one point not installed on the systems and skipped during `microcloud init`. LXD and MicroCloud itself are required to already be set up. Thanks to #259 we can also try to re-use a service that partially covers existing MicroCloud cluster members. So if a MicroCloud is set up without MicroCeph, and then the user manually configures MicroCeph to partially cover the cluster, the user can then use `microcloud service add` to further configure MicroCeph to work with MicroCloud, and set up storage pools for LXD.

masnax requested review from tomponline, roosterfish and markylaing as code owners January 26, 2024 04:05

masnax force-pushed the detect-system branch 5 times, most recently from be24bda to 01ab507 Compare January 31, 2024 02:33

masnax mentioned this pull request Feb 1, 2024

Add service list and service add #260

Closed

masnax force-pushed the detect-system branch from 01ab507 to c904827 Compare April 9, 2024 16:02

github-advanced-security bot found potential problems Apr 9, 2024

View reviewed changes

roosterfish requested changes Apr 10, 2024

View reviewed changes

microcloud/service/microceph.go Outdated Show resolved Hide resolved

microcloud/cmd/microcloud/main_init.go Show resolved Hide resolved

microcloud/cmd/microcloud/main_init.go Outdated Show resolved Hide resolved

masnax force-pushed the detect-system branch from c904827 to a33fbc4 Compare April 10, 2024 16:03

roosterfish previously approved these changes Apr 10, 2024

View reviewed changes

masnax force-pushed the detect-system branch from a33fbc4 to 978ff81 Compare April 11, 2024 17:55

markylaing requested changes Apr 12, 2024

View reviewed changes

masnax force-pushed the detect-system branch from 66e8891 to b791c9b Compare April 29, 2024 04:40

ru-fu reviewed Apr 29, 2024

View reviewed changes

doc/how-to/initialise.rst Outdated Show resolved Hide resolved

doc/how-to/initialise.rst Outdated Show resolved Hide resolved

masnax force-pushed the detect-system branch from 1455809 to 41048c3 Compare April 29, 2024 14:42

ru-fu reviewed Apr 29, 2024

View reviewed changes

doc/how-to/initialise.rst Outdated Show resolved Hide resolved

masnax force-pushed the detect-system branch from 41048c3 to 2c08b41 Compare April 29, 2024 14:59

ru-fu previously approved these changes Apr 29, 2024

View reviewed changes

roosterfish requested changes May 2, 2024

View reviewed changes

masnax added 11 commits May 2, 2024 17:21

microcloud/service: Allow fetching remote cluster members

93e4698

Signed-off-by: Max Asnaashari <[email protected]>

microcloud/cmd/microcloud: Add checkClustered helper

62c8192

Signed-off-by: Max Asnaashari <[email protected]>

microcloud/cmd/microcloud: Consider existing clusters when forming th…

00026ad

…e cluster Signed-off-by: Max Asnaashari <[email protected]>

microcloud/cmd/microcloud: Move token issuing to AddPeers

b6d4c86

Detecting already clustered members will make this block very complex, so separate it out from `waitForJoin` so that the scope of the helper is just to wait for nodes to join Signed-off-by: Max Asnaashari <[email protected]>

microcloud/service: Allow requesting the local node to join a cluster

056ad1d

Signed-off-by: Max Asnaashari <[email protected]>

microcloud/cmd/microcloud: Add ReuseExistingClusters key to Preseed

961c216

Signed-off-by: Max Asnaashari <[email protected]>

doc/how-to: Add reuse and skip documentation

59e4aac

Signed-off-by: Max Asnaashari <[email protected]>

microcloud/test/includes: Add reuse-cluster input to interactive test…

5636937

… helper Signed-off-by: Max Asnaashari <[email protected]>

microcloud/test/suites: Add reuse-cluster tests

3424df4

Signed-off-by: Max Asnaashari <[email protected]>

masnax dismissed ru-fu’s stale review via 3424df4 May 3, 2024 06:08

masnax force-pushed the detect-system branch from 2c08b41 to 3424df4 Compare May 3, 2024 06:08

masnax requested a review from roosterfish May 3, 2024 14:46

roosterfish approved these changes May 3, 2024

View reviewed changes

masnax merged commit d5a3819 into canonical:main May 3, 2024
15 checks passed

masnax mentioned this pull request Jul 23, 2024

service add and service list #355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse existing MicroCeph and MicroOVN clusters #259

Reuse existing MicroCeph and MicroOVN clusters #259

masnax commented Jan 26, 2024 •

edited

Loading

masnax commented Apr 9, 2024

tomponline commented Apr 9, 2024

roosterfish left a comment

roosterfish left a comment

masnax commented Apr 11, 2024

markylaing commented Apr 11, 2024

ru-fu commented Apr 12, 2024

markylaing commented Apr 12, 2024

markylaing left a comment

markylaing Apr 12, 2024

masnax Apr 18, 2024

roosterfish May 2, 2024

tomponline commented Apr 12, 2024

roosterfish commented Apr 12, 2024

tomponline commented Apr 12, 2024

masnax commented Apr 12, 2024

roosterfish commented Apr 12, 2024

masnax commented Apr 12, 2024

roosterfish commented Apr 15, 2024

masnax commented Apr 15, 2024 •

edited

Loading

ru-fu left a comment

masnax commented Apr 30, 2024 •

edited

Loading

tomponline commented Apr 30, 2024

masnax commented Apr 30, 2024

tomponline commented Apr 30, 2024

roosterfish left a comment

roosterfish May 2, 2024

Reuse existing MicroCeph and MicroOVN clusters #259

Reuse existing MicroCeph and MicroOVN clusters #259

Conversation

masnax commented Jan 26, 2024 • edited Loading

masnax commented Apr 9, 2024

tomponline commented Apr 9, 2024

roosterfish left a comment

Choose a reason for hiding this comment

roosterfish left a comment

Choose a reason for hiding this comment

masnax commented Apr 11, 2024

markylaing commented Apr 11, 2024

ru-fu commented Apr 12, 2024

markylaing commented Apr 12, 2024

markylaing left a comment

Choose a reason for hiding this comment

markylaing Apr 12, 2024

Choose a reason for hiding this comment

masnax Apr 18, 2024

Choose a reason for hiding this comment

roosterfish May 2, 2024

Choose a reason for hiding this comment

tomponline commented Apr 12, 2024

roosterfish commented Apr 12, 2024

tomponline commented Apr 12, 2024

masnax commented Apr 12, 2024

roosterfish commented Apr 12, 2024

masnax commented Apr 12, 2024

roosterfish commented Apr 15, 2024

masnax commented Apr 15, 2024 • edited Loading

ru-fu left a comment

Choose a reason for hiding this comment

masnax commented Apr 30, 2024 • edited Loading

tomponline commented Apr 30, 2024

masnax commented Apr 30, 2024

tomponline commented Apr 30, 2024

roosterfish left a comment

Choose a reason for hiding this comment

roosterfish May 2, 2024

Choose a reason for hiding this comment

masnax commented Jan 26, 2024 •

edited

Loading

masnax commented Apr 15, 2024 •

edited

Loading

masnax commented Apr 30, 2024 •

edited

Loading