-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse existing MicroCeph and MicroOVN clusters #259
Conversation
be24bda
to
01ab507
Compare
@markylaing @roosterfish @tomponline This one should be ready for review now. |
Thanks @masnax Once @markylaing and @roosterfish have approved ill do a final pass. Ta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, only a few smaller suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@markylaing @tomponline Do either of you want to take a look at this before merging? Also @ru-fu Looks like there's something going on with the doc checks here. Got any ideas? |
Yes I'd like to take a look. Will get to it first thing tomorrow. |
This sounds like it needs doc updates. |
If this key is false, does it error out if we find existing Ceph or OVN clusters as it does with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the structure of it looks good but some parts could be made a bit clearer.
Also, one concern I have with this is the reliance on mDNS for issuing the tokens. Will it be possible to add existing services if the join process changes such that explicit verification is required on both ends? I believe @roosterfish is working on this currently so I think it's worth a discussion around that.
}, | ||
return shared.ProxyFromEnvironment(r) | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that canonical/microcluster#83 is merged does this need to be updated? Same for the other cases where this is being set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of filling this PR with all of those changes, I've added a new PR that moves all this implementation into a helper so we can just call that instead: #287
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so #287 has to come after this one then?
I'll look over once approved by the others. |
Good point, I have also thought about this the other day when reviewing the PR. Essentially what we will probably not have anymore is the secret used for the |
It will almost certainly have to change as mdns isn't going to be used as much. But that shouldn't necessarily block this PR - although it will likely have to be reworked as part of @roosterfish planned changes. |
At the moment, pretty much everything MicroCloud does requires mDNS to work because we can't be trusted on the other systems until we've clustered with them. We need mDNS for selecting disks, selecting networks, and selecting the nodes themselves, and now issuing tokens from existing clusters with this PR. I'd assume the verification process would establish a long-lived trust for the duration of the join process similar to the mDNS auth secret, allowing us to remotely issue tokens, view available disks and network interfaces, and request a node to join the clusters. |
An option I am trying to validate for the spec is moving the forming of the MicroCloud MicroCluster right after the discovery of new peers. This establishes trust between all the peers and allow them talking to each other using mutual TLS to perform the actions we are currently performing using the auth secret. With this approach we don't anymore require any type of auth secret besides a join token for each of the peers that want to join the MicroCloud. |
For some historical context, we actually had a similar process in place very early on because I was hesitant to rely on the mDNS component so much. Problem was mainly user experience. 90% of errors will happen around the configuration so a mistake would be costly as we would have to tear down the clusters. Plus it puts a significant pause (could be minutes for larger clusters) before we even get to the bulk of the user interactions. |
What do you mean by we have to tear down the clusters? If I understood you right you are talking about the pause which will be an effect of joining in the cluster members manually? I guess this is something we have to accept as part of making the overall concept more secure. But it's worth evaluating options to make this as fast as possible for the end user. |
The user runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Docs look good now. :)
@roosterfish @tomponline Could one of you please look this over one last time before merging? Thanks! (That 1 test failure is an issue with the test suite being too hard on the runners, it passes consistently when run locally). |
@masnax does this mean if we merge this we will get that test failing going forward then? If so then thats a merge blocker to me. |
The test will be just as flaky on any PR regardless of if this one is merged. The test was not affected by this PR, the issue is that we just can't consistently run 4 VMs on the github runners without them running into issues like LXD being unable to start up. |
I see @masnax thanks for clarifying that. It sounded like this PR was introducing the tests that were flaky. I'll leave the final review to @roosterfish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more small comments, rest still looking good to me.
}, | ||
return shared.ProxyFromEnvironment(r) | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so #287 has to come after this one then?
During init, to handle the case where another system is already clustered on a particular service, we need to be able to request this node to issue a token for us from an as-yet untrusted system. This endpoint is untrusted by the cluster, but authenticated with a secret generated during mDNS lookup, so we can use it as a proxy to the unix socket on the remote system, where we will be trusted and can issue a token. Signed-off-by: Max Asnaashari <[email protected]>
Signed-off-by: Max Asnaashari <[email protected]>
Signed-off-by: Max Asnaashari <[email protected]>
…e cluster Signed-off-by: Max Asnaashari <[email protected]>
Detecting already clustered members will make this block very complex, so separate it out from `waitForJoin` so that the scope of the helper is just to wait for nodes to join Signed-off-by: Max Asnaashari <[email protected]>
…ustered The cluster size delta per service will become uneven if some nodes are already clustered, and we can't guarantee that the local node isn't already participating in some of those clusters, so we need to handle each service more explicitly by carrying a map around the cluster join process. Signed-off-by: Max Asnaashari <[email protected]>
Signed-off-by: Max Asnaashari <[email protected]>
Signed-off-by: Max Asnaashari <[email protected]>
Signed-off-by: Max Asnaashari <[email protected]>
… helper Signed-off-by: Max Asnaashari <[email protected]>
Signed-off-by: Max Asnaashari <[email protected]>
Closed the old PR to start fresh. Adds two new commands: * `microcloud service list` will list the cluster members for every installed service, or report if it is not initialized. This will effectively be the same as calling all of the following in succession: ``` lxc cluster list microcloud cluster list microceph cluster list microovn cluster list ``` The information shown will be the name, address, dqlite role, and current status of each member. * `microcloud service add` will try to setup MicroOVN and MicroCeph on all existing MicroCloud cluster members, optionally setting up storage and networks for LXD. This is useful if MicroOVN or MicroCeph was at one point not installed on the systems and skipped during `microcloud init`. LXD and MicroCloud itself are required to already be set up. Thanks to #259 we can also try to re-use a service that partially covers existing MicroCloud cluster members. So if a MicroCloud is set up without MicroCeph, and then the user manually configures MicroCeph to partially cover the cluster, the user can then use `microcloud service add` to further configure MicroCeph to work with MicroCloud, and set up storage pools for LXD.
Closes #145
This PR allows initializing a new MicroCloud with some nodes that have already set up MicroCeph or MicroOVN in the past. This can be useful if you already have a MicroCeph cluster for example, that you want to "upgrade" into a MicroCloud.
Just after confirming the list of systems to use for the MicroCloud, we will try to grab a list of cluster members from each system. The underlying microcluster package will either return the list or report that the database is not initialized. We will record the first list that we obtain for each of MicroOVN and MicroCeph, and if the list is either non-existent or identical across any other systems, we can try to reuse those clusters.
The reuse process basically amounts to asking the system that has an existing cluster to generate join tokens and send them to the node orchestrating the initialization. Normally you have to be trusted to make such a request, so instead we can use the auth secret that we got from finding the systems over mDNS to send a request using the MicroCloud proxy, which will initiate the request from the unix socket on the clustered system.
The behaviour is as follows:
--auto
is specified, we will strictly require no clustered systems.reuse_existing_clusters
which can be set to true or false. If true, we will reuse any clusters we find, and if false, we will skip that service entirely and set up as normal.