Future of Upgrades with CAPI and Gitops #1031

teemow · 2022-04-12T09:59:56Z

We have two major things that change the way how users upgrade clusters. CAPI and Gitops

with cluster api we stopped versioning the controller versions on the management cluster with the tenant cluster release (already decided and implemented)
with cluster api we would like to move towards a more seemless approach and get rid of tenant cluster releases (future)
with Gitops users can't use our clis or the webui to upgrade clusters
how do we display potential changes for Gitops users upfront?
how do we should status and progress of an upgrade to users?

cornelius-keller · 2022-05-17T09:29:03Z

This is becoming a topic in customer and internal communication, so I think we should raise priority on this one.
For visibility:

Current workflow and discussion with customer:

we rolled out new k8s versions in a semi automated way, with a script creating PRs to the individual cluster definitions
customer merged all PRs. All clusters are on v1.22.9 now.
customer feedback is that they would like to have less PRs, by for example upgrading the base in a single step. We need to discuss how we can test / stage this and if they really want to upgrade all the clusters at the same time.

Internal discussion / questions rising

What is the process of rolling out a new k8s version
How do we push security fixes to the customer, do we need approval in this case?
How do we ensure compatibility and sync of API server flags, kubelet configuration, and (default) apps like coredns, CNI and ingress now that we dont have k8scoudcofig locking all the versions
How will we do release notes? Can we include them somehow in the PR to the customer?
- 💡 Idea can we implement an URL filter for https://docs.giantswarm.io/changes/ so that we can link all the relevant changes In a PR to the customer

FIY @gianfranco-l @alex-dabija

MarcelMue · 2022-05-18T11:13:40Z

Here are my 2 cents from honeybadger PoV:

Kubernetes versions should be defaulted in the individual cluster-provider apps. E.g. cluster-openstack in version 1.2.3 would have Kubernetes version 1.24.25 as it's default. Then later we provider a new cluster-openstack release with a newer Kubernetes version.
Because these provider apps are Giant Swarm Apps, we can utilize the same mechanism we use for automatic app upgrades, to roll our upgrades of these apps to the gitops repo. AKA flux will create PRs for these new versions when they are available. We already have some docs on how this works here: https://github.com/giantswarm/gitops-template/blob/main/docs/apps/automatic_updates_appcr.md
The only pre-requisite for auomatic app updates is that the apps (e.g. cluster-openstack) is pushed to an OCI registry (just a flag has to be set in the architect-orb)

-> If a customer wants to overwrite a k8s version different from the default in the provider app, then they themselves have to take care of managing the upgrades for those clusters. Or eventually remove that overlay to get the default k8s version again.

MarcelMue · 2022-05-18T11:14:57Z

Side note: The automatic app upgrades can also be pushed directly to main for e.g. dev clusters -> no PR required.

kopiczko · 2022-05-24T14:55:14Z

This is what we came up with during todays refinement (you can download the .svg file and edit in https://draw.io).

The general idea is to have customer-facing meta-app giantswarm-openstack (the name is subject to change) which would encapsulate everything else. So changing giantswarm-openstack would change Kubernetes, OS, and all the underling apps. Kind of what we have right now but without all the versions of the controllers running at the same time.

The general idea is that patch versions (maybe minors as well) should be automatically upgraded by some automation to GitOps repos.

The exact versioning rules should be refined as well if the concept is satisfying.

I hope the picture illustrates that:

kopiczko · 2022-05-24T14:55:56Z

Just in case zipped svg:
giantswarm-openstack.svg.zip

cornelius-keller · 2022-05-25T12:29:38Z

Just had an other chat with the customer:

automated commits to the git repository by flux without a PR are probably okay
cluster upgrades in batches are prefered
I recomended to upgrade staging clusters first and then production, but this is not necessary a requirement.
question for @giantswarm/team-honeybadger Is there anoter way to group cluster upgrades then having a base per stage/group?
In the current repository layout we have one base per datacenter or region. So clusters are grouped by this, not by staging and production.

MarcelMue · 2022-05-25T12:37:59Z

@kopiczko I don't understand why this added app is needed. I am still missing the motivation / explanation for why efforts here are needed.

@cornelius-keller Yes, bases can be structured in any granularity. We need to have an example but you can have one base per environment. Therefor we can update them separately, automatically.

kopiczko · 2022-05-25T13:28:39Z

During the refinement we worked with the assumptions:

Managing all the different app versions is confusing for the customer
Compatibility between cluster-openstack and default-apps-openstack is not guaranteed for all versions
We want to bake k8s version into the release version
We want to bake OS version into the release version (with Ubuntu or any other updateable OS it's tricky)

kopiczko · 2022-05-25T14:00:58Z

Talking with @MarcelMue on Slack I have a feeling this is about GitOps repo structure again. And it largely derailed from what @teemow described in the first place.

So currently GitOps repo bases are split by DCs instead of stages because we have different image IDs and external network IDs between them. This can be partially solved by referencing images by name in CAPO 0.6 but I guess there will always differences in DCs. E.g. we can't enforce customers to have the same external network name in OpenStack and I'm sure there will be more to it.

When it comes to GitOps repo structure I think we need to find a way to structure both DCs and stages.

This of course doesn't solve original problems:

how do we display potential changes for Gitops users upfront?

how do we should status and progress of an upgrade to users?

MarcelMue · 2022-05-25T14:03:03Z

When it comes to GitOps repo structure I think we need to find a way to structure both DCs and stages.

100% agreed, we need to find this structure.

+1 on finding something for the original problems.

Oshratn · 2022-05-30T10:16:02Z

@piontec this discussion seems to be tangential to the one we had in SIG Product sync today.
Could you please add a disclaimer in the GitOps template repo that indicates the level of maturity that already exists, to provide clarity for the other teams?

MarcelMue · 2022-06-01T09:59:40Z

@cornelius-keller AFAIK there was a meeting with rocket on monday. Can you post a summary here if there were any new decisions?

piontec · 2022-06-02T09:38:00Z

@piontec this discussion seems to be tangential to the one we had in SIG Product sync today. Could you please add a disclaimer in the GitOps template repo that indicates the level of maturity that already exists, to provide clarity for the other teams?

I've added the link to issues in progress there, but this discussion of whether or not to package cluster apps into another layer of an app is irrelevant to this.

As for the discussion of upgrades itself, there's an RFC: giantswarm/rfc#36

piontec · 2022-07-01T12:03:58Z

Hey, the RFC is closed now, it explains how app (any, including a cluster app bundle) can be auto-upgraded with Flux. Questions about displaying progress need to be handled separatly, as Flux only takes care about discovering a new version, doing the upgrade and storing the change in the repo.

gianfranco-l · 2022-07-01T13:44:30Z

Created a dedicated roadmap ticket for the feedback part, I guess we can close this ticket?

piontec · 2022-07-04T13:56:45Z

I think we still need some input from KaaS teams about presentation and feedback for upgrades, right?

gianfranco-l · 2022-07-05T08:24:40Z

@piontec we thought about the feedback and presentation part to be tackled separately here , right?

kopiczko · 2022-07-05T09:16:49Z

~~We tried enabling the diff action a while back and we didn't find it very useful. Here's an example:~~

~~!https://github.com/TheHutGroup/gs-clusters/pull/303#issuecomment-1158673903 (I have to scroll up a bit after clicking on the URL), look for this:~~

Wrong issue sorry

cornelius-keller · 2022-07-05T12:12:44Z

For the record also for @MarcelMue, the decisions from meeting with customer are tracked here https://github.com/giantswarm/thg/issues/113

marians · 2022-07-08T10:39:56Z

How has the question regarding the design of the app bundles been settled?

Pawel drafted something where a giantswarm-openstack version would define the cluster-openstack version and the default-apps-openstack version.

Is this how it's going to work throughout CAPI providers?

kopiczko · 2022-07-08T11:26:46Z

How has the question regarding the design of the app bundles been settled?

I don't think this is something we agreed on.

Is this how it's going to work throughout CAPI providers?

Feels dead simple unless I don't understand the question. giantswarm-openstack, giantswarm-gcp, ...

BTW, yesterday I already had doubts about cluster-openstack and default-apps-opentsack compatibility. I'm convinced if we follow this path of independent cluster-* and default-apps-* app upgrades it will become a nightmare.

marians · 2022-07-08T12:13:55Z

So if this hasn't been decided, I have questions.

What's the alternative to the threefold app bundles design (the one described here by Pawel)? Just cluster-<provider> plus default-apps-<provider>, independent of each other?
Do we agree that a decision is due, as customers are using OpenStack and GCP already?

Currently in Rainbow this is blocking part of #1192 as it means

we cannot display whether an upgrade is available for a cluster
we don't know what to display as the defining version(s) of a workload cluster instead of the vintage Release

kopiczko · 2022-07-08T14:47:24Z

What's the alternative to the threefold app bundles design (the one described here by Pawel)? Just cluster-<provider> plus default-apps-<provider>, independent of each other?

I believe the alternative is to have independent apps.

marians · 2022-07-11T10:19:26Z

Adding this to the KaaS sync board for this Thursday.

gawertm · 2022-11-28T09:44:27Z

@pipo02mix FYI

pipo02mix · 2022-11-28T12:54:23Z

As part of this ticket we have enabled customers/us to roll changes across regions and stages independently with GitOps. At the same time, we have included a default version in cluster-PROVIDER app here like Marcel's explained. But we found out Kubernetes and OS version is tied in CAPI and we could serve a conversion table to enhance the UX. For that we have created this RFC to align across teams.

In that PR ⬆️ there have been some discussions that I think should be moved here (@puja108 @AverageMarcus ) or create an RFC to discuss them. How can we progress towards the proposed goals?

puja108 · 2022-11-30T14:04:58Z

But we found out Kubernetes and OS version is tied in CAPI and we could serve a conversion table to enhance the UX. For that giantswarm/rfc#54 to align across teams.

My point is rather that we did not want to offer such a freedom of choice (at least for the beginning) so we would set those versions via the app/chart (and have them tied to the app version). To me that is already the enhanced UX or what else does the customer need there? If we want to in the background use a conversion table for ourselves is an implementation detail to me. The UX towards the customer is rather that they know that a certain app version has a certain K8s and OS version, just like they currently know that a certain WC release has a certain K8s and OS version.

If customers have a good reason to override our defaults, I'd like us to talk to them and see why that is and if the use case is valid for our offering, only then I'd consider a decision to open up things we did not open up before. I would however like to avoid that, especially in line with this story here, as long as the app version is the only thing that depicts certain defaults we come with, we have a way to test and roll out upgrades automatically also with GitOps. If we suddenly open up tons of configuration in there, the upgrade process will always require manual intervention and changes to the configuration.

pipo02mix · 2022-12-01T09:59:59Z

But customers need to control the k8s version for different reasons for example API deprecations (PSP now but I am pretty sure will be more in future) or vendor dependencies (we have a customer whose software only runs in certain k8s versions).

We cannot have only a single cluster-PROVIDER release if we support multiple k8s minors since every release will overwrite the previous version. In GitOps we will control the k8s version using bases (for every stage/Env like HoneyBadger has proposed), so according to customer criteria, we can propose (automatically) the new versions.

I can imagine the flow like this

There is a new k8s path version for all k8s minor versions (has some CVEs patched)
Our automation builds new OS images with new versions and triggers CI for testing
The cluster-PROVIDER app is released with patched images (for the two maintained minor k8s versions)
Our automation create PRs for all customer dev cluster when the date criteria matches
Our automation merges the PRs and clusters are upgraded

(same will happen with prod env when the date window criteria are fullied)

The logic about what k8s minor customer is supported (based on customer tests, for example, they don't support yet 1.25 because they still have PSP) is in the Upgrade Automation metadata. The same with the other criteria (when is the maintenance window, which stages we upgraded before,...)

We must not use cluster-PROVIDER app releases to coordinate all cluster upgrade logic (we will need a higher level). More info in the RFC

puja108 · 2022-12-06T15:11:36Z

For now that sounds like an ok solution.

Going forward we need to think more deeply about how we want to version and rollout, and yes we do need some kind of releases, even within a continously rolled out product. As Timo suggested in the other thread, we might want to consider going away from offering 2 minor versions as stable and more towards having channels like alpha, beta, RC, stable.

I do currently see the customer value in being able to pin the k8s minor to latest-1 and be able to cherrypick patches for customers who are not ready to upgrade, yet. This would also reduce dependencies for other teams to roll out their patches and updates as long as they also work on the older k8s version, so we can reduce operative load. We need to see how the models would fit together going forward. I'd just not like us to rush it too much, as we are still a bit in flux on where and how we integrate and create boundaries between current and also future teams.

teemow · 2023-02-02T10:48:59Z

This needs refinement or we might just close this issue. There are other issues as well #1791 #1661

@puja108 @alex-dabija please figure out if this overall story is needed or is at least refined with what needs to be done.

teemow · 2023-02-07T09:46:15Z

And there is also: https://github.com/giantswarm/giantswarm/issues/25443

puja108 · 2023-02-07T11:48:43Z

Trying to clean up here the main issues that need working on are:

Define what a KaaS "release" is and it's interface on the one side to customers and on the other side to internal teams https://github.com/giantswarm/giantswarm/issues/25443 (I'd count the process of releasing as part of this)
Based on that define how an upgrade would work UX-wise (no issue AFAIK or is that part of the above already?)
Separately define how we roll out MC components behind CAPI clusters, e.g. CAPI controllers (AFAIK no issue, or is that Workload cluster release process (with CAPI) #1791)
Work towards the ability to auto upgrade (Automatic Cluster Upgrades #1661)

IMO the first 3 are on the critical path and needed for us to be comfortable with KaaS going forward, the 4th can be iterated on going into the future with some experience from first upgrades, but needs to be considered in the other 1 and 2 so we do not block our way here.

Also related to this will be our interaction processes around GitOps repos: https://github.com/giantswarm/giantswarm/issues/24072

cc @alex-dabija does that sound right to you? Did I forget anything or is there an issue I did not find and link here?

puja108 · 2023-02-13T12:32:28Z

Closing this issue as we are tracking the main work now within the release revamp stream, some short term work around gitops is handled in the ticket around interaction processes mentioned above. The rest (e.g. UX related topics that got discussed in the above will find themselves again in the iterations we will do around releases and upgrades, but don't need keeping this issue open for.

alex-dabija · 2023-02-14T07:49:16Z

cc @alex-dabija does that sound right to you? Did I forget anything or is there an issue I did not find and link here?

Yes, sounds good.

gianfranco-l mentioned this issue Jul 1, 2022

add: automated upgrade process RFC giantswarm/rfc#36

Merged

teemow added the kind/cross-team Epics that span across teams label Nov 28, 2022

teemow added this to Roadmap Nov 28, 2022

teemow moved this to Near Term (1-3 months) in Roadmap Dec 5, 2022

gawertm moved this from Near Term (1-3 months) to Under Consideration in Roadmap Jan 19, 2023

gawertm moved this from Under Consideration to Near Term (1-3 months) in Roadmap Jan 19, 2023

gawertm added the topic/capi label Feb 1, 2023

teemow added the needs/refinement Needs refinement in order to be actionable label Feb 2, 2023

teemow assigned puja108 Feb 9, 2023

architectbot added the team/planeteers Team Planeteers (Customer Success & Product Management) label Feb 9, 2023

puja108 closed this as completed Feb 13, 2023

github-project-automation bot moved this from Near Term (1-3 months) to Released in Roadmap Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future of Upgrades with CAPI and Gitops #1031

Future of Upgrades with CAPI and Gitops #1031

teemow commented Apr 12, 2022 •

edited

Loading

cornelius-keller commented May 17, 2022

MarcelMue commented May 18, 2022

MarcelMue commented May 18, 2022 •

edited

Loading

kopiczko commented May 24, 2022 •

edited

Loading

kopiczko commented May 24, 2022

cornelius-keller commented May 25, 2022

MarcelMue commented May 25, 2022

kopiczko commented May 25, 2022

kopiczko commented May 25, 2022

MarcelMue commented May 25, 2022

Oshratn commented May 30, 2022

MarcelMue commented Jun 1, 2022

piontec commented Jun 2, 2022

piontec commented Jul 1, 2022

gianfranco-l commented Jul 1, 2022 •

edited

Loading

piontec commented Jul 4, 2022

gianfranco-l commented Jul 5, 2022

kopiczko commented Jul 5, 2022 •

edited

Loading

cornelius-keller commented Jul 5, 2022

marians commented Jul 8, 2022

kopiczko commented Jul 8, 2022

marians commented Jul 8, 2022 •

edited

Loading

kopiczko commented Jul 8, 2022

marians commented Jul 11, 2022

gawertm commented Nov 28, 2022

pipo02mix commented Nov 28, 2022

puja108 commented Nov 30, 2022

pipo02mix commented Dec 1, 2022

puja108 commented Dec 6, 2022

teemow commented Feb 2, 2023

teemow commented Feb 7, 2023

puja108 commented Feb 7, 2023 •

edited

Loading

puja108 commented Feb 13, 2023

alex-dabija commented Feb 14, 2023

Future of Upgrades with CAPI and Gitops #1031

Future of Upgrades with CAPI and Gitops #1031

Comments

teemow commented Apr 12, 2022 • edited Loading

cornelius-keller commented May 17, 2022

Current workflow and discussion with customer:

Internal discussion / questions rising

MarcelMue commented May 18, 2022

MarcelMue commented May 18, 2022 • edited Loading

kopiczko commented May 24, 2022 • edited Loading

kopiczko commented May 24, 2022

cornelius-keller commented May 25, 2022

MarcelMue commented May 25, 2022

kopiczko commented May 25, 2022

kopiczko commented May 25, 2022

MarcelMue commented May 25, 2022

Oshratn commented May 30, 2022

MarcelMue commented Jun 1, 2022

piontec commented Jun 2, 2022

piontec commented Jul 1, 2022

gianfranco-l commented Jul 1, 2022 • edited Loading

piontec commented Jul 4, 2022

gianfranco-l commented Jul 5, 2022

kopiczko commented Jul 5, 2022 • edited Loading

cornelius-keller commented Jul 5, 2022

marians commented Jul 8, 2022

kopiczko commented Jul 8, 2022

marians commented Jul 8, 2022 • edited Loading

kopiczko commented Jul 8, 2022

marians commented Jul 11, 2022

gawertm commented Nov 28, 2022

pipo02mix commented Nov 28, 2022

puja108 commented Nov 30, 2022

pipo02mix commented Dec 1, 2022

puja108 commented Dec 6, 2022

teemow commented Feb 2, 2023

teemow commented Feb 7, 2023

puja108 commented Feb 7, 2023 • edited Loading

puja108 commented Feb 13, 2023

alex-dabija commented Feb 14, 2023

teemow commented Apr 12, 2022 •

edited

Loading

MarcelMue commented May 18, 2022 •

edited

Loading

kopiczko commented May 24, 2022 •

edited

Loading

gianfranco-l commented Jul 1, 2022 •

edited

Loading

kopiczko commented Jul 5, 2022 •

edited

Loading

marians commented Jul 8, 2022 •

edited

Loading

puja108 commented Feb 7, 2023 •

edited

Loading