Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future of Upgrades with CAPI and Gitops #1031

Closed
teemow opened this issue Apr 12, 2022 · 34 comments
Closed

Future of Upgrades with CAPI and Gitops #1031

teemow opened this issue Apr 12, 2022 · 34 comments
Assignees
Labels
kind/cross-team Epics that span across teams needs/refinement Needs refinement in order to be actionable team/planeteers Team Planeteers (Customer Success & Product Management) topic/capi

Comments

@teemow
Copy link
Member

teemow commented Apr 12, 2022

We have two major things that change the way how users upgrade clusters. CAPI and Gitops

  • with cluster api we stopped versioning the controller versions on the management cluster with the tenant cluster release (already decided and implemented)
  • with cluster api we would like to move towards a more seemless approach and get rid of tenant cluster releases (future)
  • with Gitops users can't use our clis or the webui to upgrade clusters
  • how do we display potential changes for Gitops users upfront?
  • how do we should status and progress of an upgrade to users?
@cornelius-keller
Copy link
Contributor

This is becoming a topic in customer and internal communication, so I think we should raise priority on this one.
For visibility:

Current workflow and discussion with customer:

  • we rolled out new k8s versions in a semi automated way, with a script creating PRs to the individual cluster definitions
  • customer merged all PRs. All clusters are on v1.22.9 now.
  • customer feedback is that they would like to have less PRs, by for example upgrading the base in a single step. We need to discuss how we can test / stage this and if they really want to upgrade all the clusters at the same time.

Internal discussion / questions rising

  • What is the process of rolling out a new k8s version
  • How do we push security fixes to the customer, do we need approval in this case?
  • How do we ensure compatibility and sync of API server flags, kubelet configuration, and (default) apps like coredns, CNI and ingress now that we dont have k8scoudcofig locking all the versions
  • How will we do release notes? Can we include them somehow in the PR to the customer?

FIY @gianfranco-l @alex-dabija

@MarcelMue
Copy link

Here are my 2 cents from honeybadger PoV:

  • Kubernetes versions should be defaulted in the individual cluster-provider apps. E.g. cluster-openstack in version 1.2.3 would have Kubernetes version 1.24.25 as it's default. Then later we provider a new cluster-openstack release with a newer Kubernetes version.
  • Because these provider apps are Giant Swarm Apps, we can utilize the same mechanism we use for automatic app upgrades, to roll our upgrades of these apps to the gitops repo. AKA flux will create PRs for these new versions when they are available. We already have some docs on how this works here: https://github.com/giantswarm/gitops-template/blob/main/docs/apps/automatic_updates_appcr.md
  • The only pre-requisite for auomatic app updates is that the apps (e.g. cluster-openstack) is pushed to an OCI registry (just a flag has to be set in the architect-orb)

-> If a customer wants to overwrite a k8s version different from the default in the provider app, then they themselves have to take care of managing the upgrades for those clusters. Or eventually remove that overlay to get the default k8s version again.

@MarcelMue
Copy link

MarcelMue commented May 18, 2022

Side note: The automatic app upgrades can also be pushed directly to main for e.g. dev clusters -> no PR required.

@kopiczko
Copy link

kopiczko commented May 24, 2022

This is what we came up with during todays refinement (you can download the .svg file and edit in https://draw.io).

The general idea is to have customer-facing meta-app giantswarm-openstack (the name is subject to change) which would encapsulate everything else. So changing giantswarm-openstack would change Kubernetes, OS, and all the underling apps. Kind of what we have right now but without all the versions of the controllers running at the same time.

The general idea is that patch versions (maybe minors as well) should be automatically upgraded by some automation to GitOps repos.

The exact versioning rules should be refined as well if the concept is satisfying.

I hope the picture illustrates that:

giantswarm-openstack

@kopiczko
Copy link

Just in case zipped svg:
giantswarm-openstack.svg.zip

@cornelius-keller
Copy link
Contributor

Just had an other chat with the customer:

  • automated commits to the git repository by flux without a PR are probably okay
  • cluster upgrades in batches are prefered
  • I recomended to upgrade staging clusters first and then production, but this is not necessary a requirement.
    question for @giantswarm/team-honeybadger Is there anoter way to group cluster upgrades then having a base per stage/group?
    In the current repository layout we have one base per datacenter or region. So clusters are grouped by this, not by staging and production.

@MarcelMue
Copy link

@kopiczko I don't understand why this added app is needed. I am still missing the motivation / explanation for why efforts here are needed.

@cornelius-keller Yes, bases can be structured in any granularity. We need to have an example but you can have one base per environment. Therefor we can update them separately, automatically.

@kopiczko
Copy link

During the refinement we worked with the assumptions:

  • Managing all the different app versions is confusing for the customer
  • Compatibility between cluster-openstack and default-apps-openstack is not guaranteed for all versions
  • We want to bake k8s version into the release version
  • We want to bake OS version into the release version (with Ubuntu or any other updateable OS it's tricky)

@kopiczko
Copy link

Talking with @MarcelMue on Slack I have a feeling this is about GitOps repo structure again. And it largely derailed from what @teemow described in the first place.

So currently GitOps repo bases are split by DCs instead of stages because we have different image IDs and external network IDs between them. This can be partially solved by referencing images by name in CAPO 0.6 but I guess there will always differences in DCs. E.g. we can't enforce customers to have the same external network name in OpenStack and I'm sure there will be more to it.

When it comes to GitOps repo structure I think we need to find a way to structure both DCs and stages.

This of course doesn't solve original problems:

  • how do we display potential changes for Gitops users upfront?
  • how do we should status and progress of an upgrade to users?

@MarcelMue
Copy link

When it comes to GitOps repo structure I think we need to find a way to structure both DCs and stages.

100% agreed, we need to find this structure.

+1 on finding something for the original problems.

@Oshratn
Copy link
Contributor

Oshratn commented May 30, 2022

@piontec this discussion seems to be tangential to the one we had in SIG Product sync today.
Could you please add a disclaimer in the GitOps template repo that indicates the level of maturity that already exists, to provide clarity for the other teams?

@MarcelMue
Copy link

@cornelius-keller AFAIK there was a meeting with rocket on monday. Can you post a summary here if there were any new decisions?

@piontec
Copy link

piontec commented Jun 2, 2022

@piontec this discussion seems to be tangential to the one we had in SIG Product sync today. Could you please add a disclaimer in the GitOps template repo that indicates the level of maturity that already exists, to provide clarity for the other teams?

I've added the link to issues in progress there, but this discussion of whether or not to package cluster apps into another layer of an app is irrelevant to this.

As for the discussion of upgrades itself, there's an RFC: giantswarm/rfc#36

@piontec
Copy link

piontec commented Jul 1, 2022

Hey, the RFC is closed now, it explains how app (any, including a cluster app bundle) can be auto-upgraded with Flux. Questions about displaying progress need to be handled separatly, as Flux only takes care about discovering a new version, doing the upgrade and storing the change in the repo.

@gianfranco-l
Copy link

gianfranco-l commented Jul 1, 2022

Created a dedicated roadmap ticket for the feedback part, I guess we can close this ticket?

@piontec
Copy link

piontec commented Jul 4, 2022

I think we still need some input from KaaS teams about presentation and feedback for upgrades, right?

@gianfranco-l
Copy link

@piontec we thought about the feedback and presentation part to be tackled separately here , right?

@kopiczko
Copy link

kopiczko commented Jul 5, 2022

We tried enabling the diff action a while back and we didn't find it very useful. Here's an example:

!https://github.com/TheHutGroup/gs-clusters/pull/303#issuecomment-1158673903 (I have to scroll up a bit after clicking on the URL), look for this:

Wrong issue sorry

@cornelius-keller
Copy link
Contributor

For the record also for @MarcelMue, the decisions from meeting with customer are tracked here https://github.com/giantswarm/thg/issues/113

@marians
Copy link
Member

marians commented Jul 8, 2022

How has the question regarding the design of the app bundles been settled?

Pawel drafted something where a giantswarm-openstack version would define the cluster-openstack version and the default-apps-openstack version.

Is this how it's going to work throughout CAPI providers?

@kopiczko
Copy link

kopiczko commented Jul 8, 2022

How has the question regarding the design of the app bundles been settled?

I don't think this is something we agreed on.

Is this how it's going to work throughout CAPI providers?

Feels dead simple unless I don't understand the question. giantswarm-openstack, giantswarm-gcp, ...


BTW, yesterday I already had doubts about cluster-openstack and default-apps-opentsack compatibility. I'm convinced if we follow this path of independent cluster-* and default-apps-* app upgrades it will become a nightmare.

@marians
Copy link
Member

marians commented Jul 8, 2022

So if this hasn't been decided, I have questions.

  • What's the alternative to the threefold app bundles design (the one described here by Pawel)? Just cluster-<provider> plus default-apps-<provider>, independent of each other?
  • Do we agree that a decision is due, as customers are using OpenStack and GCP already?

Currently in Rainbow this is blocking part of #1192 as it means

  • we cannot display whether an upgrade is available for a cluster
  • we don't know what to display as the defining version(s) of a workload cluster instead of the vintage Release

@kopiczko
Copy link

kopiczko commented Jul 8, 2022

  • What's the alternative to the threefold app bundles design (the one described here by Pawel)? Just cluster-<provider> plus default-apps-<provider>, independent of each other?

I believe the alternative is to have independent apps.

@marians
Copy link
Member

marians commented Jul 11, 2022

Adding this to the KaaS sync board for this Thursday.

@gawertm
Copy link

gawertm commented Nov 28, 2022

@pipo02mix FYI

@teemow teemow added the kind/cross-team Epics that span across teams label Nov 28, 2022
@teemow teemow added this to Roadmap Nov 28, 2022
@pipo02mix
Copy link
Contributor

As part of this ticket we have enabled customers/us to roll changes across regions and stages independently with GitOps. At the same time, we have included a default version in cluster-PROVIDER app here like Marcel's explained. But we found out Kubernetes and OS version is tied in CAPI and we could serve a conversion table to enhance the UX. For that we have created this RFC to align across teams.

In that PR ⬆️ there have been some discussions that I think should be moved here (@puja108 @AverageMarcus ) or create an RFC to discuss them. How can we progress towards the proposed goals?

@puja108
Copy link
Member

puja108 commented Nov 30, 2022

But we found out Kubernetes and OS version is tied in CAPI and we could serve a conversion table to enhance the UX. For that giantswarm/rfc#54 to align across teams.

My point is rather that we did not want to offer such a freedom of choice (at least for the beginning) so we would set those versions via the app/chart (and have them tied to the app version). To me that is already the enhanced UX or what else does the customer need there? If we want to in the background use a conversion table for ourselves is an implementation detail to me. The UX towards the customer is rather that they know that a certain app version has a certain K8s and OS version, just like they currently know that a certain WC release has a certain K8s and OS version.

If customers have a good reason to override our defaults, I'd like us to talk to them and see why that is and if the use case is valid for our offering, only then I'd consider a decision to open up things we did not open up before. I would however like to avoid that, especially in line with this story here, as long as the app version is the only thing that depicts certain defaults we come with, we have a way to test and roll out upgrades automatically also with GitOps. If we suddenly open up tons of configuration in there, the upgrade process will always require manual intervention and changes to the configuration.

@pipo02mix
Copy link
Contributor

But customers need to control the k8s version for different reasons for example API deprecations (PSP now but I am pretty sure will be more in future) or vendor dependencies (we have a customer whose software only runs in certain k8s versions).

We cannot have only a single cluster-PROVIDER release if we support multiple k8s minors since every release will overwrite the previous version. In GitOps we will control the k8s version using bases (for every stage/Env like HoneyBadger has proposed), so according to customer criteria, we can propose (automatically) the new versions.

I can imagine the flow like this

  1. There is a new k8s path version for all k8s minor versions (has some CVEs patched)
  2. Our automation builds new OS images with new versions and triggers CI for testing
  3. The cluster-PROVIDER app is released with patched images (for the two maintained minor k8s versions)
  4. Our automation create PRs for all customer dev cluster when the date criteria matches
  5. Our automation merges the PRs and clusters are upgraded

(same will happen with prod env when the date window criteria are fullied)

The logic about what k8s minor customer is supported (based on customer tests, for example, they don't support yet 1.25 because they still have PSP) is in the Upgrade Automation metadata. The same with the other criteria (when is the maintenance window, which stages we upgraded before,...)

We must not use cluster-PROVIDER app releases to coordinate all cluster upgrade logic (we will need a higher level). More info in the RFC

@teemow teemow moved this to Near Term (1-3 months) in Roadmap Dec 5, 2022
@puja108
Copy link
Member

puja108 commented Dec 6, 2022

For now that sounds like an ok solution.

Going forward we need to think more deeply about how we want to version and rollout, and yes we do need some kind of releases, even within a continously rolled out product. As Timo suggested in the other thread, we might want to consider going away from offering 2 minor versions as stable and more towards having channels like alpha, beta, RC, stable.

I do currently see the customer value in being able to pin the k8s minor to latest-1 and be able to cherrypick patches for customers who are not ready to upgrade, yet. This would also reduce dependencies for other teams to roll out their patches and updates as long as they also work on the older k8s version, so we can reduce operative load. We need to see how the models would fit together going forward. I'd just not like us to rush it too much, as we are still a bit in flux on where and how we integrate and create boundaries between current and also future teams.

@gawertm gawertm moved this from Near Term (1-3 months) to Under Consideration in Roadmap Jan 19, 2023
@gawertm gawertm moved this from Under Consideration to Near Term (1-3 months) in Roadmap Jan 19, 2023
@teemow teemow added the needs/refinement Needs refinement in order to be actionable label Feb 2, 2023
@teemow
Copy link
Member Author

teemow commented Feb 2, 2023

This needs refinement or we might just close this issue. There are other issues as well #1791 #1661

@puja108 @alex-dabija please figure out if this overall story is needed or is at least refined with what needs to be done.

@teemow
Copy link
Member Author

teemow commented Feb 7, 2023

@puja108
Copy link
Member

puja108 commented Feb 7, 2023

Trying to clean up here the main issues that need working on are:

  1. Define what a KaaS "release" is and it's interface on the one side to customers and on the other side to internal teams https://github.com/giantswarm/giantswarm/issues/25443 (I'd count the process of releasing as part of this)
  2. Based on that define how an upgrade would work UX-wise (no issue AFAIK or is that part of the above already?)
  3. Separately define how we roll out MC components behind CAPI clusters, e.g. CAPI controllers (AFAIK no issue, or is that Workload cluster release process (with CAPI) #1791)
  4. Work towards the ability to auto upgrade (Automatic Cluster Upgrades #1661)

IMO the first 3 are on the critical path and needed for us to be comfortable with KaaS going forward, the 4th can be iterated on going into the future with some experience from first upgrades, but needs to be considered in the other 1 and 2 so we do not block our way here.

Also related to this will be our interaction processes around GitOps repos: https://github.com/giantswarm/giantswarm/issues/24072

cc @alex-dabija does that sound right to you? Did I forget anything or is there an issue I did not find and link here?

@architectbot architectbot added the team/planeteers Team Planeteers (Customer Success & Product Management) label Feb 9, 2023
@puja108
Copy link
Member

puja108 commented Feb 13, 2023

Closing this issue as we are tracking the main work now within the release revamp stream, some short term work around gitops is handled in the ticket around interaction processes mentioned above. The rest (e.g. UX related topics that got discussed in the above will find themselves again in the iterations we will do around releases and upgrades, but don't need keeping this issue open for.

@puja108 puja108 closed this as completed Feb 13, 2023
@github-project-automation github-project-automation bot moved this from Near Term (1-3 months) to Released in Roadmap Feb 13, 2023
@alex-dabija
Copy link

cc @alex-dabija does that sound right to you? Did I forget anything or is there an issue I did not find and link here?

Yes, sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cross-team Epics that span across teams needs/refinement Needs refinement in order to be actionable team/planeteers Team Planeteers (Customer Success & Product Management) topic/capi
Projects
Archived in project
Development

No branches or pull requests