Gate after commit and before deployment #870

SnoopyCoder · 2021-02-05T18:44:04Z

SnoopyCoder
Feb 5, 2021

I wanted to understand what would be the best way to extend the GitOps toolkit with a mechanism to allow configuration of a manual approval step (event-based) or a maintenance time window (e.g. only at night between 2-5 am) allowing to involve a cluster owner before deploying an update.

My use case: a SaaS provider (vendor company) has a marketplace where clients (other companies) can buy apps/services which shall be operated by the vendor in a K8s cluster on client's premises. Once the client buys an offer from the marketplace, an automated agent in the cloud is updating the deployment spec in a client-specific Git repo. The GitOps system within the client's K8s cluster recognizes the changed spec, retrieves the necessary artifacts and applies/deploys the changes. In case the offer is addressing regulated domain (like a classified medical product), the update must not automatic. The client needs to know upfront and have control over what is changed and when it is changed in order to keep the system up and running during critical time periods and prevent unexpected disruptions.

The commit/merge is either done by a human or by an automated business process (e.g. after a client has selected a marketplace offer to be installed on his cluster).

Ideally it would also be configurable whether the approval is requested before pulling the image from the registry or after. The latter is useful in case the images are large and should not be downloaded until the approval has been given.

One idea of @stefanprodan was to extract Flagger's manual gating feature into a dedicated controller so that any toolkit component could be gated in such a fashion. For example, when source-controller detects a new commit, instead of creating an artifact, it will call the gate hook and wait until the gate is opened. Once the gate is opened, it will generate the artifact, then kustomize/helm controllers will reconcile it. Same with image automation, the controller will not push to upsteam until a human opens the gate.

I'm not deep into the inner workings of flux and cannot really judge (yet) whether this idea would be the right approach for my use case. I hope that this discussion arouses broad interest for such a capability (which would probably also be useful in many other scenarios) and a simple and easy to solution will be found.

stefanprodan · 2021-03-02T09:46:14Z

stefanprodan
Mar 2, 2021
Maintainer

[RFC] Manual Gating

Motivation

Flux watches sources (e.g. GitRepositories, HelmRepositories, S3-compatible Buckets, ImageRepositories) and automatically reconciles the changes onto clusters as described with Flux Kustomizations and HelmReleases.
The teams involved in the delivery process (e.g. dev, qa, sre) can decide when changes are delivered to production by reviewing and approving the proposed changes in a collaborative manner with pull request.
Once a pull request is merged onto a branch that defines the desired state of the production system, Flux kicks off the reconciliation process.

There are situations when users want to have a gating mechanism after the cluster state changes are merged in Git:

Manual approval of container image updates (e.g. classified medical products)
Manual approval of infrastructure upgrades (e.g. Flux cli reconcile force if suspended #959)
Maintenance window (e.g. Maintenance Window for Helm Controller Upgrades #1004)
Planned releases
No Deploy Friday

Proposed solution

In order to support manual gating, the GitOps Toolkit could be extended with a dedicated API and controller that would allow users to define Gate objects and perform operations like open and close.

A Gate object could be referenced in sources (Buckets, Git, Helm, Image Repositories) and syncs (Kustomizations, HelmReleases, ImageUpdateAutomation) to block the reconciliation until the gate is opened.

A Gate can be be opened or closed by annotating the object with a timestamp or by calling a specific webhook receiver exposed by notification-controller.

A Gate can be configured to automatically close or open based on a time window defined in the Gate spec.

The Gate API would replace Flagger's current manual gating mechanism.

Example

Define a gate that automatically closes after 1h from the time it has been opened:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: sre-approval
  namespace: flux-system
spec:
  interval: 30s
  default: closed
  window: 1h

When the gate is created in-cluster, the gating-controller uses spec.default to set the Opened condition:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: sre-approval
  namespace: flux-system
status:
  conditions:
    - lastTransitionTime: "2021-03-26T10:09:26Z"
      message: "Gate closed by default"
      reason: ReconciliationSucceeded
      status: "False"
      type: Opened

While the gate is closed, all the objects that reference it will wait for an approval:

apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization
metadata:
  name: my-app
  namespace: flux-system
spec:
  gates:
    - name: sre-approval
    - name: qa-approval
status:
  conditions:
    - lastTransitionTime: "2021-03-26T10:09:26Z"
      message: "Reconciliation is waiting approval, gate 'flux-system/sre-approval' is closed."
      reason: GateClosed
      status: "False"
      type: Approved

The SRE team can open the gate either by annotating the gate or by calling the notification-controller webhook:

kubectl -n flux-system annotate --overwrite gate/sre-approval \
open.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"

The gating-controller extracts the ISO8601 date from the open.gate annotation value, sets the requestedAt & resetToDefaultAt, and opens the gate for the specified window:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: sre-approval
  namespace: flux-system
status:
  requestedAt: "2021-03-26T10:00:00Z"
  resetToDefaultAt: "2021-03-26T11:00:00Z"
  conditions:
    - lastTransitionTime: "2021-03-26T10:00:00Z"
      message: "Gate scheduled for closing at 2021-03-26T11:00:00Z"
      reason: ReconciliationSucceeded
      status: "True"
      type: Opened

While the gate is opened, all the objects that reference it are approved to reconcile at their configured interval.

The SRE can decide to close the gate ahead of its schedule with:

kubectl -n flux-system annotate --overwrite gate/sre-approval \
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"

The gating-controller extracts the ISO8601 date from the close.gate annotation value, compares it with the open.gate & requestedAt date and closes the gate:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: sre-approval
  namespace: flux-system
status:
  requestedAt: "2021-03-26T10:10:00Z"
  resetToDefaultAt: "2021-03-26T10:10:00Z"
  conditions:
    - lastTransitionTime: "2021-03-26T10:10:00Z"
      message: "Gate close requested"
      reason: ReconciliationSucceeded
      status: "False"
      type: Opened

The objects that are referencing this gate, will finish their ongoing reconciliation (if any) then pause.

To enforce a maintenance window of 24 hours, you can define a Gate that's opened by default:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: maintenance
  namespace: flux-system
spec:
  interval: 30s
  default: opened
  window: 24h

To start the maintenance window you can annotate the gate with:

kubectl -n flux-system annotate --overwrite gate/maintenance \
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"

The gating-controller extracts the ISO8601 date from the close.gate annotation value and closes the gate for the specified window:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: maintenance
  namespace: flux-system
status:
  requestedAt: "2021-03-26T10:00:00Z"
  resetToDefaultAt: "2021-03-27T10:00:00Z"
  conditions:
    - lastTransitionTime: "2021-03-26T10:00:00Z"
      message: "Gate scheduled for opening at 2021-03-27T11:00:00Z"
      reason: ReconciliationSucceeded
      status: "False"
      type: Opened

You could also schedule "No Deploy Fridays" with a CronJob that closes the maintenance gate at 0 0 * * FRI.

15 replies

mikesir87 Oct 16, 2021

Just a quick question about the spec... what does the interval specify in this context, seeing that the other objects would be referencing the Gate?

stefanprodan Oct 17, 2021
Maintainer

Like for all the others Flux custom resources, the interval is used to set the control loop timer at which a resource is being reconciled. In this case, the interval is used to evaluate the current gate status, e.g. if the window has expired then close it and so on.

jjcallis Nov 25, 2021

Has this got a proposed date for implementation/release :)?

HerrmannHinz May 12, 2022

yes please. upvote.

SnoopyCoder Jul 7, 2022
Author

Don't you think that this mechanism contradicts the GitOps concept? Everything from git has to be reconciled to a cluster. Git is a source of truth, right? So if a manifest (e.g. helm release) pushed to the cluster I expect it will be deployed to the cluster as soon as possible. However, this gate creates a config drift between git and cluster state.

This is true, the config drift is not a good side effect. But, in regulated contexts this is better than not being able to use K8s with GitOps at all. The gating mechanism is only meant to delay a part of the reconciliation temporarily until the customer has formally approved it. In the rare case that this will not happen in a defined time window, it needs to be considered to drop the not approved part of the Git config.

I like the selector approach because it would allow to gate only certain workloads in the Git config (the ones under e.g. medical product regulation) but allow other parts to reconcile without gating. In my understanding the Git config would be pulled completely and only the gated and not yet approved parts would not get applied until approval.

bburky · 2021-03-04T15:08:39Z

bburky
Mar 4, 2021

I would like to see release gating implemented in conjunction with #820. I would like to be able to see a diff of what is about to be applied before it happens, with a manual trigger to apply it.

0 replies

mbrancato · 2021-08-16T11:50:18Z

mbrancato
Aug 16, 2021

I want to add a couple use-cases here where I think having an approval / gate would be helpful. These are just some ideas how it could work.

First, for multiple environments and using Kustomize. If there is a single set of manifests and kustomize is used to set values by environment (dev/prod/etc) then it would be reasonable to deploy the changes to dev and gate the deploy to prod. Since it would only be a single set of manifests, they couldn't otherwise be gated / versioned. This could be configured at the cluster / Flux level by having some flag set that says whether that implementation should auto deploy on changes to the git repo or wait for approval.

The other use-case is rolling updates across multiple environments. If we have 15 environments, all sharing the same manifests, this lets us roll out to 1, 2, 4, 8, etc environments instead of all at once.

I have concerns about how to track this approval in git. If the Gate resource defines what should happen (and likely to which type or specific resources), I would think on reconciliation, that Flux would need to create a new GateRequest resource for each instance and push these to git. Someone would then need to edit these to open the gate, and merge back to git.

So the overall process might look like where a PR is opened, and new code is merged to the default git branch. The development environment has no deployment Gate defined (or it is designed to auto-open), and goes ahead and deploys (deploys, does image update automation, etc). The production environment has a Gate defined that triggers a GateRequest object to be committed back to the git repo. A new PR is opened to update the GateRequest object to set the status to open. Once merged, the deploy is triggered. Flux is smart enough to not create a GateRequest when there is no update to resources needed or for commits that only touch GateRequest objects.

1 reply

hassenius Jul 8, 2022

The multi-cluster use case is the one that is relevant for me too.
In my case I'd like an external system to control the rollout using arbitrary rules as to how it is rolled out. This would mean that a branch structure is not useful, as this is too static (sometimes I want rolling updates, sometimes one side of a pair at the time, sometimes a whole pair, etc -- all which is trivial to manage in an external system).
The various gate objects discussed in this thread seems like a good interface for the external system

Also, I think it's appropriate to track the approvals in the external system rather than in git, as git holds the desired state, but not necessarily the current state, which is already the case when using Rolling Updates with Deployment controller, and similar "controlled rollout" mechanisms within a cluster, such as Flagger. I'm just looking to achieve the same between multiple clusters.

jjcallis · 2021-10-15T15:47:06Z

jjcallis
Oct 15, 2021

That is a really good solution! great connotation to explain it to stakeholders too. Gate closed no deploys, Gate opened code deployed 😛 when is this going in?

0 replies

jonathan-innis · 2022-01-13T18:52:45Z

jonathan-innis
Jan 13, 2022

I am thinking about the Gating scenario from the perspective of the admin persona. If I am an admin, I may want to gate my app developers from deploying during a particular timeframe, so I want to add a Gate to the Kustomizations that I want to prevent from deploying within this particular timeframe. However, how do I prevent my app developer from coming in and overriding the Gate? I am assuming in this model, app developers do not have access to their own Kustomizations, otherwise they are free to change which gates are managing them.

My main point is that this model doesn't give control to app developers over their Kustomizations with the gating mechanism, otherwise they can simply circumvent it.

This model only works where an admin persona is defining Kustomizations within an elevated namespace (say flux-system) and the app developers can only control what exists in their Github/directory

2 replies

SnoopyCoder Jul 7, 2022
Author

From my point of view this is not a problem. The Git repo always shows who has made changes (like an audit trail). The important aspect of the scenario in a regulated context I have described is that not a human being working for the software provider (vendor) makes the change, but a formally reviewed part of system has automatically requested the approval from a human being working for the software consumer (client) and applies/makes the change based on the clients reaction. The ultimate decision for the regulated system to be changed at all and at the specific point in time needs to be made by the client, and this needs to be properly documented, i.e. visible in the history of the Git repo and/or and additional audit trail (e.g. HIPAA).

Technically, the developers of the vendor might be able to circumvent or override the gate, but the point is that this is not allowed and will be visible in the Git repo history (and/or separate audit trail). It can have severe legal consequences and the software provider has to ensure that developers are aware (trained) and behave accordingly.

If there is a gate required for a certain service (e.g. a particular namespace) should be a configuration aspect specific to this service (owned by the DevOps team of this service, i.e. the owners of this namespace) and not an admin config which the team cannot influence. The admins might not even know about the characteristics of each workload, e.g. whether it is classified as a medical product or not. This would required involvement of unnecessary people without creating value and potentially create a bottleneck.

pjbgf Sep 29, 2022
Maintainer

I think that being able to enforce a given Gate over different Kustomization objects or into an entire Flux instance could be quite important specially in highly regulated environment or in multi-tenancy deployments. But given the flexibility of the approach proposed above, I think that instead of handling this type of scenario within Flux, we could instead handle it via Admission Controllers (e.g. Kyverno), which is generally what is used for that level of compliance enforcement.

uberspot · 2022-06-21T13:38:06Z

uberspot
Jun 21, 2022

This proposal seems great. Should the implementation be tracked in a ticket? Any appetite for this?

1 reply

SnoopyCoder Jul 7, 2022
Author

We don't have a solution/workaround in place yet and really want this feature :-) So, yes, from my point of view it would be great to track the progress of this item, evolve the good solution proposal a bit further into the final concept considering also the input from the latest comments, and then start implementation.

saintskeeper · 2022-07-09T11:33:49Z

saintskeeper
Jul 9, 2022

Would be nice if this gate could use cosign to validate the Signing keys on the gates.

conditions:
   - gpgCheck: true 
   - trustedKeys: trusted-keys-secret

3 replies

AndreasM009 Sep 8, 2022

Is someone working on that?

saintskeeper Sep 8, 2022

I know that Kyverno can now use Sigstore for images, and you can write your custom policies. I ended up using our automation engine with gitsign and Rekor to solve this in the meantime.

AndreasM009 Sep 8, 2022

Do you know if someone is working on the Gate topic?

mayurigupta13 · 2022-09-18T22:04:04Z

mayurigupta13
Sep 18, 2022

I have a use case coming in from a customer. This customer is an ISV in a regulated medical environment and has many client clusters. The applications on these client clusters are deployed and maintained with the way of Flux by the ISV. They are looking for a gating mechanism to control the roll-out of these applications after the PR is merged in the Git repo. The reasoning for this is both:

Ensuring that the critical applications are not disrupted and that there are maintenance time windows
The regulated medical environment demands explicit approval for updates.

0 replies

LutzLange · 2022-09-21T14:17:35Z

LutzLange
Sep 21, 2022

I do have a big German Telco that wants this feature as well. They are constructing a platform as a service provider for tenants. Those tenants need to consume platform updates when they are ready to do so. This would help to make that happen.

0 replies

stefanprodan · 2022-09-22T14:07:08Z

stefanprodan
Sep 22, 2022
Maintainer

Hey everyone 👋

We're currently focusing on Flux v2 GA release, there are still some things on the roadmap that we need to finish.

After the GA release, I plan to create an official RFC based on #870 (comment). Given that manual gating is one of the most requested features in Flux, we'll try to prioritise this work after GA.

Thanks to everyone who commented here, I'll do my best to incorporate your feedback in the final proposal, and once that's posted I will ask you to review and comment on the RFC.

2 replies

antonmatsiuk Aug 2, 2023

@stefanprodan what are the plans on the RFC for the "gating" feature?

tropnikovvl Dec 13, 2024

Hi, is there any news?

Galileo1 · 2022-12-17T22:32:03Z

Galileo1
Dec 17, 2022

+1
What's the final say on this feature? Gating in some cases essential to fulfil audit and compliance side as well.

1 reply

saintskeeper Dec 18, 2022

@stefanprodan stated there would be an RFC after GA, on these changes.

MahrRah · 2023-01-13T12:11:58Z

MahrRah
Jan 13, 2023

I ran into the same issue on my last project. We needed a type of maintenance/reconciliation window for the flux reconciliations. So, I came up with a workaround using the flux suspension feature and K8s CronJobs.
If you are interested in the details, I have written a blog post on it here: How to enable reconciliation windows using Flux and K8s native components

2 replies

jjcallis Jan 13, 2023

Very nice interim, super clear to understand is well :)

EliiseS Jan 16, 2023

Awesome! Great write up 😄

Gate after commit and before deployment #870

Replies: 12 comments · 27 replies

stefanprodan Mar 2, 2021 Maintainer

[RFC] Manual Gating

Motivation

Proposed solution

Example

stefanprodan Oct 17, 2021 Maintainer

SnoopyCoder Jul 7, 2022 Author

SnoopyCoder Jul 7, 2022 Author

pjbgf Sep 29, 2022 Maintainer

SnoopyCoder Jul 7, 2022 Author

stefanprodan Sep 22, 2022 Maintainer

Replies: 12 comments 27 replies

stefanprodan
Mar 2, 2021
Maintainer

stefanprodan Oct 17, 2021
Maintainer

SnoopyCoder Jul 7, 2022
Author

SnoopyCoder Jul 7, 2022
Author

pjbgf Sep 29, 2022
Maintainer

SnoopyCoder Jul 7, 2022
Author

stefanprodan
Sep 22, 2022
Maintainer