Improve Prow cluster management #824

NymanRobin · 2024-07-25T07:07:05Z

Current Situation

Currently there is no clear instructions to when or how to update the Prow cluster (besides a small not in the prow README Apply the changes and then create a PR with the changes.). However this can lead to scenarios when the actual configuration in the repository and the live cluster diverges. In scenarios such as two persons working with the cluster at the same time and overwriting each others work. Also recently seen scenario when image bumps there was no clear process, leaving one PR hanging and the main diverged from live cluster

PR was merged without applying: Update k8s-prow images as needed #777
PR was on hold waiting for someone to apply: Update k8s-prow images as needed #802

Potential Solution

What would be beneficial is a process so all updates are handled in one way and also some automation to support this.
Some ideas for the automation could be automatically applying changes this of course have the risk of a bad change breaking the automation itself. Another approach would be to simply checking the diff of the live cluster vs a PR and only allow for merge when the PR changes can be found in the cluster or have a periodic job that alerts in case there is a diff between main and the live cluster

The text was updated successfully, but these errors were encountered:

NymanRobin · 2024-07-25T09:28:59Z

There seems to already be some kind of check-prow-config job
https://prow.apps.test.metal3.io/view/s3/prow-logs/pr-logs/pull/metal3-io_project-infra/821/check-prow-config/1815332764979826688

Maybe this can be used to block PR's until the config is correct, but needs to be double checked if this works as expected 🤔

tuminoid · 2024-08-05T09:06:57Z

Check prow config just validates the config is syntactically correct, and won't explode Prow when deployed. It does nothing (or little at max) to address the config otherwise.

I do agree wholeheartdly that PR merging -> config deployment should be automated, and not independent operations. We may not need a test cluster to deploy as if properly automated, we can just revert the config and manually merge that to restore cluster, but up to discussion if we need canary cluster.

Rozzii · 2024-08-07T14:36:04Z

/triage accepted

metal3-io-bot · 2024-11-05T14:47:22Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Rozzii · 2024-11-06T06:40:17Z

/remove-lifecycle stale
/lifecycle frozen
/kind feature
/help

metal3-io-bot · 2024-11-06T06:40:20Z

@Rozzii:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/remove-lifecycle stale
/lifecycle frozen
/kind feature
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

lentzi90 · 2024-11-08T09:56:06Z

/assign
There are a couple of actions to get this implemented:

Separate out the credentials from the kustomizations. (We cannot have the automated tool use the kustomizations in the repo when they reference credentials that are not committed in git.) For this we can use External Secrets together with the OpenStack integration. This is also how the k/k prow instance does it (except not on top of OpenStack).
Add a postsubmit job that applies the manifests. Since the secrets are separated, this is now straight forward. The job only needs in-cluster credentials for doing the deployment.

tuminoid · 2024-11-08T10:42:19Z

/assign There are a couple of actions to get this implemented:

* [ ]  Separate out the credentials from the kustomizations. (We cannot have the automated tool use the kustomizations in the repo when they reference credentials that are not committed in git.) For this we can use External Secrets together with the OpenStack integration. This is also how the k/k prow instance does it (except not on top of OpenStack).

* [ ]  Add a postsubmit job that applies the manifests. Since the secrets are separated, this is now straight forward. The job only needs in-cluster credentials for doing the deployment.

We also need some mechanisms to verify the configs are actually working. This is now done by the person applying the config and monitoring the outcome. In case we auto-deploy untested, technically correct looking config, we can brick the cluster.

lentzi90 · 2024-11-08T10:51:35Z

True! I think the k/k prow has alerting configured and a team that is responsible for checking those.
But I will check that as well 🙂

One thing I forgot:

Declarative management of CAPI/CAPO and cert-manager. They are currently handled with clusterctl imperatively.

tuminoid · 2024-11-08T10:54:16Z

True! I think the k/k prow has alerting configured and a team that is responsible for checking those. But I will check that as well 🙂

Well, we also have open issue for implementing missing monitoring and probably need an issue for alerting. :)

I'm just thinking if we can do some automated checking beforehand to catch at least low hanging failures, then we can handle the rest with monitoring and alerting.

tuminoid · 2024-11-08T10:57:13Z

Currently, we see only figure out if something is wrong, when something has been failing long enough. For example, pod scheduling timeouts become common, we know CAPO has gone belly up, or if bot doesnt' respond to keywords, we know tokens have failed. I think we need this monitoring/alerting part even more than the automatic config applying, even though that been missing has annoyed me for the longest time.

lentzi90 · 2024-11-08T11:07:45Z

Issue created: #896
We have an internal ticket for it also from before

lentzi90 · 2024-11-20T07:44:28Z

Summary of sub-tasks so far:

Prow: Migrate to ExternalSecrets #906 ExternalSecrets
Prow: Migrate to CAPI operator #907 CAPI operator
Prow: Monitoring and alerting #896 Alerting and monitoring

metal3-io-bot added the needs-triage Indicates an issue lacks a `triage/foo` label and requires one. label Jul 25, 2024

metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Aug 7, 2024

metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 5, 2024

Rozzii added this to Metal3 - Roadmap Nov 6, 2024

github-project-automation bot moved this to Backlog in Metal3 - Roadmap Nov 6, 2024

metal3-io-bot assigned lentzi90 Nov 8, 2024

Rozzii moved this from Backlog to MISC WIP in Metal3 - Roadmap Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Prow cluster management #824

Improve Prow cluster management #824

NymanRobin commented Jul 25, 2024

NymanRobin commented Jul 25, 2024

tuminoid commented Aug 5, 2024

Rozzii commented Aug 7, 2024

metal3-io-bot commented Nov 5, 2024

Rozzii commented Nov 6, 2024

metal3-io-bot commented Nov 6, 2024

lentzi90 commented Nov 8, 2024

tuminoid commented Nov 8, 2024

lentzi90 commented Nov 8, 2024

tuminoid commented Nov 8, 2024

tuminoid commented Nov 8, 2024

lentzi90 commented Nov 8, 2024

lentzi90 commented Nov 20, 2024

Improve Prow cluster management #824

Improve Prow cluster management #824

Comments

NymanRobin commented Jul 25, 2024

Current Situation

Potential Solution

NymanRobin commented Jul 25, 2024

tuminoid commented Aug 5, 2024

Rozzii commented Aug 7, 2024

metal3-io-bot commented Nov 5, 2024

Rozzii commented Nov 6, 2024

metal3-io-bot commented Nov 6, 2024

lentzi90 commented Nov 8, 2024

tuminoid commented Nov 8, 2024

lentzi90 commented Nov 8, 2024

tuminoid commented Nov 8, 2024

tuminoid commented Nov 8, 2024

lentzi90 commented Nov 8, 2024

lentzi90 commented Nov 20, 2024