-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Prow cluster management #824
Comments
There seems to already be some kind of check-prow-config job Maybe this can be used to block PR's until the config is correct, but needs to be double checked if this works as expected 🤔 |
Check prow config just validates the config is syntactically correct, and won't explode Prow when deployed. It does nothing (or little at max) to address the config otherwise. I do agree wholeheartdly that PR merging -> config deployment should be automated, and not independent operations. We may not need a test cluster to deploy as if properly automated, we can just revert the config and manually merge that to restore cluster, but up to discussion if we need canary cluster. |
/triage accepted |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
@Rozzii: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/assign
|
We also need some mechanisms to verify the configs are actually working. This is now done by the person applying the config and monitoring the outcome. In case we auto-deploy untested, technically correct looking config, we can brick the cluster. |
True! I think the k/k prow has alerting configured and a team that is responsible for checking those. One thing I forgot:
|
Well, we also have open issue for implementing missing monitoring and probably need an issue for alerting. :) I'm just thinking if we can do some automated checking beforehand to catch at least low hanging failures, then we can handle the rest with monitoring and alerting. |
Currently, we see only figure out if something is wrong, when something has been failing long enough. For example, pod scheduling timeouts become common, we know CAPO has gone belly up, or if bot doesnt' respond to keywords, we know tokens have failed. I think we need this monitoring/alerting part even more than the automatic config applying, even though that been missing has annoyed me for the longest time. |
Issue created: #896 |
Summary of sub-tasks so far:
|
Current Situation
Currently there is no clear instructions to when or how to update the Prow cluster (besides a small not in the prow README
Apply the changes and then create a PR with the changes.).
However this can lead to scenarios when the actual configuration in the repository and the live cluster diverges. In scenarios such as two persons working with the cluster at the same time and overwriting each others work. Also recently seen scenario when image bumps there was no clear process, leaving one PR hanging and the main diverged from live clusterPotential Solution
What would be beneficial is a process so all updates are handled in one way and also some automation to support this.
Some ideas for the automation could be automatically applying changes this of course have the risk of a bad change breaking the automation itself. Another approach would be to simply checking the diff of the live cluster vs a PR and only allow for merge when the PR changes can be found in the cluster or have a periodic job that alerts in case there is a diff between main and the live cluster
The text was updated successfully, but these errors were encountered: