A Prometheus server runs in the CI cluster and is configured to create alerts on top of prow metrics. By clicking on the expr
field of every alert, we can view the query that is setup for alerting. For more information on alerts, see the Prometheus docs.
Possible reactions to some of these alerts:
These should not be a problem in general but if any of them persists for more than a couple of hours, max_goroutines
can be incremented to allow more parallelism in the operators (note that the same option dictates both operators).
It may also be that the operators are lagging due to slow responses from Jenkins. We can figure out whether prow requests to Jenkins are slow by looking at the following metrics:
This is the apdex score for GET request latencies from prow to Jenkins where we assume that most requests will have 1s RTT and tolerate up to 2.5s of RTT.
Another possible mitigation for slow syncs is to shard the operators further by spinning up a new deployment of jenkins_operator
and tweak its label selector to handle some of the load of the operator that experiences slow syncs. We would also need to change the label selector of the slow operator and add labels in some of the jobs it is handling appropriately.
Today, we use the following mappings between Jenkins operators and masters:
jenkins-operator
manages https://ci.openshift.redhat.com/jenkins/ viamaster=ci.openshift.redhat.com
labels.jenkins-dev-operator
manages https://ci.dev.openshift.redhat.com/jenkins/ viamaster=ci.dev.openshift.redhat.com
- Errors in tests managed by jenkins-dev-operator
- Errors in tests managed by jenkins-operator
- Failed Jenkins requests from jenkins-operator
- Failed Jenkins requests from jenkins-dev-operator
Errors in tests means that there is an underlying infrastructure failure that blocks tests from executing correctly or the tests are executing correctly but a problem in the infrastructure disallows the operators from picking up the results. Most often than not, this is an issue with Jenkins.
Failed requests to Jenkins is usually a problem with Jenkins and less often a misconfiguration in prow (eg. wrong Jenkins credentials). It may be possible that Jenkins is overwhelmed by the number of jobs it is running. In that case max_concurrency
can be decremented to force more free space in Jenkins.
- Failures in postsubmit tests managed by jenkins-operator
- Failures in postsubmit tests managed by jenkins-dev-operator
- Failures in batch tests managed by jenkins-operator
- Failures in batch tests managed by jenkins-dev-operator
These alerts are usually triggered because of flaky tests but keep in mind that they may also come from infrastructure failures. The only thing that can be done in this case is to triage these failures, open issues in their respective repositories, and nag people to fix them. We need to be especially cautious about failures in batch tests. Consecutive failures in batch tests means we are not merging with a satisfying rate.
Use the following links to triage these alerts:
https://prow.svc.ci.openshift.org/?type=postsubmit
https://prow.svc.ci.openshift.org/?type=batch
TODO: Forward alerts via e-mail.