✨ Performance Alerting #2081

dtfranz · 2025-07-08T14:50:29Z

Description

Introduces an early-warning series of prometheus alerts to attempt to catch issues with performance at an early stage in development.

As the e2e tests run, the installed prometheus instance is scraping metrics from catalogd and operator-controller, and will fire alerts based on rules introduced in this PR. Since we're running these tests on the github runners which do not have consistent performance, our alerts must be based on platform-independent metrics and are therefore limited. Any other ideas for metrics to check on this PR are appreciated!

Once the e2e tests finish, prometheus is queried for active alerts. Any alerts found in pending state will result in a warning being set on the e2e workflow. Any alerts in firing state will give an error. These errors do not (at the moment) fail the run, but are visible when the workflow details are viewed.

For instance:

Prometheus Alert Pending
operator-controller-memory-growth: operator-controller pod memory usage growing at a high rate for 5 minutes: 72.86kB/sec

I am not making this a required check until we have a pretty good idea of an approximate baseline.

Potential Enhancements:

Additional alerts, if any
Fine-tune the alerts and fail runs when they fire
~~Remove yaml from script and organize into an additional kustomization component~~ Done
Output metrics as a mermaid XY plot in the workflow summary

Closes #1904
Closes #1905

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

netlify · 2025-07-08T15:13:51Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`a1bc7c2`
🔍 Latest deploy log	https://app.netlify.com/projects/olmv1/deploys/6874ac51ae93740008e98f8f
😎 Deploy Preview	https://deploy-preview-2081--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Makefile

codecov · 2025-07-09T02:53:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.48%. Comparing base (1333f7b) to head (a1bc7c2).
Report is 12 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2081      +/-   ##
==========================================
+ Coverage   73.35%   73.48%   +0.12%     
==========================================
  Files          77       78       +1     
  Lines        7056     7240     +184     
==========================================
+ Hits         5176     5320     +144     
- Misses       1540     1568      +28     
- Partials      340      352      +12

Flag	Coverage Δ
e2e	`43.80% <ø> (-1.18%)`	⬇️
experimental-e2e	`50.18% <ø> (-1.16%)`	⬇️
unit	`58.75% <ø> (+0.48%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

trgeiger · 2025-07-10T15:42:17Z

I think it looks good but does it really close #1905? I would think that issue is more specific to a future iteration of this feature where we do make the job fail if it hits certain thresholds.

trgeiger · 2025-07-10T15:51:42Z

One other thing is I don't have context for the thresholds you chose in the alerts. I see you mention that you don't want this to be required until we have a firmer idea of good baselines--are the current ones based on some previous work or did you just pick some decent-seeming thresholds for all the checks based on your experience? Do we need to queue up additional work to fine-tune these?

dtfranz · 2025-07-11T07:00:14Z

Thanks for taking a look @trgeiger !

For your first point, I agree that the issue definitely indicates that we should fail the CI but I'm hesitant to do that at the moment without larger group buy-in. I'm happy to keep the issue open and close it after we turn on ci blocking, or close it with this PR and track a follow-up issue. As long as it's tracked I'm happy either way.

On your second point, these values are based on my experience running the workflow many times over and checking the metrics. Up till this point, nobody (to my knowledge) has run a more thorough study of v1 performance, not counting @jianzhangbjz and his work on the downstream version of this. These changes will enable to us to quickly get a better understanding and make any necessary adjustments.

jianzhangbjz · 2025-07-11T08:16:10Z

Yeah, I haven’t collected the data for the OLMv1 performance baseline yet. I’m planning to reuse https://github.com/cloud-bulldozer/orion to help identify performance issues. The current metrics are being discussed on Slack: link, and progress is being tracked here: https://issues.redhat.com/browse/OCPQE-28161

trgeiger · 2025-07-11T14:15:21Z

Cool, that's exactly what I wanted to know re: thresholds. And as for the issue tracking, either solution works--I just wanted to make sure the next iteration was tracked, as you stated. Keeping that issue or opening a new one both sound good to me.

openshift-ci · 2025-07-11T14:15:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: trgeiger
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

camilamacedo86 · 2025-07-11T16:18:11Z

config/prometheus/prometheus_rule.yaml

+      annotations:
+        description: "container {{ $labels.container }} of pod {{ $labels.pod }} experienced OOM event(s); count={{ $value }}"
+    - alert: operator-controller-memory-growth
+      expr: deriv(sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"})[5m:]) > 50_000


@dtfranz so we are manually defining the trashholders here?
Could we doc how it works in the https://github.com/operator-framework/operator-controller/blob/main/docs/contribute/developer.md ? WDYT?
Not a blocker for this one for sure

Are you talking about adding notes to explain these specific queries? I'm happy to do that somewhere if so, but if you mean explaining how to adjust/create rules I'd prefer to link to the official prometheus docs.

camilamacedo86 · 2025-07-11T16:24:33Z

Makefile

+	$(KUSTOMIZE) build config/prometheus | CATALOGD_SERVICE_CERT=$(shell kubectl get certificate -n olmv1-system catalogd-service-cert -o jsonpath={.spec.secretName}) envsubst '$$CATALOGD_SERVICE_CERT' | kubectl apply -f -
+	kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) -l app.kubernetes.io/name=prometheus-operator --timeout=60s
+	kubectl wait --for=create pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=60s
+	kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=120s


Wouldn't it be better to centralise the Prometheus installation and related configurations in the hack directory? It might help keep things more organised and easier to understand.

Actually, I would prefer it to be part of the existing e2e manifests, since this is something we are planning to do for our e2e's.

The script was getting way too big, and only had one or two operations that justified it as a script. The prometheus yaml is way more readable and maintainable in a kustomization manifest, IMO.

@tmshort I initially wanted to add these manifests to the e2e collection as you mentioned, but the catalogd certificate generated by certmanager is named catalogd-service-cert-v1.3.0-68-g7cd03f1-dirty (at least, for me), which doesn't give me confidence that I can just hard-code it and hope it keeps working. Unless you know what exactly that name comes from?

Also @tmshort if this sufficiently explains why I can't do as you mentioned here, I'd appreciate if we could drop the hold, but if you can think of a way around the issue I'd more than happy to give it a try!

Figured out the secret name stuff but ran into other issues; see here

camilamacedo86

I'm generally okay with the approach here, and we can continue improving it step by step through follow-ups. (I added just one nit). Otherwise, LGTM

Honestly, I prefer this incremental method — it also makes it easier for others to contribute along the way. I think it would be nice if we could get a review from @tmshort as well.

tmshort · 2025-07-11T20:06:05Z

/hold
I'm going to ask you to move config/prometheus into config/overlays/prometheus or to make it a component that is built as part of the e2e manifests. This later option means moving the files into config/components/e2e/prometheus and then including into the config/components/e2e/kustomization.yaml file. However, this may be tricky to do until #2088 is done.
My preference would be for this to all be included as part of the e2e manifests, rather than something done separately, but again, it might have to wait until #2088 is done.

.github/workflows/e2e.yaml

Makefile

tmshort · 2025-07-11T20:06:54Z

Makefile

+	$(KUSTOMIZE) build config/prometheus | CATALOGD_SERVICE_CERT=$(shell kubectl get certificate -n olmv1-system catalogd-service-cert -o jsonpath={.spec.secretName}) envsubst '$$CATALOGD_SERVICE_CERT' | kubectl apply -f -
+	kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) -l app.kubernetes.io/name=prometheus-operator --timeout=60s
+	kubectl wait --for=create pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=60s
+	kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=120s


Actually, I would prefer it to be part of the existing e2e manifests, since this is something we are planning to do for our e2e's.

tmshort · 2025-07-11T20:07:42Z

Makefile

+	curl -s "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/refs/tags/$(PROMETHEUS_VERSION)/bundle.yaml" > "$(TMPDIR)/bundle.yaml"; \
+	(cd $(TMPDIR) && $(KUSTOMIZE) edit set namespace $(PROMETHEUS_NAMESPACE)) && kubectl create -k "$(TMPDIR)"
+	kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) -l app.kubernetes.io/name=prometheus-operator
+	$(KUSTOMIZE) build config/prometheus | CATALOGD_SERVICE_CERT=$(shell kubectl get certificate -n olmv1-system catalogd-service-cert -o jsonpath={.spec.secretName}) envsubst '$$CATALOGD_SERVICE_CERT' | kubectl apply -f -


The name of this secret ought to be fixed, so you shouldn't have to extract it?

~~This secret is generated by certmanager at runtime, after installation, so it can't be predetermined (unless you know of a way).~~ EDIT: I've noticed the name does seem to stay as catalogd-service-cert-v1.3.0-68-g7cd03f1-dirty for me, but without knowing how exactly that name is generated I'd rather not set it as such here.

NVM, I found it, I'll try and work that in.

OK so it's possible to do the same sed "s/cert-git-version/cert-$(VERSION)/g" for these manifests, but the next issue we face putting the manifests in config/components/e2e is that it's not possible to put the prometheus CRDs and CRs into the same manifest that gets installed by install.tpl.sh, otherwise you get no matches for kind "Prometheus" errors. The prometheus-operator install and metrics yaml need to happen at different steps. That probably means adding additional content to the install.tpl.sh script.

Unless you know a better way, I'm thinking I could add this to install.tpl.sh:

if [[ -n "$prometheus_version" ]]; then trap 'echo "Cleaning up $(TMPDIR)"; rm -rf "$(TMPDIR)"' EXIT; \ curl -s "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/refs/tags/$(PROMETHEUS_VERSION)/kustomization.yaml" > "$(TMPDIR)/kustomization.yaml"; \ curl -s "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/refs/tags/$(PROMETHEUS_VERSION)/bundle.yaml" > "$(TMPDIR)/bundle.yaml"; \ (cd $(TMPDIR) && $(KUSTOMIZE) edit set namespace $(PROMETHEUS_NAMESPACE)) && kubectl create -k "$(TMPDIR)" kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) -l app.kubernetes.io/name=prometheus-operator fi

...then have the Makefile set the variable in test-e2e here:

test-e2e: PROMETHEUS_VERSION := v0.83.0

I kind of hesitate to add anything to the install script that's for test only, though 🫤

openshift-ci · 2025-07-14T06:13:24Z

New changes are detected. LGTM label has been removed.

Introduces an early-warning series of prometheus alerts to attempt to catch issues with performance at an early stage in development. Signed-off-by: Daniel Franz <[email protected]>

dtfranz requested a review from a team as a code owner July 8, 2025 14:50

openshift-ci bot requested review from perdasilva and trgeiger July 8, 2025 14:50

dtfranz force-pushed the metrics-alerting branch 2 times, most recently from 97a268f to cb81424 Compare July 8, 2025 14:53

tmshort reviewed Jul 8, 2025

View reviewed changes

Makefile Outdated Show resolved Hide resolved

dtfranz force-pushed the metrics-alerting branch from cb81424 to bb8a597 Compare July 9, 2025 02:41

dtfranz force-pushed the metrics-alerting branch 2 times, most recently from 3339f47 to 7cd03f1 Compare July 10, 2025 08:08

trgeiger approved these changes Jul 11, 2025

View reviewed changes

openshift-ci bot assigned trgeiger Jul 11, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 11, 2025

camilamacedo86 reviewed Jul 11, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 11, 2025

tmshort reviewed Jul 11, 2025

View reviewed changes

dtfranz force-pushed the metrics-alerting branch from 7cd03f1 to a611452 Compare July 14, 2025 06:13

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 14, 2025

Performance Alerting

a1bc7c2

Introduces an early-warning series of prometheus alerts to attempt to catch issues with performance at an early stage in development. Signed-off-by: Daniel Franz <[email protected]>

dtfranz force-pushed the metrics-alerting branch from a611452 to a1bc7c2 Compare July 14, 2025 07:05

✨ Performance Alerting #2081

Are you sure you want to change the base?

✨ Performance Alerting #2081

Conversation

dtfranz commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reviewer Checklist

Uh oh!

netlify bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for olmv1 ready!

Uh oh!

Uh oh!

codecov bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

trgeiger commented Jul 10, 2025

Uh oh!

trgeiger commented Jul 10, 2025

Uh oh!

dtfranz commented Jul 11, 2025

Uh oh!

jianzhangbjz commented Jul 11, 2025

Uh oh!

trgeiger commented Jul 11, 2025

Uh oh!

openshift-ci bot commented Jul 11, 2025

Uh oh!

camilamacedo86 Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 left a comment

Choose a reason for hiding this comment

Uh oh!

tmshort commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dtfranz Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jul 14, 2025

Uh oh!

Uh oh!

dtfranz commented Jul 8, 2025 •

edited

Loading

netlify bot commented Jul 8, 2025 •

edited

Loading

codecov bot commented Jul 9, 2025 •

edited

Loading

camilamacedo86 Jul 11, 2025 •

edited

Loading

camilamacedo86 Jul 11, 2025 •

edited

Loading

dtfranz Jul 14, 2025 •

edited

Loading