-
Notifications
You must be signed in to change notification settings - Fork 66
✨ Performance Alerting #2081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
✨ Performance Alerting #2081
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
apiVersion: v1 | ||
kind: Secret | ||
type: kubernetes.io/service-account-token | ||
metadata: | ||
name: prometheus-metrics-token | ||
namespace: system | ||
annotations: | ||
kubernetes.io/service-account.name: prometheus |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
apiVersion: monitoring.coreos.com/v1 | ||
kind: ServiceMonitor | ||
metadata: | ||
name: catalogd-controller-manager-metrics-monitor | ||
namespace: system | ||
spec: | ||
endpoints: | ||
- path: /metrics | ||
port: metrics | ||
interval: 10s | ||
scheme: https | ||
authorization: | ||
credentials: | ||
name: prometheus-metrics-token | ||
key: token | ||
tlsConfig: | ||
# NAMESPACE_PLACEHOLDER replaced by replacements in kustomization.yaml | ||
serverName: catalogd-service.NAMESPACE_PLACEHOLDER.svc | ||
insecureSkipVerify: false | ||
ca: | ||
secret: | ||
# CATALOGD_SERVICE_CERT must be replaced by envsubst | ||
name: catalogd-service-cert-git-version | ||
key: ca.crt | ||
cert: | ||
secret: | ||
name: catalogd-service-cert-git-version | ||
key: tls.crt | ||
keySecret: | ||
name: catalogd-service-cert-git-version | ||
key: tls.key | ||
selector: | ||
matchLabels: | ||
app.kubernetes.io/name: catalogd |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
apiVersion: monitoring.coreos.com/v1 | ||
kind: ServiceMonitor | ||
metadata: | ||
name: kubelet | ||
namespace: system | ||
labels: | ||
k8s-app: kubelet | ||
spec: | ||
jobLabel: k8s-app | ||
endpoints: | ||
- port: https-metrics | ||
scheme: https | ||
path: /metrics | ||
interval: 10s | ||
honorLabels: true | ||
tlsConfig: | ||
insecureSkipVerify: true | ||
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token | ||
metricRelabelings: | ||
- action: keep | ||
sourceLabels: [pod,container] | ||
regex: (operator-controller|catalogd).*;manager | ||
- port: https-metrics | ||
scheme: https | ||
path: /metrics/cadvisor | ||
interval: 10s | ||
honorLabels: true | ||
tlsConfig: | ||
insecureSkipVerify: true | ||
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token | ||
metricRelabelings: | ||
- action: keep | ||
sourceLabels: [pod,container] | ||
regex: (operator-controller|catalogd).*;manager | ||
selector: | ||
matchLabels: | ||
k8s-app: kubelet | ||
namespaceSelector: | ||
matchNames: | ||
- kube-system |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
namespace: olmv1-system | ||
resources: | ||
- prometheus.yaml | ||
- catalogd_service_monitor.yaml | ||
- kubelet_service_monitor.yaml | ||
- operator_controller_service_monitor.yaml | ||
- prometheus_rule.yaml | ||
- auth_token.yaml | ||
- network_policy.yaml | ||
- service.yaml | ||
- rbac | ||
replacements: | ||
- source: | ||
kind: ServiceMonitor | ||
name: catalogd-controller-manager-metrics-monitor | ||
fieldPath: metadata.namespace | ||
targets: | ||
- select: | ||
kind: ServiceMonitor | ||
name: catalogd-controller-manager-metrics-monitor | ||
fieldPaths: | ||
- spec.endpoints.0.tlsConfig.serverName | ||
options: | ||
delimiter: '.' | ||
index: 1 | ||
- select: | ||
kind: ServiceMonitor | ||
name: operator-controller-controller-manager-metrics-monitor | ||
fieldPaths: | ||
- spec.endpoints.0.tlsConfig.serverName | ||
options: | ||
delimiter: '.' | ||
index: 1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
apiVersion: networking.k8s.io/v1 | ||
kind: NetworkPolicy | ||
metadata: | ||
name: prometheus | ||
namespace: system | ||
spec: | ||
podSelector: | ||
matchLabels: | ||
app.kubernetes.io/name: prometheus | ||
policyTypes: | ||
- Egress | ||
- Ingress | ||
egress: | ||
- {} # Allows all egress traffic for metrics requests | ||
ingress: | ||
- {} # Allows us to query prometheus |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
apiVersion: monitoring.coreos.com/v1 | ||
kind: ServiceMonitor | ||
metadata: | ||
name: operator-controller-controller-manager-metrics-monitor | ||
namespace: system | ||
spec: | ||
endpoints: | ||
- path: /metrics | ||
interval: 10s | ||
port: https | ||
scheme: https | ||
authorization: | ||
credentials: | ||
name: prometheus-metrics-token | ||
key: token | ||
tlsConfig: | ||
# NAMESPACE_PLACEHOLDER replaced by replacements in kustomization.yaml | ||
serverName: operator-controller-service.NAMESPACE_PLACEHOLDER.svc | ||
insecureSkipVerify: false | ||
ca: | ||
secret: | ||
name: olmv1-cert | ||
key: ca.crt | ||
cert: | ||
secret: | ||
name: olmv1-cert | ||
key: tls.crt | ||
keySecret: | ||
name: olmv1-cert | ||
key: tls.key | ||
selector: | ||
matchLabels: | ||
control-plane: operator-controller-controller-manager |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
apiVersion: monitoring.coreos.com/v1 | ||
kind: Prometheus | ||
metadata: | ||
name: prometheus | ||
namespace: system | ||
spec: | ||
logLevel: debug | ||
serviceAccountName: prometheus | ||
scrapeTimeout: 30s | ||
scrapeInterval: 1m | ||
securityContext: | ||
runAsNonRoot: true | ||
runAsUser: 65534 | ||
seccompProfile: | ||
type: RuntimeDefault | ||
ruleSelector: {} | ||
serviceDiscoveryRole: EndpointSlice | ||
serviceMonitorSelector: {} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
apiVersion: monitoring.coreos.com/v1 | ||
kind: PrometheusRule | ||
metadata: | ||
name: controller-alerts | ||
namespace: system | ||
spec: | ||
groups: | ||
- name: controller-panic | ||
rules: | ||
- alert: reconciler-panic | ||
expr: controller_runtime_reconcile_panics_total{} > 0 | ||
annotations: | ||
description: "controller of pod {{ $labels.pod }} experienced panic(s); count={{ $value }}" | ||
- alert: webhook-panic | ||
expr: controller_runtime_webhook_panics_total{} > 0 | ||
annotations: | ||
description: "controller webhook of pod {{ $labels.pod }} experienced panic(s); count={{ $value }}" | ||
- name: resource-usage | ||
rules: | ||
- alert: oom-events | ||
expr: container_oom_events_total > 0 | ||
annotations: | ||
description: "container {{ $labels.container }} of pod {{ $labels.pod }} experienced OOM event(s); count={{ $value }}" | ||
- alert: operator-controller-memory-growth | ||
expr: deriv(sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"})[5m:]) > 50_000 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dtfranz so we are manually defining the trashholders here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you talking about adding notes to explain these specific queries? I'm happy to do that somewhere if so, but if you mean explaining how to adjust/create rules I'd prefer to link to the official prometheus docs. |
||
for: 5m | ||
keep_firing_for: 1d | ||
annotations: | ||
description: "operator-controller pod memory usage growing at a high rate for 5 minutes: {{ $value | humanize }}B/sec" | ||
- alert: catalogd-memory-growth | ||
expr: deriv(sum(container_memory_working_set_bytes{pod=~"catalogd.*",container="manager"})[5m:]) > 50_000 | ||
for: 5m | ||
keep_firing_for: 1d | ||
annotations: | ||
description: "catalogd pod memory usage growing at a high rate for 5 minutes: {{ $value | humanize }}B/sec" | ||
- alert: operator-controller-memory-usage | ||
expr: sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"}) > 100_000_000 | ||
for: 5m | ||
keep_firing_for: 1d | ||
annotations: | ||
description: "operator-controller pod using high memory resources for the last 5 minutes: {{ $value | humanize }}B" | ||
- alert: catalogd-memory-usage | ||
expr: sum(container_memory_working_set_bytes{pod=~"catalogd.*",container="manager"}) > 75_000_000 | ||
for: 5m | ||
keep_firing_for: 1d | ||
annotations: | ||
description: "catalogd pod using high memory resources for the last 5 minutes: {{ $value | humanize }}B" | ||
- alert: operator-controller-cpu-usage | ||
expr: rate(container_cpu_usage_seconds_total{pod=~"operator-controller.*",container="manager"}[5m]) * 100 > 20 | ||
for: 5m | ||
keep_firing_for: 1d | ||
annotations: | ||
description: "operator-controller using high cpu resource for 5 minutes: {{ $value | printf \"%.2f\" }}%" | ||
- alert: catalogd-cpu-usage | ||
expr: rate(container_cpu_usage_seconds_total{pod=~"catalogd.*",container="manager"}[5m]) * 100 > 20 | ||
for: 5m | ||
keep_firing_for: 1d | ||
annotations: | ||
description: "catalogd using high cpu resources for 5 minutes: {{ $value | printf \"%.2f\" }}%" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
resources: | ||
- prometheus_service_account.yaml | ||
- prometheus_cluster_role.yaml | ||
- prometheus_cluster_rolebinding.yaml |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
apiVersion: rbac.authorization.k8s.io/v1 | ||
kind: ClusterRole | ||
metadata: | ||
name: prometheus | ||
rules: | ||
- apiGroups: [""] | ||
resources: | ||
- nodes | ||
- nodes/metrics | ||
- services | ||
- endpoints | ||
- pods | ||
verbs: ["get", "list", "watch"] | ||
- apiGroups: [""] | ||
resources: | ||
- configmaps | ||
verbs: ["get"] | ||
- apiGroups: | ||
- discovery.k8s.io | ||
resources: | ||
- endpointslices | ||
verbs: ["get", "list", "watch"] | ||
- apiGroups: | ||
- networking.k8s.io | ||
resources: | ||
- ingresses | ||
verbs: ["get", "list", "watch"] | ||
- nonResourceURLs: ["/metrics"] | ||
verbs: ["get"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
apiVersion: rbac.authorization.k8s.io/v1 | ||
kind: ClusterRoleBinding | ||
metadata: | ||
name: prometheus | ||
roleRef: | ||
apiGroup: rbac.authorization.k8s.io | ||
kind: ClusterRole | ||
name: prometheus | ||
subjects: | ||
- kind: ServiceAccount | ||
name: prometheus | ||
namespace: system |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
apiVersion: v1 | ||
kind: ServiceAccount | ||
metadata: | ||
name: prometheus | ||
namespace: system |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
apiVersion: v1 | ||
kind: Service | ||
metadata: | ||
name: prometheus-service | ||
namespace: system | ||
spec: | ||
type: NodePort | ||
ports: | ||
- name: web | ||
nodePort: 30900 | ||
port: 9090 | ||
protocol: TCP | ||
targetPort: web | ||
selector: | ||
prometheus: prometheus |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to centralise the Prometheus installation and related configurations in the hack directory? It might help keep things more organised and easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I would prefer it to be part of the existing e2e manifests, since this is something we are planning to do for our e2e's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script was getting way too big, and only had one or two operations that justified it as a script. The prometheus yaml is way more readable and maintainable in a kustomization manifest, IMO.
@tmshort I initially wanted to add these manifests to the e2e collection as you mentioned, but the catalogd certificate generated by certmanager is named
catalogd-service-cert-v1.3.0-68-g7cd03f1-dirty
(at least, for me), which doesn't give me confidence that I can just hard-code it and hope it keeps working. Unless you know what exactly that name comes from?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also @tmshort if this sufficiently explains why I can't do as you mentioned here, I'd appreciate if we could drop the hold, but if you can think of a way around the issue I'd more than happy to give it a try!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured out the secret name stuff but ran into other issues; see here