Service monitors for cloud-controller-manager-app #3538

Rotfuks · 2024-06-27T11:46:08Z

Motivation

There are still some leftover apps (although not a lot) that still need to use a service monitor. Without this, we will not be able to tear down our Prometheus stack.

To easily find out what is not monitored via service monitors, you can connect to a MC and WC prometheus using opsctl open -i -a prometheus --workload-cluster=<cluster_id> and check out the targets page. We already identified and resolved most of the service monitors missing, but identified one turtles app that's not done yet.

TODO

https://github.com/giantswarm/azure-cloud-controller-manager-app needs a service/pod monitor
https://github.com/giantswarm/aws-cloud-controller-manager-app needs a service/pod monitor

Service monitors here are only relevant for CAPI

Outcome

All missing apps have a service monitor

fiunchinho · 2024-07-24T12:53:58Z

@Rotfuks the cloud controller manager is a component that is deployed very early in the process, probably before the ServiceMonitor CRD is present. Should be deploy it as a new app, similar to what's done with cilium-servicemonitors-app? Or will the CRDs be present and we can deploy it with the app?

Rotfuks · 2024-07-24T12:55:38Z

I believe someone else from @giantswarm/team-atlas might be of more help when it comes to that question :)

QuentinBisson · 2024-07-24T13:05:42Z

It seems that this app is required in the bootstrap of a cluster so unless you can create the cluster and have it dependent on the prometheus-operator-crds, you should consider making it another app yes :(

fiunchinho · 2024-08-07T12:09:57Z

@giantswarm/team-atlas how were these components being monitoring up until now? I can't find a metrics endpoint for them

QuantumEnigmaa · 2024-08-07T13:07:47Z

From what I see, these weren't monitored at all :

No results whereas :

kube-system           aws-cloud-controller-manager-fvrn4                                  1/1     Running     1 (42d ago)     42d                                                                           
kube-system           aws-cloud-controller-manager-fxrbp                                  1/1     Running     1 (42d ago)     42d
kube-system           aws-cloud-controller-manager-wd8sm                                  1/1     Running     1 (42d ago)     42d

QuentinBisson · 2024-08-07T13:39:39Z

Could it be because bind-address is not set as an argument so it can only bé scraped when using localhost?

fiunchinho · 2024-08-07T14:17:03Z

These apps are not exposing any metrics port AFAIK. So I'm not sure what needs to be done.

QuantumEnigmaa · 2024-08-07T14:24:58Z

Then I would say that the 1st thing to do would be to expose the metrics endpoints in those apps

fiunchinho · 2024-08-07T14:26:54Z

I don't think they have one, that's what I meant.

QuentinBisson · 2024-08-07T14:40:34Z

I'm confused because the main.go files of the aws cloud controller manager does register metrics 🤔 I Can check in 2 weeks :)

fiunchinho · 2024-08-07T14:46:42Z

Yes, it does but I don't see any http handler to expose those metrics.

fiunchinho · 2024-08-08T16:45:05Z

I decided to investigate aws-cloud-controller-manager more in detail.

Reading the source code, I could see how the controller was collecting metrics https://github.com/search?q=repo%3Akubernetes%2Fcloud-provider-aws+%2Fcloudprovider_aws%2F&type=code

But in the code I couldn't see any http endpoint being exposed for the metrics. I searched github, forums, and asked on Kubernetes Slack but I did not get any answers .

I checked the parameters that we currently pass to the aws-cloud-controller-manager container, and one says --secure-port=10267. So I port-forwarded using that port and hit the /metrics endpoint.
Internal Server Error: "/metrics": subjectaccessreviews.authorization.k8s.io is forbidden: User "system:serviceaccount:kube-system:aws-cloud-controller-manager" cannot create resource "subjectaccessreviews" in API group "authorization.k8s.io" at the cluster scope

Apparently this is a k8s resource used to determine whether a given user or service account has permission to perform a specific action on a resource. So I tried updating the ClusterRole for aws-cloud-controller-manager adding the SAR resource, and hitting the endpoint again.

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

I need to pass a serviceaccount token. So I created a new token for the aws-cloud-controller-manager SA with kubectl create token aws-cloud-controller-manager -n kube-system --duration 10m. Used it in my request and

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

Let's check the logs in the controller

E0808 14:51:04.851806       1 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, tokenreviews.authentication.k8s.io is forbidden: User \"system:serviceaccount:kube-system:aws-cloud-controller-manager\" cannot create resource \"tokenreviews\" in API group \"authentication.k8s.io\" at the cluster scope]"

The tokenreviews API is used by Kubernetes components to verify the validity of bearer tokens, which are often used for authentication. So I added that to the aws-cloud-controller-manager ClusterRole and tried again.

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Unauthorized",
  "reason": "Unauthorized",
  "code": 401
}

I added this new permissions to the aws-cloud-controller-manager ClusterRole`

rules:
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

Try hitting the metrics endpoint again:

# HELP apiserver_audit_event_total [ALPHA] Counter of audit events generated and sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
apiserver_audit_requests_rejected_total 0
# HELP apiserver_client_certificate_expiration_seconds [ALPHA] Distribution of the remaining lifetime on the certificate used to authenticate a request.
# TYPE apiserver_client_certificate_expiration_seconds histogram
apiserver_client_certificate_expiration_seconds_bucket{le="0"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="1800"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="3600"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="7200"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="21600"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="43200"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="86400"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="172800"} 0
...
..
.

I finally got the metrics.

I guess we need

the component that scrapes the metrics needs the nonResourceURLs: ["/metrics"] permissions on its role/clusterrole
add subjectaccessreviews and tokenreviews permissions to aws-cloud-controller-manager.

QuentinBisson · 2024-08-08T20:38:34Z

That's something.really thorough investigation. Kudos to you Jose

fiunchinho · 2024-08-12T11:06:53Z

After discussing it internally, we've decided that scrapping these metrics would add a lot of new time series to our monitoring platform for little gain. We could monitor the state of the DaemonSet instead of relying on the up metric exposed by the application itself. All the other metrics exposed by the application seem to be too "low level", and I don't think we would use them for our alerting.

What do you think?

QuentinBisson · 2024-08-12T18:18:19Z

It's up to you if you don't think the metrics are not relevant then so bé it ;)

fiunchinho · 2024-08-14T15:42:13Z

It seems that these components are already covered by these rules, so we should get alerted when they are not available. We can close this one.

Rotfuks added team/phoenix Team Phoenix team/turtles Team Turtles labels Jun 27, 2024

T-Kukawka added the phoenix-size/s label Jul 11, 2024

fiunchinho self-assigned this Jul 31, 2024

fiunchinho closed this as completed Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service monitors for cloud-controller-manager-app #3538

Service monitors for cloud-controller-manager-app #3538

Rotfuks commented Jun 27, 2024 •

edited

Loading

fiunchinho commented Jul 24, 2024

Rotfuks commented Jul 24, 2024

QuentinBisson commented Jul 24, 2024

fiunchinho commented Aug 7, 2024

QuantumEnigmaa commented Aug 7, 2024

QuentinBisson commented Aug 7, 2024 •

edited

Loading

fiunchinho commented Aug 7, 2024

QuantumEnigmaa commented Aug 7, 2024

fiunchinho commented Aug 7, 2024

QuentinBisson commented Aug 7, 2024

fiunchinho commented Aug 7, 2024 •

edited

Loading

fiunchinho commented Aug 8, 2024 •

edited

Loading

QuentinBisson commented Aug 8, 2024

fiunchinho commented Aug 12, 2024

QuentinBisson commented Aug 12, 2024

fiunchinho commented Aug 14, 2024

Service monitors for cloud-controller-manager-app #3538

Service monitors for cloud-controller-manager-app #3538

Comments

Rotfuks commented Jun 27, 2024 • edited Loading

Motivation

TODO

Outcome

fiunchinho commented Jul 24, 2024

Rotfuks commented Jul 24, 2024

QuentinBisson commented Jul 24, 2024

fiunchinho commented Aug 7, 2024

QuantumEnigmaa commented Aug 7, 2024

QuentinBisson commented Aug 7, 2024 • edited Loading

fiunchinho commented Aug 7, 2024

QuantumEnigmaa commented Aug 7, 2024

fiunchinho commented Aug 7, 2024

QuentinBisson commented Aug 7, 2024

fiunchinho commented Aug 7, 2024 • edited Loading

fiunchinho commented Aug 8, 2024 • edited Loading

QuentinBisson commented Aug 8, 2024

fiunchinho commented Aug 12, 2024

QuentinBisson commented Aug 12, 2024

fiunchinho commented Aug 14, 2024

Rotfuks commented Jun 27, 2024 •

edited

Loading

QuentinBisson commented Aug 7, 2024 •

edited

Loading

fiunchinho commented Aug 7, 2024 •

edited

Loading

fiunchinho commented Aug 8, 2024 •

edited

Loading