Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service monitors for cloud-controller-manager-app #3538

Closed
2 tasks
Rotfuks opened this issue Jun 27, 2024 · 16 comments
Closed
2 tasks

Service monitors for cloud-controller-manager-app #3538

Rotfuks opened this issue Jun 27, 2024 · 16 comments
Assignees
Labels

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Jun 27, 2024

Motivation

There are still some leftover apps (although not a lot) that still need to use a service monitor. Without this, we will not be able to tear down our Prometheus stack.

To easily find out what is not monitored via service monitors, you can connect to a MC and WC prometheus using opsctl open -i -a prometheus --workload-cluster=<cluster_id> and check out the targets page. We already identified and resolved most of the service monitors missing, but identified one turtles app that's not done yet.

TODO

Service monitors here are only relevant for CAPI

Outcome

  • All missing apps have a service monitor
@Rotfuks Rotfuks added team/phoenix Team Phoenix team/turtles Team Turtles labels Jun 27, 2024
@fiunchinho
Copy link
Member

@Rotfuks the cloud controller manager is a component that is deployed very early in the process, probably before the ServiceMonitor CRD is present. Should be deploy it as a new app, similar to what's done with cilium-servicemonitors-app? Or will the CRDs be present and we can deploy it with the app?

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Jul 24, 2024

I believe someone else from @giantswarm/team-atlas might be of more help when it comes to that question :)

@QuentinBisson
Copy link

It seems that this app is required in the bootstrap of a cluster so unless you can create the cluster and have it dependent on the prometheus-operator-crds, you should consider making it another app yes :(

@fiunchinho fiunchinho self-assigned this Jul 31, 2024
@fiunchinho
Copy link
Member

@giantswarm/team-atlas how were these components being monitoring up until now? I can't find a metrics endpoint for them

@QuantumEnigmaa
Copy link

From what I see, these weren't monitored at all :
image

No results whereas :

kube-system           aws-cloud-controller-manager-fvrn4                                  1/1     Running     1 (42d ago)     42d                                                                           
kube-system           aws-cloud-controller-manager-fxrbp                                  1/1     Running     1 (42d ago)     42d
kube-system           aws-cloud-controller-manager-wd8sm                                  1/1     Running     1 (42d ago)     42d

@QuentinBisson
Copy link

QuentinBisson commented Aug 7, 2024

Could it be because bind-address is not set as an argument so it can only bé scraped when using localhost?

@fiunchinho
Copy link
Member

These apps are not exposing any metrics port AFAIK. So I'm not sure what needs to be done.

@QuantumEnigmaa
Copy link

Then I would say that the 1st thing to do would be to expose the metrics endpoints in those apps

@fiunchinho
Copy link
Member

I don't think they have one, that's what I meant.

@QuentinBisson
Copy link

I'm confused because the main.go files of the aws cloud controller manager does register metrics 🤔 I Can check in 2 weeks :)

@fiunchinho
Copy link
Member

fiunchinho commented Aug 7, 2024

Yes, it does but I don't see any http handler to expose those metrics.

@fiunchinho
Copy link
Member

fiunchinho commented Aug 8, 2024

I decided to investigate aws-cloud-controller-manager more in detail.

Reading the source code, I could see how the controller was collecting metrics https://github.com/search?q=repo%3Akubernetes%2Fcloud-provider-aws+%2Fcloudprovider_aws%2F&type=code

But in the code I couldn't see any http endpoint being exposed for the metrics. I searched github, forums, and asked on Kubernetes Slack but I did not get any answers .

I checked the parameters that we currently pass to the aws-cloud-controller-manager container, and one says --secure-port=10267. So I port-forwarded using that port and hit the /metrics endpoint.
Internal Server Error: "/metrics": subjectaccessreviews.authorization.k8s.io is forbidden: User "system:serviceaccount:kube-system:aws-cloud-controller-manager" cannot create resource "subjectaccessreviews" in API group "authorization.k8s.io" at the cluster scope

Apparently this is a k8s resource used to determine whether a given user or service account has permission to perform a specific action on a resource. So I tried updating the ClusterRole for aws-cloud-controller-manager adding the SAR resource, and hitting the endpoint again.

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

I need to pass a serviceaccount token. So I created a new token for the aws-cloud-controller-manager SA with kubectl create token aws-cloud-controller-manager -n kube-system --duration 10m. Used it in my request and

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

Let's check the logs in the controller

E0808 14:51:04.851806       1 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, tokenreviews.authentication.k8s.io is forbidden: User \"system:serviceaccount:kube-system:aws-cloud-controller-manager\" cannot create resource \"tokenreviews\" in API group \"authentication.k8s.io\" at the cluster scope]"

The tokenreviews API is used by Kubernetes components to verify the validity of bearer tokens, which are often used for authentication. So I added that to the aws-cloud-controller-manager ClusterRole and tried again.

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Unauthorized",
  "reason": "Unauthorized",
  "code": 401
}

I added this new permissions to the aws-cloud-controller-manager ClusterRole`

rules:
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

Try hitting the metrics endpoint again:

# HELP apiserver_audit_event_total [ALPHA] Counter of audit events generated and sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
apiserver_audit_requests_rejected_total 0
# HELP apiserver_client_certificate_expiration_seconds [ALPHA] Distribution of the remaining lifetime on the certificate used to authenticate a request.
# TYPE apiserver_client_certificate_expiration_seconds histogram
apiserver_client_certificate_expiration_seconds_bucket{le="0"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="1800"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="3600"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="7200"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="21600"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="43200"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="86400"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="172800"} 0
...
..
.

I finally got the metrics.

I guess we need

  • the component that scrapes the metrics needs the nonResourceURLs: ["/metrics"] permissions on its role/clusterrole
  • add subjectaccessreviews and tokenreviews permissions to aws-cloud-controller-manager.

@QuentinBisson
Copy link

That's something.really thorough investigation. Kudos to you Jose

@fiunchinho
Copy link
Member

After discussing it internally, we've decided that scrapping these metrics would add a lot of new time series to our monitoring platform for little gain. We could monitor the state of the DaemonSet instead of relying on the up metric exposed by the application itself. All the other metrics exposed by the application seem to be too "low level", and I don't think we would use them for our alerting.

What do you think?

@QuentinBisson
Copy link

It's up to you if you don't think the metrics are not relevant then so bé it ;)

@fiunchinho
Copy link
Member

It seems that these components are already covered by these rules, so we should get alerted when they are not available. We can close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

5 participants