prometheus is missing container metrics from certain nodes #970

noahpb · 2024-10-30T19:16:43Z

Environment

Device and OS: darwin arm64
App version: v0.29.1-unicorn
Kubernetes distro being used: k3d with two nodes

Steps to reproduce

Create a k3d cluster with additional nodes

$ kubectl get node
NAME               STATUS   ROLES                  AGE   VERSION
k3d-agent1-0       Ready    <none>                 23m   v1.30.4+k3s1
k3d-uds-server-0   Ready    control-plane,master   25m   v1.30.4+k3s1

Deploy uds-core with monitoring

Expected result

Container metrics such as CPU and Memory utilization should be queryable

Actual Result

Prometheus only returns metrics from pods that are scheduled on control plane nodes

Visual Proof (screenshots, videos, text, etc)

Metrics returned for container_cpu_usage_seconds

No metrics returned when filtering out control plane node:

Severity/Priority

Moderate

Additional Context

Removing all NetworkPolicies in the monitoring namespace allows Prometheus to pick up metrics from the missing nodes.

The text was updated successfully, but these errors were encountered:

joelmccoy · 2024-10-30T19:30:01Z

Internal related issue: https://github.com/defenseunicorns/uds-infrastructure/issues/573

noahpb · 2024-11-01T20:00:07Z

Thanks to @rjferguson21's suggestion, we've been able to confirm that the allow-prometheus-stack-egress-metrics-scraping NetworkPolicy generated by the operator needs to be adjusted. The remoteNamespace: "" specification is not permissive enough to allow egress traffic to the prometheus-node-exporter daemonset pods. Manually adjusting the egress specification of the NetworkPolicy to the CIDR range of the nodes worked in my local testing.

mjnagel · 2024-11-05T21:18:28Z

Would suggest to resolve this we build an AllNodes generated target. We should be able to build that list of IPs using a watch on the nodes with Pepr, similar to our KubeAPI target. This would also be helpful for metrics-server which has an Anywhere rule with a todo comment to switch that to an all nodes target.

Code links for current kubeapi logic:

https://github.com/defenseunicorns/uds-core/blob/bfd415eb830a993dc9a815b77e298d5715ec1f6e/src/pepr/operator/controllers/network/generators/kubeAPI.ts

uds-core/src/pepr/operator/index.ts

Lines 37 to 48 in bfd415e

    
           When(a.EndpointSlice) 
        
             .IsCreatedOrUpdated() 
        
             .InNamespace("default") 
        
             .WithName("kubernetes") 
        
             .Reconcile(updateAPIServerCIDRFromEndpointSlice); 
        
           // Watch for changes to the API server Service and update the API server CIDR 
        
           When(a.Service) 
        
             .IsCreatedOrUpdated() 
        
             .InNamespace("default") 
        
             .WithName("kubernetes") 
        
             .Reconcile(updateAPIServerCIDRFromService);

Once this is added as a generated target we can add it to Prometheus and make sure that the traffic works as expected.

noahpb added the possible-bug Something may not be working label Oct 30, 2024

mjnagel added bug Something isn't working and removed possible-bug Something may not be working labels Nov 6, 2024

mjnagel assigned noahpb, catsby and UnicornChance Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus is missing container metrics from certain nodes #970

prometheus is missing container metrics from certain nodes #970

noahpb commented Oct 30, 2024

joelmccoy commented Oct 30, 2024

noahpb commented Nov 1, 2024

mjnagel commented Nov 5, 2024 •

edited

Loading

prometheus is missing container metrics from certain nodes #970

prometheus is missing container metrics from certain nodes #970

Comments

noahpb commented Oct 30, 2024

Environment

Steps to reproduce

Expected result

Actual Result

Visual Proof (screenshots, videos, text, etc)

Severity/Priority

Additional Context

joelmccoy commented Oct 30, 2024

noahpb commented Nov 1, 2024

mjnagel commented Nov 5, 2024 • edited Loading

mjnagel commented Nov 5, 2024 •

edited

Loading