Add failure modes telemetry #646

pree-dew · 2025-10-09T19:48:59Z

Telemetry to cover failure modes which are not covered by container logs and metrics for finding resource constraints.

Motivation and Context

When there is any issue with registry container we should be notified.

How Has This Been Tested?

Local seup

Breaking Changes

No

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

No additional exporter is used, taken advantage of opentelemetry collector
It covers metrics related to resource constraints, currently only limited to default namespace.
Takes cares of kubernetes events as logs which are the source of figuring out any problem with service, covers all such scenarios where pod is not able to start yet and get missed because there are no container logs for such cases. Limited to default namespace.
Taken care of daemonset deployment i.e. deploying otel collector as agent by using correct filtering.
Cardinality contributing factors are only pod ids (but have to observe more), node ids will not increase cardinality as scale up will lead to limited nodes.
Shipping of metrics for resources happens every 60s and list of metrics that will be emitted https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/metadata.yaml
Container errors

Resource metrics

pree-dew · 2025-10-09T19:50:25Z

Issue #509

pree-dew · 2025-10-09T19:59:22Z

@rdimitrov @domdomegg @tadasant Is there a possibility where we can run a deployment on staging for some time then push to production? Wanted to check the cardinality number for this release before it goes to production, something like this(screenshot), here it is high because of so many deployments I have done that won't be case for production. I have done very good testing around this but still wanted to see if this is an option.

pree-dew added 2 commits October 9, 2025 17:27

add receiver, processor and exporter for kubelet and k8s_events

dc5ca43

add k8sattribute from pod association for log pipeline

b326abb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add failure modes telemetry #646

Add failure modes telemetry #646

pree-dew commented Oct 9, 2025 •

edited

Loading

Uh oh!

pree-dew commented Oct 9, 2025

Uh oh!

pree-dew commented Oct 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add failure modes telemetry #646

Are you sure you want to change the base?

Add failure modes telemetry #646

Conversation

pree-dew commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

pree-dew commented Oct 9, 2025

Uh oh!

pree-dew commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pree-dew commented Oct 9, 2025 •

edited

Loading

pree-dew commented Oct 9, 2025 •

edited

Loading