Skip to content

Conversation

pree-dew
Copy link
Contributor

@pree-dew pree-dew commented Oct 9, 2025

Telemetry to cover failure modes which are not covered by container logs and metrics for finding resource constraints.

Motivation and Context

When there is any issue with registry container we should be notified.

How Has This Been Tested?

  • Local seup

Breaking Changes

  • No

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

  • No additional exporter is used, taken advantage of opentelemetry collector

  • It covers metrics related to resource constraints, currently only limited to default namespace.

  • Takes cares of kubernetes events as logs which are the source of figuring out any problem with service, covers all such scenarios where pod is not able to start yet and get missed because there are no container logs for such cases. Limited to default namespace.

  • Taken care of daemonset deployment i.e. deploying otel collector as agent by using correct filtering.

  • Cardinality contributing factors are only pod ids (but have to observe more), node ids will not increase cardinality as scale up will lead to limited nodes.

  • Shipping of metrics for resources happens every 60s and list of metrics that will be emitted https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/metadata.yaml

  • Container errors

Screenshot 2025-10-10 at 1 21 14 AM
  • Resource metrics
Screenshot 2025-10-10 at 1 23 51 AM

@pree-dew
Copy link
Contributor Author

pree-dew commented Oct 9, 2025

Issue #509

@pree-dew
Copy link
Contributor Author

pree-dew commented Oct 9, 2025

@rdimitrov @domdomegg @tadasant Is there a possibility where we can run a deployment on staging for some time then push to production? Wanted to check the cardinality number for this release before it goes to production, something like this(screenshot), here it is high because of so many deployments I have done that won't be case for production. I have done very good testing around this but still wanted to see if this is an option.

Screenshot 2025-10-10 at 1 27 54 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant