Skip to content

feat: refactor cni telemetry #3149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

feat: refactor cni telemetry #3149

wants to merge 22 commits into from

Conversation

QxBytes
Copy link
Contributor

@QxBytes QxBytes commented Nov 14, 2024

Reason for Change:

Currently the telemetry CNI is sending is insufficient to debug CNI issues. This PR refactors the cni telemetry to send more and better quality logs.

  • Moves telemetry into a package level variable so it is made accessible everywhere
  • Removes sending certain metrics as they are not used
  • Sets the subcontext to the container id. The container id is kept consistent throughout CNI calls for the same pod, meaning an ADD and DEL call (and all related logs) for the same pod will have the same subcontext/container id. The container id is also what is stored in stateless mode as one of the keys.
  • Sets the operation id before any telemetry events are sent. The operation id is used for sampling should we end up enabling it.

Examples of Logged information (Will be added in a separate PR-- this PR is focused on refactoring)

  • CNI add network configuration, arguments
  • CNI add completion with endpoint info struct information (contains hns endpoint id and hns network id), interface results from the ipam invoker, and any error that occurred
  • CNI del network configuration, arguments
  • CNI del completion with error that occurred
  • HNS Endpoint struct before creation / HNS Endpoint Id during deletion
  • HNS Network struct before creation / HNS Network Id during deletion
  • Deletion/Release of each IP (even if does not exist)
  • Mapping sent to CNS during stateless CNI mode during Update Endpoint State
  • Exact CNS response from CNS ipam invoker
  • Exact CNS response from multitenancy ipam invoker
  • Transparent vlan creating/deleting vlan veth interface

Potential additions:

  • endpoint and network structs saved to azure-vnet.json statefile

Issue Fixed:

Requirements:

Notes:
Pipeline run to prove logs sent to kusto: https://msazure.visualstudio.com/One/_build/results?buildId=108208651&view=results
Passing run: https://msazure.visualstudio.com/One/_build/results?buildId=108563465&view=results

@QxBytes QxBytes changed the title ci: refactor cni telemetry feat: refactor cni telemetry Nov 14, 2024
@QxBytes QxBytes self-assigned this Nov 14, 2024
@QxBytes QxBytes added cni Related to CNI. ci Infra or tooling. telemetry logging labels Nov 14, 2024
@QxBytes QxBytes force-pushed the alew/refactor-telemetry branch from 23e8b82 to 0613803 Compare November 14, 2024 20:18
@QxBytes QxBytes marked this pull request as ready for review November 14, 2024 23:56
@QxBytes QxBytes requested review from a team as code owners November 14, 2024 23:56
@QxBytes QxBytes requested a review from jpayne3506 November 14, 2024 23:56
@QxBytes
Copy link
Contributor Author

QxBytes commented Nov 15, 2024

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@QxBytes QxBytes force-pushed the alew/refactor-telemetry branch 2 times, most recently from b956ec4 to dd9ca83 Compare November 15, 2024 21:31
@timraymond
Copy link
Member

LGTM on @ramiro-gamarra 's approval

Copy link
Contributor

@ramiro-gamarra ramiro-gamarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may still be missing some details about the purpose of this refactor, but seems to me that logs are getting duplicated and the abstractions introduced are not cleaning up the code much yet.

behzad-mir
behzad-mir previously approved these changes Dec 3, 2024
Copy link
Contributor

@behzad-mir behzad-mir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@QxBytes
Copy link
Contributor Author

QxBytes commented Dec 5, 2024

/azp run Azure Container Networking PR

QxBytes added 19 commits May 12, 2025 11:05
we will split this part of the pr into its own pr
a telemetry event was added back which was previously removed
undo this pr to add those telemetry statements back
remove reflect
remove duplicated telemetry and telemetry buffer
remove unused fields in report manager
force access to telemetry client fields through methods
move telemetry start/connect code closer to start of plugin execution
we use SendError where we would have previously called reportPluginError (no log emitted)
we don't set error message in cni report because the error message and event message fields both end up in the Message field in the cni telemetry service
tested and none panic:
telemetry service running, lock acquired
telemetry service not running, lock acquired
telemetry service running, lock not acquired
telemetry service not running, lock not acquired
stateless if telemetry service not running
stateless if telemetry service is running
@QxBytes QxBytes force-pushed the alew/refactor-telemetry branch from 482e86f to 68d7a0a Compare May 12, 2025 18:05
@QxBytes QxBytes requested a review from Copilot May 19, 2025 16:42
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the telemetry logging functionality across the Azure Container Networking codebase, improving consistency and simplifying telemetry integration by using a package-level telemetry client. Key changes include migrating telemetry state to a global variable (AIClient), updating telemetry error/event logging calls, and removing obsolete metrics and report fields.

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

File Description
telemetry/telemetrybuffer.go Changed logging level from error to warn when failing to kill the telemetry process
telemetry/telemetry_client.go & telemetry.go Introduced and utilized the package-level telemetry client (AIClient)
network/* and cni/network/* Updated telemetry calls and removed unused telemetry fields; added generic helpers
Various test files Removed references to TelemetryBuffer and CNIReport from tests in favor of AIClient
Comments suppressed due to low confidence (3)

telemetry/telemetrybuffer.go:311

  • The log level for a failed process kill was changed from error to warn. Confirm that downgrading the severity here is intentional and does not obscure critical errors in production.
tb.logger.Warn("Failed to kill process by", zap.String("TelemetryServiceProcessName", TelemetryServiceProcessName), zap.Error(err))

network/endpoint.go:158

  • Ensure that the 'strings' package is imported in this file since the function uses strings.Builder, otherwise this will result in a compilation error.
func FormatSliceOfPointersToString[T any](slice []*T) string {

cni/network/network.go:297

  • The package-level telemetry client (telemetryClient) is being modified directly without synchronization. Consider adding concurrency controls or ensuring that these updates are performed safely if accessed from multiple goroutines.
func (plugin *NetPlugin) setCNIReportDetails(containerID, opType, msg string) {

Copy link

github-actions bot commented Jun 3, 2025

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Jun 3, 2025
@QxBytes QxBytes removed the stale Stale due to inactivity. label Jun 3, 2025
@vipul-21
Copy link
Contributor

vipul-21 commented Jun 3, 2025

The changes lgtm to me apart from couple of comments.
@tamilmani1989 Can you please also take a look as well as.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Infra or tooling. cni Related to CNI. logging telemetry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants