-
Notifications
You must be signed in to change notification settings - Fork 249
feat: refactor cni telemetry #3149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
23e8b82
to
0613803
Compare
/azp run Azure Container Networking PR |
Azure Pipelines successfully started running 1 pipeline(s). |
b956ec4
to
dd9ca83
Compare
LGTM on @ramiro-gamarra 's approval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may still be missing some details about the purpose of this refactor, but seems to me that logs are getting duplicated and the abstractions introduced are not cleaning up the code much yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
8c83456
to
94a7991
Compare
/azp run Azure Container Networking PR |
we will split this part of the pr into its own pr a telemetry event was added back which was previously removed undo this pr to add those telemetry statements back
remove reflect remove duplicated telemetry and telemetry buffer remove unused fields in report manager force access to telemetry client fields through methods move telemetry start/connect code closer to start of plugin execution
we use SendError where we would have previously called reportPluginError (no log emitted) we don't set error message in cni report because the error message and event message fields both end up in the Message field in the cni telemetry service
tested and none panic: telemetry service running, lock acquired telemetry service not running, lock acquired telemetry service running, lock not acquired telemetry service not running, lock not acquired stateless if telemetry service not running stateless if telemetry service is running
482e86f
to
68d7a0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the telemetry logging functionality across the Azure Container Networking codebase, improving consistency and simplifying telemetry integration by using a package-level telemetry client. Key changes include migrating telemetry state to a global variable (AIClient), updating telemetry error/event logging calls, and removing obsolete metrics and report fields.
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.
File | Description |
---|---|
telemetry/telemetrybuffer.go | Changed logging level from error to warn when failing to kill the telemetry process |
telemetry/telemetry_client.go & telemetry.go | Introduced and utilized the package-level telemetry client (AIClient) |
network/* and cni/network/* | Updated telemetry calls and removed unused telemetry fields; added generic helpers |
Various test files | Removed references to TelemetryBuffer and CNIReport from tests in favor of AIClient |
Comments suppressed due to low confidence (3)
telemetry/telemetrybuffer.go:311
- The log level for a failed process kill was changed from error to warn. Confirm that downgrading the severity here is intentional and does not obscure critical errors in production.
tb.logger.Warn("Failed to kill process by", zap.String("TelemetryServiceProcessName", TelemetryServiceProcessName), zap.Error(err))
network/endpoint.go:158
- Ensure that the 'strings' package is imported in this file since the function uses strings.Builder, otherwise this will result in a compilation error.
func FormatSliceOfPointersToString[T any](slice []*T) string {
cni/network/network.go:297
- The package-level telemetry client (telemetryClient) is being modified directly without synchronization. Consider adding concurrency controls or ensuring that these updates are performed safely if accessed from multiple goroutines.
func (plugin *NetPlugin) setCNIReportDetails(containerID, opType, msg string) {
This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days |
The changes lgtm to me apart from couple of comments. |
Reason for Change:
Currently the telemetry CNI is sending is insufficient to debug CNI issues. This PR refactors the cni telemetry to send more and better quality logs.
Examples of Logged information (Will be added in a separate PR-- this PR is focused on refactoring)
Potential additions:
Issue Fixed:
Requirements:
Notes:
Pipeline run to prove logs sent to kusto: https://msazure.visualstudio.com/One/_build/results?buildId=108208651&view=results
Passing run: https://msazure.visualstudio.com/One/_build/results?buildId=108563465&view=results