Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Enhance GPU metrics collection and error handling in vGPU monitor #827

Merged
merged 1 commit into from
Feb 8, 2025

Conversation

haitwang-cloud
Copy link
Contributor

@haitwang-cloud haitwang-cloud commented Jan 22, 2025

What type of PR is this?

/kind flake
This pull request includes significant changes to the vGPUmonitor application to improve its structure and functionality. The most important changes include the addition of context and signal handling, the restructuring of the metrics collection process, and the refactoring of the watchAndFeedback function to support graceful shutdowns.

Context and Signal Handling:

  • cmd/vGPUmonitor/main.go: Added context and signal handling to enable graceful shutdown of the application. This includes capturing system signals and using a context to manage the lifecycle of goroutines.

Metrics Collection:

  • cmd/vGPUmonitor/metrics.go: Refactored the metrics collection process by splitting it into multiple functions (collectGPUInfo, collectPodAndContainerInfo, collectContainerMetrics, etc.) to improve readability and maintainability. [1] [2]
  • cmd/vGPUmonitor/metrics.go: Introduced the sendMetric helper function to streamline sending metrics to Prometheus.

Refactoring watchAndFeedback:

  • cmd/vGPUmonitor/feedback.go: Refactored the watchAndFeedback function to support context-based cancellation, improving the application's ability to shut down gracefully. [1] [2]

Code Cleanup:

These changes collectively enhance the robustness and maintainability of the vGPUmonitor application.
What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
No

@haitwang-cloud
Copy link
Contributor Author

Append log for metrics after this PR

I0207 07:59:18.078782 1970225 metrics.go:320] Processing Pod ism/deploy-ism-embedding-5c66779bb7-tst68
I0207 07:59:18.078816 1970225 metrics.go:327] Processing Container istio-proxy in Pod ism/deploy-ism-embedding-5c66779bb7-tst68
I0207 07:59:18.082868 1970225 metrics.go:412] Successfully collected metrics for Pod ism/deploy-ism-embedding-5c66779bb7-tst68, Container istio-proxy
I0207 07:59:18.082893 1970225 metrics.go:320] Processing Pod sir-service/sir-core-service-79c956d9fd-xns5t
I0207 07:59:18.082911 1970225 metrics.go:327] Processing Container istio-proxy in Pod sir-service/sir-core-service-79c956d9fd-xns5t
I0207 07:59:18.086465 1970225 metrics.go:412] Successfully collected metrics for Pod sir-service/sir-core-service-79c956d9fd-xns5t, Container istio-proxy
I0207 07:59:18.086478 1970225 metrics.go:320] Processing Pod ism/dev-deploy-ism-embedding-596759cbb4-jbflb
I0207 07:59:18.086486 1970225 metrics.go:327] Processing Container istio-proxy in Pod ism/dev-deploy-ism-embedding-596759cbb4-jbflb
I0207 07:59:18.089185 1970225 metrics.go:412] Successfully collected metrics for Pod ism/dev-deploy-ism-embedding-596759cbb4-jbflb, Container istio-proxy
I0207 07:59:18.089199 1970225 metrics.go:337] Finished collecting metrics for 3 pods
I0207 07:59:18.089206 1970225 metrics.go:192] Finished collecting metrics for vGPUMonitor

@haitwang-cloud
Copy link
Contributor Author

Change default log level 5 from 4

@haitwang-cloud
Copy link
Contributor Author

haitwang-cloud commented Feb 7, 2025

Append the latest log after change log level to 4

I0207 09:15:11.175377  745617 metrics.go:337] Finished collecting metrics for 4 pods
I0207 09:15:11.175392  745617 metrics.go:192] Finished collecting metrics for vGPUMonitor

@archlitchi
Copy link
Collaborator

/lgtm

@archlitchi archlitchi merged commit 89368bd into Project-HAMi:master Feb 8, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants