Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Grafana] Use PodMonitor instead of ServiceMonitor for the Head Node to avoid metric duplication #2689

Merged
merged 1 commit into from
Dec 27, 2024

Conversation

win5923
Copy link
Contributor

@win5923 win5923 commented Dec 26, 2024

Why are these changes needed?

Addresses: #2502 (comment)

This issue is caused by RayService. The Head Node is monitored using ServiceMonitor, while RayService creates two Services:

  • One managed by RayService (e.g., rayservice-sample-head-svc)
  • One managed by RayCluster (e.g., rayservice-sample-raycluster-6mj28-head-svc)

This results in the Head Node's metrics being duplicated. Switching to PodMonitor resolves this issue.

TODO:

image

Before:
image

After:
image

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@win5923
Copy link
Contributor Author

win5923 commented Dec 26, 2024

@kevin85421 PTAL when you are free.

@kevin85421 kevin85421 self-assigned this Dec 27, 2024
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Have you checked whether the Autoscaler/Dashboard metrics are being collected as expected?

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually tested it, and the Autoscaler/Dashboard metrics are as expected.

Screenshot 2024-12-27 at 12 07 05 PM Screenshot 2024-12-27 at 12 06 50 PM

@kevin85421 kevin85421 merged commit 3425b4b into ray-project:master Dec 27, 2024
24 checks passed
@win5923
Copy link
Contributor Author

win5923 commented Dec 28, 2024

I manually tested it, and the Autoscaler/Dashboard metrics are as expected.

Screenshot 2024-12-27 at 12 07 05 PM Screenshot 2024-12-27 at 12 06 50 PM

Thanks!

@win5923 win5923 deleted the grafana/x2metrics branch December 28, 2024 00:02
rynewang pushed a commit to ray-project/ray that referenced this pull request Jan 7, 2025
Since KubeRay has changed the collection of Head Node metrics from
`ServiceMonitor` to `PodMonitor`, this PR will update the Ray doc to
reflect the current usage.

Ref: ray-project/kuberay#2689

---------

Signed-off-by: win5923 <[email protected]>
Signed-off-by: Blocka <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: angelinalg <[email protected]>
roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 7, 2025
…9476)

Since KubeRay has changed the collection of Head Node metrics from
`ServiceMonitor` to `PodMonitor`, this PR will update the Ray doc to
reflect the current usage.

Ref: ray-project/kuberay#2689

---------

Signed-off-by: win5923 <[email protected]>
Signed-off-by: Blocka <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: angelinalg <[email protected]>
roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 9, 2025
…9476)

Since KubeRay has changed the collection of Head Node metrics from
`ServiceMonitor` to `PodMonitor`, this PR will update the Ray doc to
reflect the current usage.

Ref: ray-project/kuberay#2689

---------

Signed-off-by: win5923 <[email protected]>
Signed-off-by: Blocka <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Signed-off-by: Roshan Kathawate <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants