Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: add os-level cpu stats to pd #4639

Merged
merged 2 commits into from
Jun 20, 2024
Merged

metrics: add os-level cpu stats to pd #4639

merged 2 commits into from
Jun 20, 2024

Conversation

conorsch
Copy link
Contributor

@conorsch conorsch commented Jun 17, 2024

Describe your changes

Pulls in the metrics-process crate [0] to bolt on CPU usage and other OS-level info to the metrics emitted by pd. Doing so will support a better first-run experience for node operators who lack a comprehensive pre-existing setup for monitoring node health, so they can get a sense of how much load pd is under.

[0] https://docs.rs/metrics-process/2.0.0/metrics_process/index.html

Issue ticket number and link

None, was discussed briefly during sprint-planning recently.

Checklist before requesting a review

  • If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason:

    metrics-only, doesn't change app logic

@conorsch
Copy link
Contributor Author

Still need to cobble together some graphs to display this new info, should be straightforward. Draft in the meantime.

@conorsch conorsch added the A-telemetry Area: Metrics, logging, and other observability-related features label Jun 18, 2024
Pulls in the `metrics-process` crate [0] to bolt on CPU usage and other OS-level info
to the metrics emitted by pd. Doing so will support a better first-run
experience for node operators who lack a comprehensive pre-existing setup for monitoring node health,
so they can get a sense of how much load pd is under.

Adds a cursory first-pass on a new dashboard, but will need to follow up
on that once the new metrics actually land on hosts that are receiving
load. Also included a reference to metrics addd in #4581, as a treat.

[0] https://docs.rs/metrics-process/2.0.0/metrics_process/index.html
Fixes a hardcoded path in the compose setup for local metrics. Updates
the docs, as well, to clarify that the first-run experience via docker
compose is only relevant for Linux hosts, due to use of host-networking.
Will work on a macos-compatible dev env later, but for now focusing on
making sure what's in the repo is clear and runnable for actual testnet
node operators, which means Linux boxes.

Closes #4565.
@conorsch
Copy link
Contributor Author

Ready for review. Added a new dashboard to focus on pd performance:

new-charts

I'm not happy about how to translate the pd_process_cpu_seconds_total metric into a meaningful value, but that needn't block merge: we'll do that in grafana after we have actual data to work with.

@conorsch conorsch marked this pull request as ready for review June 19, 2024 02:48
Copy link
Contributor

@cratelyn cratelyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✨ very nice!

@conorsch conorsch merged commit 757d224 into main Jun 20, 2024
13 checks passed
@conorsch conorsch deleted the pd-metrics-cpu branch June 20, 2024 15:17
@conorsch conorsch mentioned this pull request Jun 20, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-telemetry Area: Metrics, logging, and other observability-related features
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants