Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose NVMe Volume Metrics #2216

Merged
merged 3 commits into from
Nov 7, 2024

Conversation

torredil
Copy link
Member

@torredil torredil commented Nov 7, 2024

Is this a bug fix or adding new feature?

New feature.

What is this PR about? / Why do we need it?

This PR adds support for collecting detailed performance statistics for CSI managed devices. The log page for each device is retrieved at gather time, transformed into a prometheus-compatible format, and exposed at a /metrics HTTP endpoint.

What testing is done?

make verify && make test

ginkgo run --focus='should create a stateful pod and read NVMe metrics from node plugin' -v
Test Suite Passed
curl 127.0.0.1:3302/metrics

...
# HELP volume_queue_length Current volume queue length
# TYPE volume_queue_length gauge
volume_queue_length{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e"} 0
# HELP write_io_latency_histogram Histogram of write I/O latencies (in microseconds)
# TYPE write_io_latency_histogram histogram
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="1"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="2"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="4"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="8"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="16"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="32"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="64"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="128"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="256"} 58
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="512"} 14260
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="1024"} 83115
...

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 7, 2024
Copy link

github-actions bot commented Nov 7, 2024

Code Coverage Diff

File Old Coverage New Coverage Delta
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/metrics/metrics.go 66.7% 63.8% -2.9
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/metrics/nvme.go Does not exist 21.7%

tests/e2e/nvme_metrics.go Show resolved Hide resolved
tests/e2e/nvme_metrics.go Outdated Show resolved Hide resolved
@torredil torredil force-pushed the nvme-log-page-metrics branch from 0fb096f to 428fb1d Compare November 7, 2024 15:14
Copy link
Contributor

@ConnorJC3 ConnorJC3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024
@torredil torredil force-pushed the nvme-log-page-metrics branch from 428fb1d to 5927a72 Compare November 7, 2024 15:50
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024
Copy link
Contributor

@AndrewSirenko AndrewSirenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, left a few non-blocking unit test comments/questions.

pkg/metrics/nvme.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 7, 2024
@AndrewSirenko
Copy link
Contributor

/hold

For @ElijahQuinones's review

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024
Copy link
Contributor

@ConnorJC3 ConnorJC3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024
@torredil
Copy link
Member Author

torredil commented Nov 7, 2024

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Nov 7, 2024
@AndrewSirenko
Copy link
Contributor

/lgtm

Copy link
Member

@ElijahQuinones ElijahQuinones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/unhold
/approve

LGTM thank you for also adding the service in this PR :) !

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndrewSirenko, ElijahQuinones

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [AndrewSirenko,ElijahQuinones]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 966da33 into kubernetes-sigs:master Nov 7, 2024
18 checks passed
@SuperQ
Copy link

SuperQ commented Nov 13, 2024

Hey, thanks for adding these metrics.

Unfortunately, there are a few things about this change that don't follow Prometheus best practices.

https://prometheus.io/docs/practices/naming/

Would you be willing to accept changes to adapt the data to meet best practices?

@torredil
Copy link
Member Author

Hi @SuperQ, thank you for reaching out about this and engaging with the community on Prometheus best practices.

We are aware of that documentation and its one of the things our team carefully evaluated when implementing this feature. While we generally strive to follow best practices, in this specific case we had a requirement and intentionally decided to maintain consistency with the underlying EBS metrics output to allow for direct mapping with the AWS documentation and CloudWatch metrics.

That said, could you share an example of the change you'd propose? I'd like to discuss this with the team if there is an opportunity to incorporate best practices for naming conventions while maintaining the consistency requirement.

@SuperQ
Copy link

SuperQ commented Nov 13, 2024

There's basically no way to make the underlying metric match best practices without breaking changes.

My changes would be:

  • Prefix the metrics with aws_ebs_csi... to namespace the metrics as specific to this driver.
  • Map the unit from microseconds to seconds to match unit best practices.
  • Rename the metric to match best practices.

I also reviewed the code and found that the upstream data is missing the histogram cumulative sum data. ☹️

@torredil
Copy link
Member Author

That makes sense @SuperQ, I'll bring up this feedback for discussion during our next weekly sync and provide a response here shortly after. We also need to consider existing metrics exported by the controller plugin when adding/changing prefixes if the team decides to move forward with the proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants