Expose NVMe Volume Metrics #2216

torredil · 2024-11-07T14:16:03Z

Is this a bug fix or adding new feature?

New feature.

What is this PR about? / Why do we need it?

This PR adds support for collecting detailed performance statistics for CSI managed devices. The log page for each device is retrieved at gather time, transformed into a prometheus-compatible format, and exposed at a /metrics HTTP endpoint.

What testing is done?

make verify && make test

ginkgo run --focus='should create a stateful pod and read NVMe metrics from node plugin' -v
Test Suite Passed

curl 127.0.0.1:3302/metrics

...
# HELP volume_queue_length Current volume queue length
# TYPE volume_queue_length gauge
volume_queue_length{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e"} 0
# HELP write_io_latency_histogram Histogram of write I/O latencies (in microseconds)
# TYPE write_io_latency_histogram histogram
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="1"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="2"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="4"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="8"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="16"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="32"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="64"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="128"} 0
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="256"} 58
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="512"} 14260
write_io_latency_histogram_bucket{instance_id="i-0725b88fb342cd1c9",volume_id="vol-0aa37802a7c970c8e",le="1024"} 83115
...

Signed-off-by: torredil <[email protected]>

github-actions · 2024-11-07T14:18:05Z

Code Coverage Diff

File	Old Coverage	New Coverage	Delta
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/metrics/metrics.go	66.7%	63.8%	-2.9
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/metrics/nvme.go	Does not exist	21.7%

tests/e2e/nvme_metrics.go

ConnorJC3

/lgtm

pkg/metrics/nvme_test.go

pkg/driver/options.go

Signed-off-by: torredil <[email protected]>

AndrewSirenko

Looks great, left a few non-blocking unit test comments/questions.

pkg/metrics/nvme.go

AndrewSirenko · 2024-11-07T16:47:03Z

/hold

For @ElijahQuinones's review

Signed-off-by: torredil <[email protected]>

ConnorJC3

/lgtm

torredil · 2024-11-07T18:24:00Z

/label tide/merge-method-squash

AndrewSirenko · 2024-11-07T18:46:15Z

/lgtm

ElijahQuinones

/unhold
/approve

LGTM thank you for also adding the service in this PR :) !

k8s-ci-robot · 2024-11-07T19:48:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndrewSirenko, ElijahQuinones

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [AndrewSirenko,ElijahQuinones]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SuperQ · 2024-11-13T12:59:42Z

Hey, thanks for adding these metrics.

Unfortunately, there are a few things about this change that don't follow Prometheus best practices.

https://prometheus.io/docs/practices/naming/

Would you be willing to accept changes to adapt the data to meet best practices?

torredil · 2024-11-13T14:43:19Z

Hi @SuperQ, thank you for reaching out about this and engaging with the community on Prometheus best practices.

We are aware of that documentation and its one of the things our team carefully evaluated when implementing this feature. While we generally strive to follow best practices, in this specific case we had a requirement and intentionally decided to maintain consistency with the underlying EBS metrics output to allow for direct mapping with the AWS documentation and CloudWatch metrics.

That said, could you share an example of the change you'd propose? I'd like to discuss this with the team if there is an opportunity to incorporate best practices for naming conventions while maintaining the consistency requirement.

SuperQ · 2024-11-13T15:13:44Z

There's basically no way to make the underlying metric match best practices without breaking changes.

My changes would be:

Prefix the metrics with aws_ebs_csi... to namespace the metrics as specific to this driver.
Map the unit from microseconds to seconds to match unit best practices.
Rename the metric to match best practices.

I also reviewed the code and found that the upstream data is missing the histogram cumulative sum data. ☹️

torredil · 2024-11-13T16:51:52Z

That makes sense @SuperQ, I'll bring up this feedback for discussion during our next weekly sync and provide a response here shortly after. We also need to consider existing metrics exported by the controller plugin when adding/changing prefixes if the team decides to move forward with the proposal.

Expose NVMe volume metrics

248c13c

Signed-off-by: torredil <[email protected]>

k8s-ci-robot requested review from AndrewSirenko and ConnorJC3 November 7, 2024 14:16

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 7, 2024

ConnorJC3 reviewed Nov 7, 2024

View reviewed changes

tests/e2e/nvme_metrics.go Show resolved Hide resolved

tests/e2e/nvme_metrics.go Outdated Show resolved Hide resolved

torredil force-pushed the nvme-log-page-metrics branch from 0fb096f to 428fb1d Compare November 7, 2024 15:14

ConnorJC3 approved these changes Nov 7, 2024

View reviewed changes

k8s-ci-robot assigned ConnorJC3 Nov 7, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024

AndrewSirenko reviewed Nov 7, 2024

View reviewed changes

pkg/metrics/nvme_test.go Show resolved Hide resolved

AndrewSirenko reviewed Nov 7, 2024

View reviewed changes

pkg/driver/options.go Show resolved Hide resolved

Add unit/e2e tests for NVMe log page metrics

5927a72

Signed-off-by: torredil <[email protected]>

torredil force-pushed the nvme-log-page-metrics branch from 428fb1d to 5927a72 Compare November 7, 2024 15:50

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024

AndrewSirenko approved these changes Nov 7, 2024

View reviewed changes

pkg/metrics/nvme.go Show resolved Hide resolved

k8s-ci-robot assigned AndrewSirenko Nov 7, 2024

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 7, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2024

Add Service for ebs-csi-node

ecc7525

Signed-off-by: torredil <[email protected]>

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024

ConnorJC3 reviewed Nov 7, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Nov 7, 2024

ElijahQuinones approved these changes Nov 7, 2024

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2024

k8s-ci-robot assigned ElijahQuinones Nov 7, 2024

k8s-ci-robot merged commit 966da33 into kubernetes-sigs:master Nov 7, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose NVMe Volume Metrics #2216

Expose NVMe Volume Metrics #2216

torredil commented Nov 7, 2024 •

edited

Loading

github-actions bot commented Nov 7, 2024

ConnorJC3 left a comment

AndrewSirenko left a comment

AndrewSirenko commented Nov 7, 2024

ConnorJC3 left a comment

torredil commented Nov 7, 2024

AndrewSirenko commented Nov 7, 2024

ElijahQuinones left a comment

k8s-ci-robot commented Nov 7, 2024

SuperQ commented Nov 13, 2024

torredil commented Nov 13, 2024

SuperQ commented Nov 13, 2024 •

edited

Loading

torredil commented Nov 13, 2024

Expose NVMe Volume Metrics #2216

Expose NVMe Volume Metrics #2216

Conversation

torredil commented Nov 7, 2024 • edited Loading

github-actions bot commented Nov 7, 2024

Code Coverage Diff

ConnorJC3 left a comment

Choose a reason for hiding this comment

AndrewSirenko left a comment

Choose a reason for hiding this comment

AndrewSirenko commented Nov 7, 2024

ConnorJC3 left a comment

Choose a reason for hiding this comment

torredil commented Nov 7, 2024

AndrewSirenko commented Nov 7, 2024

ElijahQuinones left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 7, 2024

SuperQ commented Nov 13, 2024

torredil commented Nov 13, 2024

SuperQ commented Nov 13, 2024 • edited Loading

torredil commented Nov 13, 2024

torredil commented Nov 7, 2024 •

edited

Loading

SuperQ commented Nov 13, 2024 •

edited

Loading