-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric Instrumentation Framework #1767
Conversation
Code Coverage Diff
|
/test pull-aws-ebs-csi-driver-test-helm-chart |
Can we add a PR Description? Maybe something like: Is this a bug fix or adding new feature? What is this PR about? / Why do we need it? However, currently we only expose AWS API Call latency, not driver operation latency numbers. This PR will let the driver expose metrics for CSI RPC Call Latency (i.e. time between initiation of the ControllerPublishVolume RPC call (which triggers the attachment) to the moment the volume's "attached" state is confirmed). Furthermore, this PR decouples the metrics implementation from the cloud package, instead instantiating a metricsRecorder singleton for the life of the driver. What testing is done? |
4356a23
to
9b92246
Compare
9b92246
to
ceccd6b
Compare
ceccd6b
to
0afaf1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top level comment: this current iteration feels weird to me.
It essentially enforces an "operation" style recording of metrics, and tries to stuff everything into that.
Does it really make sense to have one metric for every possible "duration" that could occur within the EBS CSI Driver?
This is just me spitballing, but I wonder if it would make sense for the caller to supply their own name of what the metric should be.
We could have a metric for csi_rpc_duration
, for k8s_api_duration
, for mkfs_duration
, etc - stuffing all those into a generic duration_seconds
bucket feels very wrong and is antiethical to standard Prometheus design.
0afaf1f
to
d65c3eb
Compare
/test pull-aws-ebs-csi-driver-test-e2e-external-eks-windows |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
but consider the suggestions below:
d65c3eb
to
ae2798b
Compare
Signed-off-by: Eddie Torres <[email protected]>
ae2798b
to
0f8c191
Compare
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AndrewSirenko The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
/test pull-aws-ebs-csi-driver-test-e2e-external-eks-windows |
/retest |
What is this PR about? / Why do we need it?
This PR implements a framework that can be utilized to instrument driver internals and AWS API calls. Collected metrics are exposed via a Prometheus endpoint.
Changes:
Metrics package: For clearer separation of concerns and encapsulating metric-related functionality, metrics collection has been moved from
cloud
to a dedicatedmetrics
package.Singleton Metric Recorder: A singleton
metricRecorder
has been implemented to provide an organized way of recording and managing metrics across the driver. The recorder takes care of registering the metrics only once throughout the driver's lifetime and provides exported methods to record operation durations, errors, throttle events, and values.The singleton is initialized lazily when
Recorder()
is called for the first time, (which would be during driver startup if metrics are enabled via theenableMetrics
parameter in our Helm chart):Any subsequent calls to
Recorder()
simply return the already initialized instance.As an alternative solution, DI was also considered but ultimately having to explictly pass the metric recorder instsance everywhere needed limits the scope of what can be instrumented. We could technically inject the recorder into the controller and node service but -- as a theoretical example -- if future maintainers / users were to introduce a batching framework or interested in instrumenting deeply integrated code in the driver, they'd run into a wall. The singleton approach allows for simply importing the metrics package wherever needed and gives a consistent interface for metric recording throughout the driver. Also removes the possibility of registering metrics multiple times and other discrepancies.
HTTP Handler Initialization: metrics package now takes care of setting up the HTTP server for metrics via
InitializeHttpHandler
. This abstracts the process of starting the metric server away from main and lets us remove the cloud import there as well.What testing is done?
make test