-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ratelimit: Overhaul metrics for the our existing rate limits #7054
Conversation
705d1a5
to
d2ef470
Compare
d2ef470
to
e080c1a
Compare
}, []string{"limit", "result"}) | ||
stats.MustRegister(rateLimitCounter) | ||
rlCheckLatency := prometheus.NewHistogramVec(prometheus.HistogramOpts{ | ||
Name: "ratelimitsv1_check_latency_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's intended that rate limits names between v1 ratelimit
and v2 ratelimits
packages will be identical. You could add a label version="v1"
or version="v2"
rather than have two separate metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had precisely this idea myself, however:
- My understanding is that once a time series is collected by Prometheus with a specific label set, you cannot change its labels retroactively. So if we use a label like 'version', which has a pretty short lifetime, we're stuck with it unless we change the name of the time-series.
- The key-value version of this histogram is very-likely to have much smaller buckets.
If I'm wrong on either of these points, please let me know. Generally though, this choice was very-much intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, reading the prometheus docs states that to change the old metrics we'd need to do relabeling and at that point just make a new metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something about this conclusion strikes me as unlikely or surprising. IIRC it's totally possible to set metric datapoints without specifying values for every label they have. And old data points which have the "version" label set will age out of the grafana database at the same speed as old datapoints from this whole metric would.
Most of what grafana/prometheus are talking about when they talk about "relabeling" is transforming the labels on one metric to match the labels on another metric so that you can query both of them at the same time. That's definitely a pain, but I don't think we'd need to do that here. We'd simply remove the "version" label from this code, stop exporting it, and the timeseries database will catch up when the old datapoints eventually fall out.
I think? Maybe I'm totally wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to leave this alone rather than juggle histogram buckets and labels for v1 and v2 in the same time series. I could also omit v2 when I add the corresponding time-series to key-value rate limits.
}, []string{"limit", "result"}) | ||
stats.MustRegister(rateLimitCounter) | ||
rlCheckLatency := prometheus.NewHistogramVec(prometheus.HistogramOpts{ | ||
Name: "ratelimitsv1_check_latency_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something about this conclusion strikes me as unlikely or surprising. IIRC it's totally possible to set metric datapoints without specifying values for every label they have. And old data points which have the "version" label set will age out of the grafana database at the same speed as old datapoints from this whole metric would.
Most of what grafana/prometheus are talking about when they talk about "relabeling" is transforming the labels on one metric to match the labels on another metric so that you can query both of them at the same time. That's definitely a pain, but I don't think we'd need to do that here. We'd simply remove the "version" label from this code, stop exporting it, and the timeseries database will catch up when the old datapoints eventually fall out.
I think? Maybe I'm totally wrong.
return err | ||
} | ||
ra.rlCheckLatency.WithLabelValues(ratelimit.CertificatesPerFQDNSetFast, ratelimits.Allowed).Observe(elapsed.Seconds()) | ||
} | ||
|
||
fqdnLimits := ra.rlPolicies.CertificatesPerFQDNSet() | ||
if fqdnLimits.Enabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the change for it, but can't we get rid of the not-fast version of the CertsPerFQDNSet limit by now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're still setting a limit and an override for this in our integration and unit tests, so I'm not sure that statement is true. I could certainly dig into it more though.
.Enabled()
outside of each limit check RA methodNote for reviewers: Previous to this PR some of errors emitted by rate limit check methods were being counted as denials in the metrics and logs. Please ensure that these changes are desirable.
Part of #5545