Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ratelimit: Overhaul metrics for the our existing rate limits #7054

Merged
merged 3 commits into from
Sep 11, 2023

Conversation

beautifulentropy
Copy link
Member

@beautifulentropy beautifulentropy commented Aug 25, 2023

  • Use constants for each rate limit name to ensure consistency when labeling metrics
  • Consistently check .Enabled() outside of each limit check RA method
  • Replace the existing checks counter with a latency histogram

Note for reviewers: Previous to this PR some of errors emitted by rate limit check methods were being counted as denials in the metrics and logs. Please ensure that these changes are desirable.

Part of #5545

@beautifulentropy beautifulentropy force-pushed the limits-v1-metric-overhaul branch 5 times, most recently from 705d1a5 to d2ef470 Compare August 28, 2023 16:50
@beautifulentropy beautifulentropy force-pushed the limits-v1-metric-overhaul branch from d2ef470 to e080c1a Compare August 28, 2023 16:53
@beautifulentropy beautifulentropy marked this pull request as ready for review August 28, 2023 17:02
@beautifulentropy beautifulentropy requested a review from a team as a code owner August 28, 2023 17:02
}, []string{"limit", "result"})
stats.MustRegister(rateLimitCounter)
rlCheckLatency := prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "ratelimitsv1_check_latency_seconds",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's intended that rate limits names between v1 ratelimit and v2 ratelimits packages will be identical. You could add a label version="v1" or version="v2" rather than have two separate metrics.

Copy link
Member Author

@beautifulentropy beautifulentropy Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had precisely this idea myself, however:

  1. My understanding is that once a time series is collected by Prometheus with a specific label set, you cannot change its labels retroactively. So if we use a label like 'version', which has a pretty short lifetime, we're stuck with it unless we change the name of the time-series.
  2. The key-value version of this histogram is very-likely to have much smaller buckets.

If I'm wrong on either of these points, please let me know. Generally though, this choice was very-much intentional.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, reading the prometheus docs states that to change the old metrics we'd need to do relabeling and at that point just make a new metric.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about this conclusion strikes me as unlikely or surprising. IIRC it's totally possible to set metric datapoints without specifying values for every label they have. And old data points which have the "version" label set will age out of the grafana database at the same speed as old datapoints from this whole metric would.

Most of what grafana/prometheus are talking about when they talk about "relabeling" is transforming the labels on one metric to match the labels on another metric so that you can query both of them at the same time. That's definitely a pain, but I don't think we'd need to do that here. We'd simply remove the "version" label from this code, stop exporting it, and the timeseries database will catch up when the old datapoints eventually fall out.

I think? Maybe I'm totally wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to leave this alone rather than juggle histogram buckets and labels for v1 and v2 in the same time series. I could also omit v2 when I add the corresponding time-series to key-value rate limits.

ratelimit/rate-limits.go Show resolved Hide resolved
ra/ra.go Outdated Show resolved Hide resolved
@beautifulentropy beautifulentropy requested review from pgporada and a team August 30, 2023 16:37
ra/ra.go Outdated Show resolved Hide resolved
ra/ra.go Show resolved Hide resolved
ra/ra.go Show resolved Hide resolved
ra/ra.go Show resolved Hide resolved
ra/ra.go Show resolved Hide resolved
ra/ra.go Show resolved Hide resolved
ra/ra.go Show resolved Hide resolved
ra/ra.go Show resolved Hide resolved
@pgporada pgporada requested a review from a team August 30, 2023 20:00
pgporada
pgporada previously approved these changes Aug 30, 2023
ra/ra.go Outdated Show resolved Hide resolved
}, []string{"limit", "result"})
stats.MustRegister(rateLimitCounter)
rlCheckLatency := prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "ratelimitsv1_check_latency_seconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about this conclusion strikes me as unlikely or surprising. IIRC it's totally possible to set metric datapoints without specifying values for every label they have. And old data points which have the "version" label set will age out of the grafana database at the same speed as old datapoints from this whole metric would.

Most of what grafana/prometheus are talking about when they talk about "relabeling" is transforming the labels on one metric to match the labels on another metric so that you can query both of them at the same time. That's definitely a pain, but I don't think we'd need to do that here. We'd simply remove the "version" label from this code, stop exporting it, and the timeseries database will catch up when the old datapoints eventually fall out.

I think? Maybe I'm totally wrong.

return err
}
ra.rlCheckLatency.WithLabelValues(ratelimit.CertificatesPerFQDNSetFast, ratelimits.Allowed).Observe(elapsed.Seconds())
}

fqdnLimits := ra.rlPolicies.CertificatesPerFQDNSet()
if fqdnLimits.Enabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the change for it, but can't we get rid of the not-fast version of the CertsPerFQDNSet limit by now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still setting a limit and an override for this in our integration and unit tests, so I'm not sure that statement is true. I could certainly dig into it more though.

@beautifulentropy beautifulentropy merged commit 636d30f into main Sep 11, 2023
@beautifulentropy beautifulentropy deleted the limits-v1-metric-overhaul branch September 11, 2023 19:06
@beautifulentropy beautifulentropy mentioned this pull request May 27, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants