You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The downside of this method is that we are limited by the pre-defined le buckets and we need to tweak SLO objective too in order to get closer to our latency limits given the le pre-defined buckets.
Using histogram_quantile
In this case, we have the option of setting up whatever latency we want but the complexity of the query is overwhelming for me and the results are not as good as the method '1'.
errorQuery: sum_over_time((count(histogram_quantile(0.98,sum by (le) (rate(istio_request_duration_milliseconds_bucket{reporter="source",destination_service_name="<some-service>"}[{{.window}}]))) > 1000) )[{{.window}}:]) OR on() vector(0)
I noticed that in this case we need to define the SLO target both inside the errorQuery and on the objective field.
Comparison
I also noticed that this methodology does not graph well, I have a comparison of the same latency SLO 98% for less than 1000ms
I am using the queries mentioned above, with totalQueries the 'istio_request_duration_milliseconds_count' one for the le method and without the > 1000 on the histogram one. Naturally the 'histogram_quantile' is the one whose name is suffixed with -hist.
I am also want to show how the histogram_quantile method shows for 1200ms and 99% this time. It looks very spiky, it's like either 100% or dipping low (the actual latency indeed spikes, but I dont understand the 100%). The le method does not show 100% flat.
Question
So can you please explain which is the right method and what maybe i could improve on my approach? I did spend quite a lot of time on this but still my statistics fail me.
One more "bonus" question: Why do we declare the 'ticket' alerts on 1day and 3days but both with "same" burn rate ?? I mean burn rate 3 for 1 day is the same budget consumption rate as burn rate 1 for 3 days, right? Is it possible to violate the burn rate 3 in one day and not violate the burn rate 1 for 3 days? Isn't this redundant and waste of recording rules (I see there's a significant toll on prometheus)?
The text was updated successfully, but these errors were encountered:
For your histogram_quantile approach it seems to calculate error rate as number of time slices with a 98q delay over 1s. It might help smoothing the graphs to make the time slices as small as possible, something like
errorQuery: -
sum_over_time(
count(
histogram_quantile(0.98,
sum by (le) (rate(<expr>[30s]))
)
> 1000)
[{{.window}}:])
Still drawbacks I see with such an approach are
rates each time period as equally important (not taking traffic into account)
I wouldn't trust histogram_quantile(..) > 1000 to report accurately unless I also have a le="1000" bucket
Seems like an expensive query to execute over large windows
Hi, great job on Sloth!
I think it would be very helpful if there was a clearer explanation of how to setup latency SLOs.
Specific
le
bucketsThe first and easiest way I think that works is (using Istio in this case but there should be a similar metric) using specific
le
buckets:The downside of this method is that we are limited by the pre-defined
le
buckets and we need to tweak SLO objective too in order to get closer to our latency limits given thele
pre-defined buckets.Using
histogram_quantile
In this case, we have the option of setting up whatever latency we want but the complexity of the query is overwhelming for me and the results are not as good as the method '1'.
I noticed that in this case we need to define the SLO target both inside the errorQuery and on the
objective
field.Comparison
I also noticed that this methodology does not graph well, I have a comparison of the same latency SLO 98% for less than 1000ms
I am using the queries mentioned above, with totalQueries the 'istio_request_duration_milliseconds_count' one for the
le
method and without the> 1000
on the histogram one. Naturally the 'histogram_quantile' is the one whose name is suffixed with-hist
.I am also want to show how the
histogram_quantile
method shows for 1200ms and 99% this time. It looks very spiky, it's like either 100% or dipping low (the actual latency indeed spikes, but I dont understand the 100%). Thele
method does not show 100% flat.Question
So can you please explain which is the right method and what maybe i could improve on my approach? I did spend quite a lot of time on this but still my statistics fail me.
One more "bonus" question: Why do we declare the 'ticket' alerts on 1day and 3days but both with "same" burn rate ?? I mean burn rate 3 for 1 day is the same budget consumption rate as burn rate 1 for 3 days, right? Is it possible to violate the burn rate 3 in one day and not violate the burn rate 1 for 3 days? Isn't this redundant and waste of recording rules (I see there's a significant toll on prometheus)?
The text was updated successfully, but these errors were encountered: