Measuring query performance SLI's in Cortex #4563

moadz · 2021-11-24T13:07:50Z

moadz
Nov 24, 2021

TL;DR: measuring http_request_duration_seconds on the query path is a bad proxy for query latency as it does not account for data distribution and number of samples/series touched (both of which have significant implications on the performance of a query). We would like a more granular metric that describes the query shape, as well as duration.

I'm exploring more granular performance metrics for fan-out queries in Thanos-store (inspired by this discussion from Ian Billet) and wanted to reach out to the Cortex community to better understand how users of Cortex measure and track query performance for SLI's in a multi-tenanted environment (if this is done at all).

My current thinking is to create a new metric that captures these additional dimensions to better understand/quantify query performance SLI's with respect to number of samples/series touched before a query is executed. This is sub-optimal for a few reasons:

Introducing a new high-cardinality metric on the write path
Determining the appropriate bucket sizes for each dimension of the query (series/samples touched, request duration) is difficult as no single scale will be relevant for every topology/data set

Original proposal on Thanos on how we might do this with histograms/labels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measuring query performance SLI's in Cortex #4563

{{title}}

Replies: 0 comments

Select a reply

Measuring query performance SLI's in Cortex #4563

moadz Nov 24, 2021

Replies: 0 comments

moadz
Nov 24, 2021