You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR: measuring http_request_duration_seconds on the query path is a bad proxy for query latency as it does not account for data distribution and number of samples/series touched (both of which have significant implications on the performance of a query). We would like a more granular metric that describes the query shape, as well as duration.
I'm exploring more granular performance metrics for fan-out queries in Thanos-store (inspired by this discussion from Ian Billet) and wanted to reach out to the Cortex community to better understand how users of Cortex measure and track query performance for SLI's in a multi-tenanted environment (if this is done at all).
My current thinking is to create a new metric that captures these additional dimensions to better understand/quantify query performance SLI's with respect to number of samples/series touched before a query is executed. This is sub-optimal for a few reasons:
Introducing a new high-cardinality metric on the write path
Determining the appropriate bucket sizes for each dimension of the query (series/samples touched, request duration) is difficult as no single scale will be relevant for every topology/data set
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
TL;DR: measuring
http_request_duration_seconds
on the query path is a bad proxy for query latency as it does not account for data distribution and number of samples/series touched (both of which have significant implications on the performance of a query). We would like a more granular metric that describes the query shape, as well as duration.I'm exploring more granular performance metrics for fan-out queries in Thanos-store (inspired by this discussion from Ian Billet) and wanted to reach out to the Cortex community to better understand how users of Cortex measure and track query performance for SLI's in a multi-tenanted environment (if this is done at all).
My current thinking is to create a new metric that captures these additional dimensions to better understand/quantify query performance SLI's with respect to number of samples/series touched before a query is executed. This is sub-optimal for a few reasons:
Original proposal on Thanos on how we might do this with histograms/labels.
Beta Was this translation helpful? Give feedback.
All reactions