Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cortex kv panel improvements #371

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
* [CHANGE] Increased `CortexIngesterReachingSeriesLimit` critical alert threshold from 80% to 85%. #363
* [CHANGE] Decreased `-server.grpc-max-concurrent-streams` from 100k to 10k. #369
* [CHANGE] Decreased blocks storage ingesters graceful termination period from 80m to 20m. #369
* [ENHANCEMENT] Writes dashboard: fix HA-tracker KV panels; add elections panel and ingester state panel. #371
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
* [ENHANCEMENT] Cortex-mixin: Include `cortex-gw-internal` naming variation in default `gateway` job names. #328
* [ENHANCEMENT] Ruler dashboard: added object storage metrics. #354
Expand Down
27 changes: 23 additions & 4 deletions cortex-mixin/dashboards/writes.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -104,11 +104,22 @@ local utils = import 'mixin-utils/utils.libsonnet';
$.row('Key-value store for high-availability (HA) deduplication')
.addPanel(
$.panel('Requests / sec') +
$.qpsPanel('cortex_kv_request_duration_seconds_count{%s}' % $.jobMatcher($._config.job_names.distributor))
$.qpsPanel('cortex_kv_request_duration_seconds_count{%s,kv_name="distributor-hatracker"}' % $.jobMatcher($._config.job_names.distributor))
)
.addPanel(
$.panel('Latency') +
utils.latencyRecordingRulePanel('cortex_kv_request_duration_seconds', $.jobSelector($._config.job_names.distributor))
utils.latencyRecordingRulePanel('cortex_kv_request_duration_seconds', $.jobSelector($._config.job_names.distributor) + [utils.selector.eq('kv_name', 'distributor-hatracker')])
)
.addPanel(
$.panel('Elected replica changes / min') +
$.queryPanel([
'max by(exported_cluster, user)(increase(cortex_ha_tracker_elected_replica_changes_total{%s}[1m])) >0' % $.jobMatcher($._config.job_names.distributor),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a pretty high cardinality query. For example, I just tested in a cluster with few thousands of tenants and running it over "Last 12h" took 25s on cold caches. What if we add a recording rule for that? I would expect the > 0 narrows down a lot the cardinality (other than squashing it by a "distributors replicas" number factor).

], [
'{{user}}/{{exported_cluster}}',
]) +
$.stack + {
yaxes: $.yaxes('cpm'),
},
)
)
.addRow(
Expand All @@ -133,11 +144,19 @@ local utils = import 'mixin-utils/utils.libsonnet';
$.row('Key-value store for the ingesters ring')
.addPanel(
$.panel('Requests / sec') +
$.qpsPanel('cortex_kv_request_duration_seconds_count{%s}' % $.jobMatcher($._config.job_names.ingester))
$.qpsPanel('cortex_kv_request_duration_seconds_count{%s,kv_name="ingester-lifecycler"}' % $.jobMatcher($._config.job_names.ingester))
)
.addPanel(
$.panel('Latency') +
utils.latencyRecordingRulePanel('cortex_kv_request_duration_seconds', $.jobSelector($._config.job_names.ingester))
utils.latencyRecordingRulePanel('cortex_kv_request_duration_seconds', $.jobSelector($._config.job_names.ingester) + [utils.selector.eq('kv_name', 'ingester-lifecycler')])
)
.addPanel(
$.panel('Ingester status') +
$.queryPanel([
'max by (state)(cortex_ring_members{%s}) >0' % $.jobMatcher($._config.job_names.distributor),
], [
'{{state}}',
])
)
)
.addRowIf(
Expand Down
2 changes: 1 addition & 1 deletion cortex-mixin/recording_rules.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ local utils = import 'mixin-utils/utils.libsonnet';
utils.histogramRules('cortex_chunk_store_chunks_per_query', ['cluster', 'job']) +
utils.histogramRules('cortex_database_request_duration_seconds', ['cluster', 'job', 'method']) +
utils.histogramRules('cortex_gcs_request_duration_seconds', ['cluster', 'job', 'operation']) +
utils.histogramRules('cortex_kv_request_duration_seconds', ['cluster', 'job']),
utils.histogramRules('cortex_kv_request_duration_seconds', ['cluster', 'job', 'kv_name']),
},
{
name: 'cortex_queries',
Expand Down