mimir-2.3.0
Grafana Mimir version 2.3 release notes
Grafana Labs is excited to announce version 2.3 of Grafana Mimir, the most scalable, most performant open source time series database in the world.
The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.
Note: If you are upgrading from Grafana Mimir 2.2, review the list of important changes that follow.
This release contains 370 PRs from 39 authors. Thank you!
Features and enhancements
-
Ingest metrics in OpenTelemetry format:
This release of Grafana Mimir introduces experimental support for ingesting metrics from the OpenTelemetry Collector'sotlphttp
exporter. This adds a second ingestion option for users of the OTel Collector; Mimir was already compatible with theprometheusremotewrite
exporter. For more information, please see Configure OTel Collector. -
Tenant federation for metadata queries:
Users with tenant federation enabled could already issue instant queries, range queries, and exemplar queries to multiple tenants at once and receive a single aggregated result. With Grafana Mimir 2.3, we've added tenant federation support to the/api/v1/metadata
endpoint as well. -
Simpler object storage configuration:
Users can now configure block, alertmanager, and ruler storage all at once with thecommon
YAML config option key (or-common.storage.*
CLI flags). By centralizing your object storage configuration in one place, this enhancement makes configuration faster and less error prone. Users may still individually configure storage for each of these components if they desire. For more information, see the Common Configurations. -
.deb and .rpm packages for Mimir:
Starting with version 2.3, we're publishing .deb and .rpm files for Grafana Mimir, which will make installing and running it on Debian or RedHat-based linux systems much easier. Thank you to community contributor wilfriedroset for your work to implement this! -
Import historic data:
Users can now backfill time series data from their existing Prometheus or Cortex installation into Mimir usingmimirtool
, making it possible to migrate to Grafana Mimir without losing your existing metrics data. This support is still considered experimental and does not yet work for data stored in Thanos. To learn more about this feature, please seemimirtool backfill
and Configure TSDB block upload -
Increased instant query performance:
Grafana Mimir now supports splitting instant queries by time. This allows it to better parallelize execution of instant queries and therefore return results faster. At present, splitting is only supported for a subset of instant queries, which means not all instant queries will see a speedup. This feature is currently experimental and is disabled by default. It can be enabled with thesplit_instant_queries_by_interval
YAML config option in thelimits
section (or the CLI flag-query-frontend.split-instant-queries-by-interval
).
Helm chart improvements
The Mimir Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.3 release, we’re also releasing version 3.1 of the Mimir Helm chart.
Notable enhancements follow. For the full list of changes, see the Helm chart changelog.
- We've upgraded the MinIO subchart dependency from a deprecated chart to the supported one. This creates a breaking change in how the administrator password is set. However, as the built-in MinIO is not a recommended object store for production use cases, this change did not warrant a new major version of the Mimir Helm chart.
- Query sharding is now enabled by default which should give you better performance on high cardinality metrics queries.
- To compensate for the increased number of queries generated by query sharding, the query scheduler component is now enabled by default.
- The backfill API endpoints for importing historic time series data are now exposed on the Nginx gateway.
- Nginx now sets the value of the
X-Scope-OrgID
header equal to the value of Mimir'sno_auth_tenant
parameter by default. The previous release had set the value ofX-Scope-OrgID
toanonymous
by default which complicated the process of migrating to Mimir. - Memberlist now uses DNS service-discovery by default, which decreases startup time for large Mimir clusters.
Important changes
In Grafana Mimir 2.3 we have removed the following previously deprecated configuration options:
- The
extend_writes
parameter in the distributor YAML configuration and-distributor.extend-writes
CLI flag have been removed. - The
active_series_custom_trackers
parameter has been removed from the YAML configuration. It had already been moved to the runtime configuration. See #1188 for details. - The
blocks-storage.tsdb.isolation-enabled
parameter in the YAML configuration and-blocks-storage.tsdb.isolation-enabled
CLI flag have been removed.
With Grafana Mimir 2.3 we have also updated the default value for the CLI flag -distributor.ha-tracker.max-clusters
to 100
to provide Denial-of-Service protection. Previously -distributor.ha-tracker.max-clusters
was unlimited by default which could allow a tenant with HA Dedupe enabled to overload the HA tracker with __cluster__
label values that could cause the HA Dedupe database to fail.
Also, as noted above, the administrator password for Helm chart deployments using the built-in MinIO is now set differently.
Bug fixes
- PR 2447: Fix incorrect mapping of http status codes
429
to500
when the request queue is full in the query-frontend. This corrects behavior in the query-frontend where a retryable429 "Too Many Outstanding Requests"
error from a querier was incorrectly returned as an unretryable500
system error. - PR 2505: The Memberlist key-value (KV) store now tries to "fast-join" the cluster to avoid serving an empty KV store. This fix addresses the confusing "empty ring" error response and the error log message "ring doesn't exist in KV store yet" emitted by services when there are other members present in the ring when a service starts. Those using other key-value store options (e.g., consul, etcd) are not impacted by this bug.
- PR 2289: The "List Prometheus rules" API endpoint of the Mimir Ruler component is no longer blocked while rules are being synced. This means users can now list rules while syncing larger rule sets.
Changelog
2.3.0
Grafana Mimir
- [CHANGE] Ingester: Added user label to ingester metric
cortex_ingester_tsdb_out_of_order_samples_appended_total
. On multitenant clusters this helps us find the rate of appended out-of-order samples for a specific tenant. #2493 - [CHANGE] Compactor: delete source and output blocks from local disk on compaction failed, to reduce likelihood that subsequent compactions fail because of no space left on disk. #2261
- [CHANGE] Ruler: Remove unused CLI flags
-ruler.search-pending-for
and-ruler.flush-period
(and their respective YAML config options). #2288 - [CHANGE] Successful gRPC requests are no longer logged (only affects internal API calls). #2309
- [CHANGE] Add new
-*.consul.cas-retry-delay
flags. They have a default value of1s
, while previously there was no delay between retries. #2309 - [CHANGE] Store-gateway: Remove the experimental ability to run requests in a dedicated OS thread pool and associated CLI flag
-store-gateway.thread-pool-size
. #2423 - [CHANGE] Memberlist: disabled TCP-based ping fallback, because Mimir already uses a custom transport based on TCP. #2456
- [CHANGE] Change default value for
-distributor.ha-tracker.max-clusters
to100
to provide a DoS protection. #2465 - [CHANGE] Experimental block upload API exposed by compactor has changed: Previous
/api/v1/upload/block/{block}
endpoint for starting block upload is now/api/v1/upload/block/{block}/start
, and previous endpoint/api/v1/upload/block/{block}?uploadComplete=true
for finishing block upload is now/api/v1/upload/block/{block}/finish
. New API endpoint has been added:/api/v1/upload/block/{block}/check
. #2486 #2548 - [CHANGE] Compactor: changed
-compactor.max-compaction-time
default from0s
(disabled) to1h
. When compacting blocks for a tenant, the compactor will move to compact blocks of another tenant or re-plan blocks to compact at least every 1h. #2514 - [CHANGE] Distributor: removed previously deprecated
extend_writes
(see #1856) YAML key and-distributor.extend-writes
CLI flag from the distributor config. #2551 - [CHANGE] Ingester: removed previously deprecated
active_series_custom_trackers
(see #1188) YAML key from the ingester config. #2552 - [CHANGE] The tenant ID
__mimir_cluster
is reserved by Mimir and not allowed to store metrics. #2643 - [CHANGE] Purger: removed the purger component and moved its API endpoints
/purger/delete_tenant
and/purger/delete_tenant_status
to the compactor at/compactor/delete_tenant
and/compactor/delete_tenant_status
. The new endpoints on the compactor are stable. #2644 - [CHANGE] Memberlist: Change the leave timeout duration (
-memberlist.leave-timeout duration
) from 5s to 20s and connection timeout (-memberlist.packet-dial-timeout
) from 5s to 2s. This makes leave timeout 10x the connection timeout, so that we can communicate the leave to at least 1 node, if the first 9 we try to contact times out. #2669 - [CHANGE] Alertmanager: return status code
412 Precondition Failed
and log info message when alertmanager isn't configured for a tenant. #2635 - [CHANGE] Distributor: if forwarding rules are used to forward samples, exemplars are now removed from the request. #2710, #2725
- [CHANGE] Limits: change the default value of
max_global_series_per_metric
limit to0
(disabled). Setting this limit by default does not provide much benefit because series are sharded by all labels. #2714 - [CHANGE] Ingester: experimental
-blocks-storage.tsdb.new-chunk-disk-mapper
has been removed, new chunk disk mapper is now always used, and is no longer marked experimental. Default value of-blocks-storage.tsdb.head-chunks-write-queue-size
has changed to 1000000, this enables async chunk queue by default, which leads to improved latency on the write path when new chunks are created in ingesters. #2762 - [CHANGE] Ingester: removed deprecated
-blocks-storage.tsdb.isolation-enabled
option. TSDB-level isolation is now always disabled in Mimir. #2782 - [CHANGE] Compactor:
-compactor.partial-block-deletion-delay
must either be set to 0 (to disable partial blocks deletion) or a value higher than4h
. #2787 - [CHANGE] Query-frontend: CLI flag
-query-frontend.align-querier-with-step
has been deprecated. Please use-query-frontend.align-queries-with-step
instead. #2840 - [FEATURE] Compactor: Adds the ability to delete partial blocks after a configurable delay. This option can be configured per tenant. #2285
-compactor.partial-block-deletion-delay
, as a duration string, allows you to set the delay since a partial block has been modified before marking it for deletion. A value of0
, the default, disables this feature.- The metric
cortex_compactor_blocks_marked_for_deletion_total
has a new value for thereason
labelreason="partial"
, when a block deletion marker is triggered by the partial block deletion delay.
- [FEATURE] Querier: enabled support for queries with negative offsets, which are not cached in the query results cache. #2429
- [FEATURE] EXPERIMENTAL: OpenTelemetry Metrics ingestion path on
/otlp/v1/metrics
. #695 #2436 #2461 - [FEATURE] Querier: Added support for tenant federation to metric metadata endpoint. #2467
- [FEATURE] Query-frontend: introduced experimental support to split instant queries by time. The instant query splitting can be enabled setting
-query-frontend.split-instant-queries-by-interval
. #2469 #2564 #2565 #2570 #2571 #2572 #2573 #2574 #2575 #2576 #2581 #2582 #2601 #2632 #2633 #2634 #2641 #2642 #2766 - [FEATURE] Introduced an experimental anonymous usage statistics tracking (disabled by default), to help Mimir maintainers make better decisions to support the open source community. The tracking system anonymously collects non-sensitive, non-personally identifiable information about the running Mimir cluster, and is disabled by default. #2643 #2662 #2685 #2732 #2733 #2735
- [FEATURE] Introduced an experimental deployment mode called read-write and running a fully featured Mimir cluster with three components: write, read and backend. The read-write deployment mode is a trade-off between the monolithic mode (only one component, no isolation) and the microservices mode (many components, high isolation). #2754 #2838
- [ENHANCEMENT] Distributor: Decreased distributor tests execution time. #2562
- [ENHANCEMENT] Alertmanager: Allow the HTTP
proxy_url
configuration option in the receiver's configuration. #2317 - [ENHANCEMENT] ring: optimize shuffle-shard computation when lookback is used, and all instances have registered timestamp within the lookback window. In that case we can immediately return origial ring, because we would select all instances anyway. #2309
- [ENHANCEMENT] Memberlist: added experimental memberlist cluster label support via
-memberlist.cluster-label
and-memberlist.cluster-label-verification-disabled
CLI flags (and their respective YAML config options). #2354 - [ENHANCEMENT] Object storage can now be configured for all components using the
common
YAML config option key (or-common.storage.*
CLI flags). #2330 #2347 - [ENHANCEMENT] Go: updated to go 1.18.4. #2400
- [ENHANCEMENT] Store-gateway, listblocks: list of blocks now includes stats from
meta.json
file: number of series, samples and chunks. #2425 - [ENHANCEMENT] Added more buckets to
cortex_ingester_client_request_duration_seconds
histogram metric, to correctly track requests taking longer than 1s (up until 16s). #2445 - [ENHANCEMENT] Azure client: Improve memory usage for large object storage downloads. #2408
- [ENHANCEMENT] Distributor: Add
-distributor.instance-limits.max-inflight-push-requests-bytes
. This limit protects the distributor against multiple large requests that together may cause an OOM, but are only a few, so do not trigger themax-inflight-push-requests
limit. #2413 - [ENHANCEMENT] Distributor: Drop exemplars in distributor for tenants where exemplars are disabled. #2504
- [ENHANCEMENT] Runtime Config: Allow operator to specify multiple comma-separated yaml files in
-runtime-config.file
that will be merged in left to right order. #2583 - [ENHANCEMENT] Query sharding: shard binary operations only if it doesn't lead to non-shardable vector selectors in one of the operands. #2696
- [ENHANCEMENT] Add packaging for both debian based deb file and redhat based rpm file using FPM. #1803
- [ENHANCEMENT] Distributor: Add
cortex_distributor_query_ingester_chunks_deduped_total
andcortex_distributor_query_ingester_chunks_total
metrics for determining how effective ingester chunk deduplication at query time is. #2713 - [ENHANCEMENT] Upgrade Docker base images to
alpine:3.16.2
. #2729 - [ENHANCEMENT] Ruler: Add
<prometheus-http-prefix>/api/v1/status/buildinfo
endpoint. #2724 - [ENHANCEMENT] Querier: Ensure all queries pulled from query-frontend or query-scheduler are immediately executed. The maximum workers concurrency in each querier is configured by
-querier.max-concurrent
. #2598 - [ENHANCEMENT] Distributor: Add
cortex_distributor_received_requests_total
andcortex_distributor_requests_in_total
metrics to provide visiblity into appropriate per-tenant request limits. #2770 - [ENHANCEMENT] Distributor: Add single forwarding remote-write endpoint for a tenant (
forwarding_endpoint
), instead of using per-rule endpoints. This takes precendence over per-rule endpoints. #2801 - [ENHANCEMENT] Added
err-mimir-distributor-max-write-message-size
to the errors catalog. #2470 - [ENHANCEMENT] Add sanity check at startup to ensure the configured filesystem directories don't overlap for different components. #2828
- [BUGFIX] TSDB: Fixed a bug on the experimental out-of-order implementation that led to wrong query results. #2701
- [BUGFIX] Compactor: log the actual error on compaction failed. #2261
- [BUGFIX] Alertmanager: restore state from storage even when running a single replica. #2293
- [BUGFIX] Ruler: do not block "List Prometheus rules" API endpoint while syncing rules. #2289
- [BUGFIX] Ruler: return proper
*status.Status
error when running in remote operational mode. #2417 - [BUGFIX] Alertmanager: ensure the configured
-alertmanager.web.external-url
is either a path starting with/
, or a full URL including the scheme and hostname. #2381 #2542 - [BUGFIX] Memberlist: fix problem with loss of some packets, typically ring updates when instances were removed from the ring during shutdown. #2418
- [BUGFIX] Ingester: fix misfiring
MimirIngesterHasUnshippedBlocks
and stalecortex_ingester_oldest_unshipped_block_timestamp_seconds
when some block uploads fail. #2435 - [BUGFIX] Query-frontend: fix incorrect mapping of http status codes 429 to 500 when request queue is full. #2447
- [BUGFIX] Memberlist: Fix problem with ring being empty right after startup. Memberlist KV store now tries to "fast-join" the cluster to avoid serving empty KV store. #2505
- [BUGFIX] Compactor: Fix bug when using
-compactor.partial-block-deletion-delay
: compactor didn't correctly check for modification time of all block files. #2559 - [BUGFIX] Query-frontend: fix wrong query sharding results for queries with boolean result like
1 < bool 0
. #2558 - [BUGFIX] Fixed error messages related to per-instance limits incorrectly reporting they can be set on a per-tenant basis. #2610
- [BUGFIX] Perform HA-deduplication before forwarding samples according to forwarding rules in the distributor. #2603 #2709
- [BUGFIX] Fix reporting of tracing spans from PromQL engine. #2707
- [BUGFIX] Apply relabel and drop_label rules before forwarding rules in the distributor. #2703
- [BUGFIX] Distributor: Register
cortex_discarded_requests_total
metric, which previously was not registered and therefore not exported. #2712 - [BUGFIX] Ruler: fix not restoring alerts' state at startup. #2648
- [BUGFIX] Ingester: Fix disk filling up after restarting ingesters with out-of-order support disabled while it was enabled before. #2799
- [BUGFIX] Memberlist: retry joining memberlist cluster on startup when no nodes are resolved. #2837
- [BUGFIX] Query-frontend: fix incorrect mapping of http status codes 413 to 500 when request is too large. #2819
- [BUGFIX] Alertmanager: revert upstream alertmananger to v0.24.0 to fix panic when unmarshalling email headers #2924 #2925
- [BUGFIX] Fix sanity check done on configured filesystem directories when running Alertmanager in microservices mode. #2947
Mixin
- [CHANGE] Dashboards: "Slow Queries" dashboard no longer works with versions older than Grafana 9.0. #2223
- [CHANGE] Alerts: use RSS memory instead of working set memory in the
MimirAllocatingTooMuchMemory
alert for ingesters. #2480 - [CHANGE] Dashboards: remove the "Cache - Latency (old)" panel from the "Mimir / Queries" dashboard. #2796
- [FEATURE] Dashboards: added support to experimental read-write deployment mode. #2780
- [ENHANCEMENT] Dashboards: added missed rule evaluations to the "Evaluations per second" panel in the "Mimir / Ruler" dashboard. #2314
- [ENHANCEMENT] Dashboards: add k8s resource requests to CPU and memory panels. #2346
- [ENHANCEMENT] Dashboards: add RSS memory utilization panel for ingesters, store-gateways and compactors. #2479
- [ENHANCEMENT] Dashboards: allow to configure graph tooltip. #2647
- [ENHANCEMENT] Alerts: MimirFrontendQueriesStuck and MimirSchedulerQueriesStuck alerts are more reliable now as they consider all the intermediate samples in the minute prior to the evaluation. #2630
- [ENHANCEMENT] Alerts: added
RolloutOperatorNotReconciling
alert, firing if the optional rollout-operator is not successfully reconciling. #2700 - [ENHANCEMENT] Dashboards: added support to query-tee in front of ruler-query-frontend in the "Remote ruler reads" dashboard. #2761
- [ENHANCEMENT] Dashboards: Introduce support for baremetal deployment, setting
deployment_type: 'baremetal'
in the mixin_config
. #2657 - [ENHANCEMENT] Dashboards: use timeseries panel to show exemplars. #2800
- [BUGFIX] Dashboards: fixed unit of latency panels in the "Mimir / Ruler" dashboard. #2312
- [BUGFIX] Dashboards: fixed "Intervals per query" panel in the "Mimir / Queries" dashboard. #2308
- [BUGFIX] Dashboards: Make "Slow Queries" dashboard works with Grafana 9.0. #2223
- [BUGFIX] Dashboards: add missing API routes to Ruler dashboard. #2412
- [BUGFIX] Dashboards: stop setting 'interval' in dashboards; it should be set on your datasource. #2802
Jsonnet
- [CHANGE] query-scheduler is enabled by default. We advise to deploy the query-scheduler to improve the scalability of the query-frontend. #2431
- [CHANGE] Replaced anti-affinity rules with pod topology spread constraints for distributor, query-frontend, querier and ruler. #2517
- The following configuration options have been removed:
distributor_allow_multiple_replicas_on_same_node
query_frontend_allow_multiple_replicas_on_same_node
querier_allow_multiple_replicas_on_same_node
ruler_allow_multiple_replicas_on_same_node
- The following configuration options have been added:
distributor_topology_spread_max_skew
query_frontend_topology_spread_max_skew
querier_topology_spread_max_skew
ruler_topology_spread_max_skew
- The following configuration options have been removed:
- [CHANGE] Change
max_global_series_per_metric
to 0 in all plans, and as a default value. #2669 - [FEATURE] Memberlist: added support for experimental memberlist cluster label, through the jsonnet configuration options
memberlist_cluster_label
andmemberlist_cluster_label_verification_disabled
. #2349 - [FEATURE] Added ruler-querier autoscaling support. It requires KEDA installed in the Kubernetes cluster. Ruler-querier autoscaler can be enabled and configure through the following options in the jsonnet config: #2545
autoscaling_ruler_querier_enabled
:true
to enable autoscaling.autoscaling_ruler_querier_min_replicas
: minimum number of ruler-querier replicas.autoscaling_ruler_querier_max_replicas
: maximum number of ruler-querier replicas.autoscaling_prometheus_url
: Prometheus base URL from which to scrape Mimir metrics (e.g.http://prometheus.default:9090/prometheus
).
- [ENHANCEMENT] Memberlist now uses DNS service-discovery by default. #2549
- [ENHANCEMENT] Upgrade memcached image tag to
memcached:1.6.16-alpine
. #2740 - [ENHANCEMENT] Added
$._config.configmaps
and$._config.runtime_config_files
to make it easy to add new configmaps or runtime config file to all components. #2748
Mimirtool
- [ENHANCEMENT] Added
mimirtool backfill
command to upload Prometheus blocks using API available in the compactor. #1822 - [ENHANCEMENT] mimirtool bucket-validation: Verify existing objects can be overwritten by subsequent uploads. #2491
- [ENHANCEMENT] mimirtool config convert: Now supports migrating to the current version of Mimir. #2629
- [BUGFIX] mimirtool analyze: Fix dashboard JSON unmarshalling errors by using custom parsing. #2386
- [BUGFIX] Version checking no longer prompts for updating when already on latest version. #2723
Mimir Continuous Test
- [ENHANCEMENT] Added basic authentication and bearer token support for when Mimir is behind a gateway authenticating the calls. #2717
Query-tee
- [CHANGE] Renamed CLI flag
-server.service-port
to-server.http-service-port
. #2683 - [CHANGE] Renamed metric
cortex_querytee_request_duration_seconds
tocortex_querytee_backend_request_duration_seconds
. Metriccortex_querytee_request_duration_seconds
is now reported without labelbackend
. #2683 - [ENHANCEMENT] Added HTTP over gRPC support to
query-tee
to allow testing gRPC requests to Mimir instances. #2683
Documentation
- [ENHANCEMENT] Referenced
mimirtool
commands in the HTTP API documentation. #2516 - [ENHANCEMENT] Improved DNS service discovery documentation. #2513
Tools
- [ENHANCEMENT]
markblocks
now processes multiple blocks concurrently. #2677
New Contributors
- @ese made their first contribution in #2196
- @micborens made their first contribution in #2321
- @marctc made their first contribution in #2518
- @ravilushqa made their first contribution in #2562
- @BrandonDalton made their first contribution in #2515
- @nervo made their first contribution in #2569
- @lamida made their first contribution in #2427
- @LeviHarrison made their first contribution in #2644
- @sysedwinistrator made their first contribution in #2087
Full Changelog: mimir-2.2.0...mimir-2.3.0