Alloy Test and Investigation for Metrics #3522

Rotfuks · 2024-06-24T15:34:15Z

Motivation

We want to unify all of our agents to use the new opentelemetry agent from grafanalabs: alloy. For this we need to first test out if alloy can deliver exactly the same capability as the prometheus/grafana agents when collecting metrics.

Todo

Deploy Alloy for one test installation (like gazelle)
- Create a feature flag with which we can select which agents are active - for this we still need to deploy both
Check if mimir still gets the same metrics as we expect
Compare the resource consumption of alloy to the previous agents - is it less, is it more, is it the same?

Outcome

We gained experience with alloy and are confident enough to roll it out as agent for collecting metrics everywhere

TheoBrigitte · 2024-08-08T16:54:18Z

I managed to use alloy to send metrics to Mimir.

Settings kept:

affinity for karpenter and alloy itself
external labels
service and pod monitor selectors
remote write
scrape interval
priority class

Settings abandoned:

retention: there's no explicit retention setting in Alloy, the retention logic is dictated by the WAL truncation logic, see https://grafana.com/docs/alloy/latest/reference/components/prometheus/prometheus.remote_write/#data-retention
additional team label: currently no way to add our team label, we can probably push this upstream though
keep dropped target: we currently have this value set to 0 (unlimited), meaning we keep in memory every dropped target, I am unsure why we have this set (more context)

Decision made:

deployment mode: statefulset, this is the recommended deployment mode to run Alloy as metrics ingester, it allows to keep WAL on disk, and its "easy" to scale with the clustering mode. see https://grafana.com/docs/alloy/latest/set-up/deploy/
clustering: enable clustering in order to allow for multiple Alloy pods to share the load, see https://grafana.com/docs/alloy/latest/reference/components/prometheus/prometheus.operator.servicemonitors/#clustering-block
hpa: TODO enable horizontal pod autoscaling

Here is the value file I used to deploy Alloy as metrics ingester values.yaml.gz

helm install alloy giantswarm-test/alloy --version 0.3.1-c58378e71cbb5e9da677957500cf43b951d870a1 --values values.yaml

The amount of metrics sent to Mimir stays the same as with Prometheus agent

left side is Prometheus Agent sending metrics
middle just me playing with Alloy; note the spike is when I ran 2 Alloy replicas without clustering which resulted in nearly x2 req./s
right side is Alloy sending metrics

TheoBrigitte · 2024-08-15T18:45:23Z

Here are the results of running Prometheus agent and Alloy as metrics agent. All test have been ran on the same installation (golem), each test was run for 1h.

I used 4 different test cases:

1 Prometheus agent replica/shard
1 Alloy replica
2 Prometheus agent replicas/shards
2 Alloy replicas

Agents

agent	replicas	CPU	Memory
Alloy	1	< 0.1	< 3GiB
Prometheus agent	1	> 0.1	> 4Gib
Alloy	2	< 0.05	< 3GiB
Prometheus agent	2	< 0.1	< 3GiB

Mimir

The amount of metrics, network and resources load on Mimir stayed approximately the same across all tests. Some Mimir ingester restarted and had some impact on the values shown in the graphs here but values are mostly within the same range.

Summary

Those tests showed that Alloy tends to consume about the same amount of resources or less than Prometheus agent and Mimir load stayed the same across all tests.

Here are the results as Grafana dashboard screenshots prometheus-agent_vs_alloy.tar.gz

TheoBrigitte · 2024-08-20T13:54:38Z

Most of the work and testing was done in giantswarm/observability-operator#66

I decided to go with our current custom autoscaling solution as it would otherwise differ to much from what we have currently and its also more complex to find a fit for every different installations size.

alloy-app v0.4.0 was release with support for secret values
observability-bundle v1.6.0 was release with the new alloy-metrics app as v0.4.0
observability-operator v0.4.0 was release with support for Alloy as monitoring agent

Deployment to an installation is currently blocked as this feature is only supported on CAPI installations and we need a new release to get the new observability-bundle out.

TheoBrigitte · 2024-08-26T10:31:51Z

v29.1.0 is on its way, once this is release to a CAPA installation we can proceed with our live testing of Alloy as monitoring agent. We would then only need to toggle the monitoring agent flag for the observability-operator (example: https://github.com/giantswarm/giantswarm-configs/pull/135/files).

QuentinBisson · 2024-08-29T19:35:06Z

As an FYI, the release was merged :)

TheoBrigitte · 2024-08-29T19:41:46Z

Now we need to have it deployed to MCs https://github.com/giantswarm/giantswarm-management-clusters/pull/749

QuentinBisson · 2024-08-29T19:57:28Z

We can try it on a WC right?

QuentinBisson · 2024-08-29T19:59:03Z

Oh wait no we cannot because of this https://github.com/giantswarm/observability-operator/blob/09ddfe046e6a81cc6b874ac537941be9a495bc18/internal/controller/cluster_monitoring_controller.go#L181

Maybe the ervices should be created for each reconciliation then so the agent is always injected? Or passed as a function parameter

QuentinBisson · 2024-09-02T12:00:14Z

Yes we can test it out on the gazelle/cicddev cluster as it is running 29.1.0 :)

TheoBrigitte · 2024-09-17T14:24:32Z

There were actually few issue preventing this to be rolled out

incorrect catalog name in observability-bundle for alloyMetrics
invalid Alloy configuration due to missing comma in external labels map
broken release pipeline in observability-operator

Those are all fixed now, but we now need to wait for an upgrade of observability-bundle to v1.6.2, most likely in capa v30.0.0 > giantswarm/releases#1357 (review)

TheoBrigitte · 2024-09-26T12:28:13Z

This is running on golem now and would be available from:

capa v29.2.0 CAPA: Release v29.2.0. releases#1419
capz v29.1.0 CAPZ: Release v29.1.0. releases#1389

Reminder: make sure we make an announcement to customers before releasing alloy-metrics.

QuentinBisson · 2024-09-26T14:24:05Z

@TheoBrigitte as this is an investigation story and not the rollout, should this be put in tracking or closed?

Rotfuks · 2024-09-30T12:09:15Z

Done on our side for now.

github-project-automation bot added this to Roadmap Jun 24, 2024

Rotfuks mentioned this issue Jun 24, 2024

Migrate to Alloy #3520

Open

github-project-automation bot moved this to Inbox 📥 in Roadmap Jun 24, 2024

Rotfuks added the team/atlas Team Atlas label Jun 24, 2024

Rotfuks mentioned this issue Jun 24, 2024

Rollout Alloy for metrics to all Installations #3523

Closed

2 tasks

marieroque self-assigned this Jul 23, 2024

marieroque removed their assignment Aug 2, 2024

This was referenced Aug 6, 2024

add support for Alloy as monitoring agent giantswarm/observability-operator#62

Merged

add alloy-metrics giantswarm/observability-bundle#230

Merged

Rotfuks assigned TheoBrigitte Aug 15, 2024

TheoBrigitte added the blocked label Aug 20, 2024

This was referenced Aug 20, 2024

request observability-bundle >= 1.6.1 for CAPA > 29.0.0 giantswarm/releases#1346

Merged

Add Alloy mixin update script and Makefile giantswarm/dashboards#607

Merged

This was referenced Aug 22, 2024

Add Alloy mixin dashboards giantswarm/dashboards#610

Merged

Add Makefile target to update Alertmanager and Kubernetes dashboards giantswarm/dashboards#609

Merged

Release v3.23.0 giantswarm/dashboards#611

Merged

TheoBrigitte mentioned this issue Sep 17, 2024

Release v0.5.0 giantswarm/observability-operator#109

Merged

Rotfuks closed this as completed Sep 30, 2024

github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alloy Test and Investigation for Metrics #3522

Alloy Test and Investigation for Metrics #3522

Rotfuks commented Jun 24, 2024 •

edited by QuentinBisson

Loading

TheoBrigitte commented Aug 8, 2024 •

edited

Loading

TheoBrigitte commented Aug 15, 2024 •

edited

Loading

TheoBrigitte commented Aug 20, 2024

TheoBrigitte commented Aug 26, 2024

QuentinBisson commented Aug 29, 2024

TheoBrigitte commented Aug 29, 2024 •

edited

Loading

QuentinBisson commented Aug 29, 2024

QuentinBisson commented Aug 29, 2024

QuentinBisson commented Sep 2, 2024

TheoBrigitte commented Sep 17, 2024

TheoBrigitte commented Sep 26, 2024

QuentinBisson commented Sep 26, 2024

Rotfuks commented Sep 30, 2024

Alloy Test and Investigation for Metrics #3522

Alloy Test and Investigation for Metrics #3522

Comments

Rotfuks commented Jun 24, 2024 • edited by QuentinBisson Loading

Motivation

Todo

Outcome

TheoBrigitte commented Aug 8, 2024 • edited Loading

TheoBrigitte commented Aug 15, 2024 • edited Loading

Agents

Mimir

Summary

TheoBrigitte commented Aug 20, 2024

TheoBrigitte commented Aug 26, 2024

QuentinBisson commented Aug 29, 2024

TheoBrigitte commented Aug 29, 2024 • edited Loading

QuentinBisson commented Aug 29, 2024

QuentinBisson commented Aug 29, 2024

QuentinBisson commented Sep 2, 2024

TheoBrigitte commented Sep 17, 2024

TheoBrigitte commented Sep 26, 2024

QuentinBisson commented Sep 26, 2024

Rotfuks commented Sep 30, 2024

Rotfuks commented Jun 24, 2024 •

edited by QuentinBisson

Loading

TheoBrigitte commented Aug 8, 2024 •

edited

Loading

TheoBrigitte commented Aug 15, 2024 •

edited

Loading

TheoBrigitte commented Aug 29, 2024 •

edited

Loading