alloy metrics pod crashes with alloy v1.2.1 when using prometheus.operator.servicemonitors config #1349

stefanhemeier · 2024-07-24T09:32:28Z

What's wrong?

We are using prometheus.operator.servicemonitors in our config for alloy metrics pod in kubernetes and with the update to version 1.2.1 new pods crashing directly. The pods do not crash if an existing pod gets updated from alloy 1.1.1 or 1.2.0 to 1.2.1.

Maybe this has to do with this recent code change

Steps to reproduce

Deploy a new pod don't update an existing pod, because then the issue does not occur.

System information

GKE 1.27.11-gke.1062001

Software version

Grafana alloy 1.2.1

Configuration

// Discover and scrape Servicemonitor resources
        
        prometheus.operator.servicemonitors "default" {
          namespaces = ["monitoring",  "grafana"]
          clustering {
            enabled = true
          }
          forward_to = [prometheus.remote_write.mimir.receiver]
        }

Logs

alloy ts=2024-07-24T09:03:17.830097937Z level=info msg="scrape manager stopped" component_path=/ component_id=prometheus.operator.servicemonitors.default
alloy panic: runtime error: index out of range [0] with length 0
alloy
alloy goroutine 96 [running]:
alloy github.com/grafana/alloy/internal/component/prometheus/operator/common.filterTargets(0xc00221ecc0, {0x7aac3c48b238, 0xc002f93f70})
alloy     /src/alloy/internal/component/prometheus/operator/common/crdmanager.go:203 +0x5f1
alloy github.com/grafana/alloy/internal/component/prometheus/operator/common.(*crdManager).Run(0xc00290f7a0, {0x92bead0, 0xc001d64af0})
alloy     /src/alloy/internal/component/prometheus/operator/common/crdmanager.go:158 +0x925
alloy github.com/grafana/alloy/internal/component/prometheus/operator/common.(*Component).Run.func2()
alloy     /src/alloy/internal/component/prometheus/operator/common/component.go:95 +0x38
alloy created by github.com/grafana/alloy/internal/component/prometheus/operator/common.(*Component).Run in goroutine 460
alloy     /src/alloy/internal/component/prometheus/operator/common/component.go:94 +0x38b

The text was updated successfully, but these errors were encountered:

loafoe · 2024-07-24T10:18:10Z

Running into to the same, downgrading to 1.2.0 fixes this for us

raphaelfan · 2024-07-24T17:35:49Z

We encountered the same issue and can confirm that it runs fine if updating from 1.1.1 or 1.2.0 to 1.2.1. We are running alloy chart 0.5.1 with alloy 1.2.1 as a statefulset with clustering enabled to scrape metrics.

In addition, with 1.2.1, when we removed the operator configurations from the ConfigMap and restarted the statefulset, alloy came up with clustering. However, the way clustering came up was a bit strange, we have 3 pods in the cluster, only 2 joined the cluster, the first pod restarted needed another restart to join the cluster. Once the cluster came up we added back the operators configurations and it worked fine.

wildum · 2024-07-25T09:30:28Z

Hey, thanks for reporting this bug :)

Maybe this has to do with this recent code change

This is actually the fix for this bug, it's not yet in v1.2.1. (the panic above indicates that it's trying to access the first value in the slice but the slice is empty. The fix is checking for the slice length to avoid this).
This should be solved in the next release (currently planned for the 1st of August)

stefanhemeier · 2024-07-25T10:58:41Z

Oh I somehow missed that it is not yet in 1.2.1
But great that it will be fixed with the next release. Thank you very much for the update. :)

raphaelfan · 2024-08-01T00:06:13Z

Hi @wildum , just want to follow up on the bug, do you think the fix is on track to release this week? Thanks.

wildum · 2024-08-01T07:56:44Z

Hey, the release is slightly delayed, we are aiming for next Monday (5th of August)

raphaelfan · 2024-08-01T15:29:34Z

Thanks @wildum , does v1.3.0-rc.0 have it though? I was thinking about give it a try.

wildum · 2024-08-01T15:46:06Z

it does yes :) Feel free to try it, keep in mind that this is a release candidate that may contain bugs. We test it internally before we release a stable version

raphaelfan · 2024-08-01T17:47:31Z

I tested v1.3.0-rc.0, just to provide some data point, it did not give index out of range error anymore. However, it still fails to form the cluster. It is still showing the same behavior in the dev image.

{"ts":"2024-08-01T17:41:57.250315897Z","level":"info","msg":"Using pod service account via in-cluster config","component_path":"/","component_id":"prometheus.operator.podmonitors.unified_relay"}
{"ts":"2024-08-01T17:41:57.25651462Z","level":"warn","msg":"failed to resolve SRV records","service":"cluster","addr":"unified-relay-metrics-cluster","err":"lookup unified-relay-metrics-cluster on 10.96.5.5:53: no such host"}
{"ts":"2024-08-01T17:41:57.256553615Z","level":"error","msg":"fatal error: failed to get peers to join at startup - this is likely a configuration error","service":"cluster","err":"static peer discovery: failed to find any valid join addresses: failed to extract host and port: address unified-relay-metrics-cluster: missing port in address\nfailed to resolve SRV records: lookup unified-relay-metrics-cluster on 10.96.5.5:53: no such host"}

wildum · 2024-08-02T07:43:47Z

@thampiotr do you know why this happens?

thampiotr · 2024-08-02T14:20:15Z

It happens because Alloy fails to resolve the provided cluster join peer addresses at startup:

failed to resolve SRV records: lookup unified-relay-metrics-cluster on 10.96.5.5:53: no such host

In previous versions, there was another bug where at startup a failure to resolve the cluster join peers would not result in failure and errors would not be logged. This is undesired behaviour, because such misconfiguration can result in all instances starting their own cluster and trying to take on all the work, resulting in duplication and likely resource exhaustion. We want to fail fast and fail hard in this scenario, hence you get this error. This behaviour was changed here.

Try to verify that the name unified-relay-metrics-cluster can be resolved (e.g. that it matches your k8s service). Also, if you have a large number of instances in the cluster and using the helm chart, you may be running into this issue. We're working on a fix for it, but a workaround for now could be to use e.g. 40 but slightly larger instances.

raphaelfan · 2024-08-03T00:45:42Z

Thanks @thampiotr. Looking back, in 1.2.1, I already hit the DNS resolution issue, it was just obfuscated by the index out of range issue. In 1.3.0-rc-1, the 1) index out of range issue is fixed and 2) the pods fail hard when failing to resolve the cluster join peers, that's why I see the failed to resolve SRV records messages.

I found a way to fix the DNS issue by adding this to the service-cluster.yaml:
publishNotReadyAddresses: true
This allows the Statefulset headless service to propagate the srv records of the pods for it to form a cluster and become ready.

stefanhemeier · 2024-08-08T08:05:25Z

This is now working with version 1.3.0 in our env.

thampiotr · 2024-09-10T09:59:02Z

Version 1.3.1 addresses this: https://github.com/grafana/alloy/releases/tag/v1.3.1

stefanhemeier added the bug Something isn't working label Jul 24, 2024

thampiotr mentioned this issue Aug 2, 2024

Tracking: Address Clustering Issues #784

Open

stefanhemeier closed this as completed Aug 8, 2024

TheoBrigitte mentioned this issue Aug 8, 2024

Update Helm release alloy to v0.6.0 giantswarm/alloy-app#27

Merged

1 task

github-actions bot added the frozen-due-to-age label Sep 8, 2024

github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alloy metrics pod crashes with alloy v1.2.1 when using prometheus.operator.servicemonitors config #1349

alloy metrics pod crashes with alloy v1.2.1 when using prometheus.operator.servicemonitors config #1349

stefanhemeier commented Jul 24, 2024

loafoe commented Jul 24, 2024

raphaelfan commented Jul 24, 2024 •

edited

Loading

wildum commented Jul 25, 2024 •

edited

Loading

stefanhemeier commented Jul 25, 2024

raphaelfan commented Aug 1, 2024

wildum commented Aug 1, 2024

raphaelfan commented Aug 1, 2024

wildum commented Aug 1, 2024 •

edited

Loading

raphaelfan commented Aug 1, 2024

wildum commented Aug 2, 2024

thampiotr commented Aug 2, 2024

raphaelfan commented Aug 3, 2024 •

edited

Loading

stefanhemeier commented Aug 8, 2024

thampiotr commented Sep 10, 2024

alloy metrics pod crashes with alloy v1.2.1 when using prometheus.operator.servicemonitors config #1349

alloy metrics pod crashes with alloy v1.2.1 when using prometheus.operator.servicemonitors config #1349

Comments

stefanhemeier commented Jul 24, 2024

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

loafoe commented Jul 24, 2024

raphaelfan commented Jul 24, 2024 • edited Loading

wildum commented Jul 25, 2024 • edited Loading

stefanhemeier commented Jul 25, 2024

raphaelfan commented Aug 1, 2024

wildum commented Aug 1, 2024

raphaelfan commented Aug 1, 2024

wildum commented Aug 1, 2024 • edited Loading

raphaelfan commented Aug 1, 2024

wildum commented Aug 2, 2024

thampiotr commented Aug 2, 2024

raphaelfan commented Aug 3, 2024 • edited Loading

stefanhemeier commented Aug 8, 2024

thampiotr commented Sep 10, 2024

raphaelfan commented Jul 24, 2024 •

edited

Loading

wildum commented Jul 25, 2024 •

edited

Loading

wildum commented Aug 1, 2024 •

edited

Loading

raphaelfan commented Aug 3, 2024 •

edited

Loading