Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alloy metrics pod crashes with alloy v1.2.1 when using prometheus.operator.servicemonitors config #1349

Closed
stefanhemeier opened this issue Jul 24, 2024 · 14 comments
Labels
bug Something isn't working frozen-due-to-age

Comments

@stefanhemeier
Copy link

What's wrong?

We are using prometheus.operator.servicemonitors in our config for alloy metrics pod in kubernetes and with the update to version 1.2.1 new pods crashing directly. The pods do not crash if an existing pod gets updated from alloy 1.1.1 or 1.2.0 to 1.2.1.

Maybe this has to do with this recent code change

Steps to reproduce

Deploy a new pod don't update an existing pod, because then the issue does not occur.

System information

GKE 1.27.11-gke.1062001

Software version

Grafana alloy 1.2.1

Configuration

// Discover and scrape Servicemonitor resources
        
        prometheus.operator.servicemonitors "default" {
          namespaces = ["monitoring",  "grafana"]
          clustering {
            enabled = true
          }
          forward_to = [prometheus.remote_write.mimir.receiver]
        }

Logs

alloy ts=2024-07-24T09:03:17.830097937Z level=info msg="scrape manager stopped" component_path=/ component_id=prometheus.operator.servicemonitors.default
alloy panic: runtime error: index out of range [0] with length 0
alloy
alloy goroutine 96 [running]:
alloy github.com/grafana/alloy/internal/component/prometheus/operator/common.filterTargets(0xc00221ecc0, {0x7aac3c48b238, 0xc002f93f70})
alloy     /src/alloy/internal/component/prometheus/operator/common/crdmanager.go:203 +0x5f1
alloy github.com/grafana/alloy/internal/component/prometheus/operator/common.(*crdManager).Run(0xc00290f7a0, {0x92bead0, 0xc001d64af0})
alloy     /src/alloy/internal/component/prometheus/operator/common/crdmanager.go:158 +0x925
alloy github.com/grafana/alloy/internal/component/prometheus/operator/common.(*Component).Run.func2()
alloy     /src/alloy/internal/component/prometheus/operator/common/component.go:95 +0x38
alloy created by github.com/grafana/alloy/internal/component/prometheus/operator/common.(*Component).Run in goroutine 460
alloy     /src/alloy/internal/component/prometheus/operator/common/component.go:94 +0x38b
@stefanhemeier stefanhemeier added the bug Something isn't working label Jul 24, 2024
@loafoe
Copy link

loafoe commented Jul 24, 2024

Running into to the same, downgrading to 1.2.0 fixes this for us

@raphaelfan
Copy link

raphaelfan commented Jul 24, 2024

We encountered the same issue and can confirm that it runs fine if updating from 1.1.1 or 1.2.0 to 1.2.1. We are running alloy chart 0.5.1 with alloy 1.2.1 as a statefulset with clustering enabled to scrape metrics.

In addition, with 1.2.1, when we removed the operator configurations from the ConfigMap and restarted the statefulset, alloy came up with clustering. However, the way clustering came up was a bit strange, we have 3 pods in the cluster, only 2 joined the cluster, the first pod restarted needed another restart to join the cluster. Once the cluster came up we added back the operators configurations and it worked fine.

@wildum
Copy link
Contributor

wildum commented Jul 25, 2024

Hey, thanks for reporting this bug :)

Maybe this has to do with this recent code change

This is actually the fix for this bug, it's not yet in v1.2.1. (the panic above indicates that it's trying to access the first value in the slice but the slice is empty. The fix is checking for the slice length to avoid this).
This should be solved in the next release (currently planned for the 1st of August)

@stefanhemeier
Copy link
Author

Oh I somehow missed that it is not yet in 1.2.1
But great that it will be fixed with the next release. Thank you very much for the update. :)

@raphaelfan
Copy link

Hi @wildum , just want to follow up on the bug, do you think the fix is on track to release this week? Thanks.

@wildum
Copy link
Contributor

wildum commented Aug 1, 2024

Hey, the release is slightly delayed, we are aiming for next Monday (5th of August)

@raphaelfan
Copy link

Thanks @wildum , does v1.3.0-rc.0 have it though? I was thinking about give it a try.

@wildum
Copy link
Contributor

wildum commented Aug 1, 2024

it does yes :) Feel free to try it, keep in mind that this is a release candidate that may contain bugs. We test it internally before we release a stable version

@raphaelfan
Copy link

I tested v1.3.0-rc.0, just to provide some data point, it did not give index out of range error anymore. However, it still fails to form the cluster. It is still showing the same behavior in the dev image.

{"ts":"2024-08-01T17:41:57.250315897Z","level":"info","msg":"Using pod service account via in-cluster config","component_path":"/","component_id":"prometheus.operator.podmonitors.unified_relay"}
{"ts":"2024-08-01T17:41:57.25651462Z","level":"warn","msg":"failed to resolve SRV records","service":"cluster","addr":"unified-relay-metrics-cluster","err":"lookup unified-relay-metrics-cluster on 10.96.5.5:53: no such host"}
{"ts":"2024-08-01T17:41:57.256553615Z","level":"error","msg":"fatal error: failed to get peers to join at startup - this is likely a configuration error","service":"cluster","err":"static peer discovery: failed to find any valid join addresses: failed to extract host and port: address unified-relay-metrics-cluster: missing port in address\nfailed to resolve SRV records: lookup unified-relay-metrics-cluster on 10.96.5.5:53: no such host"}

@wildum
Copy link
Contributor

wildum commented Aug 2, 2024

@thampiotr do you know why this happens?

@thampiotr
Copy link
Contributor

It happens because Alloy fails to resolve the provided cluster join peer addresses at startup:

failed to resolve SRV records: lookup unified-relay-metrics-cluster on 10.96.5.5:53: no such host

In previous versions, there was another bug where at startup a failure to resolve the cluster join peers would not result in failure and errors would not be logged. This is undesired behaviour, because such misconfiguration can result in all instances starting their own cluster and trying to take on all the work, resulting in duplication and likely resource exhaustion. We want to fail fast and fail hard in this scenario, hence you get this error. This behaviour was changed here.

Try to verify that the name unified-relay-metrics-cluster can be resolved (e.g. that it matches your k8s service). Also, if you have a large number of instances in the cluster and using the helm chart, you may be running into this issue. We're working on a fix for it, but a workaround for now could be to use e.g. 40 but slightly larger instances.

@raphaelfan
Copy link

raphaelfan commented Aug 3, 2024

Thanks @thampiotr. Looking back, in 1.2.1, I already hit the DNS resolution issue, it was just obfuscated by the index out of range issue. In 1.3.0-rc-1, the 1) index out of range issue is fixed and 2) the pods fail hard when failing to resolve the cluster join peers, that's why I see the failed to resolve SRV records messages.

I found a way to fix the DNS issue by adding this to the service-cluster.yaml:
publishNotReadyAddresses: true
This allows the Statefulset headless service to propagate the srv records of the pods for it to form a cluster and become ready.

@stefanhemeier
Copy link
Author

This is now working with version 1.3.0 in our env.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2024
@thampiotr
Copy link
Contributor

Version 1.3.1 addresses this: https://github.com/grafana/alloy/releases/tag/v1.3.1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age
Projects
None yet
Development

No branches or pull requests

5 participants