fix(prometheus.operator): retry `GetInformer` when running prometheus operator components #6415

onematchfox · 2024-02-21T14:13:51Z

PR Description

This PR adds a retry when attempting to run prometheus.operator components which ensures that component do not compltetely fail to start up if there is any form of delay in the pod being able to communicate with the Kubernetes API.

At present, the prometheus.operator flow components fail to run on our GKE clusters when a new pod starts up. Logs as follows:

{"ts":"2024-02-21T11:24:44.279996707Z","level":"error","msg":"error running crd manager","component":"prometheus.operator.podmonitors.pods","err":"could not create RESTMapper from config: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}
{"ts":"2024-02-21T11:24:44.280125067Z","level":"error","msg":"error running crd manager","component":"prometheus.operator.servicemonitors.services","err":"could not create RESTMapper from config: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}

However, if we force reload the config (remove the component and then add it back in again) then everything works as expected.

With these changes in place, the components successfully start after a retry or two. Logs as follows:

{"ts":"2024-02-21T13:39:44.328729137Z","level":"warn","msg":"failed to get informer - will retry","component":"prometheus.operator.servicemonitors.services","err":"failed to get restmapping: failed to get server groups: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}
{"ts":"2024-02-21T13:39:44.328801047Z","level":"warn","msg":"failed to get informer - will retry","component":"prometheus.operator.podmonitors.pods","err":"failed to get restmapping: failed to get server groups: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}
{"ts":"2024-02-21T13:39:44.479158994Z","level":"warn","msg":"failed to get informer - will retry","component":"prometheus.operator.podmonitors.pods","err":"failed to get restmapping: failed to get server groups: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}
{"ts":"2024-02-21T13:39:44.521300633Z","level":"warn","msg":"failed to get informer - will retry","component":"prometheus.operator.servicemonitors.services","err":"failed to get restmapping: failed to get server groups: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}
{"ts":"2024-02-21T13:39:44.977446245Z","level":"info","msg":"informers started","component":"prometheus.operator.podmonitors.pods"}
{"ts":"2024-02-21T13:39:45.008419174Z","level":"info","msg":"informers started","component":"prometheus.operator.servicemonitors.services"}

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

CHANGELOG.md updated
Documentation added
Tests updated
Config converters updated

CLAassistant · 2024-02-21T14:13:59Z

All committers have signed the CLA.

onematchfox · 2024-02-21T14:16:31Z

component/prometheus/operator/common/crdmanager.go

+			return err
+		}
+
+		opts.Mapper, err = apiutil.NewDynamicRESTMapper(restConfig, opts.HTTPClient)


Change in mapper is required as the dynamic rest mapper lazy loads mappings. Hence, we can retry calls to GetInformer below rather than just failing when calling cache.New here. This is also the default going forward so rather than having to redo this work/risk introducing a regression (if I were to retry here instead), I have implemented this so that it will continue to work when upgrading controller-runtime.

onematchfox · 2024-02-21T14:16:58Z

component/prometheus/operator/common/crdmanager.go

@@ -266,6 +275,20 @@ func (c *crdManager) runInformers(restConfig *rest.Config, ctx context.Context)
 		if ls != labels.Nothing() {
 			opts.DefaultLabelSelector = ls
 		}
+
+		// TODO: Remove custom opts.Mapper when sigs.k8s.io/controller-runtime >= 0.17.0 as `NewDynamicRESTMapper` is the default in that version


See kubernetes-sigs/controller-runtime#2611 and https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.17.0. FWIW, I had a quick go (sorry, not sorry) at upgrading this but ran into also sorts of dependency issue due to the overrides in this repo (e.g. here).

hainenber · 2024-02-21T17:13:40Z

go.mod

@@ -232,7 +232,7 @@ require (
 	k8s.io/component-base v0.28.1
 	k8s.io/klog/v2 v2.100.1
 	k8s.io/utils v0.0.0-20230726121419-3b25d923346b
-	sigs.k8s.io/controller-runtime v0.16.2
+	sigs.k8s.io/controller-runtime v0.16.2 // TODO: Remove custom rest mapper from component/prometheus/operator/common/crdmanager.go when upgrading past v0.17.0


Is the comment meant to be in the go.mod? 👀

Yes, I intentionally put it here so that it will be visible in the diff when someone does upgrade this dependency

Cool, thanks for clarifying!

hainenber · 2024-02-21T17:17:48Z

component/prometheus/operator/common/crdmanager.go

+
+		// TODO: Remove custom opts.Mapper when sigs.k8s.io/controller-runtime >= 0.17.0 as `NewDynamicRESTMapper` is the default in that version
+		var err error
+		opts.HTTPClient, err = rest.HTTPClientFor(restConfig)


I think this assignment is a bit premature since if there are errors, opts attribute still keep the erroneous client

The code returns on error just below so opts will be discarded in that scenario regardless.

Ah, good catch. LGTM then

… operator components

onematchfox · 2024-03-04T13:40:52Z

Rebased to address conflicts introduced by #6552.

github-actions · 2024-04-06T00:09:46Z

This PR has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If you do not have enough time to follow up on this PR or you think it's no longer relevant, consider closing it.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your PR will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

github-actions · 2024-05-13T00:11:03Z

This PR has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If you do not have enough time to follow up on this PR or you think it's no longer relevant, consider closing it.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your PR will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

onematchfox · 2024-05-13T07:15:32Z

Waiting on maintainers

github-actions · 2024-06-15T00:10:48Z

This PR has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If you do not have enough time to follow up on this PR or you think it's no longer relevant, consider closing it.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your PR will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

onematchfox commented Feb 21, 2024

View reviewed changes

hainenber reviewed Feb 21, 2024

View reviewed changes

onematchfox added 2 commits March 4, 2024 14:36

fix(prometheus.operator): retry GetInformer when running prometheus…

6261d6e

… operator components

docs: add fix to CHANGELOG.md

349eb39

onematchfox force-pushed the fix/prometheus-crd-manager branch from de324cb to 349eb39 Compare March 4, 2024 13:40

github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label Apr 6, 2024

rfratto added variant/flow Relatd to Grafana Agent Flow. enhancement New feature or request labels Apr 9, 2024

github-actions bot removed the needs-attention An issue or PR has been sitting around and needs attention. label Apr 11, 2024

github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label May 13, 2024

github-actions bot removed the needs-attention An issue or PR has been sitting around and needs attention. label May 14, 2024

github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label Jun 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(prometheus.operator): retry `GetInformer` when running prometheus operator components #6415

fix(prometheus.operator): retry `GetInformer` when running prometheus operator components #6415

onematchfox commented Feb 21, 2024

CLAassistant commented Feb 21, 2024 •

edited

Loading

onematchfox Feb 21, 2024 •

edited

Loading

onematchfox Feb 21, 2024 •

edited

Loading

hainenber Feb 21, 2024

onematchfox Feb 22, 2024

hainenber Feb 24, 2024

hainenber Feb 21, 2024

onematchfox Feb 22, 2024

hainenber Feb 24, 2024

onematchfox commented Mar 4, 2024

github-actions bot commented Apr 6, 2024

github-actions bot commented May 13, 2024

onematchfox commented May 13, 2024

github-actions bot commented Jun 15, 2024

fix(prometheus.operator): retry GetInformer when running prometheus operator components #6415

Are you sure you want to change the base?

fix(prometheus.operator): retry GetInformer when running prometheus operator components #6415

Conversation

onematchfox commented Feb 21, 2024

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

CLAassistant commented Feb 21, 2024 • edited Loading

onematchfox Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

onematchfox Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

hainenber Feb 21, 2024

Choose a reason for hiding this comment

onematchfox Feb 22, 2024

Choose a reason for hiding this comment

hainenber Feb 24, 2024

Choose a reason for hiding this comment

hainenber Feb 21, 2024

Choose a reason for hiding this comment

onematchfox Feb 22, 2024

Choose a reason for hiding this comment

hainenber Feb 24, 2024

Choose a reason for hiding this comment

onematchfox commented Mar 4, 2024

github-actions bot commented Apr 6, 2024

github-actions bot commented May 13, 2024

onematchfox commented May 13, 2024

github-actions bot commented Jun 15, 2024

fix(prometheus.operator): retry `GetInformer` when running prometheus operator components #6415

fix(prometheus.operator): retry `GetInformer` when running prometheus operator components #6415

CLAassistant commented Feb 21, 2024 •

edited

Loading

onematchfox Feb 21, 2024 •

edited

Loading

onematchfox Feb 21, 2024 •

edited

Loading