Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(prometheus.operator): retry GetInformer when running prometheus operator components #6415

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

onematchfox
Copy link

PR Description

This PR adds a retry when attempting to run prometheus.operator components which ensures that component do not compltetely fail to start up if there is any form of delay in the pod being able to communicate with the Kubernetes API.

At present, the prometheus.operator flow components fail to run on our GKE clusters when a new pod starts up. Logs as follows:

{"ts":"2024-02-21T11:24:44.279996707Z","level":"error","msg":"error running crd manager","component":"prometheus.operator.podmonitors.pods","err":"could not create RESTMapper from config: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}
{"ts":"2024-02-21T11:24:44.280125067Z","level":"error","msg":"error running crd manager","component":"prometheus.operator.servicemonitors.services","err":"could not create RESTMapper from config: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}

However, if we force reload the config (remove the component and then add it back in again) then everything works as expected.

With these changes in place, the components successfully start after a retry or two. Logs as follows:

{"ts":"2024-02-21T13:39:44.328729137Z","level":"warn","msg":"failed to get informer - will retry","component":"prometheus.operator.servicemonitors.services","err":"failed to get restmapping: failed to get server groups: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}
{"ts":"2024-02-21T13:39:44.328801047Z","level":"warn","msg":"failed to get informer - will retry","component":"prometheus.operator.podmonitors.pods","err":"failed to get restmapping: failed to get server groups: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}
{"ts":"2024-02-21T13:39:44.479158994Z","level":"warn","msg":"failed to get informer - will retry","component":"prometheus.operator.podmonitors.pods","err":"failed to get restmapping: failed to get server groups: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}
{"ts":"2024-02-21T13:39:44.521300633Z","level":"warn","msg":"failed to get informer - will retry","component":"prometheus.operator.servicemonitors.services","err":"failed to get restmapping: failed to get server groups: Get \"https://<redacted>:443/api\": dial tcp <redacted>:443: connect: connection refused"}
{"ts":"2024-02-21T13:39:44.977446245Z","level":"info","msg":"informers started","component":"prometheus.operator.podmonitors.pods"}
{"ts":"2024-02-21T13:39:45.008419174Z","level":"info","msg":"informers started","component":"prometheus.operator.servicemonitors.services"}

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@CLAassistant
Copy link

CLAassistant commented Feb 21, 2024

CLA assistant check
All committers have signed the CLA.

return err
}

opts.Mapper, err = apiutil.NewDynamicRESTMapper(restConfig, opts.HTTPClient)
Copy link
Author

@onematchfox onematchfox Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in mapper is required as the dynamic rest mapper lazy loads mappings. Hence, we can retry calls to GetInformer below rather than just failing when calling cache.New here. This is also the default going forward so rather than having to redo this work/risk introducing a regression (if I were to retry here instead), I have implemented this so that it will continue to work when upgrading controller-runtime.

@@ -266,6 +275,20 @@ func (c *crdManager) runInformers(restConfig *rest.Config, ctx context.Context)
if ls != labels.Nothing() {
opts.DefaultLabelSelector = ls
}

// TODO: Remove custom opts.Mapper when sigs.k8s.io/controller-runtime >= 0.17.0 as `NewDynamicRESTMapper` is the default in that version
Copy link
Author

@onematchfox onematchfox Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See kubernetes-sigs/controller-runtime#2611 and https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.17.0. FWIW, I had a quick go (sorry, not sorry) at upgrading this but ran into also sorts of dependency issue due to the overrides in this repo (e.g. here).

@@ -232,7 +232,7 @@ require (
k8s.io/component-base v0.28.1
k8s.io/klog/v2 v2.100.1
k8s.io/utils v0.0.0-20230726121419-3b25d923346b
sigs.k8s.io/controller-runtime v0.16.2
sigs.k8s.io/controller-runtime v0.16.2 // TODO: Remove custom rest mapper from component/prometheus/operator/common/crdmanager.go when upgrading past v0.17.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment meant to be in the go.mod? 👀

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I intentionally put it here so that it will be visible in the diff when someone does upgrade this dependency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for clarifying!


// TODO: Remove custom opts.Mapper when sigs.k8s.io/controller-runtime >= 0.17.0 as `NewDynamicRESTMapper` is the default in that version
var err error
opts.HTTPClient, err = rest.HTTPClientFor(restConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assignment is a bit premature since if there are errors, opts attribute still keep the erroneous client

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code returns on error just below so opts will be discarded in that scenario regardless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. LGTM then

@onematchfox onematchfox force-pushed the fix/prometheus-crd-manager branch from de324cb to 349eb39 Compare March 4, 2024 13:40
@onematchfox
Copy link
Author

Rebased to address conflicts introduced by #6552.

Copy link
Contributor

github-actions bot commented Apr 6, 2024

This PR has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If you do not have enough time to follow up on this PR or you think it's no longer relevant, consider closing it.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your PR will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@github-actions github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label Apr 6, 2024
@rfratto rfratto added variant/flow Relatd to Grafana Agent Flow. enhancement New feature or request labels Apr 9, 2024
@github-actions github-actions bot removed the needs-attention An issue or PR has been sitting around and needs attention. label Apr 11, 2024
Copy link
Contributor

This PR has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If you do not have enough time to follow up on this PR or you think it's no longer relevant, consider closing it.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your PR will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@github-actions github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label May 13, 2024
@onematchfox
Copy link
Author

Waiting on maintainers

@github-actions github-actions bot removed the needs-attention An issue or PR has been sitting around and needs attention. label May 14, 2024
Copy link
Contributor

This PR has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If you do not have enough time to follow up on this PR or you think it's no longer relevant, consider closing it.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your PR will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@github-actions github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs-attention An issue or PR has been sitting around and needs attention. variant/flow Relatd to Grafana Agent Flow.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants