Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry setting Out Of Service flags #235

Merged
merged 2 commits into from
Jul 18, 2024

Conversation

mshitrit
Copy link
Member

Why we need this PR

During release testing of SNR we've noticed one case were SNR template with Out Of Service taint strategy could not be created.
The reason for that was that because when SNR was initialized, it failed to fetch the k8s version therefore SNR couldn't verify support for Out Of Service taint.
We couldn't reproduce this, but we suspect this occurred due to a temporary network issue.
In order to minimize similar occurrences in the future a retry mechanism is added.

Adding links to our slack discussion on that subject. [1] [2]

Changes made

In case Out Of Service flags can't be set at the first attempt (due to failing getting the k8s version or any other reason) , several retries will be made before deciding Out Of Service flags isn't supported.

Which issue(s) this PR fixes

Test plan

Copy link
Contributor

openshift-ci bot commented Jul 14, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@mshitrit
Copy link
Member Author

/test 4.16-openshift-e2e

@beekhof
Copy link

beekhof commented Jul 15, 2024

How much longer is an OCP/k8s version without the taint still supported for?
Arguably we should restrict the retry to just this error case, but perhaps we can already kill the whole check?

Unrelated, the handling of err in the else if block of InitOutOfServiceTaintFlags seems likely to overwrite err if already set.

@mshitrit
Copy link
Member Author

How much longer is an OCP/k8s version without the taint still supported for?

We use the taint as default since ocp 4.14 (we currently support 4.12+)

Arguably we should restrict the retry to just this error case, but perhaps we can already kill the whole check?

The main purpose of the retry was to overcome a network issue which prevents fetching the k8s version, I think it would be tricky to assume that we can get the version needed to decide whether to apply the retry logic when there are network issues that would potentially trigger the retry.

Unrelated, the handling of err in the else if block of InitOutOfServiceTaintFlags seems likely to overwrite err if already set.

In case the err is already set it'll be returned in the previous if block. I think it's a fairly common practice to reuse err var (under the assumption that it will only be overridden when error did not occur in the previous statement)

@mshitrit mshitrit changed the title [WIP] Retry setting Out Of Service flags Retry setting Out Of Service flags Jul 15, 2024
main.go Outdated Show resolved Hide resolved
@clobrano
Copy link
Contributor

/lgtm
/hold

to get more reviews and close the other threads

pkg/peerhealth/peerhealth.pb.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
…emporary network issues.

Signed-off-by: Michael Shitrit <[email protected]>
@slintes slintes self-requested a review July 17, 2024 14:15
@openshift-ci openshift-ci bot added the lgtm label Jul 17, 2024
Copy link
Contributor

openshift-ci bot commented Jul 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mshitrit, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@slintes
Copy link
Member

slintes commented Jul 17, 2024

/hold

in case you want to address the nit

@openshift-ci openshift-ci bot removed the lgtm label Jul 18, 2024
@slintes
Copy link
Member

slintes commented Jul 18, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jul 18, 2024
@mshitrit mshitrit marked this pull request as ready for review July 18, 2024 08:53
@mshitrit
Copy link
Member Author

/unhold

if err := utils.InitOutOfServiceTaintFlags(mgr.GetConfig()); err != nil {
if err := utils.InitOutOfServiceTaintFlagsWithRetry(context.Background(), mgr.GetConfig()); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement

@mshitrit
Copy link
Member Author

/retest

@openshift-merge-bot openshift-merge-bot bot merged commit 6c2442d into medik8s:main Jul 18, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants