Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applier manager improvements #5062

Merged
merged 7 commits into from
Oct 29, 2024

Conversation

twz123
Copy link
Member

@twz123 twz123 commented Oct 2, 2024

Description

  • The stacks don't need to be stored in the manager struct
    The map is only ever used in the loop to create and remove stacks, so it doesn't need to be stored in the struct. This ensures that there can't be any racy concurrent accesses to it.

  • Don't check for closed watch channels
    The only reason these channels get closed is if the watcher itself gets closed. This happens only when the method returns, which in turn only happens when the context is done. In this case, the loop has already exited without a select on a potentially closed channel. So the branches that checked for closed channels were effectively unreachable during runtime.

  • Wait for goroutines to exit
    Rename cancelWatcher to stop and wait until the newly added stopped channel is closed. Also, add a stopped channel to each stack to do the same for each stack-specific goroutine.

  • Restart watch loop on errors
    Exit the loop on error and restart it after a one-minute delay to allow it to recover in a new run. Also replace the bespoke retry loop for stacks with the Kubernetes client's wait package.

  • Improve logging
    Cancel the contexts with a cause. Add this cause to the log statements when exiting loops. Rename bundlePath to bundleDir to reflect the fact that it is a directory, not a file.

  • Remove unused applier field
    Seems to be a remnant from the past.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added

Checklist:

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

The map is only ever used in the loop to create and remove stacks, so it
doesn't need to be stored in the struct. This ensures that there can't
be any racy concurrent accesses to it.

Signed-off-by: Tom Wieczorek <[email protected]>
The only reason these channels get closed is if the watcher itself gets
closed. This happens only when the method returns, which in turn only
happens when the context is done. In this case, the loop has already
exited without a select on a potentially closed channel. So the branches
that checked for closed channels were effectively unreachable during
runtime.

Signed-off-by: Tom Wieczorek <[email protected]>
Rename cancelWatcher to stop and wait until the newly added stopped
channel is closed. Also, add a stopped channel to each stack to do the
same for each stack-specific goroutine.

Signed-off-by: Tom Wieczorek <[email protected]>
Cancel the contexts with a cause. Add this cause to the log statements
when exiting loops. Rename bundlePath to bundleDir to reflect the fact
that it is a directory, not a file.

Signed-off-by: Tom Wieczorek <[email protected]>
Exit the loop on error and restart it after a one-minute delay to allow
it to recover in a new run. Also replace the bespoke retry loop for
stacks with the Kubernetes client's wait package.

Signed-off-by: Tom Wieczorek <[email protected]>
Seems to be a remnant from the past.

Signed-off-by: Tom Wieczorek <[email protected]>
@twz123 twz123 marked this pull request as ready for review October 2, 2024 09:43
@twz123 twz123 requested a review from a team as a code owner October 2, 2024 09:43
@emosbaugh
Copy link
Contributor

I've created a pull request into the fork with a test twz123#134

@twz123
Copy link
Member Author

twz123 commented Oct 21, 2024

Thx @emosbaugh! I added the commit here, but it's not signed-off. Can you maybe sign it and do a force push? You should be able to do it directly on the branch in my fork, as "allow edits by maintainers" is checked.

Signed-off-by: Ethan Mosbaugh <[email protected]>
@emosbaugh
Copy link
Contributor

Thx @emosbaugh! I added the commit here, but it's not signed-off. Can you maybe sign it and do a force push? You should be able to do it directly on the branch in my fork, as "allow edits by maintainers" is checked.

Done. Sorry about that

@emosbaugh
Copy link
Contributor

@twz123 is the intention to backport this fix and to what version? thanks!

@twz123
Copy link
Member Author

twz123 commented Oct 21, 2024

We can check if it's easy to backport. If yes, all good, If not, we can maybe re-target your patch to the release-1.31 branch and backport that to 1.30 - 1.28 instead. Or we do it the other way round and merge your patch into main, and I'll rebase this one on top of yours.

I'll check tomorrow in detail...

@twz123 twz123 added the backport/release-1.31 PR that needs to be backported/cherrypicked to the release-1.31 branch label Oct 22, 2024
@emosbaugh
Copy link
Contributor

We can check if it's easy to backport. If yes, all good, If not, we can maybe re-target your patch to the release-1.31 branch and backport that to 1.30 - 1.28 instead. Or we do it the other way round and merge your patch into main, and I'll rebase this one on top of yours.

I'll check tomorrow in detail...

Thanks!

@twz123
Copy link
Member Author

twz123 commented Oct 23, 2024

Alright, the code changes themselves can be backported with just a few small merge conflicts that are straight forward to resolve. The test case, on the other hand, doesn't work at all in 1.28-1.30. This is supposedly due to some non-trivial improvements that have been implemented for the fake clients quite recently.

We can do a backport, excluding the tests, or we could try to backport the fake client improvements, as well, which might be quite a bit of work.

@emosbaugh
Copy link
Contributor

Alright, the code changes themselves can be backported with just a few small merge conflicts that are straight forward to resolve. The test case, on the other hand, doesn't work at all in 1.28-1.30. This is supposedly due to some non-trivial improvements that have been implemented for the fake clients quite recently.

We can do a backport, excluding the tests, or we could try to backport the fake client improvements, as well, which might be quite a bit of work.

The code change is what is most important. Thanks

Copy link
Member

@jnummelin jnummelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one minor Q on the timeout used

go func() {
_ = m.runWatchers(watcherCtx)
defer close(stopped)
wait.UntilWithContext(ctx, m.runWatchers, 1*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 minute seems bit abstract here, any reasoning why that time is used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, didn't realize auto-merge was set. oh well, it was minor anyways 😂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a trade-off between busy-loop/log-spam and a reasonable self-healing delay. I think everything between, say, 10 secs and a couple of minutes would be fine here, so one minute was just the thing I came up with when writing that code 🙈

@twz123 twz123 merged commit c07d3c1 into k0sproject:main Oct 29, 2024
90 checks passed
@twz123 twz123 deleted the applier-manager-improvements branch October 29, 2024 21:27
@k0s-bot
Copy link

k0s-bot commented Oct 29, 2024

Successfully created backport PR for release-1.31:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controlplane backport/release-1.31 PR that needs to be backported/cherrypicked to the release-1.31 branch chore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants