Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 priority queue: Fix panic within spin #3058

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sbueringer
Copy link
Member

Signed-off-by: Stefan Büringer [email protected]

I did some scale testing with priority queue and Cluster API and hit a panic within spin

panic: runtime error: index out of range [1] with length 1
goroutine 249 [running]:
[github.com/google/btree.(*node[...]).iterate(0x2ce6820](http://github.com/google/btree.(*node[...]).iterate(0x2ce6820), 0x1, {0x0, 0x0}, {0x0, 0x0}, 0x0, 0x1, 0x4000baabe0)
        /Users/buringerst/code/pkg/mod/github.com/google/[email protected]/btree_generic.go:522 +0x5b4
[github.com/google/btree.(*BTreeG[...]).Ascend(0x2ce6560](http://github.com/google/btree.(*BTreeG[...]).Ascend(0x2ce6560), 0x4000baabe0)
        /Users/buringerst/code/pkg/mod/github.com/google/[email protected]/btree_generic.go:779 +0xb4
[sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.(*priorityqueue[...]).spin.func1(0x40007d8200](http://sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.(*priorityqueue[...]).spin.func1(0x40007d8200), 0x2ce11a0)
        /Users/buringerst/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/controller/priorityqueue/priorityqueue.go:198 +0x218
[sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.(*priorityqueue[...]).spin(0x2ce11a0)](http://sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.(*priorityqueue[...]).spin(0x2ce11a0))
        /Users/buringerst/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/controller/priorityqueue/priorityqueue.go:227 +0x198
created by [sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.New[...]](http://sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.New[...]) in goroutine 247
        /Users/buringerst/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/controller/priorityqueue/priorityqueue.go:76 +0x66c

The problem is that if we delete an item from the queue within Ascend it can happen that Ascend panics with panic: runtime error: index out of range [1] with length 1

I wasn't able yet to reproduce this with a unit test (it seems to occur only under specific circumstances)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 3, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 3, 2025
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 3, 2025
@sbueringer sbueringer changed the title [WIP] 🐛 priority queue: Fix panic within spin 🐛 priority queue: Fix panic within spin Jan 3, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 3, 2025
@sbueringer
Copy link
Member Author

/assign @vincepri @alvaroaleman

@@ -283,6 +284,41 @@ var _ = Describe("Controllerworkqueue", func() {
Expect(metrics.depth["test"]).To(Equal(0))
Expect(metrics.adds["test"]).To(Equal(2))
})

It("returns many items", func() {
Copy link
Member Author

@sbueringer sbueringer Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to verify that the spin goroutine didn't panic.

Locally with Intellij, I've hit a panic break point and after continuing the test was shown as successful (even with the panic). I'm not sure if the same happens in CI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that there was a breakpoint caused go test to not recognize it properly?

@k8s-ci-robot
Copy link
Contributor

@sbueringer: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-controller-runtime-test 8aa1eb4 link true /test pull-controller-runtime-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


// Return false because continuing with Ascend after deleting an item
// can lead to panics within Ascend.
continueLoop = true
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a super strong opinion but how do you feel about instead appending to a toDelete slice in Ascend and then calling Delete after being done with Ascend? It should be safe because we are holding the lock so a concurrent routine seeing the item when it is supposed to be deleted shouldn't be possible.

The reason is that even if it works this way now, I don't think manipluting the tree while iterating is an expected usage, even if it seems to work now there could be more bugs or it could stop working in a future version of the lib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up using that approach + your testcase in #3060 as this issue made CI fail there, hope that is okay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants