🐛 priority queue: Fix panic within spin #3058

sbueringer · 2025-01-03T17:04:19Z

Signed-off-by: Stefan Büringer [email protected]

I did some scale testing with priority queue and Cluster API and hit a panic within spin

panic: runtime error: index out of range [1] with length 1
goroutine 249 [running]:
[github.com/google/btree.(*node[...]).iterate(0x2ce6820](http://github.com/google/btree.(*node[...]).iterate(0x2ce6820), 0x1, {0x0, 0x0}, {0x0, 0x0}, 0x0, 0x1, 0x4000baabe0)
        /Users/buringerst/code/pkg/mod/github.com/google/[email protected]/btree_generic.go:522 +0x5b4
[github.com/google/btree.(*BTreeG[...]).Ascend(0x2ce6560](http://github.com/google/btree.(*BTreeG[...]).Ascend(0x2ce6560), 0x4000baabe0)
        /Users/buringerst/code/pkg/mod/github.com/google/[email protected]/btree_generic.go:779 +0xb4
[sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.(*priorityqueue[...]).spin.func1(0x40007d8200](http://sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.(*priorityqueue[...]).spin.func1(0x40007d8200), 0x2ce11a0)
        /Users/buringerst/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/controller/priorityqueue/priorityqueue.go:198 +0x218
[sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.(*priorityqueue[...]).spin(0x2ce11a0)](http://sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.(*priorityqueue[...]).spin(0x2ce11a0))
        /Users/buringerst/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/controller/priorityqueue/priorityqueue.go:227 +0x198
created by [sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.New[...]](http://sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue.New[...]) in goroutine 247
        /Users/buringerst/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/controller/priorityqueue/priorityqueue.go:76 +0x66c

The problem is that if we delete an item from the queue within Ascend it can happen that Ascend panics with panic: runtime error: index out of range [1] with length 1

I wasn't able yet to reproduce this with a unit test (it seems to occur only under specific circumstances)

k8s-ci-robot · 2025-01-03T17:04:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbueringer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Stefan Büringer [email protected]

sbueringer · 2025-01-03T19:32:01Z

/assign @vincepri @alvaroaleman

sbueringer · 2025-01-03T19:34:25Z

pkg/controller/priorityqueue/priorityqueue_test.go

@@ -283,6 +284,41 @@ var _ = Describe("Controllerworkqueue", func() {
 		Expect(metrics.depth["test"]).To(Equal(0))
 		Expect(metrics.adds["test"]).To(Equal(2))
 	})
+
+	It("returns many items", func() {


Not sure how to verify that the spin goroutine didn't panic.

Locally with Intellij, I've hit a panic break point and after continuing the test was shown as successful (even with the panic). I'm not sure if the same happens in CI

Maybe that there was a breakpoint caused go test to not recognize it properly?

k8s-ci-robot · 2025-01-03T19:37:32Z

@sbueringer: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-controller-runtime-test	`8aa1eb4`	link	true	`/test pull-controller-runtime-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

alvaroaleman · 2025-01-07T00:58:10Z

pkg/controller/priorityqueue/priorityqueue.go

+
+					// Return false because continuing with Ascend after deleting an item
+					// can lead to panics within Ascend.
+					continueLoop = true
 					return false


Not a super strong opinion but how do you feel about instead appending to a toDelete slice in Ascend and then calling Delete after being done with Ascend? It should be safe because we are holding the lock so a concurrent routine seeing the item when it is supposed to be deleted shouldn't be possible.

The reason is that even if it works this way now, I don't think manipluting the tree while iterating is an expected usage, even if it seems to work now there could be more bugs or it could stop working in a future version of the lib.

I ended up using that approach + your testcase in #3060 as this issue made CI fail there, hope that is okay

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 3, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 3, 2025

k8s-ci-robot requested review from FillZpp and varshaprasad96 January 3, 2025 17:04

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 3, 2025

sbueringer force-pushed the pr-fix-pq-panic branch from 75fd8e7 to e7adeec Compare January 3, 2025 17:37

priority queue: Fix panic within spin

2c34bd6

Signed-off-by: Stefan Büringer [email protected]

sbueringer force-pushed the pr-fix-pq-panic branch from e7adeec to 2c34bd6 Compare January 3, 2025 17:59

sbueringer changed the title ~~[WIP] 🐛 priority queue: Fix panic within spin~~ 🐛 priority queue: Fix panic within spin Jan 3, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 3, 2025

k8s-ci-robot assigned alvaroaleman and vincepri Jan 3, 2025

priority queue: add unit test to reproduce panic

8aa1eb4

sbueringer force-pushed the pr-fix-pq-panic branch from 241602f to 8aa1eb4 Compare January 3, 2025 19:33

sbueringer commented Jan 3, 2025

View reviewed changes

alvaroaleman reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 priority queue: Fix panic within spin #3058

🐛 priority queue: Fix panic within spin #3058

sbueringer commented Jan 3, 2025

k8s-ci-robot commented Jan 3, 2025

sbueringer commented Jan 3, 2025

sbueringer Jan 3, 2025 •

edited

Loading

alvaroaleman Jan 7, 2025

k8s-ci-robot commented Jan 3, 2025

alvaroaleman Jan 7, 2025

alvaroaleman Jan 7, 2025

🐛 priority queue: Fix panic within spin #3058

Are you sure you want to change the base?

🐛 priority queue: Fix panic within spin #3058

Conversation

sbueringer commented Jan 3, 2025

k8s-ci-robot commented Jan 3, 2025

sbueringer commented Jan 3, 2025

sbueringer Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

alvaroaleman Jan 7, 2025

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 3, 2025

alvaroaleman Jan 7, 2025

Choose a reason for hiding this comment

alvaroaleman Jan 7, 2025

Choose a reason for hiding this comment

sbueringer Jan 3, 2025 •

edited

Loading