-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 priority queue: Fix panic within spin #3058
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbueringer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
75fd8e7
to
e7adeec
Compare
Signed-off-by: Stefan Büringer [email protected]
e7adeec
to
2c34bd6
Compare
/assign @vincepri @alvaroaleman |
241602f
to
8aa1eb4
Compare
@@ -283,6 +284,41 @@ var _ = Describe("Controllerworkqueue", func() { | |||
Expect(metrics.depth["test"]).To(Equal(0)) | |||
Expect(metrics.adds["test"]).To(Equal(2)) | |||
}) | |||
|
|||
It("returns many items", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to verify that the spin goroutine didn't panic.
Locally with Intellij, I've hit a panic break point and after continuing the test was shown as successful (even with the panic). I'm not sure if the same happens in CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe that there was a breakpoint caused go test to not recognize it properly?
@sbueringer: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
||
// Return false because continuing with Ascend after deleting an item | ||
// can lead to panics within Ascend. | ||
continueLoop = true | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a super strong opinion but how do you feel about instead appending to a toDelete
slice in Ascend
and then calling Delete
after being done with Ascend
? It should be safe because we are holding the lock so a concurrent routine seeing the item when it is supposed to be deleted shouldn't be possible.
The reason is that even if it works this way now, I don't think manipluting the tree while iterating is an expected usage, even if it seems to work now there could be more bugs or it could stop working in a future version of the lib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up using that approach + your testcase in #3060 as this issue made CI fail there, hope that is okay
Signed-off-by: Stefan Büringer [email protected]
I did some scale testing with priority queue and Cluster API and hit a panic within spin
The problem is that if we delete an item from the queue within Ascend it can happen that Ascend panics with
panic: runtime error: index out of range [1] with length 1
I wasn't able yet to reproduce this with a unit test (it seems to occur only under specific circumstances)