-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2629] Adding a node can result in a deadlock #859
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pbacsko thanks for this patch
Based on my understanding. I'm wondering if there have any racing when informer events(Pods/Nodes) comes during below go routine running:
For the InitializeState(), I'm not quite confident to answer below questions: Is it still safe? |
That should not be a problem.
The informers are first synced without event handling on. That gives us the base list of pods and nodes. All these listed objects are then processed. The event handler are turned off until we have done all that work. The step 5 in the init turns them on. At that point we no longer have to worry about the init code. The cache lock will prevent the simultaneous changes. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #859 +/- ##
==========================================
- Coverage 68.10% 68.07% -0.03%
==========================================
Files 70 70
Lines 7634 7600 -34
==========================================
- Hits 5199 5174 -25
+ Misses 2218 2210 -8
+ Partials 217 216 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
The horrible fix needed in (ctx *Context) RemoveApplication(appID string)
will disappear completely when we fix YUNIKORN-2782 as the function is not used
What is this PR for?
Fix deadlock problem by modifying locking in
cache.Context
.Lock/unlock calls were removed where it doesn't seem to be necessary. In most cases, the state of the context is not modified at all. Only the scheduler cache is affected which has its own lock.
Testing done:
make test
multiple timesBenchmarkSchedulingThroughPut
with deadlock detectorBenchmarkSchedulingThroughPut
with race detectorWhat type of PR is it?
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-2629
How should this be tested?
Screenshots (if appropriate)
Questions: