Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2629] Adding a node can result in a deadlock #859

Closed
wants to merge 3 commits into from

Conversation

pbacsko
Copy link
Contributor

@pbacsko pbacsko commented Jun 21, 2024

What is this PR for?

Fix deadlock problem by modifying locking in cache.Context.

Lock/unlock calls were removed where it doesn't seem to be necessary. In most cases, the state of the context is not modified at all. Only the scheduler cache is affected which has its own lock.

Testing done:

  • make test multiple times
  • BenchmarkSchedulingThroughPut with deadlock detector
  • BenchmarkSchedulingThroughPut with race detector

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2629

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbacsko thanks for this patch

@chenyulin0719
Copy link
Contributor

chenyulin0719 commented Jun 24, 2024

Based on my understanding.
After this PR. One of the results is that the informer event handlers are no longer grab context lock when updating nodes or updating Pods. (NodeInformer/PodInformer go routine)

I'm wondering if there have any racing when informer events(Pods/Nodes) comes during below go routine running:

  1. (main goroutine) InitializeState() -> I'm not quite confident.
  2. (shim scheduler goroutine) schedule() -> Should be safe.
  3. (dispatcher goroutine) -> Should be safe.

For the InitializeState(), I'm not quite confident to answer below questions:
Q: What will happen if NodeInformer update NodeA when InitializeState() is registring NodeA?
Q: What will happen if PodInformer update PodA when InitializeState() is registring PodA?

Is it still safe?

@wilfred-s
Copy link
Contributor

  1. (main goroutine) InitializeState() -> I'm not quite confident.

That should not be a problem.

For the InitializeState(), I'm not quite confident to answer below questions: Q: What will happen if NodeInformer update NodeA when InitializeState() is registring NodeA? Q: What will happen if PodInformer update PodA when InitializeState() is registring PodA?

Is it still safe?

The informers are first synced without event handling on. That gives us the base list of pods and nodes. All these listed objects are then processed. The event handler are turned off until we have done all that work. The step 5 in the init turns them on. At that point we no longer have to worry about the init code. The cache lock will prevent the simultaneous changes.

Copy link

codecov bot commented Jul 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.07%. Comparing base (138d53a) to head (70c749f).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #859      +/-   ##
==========================================
- Coverage   68.10%   68.07%   -0.03%     
==========================================
  Files          70       70              
  Lines        7634     7600      -34     
==========================================
- Hits         5199     5174      -25     
+ Misses       2218     2210       -8     
+ Partials      217      216       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM.

Copy link
Contributor

@wilfred-s wilfred-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1
The horrible fix needed in (ctx *Context) RemoveApplication(appID string) will disappear completely when we fix YUNIKORN-2782 as the function is not used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants