Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock detection #152

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Pranav-Agarwal0612
Copy link
Member

Deadlock Module
Created a dependency graph between Objects.
Tested with mutex cycle detection.

Lock Module
Changed scan_mutex_lock to return its findings for graph creating purpose

	Created a dependency graph between Objects.
	Tested with mutex cycle detection.

Lock Module
	Changed scan_mutex_lock to return its findings for graph creating purpose
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Feb 7, 2025
@@ -0,0 +1,109 @@
import argparse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new file needs an Oracle copyright header. See the .header.txt file, but make sure to use 2025 :)
(This should have been automatically caught by our pre-commit hooks. Please install them and use them for the future, see the developer workflow doc.)

Comment on lines 8 to 9

class DependencyGraph:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there should be two blank lines separating the "import" statements from the rest of the top-level code.

This should be automatically detected, and fixed, by our pre-commit hooks. I won't add any more comments on coding style, since it's not really interesting and the pre-commit hooks should fix everything.

Comment on lines 12 to 13
def __init__(self, object: Object) :
self.name: str = object.type_.type_name() # TypeName of Object inside Node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to allow creating a Node object that doesn't represent a drgn Object. For instance, a CPU or IRQ number are things we may want to use in deadlock detection, and those don't necessarily have an object associated with them. In that case, I think you'd want to allow something like:

node = Node("cpu", 5)
# but also:
node = Node(mutex_object)

self.name: str = object.type_.type_name() # TypeName of Object inside Node
self.address: int = object.value_() # Memory address of Object in host RAM
self.object: Object = object # The object itself
self.depends_on: list[__class__] = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Type annotations for collections (List, Tuple, Dict, etc) should use the classes from typing (e.g. from typing import List). This is because we need to maintain compatibility with Python 3.6. The default objects (list, tuple, dict) cannot be used in annotations until Python 3.9.
  • Type annotations for the class itself should be a string with the class name. This is because the class name ("Node") does not exist in the namespace yet. The postponed evaluation of annotations feature would allow us to use the class name, and the typing.Self class would also be helpful here, but unfortunately neither are available for Python 3.6. Thus, you need to just use a string :(
Suggested change
self.depends_on: list[__class__] = []
self.depends_on: List["Node"] = []

class Node:
def __init__(self, object: Object) :
self.name: str = object.type_.type_name() # TypeName of Object inside Node
self.address: int = object.value_() # Memory address of Object in host RAM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be slightly more intelligent. Drgn objects, like C objects, can either be a pointer, or a value. So you can have a drgn struct mutex *, or a drgn struct mutex. These are different. The object.value_() method will return an int for the pointer, but for the actual struct, it would return a dict of the contents. So it's possible to have two different behaviors:

  1. A user passes a struct mutex * (the expected behavior). They get a node with name = "struct mutex *" and the correct value.
  2. A user passes a struct mutex. They get a node with name = "struct mutex *" (no star), and the value is a dictionary, which is not hashable.

Many common mutexes would be accessed in a way that the object you get is actually a struct mutex, not a struct mutex *. But other access mechanisms may result in a pointer. So this needs to handle both. I'd recommend something like this:

Suggested change
self.address: int = object.value_() # Memory address of Object in host RAM
type_ = object.type_
if type_.kind == TypeKind.POINTER:
object = object[0]
type_ = object.type_
if type_.kind != TypeKind.STRUCT or object.address_ is None:
raise ValueError("A reference object of type struct is expected")
self.name = type_.type_name()
self.address = object.address_
self.address: int = object.value_() # Memory address of Object in host RAM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Can there be objects of type struct <something> **, or deeper?

self.NodeSet: dict[tuple, self.Node] = dict()


def insert(self, dependency: tuple) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Using a tuple for an argument here is a bit unintuitive. The first argument is an object, and the second argument is a list of things it depends on. I think it would make more sense to make this function take two arguments, rather than a single tuple.
  2. That said, I don't think this is the right API. We shouldn't expect that the caller has every single dependency ahead of time. Rather, I think it would be nice to have a simple API: add_edge(src, dst). This would create a node for both src and dst if necessary, and then it would add dst to the depends_on list of src. This makes it easier for the caller: they don't need to form a list of all the dependencies ahead of time. They can just record each dependency as they discover them.
  3. Another note on the type annotations... I won't highlight all of them, you get the idea. Again, this should use Tuple rather than tuple, and it should include annotations of the elements of the tuple as well.
  4. Finally, can you have a docstring that describes the arguments? In particular, document which node depends on which other node. The order of the arguments is important :)

Comment on lines 78 to 79
for neighbor in node.depends_on:
dfs(neighbor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think recursive DFS will be a good idea. The Python stack has a limited number of frames, and while I would hope that there isn't a cycle composed of 1000 elements, I think it would be prudent to avoid a known error case like that.



def __init__(self):
self.NodeSet: dict[tuple, self.Node] = dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For class variables, please try to use lower_case_with_underscores. EG node_set.

Though in this case, it's not really accurate, because it's a dictionary. It's the actual node mapping. I would just call it nodes or something.

(Also again, note on using the typing versions of classes for annotations here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if not blocking_object:
return

key = (blocking_object.type_.type_name(), blocking_object.value_())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some comments above on getting the correct type name and value. Since you're repeating this logic, it would be good to have it in a helper function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed




def detect_cycle(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a type annotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed




def detect_cycle(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a type annotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I take the detect_cycle function out of the class?

Comment on lines 108 to 109
for cycle in cycles:
print(cycle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cycle, as far as I can tell, is just a list of nodes. This will not be nicely formatted or very intuitive for the user. We need to think more carefully about how we format this for the user, so that it is understandable and useful. We can make this a TODO if the problem is a bit larger than we can fit into a single sprint but I think it's important.

if node not in visited: # Only visit unvisited nodes
dfs(node)

return cycles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem we want to solve is larger than simply running DFS on each node. Suppose we have the following graph:

A -> D
B -> D
C -> D
D -> C
E -> F

There are two things we care about.

  1. What is the minimal cycle present? In this case, it is C <-> D.
  2. What is the maximal set of of elements whose dependencies include the cycle? In this case, it is A, B, C, D.

I think we want to return both pieces of information, and we don't want any duplication. That way, we would be able to report to the user the root cause of the deadlock (that's the first thing), and we'd also be able to go through the second item and report all the tasks or CPUs or whatever are impacted by the cycle, without needing to spell out the entire path for each of them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can easily be solved by changing A depends_on B with B blocks A. So when we find the cycle, we can also print out the objects/nodes blocked on the nodes present in the cycle. There might be future benefits of keeping both depends_on and blocks, but I am not sure right now.

) -> None:
) -> List[Tuple[Object, List[Object]]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of tracking and returning the dependencies, maybe just include an optional argument which is the graph. If the argument is present, then the function would add edges to the graph as it encounters them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of a graph argument, we should have a separate get_mutex_lock_info function from the scan_mutex_lock function. Because even on passing a graph argument, the lock info will get printed a second time on both lock and deadlock modules. I am thinking of a way to prevent duplicate code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants