-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock detection #152
base: main
Are you sure you want to change the base?
Deadlock detection #152
Conversation
Created a dependency graph between Objects. Tested with mutex cycle detection. Lock Module Changed scan_mutex_lock to return its findings for graph creating purpose
@@ -0,0 +1,109 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new file needs an Oracle copyright header. See the .header.txt
file, but make sure to use 2025 :)
(This should have been automatically caught by our pre-commit hooks. Please install them and use them for the future, see the developer workflow doc.)
drgn_tools/deadlock.py
Outdated
|
||
class DependencyGraph: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there should be two blank lines separating the "import" statements from the rest of the top-level code.
This should be automatically detected, and fixed, by our pre-commit hooks. I won't add any more comments on coding style, since it's not really interesting and the pre-commit hooks should fix everything.
drgn_tools/deadlock.py
Outdated
def __init__(self, object: Object) : | ||
self.name: str = object.type_.type_name() # TypeName of Object inside Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to allow creating a Node object that doesn't represent a drgn Object. For instance, a CPU or IRQ number are things we may want to use in deadlock detection, and those don't necessarily have an object associated with them. In that case, I think you'd want to allow something like:
node = Node("cpu", 5)
# but also:
node = Node(mutex_object)
drgn_tools/deadlock.py
Outdated
self.name: str = object.type_.type_name() # TypeName of Object inside Node | ||
self.address: int = object.value_() # Memory address of Object in host RAM | ||
self.object: Object = object # The object itself | ||
self.depends_on: list[__class__] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Type annotations for collections (List, Tuple, Dict, etc) should use the classes from
typing
(e.g.from typing import List
). This is because we need to maintain compatibility with Python 3.6. The default objects (list
,tuple
,dict
) cannot be used in annotations until Python 3.9. - Type annotations for the class itself should be a string with the class name. This is because the class name ("Node") does not exist in the namespace yet. The postponed evaluation of annotations feature would allow us to use the class name, and the typing.Self class would also be helpful here, but unfortunately neither are available for Python 3.6. Thus, you need to just use a string :(
self.depends_on: list[__class__] = [] | |
self.depends_on: List["Node"] = [] |
drgn_tools/deadlock.py
Outdated
class Node: | ||
def __init__(self, object: Object) : | ||
self.name: str = object.type_.type_name() # TypeName of Object inside Node | ||
self.address: int = object.value_() # Memory address of Object in host RAM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be slightly more intelligent. Drgn objects, like C objects, can either be a pointer, or a value. So you can have a drgn struct mutex *
, or a drgn struct mutex
. These are different. The object.value_()
method will return an int for the pointer, but for the actual struct, it would return a dict of the contents. So it's possible to have two different behaviors:
- A user passes a
struct mutex *
(the expected behavior). They get a node withname = "struct mutex *"
and the correct value. - A user passes a
struct mutex
. They get a node withname = "struct mutex *"
(no star), and the value is a dictionary, which is not hashable.
Many common mutexes would be accessed in a way that the object you get is actually a struct mutex
, not a struct mutex *
. But other access mechanisms may result in a pointer. So this needs to handle both. I'd recommend something like this:
self.address: int = object.value_() # Memory address of Object in host RAM | |
type_ = object.type_ | |
if type_.kind == TypeKind.POINTER: | |
object = object[0] | |
type_ = object.type_ | |
if type_.kind != TypeKind.STRUCT or object.address_ is None: | |
raise ValueError("A reference object of type struct is expected") | |
self.name = type_.type_name() | |
self.address = object.address_ | |
self.address: int = object.value_() # Memory address of Object in host RAM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Can there be objects of type struct <something> **
, or deeper?
drgn_tools/deadlock.py
Outdated
self.NodeSet: dict[tuple, self.Node] = dict() | ||
|
||
|
||
def insert(self, dependency: tuple) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Using a tuple for an argument here is a bit unintuitive. The first argument is an object, and the second argument is a list of things it depends on. I think it would make more sense to make this function take two arguments, rather than a single tuple.
- That said, I don't think this is the right API. We shouldn't expect that the caller has every single dependency ahead of time. Rather, I think it would be nice to have a simple API:
add_edge(src, dst)
. This would create a node for bothsrc
anddst
if necessary, and then it would adddst
to thedepends_on
list ofsrc
. This makes it easier for the caller: they don't need to form a list of all the dependencies ahead of time. They can just record each dependency as they discover them. - Another note on the type annotations... I won't highlight all of them, you get the idea. Again, this should use
Tuple
rather thantuple
, and it should include annotations of the elements of the tuple as well. - Finally, can you have a docstring that describes the arguments? In particular, document which node depends on which other node. The order of the arguments is important :)
drgn_tools/deadlock.py
Outdated
for neighbor in node.depends_on: | ||
dfs(neighbor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think recursive DFS will be a good idea. The Python stack has a limited number of frames, and while I would hope that there isn't a cycle composed of 1000 elements, I think it would be prudent to avoid a known error case like that.
drgn_tools/deadlock.py
Outdated
|
||
|
||
def __init__(self): | ||
self.NodeSet: dict[tuple, self.Node] = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For class variables, please try to use lower_case_with_underscores
. EG node_set
.
Though in this case, it's not really accurate, because it's a dictionary. It's the actual node mapping. I would just call it nodes
or something.
(Also again, note on using the typing
versions of classes for annotations here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
drgn_tools/deadlock.py
Outdated
if not blocking_object: | ||
return | ||
|
||
key = (blocking_object.type_.type_name(), blocking_object.value_()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some comments above on getting the correct type name and value. Since you're repeating this logic, it would be good to have it in a helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
drgn_tools/deadlock.py
Outdated
|
||
|
||
|
||
def detect_cycle(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a type annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
drgn_tools/deadlock.py
Outdated
|
||
|
||
|
||
def detect_cycle(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a type annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I take the detect_cycle function out of the class?
drgn_tools/deadlock.py
Outdated
for cycle in cycles: | ||
print(cycle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cycle, as far as I can tell, is just a list of nodes. This will not be nicely formatted or very intuitive for the user. We need to think more carefully about how we format this for the user, so that it is understandable and useful. We can make this a TODO if the problem is a bit larger than we can fit into a single sprint but I think it's important.
if node not in visited: # Only visit unvisited nodes | ||
dfs(node) | ||
|
||
return cycles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem we want to solve is larger than simply running DFS on each node. Suppose we have the following graph:
A -> D
B -> D
C -> D
D -> C
E -> F
There are two things we care about.
- What is the minimal cycle present? In this case, it is
C <-> D
. - What is the maximal set of of elements whose dependencies include the cycle? In this case, it is A, B, C, D.
I think we want to return both pieces of information, and we don't want any duplication. That way, we would be able to report to the user the root cause of the deadlock (that's the first thing), and we'd also be able to go through the second item and report all the tasks or CPUs or whatever are impacted by the cycle, without needing to spell out the entire path for each of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can easily be solved by changing A depends_on
B with B blocks
A. So when we find the cycle, we can also print out the objects/nodes blocked on the nodes present in the cycle. There might be future benefits of keeping both depends_on
and blocks
, but I am not sure right now.
drgn_tools/lock.py
Outdated
) -> None: | ||
) -> List[Tuple[Object, List[Object]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of tracking and returning the dependencies, maybe just include an optional argument which is the graph. If the argument is present, then the function would add edges to the graph as it encounters them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of a graph argument, we should have a separate get_mutex_lock_info function from the scan_mutex_lock function. Because even on passing a graph argument, the lock info will get printed a second time on both lock and deadlock modules. I am thinking of a way to prevent duplicate code.
Deadlock Module
Created a dependency graph between Objects.
Tested with mutex cycle detection.
Lock Module
Changed scan_mutex_lock to return its findings for graph creating purpose