Deadlocks in C programs are a pain to debug. But if you know your way around GDB, you may get more out of a core dump than you think.
For the following example, we'll create a core dump of a program that
intentionally deadlocks: examples/deadlock
That example application spawns eight threads of which five threads use locks and two of those lock each other -- because of a locking inversion.
The application has an alarm set, so it will generate SIGABRT
core
file after a few seconds. That's nice because now we have something to
debug.
Go into the examples/
dir and type make deadlock.core
. It will
self-test the deadlock application and then create the file
deadlock.core
using gdb
.
$ cd examples/
$ make deadlock.core
cc -Wall -g -O0 -c -o deadlock.o deadlock.c
cc -Wall -g -O0 -pthread deadlock.o -o deadlock
./deadlock nolock
...
If all went well, you now have a deadlock.core
file:
$ ls -l deadlock.core
-rw-rw-r-- 1 walter walter 17448904 sep 1 21:04 deadlock.core
If you open the core file with GDB, you should manually be able to find out which threads lock each other; but it's tedious work.
For example:
$ gdb ./deadlock deadlock.core
(gdb) info threads
Id Target Id Frame
9 Thread 0x2aaaaaaef040 (LWP 27444) 0x00002aaaaaf24cc9 in __GI_raise
(sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
8 Thread 0x2aaaab4b3700 (LWP 27448) __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
7 Thread 0x2aaaab6b4700 (LWP 27449) __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
6 Thread 0x2aaaab8b5700 (LWP 27450) 0x00002aaaaafaef3d in nanosleep
() at ../sysdeps/unix/syscall-template.S:81
5 Thread 0x2aaaabab6700 (LWP 27451) __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
4 Thread 0x2aaaabcb7700 (LWP 27452) __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
3 Thread 0x2aaaabeb8700 (LWP 27453) 0x00002aaaaafaef3d in nanosleep
() at ../sysdeps/unix/syscall-template.S:81
2 Thread 0x2aaaac0b9700 (LWP 27454) 0x00002aaaaafaef3d in nanosleep
() at ../sysdeps/unix/syscall-template.S:81
* 1 Thread 0x2aaaac2ba700 (LWP 27455) __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
Okay, 9 threads. And now?
- Check only the threads that are waiting for a lock: those in
__lll_lock_wait
. - For those threads, check which locks they're waiting for, and who is holding those locks.
Somewhat like this:
(gdb) thread 8
[Switching to thread 8 (Thread 0x2aaaab4b3700 (LWP 27448))]
#0 __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) frame 2
#2 0x00002aaaaacda480 in __GI___pthread_mutex_lock (
mutex=0x602120 <speciallock>)
at ../nptl/pthread_mutex_lock.c:79
79 ../nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) print *(pthread_mutex_t*)mutex
$1 = {__data = {__lock = 2, __count = 0, __owner = 27452,
__nusers = 1, __kind = 0, __spins = 0, __elision = 0,
__list = {__prev = 0x0, __next = 0x0}},
__size = "\002...", '\000' <repeats 26 times>, __align = 2}
There, the __owner
shows that the mutex that thread 8 is waiting is
waiting for, is (light weight) process id (LWP) 27452, which corresponds
to thread 4.
Repeat this step for all threads that are waiting for a lock.
Tedious, huh?
In the same example directory, now type make debug-deadlock
. It
will call GDB from the pycoredump
library instead.
$ make debug-deadlock
PYTHONPATH=`pwd`/../ python debug-deadlock.py \
./deadlock ./deadlock.core
-- all waiting threads --
<GdbThread(thno=8, thid=0x2aaaab4b3700, procid=27448, func=__lll_lock_wait)>
waits for <GdbThread(thno=4, thid=0x2aaaabcb7700, procid=27452, func=__lll_lock_wait)>
<GdbThread(thno=7, thid=0x2aaaab6b4700, procid=27449, func=__lll_lock_wait)>
waits for <GdbThread(thno=8, thid=0x2aaaab4b3700, procid=27448, func=__lll_lock_wait)>
<GdbThread(thno=5, thid=0x2aaaabab6700, procid=27451, func=__lll_lock_wait)>
waits for <GdbThread(thno=8, thid=0x2aaaab4b3700, procid=27448, func=__lll_lock_wait)>
<GdbThread(thno=4, thid=0x2aaaabcb7700, procid=27452, func=__lll_lock_wait)>
waits for <GdbThread(thno=8, thid=0x2aaaab4b3700, procid=27448, func=__lll_lock_wait)>
<GdbThread(thno=1, thid=0x2aaaac2ba700, procid=27455, func=__lll_lock_wait)>
waits for <GdbThread(thno=8, thid=0x2aaaab4b3700, procid=27448, func=__lll_lock_wait)>
Nice, only the threads that are waiting for a lock; instead of a listing of all the threads. And as a bonus, directly a reference to the threads they're waiting for.
And it gets better, they can be filtered by relevant threads only:
thread 8 and thread 4. And because we get the backtrace info directly,
we can see that one of those is indeed the culprit, the one with
func=inverted
in frame 3.
-- relevant threads --
<GdbThread(thno=8, thid=0x2aaaab4b3700, procid=27448, func=__lll_lock_wait)>
<GdbBacktrace(
<GdbFrame(no=0, func=__lll_lock_wait, file=../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135>
<GdbFrame(no=1, func=_L_lock_909, file=/lib/x86_64-linux-gnu/libpthread.so.0>
<GdbFrame(no=2, func=__GI___pthread_mutex_lock, file=../nptl/pthread_mutex_lock.c:79>
<GdbFrame(no=3, func=normal, file=deadlock.c:24>
<GdbFrame(no=4, func=start_thread, file=pthread_create.c:312>
<GdbFrame(no=5, func=clone, file=../sysdeps/unix/sysv/linux/x86_64/clone.S:111>
)>
waits for <GdbThread(thno=4, thid=0x2aaaabcb7700, procid=27452, func=__lll_lock_wait)>
<GdbThread(thno=4, thid=0x2aaaabcb7700, procid=27452, func=__lll_lock_wait)>
<GdbBacktrace(
<GdbFrame(no=0, func=__lll_lock_wait, file=../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135>
<GdbFrame(no=1, func=_L_lock_909, file=/lib/x86_64-linux-gnu/libpthread.so.0>
<GdbFrame(no=2, func=__GI___pthread_mutex_lock, file=../nptl/pthread_mutex_lock.c:79>
<GdbFrame(no=3, func=inverted, file=deadlock.c:36>
<GdbFrame(no=4, func=start_thread, file=pthread_create.c:312>
<GdbFrame(no=5, func=clone, file=../sysdeps/unix/sysv/linux/x86_64/clone.S:111>
)>
waits for <GdbThread(thno=8, thid=0x2aaaab4b3700, procid=27448, func=__lll_lock_wait)>
Better, yes?
Here it clearly shows that they're waiting on each other: a classic example of a deadlock.
That's generated by just the 30 odd lines from debug-deadlock.py
using the pycoredump
library.
Or, for another sample invocation, try this:
sudo env PYTHONPATH=$(pwd) python examples/debug-deadlock.py \ /usr/sbin/asterisk \ /var/spool/asterisk/core.ua4-ams-2018-11-29T16:09:42+0100
Obviously the pycoredump
library could use a lot of extra features,
and cleanup, but this simple example demonstrates how it can make your
deadlock debugging life easier.
I've used the manual technique to find the cause of a deadlock in the Asterisk PBX software. That software supports a thread-debugging mode, but that's so slow that it's unusable in production systems. Luckily, much of the needed info is there in the regular core dump; you just have to find it.
And now, finding it may have become a little bit simpler.