Concurrency is hard but can improve speed of programs. For that it must be well designed. As for all programs it must be with the less bugs as possible, therefore we use and need tools to guide and correct us.
There are many technics to do concurrent programs and at some point we have to synchronize our threads to make them communicate. These synchronizations can be effectively made with locks (eg: mutexes) or with lock-free technics allowing our threads to cooperate enabling maximum concurrency (some thread makes progress every step). This is also more robust, there is less dependencies with other threads (no locks).
Unfortunately lock-free programming has drawbacks and is more complicated than lock based counterpart. For instance to design lock-free algorithms you must be sure to not break invariants all the time as you can't exclude other threads when updating things. You must be aware of the order in which the data will be visible to other threads. And use atomics to avoid undefined behavior. I will discuss about these problems in the next sections.
If you want an introduction to lock-free programming I would recommend you to read this article by Jeff Preshing. And an in-depth book about concurrency: C++ Concurrency in action by Anthony Williams See references for more.
Table of Contents:
- Lock free debug tools
A data race is a situation in which at least two threads access a shared variable at the same time (unsynchronized), and at least one thread tries to modify the variable.
See an example.
A race condition is a situation in which the result of an operation depends on the interleaving of certain individual operations.
In red the data races, in green the race conditions (code):
Memory reordering can lead to a completely different program from the source code. This is due to compiler optimizations but also during runtime the processor may reorder stores and loads.
For example, can you guess possible outputs of this program ? (Suppose operations on x and y atomic):
int main() {
int x = 0;
int y = 0;
int r1;
int r2;
std::thread t1([&] {
y = 1; // #1
r1 = x; // #2
});
std::thread t2([&] {
x = 1; // #3
r2 = y; // #4
});
t1.join();
t2.join();
std::cout << "r1: " << r1 << ", r2: " << r2 << "\n";
return 0;
}
/* it could be:
r1: 0, r2: 1
r1: 1, r2: 0
r1: 1, r2: 1
right ? but this could also be:
r1: 0, r2: 0
it can arise if #1 #2 or #3 #4 are reordered (store/load reorder)
to prevent memory reordering use barrier or std::atomic with memory_order
(by default it is sequentially consistent)
*/
Here is the reference of C++ memory ordering it explain with example different type of memory ordering.
I recommend you to read memory ordering at compile time by Jeff Preshing. And also memory barriers are like source control operations.
Livelock, like deadlock, is when one thread is waiting for another, which is in turn waiting for the former. The difference is the wait is non blocking but active such as spin-lock.
Starvation is when some thread(s) access shared part frequently preventing other thread(s) to make progress
Lock-free program does not necessarily prevent starvation.
False sharing is when 2 variable accessed frequently be 2 different threads are on the same cache line, thus the threads must acquire exclusivity of that cache line preventing real parallelism.
ABA can appear in lock-free program when using CAS (compare and swap), example or example trying to provoc ABA
Multiple solutions exist:
- Double length CAS with modification counter
- Reference counter (shared_ptr with atomic, not necessarily lock-free but portable C++11, simplified in C++20
std::atomic<shared_ptr<>>
) - Hazard pointers
- lazy garbadge collection
- garbage collector
- Not freeing memory
note lock-free programs are not necessarily starvation free
These tools have been tested on:
- Machine 1(M1): ArchLinux
Linux 5.1.15-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 25 04:49:39 UTC 2019 x86_64 GNU/Linux
(Intel i7-6700HQ (8 cores) 3.5GHz)- gcc version 9.1.0-2 (GCC)
- clang version 8.0.0-4 (tags/RELEASE_800/final)
Note: Outputs may differs depending on compilers and options:
- Compilers:
- gcc
- clang
- Standard library:
- libstdc++ (gnu)
- libc++ (llvm)
- Options:
-g
: debug info (file name, symbols, line:column number)
The tools will be tested with different options and differences will be reported.
The tools will be tested with small C++ programs (multiple times for dynamic tools), focused on lock-free.\
Each test with error will be run on a loop, for the first run it will not loop. Then if the tool does not detect the error, we will try to loop 10 times and try again with 100 times and 1000 times.
Each correct test will be run on a loop, we will do 4 run: no loop, loop 10 times, loop 100 times, loop 1000 times. For static tools we won't use loops of course
Tests list:
- Programs with bugs:
- Data races
- Simple data race
- Simple data race CppMem
- Data race on std::string
- Data race trying to notify through boolean flag
- Data race trying to notify through boolean flag CppMem
- Data race on complex type std::map
- Data race and race condition
- Data race on object destruction
- Data race on small std::string destruction
- Data race on std::string destruction
- ABA':
- Wrongly synchronized producer consumer:
- Data races
- Correct programs:
- Atomic:
- Producer consumer:
- ABA' fixed
- Memory ordering
1: This program is just here to tries to provoc the ABA' problem.
It won't be tested with tools because the tools instruments malloc and other things that prevent the tweak to work.
Compilation, it will make 3 build (gcc, clang and clang with libc++):
./build_all.sh
Valgrind: DRD
M1 Version: | valgrind-3.14.0 |
Type: | dynamic on-the-fly |
Plateform: | Linux/MacOS/Android 32/64bits |
DRD can detect data races, improper use of POSIX threads, false sharing,
deadlock and monitor lock contention. But can't detect wrong lock order.
Also DRD support detached threads.
It is faster than Helgring but can be less precise.
By default DRD does not check for local variable (stack).
- Compile normally (with
-g
for debug info) - Run:
valgrind --tool=drd ./program
note: you can colorize the output: pip install colour-valgrind
--check-stack-var=<yes|no> [default: no]
Controls whether DRD detects data races on stack variables.
--segment-merging=<yes|no> [default: yes]
Segment merging is an algorithm to limit memory usage.
Disabling segment merging may improve the accuracy displayed in race reports but can also trigger an out of memory error.
--suppressions=<suppressions-file>
Specify user suppressions file, example with atomics. doc
--gen-suppressions=all
Generate suppression for detected problem. Useful to rapidly suppress problem.
You can also annotate your code
to help DRD understand happens-before relation example with std::atomic<>
.
Unfortunately this can silence errors if you get wrong with these, be sure to really understand you code before.
It may be possible to wrap annotations and check for ordering(std::memory_order) in user defined atomic
class.
=> Check other options
Run command:
valgrind --gen-suppressions=all --suppressions=valgrind.supp --check-stack-var=yes --tool=drd ./prog
The suppressions file if you want.
Sample with errors | Gcc | Clang | Clang libc++ | Details |
---|---|---|---|---|
Simple data race | ✔ | ✔ | ✔ | more |
Data race on std::string | ✔ | ✔ | ✔ | more |
Data race notify | ✔ | ✔ | ✔ | more |
Data race on std::map | ✔ | ✔ | ✔ | more |
Data race and race condition | ✔ | ✔ | ✔ | more |
Data race on object destruction | ✘ | ✘ | ✘ | more |
Data race on small string destruction | ✘ | ✘ | ✘ | more |
Data race on string destruction | ✘ | ✘ | ✘ | more |
ABA' problem in a stack DS | ✘ | ✘ | ✘ | more |
ABA' problem in a stack DS Sync | ✔10 | ✔10 | ✔10 | more |
Notification load relaxed | ~ | ~ | ~ | more |
Notification load relaxed in loop | ~ | ~ | ~ | more |
Notification load/store relaxed | ~ | ~ | ~ | more |
Correct sample | Gcc | Clang | Clang libc++ | Details |
---|---|---|---|---|
Simple data race fix | ~ | ✔ | ✔ | more |
Simple data race fix relaxed | ~ | ~ | ~ | more |
Notification sequentially consistent | ~ | ~ | ~ | more |
Notification acquire release | ~ | ~ | ~ | more |
ABA' fixed | ✔ | ✔ | ✔ | more |
store/load sequentially consistent | ✔ | ✔ | ✔ | DRD does nothing on that, not the purpose of it |
store/load acquire release | ✔ | ✔ | ✔ | DRD does nothing on that, not the purpose of it |
store/load relaxed | ✔ | ✔ | ✔ | DRD does nothing on that, not the purpose of it |
- ✔: The tool has correctly detected the error or correctly reported no error
- ?: The tool has reported an error even though there was no error
- ✘: The tool has not reported the error
- ~: Out of scope
- !: The tool has crashed
- <n> if a number is specified it means that the error is manifesting when looping n times.
You can see output samples.
Valgrind: Helgrind
M1 Version: | valgrind-3.14.0 |
Type: | dynamic on-the-fly |
Plateform: | Linux/MacOS/Android 32/64bits |
Helgrind can detect data races (concerned read/write position showed), improper use of POSIX threads, deadlock linked to lock ordering problems.
It is more precise than DRD but slower.
By default Helgrind checks for local variable (stack).
- Compile normally (with
-g
for debug info) - Run:
valgrind --tool=helgrind ./program
note: you can colorize the output: pip install colour-valgrind
--check-stack-refs=no|yes [default: yes]
This flag enables you to skip checking for accesses to thread stacks (local variables). This can improve performance, but comes at the cost of missing races on stack-allocated data.
--track-lockorders=no|yes [default: yes]
Helgrind performs lock order consistency checking. If you're only interested in race errors, you may want to disable lock order checking.
--history-level=none|approx|full [default: full]
\
-
full
: causes Helgrind collects enough information about "old" accesses that it can produce two stack traces in a race report: both the stack trace for the current access, and the trace for the older, conflicting access.
Collecting such information is expensive in both speed and memory. You may not need it in situations where you just want to check for the presence or absence of races, for example, when doing regression testing of a previously race-free program. -
none
: is the opposite. It causes Helgrind not to collect any information about previous accesses. -
approx
: provides a compromise between these two. It causes Helgrind to show a full trace for the later access, and approximate information regarding the earlier access. This approximate information consists of two stacks, and the earlier access is guaranteed to have occurred somewhere between program points denoted by the two stacks. This is not as useful as showing the exact stack for the previous access but it is almost as fast asnone
.
You can also annotate your code, like DRD,
to help Helgrind understand happens-before relation example with std::atomic<>
.
Unfortunately this can silence errors if you get wrong with these, be sure to really understand you code before.
It may be possible to wrap annotations and check for ordering(std::memory_order) in user defined atomic
class.
=> Check other options
Run command:
valgrind --gen-suppressions=all --suppressions=valgrind.supp --tool=helgrind ./prog
The suppressions file if you want.
Sample with errors | Gcc | Clang | Clang libc++ | Details |
---|---|---|---|---|
Simple data race | ✔ | ✔ | ✔ | more |
Data race on std::string | ✔ | ✔ | ✔ | more |
Data race notify | ✔ | ✔ | ✔ | more |
Data race on std::map | ✔ | ✔ | ✔ | more |
Data race and race condition | ✔ | ✔ | ✔ | more |
Data race on object destruction | ✘ | ✘ | ✘ | more |
Data race on small string destruction | ✘ | ✘ | ✘ | more |
Data race on string destruction | ✘ | ✘ | ✘ | more |
ABA' problem in a stack DS | ✘ | ✘ | ✘ | more |
ABA' problem in a stack DS Sync | ✘ | ✘ | ✔10 | more |
Notification load relaxed | ~ | ~ | ~ | more |
Notification load relaxed in loop | ~ | ~ | ~ | more |
Notification load/store relaxed | ~ | ~ | ~ | more |
Correct sample | Gcc | Clang | Clang libc++ | Details |
---|---|---|---|---|
Simple data race fix | ~ | ✔ | ✔ | more |
Simple data race fix relaxed | ~ | ~ | ~ | more |
Notification sequentially consistent | ~ | ~ | ~ | more |
Notification acquire release | ~ | ~ | ~ | more |
ABA' fixed | ✔ | ✔ | ✔ | more |
store/load sequentially consistent | ✔ | ✔ | ✔ | Helgrind does nothing on that, not the purpose of it though |
store/load acquire release | ✔ | ✔ | ✔ | Helgrind does nothing on that, not the purpose of it though |
store/load relaxed | ✔ | ✔ | ✔ | Helgrind does nothing on that, not the purpose of it though |
- ✔: The tool has correctly detected the error or correctly reported no error
- ?: The tool has reported an error even though there was no error
- ✘: The tool has not reported the error
- ~: Out of scope
- !: The tool has crashed
- <n> if a number is specified it means that the error is manifesting when looping n times.
You can see output samples.
Type: | dynamic on-the-fly |
Plateform: | Linux/MacOS/Android 64bits only |
ThreadSanitizer can detect data races (concerned read/write position showed), improper use of POSIX threads,
deadlock linked to lock ordering problems. see more.
It also support std::atomic<>
ThreadSanitizer as for other sanitizers is integrated in Gcc and Clang
- Compile with
-fsanitize=thread
(and with-g
for debug info) - Run normally:
./program
, you could use ./path_colorizer.sh to colorize paths to source. - you can pass options at runtime by setting TSAN_OPTIONS variable
There is compile time options
and options that can be passed at runtime avoiding recompilation.
And also common flags with other sanitizer.
It is possible to annotate code or suppress warnings.
Run command:
PATH_COLORIZER_SEARCH=$HOME ./path_colorizer.sh ./program
Sample with errors | Gcc | Clang | Clang libc++ | Details |
---|---|---|---|---|
Simple data race | ✔ | ✔ | ✔ | more |
Data race on std::string | ✔ | ✔ | ✔ | more |
Data race notify | ✔ | ✔ | ✔ | more |
Data race on std::map | ✔ | ✔ | ✔ | more |
Data race and race condition | ✔ | ✔ | ✔ | more |
Data race on object destruction | ✘ | ✘ | ✘ | more |
Data race on small string destruction | ✘ | ✘ | ✘ | more |
Data race on string destruction | ✔ | ✘ | ✔ | more |
ABA' problem in a stack DS | ✘ | ✘ | ✘ | more |
ABA' problem in a stack DS Sync | ✔10 | ✔10 | ✔1000 | more |
Notification load relaxed | ✔ | ✔ | ✔ | more |
Notification load relaxed in loop | ✔ | ✔ | ✔ | more |
Notification load/store relaxed | ✔ | ✔ | ✔ | more |
Correct sample | Gcc | Clang | Clang libc++ | Details |
---|---|---|---|---|
Simple data race fix | ✔ | ✔ | ✔ | more |
Simple data race fix relaxed | ✔ | ✔ | ✔ | more |
Notification sequentially consistent | ✔ | ✔ | ✔ | more |
Notification acquire release | ✔ | ✔ | ✔ | more |
ABA' fixed | ✔ | ✔ | ✔ | more |
store/load sequentially consistent | ✔ | ✔ | ✔ | Tsan does nothing on that, not the purpose of it though |
store/load acquire release | ✔ | ✔ | ✔ | Tsan does nothing on that, not the purpose of it though |
store/load relaxed | ✔ | ✔ | ✔ | Tsan does nothing on that, not the purpose of it though |
- ✔: The tool has correctly detected the error or correctly reported no error
- ?: The tool has reported an error even though there was no error
- ✘: The tool has not reported the error
- !: The tool has crashed
- <n> if a number is specified it means that the error is manifesting when looping n times.
You can see output samples.
Intel Inspector (Free version)
M1 Version: | 2019.3.199-3 |
Type: | dynamic on-the-fly instrumentation, post-mortem analysis |
Plateform: | Linux/MacOS/Windows 32/64bits |
Intel Inspector can detect data race, deadlock, it partially support atomics (MSVC/Intel(icc)/Clang).
Here is a more detailed list.
But it has some limitations, especially C++11 Atomics on Clang/Gcc aren't supported.
- Compile normally (and with
-g
for debug info) - Run intel inspector or inspxe-cl command line
- Create/open a project
- Tell it where your binary is
- Launch analyses
When creating a new analysis you can specify options by clicking on Edit button:
It is pretty straightforward, by default data race on stack is disabled, you may want to enable it.
using the cli, I written a script.
inspxe-cl -collect=ti-3 -- ./prog; inspxe-cl -report=problem
# or
./lauch_inspxe.sh
Sample with errors | Gcc | Clang | Clang libc++ | Details |
---|---|---|---|---|
Simple data race | ✔ | ✔ | ✔ | more |
Data race on std::string | ✔ | ✔ | ✔ | more |
Data race notify | ✔ | ✔ | ✔ | more |
Data race on std::map | ✔ | ✔ | ✔ | more |
Data race and race condition | ✔ | ✔ | ✔ | more |
Data race on object destruction | ✘ | ✘ | ✘ | more |
Data race on small string destruction | ✘ | ✘ | ✘ | more |
Data race on string destruction | ✘ | ✘ | ✘ | more |
ABA' problem in a stack DS | ✘ | ✘ | ✘ | more |
ABA' problem in a stack DS Sync | ✘ | ✔1000 | ✔1000 | more |
Notification load relaxed | ~ | ~ | ~ | more |
Notification load relaxed in loop | ~ | ~ | ~ | more |
Notification load/store relaxed | ~ | ~ | ~ | more |
Correct sample | Gcc | Clang | Clang libc++ | Details |
---|---|---|---|---|
Simple data race fix | ~ | ✔ | ✔ | more |
Simple data race fix relaxed | ~ | ~ | ~ | more |
Notification sequentially consistent | ~ | ~ | ~ | more |
Notification acquire release | ~ | ~ | ~ | more |
ABA' fixed | ✔ | ✔ | ✔ | more |
- ✔: The tool has correctly detected the error or correctly reported no error
- ?: The tool has reported an error even though there was no error
- ✘: The tool has not reported the error
- ~: Out of scope
- !: The tool has crashed
- <n> if a number is specified it means that the error is manifesting when looping n times.
You can see output samples.
Type: | static |
Plateform: | On the web |
CppMem has been designed to explore the C/C++ memory model. It support a tiny subset of pseudo C/C++ (no struct/classes)
- Go to http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
- Write/enter your "program"
- Run
- See different consistent executions
You can use std::thread
or you could use this syntax:
{{{
{
// thread1
}
|||
{
// thread2
}
}}}
You could rapidly see on top global information about the current possible execution (races, locks, ...).
Then the graph describing that execution.
- W: write
- R: read
- (R/W)na: non atomic R/W
- (R/W)(sc|rlx): atomic R/W (seq_cst|relaxed)
- Racq: atomic read acquire
- Wrel: atomic write release
- sb: sequenced before
- mo: modification order
- sw: synchronized-with
- hb: happens-before
- rf: read from
- dr: data race
- sc: sequentially consistent
More details on their help page.
We have to convert the C++ into CppMem syntax, some test aren't there because CppMem does not support features used in it.
You can see output samples.
There are some other tools not covered here with given reason:
- Other sanitizers: these must be useful to catch other bugs not related to concurrency, check it out.
- Debuggers (gdb, ...) of course, use it to control interleaving of threads or to explore the execution near detected errors.
- Relacy race detector:
The code must be instrumented (unit test) and compiled C++03 or C++11 but
std::*
must be changed byrl::*
see example - CDSChecker: poor support of C++, old.
- IFRit: too old, based on old llvm/clang
- To initialize something in concurrency (double checked locking) think about using static local variable
- Do not use volatile for synchronization!
Most tools have poor support of lock-free programming.
CppMem can shows us what is going on in lock free but has serious limitations (no struct, tiny subset of pseudo C supported).
ThreadSanitizer can work well in lock-free context but has not been seriously tested on that specific area[7] and there are no paper describing the new Thread Sanitizer (v2).
Even though we hadn't benchmark the time, during the tests Tsan appears to be the fastest.
Intel Inspector seems to be really slow, actually the analysis take a long time.
DRD even faster than Helgrind seems both to be significantly slower than Tsan.
We had ideas for further testing, but due to lack of time we don't have tested:
- programs with more than 2 threads
- real examples
- boost lock-free
- TBB
- PPLP (program develop at Verimag)
- More benchmark (on real examples)
- Execution time
- Memory overhead
- Other hardware:
- ARM64
- Verimag PC
Finally it would be interesting to:
- Do further testing on real lock free programs with TSan, as this haven't been done, maybe by stressing the tool with generated correct code with corner cases.
In order to potentially complete TSan specifically for lock-free programs. - See to make an optimizer that would transform seq_cst into more relaxed barrier.
That will simplify coder task to produce less error prone program, without losing performance due to relative cost of full memory barriers. - Explore how we could detect the ABA' problem as no tools could simply detect it.
- CppReference C++ reference close to the standard but more digest.
- Book C++ concurrency in action by Anthony Williams good book about concurrency and lock-free
- Conference CppCon 2014 Herb Sutter "Lock-Free Programming (or, Juggling Razor Blades)"
- Papers
- Memory Barriers: a Hardware View for Software Hackers, Paul E. Mckenney, 10.1.1.170.3279
- ThreadSanitizer – data race detection in practice, Konstantin Serebryany and Timur Iskhodzhanov, 10.1.1.308.3963
- Blogs
- Awesome lock-free resources
- TSan supports [...] C++
<atomic>
operations are supported with llvm libc++ (not very thoroughly tested, though).