-
Notifications
You must be signed in to change notification settings - Fork 183
chaos: add futex delays trait #2280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Putting this up as a draft so I can work on it in the open now. It seems to have a stall bug even without the DSQ searching/re-queuing, and that path never gets hit because the searching isn't correct. |
3406cfe
to
5144a4e
Compare
Things are working substantially better with the recent chaos/p2dq fixes! Current issue is we attempt to call
Triggers the crash case well, general use doesn't hit the contended use case very often on my quiet machine. |
chaos_stat_inc(CHAOS_STAT_TRAIT_FUTEX_DELAYS); | ||
scx_bpf_dsq_insert_vtime(p, get_cpu_delay_dsq(cpu), 0, now + futex_uncontended_delay_ns, enq_flags); | ||
|
||
// critical sections can't call kfuncs which makes this very complicated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice explanation!
c576237 added random support to scx_chaos with less bias (very very nearly 0). It was reverted because it broke random delays. It turns out the random implementation was fine but the callsite was wrong. Instead of adding the random delay to the current time it was setting the target time to the random delay, which was always in the past, and hence scheduling things immediately. With that fixed, this appears to work. Test plan: - CI ``` # wakeup latencies with no flags 56.166µs |▁▄▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████ 132.125µs ███▇▇▇▇▆▆▆▃| 141.834µs 63.75µs |▁▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 74.767µs ███████▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▄▃▃| 88.6µs 65.835µs |▁▆▇▇ 84.235µs █████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▄▄▃▃| 446.137µs 59.289µs |▁▅▆▆▆▆▆▇▇▇▇▇██ 117.979µs ████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▃▃▂▁| 407.666µs 64.987µs |▁▄▅▅▆▆▆▆▇▇▇▇▇██ 148.62µs █████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▁▁| 536.936µs 52.78µs |▁▂▂▃▃▄▄▅▆▆▆▆▆▆▇▇▇▇▇▇██ 146.406µs ██████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▄▃▁| 415.672µs 58.819µs |▁▃▄▄▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇████ 174.184µs █████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▄▁| 437.196µs 62.938µs |▁▆▇ 127.197µs ██▇▇▇▇▇▆▆▆▅▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁| 2.105205ms 59.026µs |▁▂▄▄▅▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 227.526µs ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▃▂▁| 456.578µs Benchmarking sleep_wakeup_histogram/wakeup_latency: Collecting 10 samples in estimated 5.0415 s (490 iterations) 64.466µs |▁▃▄▄▄▅▅▅▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████████ 312.808µs ██████▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▅▅▄▄▁| 431.229µs 48.977µs |▁▂▄▄▅▅▆▆▆▆▆▇▇▇▇▇██ 125.064µs ██████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▄▁| 410.671µs 58.827µs |▁▃▄▅▆▆▆▆▆▇▇▇▇▇██ 127.664µs ██████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▄▁| 421.334µs 63.12µs |▁▃▄▄▅▅▆▆▆▆▇▇▇▇▇███ 135.895µs ██████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▁| 424.716µs 37.173µs |▁▁▂▂▃▃▄▄▅▆▆▆▆▇▇▇▇██ 132.831µs ███████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▂▂▁| 470.139µs 60.392µs |▁▂▂▃▃▄▄▄▅▅▅▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████████████ 410.584µs ██▇▇▆▆▅▅▄▄▁| 457.748µs 58.813µs |▁▃▄▅▆▆▆▆▇▇▇▇▇█ 124.729µs ███████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▁| 453.846µs 55.203µs |▁▃▄▄▅▅▆▆▆▆▇▇▇▇██ 130.237µs ██████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▄▁| 466.242µs 62.011µs |▁▄▄▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇███ 186.162µs ████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▃▃▃▂▂▂▁▁| 499.317µs 44.48µs |▁▃▄▄▄▄▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇██ 151.382µs █████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▄▁| 450.172µs # wakeup latencies with --random-delay-frequency 1.0 --random-delay-min-us 100000 --random-delay-max-us 200000 67.984µs | 103.273µs ▁████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▄▄▄▃▃| 149.695558ms 69.85µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 73.03691ms ████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▃| 197.940289ms 66.854µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████ 79.471527ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▄▃| 200.842494ms 72.268µs |▁▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████ 79.757623ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▁| 195.335728ms 45.997µs |▁▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇█████ 59.430492ms ███████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▄▁| 201.135123ms Benchmarking sleep_wakeup_histogram/wakeup_latency: Collecting 10 samples in estimated 6.1568 s (40 iterations) 72.645µs |▁▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████ 83.468765ms ████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃| 198.469905ms 77.531µs |▁▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇████████ 92.318916ms █████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▃| 194.16041ms 67.939µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 75.667921ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▃| 188.527733ms 66.176µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 75.595375ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▃| 193.914143ms 67.455µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 75.774696ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▃| 199.696546ms 61.654µs |▁▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████ 86.050604ms █████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▃| 186.936313ms 65.914µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 76.277914ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▄▄▄▃| 196.387308ms 91.47µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 76.357756ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▄▄▄▄▃| 200.616152ms 79.914µs |▁▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████ 83.185347ms ████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃| 198.712172ms 77.44µs |▁▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 72.513261ms ████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▄▃| 199.976316ms ```
Add futex delays to chaos. To best reproduce deadlocks and other futex issues we need to affect locking. The approach here: - Delays a waiter when a lock has contention up to futex_uncontended_delay_ns. - Swaps out the existing delayed waiter when another waiter comes along. - Delays the previous waiter by a random delay between futex_contended_delay_ns and futex_uncontended_delay_ns. This approach is chosen over random delays to flip futex conditions with minimal performance impact on a machine/process. If we had a futex and pair of threads that have many idle seconds after a short period of contention we would need huge random delays to affect their ordering at all, on every task that touches the futex. Instead we can limit the delays to a solo waiter at any point, and have a much smaller delay when we know the mutex is already under contention. We'll see how this works in practice. This is the most complicated chaos trait in terms of data structures by far. Currently we use a BPF hash map and a built in DSQ to maintain the data. The hash map maps a specific futex (well, close, a tgid/uaddr pair) to an entry in a CPU's delay DSQ. The delay DSQ holds the task until its timeout, and the map stores how to find that entry in the DSQ to re-queue it with the uncontended timeout. As commented in the code, the complexity of a search in a native DSQ is hideous - it's O(n). We can change the implementation in the future while keeping the logic the same. Test plan: - Lightly tested. Futex is attached to and sees many entries. Slow futex waiters are delayed. The hand off between an old delayed waiter and a new delayed waiter are not reliable and likely have a bug. - This change is a no-op unless you provide new command line flags.
5144a4e
to
69434ac
Compare
Add futex delays to chaos. To best reproduce deadlocks and other futex issues we need to affect locking.
The approach here:
This approach is chosen over random delays to flip futex conditions with minimal performance impact on a machine/process. If we had a futex and pair of threads that have many idle seconds after a short period of contention we would need huge random delays to affect their ordering at all, on every task that touches the futex. Instead we can limit the delays to a solo waiter at any point, and have a much smaller delay when we know the mutex is already under contention. We'll see how this works in practice.
This is the most complicated chaos trait in terms of data structures by far. Currently we use a BPF hash map and a built in DSQ to maintain the data. The hash map maps a specific futex (well, close, a tgid/uaddr pair) to an entry in a CPU's delay DSQ. The delay DSQ holds the task until its timeout, and the map stores how to find that entry in the DSQ to re-queue it with the uncontended timeout. As commented in the code, the complexity of a search in a native DSQ is hideous - it's O(n). We can change the implementation in the future while keeping the logic the same.
Test plan: