Skip to content

Conversation

JakeHillion
Copy link
Contributor

Add futex delays to chaos. To best reproduce deadlocks and other futex issues we need to affect locking.

The approach here:

  • Delays a waiter when a lock has contention up to futex_uncontended_delay_ns.
  • Swaps out the existing delayed waiter when another waiter comes along.
  • Delays the previous waiter by a random delay between futex_contended_delay_ns and futex_uncontended_delay_ns.

This approach is chosen over random delays to flip futex conditions with minimal performance impact on a machine/process. If we had a futex and pair of threads that have many idle seconds after a short period of contention we would need huge random delays to affect their ordering at all, on every task that touches the futex. Instead we can limit the delays to a solo waiter at any point, and have a much smaller delay when we know the mutex is already under contention. We'll see how this works in practice.

This is the most complicated chaos trait in terms of data structures by far. Currently we use a BPF hash map and a built in DSQ to maintain the data. The hash map maps a specific futex (well, close, a tgid/uaddr pair) to an entry in a CPU's delay DSQ. The delay DSQ holds the task until its timeout, and the map stores how to find that entry in the DSQ to re-queue it with the uncontended timeout. As commented in the code, the complexity of a search in a native DSQ is hideous - it's O(n). We can change the implementation in the future while keeping the logic the same.

Test plan:

  • Lightly tested. Futex is attached to and sees many entries. Slow futex waiters are delayed. The hand off between an old delayed waiter and a new delayed waiter are not reliable and likely have a bug.
  • This change is a no-op unless you provide new command line flags.

@JakeHillion
Copy link
Contributor Author

Putting this up as a draft so I can work on it in the open now. It seems to have a stall bug even without the DSQ searching/re-queuing, and that path never gets hit because the searching isn't correct.

@JakeHillion JakeHillion force-pushed the jakehillion/chaos-futex branch from 3406cfe to 5144a4e Compare July 10, 2025 16:57
@JakeHillion
Copy link
Contributor Author

Things are working substantially better with the recent chaos/p2dq fixes! Current issue is we attempt to call scx_bpf_dsq_move_vtime from deep inside chaos_enqueue and this isn't allowed. Evaluating the workarounds at the minute.

$ nix run nixpkgs#stress-ng -- --futex 8

Triggers the crash case well, general use doesn't hit the contended use case very often on my quiet machine.

chaos_stat_inc(CHAOS_STAT_TRAIT_FUTEX_DELAYS);
scx_bpf_dsq_insert_vtime(p, get_cpu_delay_dsq(cpu), 0, now + futex_uncontended_delay_ns, enq_flags);

// critical sections can't call kfuncs which makes this very complicated.
Copy link
Contributor

@hodgesds hodgesds Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice explanation!

c576237 added random support to scx_chaos with less bias (very very
nearly 0). It was reverted because it broke random delays.

It turns out the random implementation was fine but the callsite was
wrong. Instead of adding the random delay to the current time it was
setting the target time to the random delay, which was always in the
past, and hence scheduling things immediately. With that fixed, this
appears to work.

Test plan:
- CI

```
 # wakeup latencies with no flags
 56.166µs |▁▄▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████ 132.125µs ███▇▇▇▇▆▆▆▃| 141.834µs
63.75µs |▁▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 74.767µs ███████▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▄▃▃| 88.6µs
65.835µs |▁▆▇▇ 84.235µs █████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▄▄▃▃| 446.137µs
59.289µs |▁▅▆▆▆▆▆▇▇▇▇▇██ 117.979µs ████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▃▃▂▁| 407.666µs
64.987µs |▁▄▅▅▆▆▆▆▇▇▇▇▇██ 148.62µs █████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▁▁| 536.936µs
52.78µs |▁▂▂▃▃▄▄▅▆▆▆▆▆▆▇▇▇▇▇▇██ 146.406µs ██████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▄▃▁| 415.672µs
58.819µs |▁▃▄▄▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇████ 174.184µs █████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▄▁| 437.196µs
62.938µs |▁▆▇ 127.197µs ██▇▇▇▇▇▆▆▆▅▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁| 2.105205ms
59.026µs |▁▂▄▄▅▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 227.526µs ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▃▂▁| 456.578µs
Benchmarking sleep_wakeup_histogram/wakeup_latency: Collecting 10 samples in estimated 5.0415 s (490 iterations)
64.466µs |▁▃▄▄▄▅▅▅▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████████ 312.808µs ██████▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▅▅▄▄▁| 431.229µs
48.977µs |▁▂▄▄▅▅▆▆▆▆▆▇▇▇▇▇██ 125.064µs ██████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▄▁| 410.671µs
58.827µs |▁▃▄▅▆▆▆▆▆▇▇▇▇▇██ 127.664µs ██████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▄▁| 421.334µs
63.12µs |▁▃▄▄▅▅▆▆▆▆▇▇▇▇▇███ 135.895µs ██████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▁| 424.716µs
37.173µs |▁▁▂▂▃▃▄▄▅▆▆▆▆▇▇▇▇██ 132.831µs ███████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▂▂▁| 470.139µs
60.392µs |▁▂▂▃▃▄▄▄▅▅▅▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████████████ 410.584µs ██▇▇▆▆▅▅▄▄▁| 457.748µs
58.813µs |▁▃▄▅▆▆▆▆▇▇▇▇▇█ 124.729µs ███████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▁| 453.846µs
55.203µs |▁▃▄▄▅▅▆▆▆▆▇▇▇▇██ 130.237µs ██████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▄▁| 466.242µs
62.011µs |▁▄▄▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇███ 186.162µs ████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▃▃▃▂▂▂▁▁| 499.317µs
44.48µs |▁▃▄▄▄▄▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇██ 151.382µs █████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▄▁| 450.172µs

 # wakeup latencies with --random-delay-frequency 1.0 --random-delay-min-us 100000 --random-delay-max-us 200000
67.984µs | 103.273µs ▁████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▄▄▄▃▃| 149.695558ms
69.85µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 73.03691ms ████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▃| 197.940289ms
66.854µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████ 79.471527ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▄▃| 200.842494ms
72.268µs |▁▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████ 79.757623ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▁| 195.335728ms
45.997µs |▁▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇█████ 59.430492ms ███████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▄▁| 201.135123ms
Benchmarking sleep_wakeup_histogram/wakeup_latency: Collecting 10 samples in estimated 6.1568 s (40 iterations)
72.645µs |▁▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████ 83.468765ms ████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃| 198.469905ms
77.531µs |▁▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇████████ 92.318916ms █████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▃| 194.16041ms
67.939µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 75.667921ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▄▄▃| 188.527733ms
66.176µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 75.595375ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▃| 193.914143ms
67.455µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 75.774696ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▃| 199.696546ms
61.654µs |▁▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████ 86.050604ms █████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▄▄▃| 186.936313ms
65.914µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 76.277914ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▄▄▄▃| 196.387308ms
91.47µs |▁▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 76.357756ms ██████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▄▄▄▄▃| 200.616152ms
79.914µs |▁▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████ 83.185347ms ████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃| 198.712172ms
77.44µs |▁▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████ 72.513261ms ████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▄▃| 199.976316ms
```
Add futex delays to chaos. To best reproduce deadlocks and other futex
issues we need to affect locking.

The approach here:
- Delays a waiter when a lock has contention up to
  futex_uncontended_delay_ns.
- Swaps out the existing delayed waiter when another waiter comes along.
- Delays the previous waiter by a random delay between
  futex_contended_delay_ns and futex_uncontended_delay_ns.

This approach is chosen over random delays to flip futex conditions with
minimal performance impact on a machine/process. If we had a futex and
pair of threads that have many idle seconds after a short period of
contention we would need huge random delays to affect their ordering at
all, on every task that touches the futex. Instead we can limit the
delays to a solo waiter at any point, and have a much smaller delay when
we know the mutex is already under contention. We'll see how this works
in practice.

This is the most complicated chaos trait in terms of data structures
by far. Currently we use a BPF hash map and a built in DSQ to maintain
the data. The hash map maps a specific futex (well, close, a tgid/uaddr
pair) to an entry in a CPU's delay DSQ. The delay DSQ holds the task
until its timeout, and the map stores how to find that entry in the DSQ
to re-queue it with the uncontended timeout. As commented in the code,
the complexity of a search in a native DSQ is hideous - it's O(n). We
can change the implementation in the future while keeping the logic the
same.

Test plan:
- Lightly tested. Futex is attached to and sees many entries. Slow futex
  waiters are delayed. The hand off between an old delayed waiter and a
  new delayed waiter are not reliable and likely have a bug.
- This change is a no-op unless you provide new command line flags.
@JakeHillion JakeHillion force-pushed the jakehillion/chaos-futex branch from 5144a4e to 69434ac Compare October 10, 2025 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants