scx_mitosis: add l3 awareness and work stealing #2761

tommy-u · 2025-09-11T02:36:22Z

Work in progress...

This PR introduces L3 cache domain awareness into scx_mitosis. Now, cell tasks are affinitized to a single L3 within that cell's cpuset. When a CPU executed dispatch(), it preferentially executes tasks affinitized to that L3. When a CPU is about to go idle, it first attempts to steal work from other L3 domains associated with the cell. Work stealing can be enabled or disabled at compile time. Because work stealing can paper over a number of scheduler issues, we preserve the option to disable work stealing for debugging purposes.

Performance testing showed <5% of scheduling decisions resulted in steals (and a task staying on a core does not count as a scheduling decision).

Unchanged:

A cell's cpuset is still dictated by the cpuset of the cgroup it corresponds to
There are still per CPU DSQs for use by pinned system threads

BPF Side Changes:

Cell DSQs are no more; they are replaced by per-cell-per-L3 DSQs.
-- There are MAX_CELLS x MAX_L3S many of these in the scheduler.
Cell vtime is no longer tracked, it is tracked by these per-cell-per-L3 DSQs
Cell tasks are randomly affinitized to an L3 weighted by the number of CPUs the cell has in that L3.
-- The chance of being affinitized to L3_i is (#cell cpus in L3_i / #cell cpus)
The dispatch() path preferentially pulls work from cell tasks within its L3
Work stealing: In our model, a given CPU is associated with exactly one cell. Before a CPU goes idle (empty local, per-CPU, and per-cell-per-L3 DSQs), it first checks all of the other L3s of the cell for queued work. If it finds any queued work, it steals it from that per-cell-per-L3 DSQ and begins execution.
Steal counts and timstamps are collected for future rate limiting
Formalized 32-bit DSQ layout with typed unions + helpers replacing ad-hoc bit math
Organization, code factored into multiple headers: dsq.bpf.h, l3_aware.bpf.h, mitosis.bpf.h

Rust Side Changes:

Populating maps with CPU -> L3 mappings and the inverse
More and more configurable debugging counters
Organization, creating mitosis_topology_utils.rs

TODOs:
Clean up all TODOs
Performance testing
Run formatters
READ_ONCE and WRITE_ONCE on all shared variables
Tighten up sloppy races in work stealing

…fication.

dschatzberg

I haven't reviewed the rust side changes yet, but went through all the bpf and header codes. This mostly looks like what I'd expect so high-level everything is good. Just a bunch of detailed comments inline.

One high level suggestion is maybe we should term this llc (last level cache) awareness instead of l3 to be a bit more general to all sorts of cpus.

scheds/rust/scx_mitosis/src/bpf/dsq.bpf.h

dschatzberg · 2025-09-12T15:10:38Z

scheds/rust/scx_mitosis/src/bpf/intf.h

 #include <scx/ravg.bpf.h>
 #endif

+/* ---- Work stealing config (compile-time) ------------------------------- */


I think it might be best to have this as a runtime option - e.g. a flag passed to the user space binary that writes to a global static variable in the bpf code before running it

scheds/rust/scx_mitosis/src/bpf/intf.h

scheds/rust/scx_mitosis/src/bpf/l3_aware.bpf.h

dschatzberg · 2025-09-12T18:40:19Z

scheds/rust/scx_mitosis/src/bpf/mitosis.bpf.c

+			// TODO: This math is kinda dumb and confusing.
+			u32 start = ((u32)l3 + 1) % nr_l3;
+			u32 off;
+			// TODO: This might try a bunch of L3s outside of the cell


Yeah, I think this should still stay within the cell

Ya, this is unnecessarily confusing. It does not try any DSQs outside of the cell. But it does try the cell's DSQs outside its cpuset which we know are empty. I'll clean it up

dschatzberg · 2025-09-12T18:42:20Z

scheds/rust/scx_mitosis/src/bpf/mitosis.bpf.c

+						// NOTE: This could get expensive, but I'm not
+						// anticipating that many steals. Percpu if we care.
+						if (count)
+							__sync_fetch_and_add(count, 1);


I think if you make this all stay within the cell, you can just treat this as another per-cell counter like we do with CSTAT_LOCAL, etc.

dschatzberg · 2025-09-12T18:50:51Z

scheds/rust/scx_mitosis/src/bpf/l3_aware.bpf.h

+	bpf_rcu_read_unlock();
+
+	cell->l3_present_cnt = present;
+	cell->cpu_cnt = total_cpus;


This logic runs off tick() and it is concurrent with select_cpu, task_init, etc. It's not obvious to me that all the concurrency here is safe. You might even need to protect all of this with a per-cell spinlock or rwsem. At the very least explain how you reason about the safety of these writes vs the below reads

scheds/rust/scx_mitosis/src/bpf/mitosis.bpf.c

dschatzberg · 2025-09-12T18:54:47Z

scheds/rust/scx_mitosis/src/bpf/mitosis.bpf.c

+			cell->l3_vtime_now[tctx->l3] +=
+				used * DEFAULT_WEIGHT_MULTIPLIER /
+				p->scx.weight;
+		}


Do you really need this? It seems like advancing vtime at running() should be sufficient.

tommy-u added 4 commits September 10, 2025 19:34

Only print stats when cell is in_use

1b11d2b

Fix cpumask cleanup. RAII for running guard.

0b7a1fc

Preparing datastructures and helper functions for core scheduler modi…

0c3c7bb

…fication.

Prepare rust side for l3 awareness

ba1924b

tommy-u marked this pull request as draft September 11, 2025 02:36

tommy-u changed the title ~~Scx mitosis l3 aware~~ scx_mitosis: add l3 awareness Sep 11, 2025

tommy-u requested review from dforsyth and dschatzberg September 11, 2025 02:44

tommy-u changed the title ~~scx_mitosis: add l3 awareness~~ scx_mitosis: add l3 awareness and work stealing Sep 11, 2025

tommy-u requested a review from kkdwivedi September 11, 2025 05:15

tommy-u force-pushed the scx_mitosis_l3_aware branch from a84b9b1 to 0d92efb Compare September 11, 2025 16:17

scx_mitosis: add L3 awareness and work stealing

0dd6be6

tommy-u force-pushed the scx_mitosis_l3_aware branch from 475b560 to 0dd6be6 Compare September 11, 2025 23:53

dschatzberg reviewed Sep 12, 2025

View reviewed changes

scx_mitosis: major work stealing cleanup

8523b9d

tommy-u force-pushed the scx_mitosis_l3_aware branch from 686fabc to 8523b9d Compare September 22, 2025 22:10

tommy-u added 2 commits September 26, 2025 05:48

Use dsq_id_t type

7ddaba0

First cut at locking

7639d21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scx_mitosis: add l3 awareness and work stealing #2761

scx_mitosis: add l3 awareness and work stealing #2761

Uh oh!

tommy-u commented Sep 11, 2025 •

edited

Loading

Uh oh!

dschatzberg left a comment

Uh oh!

Uh oh!

dschatzberg Sep 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dschatzberg Sep 12, 2025

Uh oh!

tommy-u Sep 12, 2025

Uh oh!

dschatzberg Sep 12, 2025

Uh oh!

dschatzberg Sep 12, 2025

Uh oh!

Uh oh!

dschatzberg Sep 12, 2025

Uh oh!

Uh oh!

scx_mitosis: add l3 awareness and work stealing #2761

Are you sure you want to change the base?

scx_mitosis: add l3 awareness and work stealing #2761

Uh oh!

Conversation

tommy-u commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dschatzberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dschatzberg Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dschatzberg Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

tommy-u Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

dschatzberg Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

dschatzberg Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dschatzberg Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tommy-u commented Sep 11, 2025 •

edited

Loading