Skip to content

Conversation

tommy-u
Copy link
Contributor

@tommy-u tommy-u commented Sep 11, 2025

Work in progress...

This PR introduces L3 cache domain awareness into scx_mitosis. Now, cell tasks are affinitized to a single L3 within that cell's cpuset. When a CPU executed dispatch(), it preferentially executes tasks affinitized to that L3. When a CPU is about to go idle, it first attempts to steal work from other L3 domains associated with the cell. Work stealing can be enabled or disabled at compile time. Because work stealing can paper over a number of scheduler issues, we preserve the option to disable work stealing for debugging purposes.

Performance testing showed <5% of scheduling decisions resulted in steals (and a task staying on a core does not count as a scheduling decision).

Unchanged:

  • A cell's cpuset is still dictated by the cpuset of the cgroup it corresponds to
  • There are still per CPU DSQs for use by pinned system threads

BPF Side Changes:

  • Cell DSQs are no more; they are replaced by per-cell-per-L3 DSQs.
    -- There are MAX_CELLS x MAX_L3S many of these in the scheduler.
  • Cell vtime is no longer tracked, it is tracked by these per-cell-per-L3 DSQs
  • Cell tasks are randomly affinitized to an L3 weighted by the number of CPUs the cell has in that L3.
    -- The chance of being affinitized to L3_i is (#cell cpus in L3_i / #cell cpus)
  • The dispatch() path preferentially pulls work from cell tasks within its L3
  • Work stealing: In our model, a given CPU is associated with exactly one cell. Before a CPU goes idle (empty local, per-CPU, and per-cell-per-L3 DSQs), it first checks all of the other L3s of the cell for queued work. If it finds any queued work, it steals it from that per-cell-per-L3 DSQ and begins execution.
  • Steal counts and timstamps are collected for future rate limiting
  • Formalized 32-bit DSQ layout with typed unions + helpers replacing ad-hoc bit math
  • Organization, code factored into multiple headers: dsq.bpf.h, l3_aware.bpf.h, mitosis.bpf.h

Rust Side Changes:

  • Populating maps with CPU -> L3 mappings and the inverse
  • More and more configurable debugging counters
  • Organization, creating mitosis_topology_utils.rs

TODOs:
Clean up all TODOs
Performance testing
Run formatters
READ_ONCE and WRITE_ONCE on all shared variables
Tighten up sloppy races in work stealing

@tommy-u tommy-u marked this pull request as draft September 11, 2025 02:36
@tommy-u tommy-u changed the title Scx mitosis l3 aware scx_mitosis: add l3 awareness Sep 11, 2025
@tommy-u tommy-u changed the title scx_mitosis: add l3 awareness scx_mitosis: add l3 awareness and work stealing Sep 11, 2025
@tommy-u tommy-u requested a review from kkdwivedi September 11, 2025 05:15
@tommy-u tommy-u force-pushed the scx_mitosis_l3_aware branch from a84b9b1 to 0d92efb Compare September 11, 2025 16:17
@tommy-u tommy-u force-pushed the scx_mitosis_l3_aware branch from 475b560 to 0dd6be6 Compare September 11, 2025 23:53
Copy link
Contributor

@dschatzberg dschatzberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the rust side changes yet, but went through all the bpf and header codes. This mostly looks like what I'd expect so high-level everything is good. Just a bunch of detailed comments inline.

One high level suggestion is maybe we should term this llc (last level cache) awareness instead of l3 to be a bit more general to all sorts of cpus.

#include <scx/ravg.bpf.h>
#endif

/* ---- Work stealing config (compile-time) ------------------------------- */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be best to have this as a runtime option - e.g. a flag passed to the user space binary that writes to a global static variable in the bpf code before running it

// TODO: This math is kinda dumb and confusing.
u32 start = ((u32)l3 + 1) % nr_l3;
u32 off;
// TODO: This might try a bunch of L3s outside of the cell
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this should still stay within the cell

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, this is unnecessarily confusing. It does not try any DSQs outside of the cell. But it does try the cell's DSQs outside its cpuset which we know are empty. I'll clean it up

// NOTE: This could get expensive, but I'm not
// anticipating that many steals. Percpu if we care.
if (count)
__sync_fetch_and_add(count, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you make this all stay within the cell, you can just treat this as another per-cell counter like we do with CSTAT_LOCAL, etc.

bpf_rcu_read_unlock();

cell->l3_present_cnt = present;
cell->cpu_cnt = total_cpus;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic runs off tick() and it is concurrent with select_cpu, task_init, etc. It's not obvious to me that all the concurrency here is safe. You might even need to protect all of this with a per-cell spinlock or rwsem. At the very least explain how you reason about the safety of these writes vs the below reads

cell->l3_vtime_now[tctx->l3] +=
used * DEFAULT_WEIGHT_MULTIPLIER /
p->scx.weight;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need this? It seems like advancing vtime at running() should be sufficient.

@tommy-u tommy-u force-pushed the scx_mitosis_l3_aware branch from 686fabc to 8523b9d Compare September 22, 2025 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants