-
Notifications
You must be signed in to change notification settings - Fork 182
scx_mitosis: add l3 awareness and work stealing #2761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a84b9b1
to
0d92efb
Compare
475b560
to
0dd6be6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed the rust side changes yet, but went through all the bpf and header codes. This mostly looks like what I'd expect so high-level everything is good. Just a bunch of detailed comments inline.
One high level suggestion is maybe we should term this llc (last level cache) awareness instead of l3 to be a bit more general to all sorts of cpus.
#include <scx/ravg.bpf.h> | ||
#endif | ||
|
||
/* ---- Work stealing config (compile-time) ------------------------------- */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be best to have this as a runtime option - e.g. a flag passed to the user space binary that writes to a global static variable in the bpf code before running it
// TODO: This math is kinda dumb and confusing. | ||
u32 start = ((u32)l3 + 1) % nr_l3; | ||
u32 off; | ||
// TODO: This might try a bunch of L3s outside of the cell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this should still stay within the cell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, this is unnecessarily confusing. It does not try any DSQs outside of the cell. But it does try the cell's DSQs outside its cpuset which we know are empty. I'll clean it up
// NOTE: This could get expensive, but I'm not | ||
// anticipating that many steals. Percpu if we care. | ||
if (count) | ||
__sync_fetch_and_add(count, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you make this all stay within the cell, you can just treat this as another per-cell counter like we do with CSTAT_LOCAL, etc.
bpf_rcu_read_unlock(); | ||
|
||
cell->l3_present_cnt = present; | ||
cell->cpu_cnt = total_cpus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic runs off tick() and it is concurrent with select_cpu, task_init, etc. It's not obvious to me that all the concurrency here is safe. You might even need to protect all of this with a per-cell spinlock or rwsem. At the very least explain how you reason about the safety of these writes vs the below reads
cell->l3_vtime_now[tctx->l3] += | ||
used * DEFAULT_WEIGHT_MULTIPLIER / | ||
p->scx.weight; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need this? It seems like advancing vtime at running() should be sufficient.
686fabc
to
8523b9d
Compare
Work in progress...
This PR introduces L3 cache domain awareness into scx_mitosis. Now, cell tasks are affinitized to a single L3 within that cell's cpuset. When a CPU executed
dispatch()
, it preferentially executes tasks affinitized to that L3. When a CPU is about to go idle, it first attempts to steal work from other L3 domains associated with the cell. Work stealing can be enabled or disabled at compile time. Because work stealing can paper over a number of scheduler issues, we preserve the option to disable work stealing for debugging purposes.Performance testing showed <5% of scheduling decisions resulted in steals (and a task staying on a core does not count as a scheduling decision).
Unchanged:
BPF Side Changes:
-- There are MAX_CELLS x MAX_L3S many of these in the scheduler.
-- The chance of being affinitized to L3_i is (#cell cpus in L3_i / #cell cpus)
dispatch()
path preferentially pulls work from cell tasks within its L3Rust Side Changes:
TODOs:
Clean up all TODOs
Performance testing
Run formatters
READ_ONCE and WRITE_ONCE on all shared variables
Tighten up sloppy races in work stealing