Skip to content

Conversation

@inwardvessel
Copy link
Contributor

pre-submission run

rgushchin and others added 24 commits October 24, 2025 11:05
Signed-off-by: Roman Gushchin <[email protected]>
Move struct bpf_struct_ops_link's definition into bpf.h,
where other custom bpf links definitions are.

It's necessary to access its members from outside of generic
bpf_struct_ops implementation, which will be done by following
patches in the series.

Signed-off-by: Roman Gushchin <[email protected]>
When a struct ops is being attached and a bpf link is created,
allow to pass a cgroup fd using bpf attr, so that struct ops
can be attached to a cgroup instead of globally.

Attached struct ops doesn't hold a reference to the cgroup,
only preserves cgroup id.

Signed-off-by: Roman Gushchin <[email protected]>
Struct oom_control is used to describe the OOM context.
It's memcg field defines the scope of OOM: it's NULL for global
OOMs and a valid memcg pointer for memcg-scoped OOMs.
Teach bpf verifier to recognize it as trusted or NULL pointer.
It will provide the bpf OOM handler a trusted memcg pointer,
which for example is required for iterating the memcg's subtree.

Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Kumar Kartikeya Dwivedi <[email protected]>
mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation,
but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency.

Signed-off-by: Roman Gushchin <[email protected]>
To use memcg_page_state_output() in bpf_memcontrol.c move the
declaration from v1-specific memcontrol-v1.h to memcontrol.h.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce a bpf struct ops for implementing custom OOM handling
policies.

It's possible to load one bpf_oom_ops for the system and one
bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
cgroup tree is traversed from the OOM'ing memcg up to the root and
corresponding BPF OOM handlers are executed until some memory is
freed. If no memory is freed, the kernel OOM killer is invoked.

The struct ops provides the bpf_handle_out_of_memory() callback,
which expected to return 1 if it was able to free some memory and 0
otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
field of the oom_control structure, which is expected to be set by
kfuncs suitable for releasing memory. If both are set, OOM is
considered handled, otherwise the next OOM handler in the chain
(e.g. BPF OOM attached to the parent cgroup or the in-kernel OOM
killer) is executed.

The bpf_handle_out_of_memory() callback program is sleepable to enable
using iterators, e.g. cgroup iterators. The callback receives struct
oom_control as an argument, so it can determine the scope of the OOM
event: if this is a memcg-wide or system-wide OOM.

The callback is executed just before the kernel victim task selection
algorithm, so all heuristics and sysctls like panic on oom,
sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
are respected.

BPF OOM struct ops provides the handle_cgroup_offline() callback
which is good for releasing struct ops if the corresponding cgroup
is gone.

The struct ops also has the name field, which allows to define a
custom name for the implemented policy. It's printed in the OOM report
in the oom_policy=<policy> format. "default" is printed if bpf is not
used or policy name is not specified.

[  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
               oom_policy=bpf_test_policy
[  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc kernel-patches#102 PREEMPT(full)
[  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
[  112.698167] Call Trace:
[  112.698177]  <TASK>
[  112.698182]  dump_stack_lvl+0x4d/0x70
[  112.698192]  dump_header+0x59/0x1c6
[  112.698199]  oom_kill_process.cold+0x8/0xef
[  112.698206]  bpf_oom_kill_process+0x59/0xb0
[  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
[  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
[  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
[  112.698240]  bpf_handle_oom+0x11a/0x1e0
[  112.698250]  out_of_memory+0xab/0x5c0
[  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
[  112.698274]  try_charge_memcg+0x4b5/0x7e0
[  112.698288]  charge_memcg+0x2f/0xc0
[  112.698293]  __mem_cgroup_charge+0x30/0xc0
[  112.698299]  do_anonymous_page+0x40f/0xa50
[  112.698311]  __handle_mm_fault+0xbba/0x1140
[  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
[  112.698335]  handle_mm_fault+0xe6/0x370
[  112.698343]  do_user_addr_fault+0x211/0x6a0
[  112.698354]  exc_page_fault+0x75/0x1d0
[  112.698363]  asm_exc_page_fault+0x26/0x30
[  112.698366] RIP: 0033:0x7fa97236db00

Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_oom_kill_process() bpf kfunc, which is supposed
to be used by BPF OOM programs. It allows to kill a process
in exactly the same way the OOM killer does: using the OOM reaper,
bumping corresponding memcg and global statistics, respecting
memory.oom.group etc.

On success, it sets om_control's bpf_memory_freed field to true,
enabling the bpf program to bypass the kernel OOM killer.

Signed-off-by: Roman Gushchin <[email protected]>
To effectively operate with memory cgroups in BPF there is a need
to convert css pointers to memcg pointers. A simple container_of
cast which is used in the kernel code can't be used in BPF because
from the verifier's point of view that's a out-of-bounds memory access.

Introduce helper get/put kfuncs which can be used to get
a refcounted memcg pointer from the css pointer:
  - bpf_get_mem_cgroup,
  - bpf_put_mem_cgroup.

bpf_get_mem_cgroup() can take both memcg's css and the corresponding
cgroup's "self" css. It allows it to be used with the existing cgroup
iterator which iterates over cgroup tree, not memcg tree.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce a BPF kfunc to get a trusted pointer to the root memory
cgroup. It's very handy to traverse the full memcg tree, e.g.
for handling a system-wide OOM.

It's possible to obtain this pointer by traversing the memcg tree
up from any known memcg, but it's sub-optimal and makes BPF programs
more complex and less efficient.

bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
however in reality it's not necessarily to bump the corresponding
reference counter - root memory cgroup is immortal, reference counting
is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce BPF kfuncs to conveniently access memcg data:
  - bpf_mem_cgroup_vm_events(),
  - bpf_mem_cgroup_usage(),
  - bpf_mem_cgroup_page_state(),
  - bpf_mem_cgroup_flush_stats().

These functions are useful for implementing BPF OOM policies, but
also can be used to accelerate access to the memcg data. Reading
it through cgroupfs is much more expensive, roughly 5x, mostly
because of the need to convert the data into the text and back.

Signed-off-by: Roman Gushchin <[email protected]>
Co-developed-by: JP Kobryn <[email protected]>
Signed-off-by: JP Kobryn <[email protected]>
Introduce BPF kfunc to access memory events, e.g.:
MEMCG_LOW, MEMCG_MAX, MEMCG_OOM, MEMCG_OOM_KILL etc.

Signed-off-by: JP Kobryn <[email protected]>
Add test coverage for the kfuncs that fetch memcg stats. Using some common
stats, test scenarios ensuring that the given stat increases by some
arbitrary amount. The stats selected cover the three categories represented
by the enums: node_stat_item, memcg_stat_item, vm_event_item.

Since only a subset of all stats are queried, use a static struct made up
of fields for each stat. Write to the struct with the fetched values when
the bpf program is invoked and read the fields in the user mode program for
verification.

Signed-off-by: JP Kobryn <[email protected]>
Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
an out of memory events and trigger the corresponding kernel OOM
handling mechanism.

It takes a trusted memcg pointer (or NULL for system-wide OOMs)
as an argument, as well as the page order.

If the BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK flag is not set, only one OOM
can be declared and handled in the system at once, so if the function
is called in parallel to another OOM handling, it bails out with -EBUSY.
This mode is suited for global OOM's: any concurrent OOMs will likely
do the job and release some memory. In a blocking mode (which is
suited for memcg OOMs) the execution will wait on the oom_lock mutex.

The function is declared as sleepable. It guarantees that it won't
be called from an atomic context. It's required by the OOM handling
code, which shouldn't be called from a non-blocking context.

Handling of a memcg OOM almost always requires taking of the
css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
also guarantees that it can't be called with acquired css_set_lock,
so the kernel can't deadlock on it.

Please, note that this function will be inaccessible as of now.
Calling bpf_out_of_memory() from a random context is dangerous
because e.g. it's easy to deadlock the system on oom_lock.
The following commit in the series will provide one safe context
where this kfunc can be used.

Signed-off-by: Roman Gushchin <[email protected]>
Currently there is a hard-coded list of possible oom constraints:
NONE, CPUSET, MEMORY_POLICY & MEMCG. Add a new one: CONSTRAINT_BPF.
Also, add an ability to specify a custom constraint name
when calling bpf_out_of_memory(). If an empty string is passed
as an argument, CONSTRAINT_BPF is displayed.

The resulting output in dmesg will look like this:

[  315.224875] kworker/u17:0 invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0
               oom_policy=default
[  315.226532] CPU: 1 UID: 0 PID: 74 Comm: kworker/u17:0 Not tainted 6.16.0-00015-gf09eb0d6badc kernel-patches#102 PREEMPT(full)
[  315.226534] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
[  315.226536] Workqueue: bpf_psi_wq bpf_psi_handle_event_fn
[  315.226542] Call Trace:
[  315.226545]  <TASK>
[  315.226548]  dump_stack_lvl+0x4d/0x70
[  315.226555]  dump_header+0x59/0x1c6
[  315.226561]  oom_kill_process.cold+0x8/0xef
[  315.226565]  out_of_memory+0x111/0x5c0
[  315.226577]  bpf_out_of_memory+0x6f/0xd0
[  315.226580]  ? srso_alias_return_thunk+0x5/0xfbef5
[  315.226589]  bpf_prog_3018b0cf55d2c6bb_handle_psi_event+0x5d/0x76
[  315.226594]  bpf__bpf_psi_ops_handle_psi_event+0x47/0xa7
[  315.226599]  bpf_psi_handle_event_fn+0x63/0xb0
[  315.226604]  process_one_work+0x1fc/0x580
[  315.226616]  ? srso_alias_return_thunk+0x5/0xfbef5
[  315.226624]  worker_thread+0x1d9/0x3b0
[  315.226629]  ? __pfx_worker_thread+0x10/0x10
[  315.226632]  kthread+0x128/0x270
[  315.226637]  ? lock_release+0xd4/0x2d0
[  315.226645]  ? __pfx_kthread+0x10/0x10
[  315.226649]  ret_from_fork+0x81/0xd0
[  315.226652]  ? __pfx_kthread+0x10/0x10
[  315.226655]  ret_from_fork_asm+0x1a/0x30
[  315.226667]  </TASK>
[  315.239745] memory: usage 42240kB, limit 9007199254740988kB, failcnt 0
[  315.240231] swap: usage 0kB, limit 0kB, failcnt 0
[  315.240585] Memory cgroup stats for /cgroup-test-work-dir673/oom_test/cg2:
[  315.240603] anon 42897408
[  315.241317] file 0
[  315.241493] kernel 98304
...
[  315.255946] Tasks state (memory values in pages):
[  315.256292] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[  315.257107] [    675]     0   675   162013    10969    10712      257         0   155648        0             0 test_progs
[  315.257927] oom-kill:constraint=CONSTRAINT_BPF_PSI_MEM,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/cgroup-test-work-dir673/oom_test/cg2,task_memcg=/cgroup-test-work-dir673/oom_test/cg2,task=test_progs,pid=675,uid=0
[  315.259371] Memory cgroup out of memory: Killed process 675 (test_progs) total-vm:648052kB, anon-rss:42848kB, file-rss:1028kB, shmem-rss:0kB, UID:0 pgtables:152kB oom_score_adj:0

Signed-off-by: Roman Gushchin <[email protected]>
Export tsk_is_oom_victim() helper as a BPF kfunc.
It's very useful to avoid redundant oom kills.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_map__attach_struct_ops_opts(), an extended version of
bpf_map__attach_struct_ops(), which takes additional struct
bpf_struct_ops_opts argument.

struct bpf_struct_ops_opts has the relative_fd member, which allows
to pass an additional file descriptor argument. It can be used to
attach struct ops maps to cgroups.

Signed-off-by: Roman Gushchin <[email protected]>
Implement read_cgroup_file() helper to read from cgroup control files,
e.g. statistics.

Signed-off-by: Roman Gushchin <[email protected]>
Implement a pseudo-realistic test for the OOM handling
functionality.

The OOM handling policy which is implemented in bpf is to
kill all tasks belonging to the biggest leaf cgroup, which
doesn't contain unkillable tasks (tasks with oom_score_adj
set to -1000). Pagecache size is excluded from the accounting.

The test creates a hierarchy of memory cgroups, causes an
OOM at the top level, checks that the expected process will be
killed and checks memcg's oom statistics.

Signed-off-by: Roman Gushchin <[email protected]>
Currently psi_trigger_create() does a lot of things:
parses the user text input, allocates and initializes
the psi_trigger structure and turns on the trigger.
It does it slightly different for two existing types
of psi_triggers: system-wide and cgroup-wide.

In order to support a new type of PSI triggers, which
will be owned by a BPF program and won't have a user's
text description, let's refactor psi_trigger_create().

1. Introduce psi_trigger_type enum:
   currently PSI_SYSTEM and PSI_CGROUP are valid values.
2. Introduce psi_trigger_params structure to avoid passing
   a large number of parameters to psi_trigger_create().
3. Move out the user's input parsing into the new
   psi_trigger_parse() helper.
4. Move out the capabilities check into the new
   psi_file_privileged() helper.
5. Stop relying on t->of for detecting trigger type.

This commit is a pure refactoring and doesn't bring any
functional changes.

Signed-off-by: Roman Gushchin <[email protected]>
This patch implements a BPF struct ops-based mechanism to create
PSI triggers, attach them to cgroups or system wide and handle
PSI events in BPF.

The struct ops provides 3 callbacks:
  - init() called once at load, handy for creating PSI triggers
  - handle_psi_event() called every time a PSI trigger fires
  - handle_cgroup_online() called when a new cgroup is created
  - handle_cgroup_offline() called if a cgroup with an attached
    trigger is deleted

A single struct ops can create a number of PSI triggers, both
cgroup-scoped and system-wide.

All 4 struct ops callbacks can be sleepable. handle_psi_event()
handlers are executed using a separate workqueue, so it won't
affect the latency of other PSI triggers.

Signed-off-by: Roman Gushchin <[email protected]>
Implement a new bpf_psi_create_trigger() BPF kfunc, which allows
to create new PSI triggers and attach them to cgroups or be
system-wide.

Created triggers will exist until the struct ops is loaded and
if they are attached to a cgroup until the cgroup exists.

Due to a limitation of 5 arguments, the resource type and the "full"
bit are squeezed into a single u32.

Signed-off-by: Roman Gushchin <[email protected]>
Add a PSI struct ops test.

The test creates a cgroup with two child sub-cgroups, sets up
memory.high for one of those and puts there a memory hungry
process (initially frozen).

Then it creates 2 PSI triggers from within a init() BPF callback and
attaches them to these cgroups.  Then it deletes the first cgroup and
runs the memory hungry task.  The task is creating a high memory
pressure, which triggers the PSI event. The PSI BPF handler declares
a memcg oom in the corresponding cgroup.  Finally the checks that both
handle_cgroup_free() and handle_psi_event() handlers were executed,
the correct process was killed and oom counters were updated.

Signed-off-by: Roman Gushchin <[email protected]>
Include CONFIG_PSI to allow dependent tests to build.

Signed-off-by: JP Kobryn <[email protected]>
Signed-off-by: JP Kobryn <[email protected]>
Signed-off-by: JP Kobryn <[email protected]>
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 5 times, most recently from 6d6792d to 4481a85 Compare October 28, 2025 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants