-
Notifications
You must be signed in to change notification settings - Fork 148
Bpf task work #9403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: bpf-next_base
Are you sure you want to change the base?
Bpf task work #9403
Conversation
09fdfb7
to
9b0503b
Compare
601ea2d
to
7390c2c
Compare
This patch adds necessary plumbing in verifier, syscall and maps to support handling new kfunc bpf_task_work_schedule and kernel structure bpf_task_work. The idea is similar to how we already handle bpf_wq and bpf_timer. verifier changes validate calls to bpf_task_work_schedule to make sure it is safe and expected invariants hold. btf part is required to detect bpf_task_work structure inside map value and store its offset, which will be used in the next patch to calculate key and value addresses. arraymap and hashtab changes are needed to handle freeing of the bpf_task_work: run code needed to deinitialize it, for example cancel task_work callback if possible. The use of bpf_task_work and proper implementation for kfuncs are introduced in the next patch. Signed-off-by: Mykyta Yatsenko <[email protected]>
Calculation of the BPF map key, given the pointer to a value is duplicated in a couple of places in helpers already, in the next patch another use case is introduced as well. This patch extracts that functionality into a separate function. Signed-off-by: Mykyta Yatsenko <[email protected]>
dbe62d1
to
7089387
Compare
Implementation of the bpf_task_work_schedule kfuncs. Main components: * struct bpf_task_work_context – Metadata and state management per task work. * enum bpf_task_work_state – A state machine to serialize work scheduling and execution. * bpf_task_work_schedule() – The central helper that initiates scheduling. * bpf_task_work_callback() – Invoked when the actual task_work runs. * bpf_task_work_irq() – An intermediate step (runs in softirq context) to enqueue task work. * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. Flow of task work scheduling 1) bpf_task_work_schedule_* is called from BPF code. 2) Transition state from STANDBY to PENDING. 3) irq_work_queue() schedules bpf_task_work_irq(). 4) Transition state from PENDING to SCHEDULING. 4) bpf_task_work_irq() attempts task_work_add(). If successful, state transitions to SCHEDULED. 5) Task work calls bpf_task_work_callback(), which transition state to RUNNING. 6) BPF callback is executed 7) Context is cleaned up, refcounts released, state set back to STANDBY. Map value deletion If map value that contains bpf_task_work_context is deleted, BPF map implementation calls bpf_task_work_cancel_and_free(). Deletion is handled by atomically setting state to FREED and releasing references or letting scheduler do that, depending on the last state before the deletion: * SCHEDULING: release references in bpf_task_work_cancel_and_free(), expect bpf_task_work_irq() to cancel task work. * SCHEDULED: release references and try to cancel task work in bpf_task_work_cancel_and_free(). * other states: one of bpf_task_work_irq(), bpf_task_work_schedule(), bpf_task_work_callback() should cleanup upon detecting the state switching to FREED. The state transitions are controlled with atomic_cmpxchg, ensuring: * Only one thread can successfully enqueue work. * Proper handling of concurrent deletes (BPF_TW_FREED). * Safe rollback if task_work_add() fails. Signed-off-by: Mykyta Yatsenko <[email protected]>
Introducing selftests that check BPF task work scheduling mechanism. Validate that verifier does not accepts incorrect calls to bpf_task_work_schedule kfunc. Signed-off-by: Mykyta Yatsenko <[email protected]>
@@ -7418,6 +7418,10 @@ struct bpf_timer { | |||
__u64 __opaque[2]; | |||
} __attribute__((aligned(8))); | |||
|
|||
struct bpf_task_work { | |||
__u64 ctx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not follow the __opaque
pattern?
@@ -3703,8 +3703,53 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) | |||
return bpf_strnstr(s1__ign, s2__ign, XATTR_SIZE_MAX); | |||
} | |||
|
|||
typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep argument names?
{ | ||
return func_id == BPF_FUNC_timer_set_callback; | ||
return func_id == BPF_FUNC_timer_set_callback || is_task_work_add_kfunc(func_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I pointed this out before, maybe I'm forgetting the answer, but this looks wrong to me. This check is meant for old-style helpers, not kfuncs.
@@ -12751,7 +12844,8 @@ static bool is_sync_callback_calling_kfunc(u32 btf_id) | |||
|
|||
static bool is_async_callback_calling_kfunc(u32 btf_id) | |||
{ | |||
return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl]; | |||
return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl] || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should probably be a call to is_bpf_wq_set_callback_impl_kfunc()?
* Otherwise it is safe to access map key value under the rcu_read_lock | ||
*/ | ||
rcu_read_lock_trace(); | ||
state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is warranted here explaining why we are trying SCHEDULING -> RUNNING transition at all, and why it has to be done before SCHEDULED -> RUNNING, it's kind of tricky and non-obvious
err = -EPERM; | ||
goto release_prog; | ||
} | ||
ctx = bpf_task_work_aquire_ctx(tw, map); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: acquire
return ERR_PTR(-ENOMEM); | ||
memset(ctx, 0, sizeof(*ctx)); | ||
if (atomic_long_cmpxchg(ctx_ptr, 0, (unsigned long)ctx) != 0) { | ||
kfree(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this nagging feeling that we can't just do kfree(ctx), this ctx has to be RCU protected and (maybe) refcounted, otherwise we run the risk of races between kfree and another CPU still accessing ctx fields
|
||
ctx = (void *)atomic_long_read(ctx_ptr); | ||
if (!ctx) { | ||
ctx = bpf_map_kmalloc_node(map, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline, this has to be bpf_mem_alloc()
bpf_reset_task_work_context(ctx); | ||
fallthrough; | ||
case BPF_TW_STANDBY: | ||
kfree(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I mentioned somewhere else, we can't just kfree(ctx) here, that ctx can still be referenced from one of irq/task work callbacks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case it should not be referenced in task work or irq work, because the state we found it in was STANDBY, although it's still possible that bpf_task_work_schedule has a pointer to the ctx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, so you agree with me that we can't just kfree it? it has to be RCU protected, right?
memset(ctx, 0, sizeof(*ctx)); | ||
if (atomic_long_cmpxchg(ctx_ptr, 0, (unsigned long)ctx) != 0) { | ||
kfree(ctx); | ||
return ERR_PTR(-EBUSY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we retry here, someone might have allocated memory, scheduled and executed callback, and returned to STANDBY, so we can reuse that state here (very unlikely, but with NMI involvement this can theoretically happen)
715d6cb
to
506c27a
Compare
76c716d
to
f3b4b37
Compare
No description provided.