Investigate kernel blocking for sock calls #13

dberliner · 2020-01-14T20:17:31Z

The Linux kernel may handle certain blocking ops for us, particularly look into the usage of sk_wait_async and wake_up_interruptible_all in the def read and write space functions.

The text was updated successfully, but these errors were encountered:

dberliner · 2020-01-15T04:53:42Z

Generic sock calls the protocols function, so a INET sock may call TCP or UDP recvmsg procedure. We do not make this abscraction and simply handle the procedure without appealing to a struct proto abstraction. This is fine for the moment since all calls are currently TCP, but I probably can't ignore this forever.

In terms of waiting, TCPs send/recv functions both use sock_lock no matter what. I used these in my prototype functions and they were woken up by the wake_up_interruptible_all calls.

TCP recvmsg does the following on entry.

/* If this sock can busy loop, the queue is empty and we have an active connection */
	if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue) &&
	    (sk->sk_state == TCP_ESTABLISHED))
		sk_busy_loop(sk, nonblock);

lock_sock(sk);

lock_sock is smart enough to know when to sleep or spin and it looks like all inet connects are blocking and happen with this lock attained. Connect is extremely convoluted but I believe it's typical path calls inet_wait_for_connect which uses a wait queue provided in the sock. It looks like the one we wake up on in the custon writable/readable functions.

So in summary, the kernel does not do any of this for us, but it does provide standard tools for blocking that should be used if at all possible.

dberliner · 2020-01-15T05:26:02Z

sk_busy_loop seems to routinely be called (although it is technically dependent on CONFIG_NET_RX_BUSY_POLL and one other internal value, it think they are nearly always true). It has different paths depending on whether it can block (I don't think it actually does block, though) but it's main waiting method is just calling a bunch of NOPs. The ARM implementation is

#define cpu_relax()						\
	do {							\
		smp_mb();					\
		__asm__ __volatile__("nop; nop; nop; nop; nop; nop; nop; nop; nop; nop;");	\
	} while (0)

This is all complex because of the way Linux mitigates interrupts using NAPI https://en.wikipedia.org/wiki/New_API

On a somewhat pointless sidenote, x86 actually has a specific opcode nooping in a busy loop. Gotta love CISC.

dberliner · 2020-01-15T06:22:55Z

sk_busy_loop isn't usable for us and has almost no external usage in the kernel due to it's specific nature. Bluetooth just waits via waitqueues and sleeping (for 0 jiffies if nonblocking) until the waitqueue wakes us up. It's a bit different than I've been using waitqueues so far but it could be useful.

drbild · 2020-01-15T19:39:45Z

sk_busy_loop seems to routinely be called (although it is technically dependent on CONFIG_NET_RX_BUSY_POLL and one other internal value, it think they are nearly always true).

I don't think we need to support busy looping for now. That seems to be for reducing latencies from the NIC versus the traditional model. We should be ok for now with the traditional model. See https://netdevconf.info/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pd

Generic sock calls the protocols function, so a INET sock may call TCP or UDP recvmsg procedure. We do not make this abscraction and simply handle the procedure without appealing to a struct proto abstraction. This is fine for the moment since all calls are currently TCP, but I probably can't ignore this forever.

...

So in summary, the kernel does not do any of this for us, but it does provide standard tools for blocking that should be used if at all possible.

Check out page 21 and 24 of this overview of the UDP rx path. It shows how the kernel facilities are used for waking blocked (recv/read) threads.

And this for how epoll is implemented.

I'm a bit confused as to why bluetooth and others (e.g., tls) sometimes operate directly on the wait queue. I assume that TCP and UDP are doing the same thing somehow --- abstracted away by the busy polling stuff --- I just haven't tracked it down.

Regardless, it seems like we should be using the sk_sleep(sk) wait queue in sendmsg and recvmesg) in a similar fashion to the bluetooth or TLS sockets.

Similarly, from the other side, we should be using the default sock_def_readable, sk_wake_async, etc. functions to wake/notify those sleepers when data is received or (in a future version) the write buffer becomes non-empty.

When experimenting with bpf_send_signal() helper in our production environment (5.2 based), we experienced a deadlock in NMI mode: #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24 #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012 #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55 #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602 #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227 #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140 #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09 #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47 #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d #17 [ffffc9002219fc90] schedule at ffffffff81a3e219 #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9 #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529 #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602 torvalds#23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068 The above call stack is actually very similar to an issue reported by Commit eac9153 ("bpf/stackmap: Fix deadlock with rq_lock in bpf_get_stack()") by Song Liu. The only difference is bpf_send_signal() helper instead of bpf_get_stack() helper. The above deadlock is triggered with a perf_sw_event. Similar to Commit eac9153, the below almost identical reproducer used tracepoint point sched/sched_switch so the issue can be easily caught. /* stress_test.c */ #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <pthread.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #define THREAD_COUNT 1000 char *filename; void *worker(void *p) { void *ptr; int fd; char *pptr; fd = open(filename, O_RDONLY); if (fd < 0) return NULL; while (1) { struct timespec ts = {0, 1000 + rand() % 2000}; ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0); usleep(1); if (ptr == MAP_FAILED) { printf("failed to mmap\n"); break; } munmap(ptr, 4096 * 64); usleep(1); pptr = malloc(1); usleep(1); pptr[0] = 1; usleep(1); free(pptr); usleep(1); nanosleep(&ts, NULL); } close(fd); return NULL; } int main(int argc, char *argv[]) { void *ptr; int i; pthread_t threads[THREAD_COUNT]; if (argc < 2) return 0; filename = argv[1]; for (i = 0; i < THREAD_COUNT; i++) { if (pthread_create(threads + i, NULL, worker, NULL)) { fprintf(stderr, "Error creating thread\n"); return 0; } } for (i = 0; i < THREAD_COUNT; i++) pthread_join(threads[i], NULL); return 0; } and the following command: 1. run `stress_test /bin/ls` in one windown 2. hack bcc trace.py with the following change: --- a/tools/trace.py +++ b/tools/trace.py @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s); __data.tgid = __tgid; __data.pid = __pid; bpf_get_current_comm(&__data.comm, sizeof(__data.comm)); + bpf_send_signal(10); %s %s %s.perf_submit(%s, &__data, sizeof(__data)); 3. in a different window run ./trace.py -p $(pidof stress_test) t:sched:sched_switch The deadlock can be reproduced in our production system. Similar to Song's fix, the fix is to delay sending signal if irqs is disabled to avoid deadlocks involving with rq_lock. With this change, my above stress-test in our production system won't cause deadlock any more. I also implemented a scale-down version of reproducer in the selftest (a subsequent commit). With latest bpf-next, it complains for the following potential deadlock. [ 32.832450] -> #1 (&p->pi_lock){-.-.}: [ 32.833100] _raw_spin_lock_irqsave+0x44/0x80 [ 32.833696] task_rq_lock+0x2c/0xa0 [ 32.834182] task_sched_runtime+0x59/0xd0 [ 32.834721] thread_group_cputime+0x250/0x270 [ 32.835304] thread_group_cputime_adjusted+0x2e/0x70 [ 32.835959] do_task_stat+0x8a7/0xb80 [ 32.836461] proc_single_show+0x51/0xb0 ... [ 32.839512] -> #0 (&(&sighand->siglock)->rlock){....}: [ 32.840275] __lock_acquire+0x1358/0x1a20 [ 32.840826] lock_acquire+0xc7/0x1d0 [ 32.841309] _raw_spin_lock_irqsave+0x44/0x80 [ 32.841916] __lock_task_sighand+0x79/0x160 [ 32.842465] do_send_sig_info+0x35/0x90 [ 32.842977] bpf_send_signal+0xa/0x10 [ 32.843464] bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000 [ 32.844301] trace_call_bpf+0x115/0x270 [ 32.844809] perf_trace_run_bpf_submit+0x4a/0xc0 [ 32.845411] perf_trace_sched_switch+0x10f/0x180 [ 32.846014] __schedule+0x45d/0x880 [ 32.846483] schedule+0x5f/0xd0 ... [ 32.853148] Chain exists of: [ 32.853148] &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock [ 32.853148] [ 32.854451] Possible unsafe locking scenario: [ 32.854451] [ 32.855173] CPU0 CPU1 [ 32.855745] ---- ---- [ 32.856278] lock(&rq->lock); [ 32.856671] lock(&p->pi_lock); [ 32.857332] lock(&rq->lock); [ 32.857999] lock(&(&sighand->siglock)->rlock); Deadlock happens on CPU0 when it tries to acquire &sighand->siglock but it has been held by CPU1 and CPU1 tries to grab &rq->lock and cannot get it. This is not exactly the callstack in our production environment, but sympotom is similar and both locks are using spin_lock_irqsave() to acquire the lock, and both involves rq_lock. The fix to delay sending signal when irq is disabled also fixed this issue. Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Cc: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

dberliner self-assigned this Jan 14, 2020

dberliner mentioned this issue Jan 14, 2020

Evaluate whether to use semaphores or waitqueues for blocking ops #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate kernel blocking for sock calls #13

Investigate kernel blocking for sock calls #13

dberliner commented Jan 14, 2020

dberliner commented Jan 15, 2020

dberliner commented Jan 15, 2020 •

edited

Loading

dberliner commented Jan 15, 2020

drbild commented Jan 15, 2020

Investigate kernel blocking for sock calls #13

Investigate kernel blocking for sock calls #13

Comments

dberliner commented Jan 14, 2020

dberliner commented Jan 15, 2020

dberliner commented Jan 15, 2020 • edited Loading

dberliner commented Jan 15, 2020

drbild commented Jan 15, 2020

dberliner commented Jan 15, 2020 •

edited

Loading