Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate kernel blocking for sock calls #13

Open
dberliner opened this issue Jan 14, 2020 · 4 comments
Open

Investigate kernel blocking for sock calls #13

dberliner opened this issue Jan 14, 2020 · 4 comments
Assignees

Comments

@dberliner
Copy link

The Linux kernel may handle certain blocking ops for us, particularly look into the usage of sk_wait_async and wake_up_interruptible_all in the def read and write space functions.

@dberliner
Copy link
Author

Generic sock calls the protocols function, so a INET sock may call TCP or UDP recvmsg procedure. We do not make this abscraction and simply handle the procedure without appealing to a struct proto abstraction. This is fine for the moment since all calls are currently TCP, but I probably can't ignore this forever.

In terms of waiting, TCPs send/recv functions both use sock_lock no matter what. I used these in my prototype functions and they were woken up by the wake_up_interruptible_all calls.

TCP recvmsg does the following on entry.

/* If this sock can busy loop, the queue is empty and we have an active connection */
	if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue) &&
	    (sk->sk_state == TCP_ESTABLISHED))
		sk_busy_loop(sk, nonblock);

lock_sock(sk);

lock_sock is smart enough to know when to sleep or spin and it looks like all inet connects are blocking and happen with this lock attained. Connect is extremely convoluted but I believe it's typical path calls inet_wait_for_connect which uses a wait queue provided in the sock. It looks like the one we wake up on in the custon writable/readable functions.

So in summary, the kernel does not do any of this for us, but it does provide standard tools for blocking that should be used if at all possible.

@dberliner
Copy link
Author

dberliner commented Jan 15, 2020

sk_busy_loop seems to routinely be called (although it is technically dependent on CONFIG_NET_RX_BUSY_POLL and one other internal value, it think they are nearly always true). It has different paths depending on whether it can block (I don't think it actually does block, though) but it's main waiting method is just calling a bunch of NOPs. The ARM implementation is

#define cpu_relax()						\
	do {							\
		smp_mb();					\
		__asm__ __volatile__("nop; nop; nop; nop; nop; nop; nop; nop; nop; nop;");	\
	} while (0)

This is all complex because of the way Linux mitigates interrupts using NAPI https://en.wikipedia.org/wiki/New_API

On a somewhat pointless sidenote, x86 actually has a specific opcode nooping in a busy loop. Gotta love CISC.

@dberliner
Copy link
Author

sk_busy_loop isn't usable for us and has almost no external usage in the kernel due to it's specific nature. Bluetooth just waits via waitqueues and sleeping (for 0 jiffies if nonblocking) until the waitqueue wakes us up. It's a bit different than I've been using waitqueues so far but it could be useful.

@drbild
Copy link

drbild commented Jan 15, 2020

sk_busy_loop seems to routinely be called (although it is technically dependent on CONFIG_NET_RX_BUSY_POLL and one other internal value, it think they are nearly always true).

I don't think we need to support busy looping for now. That seems to be for reducing latencies from the NIC versus the traditional model. We should be ok for now with the traditional model. See https://netdevconf.info/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pd

Generic sock calls the protocols function, so a INET sock may call TCP or UDP recvmsg procedure. We do not make this abscraction and simply handle the procedure without appealing to a struct proto abstraction. This is fine for the moment since all calls are currently TCP, but I probably can't ignore this forever.

...

So in summary, the kernel does not do any of this for us, but it does provide standard tools for blocking that should be used if at all possible.

Check out page 21 and 24 of this overview of the UDP rx path. It shows how the kernel facilities are used for waking blocked (recv/read) threads.

And this for how epoll is implemented.

I'm a bit confused as to why bluetooth and others (e.g., tls) sometimes operate directly on the wait queue. I assume that TCP and UDP are doing the same thing somehow --- abstracted away by the busy polling stuff --- I just haven't tracked it down.

Regardless, it seems like we should be using the sk_sleep(sk) wait queue in sendmsg and recvmesg) in a similar fashion to the bluetooth or TLS sockets.

Similarly, from the other side, we should be using the default sock_def_readable, sk_wake_async, etc. functions to wake/notify those sleepers when data is received or (in a future version) the write buffer becomes non-empty.

zanebeckwith pushed a commit that referenced this issue Aug 5, 2020
When experimenting with bpf_send_signal() helper in our production
environment (5.2 based), we experienced a deadlock in NMI mode:
   #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24
   #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012
   #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd
   #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55
   #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602
  #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a
  #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227
  #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140
  #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf
  #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09
  #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47
  #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d
  #17 [ffffc9002219fc90] schedule at ffffffff81a3e219
  #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9
  #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529
  #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc
  #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c
  #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602
  torvalds#23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068

The above call stack is actually very similar to an issue
reported by Commit eac9153 ("bpf/stackmap: Fix deadlock with
rq_lock in bpf_get_stack()") by Song Liu. The only difference is
bpf_send_signal() helper instead of bpf_get_stack() helper.

The above deadlock is triggered with a perf_sw_event.
Similar to Commit eac9153, the below almost identical reproducer
used tracepoint point sched/sched_switch so the issue can be easily caught.
  /* stress_test.c */
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/mman.h>
  #include <pthread.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>

  #define THREAD_COUNT 1000
  char *filename;
  void *worker(void *p)
  {
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
  }

  int main(int argc, char *argv[])
  {
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
  }
and the following command:
  1. run `stress_test /bin/ls` in one windown
  2. hack bcc trace.py with the following change:
     --- a/tools/trace.py
     +++ b/tools/trace.py
     @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s);
              __data.tgid = __tgid;
              __data.pid = __pid;
              bpf_get_current_comm(&__data.comm, sizeof(__data.comm));
     +        bpf_send_signal(10);
      %s
      %s
              %s.perf_submit(%s, &__data, sizeof(__data));
  3. in a different window run
     ./trace.py -p $(pidof stress_test) t:sched:sched_switch

The deadlock can be reproduced in our production system.

Similar to Song's fix, the fix is to delay sending signal if
irqs is disabled to avoid deadlocks involving with rq_lock.
With this change, my above stress-test in our production system
won't cause deadlock any more.

I also implemented a scale-down version of reproducer in the
selftest (a subsequent commit). With latest bpf-next,
it complains for the following potential deadlock.
  [   32.832450] -> #1 (&p->pi_lock){-.-.}:
  [   32.833100]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.833696]        task_rq_lock+0x2c/0xa0
  [   32.834182]        task_sched_runtime+0x59/0xd0
  [   32.834721]        thread_group_cputime+0x250/0x270
  [   32.835304]        thread_group_cputime_adjusted+0x2e/0x70
  [   32.835959]        do_task_stat+0x8a7/0xb80
  [   32.836461]        proc_single_show+0x51/0xb0
  ...
  [   32.839512] -> #0 (&(&sighand->siglock)->rlock){....}:
  [   32.840275]        __lock_acquire+0x1358/0x1a20
  [   32.840826]        lock_acquire+0xc7/0x1d0
  [   32.841309]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.841916]        __lock_task_sighand+0x79/0x160
  [   32.842465]        do_send_sig_info+0x35/0x90
  [   32.842977]        bpf_send_signal+0xa/0x10
  [   32.843464]        bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000
  [   32.844301]        trace_call_bpf+0x115/0x270
  [   32.844809]        perf_trace_run_bpf_submit+0x4a/0xc0
  [   32.845411]        perf_trace_sched_switch+0x10f/0x180
  [   32.846014]        __schedule+0x45d/0x880
  [   32.846483]        schedule+0x5f/0xd0
  ...

  [   32.853148] Chain exists of:
  [   32.853148]   &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock
  [   32.853148]
  [   32.854451]  Possible unsafe locking scenario:
  [   32.854451]
  [   32.855173]        CPU0                    CPU1
  [   32.855745]        ----                    ----
  [   32.856278]   lock(&rq->lock);
  [   32.856671]                                lock(&p->pi_lock);
  [   32.857332]                                lock(&rq->lock);
  [   32.857999]   lock(&(&sighand->siglock)->rlock);

  Deadlock happens on CPU0 when it tries to acquire &sighand->siglock
  but it has been held by CPU1 and CPU1 tries to grab &rq->lock
  and cannot get it.

  This is not exactly the callstack in our production environment,
  but sympotom is similar and both locks are using spin_lock_irqsave()
  to acquire the lock, and both involves rq_lock. The fix to delay
  sending signal when irq is disabled also fixed this issue.

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Cc: Song Liu <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants