-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stall-detector: Try hard not to crash while collecting backtrace #2420
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -32,6 +32,7 @@ module; | |
#include <fstream> | ||
#include <regex> | ||
#include <thread> | ||
#include <setjmp.h> | ||
|
||
#include <spawn.h> | ||
#include <sys/syscall.h> | ||
|
@@ -842,7 +843,23 @@ class backtrace_buffer { | |
} | ||
}; | ||
|
||
static thread_local jmp_buf stall_detector_env; | ||
static thread_local bool in_stall_detector = false; | ||
static thread_local bool crash_collecting_backtrace = false; | ||
|
||
inline void maybe_crash_for_test() noexcept { | ||
if (crash_collecting_backtrace) [[unlikely]] { | ||
*(volatile int *)nullptr = 0; | ||
} | ||
} | ||
|
||
static void print_with_backtrace(backtrace_buffer& buf, bool oneline) noexcept { | ||
if (sigsetjmp(stall_detector_env, 0)) { | ||
buf.append(" ¯\\_(ツ)_/¯\n"); | ||
goto out; | ||
} | ||
in_stall_detector = true; | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be technically correct, we need an std::atomic_signal_fence(std::memory_order_relaxed). This prevents a magical compiler from delaying the write to memory because no one reads it. |
||
if (local_engine) { | ||
buf.append(" on shard "); | ||
buf.append_decimal(this_shard_id()); | ||
|
@@ -853,12 +870,17 @@ static void print_with_backtrace(backtrace_buffer& buf, bool oneline) noexcept { | |
|
||
if (!oneline) { | ||
buf.append(".\nBacktrace:\n"); | ||
maybe_crash_for_test(); | ||
buf.append_backtrace(); | ||
} else { | ||
buf.append(". Backtrace:"); | ||
maybe_crash_for_test(); | ||
buf.append_backtrace_oneline(); | ||
buf.append("\n"); | ||
} | ||
|
||
out: | ||
in_stall_detector = false; | ||
buf.flush(); | ||
} | ||
|
||
|
@@ -1502,6 +1524,10 @@ reactor::test::set_stall_detector_report_function(std::function<void ()> report) | |
r._cpu_stall_detector->reset_suppression_state(reactor::now()); | ||
} | ||
|
||
void reactor::test::set_stall_detector_crash_collecting_backtrace() { | ||
crash_collecting_backtrace = true; | ||
} | ||
|
||
std::function<void ()> | ||
reactor::test::get_stall_detector_report_function() { | ||
return engine()._cpu_stall_detector->get_config().report; | ||
|
@@ -3930,6 +3956,9 @@ void install_oneshot_signal_handler() { | |
|
||
struct sigaction sa; | ||
sa.sa_sigaction = [](int sig, siginfo_t *info, void *p) { | ||
if (sig == SIGSEGV && in_stall_detector) { | ||
siglongjmp(stall_detector_env, 1); | ||
} | ||
std::lock_guard<util::spinlock> g(lock); | ||
if (!handled) { | ||
handled = true; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -175,6 +175,11 @@ SEASTAR_THREAD_TEST_CASE(spin_in_kernel) { | |
test_spin_with_body("kernel", [] { mmap_populate(128 * 1024); }); | ||
} | ||
|
||
SEASTAR_THREAD_TEST_CASE(crash_collecting_backtrace) { | ||
reactor::test::set_stall_detector_crash_collecting_backtrace(); | ||
engine().update_blocked_reactor_notify_ms(100ms); | ||
spin(500ms); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you also reproduce the crash during unwinding? It's not given that siglongjmp is a safe way to unwind. If the unwinder takes a lock, it will leak it (though I'm guessing it doesn't). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In labs -- unfortunately, no :(
Yes, sure, at this point the situation is already screwed up, and it's questionable whether these tricks are making things even worse or not There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps we can override __cxa_throw and whatever function it uses to exist unwinding (but maybe there isn't one), and call them via RTLD_NEXT. Then we can set flags when unwinding is in progress, and just avoid going into the stall detector again (or perhaps: ask the stall detector to run on the exit path of __cxa_throw). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it will work. Also, tracing exception throwers is important. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There isn't one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe have a blacklist of functions that are known to crash. Every time we see a crash, add the triggering function to the blacklist. In a few short years we'll have a robust filter. |
||
|
||
#else | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1