Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stall-detector: Try hard not to crash while collecting backtrace #2420

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xemul
Copy link
Contributor

@xemul xemul commented Sep 4, 2024

Sometimes stall-detector signal comes in the middle of exception handling. If the stall is detected, stack unwiding starts to collect the stalled backtrace. Since exception handling means unwiding the stack as well, those two unwinders need to cooperate carefully, which is not guaranteed (spoiler: they don't cooperate carefully). In unlucky case, segmentation fault happens, the app is killed with SEGV.

This patch helps stall detector to bail out in case of SEGV arrival while collecting the backtrace with minimally possible yet detailed enough stall report.

@xemul xemul requested review from avikivity and michoecho September 4, 2024 11:05
static void print_with_backtrace(backtrace_buffer& buf, bool oneline) noexcept {
if (sigsetjmp(stall_detector_env, 0)) {
buf.append(" ¯\\_(ツ)_/¯\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sometimes stall-detector signal comes in the middle of exception
handling. If the stall is detected, stack unwiding starts to collect the
stalled backtrace. Since exception handling means unwiding the stack as
well, those two unwinders need to cooperate carefully, which is not
guaranteed (spoiler: they don't cooperate carefully). In unlucky case,
segmentation fault happens, the app is killed with SEGV.

This patch helps stall detector to bail out in case of SEGV arrival
while collecting the backtrace with minimally possible yet detailed
enough stall report.

Signed-off-by: Pavel Emelyanov <[email protected]>
@xemul xemul force-pushed the br-avoid-segv-in-stall-detector branch from 6b368ce to ce84a03 Compare September 4, 2024 11:17
@michoecho
Copy link
Contributor

Doesn't solve the problem entirely, since SIGSEGV isn't the only possible symptom (you could get an infinite loop for example, why not), but I guess it prevents a crash in the cases it's enough (which is probably a great majority of cases), and doesn't hurt in the others, so why not.

goto out;
}
in_stall_detector = true;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be technically correct, we need an std::atomic_signal_fence(std::memory_order_relaxed). This prevents a magical compiler from delaying the write to memory because no one reads it.

reactor::test::set_stall_detector_crash_collecting_backtrace();
engine().update_blocked_reactor_notify_ms(100ms);
spin(500ms);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you also reproduce the crash during unwinding? It's not given that siglongjmp is a safe way to unwind. If the unwinder takes a lock, it will leak it (though I'm guessing it doesn't).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you also reproduce the crash during unwinding?

In labs -- unfortunately, no :(

It's not given that siglongjmp is a safe way to unwind.

Yes, sure, at this point the situation is already screwed up, and it's questionable whether these tricks are making things even worse or not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can override __cxa_throw and whatever function it uses to exist unwinding (but maybe there isn't one), and call them via RTLD_NEXT. Then we can set flags when unwinding is in progress, and just avoid going into the stall detector again (or perhaps: ask the stall detector to run on the exit path of __cxa_throw).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will work.

Also, tracing exception throwers is important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can override __cxa_throw and whatever function it uses to exist unwinding (but maybe there isn't one)

There isn't one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have a blacklist of functions that are known to crash. Every time we see a crash, add the triggering function to the blacklist. In a few short years we'll have a robust filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants