-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test: Add panic checker #14346
Test: Add panic checker #14346
Conversation
Just a question for my own understanding: So that means we can't look for panics that happen outside request handlers? Even if this is the case, I think there are very few cases where we would get a panic outside a request handler on the tests so this should be fine. |
test/main.sh
Outdated
@@ -211,6 +211,11 @@ run_test() { | |||
fi | |||
fi | |||
|
|||
# Run the panic checker after every test. This ensures that the test fails if any panics occur (as long as DEBUG is set). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now setting LXD_VERBOSE
or LXD_DEBUG
can change the outcome of the tests, I don't think this is ideal but I also can't come up with an alternative.
Would you mind chipping in on this? @simondeziel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think it's possible to implement a recovery middleware for mux that can catch the panic before the net/http library does. This would allow us to log panics at error level. I didn't look into this in too much detail as I'm mainly adding this for #14324. The CI tests are all running with LXD_VERBOSE
.
test/deps/panic-checker
Outdated
lastline = "" | ||
continue | ||
|
||
if re.search("(INFO|DEBUG|TRACE|WARNING|ERROR)", line): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be safe, maybe also check here if we reached EOF in case the panic is the last log message in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is just when the for line in file
loop exits. I'll add a check if there is any output at that point and print it.
19deef3
to
c0a24e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice idea!
test/main.sh
Outdated
cd "${cwd}" | ||
|
||
# Run the panic checker after every test. This ensures that the test fails if any panics occur (as long as DEBUG is set). | ||
# It is possible for a suite to succeed with panics in cases where a command is expected to fail (e.g. `! <cmd> || false`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, such a nice idea!
# If we didn't find a panic, exit 0. | ||
exit(0) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't help but test my feedback on a local copy. Since it's a small script I figured it was quicker to just provide it here:
#!/usr/bin/env python3
import sys
import re
# Looks for a panic by grepping for stacktraces in a log file.
# If a panic is found, print the last log message plus the stacktrace, exit 1
# If no panics are found, exit 0
#
# Only the first panic is returned.
#
# NOTE: When a panic occurs in LXD at runtime via a mux handler, it is logged
# at info level because the net/http library has a built-in recover. We are not
# handling panic recovery manually. Because it is logged at info level, this
# checker will only find panics if the test suite is run with LXD_VERBOSE=1 or
# LXD_DEBUG=1.
with open(sys.argv[1]) as file:
found = False
lastline = ""
stacktrace_regex = re.compile(r'^goroutine\s+\d+\s+\[running\]:')
standard_log_regex = re.compile(r'(INFO|DEBUG|TRACE|WARNING|ERROR)')
for line in file:
if not found and not stacktrace_regex.search(line):
# Nothing found yet but lets retain the last log line
lastline = line
continue
# Stacktrace detected, print the last log line that preceeded it
if not found:
sys.stderr.write(lastline)
found = True
# The first standard log message indicates the end of the stacktrace
if standard_log_regex.search(line):
break
# Print the line as it is part of the stacktrace
sys.stderr.write(line)
if found:
# Panic found, failure
exit(1)
# No panic found, success
exit(0)
It's a bit of a rewrite (sorry) but I hope it reads better and is a tad simpler?
Some of the changes:
- Compile the regexps once, outside the loop
- Use
r""
string for regexps - Don't accumulate output, just print it as it comes
- Use
stderr
instead ofstdout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very neat improvements!
Also helps with the case where the panic log message is the last on the file.
@markylaing can this be closed? |
This can still be useful in cases where a test passes but LXD actually panicked e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea and think we should integrate it.
To save time, I provided a slight rewrite in a comment some time ago, have you had the chance to look at it @markylaing?
c0a24e6
to
dfba0a8
Compare
@simondeziel thanks. I've just updated it with your suggestions :) |
Signed-off-by: Mark Laing <[email protected]>
dfba0a8
to
e2f94b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ta!
@markylaing @simondeziel looking at recently merged PRs, the btrfs cluster test took about 11.5minutes, but is taking 12minutes on this PR. Could this be a slow down caused by parsing the logs for every test? |
Feels withing noise margin but let's time it maybe? |
test/main.sh
Outdated
# Run the panic checker after every test. This ensures that the test fails if any panics occur (as long as DEBUG is set). | ||
# It is possible for a suite to succeed with panics in cases where a command is expected to fail (e.g. `! <cmd> || false`) | ||
# but panics are never acceptable at runtime. | ||
panic_checker "${TEST_DIR}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TEST_DIR is a directory but the script expects a file, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panic_checker
is a test util which lists all LXD daemon directories and performs the check on all of them. There are definitely some things to improve here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duh, yeah, I missed/forgot about the wrapper script around the python tool
@markylaing since the script runs on every test, would it make sense to clean the log ( |
I'm not sure it's a good idea to truncate the log files, we might want them for something else. I think to avoid wasting CI minutes here I've got a couple of improvements:
|
Even better, yeah!
Only bits of the |
244e90f
e2f94b6
to
244e90f
Compare
@tomponline @simondeziel I've updated this to only scan logs on a call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markylaing nicely done and it takes roughly just ~1s:
2024-11-19T10:22:16.3251557Z + cleanup
2024-11-19T10:22:16.3252057Z + panic_checker /home/runner/work/lxd/lxd/test/tmp.oFt
2024-11-19T10:22:16.3252757Z + '[' -z --verbose ']'
2024-11-19T10:22:16.3253207Z + local test_dir daemon_dir
2024-11-19T10:22:16.3254966Z + test_dir=/home/runner/work/lxd/lxd/test/tmp.oFt
2024-11-19T10:22:16.3255606Z + sleep 1
2024-11-19T10:22:17.3269614Z + read -r daemon_dir
2024-11-19T10:22:17.3270858Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.oFt/I5n/lxd.log
2024-11-19T10:22:17.3461700Z + read -r daemon_dir
2024-11-19T10:22:17.3462304Z + set +ex
My only question is why that sleep 1
?
🤔 I have no idea 🤣 I'll remove it. |
This runs the panic checker against all currently running LXD daemons. Signed-off-by: Mark Laing <[email protected]>
This commit reverts any changes made to the current directory in any test suites. Signed-off-by: Mark Laing <[email protected]>
Signed-off-by: Mark Laing <[email protected]>
Signed-off-by: Mark Laing <[email protected]>
9465252
244e90f
to
9465252
Compare
@simondeziel @tomponline I've removed the superfluous sleep (sorry I can't remember why I added that). So this should be ready to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the performance aspect has been dealt with ;)
2024-11-19T15:23:40.2896690Z + cleanup
2024-11-19T15:23:40.2897255Z + panic_checker /home/runner/work/lxd/lxd/test/tmp.gkI
2024-11-19T15:23:40.2898009Z + '[' -z --verbose ']'
2024-11-19T15:23:40.2898482Z + local test_dir daemon_dir
2024-11-19T15:23:40.2899039Z + test_dir=/home/runner/work/lxd/lxd/test/tmp.gkI
2024-11-19T15:23:40.2899718Z + read -r daemon_dir
2024-11-19T15:23:40.2900520Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.gkI/kxh/lxd.log
2024-11-19T15:23:40.3207519Z + read -r daemon_dir
2024-11-19T15:23:40.3208585Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.gkI/q4o/lxd.log
2024-11-19T15:23:40.3397961Z + read -r daemon_dir
2024-11-19T15:23:40.3399031Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.gkI/J19/lxd.log
2024-11-19T15:23:40.3587744Z + read -r daemon_dir
2024-11-19T15:23:40.3588228Z + set +ex
I like "free" improvements like that! Thanks
A suspected panic occurred in #14324. This happened twice but is intermittent and does not occur locally. The test suite has now passed for that PR 6/8 times.
To surface panics more cleanly when investigating the issue, I added a python script to search LXD logs for a panic. Panics are logged at info level because the net/http package has a default
recover()
when a handler is run.The script is run after each test suite to ensure the test fails if it panics. The script is only run when
LXD_VERBOSE
orLXD_DEBUG
are set however. This is because the panic log entry is at info level, and is not written if--verbose
or--debug
are unset.