Test: Add panic checker #14346

markylaing · 2024-10-25T10:27:11Z

A suspected panic occurred in #14324. This happened twice but is intermittent and does not occur locally. The test suite has now passed for that PR 6/8 times.

To surface panics more cleanly when investigating the issue, I added a python script to search LXD logs for a panic. Panics are logged at info level because the net/http package has a default recover() when a handler is run.

The script is run after each test suite to ensure the test fails if it panics. The script is only run when LXD_VERBOSE or LXD_DEBUG are set however. This is because the panic log entry is at info level, and is not written if --verbose or --debug are unset.

hamistao · 2024-10-25T11:52:45Z

Panics are logged at info level because the net/http package has a default recover() when a handler is run.

Just a question for my own understanding: So that means we can't look for panics that happen outside request handlers?

Even if this is the case, I think there are very few cases where we would get a panic outside a request handler on the tests so this should be fine.

hamistao · 2024-10-25T11:54:53Z

test/main.sh

@@ -211,6 +211,11 @@ run_test() {
    fi
  fi

+  # Run the panic checker after every test. This ensures that the test fails if any panics occur (as long as DEBUG is set).


So now setting LXD_VERBOSE or LXD_DEBUG can change the outcome of the tests, I don't think this is ideal but I also can't come up with an alternative.

Would you mind chipping in on this? @simondeziel

I agree. I think it's possible to implement a recovery middleware for mux that can catch the panic before the net/http library does. This would allow us to log panics at error level. I didn't look into this in too much detail as I'm mainly adding this for #14324. The CI tests are all running with LXD_VERBOSE.

hamistao · 2024-10-25T12:04:59Z

test/deps/panic-checker

+            lastline = ""
+            continue
+
+        if re.search("(INFO|DEBUG|TRACE|WARNING|ERROR)", line):


Just to be safe, maybe also check here if we reached EOF in case the panic is the last log message in the file.

I guess this is just when the for line in file loop exits. I'll add a check if there is any output at that point and print it.

simondeziel

Very nice idea!

simondeziel · 2024-10-25T16:44:05Z

test/main.sh

+  cd "${cwd}"
+
+  # Run the panic checker after every test. This ensures that the test fails if any panics occur (as long as DEBUG is set).
+  # It is possible for a suite to succeed with panics in cases where a command is expected to fail (e.g. `! <cmd> || false`)


Wow, such a nice idea!

simondeziel · 2024-10-25T16:58:36Z

test/deps/panic-checker

+# If we didn't find a panic, exit 0.
+exit(0)
+
+


I couldn't help but test my feedback on a local copy. Since it's a small script I figured it was quicker to just provide it here:

#!/usr/bin/env python3 import sys import re # Looks for a panic by grepping for stacktraces in a log file. # If a panic is found, print the last log message plus the stacktrace, exit 1 # If no panics are found, exit 0 # # Only the first panic is returned. # # NOTE: When a panic occurs in LXD at runtime via a mux handler, it is logged # at info level because the net/http library has a built-in recover. We are not # handling panic recovery manually. Because it is logged at info level, this # checker will only find panics if the test suite is run with LXD_VERBOSE=1 or # LXD_DEBUG=1. with open(sys.argv[1]) as file: found = False lastline = "" stacktrace_regex = re.compile(r'^goroutine\s+\d+\s+\[running\]:') standard_log_regex = re.compile(r'(INFO|DEBUG|TRACE|WARNING|ERROR)') for line in file: if not found and not stacktrace_regex.search(line): # Nothing found yet but lets retain the last log line lastline = line continue # Stacktrace detected, print the last log line that preceeded it if not found: sys.stderr.write(lastline) found = True # The first standard log message indicates the end of the stacktrace if standard_log_regex.search(line): break # Print the line as it is part of the stacktrace sys.stderr.write(line) if found: # Panic found, failure exit(1) # No panic found, success exit(0)

It's a bit of a rewrite (sorry) but I hope it reads better and is a tad simpler?

Some of the changes:

Compile the regexps once, outside the loop

Use r"" string for regexps

Don't accumulate output, just print it as it comes

Use stderr instead of stdout

Very neat improvements!

Also helps with the case where the panic log message is the last on the file.

tomponline · 2024-11-05T09:59:38Z

@markylaing can this be closed?

markylaing · 2024-11-05T10:06:21Z

@markylaing can this be closed?

This can still be useful in cases where a test passes but LXD actually panicked e.g. ! lxc profile apply foo bar || false. However I think these cases are likely to be rare. Any panics should show up in the logs in will most likely cause test failures.

simondeziel

I really like the idea and think we should integrate it.

To save time, I provided a slight rewrite in a comment some time ago, have you had the chance to look at it @markylaing?

markylaing · 2024-11-18T15:12:08Z

@simondeziel thanks. I've just updated it with your suggestions :)

test/deps/panic-checker

Signed-off-by: Mark Laing <[email protected]>

simondeziel

LGTM, thanks

tomponline

Ta!

tomponline · 2024-11-18T16:30:39Z

@markylaing @simondeziel looking at recently merged PRs, the btrfs cluster test took about 11.5minutes, but is taking 12minutes on this PR. Could this be a slow down caused by parsing the logs for every test?

simondeziel · 2024-11-18T16:32:27Z

@markylaing @simondeziel looking at recently merged PRs, the btrfs cluster test took about 11.5minutes, but is taking 12minutes on this PR. Could this be a slow down caused by parsing the logs for every test?

Feels withing noise margin but let's time it maybe?

simondeziel · 2024-11-18T16:39:11Z

test/main.sh

+  # Run the panic checker after every test. This ensures that the test fails if any panics occur (as long as DEBUG is set).
+  # It is possible for a suite to succeed with panics in cases where a command is expected to fail (e.g. `! <cmd> || false`)
+  # but panics are never acceptable at runtime.
+  panic_checker "${TEST_DIR}"


TEST_DIR is a directory but the script expects a file, no?

panic_checker is a test util which lists all LXD daemon directories and performs the check on all of them. There are definitely some things to improve here.

Duh, yeah, I missed/forgot about the wrapper script around the python tool

simondeziel · 2024-11-18T16:40:23Z

@markylaing since the script runs on every test, would it make sense to clean the log (> .../lxd.log) when no panic is detected? This would avoid scanning a log file that keeps growing.

markylaing · 2024-11-18T17:31:52Z

@markylaing since the script runs on every test, would it make sense to clean the log (> .../lxd.log) when no panic is detected? This would avoid scanning a log file that keeps growing.

I'm not sure it's a good idea to truncate the log files, we might want them for something else.

I think to avoid wasting CI minutes here I've got a couple of improvements:

Don't scan after every test. This will mean that we won't know exactly which test is broken but it should be obvious where the problem is because the whole point of this is to get a stacktrace.
Scan log files when tearing down the daemon. This makes sense for the clustering tests because many daemons are set up and removed within the suite itself, so scanning after the test has completed would miss those. We'll need to be careful though because the clean up is skipped in GHA.

simondeziel · 2024-11-18T17:44:20Z

I think to avoid wasting CI minutes here I've got a couple of improvements:

1. Don't scan after every test. This will mean that we won't know exactly which test is broken but it should be obvious where the problem is because the whole point of this is to get a stacktrace.

Even better, yeah!

2. Scan log files when tearing down the daemon. This makes sense for the clustering tests because many daemons are set up and removed within the suite itself, so scanning after the test has completed would miss those. We'll need to be careful though because the clean up is skipped in GHA.

Only bits of the cleanup() are skipped when GHA is detected so it should be feasible to do the log scanning there.

markylaing · 2024-11-19T10:03:23Z

@tomponline @simondeziel I've updated this to only scan logs on a call to kill_lxd, or on cleanup before +e is set. This should mean each daemons logs, including those set up and torn down within a suite, are checked for panics and each log file will be checked once.

simondeziel

@markylaing nicely done and it takes roughly just ~1s:

2024-11-19T10:22:16.3251557Z + cleanup
2024-11-19T10:22:16.3252057Z + panic_checker /home/runner/work/lxd/lxd/test/tmp.oFt
2024-11-19T10:22:16.3252757Z + '[' -z --verbose ']'
2024-11-19T10:22:16.3253207Z + local test_dir daemon_dir
2024-11-19T10:22:16.3254966Z + test_dir=/home/runner/work/lxd/lxd/test/tmp.oFt
2024-11-19T10:22:16.3255606Z + sleep 1
2024-11-19T10:22:17.3269614Z + read -r daemon_dir
2024-11-19T10:22:17.3270858Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.oFt/I5n/lxd.log
2024-11-19T10:22:17.3461700Z + read -r daemon_dir
2024-11-19T10:22:17.3462304Z + set +ex

My only question is why that sleep 1?

markylaing · 2024-11-19T14:28:03Z

@markylaing nicely done and it takes roughly just ~1s:

2024-11-19T10:22:16.3251557Z + cleanup
2024-11-19T10:22:16.3252057Z + panic_checker /home/runner/work/lxd/lxd/test/tmp.oFt
2024-11-19T10:22:16.3252757Z + '[' -z --verbose ']'
2024-11-19T10:22:16.3253207Z + local test_dir daemon_dir
2024-11-19T10:22:16.3254966Z + test_dir=/home/runner/work/lxd/lxd/test/tmp.oFt
2024-11-19T10:22:16.3255606Z + sleep 1
2024-11-19T10:22:17.3269614Z + read -r daemon_dir
2024-11-19T10:22:17.3270858Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.oFt/I5n/lxd.log
2024-11-19T10:22:17.3461700Z + read -r daemon_dir
2024-11-19T10:22:17.3462304Z + set +ex

My only question is why that sleep 1?

🤔 I have no idea 🤣 I'll remove it.

This runs the panic checker against all currently running LXD daemons. Signed-off-by: Mark Laing <[email protected]>

This commit reverts any changes made to the current directory in any test suites. Signed-off-by: Mark Laing <[email protected]>

Signed-off-by: Mark Laing <[email protected]>

markylaing · 2024-11-19T15:00:04Z

@simondeziel @tomponline I've removed the superfluous sleep (sorry I can't remember why I added that). So this should be ready to go.

simondeziel

I think the performance aspect has been dealt with ;)

2024-11-19T15:23:40.2896690Z + cleanup
2024-11-19T15:23:40.2897255Z + panic_checker /home/runner/work/lxd/lxd/test/tmp.gkI
2024-11-19T15:23:40.2898009Z + '[' -z --verbose ']'
2024-11-19T15:23:40.2898482Z + local test_dir daemon_dir
2024-11-19T15:23:40.2899039Z + test_dir=/home/runner/work/lxd/lxd/test/tmp.gkI
2024-11-19T15:23:40.2899718Z + read -r daemon_dir
2024-11-19T15:23:40.2900520Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.gkI/kxh/lxd.log
2024-11-19T15:23:40.3207519Z + read -r daemon_dir
2024-11-19T15:23:40.3208585Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.gkI/q4o/lxd.log
2024-11-19T15:23:40.3397961Z + read -r daemon_dir
2024-11-19T15:23:40.3399031Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.gkI/J19/lxd.log
2024-11-19T15:23:40.3587744Z + read -r daemon_dir
2024-11-19T15:23:40.3588228Z + set +ex

I like "free" improvements like that! Thanks

markylaing self-assigned this Oct 25, 2024

markylaing mentioned this pull request Oct 25, 2024

Auth: Fix missing snapshots and backups from storage pool used-by URLs #14324

Merged

1 task

markylaing requested review from tomponline, simondeziel and hamistao October 25, 2024 10:57

hamistao reviewed Oct 25, 2024

View reviewed changes

markylaing force-pushed the panic-checker branch 2 times, most recently from 19deef3 to c0a24e6 Compare October 25, 2024 12:35

simondeziel reviewed Oct 25, 2024

View reviewed changes

tomponline requested a review from simondeziel November 18, 2024 11:26

simondeziel reviewed Nov 18, 2024

View reviewed changes

markylaing force-pushed the panic-checker branch from c0a24e6 to dfba0a8 Compare November 18, 2024 15:11

simondeziel previously approved these changes Nov 18, 2024

View reviewed changes

test/deps/panic-checker Outdated Show resolved Hide resolved

test/deps: Add python script to search for panics in LXD logs.

7e64afd

Signed-off-by: Mark Laing <[email protected]>

markylaing dismissed simondeziel’s stale review via e2f94b6 November 18, 2024 15:46

markylaing force-pushed the panic-checker branch from dfba0a8 to e2f94b6 Compare November 18, 2024 15:46

simondeziel previously approved these changes Nov 18, 2024

View reviewed changes

tomponline previously approved these changes Nov 18, 2024

View reviewed changes

simondeziel reviewed Nov 18, 2024

View reviewed changes

markylaing marked this pull request as draft November 18, 2024 17:26

markylaing dismissed stale reviews from tomponline and simondeziel via 244e90f November 19, 2024 09:58

markylaing force-pushed the panic-checker branch from e2f94b6 to 244e90f Compare November 19, 2024 09:58

markylaing marked this pull request as ready for review November 19, 2024 10:03

tomponline previously approved these changes Nov 19, 2024

View reviewed changes

simondeziel previously approved these changes Nov 19, 2024

View reviewed changes

markylaing added 4 commits November 19, 2024 14:29

test/includes: Add panic checker helper function.

e63cf22

This runs the panic checker against all currently running LXD daemons. Signed-off-by: Mark Laing <[email protected]>

test: All tests should be executed from TEST_DIR.

1e72b0e

This commit reverts any changes made to the current directory in any test suites. Signed-off-by: Mark Laing <[email protected]>

test/includes: Run the panic checker killing any LXD daemon.

484e2ae

Signed-off-by: Mark Laing <[email protected]>

test: Run the panic checker on cleanup before setting +e.

9465252

Signed-off-by: Mark Laing <[email protected]>

markylaing dismissed stale reviews from simondeziel and tomponline via 9465252 November 19, 2024 14:29

markylaing force-pushed the panic-checker branch from 244e90f to 9465252 Compare November 19, 2024 14:29

simondeziel approved these changes Nov 19, 2024

View reviewed changes

tomponline merged commit f1922a0 into canonical:main Nov 19, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test: Add panic checker #14346

Test: Add panic checker #14346

markylaing commented Oct 25, 2024

hamistao commented Oct 25, 2024

hamistao Oct 25, 2024

markylaing Oct 25, 2024

hamistao Oct 25, 2024

markylaing Oct 25, 2024

simondeziel left a comment

simondeziel Oct 25, 2024

simondeziel Oct 25, 2024

hamistao Oct 25, 2024

tomponline commented Nov 5, 2024

markylaing commented Nov 5, 2024

simondeziel left a comment

markylaing commented Nov 18, 2024

simondeziel left a comment

tomponline left a comment

tomponline commented Nov 18, 2024

simondeziel commented Nov 18, 2024

simondeziel Nov 18, 2024

markylaing Nov 18, 2024

simondeziel Nov 18, 2024

simondeziel commented Nov 18, 2024

markylaing commented Nov 18, 2024

simondeziel commented Nov 18, 2024

markylaing commented Nov 19, 2024

simondeziel left a comment

markylaing commented Nov 19, 2024

markylaing commented Nov 19, 2024

simondeziel left a comment

Test: Add panic checker #14346

Test: Add panic checker #14346

Conversation

markylaing commented Oct 25, 2024

hamistao commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simondeziel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomponline commented Nov 5, 2024

markylaing commented Nov 5, 2024

simondeziel left a comment

Choose a reason for hiding this comment

markylaing commented Nov 18, 2024

simondeziel left a comment

Choose a reason for hiding this comment

tomponline left a comment

Choose a reason for hiding this comment

tomponline commented Nov 18, 2024

simondeziel commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simondeziel commented Nov 18, 2024

markylaing commented Nov 18, 2024

simondeziel commented Nov 18, 2024

markylaing commented Nov 19, 2024

simondeziel left a comment

Choose a reason for hiding this comment

markylaing commented Nov 19, 2024

markylaing commented Nov 19, 2024

simondeziel left a comment

Choose a reason for hiding this comment