Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test: Add panic checker #14346

Merged
merged 5 commits into from
Nov 19, 2024
Merged

Test: Add panic checker #14346

merged 5 commits into from
Nov 19, 2024

Conversation

markylaing
Copy link
Contributor

A suspected panic occurred in #14324. This happened twice but is intermittent and does not occur locally. The test suite has now passed for that PR 6/8 times.

To surface panics more cleanly when investigating the issue, I added a python script to search LXD logs for a panic. Panics are logged at info level because the net/http package has a default recover() when a handler is run.

The script is run after each test suite to ensure the test fails if it panics. The script is only run when LXD_VERBOSE or LXD_DEBUG are set however. This is because the panic log entry is at info level, and is not written if --verbose or --debug are unset.

@hamistao
Copy link
Contributor

Panics are logged at info level because the net/http package has a default recover() when a handler is run.

Just a question for my own understanding: So that means we can't look for panics that happen outside request handlers?

Even if this is the case, I think there are very few cases where we would get a panic outside a request handler on the tests so this should be fine.

test/main.sh Outdated
@@ -211,6 +211,11 @@ run_test() {
fi
fi

# Run the panic checker after every test. This ensures that the test fails if any panics occur (as long as DEBUG is set).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now setting LXD_VERBOSE or LXD_DEBUG can change the outcome of the tests, I don't think this is ideal but I also can't come up with an alternative.

Would you mind chipping in on this? @simondeziel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think it's possible to implement a recovery middleware for mux that can catch the panic before the net/http library does. This would allow us to log panics at error level. I didn't look into this in too much detail as I'm mainly adding this for #14324. The CI tests are all running with LXD_VERBOSE.

lastline = ""
continue

if re.search("(INFO|DEBUG|TRACE|WARNING|ERROR)", line):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be safe, maybe also check here if we reached EOF in case the panic is the last log message in the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is just when the for line in file loop exits. I'll add a check if there is any output at that point and print it.

@markylaing markylaing force-pushed the panic-checker branch 2 times, most recently from 19deef3 to c0a24e6 Compare October 25, 2024 12:35
Copy link
Member

@simondeziel simondeziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice idea!

test/main.sh Outdated
cd "${cwd}"

# Run the panic checker after every test. This ensures that the test fails if any panics occur (as long as DEBUG is set).
# It is possible for a suite to succeed with panics in cases where a command is expected to fail (e.g. `! <cmd> || false`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, such a nice idea!

# If we didn't find a panic, exit 0.
exit(0)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't help but test my feedback on a local copy. Since it's a small script I figured it was quicker to just provide it here:

#!/usr/bin/env python3

import sys
import re

# Looks for a panic by grepping for stacktraces in a log file.
# If a panic is found, print the last log message plus the stacktrace, exit 1
# If no panics are found, exit 0
#
# Only the first panic is returned.
#
# NOTE: When a panic occurs in LXD at runtime via a mux handler, it is logged
# at info level because the net/http library has a built-in recover. We are not
# handling panic recovery manually. Because it is logged at info level, this
# checker will only find panics if the test suite is run with LXD_VERBOSE=1 or
# LXD_DEBUG=1.
with open(sys.argv[1]) as file:
    found = False
    lastline = ""
    stacktrace_regex = re.compile(r'^goroutine\s+\d+\s+\[running\]:')
    standard_log_regex = re.compile(r'(INFO|DEBUG|TRACE|WARNING|ERROR)')

    for line in file:
        if not found and not stacktrace_regex.search(line):
            # Nothing found yet but lets retain the last log line
            lastline = line
            continue

        # Stacktrace detected, print the last log line that preceeded it
        if not found:
            sys.stderr.write(lastline)
            found = True

        # The first standard log message indicates the end of the stacktrace
        if standard_log_regex.search(line):
            break

        # Print the line as it is part of the stacktrace
        sys.stderr.write(line)

    if found:
        # Panic found, failure
        exit(1)

# No panic found, success
exit(0)

It's a bit of a rewrite (sorry) but I hope it reads better and is a tad simpler?

Some of the changes:

  • Compile the regexps once, outside the loop
  • Use r"" string for regexps
  • Don't accumulate output, just print it as it comes
  • Use stderr instead of stdout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat improvements!

Also helps with the case where the panic log message is the last on the file.

@tomponline
Copy link
Member

@markylaing can this be closed?

@markylaing
Copy link
Contributor Author

@markylaing can this be closed?

This can still be useful in cases where a test passes but LXD actually panicked e.g. ! lxc profile apply foo bar || false. However I think these cases are likely to be rare. Any panics should show up in the logs in will most likely cause test failures.

Copy link
Member

@simondeziel simondeziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea and think we should integrate it.

To save time, I provided a slight rewrite in a comment some time ago, have you had the chance to look at it @markylaing?

@markylaing
Copy link
Contributor Author

@simondeziel thanks. I've just updated it with your suggestions :)

simondeziel
simondeziel previously approved these changes Nov 18, 2024
test/deps/panic-checker Outdated Show resolved Hide resolved
simondeziel
simondeziel previously approved these changes Nov 18, 2024
Copy link
Member

@simondeziel simondeziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

tomponline
tomponline previously approved these changes Nov 18, 2024
Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ta!

@tomponline
Copy link
Member

@markylaing @simondeziel looking at recently merged PRs, the btrfs cluster test took about 11.5minutes, but is taking 12minutes on this PR. Could this be a slow down caused by parsing the logs for every test?

@simondeziel
Copy link
Member

@markylaing @simondeziel looking at recently merged PRs, the btrfs cluster test took about 11.5minutes, but is taking 12minutes on this PR. Could this be a slow down caused by parsing the logs for every test?

Feels withing noise margin but let's time it maybe?

test/main.sh Outdated
# Run the panic checker after every test. This ensures that the test fails if any panics occur (as long as DEBUG is set).
# It is possible for a suite to succeed with panics in cases where a command is expected to fail (e.g. `! <cmd> || false`)
# but panics are never acceptable at runtime.
panic_checker "${TEST_DIR}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEST_DIR is a directory but the script expects a file, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panic_checker is a test util which lists all LXD daemon directories and performs the check on all of them. There are definitely some things to improve here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duh, yeah, I missed/forgot about the wrapper script around the python tool

@simondeziel
Copy link
Member

@markylaing since the script runs on every test, would it make sense to clean the log (> .../lxd.log) when no panic is detected? This would avoid scanning a log file that keeps growing.

@markylaing markylaing marked this pull request as draft November 18, 2024 17:26
@markylaing
Copy link
Contributor Author

@markylaing since the script runs on every test, would it make sense to clean the log (> .../lxd.log) when no panic is detected? This would avoid scanning a log file that keeps growing.

I'm not sure it's a good idea to truncate the log files, we might want them for something else.

I think to avoid wasting CI minutes here I've got a couple of improvements:

  1. Don't scan after every test. This will mean that we won't know exactly which test is broken but it should be obvious where the problem is because the whole point of this is to get a stacktrace.
  2. Scan log files when tearing down the daemon. This makes sense for the clustering tests because many daemons are set up and removed within the suite itself, so scanning after the test has completed would miss those. We'll need to be careful though because the clean up is skipped in GHA.

@simondeziel
Copy link
Member

I think to avoid wasting CI minutes here I've got a couple of improvements:

1. Don't scan after every test. This will mean that we won't know exactly which test is broken but it should be obvious where the problem is because the whole point of this is to get a stacktrace.

Even better, yeah!

2. Scan log files when tearing down the daemon. This makes sense for the clustering tests because many daemons are set up and removed within the suite itself, so scanning after the test has completed would miss those. We'll need to be careful though because the clean up is skipped in GHA.

Only bits of the cleanup() are skipped when GHA is detected so it should be feasible to do the log scanning there.

@markylaing
Copy link
Contributor Author

@tomponline @simondeziel I've updated this to only scan logs on a call to kill_lxd, or on cleanup before +e is set. This should mean each daemons logs, including those set up and torn down within a suite, are checked for panics and each log file will be checked once.

@markylaing markylaing marked this pull request as ready for review November 19, 2024 10:03
tomponline
tomponline previously approved these changes Nov 19, 2024
simondeziel
simondeziel previously approved these changes Nov 19, 2024
Copy link
Member

@simondeziel simondeziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markylaing nicely done and it takes roughly just ~1s:

2024-11-19T10:22:16.3251557Z + cleanup
2024-11-19T10:22:16.3252057Z + panic_checker /home/runner/work/lxd/lxd/test/tmp.oFt
2024-11-19T10:22:16.3252757Z + '[' -z --verbose ']'
2024-11-19T10:22:16.3253207Z + local test_dir daemon_dir
2024-11-19T10:22:16.3254966Z + test_dir=/home/runner/work/lxd/lxd/test/tmp.oFt
2024-11-19T10:22:16.3255606Z + sleep 1
2024-11-19T10:22:17.3269614Z + read -r daemon_dir
2024-11-19T10:22:17.3270858Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.oFt/I5n/lxd.log
2024-11-19T10:22:17.3461700Z + read -r daemon_dir
2024-11-19T10:22:17.3462304Z + set +ex

My only question is why that sleep 1?

@markylaing
Copy link
Contributor Author

@markylaing nicely done and it takes roughly just ~1s:

2024-11-19T10:22:16.3251557Z + cleanup
2024-11-19T10:22:16.3252057Z + panic_checker /home/runner/work/lxd/lxd/test/tmp.oFt
2024-11-19T10:22:16.3252757Z + '[' -z --verbose ']'
2024-11-19T10:22:16.3253207Z + local test_dir daemon_dir
2024-11-19T10:22:16.3254966Z + test_dir=/home/runner/work/lxd/lxd/test/tmp.oFt
2024-11-19T10:22:16.3255606Z + sleep 1
2024-11-19T10:22:17.3269614Z + read -r daemon_dir
2024-11-19T10:22:17.3270858Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.oFt/I5n/lxd.log
2024-11-19T10:22:17.3461700Z + read -r daemon_dir
2024-11-19T10:22:17.3462304Z + set +ex

My only question is why that sleep 1?

🤔 I have no idea 🤣 I'll remove it.

This runs the panic checker against all currently running LXD daemons.

Signed-off-by: Mark Laing <[email protected]>
This commit reverts any changes made to the current directory in
any test suites.

Signed-off-by: Mark Laing <[email protected]>
@markylaing
Copy link
Contributor Author

@simondeziel @tomponline I've removed the superfluous sleep (sorry I can't remember why I added that). So this should be ready to go.

Copy link
Member

@simondeziel simondeziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the performance aspect has been dealt with ;)

2024-11-19T15:23:40.2896690Z + cleanup
2024-11-19T15:23:40.2897255Z + panic_checker /home/runner/work/lxd/lxd/test/tmp.gkI
2024-11-19T15:23:40.2898009Z + '[' -z --verbose ']'
2024-11-19T15:23:40.2898482Z + local test_dir daemon_dir
2024-11-19T15:23:40.2899039Z + test_dir=/home/runner/work/lxd/lxd/test/tmp.gkI
2024-11-19T15:23:40.2899718Z + read -r daemon_dir
2024-11-19T15:23:40.2900520Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.gkI/kxh/lxd.log
2024-11-19T15:23:40.3207519Z + read -r daemon_dir
2024-11-19T15:23:40.3208585Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.gkI/q4o/lxd.log
2024-11-19T15:23:40.3397961Z + read -r daemon_dir
2024-11-19T15:23:40.3399031Z + deps/panic-checker /home/runner/work/lxd/lxd/test/tmp.gkI/J19/lxd.log
2024-11-19T15:23:40.3587744Z + read -r daemon_dir
2024-11-19T15:23:40.3588228Z + set +ex

I like "free" improvements like that! Thanks

@tomponline tomponline merged commit f1922a0 into canonical:main Nov 19, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants