-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
send Honeybadger alert on nonzero script exit #78
Conversation
98b8265
to
cf449c3
Compare
Just curious how this connects to the |
cf449c3
to
6d27f28
Compare
i basically added a wrapper script, that sends an alert for any non-zero exit by the script it's running. so this just inserts the wrapper script so that the wrapper is what the docker container uses to invoke this is the line that does that: https://github.com/sul-dlss/speech-to-text/pull/78/files#diff-dd2c0eb6ea5cfc6c4bd4eac30934e2d5746747af48fef6da689e85b752f39557R23 the invocation will change a bit when it's rebased on the WIP for dep mgmt via uv: #81 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this based on results of testing in AWS!
6b6a13b
to
9bf0d6a
Compare
since this passed testing, and since @edsu approved, i'm going to go ahead and merge. i will note, i added a commit since ed reviewed -- the last commit of this PR sends a deployment notification to honeybadger. i added this because i wanted some extra confirmation that the test i was running was using the code i expected -- i couldn't find an easy way in the AWS console to match a particular batch job run with the source code used to build the container that ran it. plus, deploy notifications in HB are something we try to have in all our projects, and something i'd have ticketed if it wasn't easy and useful to implement in the course of testing this. some screenshots showing my test results... failed job entry in the aws batch history for the run on the #83 (expected to segfault) test container: container logs for the failed job, showing the druid being run at the time of failure: |
@@ -1,3 +1,6 @@ | |||
[run] | |||
cov-report = term,html | |||
omit = tests/* | |||
concurrency = multiprocessing | |||
parallel = true | |||
sigterm = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested by the docs: https://pytest-cov.readthedocs.io/en/latest/subprocess-support.html
but the tests still didn't seem to detect that error_reporting_wrapper.py
's __main__
code did in fact get invoked.
|
||
if __name__ == "__main__": | ||
configure() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this complaint is spurious, and that the tests do actually cover this?
closes #61
connects with #62
This is intended to catch total failures of the
speech_to_text.py
script that might otherwise go uncaught by its exception handling, and unreported to Honeybadger (e.g. segfaults or other memory errors).Besides the unit tests, I deployed two versions of this when I originally wrote it in December: the branch in this PR, and this other branch: https://github.com/sul-dlss/speech-to-text/compare/honeybadger-alert-on-nonzero-script-exit_on-old-ubuntu-base-that-segfaults?expand=1
Using each branch, I ran speech-to-text workflow on the druid mentioned in #62 as reliably causing the segfault. The branch in this PR, already based on a commit that fixed #62 a few days earlier, successfully ran on the druid without error (as expected). The other branch (
honeybadger-alert-on-nonzero-script-exit_on-old-ubuntu-base-that-segfaults
), still based on a Docker image that had reliably segfaulted on that druid before #68, did indeed segfault, and the error was reported to Honeybadger.You can see the HB alert from it catching the segfault here: https://app.honeybadger.io/projects/129304/faults/115578169
More useful info will probably be in the STDOUT and STDERR of the crashed container's logs, which should be available from cloudwatch. But an HB alert will also send us a heads up via our most heavily monitored error reporting medium.
Note if you compare the two test branches that
error_reporting_wrapper.py
is the same in each. The only differences are the base commit and the addition of unit tests in this branch.Still, since we've significantly changed the way the speech-to-text container is run, it'd be good to re-test this. I'm inclined to:
Dockerfile
more like the problematic one in the other test branch, from before Use a nvidia/cuda base image #68, based on plain Ubuntu, instead of Nvidia's CUDA variant. run xx631hs6770 through. the expectation is that the container segfaults, but reports an error to HB.Command '['python3.11', 'speech_to_text.py', '--job', '{"id":"xx631hs6770-v14","druid":"druid:xx631hs6770","media":[{"name":"xx631hs6770-v14/bh106kg8488_01_sl_m4a.m4a"},...snip...,{"name":"xx631hs6770-v14/bh106kg8488_10_sl_m4a.m4a"},{"name":"xx631hs6770-v14/bh106kg8488_11_sl_m4a.m4a"}],"options":{"model":"large","word_timestamps":true,"condition_on_previous_text":false,"writer":{"max_line_width":42,"max_line_count":1}}}']' died with <Signals.SIGSEGV: 11>.