send Honeybadger alert on nonzero script exit #78

jmartin-sul · 2025-01-30T08:52:32Z

closes #61
connects with #62

This is intended to catch total failures of the speech_to_text.py script that might otherwise go uncaught by its exception handling, and unreported to Honeybadger (e.g. segfaults or other memory errors).

Besides the unit tests, I deployed two versions of this when I originally wrote it in December: the branch in this PR, and this other branch: https://github.com/sul-dlss/speech-to-text/compare/honeybadger-alert-on-nonzero-script-exit_on-old-ubuntu-base-that-segfaults?expand=1

Using each branch, I ran speech-to-text workflow on the druid mentioned in #62 as reliably causing the segfault. The branch in this PR, already based on a commit that fixed #62 a few days earlier, successfully ran on the druid without error (as expected). The other branch (honeybadger-alert-on-nonzero-script-exit_on-old-ubuntu-base-that-segfaults), still based on a Docker image that had reliably segfaulted on that druid before #68, did indeed segfault, and the error was reported to Honeybadger.

You can see the HB alert from it catching the segfault here: https://app.honeybadger.io/projects/129304/faults/115578169

Command '['python3.11', 'speech_to_text.py']' died with <Signals.SIGSEGV: 11>.

More useful info will probably be in the STDOUT and STDERR of the crashed container's logs, which should be available from cloudwatch. But an HB alert will also send us a heads up via our most heavily monitored error reporting medium.

Note if you compare the two test branches that error_reporting_wrapper.py is the same in each. The only differences are the base commit and the addition of unit tests in this branch.

Still, since we've significantly changed the way the speech-to-text container is run, it'd be good to re-test this. I'm inclined to:

deploy this PR's branch and run it with xx631hs6770, which reliably crashed the container before Use a nvidia/cuda base image #68. the expectation is that it processes the druid without error.
create a test branch based on this PR's branch, but using a Dockerfile more like the problematic one in the other test branch, from before Use a nvidia/cuda base image #68, based on plain Ubuntu, instead of Nvidia's CUDA variant. run xx631hs6770 through. the expectation is that the container segfaults, but reports an error to HB.
- see [DONOTMERGE] testing - looking for this to segfault when run in current AWS batch environment, and to report the error to honeybadger #83
- segfault generated and caught. see HB alert here.
- alert message: Command '['python3.11', 'speech_to_text.py', '--job', '{"id":"xx631hs6770-v14","druid":"druid:xx631hs6770","media":[{"name":"xx631hs6770-v14/bh106kg8488_01_sl_m4a.m4a"},...snip...,{"name":"xx631hs6770-v14/bh106kg8488_10_sl_m4a.m4a"},{"name":"xx631hs6770-v14/bh106kg8488_11_sl_m4a.m4a"}],"options":{"model":"large","word_timestamps":true,"condition_on_previous_text":false,"writer":{"max_line_width":42,"max_line_count":1}}}']' died with <Signals.SIGSEGV: 11>.

peetucket · 2025-01-30T21:50:38Z

Just curious how this connects to the speech_to_text.py file? Like how does it know to send a HB alert if an exception occurs in that file?

…f it exits with an error

…_to_text.py

jmartin-sul · 2025-01-31T19:21:20Z

Just curious how this connects to the speech_to_text.py file? Like how does it know to send a HB alert if an exception occurs in that file?

i basically added a wrapper script, that sends an alert for any non-zero exit by the script it's running. so this just inserts the wrapper script so that the wrapper is what the docker container uses to invoke speech_to_text.py, so that an alert is sent if speech_to_text.py exits with an error.

this is the line that does that: https://github.com/sul-dlss/speech-to-text/pull/78/files#diff-dd2c0eb6ea5cfc6c4bd4eac30934e2d5746747af48fef6da689e85b752f39557R23

the invocation will change a bit when it's rebased on the WIP for dep mgmt via uv: #81

see https://pytest-cov.readthedocs.io/en/latest/subprocess-support.html

edsu

Approving this based on results of testing in AWS!

jmartin-sul · 2025-02-01T05:58:19Z

since this passed testing, and since @edsu approved, i'm going to go ahead and merge.

i will note, i added a commit since ed reviewed -- the last commit of this PR sends a deployment notification to honeybadger. i added this because i wanted some extra confirmation that the test i was running was using the code i expected -- i couldn't find an easy way in the AWS console to match a particular batch job run with the source code used to build the container that ran it. plus, deploy notifications in HB are something we try to have in all our projects, and something i'd have ticketed if it wasn't easy and useful to implement in the course of testing this.

some screenshots showing my test results...

failed job entry in the aws batch history for the run on the #83 (expected to segfault) test container:

container logs for the failed job, showing the druid being run at the time of failure:

deployment notifications in HB:

jmartin-sul · 2025-02-01T06:00:51Z

.coveragerc

@@ -1,3 +1,6 @@
 [run]
 cov-report = term,html
 omit = tests/*
+concurrency = multiprocessing
+parallel = true
+sigterm = true


suggested by the docs: https://pytest-cov.readthedocs.io/en/latest/subprocess-support.html

but the tests still didn't seem to detect that error_reporting_wrapper.py's __main__ code did in fact get invoked.

jmartin-sul · 2025-02-01T06:01:43Z

error_reporting_wrapper.py

+
+if __name__ == "__main__":
+    configure()
+


i think this complaint is spurious, and that the tests do actually cover this?

jmartin-sul force-pushed the honeybadger-alert-on-nonzero-script-exit branch from 98b8265 to cf449c3 Compare January 30, 2025 19:11

jmartin-sul added 3 commits January 31, 2025 11:17

add a wrapper that invokes a command, and sends a honeybadger alert i…

867ee19

…f it exits with an error

docker image entrypoint: use error_reporting_wrapper.py to run speech…

6834321

…_to_text.py

tests for error_reporting_wrapper.py

6d27f28

jmartin-sul force-pushed the honeybadger-alert-on-nonzero-script-exit branch from cf449c3 to 6d27f28 Compare January 31, 2025 19:17

add subprocess support to code coverage

b70ee41

see https://pytest-cov.readthedocs.io/en/latest/subprocess-support.html

edsu approved these changes Jan 31, 2025

View reviewed changes

notify honeybadger upon deployment

9bf0d6a

jmartin-sul force-pushed the honeybadger-alert-on-nonzero-script-exit branch from 6b6a13b to 9bf0d6a Compare February 1, 2025 05:00

jmartin-sul mentioned this pull request Feb 1, 2025

[DONOTMERGE] testing - looking for this to segfault when run in current AWS batch environment, and to report the error to honeybadger #83

Closed

jmartin-sul commented Feb 1, 2025

View reviewed changes

jmartin-sul changed the title ~~[HOLD] send Honeybadger alert on nonzero script exit~~ send Honeybadger alert on nonzero script exit Feb 1, 2025

jmartin-sul merged commit f74d9c2 into main Feb 1, 2025
1 of 3 checks passed

jmartin-sul deleted the honeybadger-alert-on-nonzero-script-exit branch February 1, 2025 06:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

send Honeybadger alert on nonzero script exit #78

send Honeybadger alert on nonzero script exit #78

jmartin-sul commented Jan 30, 2025 •

edited

Loading

peetucket commented Jan 30, 2025

jmartin-sul commented Jan 31, 2025

edsu left a comment

jmartin-sul commented Feb 1, 2025 •

edited

Loading

jmartin-sul Feb 1, 2025

jmartin-sul Feb 1, 2025

send Honeybadger alert on nonzero script exit #78

send Honeybadger alert on nonzero script exit #78

Conversation

jmartin-sul commented Jan 30, 2025 • edited Loading

peetucket commented Jan 30, 2025

jmartin-sul commented Jan 31, 2025

edsu left a comment

Choose a reason for hiding this comment

jmartin-sul commented Feb 1, 2025 • edited Loading

jmartin-sul Feb 1, 2025

Choose a reason for hiding this comment

jmartin-sul Feb 1, 2025

Choose a reason for hiding this comment

jmartin-sul commented Jan 30, 2025 •

edited

Loading

jmartin-sul commented Feb 1, 2025 •

edited

Loading