Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

send Honeybadger alert on nonzero script exit #78

Merged
merged 5 commits into from
Feb 1, 2025

Conversation

jmartin-sul
Copy link
Member

@jmartin-sul jmartin-sul commented Jan 30, 2025

closes #61
connects with #62

This is intended to catch total failures of the speech_to_text.py script that might otherwise go uncaught by its exception handling, and unreported to Honeybadger (e.g. segfaults or other memory errors).

Besides the unit tests, I deployed two versions of this when I originally wrote it in December: the branch in this PR, and this other branch: https://github.com/sul-dlss/speech-to-text/compare/honeybadger-alert-on-nonzero-script-exit_on-old-ubuntu-base-that-segfaults?expand=1

Using each branch, I ran speech-to-text workflow on the druid mentioned in #62 as reliably causing the segfault. The branch in this PR, already based on a commit that fixed #62 a few days earlier, successfully ran on the druid without error (as expected). The other branch (honeybadger-alert-on-nonzero-script-exit_on-old-ubuntu-base-that-segfaults), still based on a Docker image that had reliably segfaulted on that druid before #68, did indeed segfault, and the error was reported to Honeybadger.

You can see the HB alert from it catching the segfault here: https://app.honeybadger.io/projects/129304/faults/115578169

Command '['python3.11', 'speech_to_text.py']' died with <Signals.SIGSEGV: 11>.

More useful info will probably be in the STDOUT and STDERR of the crashed container's logs, which should be available from cloudwatch. But an HB alert will also send us a heads up via our most heavily monitored error reporting medium.

Note if you compare the two test branches that error_reporting_wrapper.py is the same in each. The only differences are the base commit and the addition of unit tests in this branch.

Still, since we've significantly changed the way the speech-to-text container is run, it'd be good to re-test this. I'm inclined to:

  • deploy this PR's branch and run it with xx631hs6770, which reliably crashed the container before Use a nvidia/cuda base image #68. the expectation is that it processes the druid without error.
  • create a test branch based on this PR's branch, but using a Dockerfile more like the problematic one in the other test branch, from before Use a nvidia/cuda base image #68, based on plain Ubuntu, instead of Nvidia's CUDA variant. run xx631hs6770 through. the expectation is that the container segfaults, but reports an error to HB.

@jmartin-sul jmartin-sul force-pushed the honeybadger-alert-on-nonzero-script-exit branch from 98b8265 to cf449c3 Compare January 30, 2025 19:11
@peetucket
Copy link
Member

Just curious how this connects to the speech_to_text.py file? Like how does it know to send a HB alert if an exception occurs in that file?

@jmartin-sul jmartin-sul force-pushed the honeybadger-alert-on-nonzero-script-exit branch from cf449c3 to 6d27f28 Compare January 31, 2025 19:17
@jmartin-sul
Copy link
Member Author

Just curious how this connects to the speech_to_text.py file? Like how does it know to send a HB alert if an exception occurs in that file?

i basically added a wrapper script, that sends an alert for any non-zero exit by the script it's running. so this just inserts the wrapper script so that the wrapper is what the docker container uses to invoke speech_to_text.py, so that an alert is sent if speech_to_text.py exits with an error.

this is the line that does that: https://github.com/sul-dlss/speech-to-text/pull/78/files#diff-dd2c0eb6ea5cfc6c4bd4eac30934e2d5746747af48fef6da689e85b752f39557R23

the invocation will change a bit when it's rebased on the WIP for dep mgmt via uv: #81

Copy link
Contributor

@edsu edsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this based on results of testing in AWS!

@jmartin-sul
Copy link
Member Author

jmartin-sul commented Feb 1, 2025

since this passed testing, and since @edsu approved, i'm going to go ahead and merge.

i will note, i added a commit since ed reviewed -- the last commit of this PR sends a deployment notification to honeybadger. i added this because i wanted some extra confirmation that the test i was running was using the code i expected -- i couldn't find an easy way in the AWS console to match a particular batch job run with the source code used to build the container that ran it. plus, deploy notifications in HB are something we try to have in all our projects, and something i'd have ticketed if it wasn't easy and useful to implement in the course of testing this.

some screenshots showing my test results...

failed job entry in the aws batch history for the run on the #83 (expected to segfault) test container:
Screenshot 2025-01-31 at 9 49 50 PM

container logs for the failed job, showing the druid being run at the time of failure:
Screenshot 2025-01-31 at 9 50 53 PM

deployment notifications in HB:
Screenshot 2025-01-31 at 9 51 56 PM

@@ -1,3 +1,6 @@
[run]
cov-report = term,html
omit = tests/*
concurrency = multiprocessing
parallel = true
sigterm = true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested by the docs: https://pytest-cov.readthedocs.io/en/latest/subprocess-support.html

but the tests still didn't seem to detect that error_reporting_wrapper.py's __main__ code did in fact get invoked.


if __name__ == "__main__":
configure()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this complaint is spurious, and that the tests do actually cover this?

@jmartin-sul jmartin-sul changed the title [HOLD] send Honeybadger alert on nonzero script exit send Honeybadger alert on nonzero script exit Feb 1, 2025
@jmartin-sul jmartin-sul merged commit f74d9c2 into main Feb 1, 2025
1 of 3 checks passed
@jmartin-sul jmartin-sul deleted the honeybadger-alert-on-nonzero-script-exit branch February 1, 2025 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants