Treat OSErrors as exceptions, not payload failures #35461

gherceg · 2024-12-02T17:29:38Z

Product Description

Technical Summary

https://dimagi.atlassian.net/browse/SAAS-16323

Occasionally we have stale celery workers that ultimately run into an OSError when attempting to access code that no longer exists if the release it is running on has been cleaned up. See this stack trace for example:

Traceback (most recent call last):
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/tasks.py", line 192, in _process_repeat_record
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/models.py", line 1196, in fire
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/models.py", line 453, in fire_for_record
  File "/home/cchq/www/production/releases/2024-11-21_08.58/python_env-3.9/lib/python3.9/site-packages/memoized.py", line 20, in _memoized
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/expression/repeaters.py", line 93, in get_payload
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/expression/repeater_generators.py", line 19, in get_payload
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/expression/repeater_generators.py", line 71, in _generate_payload
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/specs.py", line 453, in __call__
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/specs.py", line 901, in __call__
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/specs.py", line 850, in __call__
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/extension_expressions.py", line 61, in __call__
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/specs.py", line 605, in __call__
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/specs.py", line 626, in get_value
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/decorators.py", line 50, in _inner
  File "/usr/lib/python3.9/inspect.py", line 1024, in getsource
    lines, lnum = getsourcelines(object)
  File "/usr/lib/python3.9/inspect.py", line 1006, in getsourcelines
    lines, lnum = findsource(object)
  File "/usr/lib/python3.9/inspect.py", line 835, in findsource
    raise OSError('could not get source code')

Since this was happening when attempting to fetch the payload, this was being treated as a payload error and therefore not being retried. However this is an issue on our end, and while this isn't the perfect solution (ideally we wouldn't get into this state in the first place), we should at least treat OSErrors as non-payload related failures that will be retried, since there is a chance it will succeed on the next attempt.

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

No

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

AmitPhulera

Looks good.
One question, This will notify us regarding the issues that we faced with repeat records. Can this situation happen with other queues as well?

Treat OSErrors as exceptions, not payload failures

977baa1

gherceg marked this pull request as ready for review December 2, 2024 17:32

gherceg requested a review from kaapstorm as a code owner December 2, 2024 17:32

gherceg requested review from millerdev and AmitPhulera December 2, 2024 17:32

millerdev approved these changes Dec 2, 2024

View reviewed changes

kaapstorm approved these changes Dec 2, 2024

View reviewed changes

AmitPhulera approved these changes Dec 3, 2024

View reviewed changes

gherceg merged commit 140d56c into master Dec 3, 2024
13 checks passed

gherceg deleted the gh/repeat-records/handle-os-error branch December 3, 2024 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat OSErrors as exceptions, not payload failures #35461

Treat OSErrors as exceptions, not payload failures #35461

gherceg commented Dec 2, 2024 •

edited

Loading

AmitPhulera left a comment

Treat OSErrors as exceptions, not payload failures #35461

Treat OSErrors as exceptions, not payload failures #35461

Conversation

gherceg commented Dec 2, 2024 • edited Loading

Product Description

Technical Summary

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Rollback instructions

Labels & Review

AmitPhulera left a comment

Choose a reason for hiding this comment

gherceg commented Dec 2, 2024 •

edited

Loading