Fix Celery task retry delays

Fixes #5508. Several of our Celery tasks have `retry_backoff=3600` which activates Celery's exponential backoff feature for retries (see <https://docs.celeryq.dev/en/stable/userguide/tasks.html#Task.retry_backoff>). The intention was that: 1. The first retry of a task would be scheduled for `3600` secs (one hour) after the failure of the initial attempt 2. Following the rules of exponential backoff the second retry would be scheduled for `2 * 3600` secs (two hours) after the failure of the first retry When accounting for Celery's enabled-by-default `retry_jitter` setting (<https://docs.celeryq.dev/en/stable/userguide/tasks.html#Task.retry_jitter>) the delays should actually be randomly chosen times between `0` and `3600` and between `0` and `7200` secs. These long delays were chosen to prevent a snowballing problem: if tasks start failing because a service is overloaded you don't want to further load the service by quickly retrying those tasks. So instead wait an hour or two before retrying. Long delays were also chosen to maximize the chances of the retries succeeding. For example if a task is failing because a service (such as Mailchimp) is down then retrying the task right away is likely to fail again but after an hour or two the service is likely to be back up. Unfortunately [all the retries seem to be being scheduled within ten minutes, not hours](https://hypothes-is.slack.com/archives/C074BUPEG/p1687344232668929?thread_ts=1687325831.540209&cid=C074BUPEG). This turns out to be because of Celery's `retry_backoff_max` setting (<https://docs.celeryq.dev/en/stable/userguide/tasks.html#Task.retry_backoff_max>) which by default puts a cap of ten minutes on `retry_backoff`'s delays. Fix this by increasing the `retry_backoff_max` setting to match the maximum intended delay of `7200` secs.
hypothesis · Jun 21, 2023 · 528831a · 528831a
1 parent 22114e1
commit 528831a
Show file tree

Hide file tree

Showing 2 changed files with 21 additions and 3 deletions.
diff --git a/lms/tasks/email_digests.py b/lms/tasks/email_digests.py
@@ -17,7 +17,13 @@
 LOG = logging.getLogger(__name__)
 
 
-@app.task(acks_late=True, autoretry_for=(Exception,), max_retries=2, retry_backoff=3600)
+@app.task(
+    acks_late=True,
+    autoretry_for=(Exception,),
+    max_retries=2,
+    retry_backoff=3600,
+    retry_backoff_max=7200,
+)
 def send_instructor_email_digest_tasks(*, batch_size):
     """
     Generate and send instructor email digests.
@@ -82,7 +88,13 @@ def send_instructor_email_digest_tasks(*, batch_size):
                 )
 
 
-@app.task(acks_late=True, autoretry_for=(Exception,), max_retries=2, retry_backoff=3600)
+@app.task(
+    acks_late=True,
+    autoretry_for=(Exception,),
+    max_retries=2,
+    retry_backoff=3600,
+    retry_backoff_max=7200,
+)
 def send_instructor_email_digests(
     *, h_userids: List[str], updated_after: str, updated_before: str, **kwargs
 ) -> None:

diff --git a/lms/tasks/mailchimp.py b/lms/tasks/mailchimp.py
@@ -4,7 +4,13 @@
 from lms.tasks.celery import app
 
 
-@app.task(acks_late=True, autoretry_for=(Exception,), max_retries=2, retry_backoff=3600)
+@app.task(
+    acks_late=True,
+    autoretry_for=(Exception,),
+    max_retries=2,
+    retry_backoff=3600,
+    retry_backoff_max=7200,
+)
 def send_template(*, sender, recipient, **kwargs) -> None:
     """Send an email using Mailchimp's send-template API."""