Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixes #5508. Several of our Celery tasks have `retry_backoff=3600` which activates Celery's exponential backoff feature for retries (see <https://docs.celeryq.dev/en/stable/userguide/tasks.html#Task.retry_backoff>). The intention was that: 1. The first retry of a task would be scheduled for `3600` secs (one hour) after the failure of the initial attempt 2. Following the rules of exponential backoff the second retry would be scheduled for `2 * 3600` secs (two hours) after the failure of the first retry When accounting for Celery's enabled-by-default `retry_jitter` setting (<https://docs.celeryq.dev/en/stable/userguide/tasks.html#Task.retry_jitter>) the delays should actually be randomly chosen times between `0` and `3600` and between `0` and `7200` secs. These long delays were chosen to prevent a snowballing problem: if tasks start failing because a service is overloaded you don't want to further load the service by quickly retrying those tasks. So instead wait an hour or two before retrying. Long delays were also chosen to maximize the chances of the retries succeeding. For example if a task is failing because a service (such as Mailchimp) is down then retrying the task right away is likely to fail again but after an hour or two the service is likely to be back up. Unfortunately [all the retries seem to be being scheduled within ten minutes, not hours](https://hypothes-is.slack.com/archives/C074BUPEG/p1687344232668929?thread_ts=1687325831.540209&cid=C074BUPEG). This turns out to be because of Celery's `retry_backoff_max` setting (<https://docs.celeryq.dev/en/stable/userguide/tasks.html#Task.retry_backoff_max>) which by default puts a cap of ten minutes on `retry_backoff`'s delays. Fix this by increasing the `retry_backoff_max` setting to match the maximum intended delay of `7200` secs.
- Loading branch information