Branding: improve IO resilience #3182
Open
+88
−40
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #2808
This PR aims to improve the resiliency of the Branding manager cog. It is a rework of #2869, which I've closed and re-implemented in this branch in adherence with the reviewer's request to use the tenacity lib.
Context
The Branding manager automatically discovers events once per day or when instructed to. During each such discovery, it makes multiple HTTP requests to the GitHub API. First, it fetches a list of all events in the Branding repository, and then it fetches those events one-by-one.
Currently, if the bot fails to fetch an event, it simply skips it. This seemed useful initially, as we didn't want the bot to stop the discovery iteration if just one event is badly configured. In practice, however, this causes undesired behaviour: if the currently active event is skipped due to a 5xx error, the bot may reset the guild's branding to the fallback event, despite the event still being in progress.
This PR makes adjustments to the error handling methodology. Now, if any of the requests fail, the whole discovery iteration is aborted & retried the following day (or manually via the
sync
command). The last valid state remains in the Redis cache, so the cog will continue to work just fine.Additionally, 5xx errors are now retried up to 5 times, with exponential backoff. This prevents discovery iterations from being cancelled due to stray server errors.
Implementation
The solution is fairly simple. The repository class no longer swallows IO/deserialization errors, and instead lets them propagate to the cog. The cog mostly already handles exceptions coming from the repo, I just added error handling where it was missing.
The calendar refresh can now fail (well, it could always fail, but the error was hidden). I added an embed response to inform the user of the result:
Because of this, it no longer automatically invokes the calendar view after the refresh - it seemed noisy. But feel free to propose other solutions.
I've used the Apache 2.0 licensed tenacity library for the retry with backoff.