Branding: improve IO resilience #3182

kwzrd · 2024-10-09T17:58:48Z

This PR aims to improve the resiliency of the Branding manager cog. It is a rework of #2869, which I've closed and re-implemented in this branch in adherence with the reviewer's request to use the tenacity lib.

Context

The Branding manager automatically discovers events once per day or when instructed to. During each such discovery, it makes multiple HTTP requests to the GitHub API. First, it fetches a list of all events in the Branding repository, and then it fetches those events one-by-one.

Currently, if the bot fails to fetch an event, it simply skips it. This seemed useful initially, as we didn't want the bot to stop the discovery iteration if just one event is badly configured. In practice, however, this causes undesired behaviour: if the currently active event is skipped due to a 5xx error, the bot may reset the guild's branding to the fallback event, despite the event still being in progress.

This PR makes adjustments to the error handling methodology. Now, if any of the requests fail, the whole discovery iteration is aborted & retried the following day (or manually via the sync command). The last valid state remains in the Redis cache, so the cog will continue to work just fine.

Additionally, 5xx errors are now retried up to 5 times, with exponential backoff. This prevents discovery iterations from being cancelled due to stray server errors.

Implementation

The solution is fairly simple. The repository class no longer swallows IO/deserialization errors, and instead lets them propagate to the cog. The cog mostly already handles exceptions coming from the repo, I just added error handling where it was missing.

The calendar refresh can now fail (well, it could always fail, but the error was hidden). I added an embed response to inform the user of the result:

Because of this, it no longer automatically invokes the calendar view after the refresh - it seemed noisy. But feel free to propose other solutions.

I've used the Apache 2.0 licensed tenacity library for the retry with backoff.

If we fail to fetch an event, the whole branding sync will now be aborted. This will prevent situations where we fail to fetch the current event due to a 5xx error and the cog resorts to the fallback branding in the middle of an event. Error handling is moved to the cog. The repo abstraction will now propagate errors rather than silence them.

Use the tenacity lib to retry 5xx responses from GitHub.

We now send a result embed after refresh. It would be noisy to also send the calendar embed. Users can invoke the calendar manually if desired.

kwzrd · 2024-10-09T18:10:20Z

For testing purposes, you can do this.

async with self.bot.http_session.get(...) as response:
    response.status = 500
    _raise_for_status(response)

The cog is overall not very testable. I have some ideas on how to improve it, but I'd rather leave it for a separate PR.

ChrisLovering

Looks good 👌 🚀

kwzrd added 4 commits October 9, 2024 16:23

Deps: add tenacity

6202f9d

Branding: retry GitHub server errors

e48d042

Use the tenacity lib to retry 5xx responses from GitHub.

Branding: do not invoke calendar after refresh

af8afca

We now send a result embed after refresh. It would be noisy to also send the calendar embed. Users can invoke the calendar manually if desired.

kwzrd mentioned this pull request Oct 9, 2024

Branding: improve IO resilience #2869

Closed

Branding: improve decorator name

80d1e8c

ChrisLovering approved these changes Oct 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Branding: improve IO resilience #3182

Branding: improve IO resilience #3182

kwzrd commented Oct 9, 2024

kwzrd commented Oct 9, 2024 •

edited

Loading

ChrisLovering left a comment

Branding: improve IO resilience #3182

Are you sure you want to change the base?

Branding: improve IO resilience #3182

Conversation

kwzrd commented Oct 9, 2024

Context

Implementation

kwzrd commented Oct 9, 2024 • edited Loading

ChrisLovering left a comment

Choose a reason for hiding this comment

kwzrd commented Oct 9, 2024 •

edited

Loading