Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion adrs/0029-scheduled-tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Status

Superseded
ACCEPTED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Teeny detail, but I don't think any of the other ADRs are using ALL CAPS for status


Date of decision: 2025-04-16

Expand Down
207 changes: 119 additions & 88 deletions adrs/0033-scheduled-tasks-revisited.md
Original file line number Diff line number Diff line change
@@ -1,88 +1,119 @@
# ADR-0033: Scheduled tasks revisited

## Status

Accepted

Date of decision: 2025-10-17

## Context and Problem Statement

We need multiple different jobs that will run periodically, e.g. cleaning up old partial property registrations, sending
out various different reminders to users etc. We previously decided to run these as ephemeral copies of the WebApp container
triggered by Eventbridge Scheduler [ADR-0029](https://github.com/communitiesuk/prsdb-webapp/blob/main/adrs/0029-scheduled-tasks.md).
However, this approach has some limitations:

* There is a limit of 10 instances of each task definition at once, so if we have a large number of scheduled tasks that all run at the same
time, some of them may fail to start.
* The startup time of the container is long and requires a lot of resources, making it wasteful to spin up a new container for each task
run.
* If a task fails partway through, there is no easy way to retry it.

## Considered Options

* Long running ECS task with webserver and private endpoints
* Separate ECS service with SQS queue and custom scaling rule
* Switching to Lambda functions

## Decision Outcome

Separate ECS service with SQS queue and custom scaling rule, because it allows easy reuse of existing code while only using the resources it
requires.

## Pros and Cons of the Options

### Long running ECS task with webserver and private endpoints

We could run a mirror of our existing ECS service, with a private loadbalancer that is only accessible from within our VPC and endpoints
that could be targeted by Eventbridge Scheduler.

* Good, because it addresses the limit of 10 instances of each task definition.
* Good, because it avoids the long startup time of the container.
* Good, because it would allow very easy reuse and sharing of code and patterns between the WebApp and asynchronous tasks.
* Good, because we could use normal scaling rules for the ECS service based on the load of the containers.
* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by sending a request from the WebApp.
* Bad, because we would have to handle retry logic ourselves if a task failed partway through, and there would be some cases where
tasks could be lost if the container was stopped or restarted while processing a task.
* Bad, because it would require a full copy of the WebApp to be running all the time even when there are no asynchronous tasks, which is
wasteful in terms of resources.

### Separate ECS service with SQS queue and custom scaling rule

We could extend our current approach of having an ephemeral non-webserver version of the WebApp container, to instead have a long-running
ECS service that reads tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue to trigger tasks. We could
set up a custom scaling rule for the ECS service based on the average number of messages in the SQS queue per task, so that it would scale
up when there are tasks to process and scale down to zero when there are no tasks.

* Good, because would allow relatively easy reuse of our existing code.
* Good, because during high volumes of tasks it would avoid the long start-up time of the webapp container.
* Good, because it would scale down to zero when there are no tasks, avoiding wasteful resource usage.
* Good, because it addresses the limit of 10 instances of each task definition.
* Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container
was stopped or restarted while processing a task.
* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from
the WebApp.
* Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an
environment variable.
* Bad, because it would require creating a custom scaling rule for the ECS service, which is more complex than using the built-in scaling
rules.

### Switching to Lambda functions

We could split out each scheduled task into a separate Lambda function, which would be triggered by Eventbridge Schedulers.

* Good, because we wouldn't need to worry about scaling rules.
* Good, because we would only be using resources when a task is actually running.
* Good, because it would avoid the long start-up time of the full webapp container.
* Good, because it would address the limit of 10 instances of each task definition.
* Neutral, because while we wouldn't get retry logic for free, it would be relatively trivial to set up an SQS queue per lambda to allow for
this.
* Neutral, because while we wouldn't get the ability to trigger asynchronous tasks for free, it would be relatively trivial to set up an SQS
queue per lambda to allow for this.
* Bad, because we would need to significantly refactor our existing code to split out common functionality into a separate package that
could be used by the Lambda functions.
* Bad, because we would need to ensure no tasks take longer than 15 minutes to run, which may not be possible for some tasks.

## More Information

Custom ECS scaling rules: https://aws.amazon.com/blogs/containers/amazon-elastic-container-service-ecs-auto-scaling-using-custom-metrics/
# ADR-0033: Asynchronous Tasks

## Status

DRAFT

Date of decision: TBC

## Context and Problem Statement

We will have a number of asynchronous tasks that need to be run on demand, triggered directly or indirectly by a user's actions in the
WebApp. Examples of this include handling virus scan results, processing bulk uploads of property data, and potentially others in future
such as generating reports and sending bulk emails.

## Considered Options

* Separate long-running ECS task with webserver and private endpoints
* Separate ECS service with SQS queue
* Separate ECS service with SQS queue and custom scaling rule
* SQS queue with the main webapp listening to it
* Switching to Lambda functions

## Decision Outcome

Separate ECS service with SQS queue, as it allows us to implement asynchronous tasks relatively quickly and efficiently, allows the capacity
for ad-hoc async tasks to be scaled independently of the main webapp, and allows us a pathway to add custom scaling rules to reduce the
costs of having a permanent additional container running if the cost savings justify the additional effort.

## Pros and Cons of the Options

### Separate long-running ECS task with webserver and private endpoints

We could run a mirror of our existing ECS service, with a private loadbalancer that is only accessible from within our VPC and endpoints
that could be targeted by Eventbridge Scheduler.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this ADR is about on-demand tasks, should these summaries be talking about EventBridge Scheduler? Rather than just vanilla EventBridge or direct queue interaction from the web app? (Although I guess it is true that EventBridge Scheduler could put messages on the queue, too!)


* Good, because it avoids the long startup time of the container.
* Good, because it would allow very easy reuse and sharing of code and patterns between the WebApp and asynchronous tasks.
* Good, because we could use normal scaling rules for the ECS service based on the load of the containers.
* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by sending a request from the WebApp.
* Bad, because we would have to handle retry logic ourselves if a task failed partway through, and there could be some cases where
tasks could be lost if the container was stopped or restarted while processing a task.
* Bad, because it would require a full copy of the WebApp to be running all the time even when there are no asynchronous tasks, which is
wasteful in terms of resources.
*

### Separate long running ECS service with SQS queue

We could have a long-running ECS service that reads tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue
to trigger tasks.

* Good, because would allow relatively easy reuse of our existing code.
* Good, because during high volumes of tasks it would avoid the long start-up time of the webapp container.
* Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container
was stopped or restarted while processing a task.
* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from
the WebApp.
* Good, because it would not place any additional load on the main WebApp containers.
* Good, because it would not require creating a custom scaling rule for the ECS service.
* Bad, because having a second long-running container would be wasteful during periods of low or no asynchronous task activity.
* Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an
environment variable.
* Bad, because it would require creating a custom scaling rule for the ECS service, including a lambda function to calculate the metrics,
which is more complex than using the built-in scaling rules.
Comment on lines +62 to +63
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this one applies here


### Separate ECS service with SQS queue and custom scaling rule

We could have a long-running ECS service that reads tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue
to trigger tasks. We could set up a custom scaling rule for the ECS service based on the average number of messages in the SQS queue per
task, so that it would scale up when there are tasks to process and scale down to zero when there are no tasks.

* Good, because would allow relatively easy reuse of our existing code.
* Good, because during high volumes of tasks it would avoid the long start-up time of the webapp container.
* Good, because it would scale down to zero when there are no tasks, avoiding wasteful resource usage.
* Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container
was stopped or restarted while processing a task.
* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from
the WebApp.
* Good, because it would not place any additional load on the main WebApp containers.
* Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an
environment variable.
* Bad, because it would require creating a custom scaling rule for the ECS service, including a lambda function to calculate the metrics,
which is more complex than using the built-in scaling rules.

### Switching to Lambda functions

We could split out each scheduled task into a separate Lambda function, which would be triggered by Eventbridge Schedulers.

* Good, because we wouldn't need to worry about scaling rules.
* Good, because we would only be using resources when a task is actually running.
* Good, because it would avoid the long start-up time of the full webapp container.
* Neutral, because while we wouldn't get retry logic for free, it would be relatively trivial to set up an SQS queue per lambda to allow for
this.
* Neutral, because while we wouldn't get the ability to trigger asynchronous tasks for free, it would be relatively trivial to set up an SQS
queue per lambda to allow for this.
* Bad, because we would need to significantly refactor our existing code to split out common functionality into a separate package that
could be used by the Lambda functions.
* Bad, because we would need to ensure no tasks take longer than 15 minutes to run, which may not be possible for some tasks.

### SQS queue with the main webapp listening to it

We could have the main WebApp containers read tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue to
trigger tasks.

* Good, because it would allow us to implement asynchronous tasks relatively quickly with minimal changes to our existing code.
* Good, because it would avoid the long start-up time of the webapp container.
* Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container
was stopped or restarted while processing a task.
* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from
the WebApp itself.
* Good, because it would avoid the need to create a custom scaling rule for a separate ECS service.
* Neutral, because while the load on ad-hoc tasks should scale with the load on the webapp itself, there may be times when this is not
ideal.
* Bad, because it would place additional load on the main WebApp containers, which could impact performance during periods of high load.
* Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an
environment variable.

## More Information

Custom ECS scaling rules: https://aws.amazon.com/blogs/containers/amazon-elastic-container-service-ecs-auto-scaling-using-custom-metrics/
Loading