diff --git a/adrs/0029-scheduled-tasks.md b/adrs/0029-scheduled-tasks.md index 36abc392d..5aca6b3ad 100644 --- a/adrs/0029-scheduled-tasks.md +++ b/adrs/0029-scheduled-tasks.md @@ -2,7 +2,7 @@ ## Status -Superseded +ACCEPTED Date of decision: 2025-04-16 diff --git a/adrs/0033-scheduled-tasks-revisited.md b/adrs/0033-scheduled-tasks-revisited.md index 9ae8348bd..d741649c4 100644 --- a/adrs/0033-scheduled-tasks-revisited.md +++ b/adrs/0033-scheduled-tasks-revisited.md @@ -1,88 +1,119 @@ -# ADR-0033: Scheduled tasks revisited - -## Status - -Accepted - -Date of decision: 2025-10-17 - -## Context and Problem Statement - -We need multiple different jobs that will run periodically, e.g. cleaning up old partial property registrations, sending -out various different reminders to users etc. We previously decided to run these as ephemeral copies of the WebApp container -triggered by Eventbridge Scheduler [ADR-0029](https://github.com/communitiesuk/prsdb-webapp/blob/main/adrs/0029-scheduled-tasks.md). -However, this approach has some limitations: - -* There is a limit of 10 instances of each task definition at once, so if we have a large number of scheduled tasks that all run at the same - time, some of them may fail to start. -* The startup time of the container is long and requires a lot of resources, making it wasteful to spin up a new container for each task - run. -* If a task fails partway through, there is no easy way to retry it. - -## Considered Options - -* Long running ECS task with webserver and private endpoints -* Separate ECS service with SQS queue and custom scaling rule -* Switching to Lambda functions - -## Decision Outcome - -Separate ECS service with SQS queue and custom scaling rule, because it allows easy reuse of existing code while only using the resources it -requires. - -## Pros and Cons of the Options - -### Long running ECS task with webserver and private endpoints - -We could run a mirror of our existing ECS service, with a private loadbalancer that is only accessible from within our VPC and endpoints -that could be targeted by Eventbridge Scheduler. - -* Good, because it addresses the limit of 10 instances of each task definition. -* Good, because it avoids the long startup time of the container. -* Good, because it would allow very easy reuse and sharing of code and patterns between the WebApp and asynchronous tasks. -* Good, because we could use normal scaling rules for the ECS service based on the load of the containers. -* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by sending a request from the WebApp. -* Bad, because we would have to handle retry logic ourselves if a task failed partway through, and there would be some cases where - tasks could be lost if the container was stopped or restarted while processing a task. -* Bad, because it would require a full copy of the WebApp to be running all the time even when there are no asynchronous tasks, which is - wasteful in terms of resources. - -### Separate ECS service with SQS queue and custom scaling rule - -We could extend our current approach of having an ephemeral non-webserver version of the WebApp container, to instead have a long-running -ECS service that reads tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue to trigger tasks. We could -set up a custom scaling rule for the ECS service based on the average number of messages in the SQS queue per task, so that it would scale -up when there are tasks to process and scale down to zero when there are no tasks. - -* Good, because would allow relatively easy reuse of our existing code. -* Good, because during high volumes of tasks it would avoid the long start-up time of the webapp container. -* Good, because it would scale down to zero when there are no tasks, avoiding wasteful resource usage. -* Good, because it addresses the limit of 10 instances of each task definition. -* Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container - was stopped or restarted while processing a task. -* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from - the WebApp. -* Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an - environment variable. -* Bad, because it would require creating a custom scaling rule for the ECS service, which is more complex than using the built-in scaling - rules. - -### Switching to Lambda functions - -We could split out each scheduled task into a separate Lambda function, which would be triggered by Eventbridge Schedulers. - -* Good, because we wouldn't need to worry about scaling rules. -* Good, because we would only be using resources when a task is actually running. -* Good, because it would avoid the long start-up time of the full webapp container. -* Good, because it would address the limit of 10 instances of each task definition. -* Neutral, because while we wouldn't get retry logic for free, it would be relatively trivial to set up an SQS queue per lambda to allow for - this. -* Neutral, because while we wouldn't get the ability to trigger asynchronous tasks for free, it would be relatively trivial to set up an SQS - queue per lambda to allow for this. -* Bad, because we would need to significantly refactor our existing code to split out common functionality into a separate package that - could be used by the Lambda functions. -* Bad, because we would need to ensure no tasks take longer than 15 minutes to run, which may not be possible for some tasks. - -## More Information - -Custom ECS scaling rules: https://aws.amazon.com/blogs/containers/amazon-elastic-container-service-ecs-auto-scaling-using-custom-metrics/ +# ADR-0033: Asynchronous Tasks + +## Status + +DRAFT + +Date of decision: TBC + +## Context and Problem Statement + +We will have a number of asynchronous tasks that need to be run on demand, triggered directly or indirectly by a user's actions in the +WebApp. Examples of this include handling virus scan results, processing bulk uploads of property data, and potentially others in future +such as generating reports and sending bulk emails. + +## Considered Options + +* Separate long-running ECS task with webserver and private endpoints +* Separate ECS service with SQS queue +* Separate ECS service with SQS queue and custom scaling rule +* SQS queue with the main webapp listening to it +* Switching to Lambda functions + +## Decision Outcome + +Separate ECS service with SQS queue, as it allows us to implement asynchronous tasks relatively quickly and efficiently, allows the capacity +for ad-hoc async tasks to be scaled independently of the main webapp, and allows us a pathway to add custom scaling rules to reduce the +costs of having a permanent additional container running if the cost savings justify the additional effort. + +## Pros and Cons of the Options + +### Separate long-running ECS task with webserver and private endpoints + +We could run a mirror of our existing ECS service, with a private loadbalancer that is only accessible from within our VPC and endpoints +that could be targeted by Eventbridge Scheduler. + +* Good, because it avoids the long startup time of the container. +* Good, because it would allow very easy reuse and sharing of code and patterns between the WebApp and asynchronous tasks. +* Good, because we could use normal scaling rules for the ECS service based on the load of the containers. +* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by sending a request from the WebApp. +* Bad, because we would have to handle retry logic ourselves if a task failed partway through, and there could be some cases where + tasks could be lost if the container was stopped or restarted while processing a task. +* Bad, because it would require a full copy of the WebApp to be running all the time even when there are no asynchronous tasks, which is + wasteful in terms of resources. +* + +### Separate long running ECS service with SQS queue + +We could have a long-running ECS service that reads tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue +to trigger tasks. + +* Good, because would allow relatively easy reuse of our existing code. +* Good, because during high volumes of tasks it would avoid the long start-up time of the webapp container. +* Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container + was stopped or restarted while processing a task. +* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from + the WebApp. +* Good, because it would not place any additional load on the main WebApp containers. +* Good, because it would not require creating a custom scaling rule for the ECS service. +* Bad, because having a second long-running container would be wasteful during periods of low or no asynchronous task activity. +* Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an + environment variable. +* Bad, because it would require creating a custom scaling rule for the ECS service, including a lambda function to calculate the metrics, + which is more complex than using the built-in scaling rules. + +### Separate ECS service with SQS queue and custom scaling rule + +We could have a long-running ECS service that reads tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue +to trigger tasks. We could set up a custom scaling rule for the ECS service based on the average number of messages in the SQS queue per +task, so that it would scale up when there are tasks to process and scale down to zero when there are no tasks. + +* Good, because would allow relatively easy reuse of our existing code. +* Good, because during high volumes of tasks it would avoid the long start-up time of the webapp container. +* Good, because it would scale down to zero when there are no tasks, avoiding wasteful resource usage. +* Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container + was stopped or restarted while processing a task. +* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from + the WebApp. +* Good, because it would not place any additional load on the main WebApp containers. +* Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an + environment variable. +* Bad, because it would require creating a custom scaling rule for the ECS service, including a lambda function to calculate the metrics, + which is more complex than using the built-in scaling rules. + +### Switching to Lambda functions + +We could split out each scheduled task into a separate Lambda function, which would be triggered by Eventbridge Schedulers. + +* Good, because we wouldn't need to worry about scaling rules. +* Good, because we would only be using resources when a task is actually running. +* Good, because it would avoid the long start-up time of the full webapp container. +* Neutral, because while we wouldn't get retry logic for free, it would be relatively trivial to set up an SQS queue per lambda to allow for + this. +* Neutral, because while we wouldn't get the ability to trigger asynchronous tasks for free, it would be relatively trivial to set up an SQS + queue per lambda to allow for this. +* Bad, because we would need to significantly refactor our existing code to split out common functionality into a separate package that + could be used by the Lambda functions. +* Bad, because we would need to ensure no tasks take longer than 15 minutes to run, which may not be possible for some tasks. + +### SQS queue with the main webapp listening to it + +We could have the main WebApp containers read tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue to +trigger tasks. + +* Good, because it would allow us to implement asynchronous tasks relatively quickly with minimal changes to our existing code. +* Good, because it would avoid the long start-up time of the webapp container. +* Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container + was stopped or restarted while processing a task. +* Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from + the WebApp itself. +* Good, because it would avoid the need to create a custom scaling rule for a separate ECS service. +* Neutral, because while the load on ad-hoc tasks should scale with the load on the webapp itself, there may be times when this is not + ideal. +* Bad, because it would place additional load on the main WebApp containers, which could impact performance during periods of high load. +* Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an + environment variable. + +## More Information + +Custom ECS scaling rules: https://aws.amazon.com/blogs/containers/amazon-elastic-container-service-ecs-auto-scaling-using-custom-metrics/