-
Notifications
You must be signed in to change notification settings - Fork 1
PRSD-1605: Revise async tasks ADR to focus on ad-hoc tasks only #837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,7 @@ | |
|
|
||
| ## Status | ||
|
|
||
| Superseded | ||
| ACCEPTED | ||
|
|
||
| Date of decision: 2025-04-16 | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,88 +1,119 @@ | ||
| # ADR-0033: Scheduled tasks revisited | ||
|
|
||
| ## Status | ||
|
|
||
| Accepted | ||
|
|
||
| Date of decision: 2025-10-17 | ||
|
|
||
| ## Context and Problem Statement | ||
|
|
||
| We need multiple different jobs that will run periodically, e.g. cleaning up old partial property registrations, sending | ||
| out various different reminders to users etc. We previously decided to run these as ephemeral copies of the WebApp container | ||
| triggered by Eventbridge Scheduler [ADR-0029](https://github.com/communitiesuk/prsdb-webapp/blob/main/adrs/0029-scheduled-tasks.md). | ||
| However, this approach has some limitations: | ||
|
|
||
| * There is a limit of 10 instances of each task definition at once, so if we have a large number of scheduled tasks that all run at the same | ||
| time, some of them may fail to start. | ||
| * The startup time of the container is long and requires a lot of resources, making it wasteful to spin up a new container for each task | ||
| run. | ||
| * If a task fails partway through, there is no easy way to retry it. | ||
|
|
||
| ## Considered Options | ||
|
|
||
| * Long running ECS task with webserver and private endpoints | ||
| * Separate ECS service with SQS queue and custom scaling rule | ||
| * Switching to Lambda functions | ||
|
|
||
| ## Decision Outcome | ||
|
|
||
| Separate ECS service with SQS queue and custom scaling rule, because it allows easy reuse of existing code while only using the resources it | ||
| requires. | ||
|
|
||
| ## Pros and Cons of the Options | ||
|
|
||
| ### Long running ECS task with webserver and private endpoints | ||
|
|
||
| We could run a mirror of our existing ECS service, with a private loadbalancer that is only accessible from within our VPC and endpoints | ||
| that could be targeted by Eventbridge Scheduler. | ||
|
|
||
| * Good, because it addresses the limit of 10 instances of each task definition. | ||
| * Good, because it avoids the long startup time of the container. | ||
| * Good, because it would allow very easy reuse and sharing of code and patterns between the WebApp and asynchronous tasks. | ||
| * Good, because we could use normal scaling rules for the ECS service based on the load of the containers. | ||
| * Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by sending a request from the WebApp. | ||
| * Bad, because we would have to handle retry logic ourselves if a task failed partway through, and there would be some cases where | ||
| tasks could be lost if the container was stopped or restarted while processing a task. | ||
| * Bad, because it would require a full copy of the WebApp to be running all the time even when there are no asynchronous tasks, which is | ||
| wasteful in terms of resources. | ||
|
|
||
| ### Separate ECS service with SQS queue and custom scaling rule | ||
|
|
||
| We could extend our current approach of having an ephemeral non-webserver version of the WebApp container, to instead have a long-running | ||
| ECS service that reads tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue to trigger tasks. We could | ||
| set up a custom scaling rule for the ECS service based on the average number of messages in the SQS queue per task, so that it would scale | ||
| up when there are tasks to process and scale down to zero when there are no tasks. | ||
|
|
||
| * Good, because would allow relatively easy reuse of our existing code. | ||
| * Good, because during high volumes of tasks it would avoid the long start-up time of the webapp container. | ||
| * Good, because it would scale down to zero when there are no tasks, avoiding wasteful resource usage. | ||
| * Good, because it addresses the limit of 10 instances of each task definition. | ||
| * Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container | ||
| was stopped or restarted while processing a task. | ||
| * Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from | ||
| the WebApp. | ||
| * Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an | ||
| environment variable. | ||
| * Bad, because it would require creating a custom scaling rule for the ECS service, which is more complex than using the built-in scaling | ||
| rules. | ||
|
|
||
| ### Switching to Lambda functions | ||
|
|
||
| We could split out each scheduled task into a separate Lambda function, which would be triggered by Eventbridge Schedulers. | ||
|
|
||
| * Good, because we wouldn't need to worry about scaling rules. | ||
| * Good, because we would only be using resources when a task is actually running. | ||
| * Good, because it would avoid the long start-up time of the full webapp container. | ||
| * Good, because it would address the limit of 10 instances of each task definition. | ||
| * Neutral, because while we wouldn't get retry logic for free, it would be relatively trivial to set up an SQS queue per lambda to allow for | ||
| this. | ||
| * Neutral, because while we wouldn't get the ability to trigger asynchronous tasks for free, it would be relatively trivial to set up an SQS | ||
| queue per lambda to allow for this. | ||
| * Bad, because we would need to significantly refactor our existing code to split out common functionality into a separate package that | ||
| could be used by the Lambda functions. | ||
| * Bad, because we would need to ensure no tasks take longer than 15 minutes to run, which may not be possible for some tasks. | ||
|
|
||
| ## More Information | ||
|
|
||
| Custom ECS scaling rules: https://aws.amazon.com/blogs/containers/amazon-elastic-container-service-ecs-auto-scaling-using-custom-metrics/ | ||
| # ADR-0033: Asynchronous Tasks | ||
|
|
||
| ## Status | ||
|
|
||
| DRAFT | ||
|
|
||
| Date of decision: TBC | ||
|
|
||
| ## Context and Problem Statement | ||
|
|
||
| We will have a number of asynchronous tasks that need to be run on demand, triggered directly or indirectly by a user's actions in the | ||
| WebApp. Examples of this include handling virus scan results, processing bulk uploads of property data, and potentially others in future | ||
| such as generating reports and sending bulk emails. | ||
|
|
||
| ## Considered Options | ||
|
|
||
| * Separate long-running ECS task with webserver and private endpoints | ||
| * Separate ECS service with SQS queue | ||
| * Separate ECS service with SQS queue and custom scaling rule | ||
| * SQS queue with the main webapp listening to it | ||
| * Switching to Lambda functions | ||
|
|
||
| ## Decision Outcome | ||
|
|
||
| Separate ECS service with SQS queue, as it allows us to implement asynchronous tasks relatively quickly and efficiently, allows the capacity | ||
| for ad-hoc async tasks to be scaled independently of the main webapp, and allows us a pathway to add custom scaling rules to reduce the | ||
| costs of having a permanent additional container running if the cost savings justify the additional effort. | ||
|
|
||
| ## Pros and Cons of the Options | ||
|
|
||
| ### Separate long-running ECS task with webserver and private endpoints | ||
|
|
||
| We could run a mirror of our existing ECS service, with a private loadbalancer that is only accessible from within our VPC and endpoints | ||
| that could be targeted by Eventbridge Scheduler. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given this ADR is about on-demand tasks, should these summaries be talking about EventBridge Scheduler? Rather than just vanilla EventBridge or direct queue interaction from the web app? (Although I guess it is true that EventBridge Scheduler could put messages on the queue, too!) |
||
|
|
||
| * Good, because it avoids the long startup time of the container. | ||
| * Good, because it would allow very easy reuse and sharing of code and patterns between the WebApp and asynchronous tasks. | ||
| * Good, because we could use normal scaling rules for the ECS service based on the load of the containers. | ||
| * Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by sending a request from the WebApp. | ||
| * Bad, because we would have to handle retry logic ourselves if a task failed partway through, and there could be some cases where | ||
| tasks could be lost if the container was stopped or restarted while processing a task. | ||
| * Bad, because it would require a full copy of the WebApp to be running all the time even when there are no asynchronous tasks, which is | ||
| wasteful in terms of resources. | ||
| * | ||
|
|
||
| ### Separate long running ECS service with SQS queue | ||
|
|
||
| We could have a long-running ECS service that reads tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue | ||
| to trigger tasks. | ||
|
|
||
| * Good, because would allow relatively easy reuse of our existing code. | ||
| * Good, because during high volumes of tasks it would avoid the long start-up time of the webapp container. | ||
| * Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container | ||
| was stopped or restarted while processing a task. | ||
| * Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from | ||
| the WebApp. | ||
| * Good, because it would not place any additional load on the main WebApp containers. | ||
| * Good, because it would not require creating a custom scaling rule for the ECS service. | ||
| * Bad, because having a second long-running container would be wasteful during periods of low or no asynchronous task activity. | ||
| * Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an | ||
| environment variable. | ||
| * Bad, because it would require creating a custom scaling rule for the ECS service, including a lambda function to calculate the metrics, | ||
| which is more complex than using the built-in scaling rules. | ||
|
Comment on lines
+62
to
+63
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't think this one applies here |
||
|
|
||
| ### Separate ECS service with SQS queue and custom scaling rule | ||
|
|
||
| We could have a long-running ECS service that reads tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue | ||
| to trigger tasks. We could set up a custom scaling rule for the ECS service based on the average number of messages in the SQS queue per | ||
| task, so that it would scale up when there are tasks to process and scale down to zero when there are no tasks. | ||
|
|
||
| * Good, because would allow relatively easy reuse of our existing code. | ||
| * Good, because during high volumes of tasks it would avoid the long start-up time of the webapp container. | ||
| * Good, because it would scale down to zero when there are no tasks, avoiding wasteful resource usage. | ||
| * Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container | ||
| was stopped or restarted while processing a task. | ||
| * Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from | ||
| the WebApp. | ||
| * Good, because it would not place any additional load on the main WebApp containers. | ||
| * Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an | ||
| environment variable. | ||
| * Bad, because it would require creating a custom scaling rule for the ECS service, including a lambda function to calculate the metrics, | ||
| which is more complex than using the built-in scaling rules. | ||
|
|
||
| ### Switching to Lambda functions | ||
|
|
||
| We could split out each scheduled task into a separate Lambda function, which would be triggered by Eventbridge Schedulers. | ||
|
|
||
| * Good, because we wouldn't need to worry about scaling rules. | ||
| * Good, because we would only be using resources when a task is actually running. | ||
| * Good, because it would avoid the long start-up time of the full webapp container. | ||
| * Neutral, because while we wouldn't get retry logic for free, it would be relatively trivial to set up an SQS queue per lambda to allow for | ||
| this. | ||
| * Neutral, because while we wouldn't get the ability to trigger asynchronous tasks for free, it would be relatively trivial to set up an SQS | ||
| queue per lambda to allow for this. | ||
| * Bad, because we would need to significantly refactor our existing code to split out common functionality into a separate package that | ||
| could be used by the Lambda functions. | ||
| * Bad, because we would need to ensure no tasks take longer than 15 minutes to run, which may not be possible for some tasks. | ||
|
|
||
| ### SQS queue with the main webapp listening to it | ||
|
|
||
| We could have the main WebApp containers read tasks from an SQS queue. Eventbridge Scheduler would then send messages to the SQS queue to | ||
| trigger tasks. | ||
|
|
||
| * Good, because it would allow us to implement asynchronous tasks relatively quickly with minimal changes to our existing code. | ||
| * Good, because it would avoid the long start-up time of the webapp container. | ||
| * Good, because it would allow us to implement retry logic using the built-in features of SQS, and would avoid losing tasks if a container | ||
| was stopped or restarted while processing a task. | ||
| * Good, because it would be easy to trigger asynchronous tasks using either Eventbridge Scheduler or by adding a message to the queue from | ||
| the WebApp itself. | ||
| * Good, because it would avoid the need to create a custom scaling rule for a separate ECS service. | ||
| * Neutral, because while the load on ad-hoc tasks should scale with the load on the webapp itself, there may be times when this is not | ||
| ideal. | ||
| * Bad, because it would place additional load on the main WebApp containers, which could impact performance during periods of high load. | ||
| * Bad, because it would require some changes to our existing code to poll the queue for messages instead of receiving them via an | ||
| environment variable. | ||
|
|
||
| ## More Information | ||
|
|
||
| Custom ECS scaling rules: https://aws.amazon.com/blogs/containers/amazon-elastic-container-service-ecs-auto-scaling-using-custom-metrics/ | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Teeny detail, but I don't think any of the other ADRs are using ALL CAPS for status