-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation plan: Ingestion server removal #4026
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/4026 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. New files ➕: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is soooo exciting! I left a very long comment with a suggestion for an important difference we should ideally take to simplify deployments and maintenance, as well as potentially simplify the DAGs (depending on whether boto3 in a plain Python operator is considered "simpler" than the AWS Airflow operators).
I also took the time to write out the differences in hourly cost for EC2 and ECS, which you can include in the document as an appendix, if you'd like to record evidence for the cost difference. Based on the resource consumption numbers you mentioned, I also looked to see if we could reduce the allocated resources in either approach and compare the costs. EC2 would not work for us here, because anything with less vCPU cuts the RAM in half, with the relevant EC2 instances, and there's nothing smaller than the m5.xlarge that would keep enough RAM to be comfortably above the 52% of 16 GB consumption we saw.
ECS is more flexible, and we can scale down the amount of RAM allocated to each task, but because we still need at least 50% of
EC2 m5.xlarge
0.192/hour
4 vCPU
16 GiB
ECS hourly costs:
- 0.04048/hour/vCPU
- 0.004445/hour/GB
ECS direct comparison:
4 vCPU = 0.16192/hour
16 GB = 0.07112/hour
total = 0.23304/hour
ECS lower RAM:
4 vCPU = 0.16192/hour
10 GB = 0.04445/hour
total = 0.20637/hour
ECS lower RAM and CPU:
2 vCPU = 0.08096/hour
10 GB = 0.04445/hour
total = 0.12541/hour
So there is potentially a way to make ECS less expensive, simply by scaling down the resources we allocate for each task to try to achieve as high utilisation as possible from each one. However, we have the opportunity for much more significant cost savings if we re-write the indexer worker in a language that has tighter memory requirements than Python, by leveraging disk instead of RAM for task storage on the worker, or potentially even just by refactoring the existing Python code to be more consciencious of memory usage. If we can cut memory reliably below 50% consumption, a m4.large with 2 vCPU and 8 GB costs a mere 0.10/hour, cutting costs nearly in half from the m5.xlarge.
Even if ECS could be less expensive now, optimising the indexer workers to maximise resource utilisation with EC2 will always result in cheaper running costs than ECS, vCPU-per-vCPU and GB-per-GB. Planning this project with EC2 sets us up well for the opportunity of long-term savings through code optimisation. With the ASG approach I've outlined, I think it also brings virtually every single benefit we've come to enjoy from ECS to the EC2 based workers.
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
- Use Airflow's | ||
[EC2Hook#describe_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.describe_instances) | ||
to get a list of internal URLs bound to the indexer workers for the | ||
appropriate environment. In the local environment, this will instead return | ||
the URL for the catalog-indexer-worker Docker container. | ||
- Use | ||
[EC2Hook#start_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.start_instances) | ||
to start the appropriate EC2 instances. This step should skip in a local | ||
environment. | ||
- Use dynamic task mapping to generate a Sensor for each of the expected indexer | ||
workers, and ping its `healthcheck` endpoint until it passes. | ||
- Use dynamic task mapping to distribute reindexing across the indexer workers | ||
by first calculating `start` and `end` indices that will split the records in | ||
the media table into even portions, depending on the number of workers | ||
available in the given environment. Then: | ||
- POST to each worker's `reindexing_task` endpoint the `start_index` and | ||
`end_index` it should handle | ||
- Use a Sensor to ping the worker's `task/{task_id}` endpoint until the task | ||
is complete, logging the progress as it goes | ||
- Use the | ||
[EC2Hook#stop_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.stop_instances) | ||
to shut the instance down. This step should skip in a local environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mention using ec2-service
later on, and in thinking further about the implications of that compared to creating EC2 instances directly, I think we have an opportunity to leverage the functionality of an auto-scaling group to trivially get somewhat automatically updating indexer workers, along with simplifying the DAGs set of tasks by removing most of the need for mapped tasks.
When managing EC2 instances directly (what we do currently), we start the instances once when we provision them, and then immediately shut them down. When the instances start again, a cron job is responsible for restarting the docker compose services1. All of this is configured once in the user data script at instance creation. User data scripts run only on instance creation and never again (at least not automatically, I'm sure you could locate the script and run it manually if you wanted to).
With ASGs, however, we have some interesting capabilities that would make deploying even a bumped version tag (if we did not use a new tag to point to the "release" we want). The critical thing to know about ASGs is that they do not rely on stopped instances, but rather terminated instances. Meaning the instances literally get created and destroyed as needed, rather than started and stopped. This is important because it means the user data script runs every time an ASG instance spins up. The user data script is defined in the launch template, which in turn means that a "deployment" for a correctly leveraged ASG should mean just updating the launch template and then replacing each running instance one-by-one (or in whatever pattern makes sense for the given service). If we ran the API or frontend in an ASG again, that's how we would do it, rather than needing to create an entirely new ASG each time.
For the indexer workers, what does this mean? It means we can actually basically ignore Ansible for them, and instead completely leverage Terraform and only Terraform, significantly simplifying their management. Rather than running an Ansible playbook that would one-by-one start and then SSH into each indexer worker, we would just push an update to the ASG launch template. The next time any instance starts in that ASG, it will use the new launch template, and therefore have whatever updates we pushed to it.
This somewhat blends our old approach with the new approach, and basically sets up the best part of ECS (the need to only define a task-definition) with the ASG (by using a launch template, instead of task definition).
Then, in Airflow, rather than starting and stopping individual EC2 instances, we just do the following:
- Query for the relevant ASG for the environment we need workers for
- Issue a
set-desired-capacity
command to the ASG to set it to the capacity used for the data refresh - Query the ASG using
describe-auto-scaling-group
, which returns the ASG health and a list of all the instances with health status, and wait until instances are available to accept tasks (some or all are healthy). - Continue with the non-ASG approach as described in the IP, just use the ASG's instances list instead of querying for the EC2 instances matching the tag (and of course skip starting the instances).
- As each instance finishes everything the DAG needs it to do, tell the ASG to spin it down using manual specific instance termination which also allows us to decrement the desired capacity at the same time. This bascially tells the ASG: terminate this instance, and don't replace it.
This has some significant benefits over the EC2 approach:
- Deployments will take literally seconds (updating a launch template is nearly instant)
- Instances do not need to know how to terminate themselves, the DAG will do that for us
- We can take a completely "hands-free" approach to the instances, thinking only in terms of launch templates and ASGs for the most part, eliminating the need for any new Ansible playbooks.
- Additionally, we can use SSM parameters for selecting the AMI at launch time rather than using a static AMI ID, meaning we have one less place to worry about updating system dependencies (which is the main headache long-running EC2 instances carry, in my opinion).
In the future, we could even employ a target group with the internal load balancer and use target group stickiness configured with a custom app cookie, where each task responsible for one of the indexer workers generates a unique cookie value for the lifetime of the DAG run. Then the target group will automatically route requests to the same instance without the task needing to maintain a reference to the instance or us ever needing to even list the individual instances. That's definitely out of scope here, but wanted to share an example of what ASG's get us.
To summarise, rather than directly interacting with EC2, we can interact with the ASG. I don't think we can use AWS operators for this, there don't appear to be ones for ASGs, but boto3 supports them, so we just need to write our own interactions (like we do for s3). This eliminates the need for the Ansible playbook, like I said, so it will meaningfully reduce the amount of work done on the infrastructure side, in addition to reducing the number of mapped tasks in the DAG.
Footnotes
-
Which, by the way, really is not necessary if we just used the docker service as intended, with an appropriate restart policy, but that's neither here nor there ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh also, I forgot about this difference with ECS. We already have a process for automatically updating the ECS task definition when we cut a new release. We do not have that for launch templates, which are roughly the equivalent for EC2 instances (and the ASG approach). But, as you've said in the plan, we do not often update the code for the workers, so that isn't a big concern. Additionally, it would be good to keep updates to the launch templates manual (e.g., running tf apply
) so that we make sure that we only do it when a data refresh is not running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarayourfriend this is incredible 😍 I'm going to take some time to do a bit more background reading on this myself, but this is really exciting.
One possible correction is that while the indexer workers currently shut themselves down, in my proposed implementation the DAG would send the stop_instances
request. (However as you've noted that's only in the context of a data refresh; they need to shut themselves down after deploying. I also had not realized that we're currently using a cron job to restart the stack when the instances start 😮)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1-ing Staci's note - thank you for this @sarayourfriend! I hadn't thought about leveraging ASGs here for this, but it feels so right. Definitely on board with this approach!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks for the clarification there, Staci.
they need to shut themselves down after deploying
That would no longer be necessary using the ASG, just to clarify 😁
I also had not realized that we're currently using a cron job to restart the stack when the instances start 😮
Me neither!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarayourfriend perhaps to your point above, during a data refresh we could set the ASG to use a fixed version of the template rather than latest
, and then switch the config back to latest
after the refresh is complete. That should be possible with boto3.
Edit: or perhaps it would be simplest to always use a fixed version and bump it whenever we deploy a new launch template definition 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
during a data refresh we could set the ASG to use a fixed version of the template rather than latest, and then switch the config back to latest after the refresh is complete.
Pinning the versions in Airflow is a really interesting idea 🤔 I think we could do something like this to pin/unpin the launch template version, but if the user_data script is always pulling the latest
tag on the Docker image then code changes will be able to come through (for new instances) regardless, right? Although maybe those changes are okay to make mid-refresh.
One thing to note is that if an indexer worker fails for any reason, the Airflow wait_for_reindexing
task will fail and won't automatically retry with a new worker instance. If the worker failed due to some issue with the reindexing, Airflow will terminate the instance to ensure it’s not left running (and prevent the ASG from making a new one). If the instance itself crashes for some reason and the ASG tears it down, then the Airflow wait_for_reindexing
task will still fail because it will try to poll a terminated instance. But the ASG will spin up a new one and the Airflow task will not know how to connect to resume, or to terminate the new instance gracefully. I think we’d need an extra Airflow task at the end to account for this 😓 But in either case, if a worker fails the DAG won’t be able to retry on a new instance automatically.
This discussion has made me realize that if we use an ASG instead of the more naive EC2 approach, we lose the ability to retry a single failed reindexer worker, because when a worker is terminated and a new one is spun up, there is no way to clear a single trigger_reindex
task (because it will be trying to connect to the terminated instance). The best thing I have come up with is making distributed_reindex
a reusable TaskGroup that accepts a list of worker params, and then having an entirely separate DAG that accepts worker params as a conf option and can be used to manually retry parts of a failed data refresh. This feels very convoluted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although as we pointed out recently, if a single reindexing task fails and we're forced to retry all of them (in this case by clearing the DAG at the task that sets the ASG desired capacity), it's not any slower than retrying a single indexer worker because they run concurrently. The loss there is in the literal cost of unnecessarily running the other workers for however many hours the reindexing takes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to post back-to-back 😅 but -- the inability to retry a single reindexing task has an advantage in that it actually fixes the problem of ensuring that all the indexer workers are running the exact same code and have the exact same configuration. The trade-off is just the cost of running the successful workers in scenarios where the configuration hadn't changed, which is perhaps fair.
This would allow us to use latest
for both the launch template version and docker tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should generate AMIs for the indexer workers with the Docker image they should run bundled into them: https://github.com/WordPress/openverse-infrastructure/issues/843#issuecomment-2058094706
The launch template would always point to the latest AMI, and a new AMI would only be generated for production manually (we can also make it manual for staging if we want). We can configure the ASG to replace instances using the launch template they were running before, rather than using the latest launch template version (or pin the ASG to a specific launch template version rather than $Latest
, and manually bump it when data refresh isn't running for that environment if a new AMI exists, maybe even just using a DAG that runs and checks if data refresh is running before bumping the launch template to the new AMI version).
The AMI approach has a lot of benefits over what I suggested before:
- Determinism: we know exactly what dependencies and Docker image will run
- Those are all pre-installed/downloaded to the AMI, meaning the instance has zero work to do at boot, just start the service
- No dependencies on GHCR or other external services
To avoid blocking this project, I want to spend some time looking at the AMI approach I shared above and tinker with the existing indexer worker code as the proof of concept. We need a proof of concept anyway for Airflow/Kibana/other EC2 based services, and the indexer workers are a very simple baseline to use.
Co-authored-by: sarayourfriend <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an incredibly thorough implementation plan, thank you for taking the time to get this all "on paper"! I have a few questions and notes for clarification, and I'm really excited about Sara's suggestion for how to manage the EC2 instances in a way that will be more hands-off 🚀
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Show resolved
Hide resolved
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
The ECS approach allows us to remove 100% of the server management code, and | ||
eliminates the need for deployments in essentially every case (unless the task | ||
definition itself changes). However it is more expensive, more difficult to | ||
implement, and requires a special secondary implementation for running locally | ||
that may be prone to cross-platform issues. | ||
|
||
The EC2 approach on the other hand is cheaper, quicker to implement, and gets us | ||
the vast majority of what we want in terms of removing complexity and reducing | ||
the need for deployments. Consequently I argue here that the EC2 approach is the | ||
one we should pursue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say, I still would like to pursue the ECS approach, even with the disadvantages you mention 😅 But, that's not a blocking concern, merely a personal desire to get rid of the last vestiges of the seemingly-unnecessary webserver aspects of the data refresh process 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the inability to realistically test the data refresh locally, needing an entirely separate process locally, is pretty significant.
seemingly-unnecessary webserver aspects
In what sense are they unnecessary? ECS still runs a server, it eliminates a small amount of deployment complexity, but it isn't complexity we don't have elsewhere and can solve to make as or almost as simple as our ECS deployments (which would be good for making Airflow, Kibana, Elasticsearch, etc all simpler and easier to understand and maintain).
The difference in long-term cost is significant, with a bit of up-front infrastructure work that will build off of and benefit the work we've done for other services. Why avoid EC2 if we can and regardless need to make it simple to use for the sake of our other services? We have a low level of understanding of EC2, but we need to improve that, and as we have done, everything about the process improves and eliminates the relative complexity of it compared to ECS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seemingly-unnecessary webserver aspects
The indexer workers run their own webserver in order to accept tasks and issue progress on them. That code could be completely obviated if the containers which were to run on ECS received the range they needed to index from Airflow and then ran directly on that chunk. The containers could spin up, use the range supplied via CLI to run, and then complete and exit. This would remove the need for maintaining any webserver aspects for the indexer workers, since the containers would only be running as long as it takes to index everything.
To be clear, I'm comfortable with the current approach proposed here nonetheless.
- Use Airflow's | ||
[EC2Hook#describe_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.describe_instances) | ||
to get a list of internal URLs bound to the indexer workers for the | ||
appropriate environment. In the local environment, this will instead return | ||
the URL for the catalog-indexer-worker Docker container. | ||
- Use | ||
[EC2Hook#start_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.start_instances) | ||
to start the appropriate EC2 instances. This step should skip in a local | ||
environment. | ||
- Use dynamic task mapping to generate a Sensor for each of the expected indexer | ||
workers, and ping its `healthcheck` endpoint until it passes. | ||
- Use dynamic task mapping to distribute reindexing across the indexer workers | ||
by first calculating `start` and `end` indices that will split the records in | ||
the media table into even portions, depending on the number of workers | ||
available in the given environment. Then: | ||
- POST to each worker's `reindexing_task` endpoint the `start_index` and | ||
`end_index` it should handle | ||
- Use a Sensor to ping the worker's `task/{task_id}` endpoint until the task | ||
is complete, logging the progress as it goes | ||
- Use the | ||
[EC2Hook#stop_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.stop_instances) | ||
to shut the instance down. This step should skip in a local environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1-ing Staci's note - thank you for this @sarayourfriend! I hadn't thought about leveraging ASGs here for this, but it feels so right. Definitely on board with this approach!
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Show resolved
Hide resolved
Drafting to address comments :) |
Updates are up and I've re-requested reviews. Particularly interested in folks' thoughts on the discussion toward the end of this thread and the conclusions I came to here :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the revisions, I think all my concerns are assuaged! I'm so stoked for this 🤩
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Show resolved
Hide resolved
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
Co-authored-by: Madison Swain-Bowden <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM.
I've left a couple comments describing how the AMI approach I talk about in this comment in the infrastructure repository would solve some of the edge cases in this PR introduced by using the "always pull latest" docker image approach. None of that blocks this plan though, we can move forward with it as is, and the AMI approach would supplant it when eventually implemented. Of course, it could be worth noting the AMI approach at least in general, so that if it is available before the implementation actually reaches the step of creating the ASG, that it should be used instead.
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Show resolved
Hide resolved
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
- The `user_data` script should be updated to pull the Docker image with the | ||
`latest` tag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit to note that this should change if we use the AMI approach, which I think we should, and I aim to have resolved so as not to block the infrastructure work of this IP. It's a conceptual/foundational piece of our ASG-based deployment puzzle that needs to be solved one way or another. It doesn't need to block this project, so you don't need to change from this described approach here using docker to pull the latest image, and the AMI approach builds on that. But just in case we manage to sort it out before we get to this point of the plan, it's worth noting the difference.
Fixes
Fixes #3981 by @stacimc
Description
This PR adds the implementation plan for removing the ingestion server, and moving the data refresh into Airflow.
Reviewers
I have requested reviews from @sarayourfriend and @AetherUnbound. @sarayourfriend's deep knowledge of the infrastructure, and in particular the ongoing work to the catalog deployments, would be extremely beneficial. @AetherUnbound also has expertise in our infrastructure and catalog, and in orchestrating similar distributed tasks using Airflow.
Testing Instructions
Please read the IP following the decision-making process defined below.
Decision-making process
This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site. Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.
Current round
This discussion is currently in the Decision round. The deadline for review of this round is April 15 2024. Note the deadline would typically be April 16, but is extended to account for my own AFK.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin