From 4ddb65ddf5877d9a9c90128444ff0c4a3b2e237f Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Wed, 3 Apr 2024 11:22:57 -0700 Subject: [PATCH 01/15] Add initial plan --- ...mentation_plan_ingestion_server_removal.md | 586 ++++++++++++++++++ 1 file changed, 586 insertions(+) create mode 100644 documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md new file mode 100644 index 00000000000..f0d748a917b --- /dev/null +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -0,0 +1,586 @@ +# 2024-03-28 Implementation Plan: Ingestion Server Removal + +**Author**: @stacimc + + + + +## Reviewers + + + +- [ ] @sarayourfriend +- [ ] @AetherUnbound + +## Project links + + + +- [Project Thread](https://github.com/WordPress/openverse/issues/3925) +- [Project Proposal](/projects/proposals/ingestion_server_removal/20240319-project_proposal.md) + +## Overview + + + +The critical data refresh process is orchestrated by Airflow via the +`_data_refresh` DAGs, but the bulk of the work is merely triggered +by Airflow on a remote ingestion server. Due to the reasons laid out in the +Project Proposal, we will be moving all operational pieces of the data refresh +into Airflow itself, and removing the ingestion server entirely. + +The work which must be moved into Airflow can be split into these abstract +steps, which will be discussed separately in this IP: + +- **Copy Data**: A FDW extension is used to connect the API database to the + upstream (catalog) database. The entire contents of the upstream media table + are copied into a new temp table in the API database. This temp table will + later replace the main media table in the API. +- **Clean Data**: Data clean up which includes removing denylisted tags and + cleaning URLs. This step will become unnecessary when the + [Catalog Data Cleaning project](https://github.com/WordPress/openverse/issues/430) + is completed, which moves all data cleaning steps into ingestion. +- **Create Index**: Create a new Elasticsearch index, matching the configuration + of the existing media index. +- **Distributed Reindex**: Convert each record from the new temp table to the + format required by an Elasticsearch document, and then reindex them into the + newly created index. Because this step is by far the most time consuming, the + reindex is distributed across multiple **indexer workers**, which are + themselves running on EC2 instances (6 in production, 2 in staging). Each + indexer worker serves a small API which can be used to trigger a reindexing + task for an equal portion of the work. The workers are started by the + ingestion-server when needed, and automatically spun down when complete. +- **Create and Populate Filtered Index**: Create a new Elasticsearch index + matching the configuration of the existing _filtered_ index, and then reindex + documents into it from the new media index, applying appropriate filtering for + sensitive terms. +- **Reapply Constraints**: Recreate indices and constraints from the original + API table on the new temp table. +- **Promote Table**: Drop the old media table in the API and rename the temp + table and its indices, which has the effect of promoting them/replacing the + old table. +- **Promote Index**: Promote the new Elasticsearch index by unlinking the given + alias from the existing index and moving it to the new one. (Used for both the + main and filtered indices.) +- **Delete Index**: Delete the old Elasticsearch index. (Used for both the main + and filtered indices.) + +In addition to the data refresh DAGs, some combination of these steps on the +ingestion-server are also used by: + +- `recreate_full_staging_index` DAG +- `create_filtered__index` DAGs +- Our `load_sample_data` scripts, which use the ingestion server to load sample + data from the catalog into the API and Elasticsearch + +This IP includes details for moving all steps into Airflow. The bulk of the plan +deals with the `Distributed Reindex` step, which is the only step that is not +straightforward to move. + +## Expected Outcomes + + + +When this work is completed we will be able to: + +- Manage data refreshes entirely in Airflow, without the use of the ingestion + server +- Completely remove the staging and production ingestion servers, as well as all + related code in the monorepo and infrastructure repos +- Remove the `recreate_full_staging_index` DAG + +## Approach to the Distributed Reindex + +The majority of the data refresh process can be implemented in Airflow with only +slight refactoring, using our existing +[Postgres hooks/operators](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/catalog/dags/common/sql.py) +and reusable +[Elasticsearch tasks](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/catalog/dags/common/elasticsearch.py). +The exception is the _distributed reindex_. + +The simplest approach would be to remove the indexer workers entirely in favor +of doing the reindexing in parallelized dynamic tasks in Airflow. However, +currently the indexer workers and catalog EC2 instances are the same size +(m5.xlarge), and each of the six production workers were observed to use up to +~25% CPU and ~52% memory utilization during reindexing. The expense of +permanently increasing resources on the catalog instance to support reindexing +(which may happen concurrently with popularity refreshes, ingestion, and all +other processes that occur on the catalog) would significantly exceed the cost +of maintaining remote workers that can be run only when needed. + +Two other options were evaluated for this IP: + +- Simply move the 8 total existing EC2 worker instances into the catalog, and + connect to them directly from Airflow. +- Remove the EC2 instances entirely in favor of a new ECS task definition using + the `indexer-worker` image, which would remove all API code and contain only + the reindexing script. Airflow would spin up the appropriate number of ECS + tasks when needed. + +### Comparison of EC2 and ECS + +The following areas of comparison were considered when choosing between the two +approaches: + +#### Cost + +Maintaining the +[EC2 worker instances](https://calculator.aws/#/createCalculator/ec2-enhancement), +provided we continue to automatically stop the instances when not in use, would +be cheaper than using [ECS](https://calculator.aws/#/createCalculator/Fargate). +Notably both solutions are _signifcantly_ cheaper than the current +implementation, because the bulk of the cost comes from constantly running the +two m5.xlarge EC2 instances for the ingestion servers. + +#### Ease of infrastructure development + +EC2 is also likelier the easier solution to implement, because it requires so +few changes to the indexer-workers as they are currently configured. For ECS, +less total infrastructure scaffolding is required, but it is a greater departure +from the current setup. + +#### Code maintainability + +The ECS approach requires slightly less code, because it would allow us to +completely remove all of the server logic needed in the EC2 instances. The +docker image need only contain the script for reindexing. + +For the EC2 instances we would continue to serve a minimal API with the +following endpoints: + +- `healthcheck` +- `reindexing_task` - Trigger the reindexing task for the given parameters + (e.g., run the script) +- `task/{task_id}` - Get the status of the running task + +That being said, the API needed for the indexer worker is simple and small. The +vast majority of the complexity in the data refresh is in the _ingestion +server's_ API, all of which would still be removed. + +#### Supporting local testing & development + +This is the most significant area in which EC2 is preferable to the ECS +approach. With the EC2 approach, assigning work to the workers requires starting +the worker instances and then triggering the reindexing task on each worker by +POSTing to its `reindexing_task` endpoint. Locally, we would simply POST to our +local Docker container rather than an EC2 instance. Minimal work is needed to +adapt the process to work in a local environment, and the DAG when run locally +is as close as possible to its production counterpart. + +For the ECS approach, in production we must spin up ECS instances using +Airflow's +[EcsRunTaskOperator](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/operators/ecs.html#run-a-task-definition). +Simulating this in a local development environment without actually connecting +to AWS is less straightforward, and requires supporting a separate workflow for +local environments. For example we could have a `branch` operator that checks +the environment and proceeds to the EcsRunTaskOperator in production to spin up +ECS tasks, or a +[DockerOperator](https://airflow.apache.org/docs/apache-airflow-providers-docker/stable/_api/airflow/providers/docker/operators/docker/index.html) +in local development to spin up a worker as a local Docker container. + +This is still relatively complex, however, because the worker Docker containers +must be launched from Airflow, which is itself dockerized. This requires either +using a Docker-in-Docker approach, which should typically be avoided, or +modifying the configuration to allow Airflow to launch sibling containers[^1]. +Requirements differ across operating systems, yet this **must** be tested and +maintained across platforms because the data refresh is used as a critical step +in setting up our local development environment. + +#### Support for progress tracking + +With the EC2 approach it is trivial to calculate the `percentage_complete` for +the reindexing task, and report to Airflow via the task status endpoint. +Airflow's ECS Operators do not have out-of-the-box support for streaming the +Cloudwatch logs from the task while it is running, so we would have less insight +into the status of the task from Airflow. + +#### Impact on Deployments + +This is the greatest strength of the ECS approach. Each time an ECS task is spun +up on Fargate, it will pull the Docker image configured in the task definition. +By using the `latest` tag in this configuration we can ensure that the latest +indexer-worker image is pulled each time, which means that any changes to the +reindexing code will become live in production as soon as the new docker image +is published after merging on `main`. + +For the EC2 approach, any changes to the reindexing script or its dependencies +would require deploying the catalog (and by extension the indexer workers). +However: + +- No deployment would be required for changes to the vast majority of the data + refresh implementation, including the data copy, filtered index creation, + alias management, and all aspects of orchestrating the indexer workers. +- _If_ deploying the indexer-worker becomes a frequent issue in the future, we + could iterate on the data refresh DAG to automatically update the workers. For + example we could use an Airflow + [SSHOperator](https://docs.aws.amazon.com/mwaa/latest/userguide/samples-ssh.html) + to connect to the instances and pull the latest image before triggering the + task. I think this is unlikely enough to be a problem that this does not need + to be a requirement of the initial implementation. + +### Conclusion + +The ECS approach allows us to remove 100% of the server management code, and +eliminates the need for deployments in essentially every case (unless the task +definition itself changes). However it is more expensive, more difficult to +implement, and requires a special secondary implementation for running locally +that may be prone to cross-platform issues. + +The EC2 approach on the other hand is cheaper, quicker to implement, and gets us +the vast majority of what we want in terms of removing complexity and reducing +the need for deployments. Consequently I argue here that the EC2 approach is the +one we should pursue. + +## Step-by-step plan + + + +Because the data refresh process is critical, we will not remove the ingestion +server or the existing data refresh DAGs until the new process has been +performed successfully in staging and production. The new data refresh will be +developed as separate DAGs alongside the current ones. + +1. Create a new data refresh factory to generate new data refresh DAGs for each + media type for staging and production, with the + `__data_refresh` DAG id. For simplicity of code review, + these initial DAGs should only perform the `Copy Data`, and `Create Index` + steps, which we will perform entirely in Airflow. +1. Create a new `catalog-indexer-worker` in the Catalog, and build the new + indexer worker image locally. +1. Add the distributed reindexing step to the DAG (excludes infrastructure + work). +1. Add the `catalog-indexer-worker` to the Terraform configuration for the + catalog, and add steps to deploy them in the related Ansible playbooks. +1. Add all remaining steps to the data refresh DAGs: + `Create and Populated Filtered Index`, `Reapply Constraints`, + `Promote Table`, `Promote Index`, `Delete Index`. +1. Deploy the catalog and the new indexer-workers configuration. +1. Run the staging audio data refresh, followed by the staging image data + refresh. +1. Once successful, run the production audio data refresh and image data + refresh. +1. Update the `create_filtered__index` DAGs to remove use of the + ingestion server. +1. Drop the `recreate_full_staging_index` DAG, which can be removed entirely in + favor of the staging data refresh. +1. Update the `load_sample_data.sh` script to run the DAG instead of using the + ingestion server. +1. Fully remove the ingestion server and related infrastructure. + +There are few opportunities for multiple streams of work. Updating the +`create_filtered__index` DAGs can happen at any time. + +## Step details + + + +While this may look daunting, it should be noted that with very few exceptions +the work described below is refactoring of existing logic. Links to the source +files are included for convenience. + +### Create the new data refresh DAG factory and move initial steps into Airflow + +In this step, we'll create a new data refresh DAG factory to generate data +refresh DAGs for each existing media_type and environment. Currently these four +will be generated: + +- staging_audio_data_refresh +- staging_image_data_refresh +- production_audio_data_refresh +- production_image_data_refresh + +Because the `environment` is added as a suffix, there will be no collision with +the existing DAG ids. In this initial step, we we will add only a small portion +of the logic in order to make the PR easier to review. The first steps are +already implemented in the current data refresh and can simply be copied: + +- Get the current record count from the target API table; this must be modified + to take the `environment` as an argument +- Perform concurrency checks on the other data refreshes and conflicting DAGs; + this must be modified to include the now larger list of data refresh DAG ids +- Get the name of the Elasticsearch index currently mapped to the `target_alias` +- Generate the new index suffix + +We will include new tasks to perform the initial few steps of the ingestion +server's work: + +- Copy Data: this should be a TaskGroup that will have multiple tasks for + creating the FDW from the upstream DB to the downstream DB, running the + copy_data query, and so on. It should fully replace the implementation of + [`refresh_api_table`](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/ingestion_server/ingestion_server/ingest.py#L248) + in the ingestion server. All steps in this section are SQL queries that can be + implemented using the existing + [PostgresHook and PGExecuteQueryOperator](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/catalog/dags/common/sql.py). +- Create Index: we can use our existing + [Elasticsearch tasks](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/catalog/dags/common/elasticsearch.py#L82) + to create the new elasticsearch index with the index suffix generated in the + previous task. + +### Implement new catalog indexer worker + +In this step we will create the new catalog-indexer-worker Docker image, and add +distributed reindexing to the new DAGs. This step does not include the +infrastructure work to actually create the new EC2 instances. + +First we will create a new `indexer-worker` directory under +`catalog/dags/data_refresh`, which will contain the contents of the new indexer +worker. **This implementation already exists in the ingestion server. The +relevant pieces can be pulled out and refactored slightly to fit the new, much +smaller image.** Broadly, this is the mapping of existing files to new files +needed: + +- `api.py` will defined the API for the worker, and is refactored from the + existing + [`indexer_worker.py`](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/ingestion_server/ingestion_server/indexer_worker.py). + It must be refactored to add task state and a `task_status` endpoint, which + takes a `task_id` and returns the status and progress of the given task. +- `indexer.py` will contain the logic for the actual indexing task. It will be + refactored from the existing + [`indexer.py`](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/ingestion_server/ingestion_server/indexer.py); + specifically all we need is the + [`replicate`](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/ingestion_server/ingestion_server/indexer.py#L157) + function. +- `elasticsearch_models.py`, pulled from the file of the + [same name](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/ingestion_server/ingestion_server/elasticsearch_models.py) + in the ingestion server. Defines a mapping from a database record to an + Elasticsearch document. +- Utility files for helper functions for connecting to Elasticsearch and + Postgres (e.g. + [`es_helpers.py`](https://github.com/WordPress/openverse/blob/main/ingestion_server/ingestion_server/es_helpers.py)) + +The +[`Dockerfile`](https://github.com/WordPress/openverse/blob/main/ingestion_server/Dockerfile) +can be copied from the existing ingestion server. It should be updated to +reference the new file structure, and to expose only a single port, which should +be distinguished from the ports currently in use by the ingestion server (8001 +and 8002). Other necessary files, including `env.docker`, `.dockerignore`, +`Pipfile`, and `gunicorn.conf.py` can all be copied in from the existing +ingestion server as well. + +Finally we will update the monorepo's root +[`docker-compose.yml`](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/docker-compose.yml) +to add a new `catalog-indexer-worker` service. Its build context should point to +the nested `data_refresh/indexer_worker` directory, and it should map the +exposed port to enable the API to be reached by the catalog. + +When this work is complete, it should be possible to run `just catalog/ipython` +and curl the new indexer worker. The existing ingestion-server and +indexer-worker services are unaffected (it is still possible to run legacy data +refreshes locally and in production). + +### Implement distributed reindexing locally + +In this step we will add tasks to the data refresh DAGs to orchestrate the +distributed reindex. At the end of this step, it will be possible to run a +distributed reindex locally. Because the infrastructure work to create the EC2 +instances is not complete, it can not be run on production yet. The following +code can all be refactored from +[`distributed_reindex_scheduler.py`](https://github.com/WordPress/openverse/blob/main/ingestion_server/ingestion_server/distributed_reindex_scheduler.py). + +- Use Airflow's + [EC2Hook#describe_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.describe_instances) + to get a list of internal URLs bound to the indexer workers for the + appropriate environment. In the local environment, this will instead return + the URL for the catalog-indexer-worker Docker container. +- Use + [EC2Hook#start_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.start_instances) + to start the appropriate EC2 instances. This step should skip in a local + environment. +- Use dynamic task mapping to generate a Sensor for each of the expected indexer + workers, and ping its `healthcheck` endpoint until it passes. +- Use dynamic task mapping to distribute reindexing across the indexer workers + by first calculating `start` and `end` indices that will split the records in + the media table into even portions, depending on the number of workers + available in the given environment. Then: + - POST to each worker's `reindexing_task` endpoint the `start_index` and + `end_index` it should handle + - Use a Sensor to ping the worker's `task/{task_id}` endpoint until the task + is complete, logging the progress as it goes + - Use the + [EC2Hook#stop_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.stop_instances) + to shut the instance down. This step should skip in a local environment. + +### Create the Terraform and Ansible resources needed to deploy the new indexer workers + +In this step we will add the resources needed to actually deploy the new indexer +workers. At the time that this IP is being written, work to +[deploy Airflow with Ansible](https://github.com/WordPress/openverse-infrastructure/pull/829) +is actively underway. This IP builds off of this implementation and is blocked +by its success in production. + +The goals are to: + +- Add the Terraform resources for the indexer worker to + [`next/production/airflow.tf`](https://github.com/WordPress/openverse-infrastructure/blob/7a8fdd02dc448cfb9f2991f7b4ece1da7349209c/next/production/airflow.tf), + using the generic ec2-service module. The configuration can be copied from the + [current configuration](https://github.com/WordPress/openverse-infrastructure/tree/main/modules/services/ingestion-server) + in the ingestion-server. + [This PR](https://github.com/WordPress/openverse-infrastructure/pull/829) can + be used as a guide. +- Update the Airflow + [Ansible playbooks](https://github.com/WordPress/openverse-infrastructure/blob/7a8fdd02dc448cfb9f2991f7b4ece1da7349209c/ansible/roles/airflow/tasks/airflow.yml) + to deploy the indexer workers alongside the catalog. + +The existing logic for deploying Airflow can be followed closely when adding the +catalog-indexer-worker. The majority of this is already implemented and the +existing conventions should be followed, including planning by default, checking +whether a data refresh is currently in progress, skipping if no differences +exist, and checking for positive confirmation before deployment. It is most +useful to note what differs about the catalog-indexer-workers: + +- **Important**: when the compose stack is restarted for the indexer workers, it + is critical that the indexer workers must shut themselves down as soon as + setup is complete. They will be spun up automatically by the DAG when needed + and should not be left running. +- Currently, the staging and production workers are split into separate + environments (i.e., a staging deploy is used to deploy the 2 staging workers + separately from the 6 production workers). It is more accurate to view all 8 + workers as production instances (i.e., part of the production catalog + deployment), which merely _operate_ on different environments. As such all 8 + should be part of the production catalog deploy, but each instance should have + an environment `tag` applied which indicates which environment it is intended + to be used for. +- The playbooks must be updated to check if any of the four _new_ data refresh + DAGs are running before deploying, as well. + +### Add remaining steps to the Data Refresh DAGs + +The final steps can now be added to the DAGs: + +- `Create and Populated Filtered Index`: this should be implemented as a + reusable TaskGroup. This work can already be implemented using our existing + Elasticsearch tasks and replaces + [this function](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/ingestion_server/ingestion_server/indexer.py#L437). +- `Reapply Constraints` and `Promote Table`: these SQL queries can be performed + with the PostgresHook, and replaces + [this function](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/ingestion_server/ingestion_server/ingest.py#L340). +- `Promote Index` and `Delete Index`: these can be implemented using our + existing Elasticsearch tasks, and replaces + [these](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/ingestion_server/ingestion_server/indexer.py#L311) + [functions](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/ingestion_server/ingestion_server/indexer.py#L377). + +### Deploy the catalog and indexer workers + +In this step we'll actually deploy the new workers using the Ansible playbooks. +The existing ingestion server and indexer workers remain untouched, so the +legacy data refresh can continue to run. + +### Run the data refreshes + +The next step is to run the `staging_` data refreshes, which act on the +staging API DB and Elasticsearch cluster and are therefore lower risk. This will +be the first test of the new process. Once successful, we can run the +`production_` refreshes. + +### Update `create_filtered__index` DAGs + +These DAGs currently use the ingestion server to perform the +`create_filtered_index` steps in isolation. We can update these to use the +reusable TaskGroup implemented in an earlier step. + +### Remove the `recreate_full_staging_index` DAG + +This DAG performs the second half of a data refresh (the distributed reindex and +promotion) in staging. Once the data refresh has been moved to Airflow, we can +add a +[DAG param](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/params.html) +to the data refresh that allows skipping the initial `Copy Data` steps. This DAG +will no longer be necessary and can then be deleted. + +### Update the load_sample_data scripts + +Our `load_sample_data.sh` scripts currently use `just` commands to ]run parts of +the data +refresh](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/load_sample_data.sh#L98) +on the local ingestion server, as part of setting up the local development +environment. We should update the scripts to instead use the Airflow CLI to +unpause the two data refresh DAGs and await their completion. + +### Remove the ingestion server + +Once the new data refresh has been run successfully in production, we can +finally remove all ingestion-server code from the monorepo and the +infrastructure repo, and completely remove the associated EC2 instances. + +### Infrastructure + + + +Infrastructure is a large part of this project. We will be adding 8 new EC2 +instances but deprovisioning ten, as described above. There will be significant +cost benefits to removing the two EC2 instances which are constantly running. + +### Tools & packages + + + +No new tools or packages are required. + +### Other projects or work + + + +- [Deploying Airflow with Ansible](https://github.com/WordPress/openverse-infrastructure/pull/829), + part of the + [Move Airflow to openverse.org](https://github.com/WordPress/openverse-infrastructure/milestone/6) + Project. +- [Catalog Data Cleaning project](https://github.com/WordPress/openverse/issues/430), + which removes the Clean Data steps from the data refresh + +## Alternatives + + + +The alternative options of using an ECS approach or performing the reindex +entirely in Airflow are discussed at length in the [Approach to the Distributed +Reindex][#approach-to-the-distributed-reindex] section. + +## Blockers + + + +Development can start immediately, but the infrastructure components of this +project are blocked by completion of the effort to move Airflow to +openverse.org, specifically +[this issue](https://github.com/WordPress/openverse-infrastructure/issues/776) +to run a (legacy) data refresh on the new instance. + +## Rollback + + + +We will not remove the ingestion server or the existing data refresh DAGs until +the new DAGs have been run successfully against both staging and production. We +can rollback at any time by simply removing the new instances and DAGs. + +## Risks + + + +There is minimal risk, as we will be able to test against production data in +staging before running against production. However the data refresh does +completely replace the production media tables and Elasticsearch indices, so +there is always inherent risk that production data could be lost or malformed if +something goes wrong. We will ensure production backups are available and +monitor the first production data refreshes closely. + +## Prior art + + + +[^1] +https://towardsdatascience.com/using-apache-airflow-dockeroperator-with-docker-compose-57d0217c8219 From f3115e50f81c01b9b1663eb4f6c306ab60b26725 Mon Sep 17 00:00:00 2001 From: Staci Mullins <63313398+stacimc@users.noreply.github.com> Date: Thu, 4 Apr 2024 16:02:47 -0700 Subject: [PATCH 02/15] Correct wording Co-authored-by: sarayourfriend <24264157+sarayourfriend@users.noreply.github.com> --- .../20240328-implementation_plan_ingestion_server_removal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index f0d748a917b..b0f89aeeae0 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -303,7 +303,7 @@ will be generated: - production_audio_data_refresh - production_image_data_refresh -Because the `environment` is added as a suffix, there will be no collision with +Because the `environment` is added as a prefix, there will be no collision with the existing DAG ids. In this initial step, we we will add only a small portion of the logic in order to make the PR easier to review. The first steps are already implemented in the current data refresh and can simply be copied: From 84c1ce5e1a013d736069e06aff085760264ab0dc Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Wed, 10 Apr 2024 14:34:29 -0700 Subject: [PATCH 03/15] Add clarifications and formatting fixes --- ...mentation_plan_ingestion_server_removal.md | 51 +++++++++++-------- 1 file changed, 30 insertions(+), 21 deletions(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index b0f89aeeae0..d5c7292c132 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -32,24 +32,28 @@ into Airflow itself, and removing the ingestion server entirely. The work which must be moved into Airflow can be split into these abstract steps, which will be discussed separately in this IP: -- **Copy Data**: A FDW extension is used to connect the API database to the - upstream (catalog) database. The entire contents of the upstream media table - are copied into a new temp table in the API database. This temp table will - later replace the main media table in the API. +- **Copy Data**: An + [FDW extension](https://www.postgresql.org/docs/current/postgres-fdw.html) is + used to connect the API database to the upstream (catalog) database. The + entire contents of the upstream media table are copied into a new temp table + in the API database. This temp table will later replace the main media table + in the API. - **Clean Data**: Data clean up which includes removing denylisted tags and cleaning URLs. This step will become unnecessary when the [Catalog Data Cleaning project](https://github.com/WordPress/openverse/issues/430) - is completed, which moves all data cleaning steps into ingestion. + is completed, which moves all data cleaning steps upstream into the initial + provider ingestion process. - **Create Index**: Create a new Elasticsearch index, matching the configuration of the existing media index. - **Distributed Reindex**: Convert each record from the new temp table to the format required by an Elasticsearch document, and then reindex them into the newly created index. Because this step is by far the most time consuming, the - reindex is distributed across multiple **indexer workers**, which are - themselves running on EC2 instances (6 in production, 2 in staging). Each - indexer worker serves a small API which can be used to trigger a reindexing - task for an equal portion of the work. The workers are started by the - ingestion-server when needed, and automatically spun down when complete. + reindex is distributed across multiple **indexer workers**. In the current + implementation, the indexer workers are themselves running on EC2 instances (6 + in production, 2 in staging). Each indexer worker serves a small API which can + be used to trigger a reindexing task for an equal portion of the work. The + workers are started by the ingestion-server when needed, and automatically + spun down when complete. - **Create and Populate Filtered Index**: Create a new Elasticsearch index matching the configuration of the existing _filtered_ index, and then reindex documents into it from the new media index, applying appropriate filtering for @@ -377,7 +381,7 @@ to add a new `catalog-indexer-worker` service. Its build context should point to the nested `data_refresh/indexer_worker` directory, and it should map the exposed port to enable the API to be reached by the catalog. -When this work is complete, it should be possible to run `just catalog/ipython` +When this work is complete, it should be possible to run `just catalog/shell` and curl the new indexer worker. The existing ingestion-server and indexer-worker services are unaffected (it is still possible to run legacy data refreshes locally and in production). @@ -392,12 +396,12 @@ code can all be refactored from [`distributed_reindex_scheduler.py`](https://github.com/WordPress/openverse/blob/main/ingestion_server/ingestion_server/distributed_reindex_scheduler.py). - Use Airflow's - [EC2Hook#describe_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.describe_instances) + [EC2Hook::describe_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.describe_instances) to get a list of internal URLs bound to the indexer workers for the appropriate environment. In the local environment, this will instead return the URL for the catalog-indexer-worker Docker container. - Use - [EC2Hook#start_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.start_instances) + [EC2Hook::start_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.start_instances) to start the appropriate EC2 instances. This step should skip in a local environment. - Use dynamic task mapping to generate a Sensor for each of the expected indexer @@ -411,8 +415,11 @@ code can all be refactored from - Use a Sensor to ping the worker's `task/{task_id}` endpoint until the task is complete, logging the progress as it goes - Use the - [EC2Hook#stop_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.stop_instances) - to shut the instance down. This step should skip in a local environment. + [EC2Hook::stop_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.stop_instances) + to shut the instance down. It should use the + [`NONE_FAILED` TriggerRule](https://github.com/WordPress/openverse/issues/286) + to ensure that the instances are shut down, even if there are upstream + failures. This step should skip in a local environment. ### Create the Terraform and Ansible resources needed to deploy the new indexer workers @@ -503,9 +510,8 @@ will no longer be necessary and can then be deleted. ### Update the load_sample_data scripts -Our `load_sample_data.sh` scripts currently use `just` commands to ]run parts of -the data -refresh](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/load_sample_data.sh#L98) +Our `load_sample_data.sh` scripts currently use `just` commands to +[run parts of the data refresh](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/load_sample_data.sh#L98) on the local ingestion server, as part of setting up the local development environment. We should update the scripts to instead use the Airflow CLI to unpause the two data refresh DAGs and await their completion. @@ -540,14 +546,16 @@ No new tools or packages are required. Project. - [Catalog Data Cleaning project](https://github.com/WordPress/openverse/issues/430), which removes the Clean Data steps from the data refresh +- [Switch Python Package Management Away from Pipenv](https://github.com/WordPress/openverse/issues/286) ## Alternatives The alternative options of using an ECS approach or performing the reindex -entirely in Airflow are discussed at length in the [Approach to the Distributed -Reindex][#approach-to-the-distributed-reindex] section. +entirely in Airflow are discussed at length in the +[Approach to the Distributed Reindex](#approach-to-the-distributed-reindex) +section. ## Blockers @@ -582,5 +590,6 @@ monitor the first production data refreshes closely. -[^1] +[^1]: + https://towardsdatascience.com/using-apache-airflow-dockeroperator-with-docker-compose-57d0217c8219 From e0fc4c065c6ed3798c863f91ea93bd009dce4bc6 Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Thu, 11 Apr 2024 14:18:45 -0700 Subject: [PATCH 04/15] Adjust approach using ASG --- ...mentation_plan_ingestion_server_removal.md | 178 +++++++++--------- 1 file changed, 89 insertions(+), 89 deletions(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index d5c7292c132..4181cd7e403 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -114,8 +114,12 @@ of maintaining remote workers that can be run only when needed. Two other options were evaluated for this IP: -- Simply move the 8 total existing EC2 worker instances into the catalog, and - connect to them directly from Airflow. +- Keep the EC2 instances as they are now, but connect to them directly from + Airflow. Rather than managing the 8 total EC2 instances directly, we will + instead set up an EC2 Auto Scaling group for each environment, with an initial + desired capacity of 0. The data refresh DAGs will spin up instances by + increasing this desired capacity. (More details follow later in this + document.) - Remove the EC2 instances entirely in favor of a new ECS task definition using the `indexer-worker` image, which would remove all API code and contain only the reindexing script. Airflow would spin up the appropriate number of ECS @@ -200,40 +204,38 @@ into the status of the task from Airflow. #### Impact on Deployments -This is the greatest strength of the ECS approach. Each time an ECS task is spun -up on Fargate, it will pull the Docker image configured in the task definition. -By using the `latest` tag in this configuration we can ensure that the latest -indexer-worker image is pulled each time, which means that any changes to the -reindexing code will become live in production as soon as the new docker image -is published after merging on `main`. - -For the EC2 approach, any changes to the reindexing script or its dependencies -would require deploying the catalog (and by extension the indexer workers). -However: - -- No deployment would be required for changes to the vast majority of the data - refresh implementation, including the data copy, filtered index creation, - alias management, and all aspects of orchestrating the indexer workers. -- _If_ deploying the indexer-worker becomes a frequent issue in the future, we - could iterate on the data refresh DAG to automatically update the workers. For - example we could use an Airflow - [SSHOperator](https://docs.aws.amazon.com/mwaa/latest/userguide/samples-ssh.html) - to connect to the instances and pull the latest image before triggering the - task. I think this is unlikely enough to be a problem that this does not need - to be a requirement of the initial implementation. +This is a great strength of both approaches, each of which would eliminate the +need for deployments when changes are made to the data refresh code (including +changes to the indexer workers). + +Each time an ECS task is spun up on Fargate, it will pull the Docker image +configured in the task definition. By using the `latest` tag in this +configuration we can ensure that the latest indexer-worker image is pulled each +time, which means that any changes to the reindexing code will become live in +production as soon as the new docker image is published after merging on `main`. + +The EC2 approach uses an ASG to achieve a similar result. Because the ASG will +actually terminate (rather than stop) the instances when their work is complete +and start _new_ instances for each data refresh, we can pull the Docker image +with the `latest` tag in the `user_data` script so that the latest Docker image +is pulled each time a new data refresh starts. By using AWS Systems Manager +parameters instead of AMI IDs in the launch template, we can even use the ASG to +automatically use new AMI IDs without needing to deploy a new launch template +each time a system dependency is updated. ### Conclusion -The ECS approach allows us to remove 100% of the server management code, and -eliminates the need for deployments in essentially every case (unless the task -definition itself changes). However it is more expensive, more difficult to -implement, and requires a special secondary implementation for running locally -that may be prone to cross-platform issues. +Both approaches allow us to eliminate the need for deployments in almost all +cases, except for when changes are made to the task definition and launch +template respectively. -The EC2 approach on the other hand is cheaper, quicker to implement, and gets us -the vast majority of what we want in terms of removing complexity and reducing -the need for deployments. Consequently I argue here that the EC2 approach is the -one we should pursue. +The main advantage of the ECS approach is that it allows us to remove 100% of +the server management code. However, it is more difficult to implement, likely +more expensive, and requires a special secondary implementation for running +locally that may be prone to cross-platform issues. The EC2 approach on the +other hand is cheaper, quicker to implement, and gets us the vast majority of +what we want in terms of removing complexity. Consequently I argue here that the +EC2 approach is the one we should pursue. ## Step-by-step plan @@ -262,8 +264,8 @@ developed as separate DAGs alongside the current ones. indexer worker image locally. 1. Add the distributed reindexing step to the DAG (excludes infrastructure work). -1. Add the `catalog-indexer-worker` to the Terraform configuration for the - catalog, and add steps to deploy them in the related Ansible playbooks. +1. Set up the necessary resources for the ASGs for staging and production in the + catalog Terraform configuration. 1. Add all remaining steps to the data refresh DAGs: `Create and Populated Filtered Index`, `Reapply Constraints`, `Promote Table`, `Promote Index`, `Delete Index`. @@ -336,9 +338,9 @@ server's work: ### Implement new catalog indexer worker -In this step we will create the new catalog-indexer-worker Docker image, and add -distributed reindexing to the new DAGs. This step does not include the -infrastructure work to actually create the new EC2 instances. +In this step we will create the new catalog-indexer-worker Docker image. This +step does not include adding the orchestration steps to the DAG, or the +infrastructure work to actually create the ASGs. First we will create a new `indexer-worker` directory under `catalog/dags/data_refresh`, which will contain the contents of the new indexer @@ -390,22 +392,26 @@ refreshes locally and in production). In this step we will add tasks to the data refresh DAGs to orchestrate the distributed reindex. At the end of this step, it will be possible to run a -distributed reindex locally. Because the infrastructure work to create the EC2 -instances is not complete, it can not be run on production yet. The following -code can all be refactored from +distributed reindex _locally_, but because the infrastructure work to create the +ASGs is not complete, it can not be run on production yet. The following code +can all be refactored from [`distributed_reindex_scheduler.py`](https://github.com/WordPress/openverse/blob/main/ingestion_server/ingestion_server/distributed_reindex_scheduler.py). -- Use Airflow's - [EC2Hook::describe_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.describe_instances) - to get a list of internal URLs bound to the indexer workers for the - appropriate environment. In the local environment, this will instead return - the URL for the catalog-indexer-worker Docker container. - Use - [EC2Hook::start_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.start_instances) - to start the appropriate EC2 instances. This step should skip in a local - environment. -- Use dynamic task mapping to generate a Sensor for each of the expected indexer - workers, and ping its `healthcheck` endpoint until it passes. + [`describe_auto_scaling_groups`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/autoscaling/client/describe_auto_scaling_groups.html) + and filter by tags to select the appropriate ASG for the desired environment. + (Skips in local env.) +- Use + [`set_desired_capacity`](https://boto3.amazonaws.com/v1/documentation/api/1.26.86/reference/services/autoscaling/client/set_desired_capacity.html) + to increase the desired capacity of the ASG to the desired number of workers, + depending on the environment. This will cause the ASG to begin spinning up + instances. (Skips in local env.) +- Use + [`describe_auto_scaling_groups`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/autoscaling/client/describe_auto_scaling_groups.html) + to poll the ASG until all instances have been started, and get the EC2 + instance IDs. (Skips in local env.) +- Use dynamic task mapping to generate a Sensor for each of the instance IDs, + and ping its `healthcheck` endpoint until it passes. - Use dynamic task mapping to distribute reindexing across the indexer workers by first calculating `start` and `end` indices that will split the records in the media table into even portions, depending on the number of workers @@ -414,55 +420,44 @@ code can all be refactored from `end_index` it should handle - Use a Sensor to ping the worker's `task/{task_id}` endpoint until the task is complete, logging the progress as it goes - - Use the - [EC2Hook::stop_instances](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/hooks/ec2/index.html#airflow.providers.amazon.aws.hooks.ec2.EC2Hook.stop_instances) - to shut the instance down. It should use the - [`NONE_FAILED` TriggerRule](https://github.com/WordPress/openverse/issues/286) - to ensure that the instances are shut down, even if there are upstream - failures. This step should skip in a local environment. +- Use + [`terminate_instance_in_auto_scaling_group`](https://boto3.amazonaws.com/v1/documentation/api/1.26.86/reference/services/autoscaling/client/terminate_instance_in_auto_scaling_group.html) + to terminate the instance. Make sure to set `ShouldDecrementDesiredCapacity` + to `True` to ensure that the ASG does not try to replace the instance. This + task should use the + [`NONE_FAILED` TriggerRule](https://github.com/WordPress/openverse/issues/286) + to ensure that the instances are terminated, even if there are upstream + failures. (Skips in local env.) +- Finally, after all tasks have finished (regardless of success/failure), we + should have a cleanup task that calls `set_desired_capacity` to 0. Generally + this should be a no-op, but if an instance crashes during reindexing (rather + than simply failing during reindexing) the ASG will spin up a replacement and + Airflow will not automatically clean it up. This task ensures that any + dangling instances are terminated. + +```{note} +It is not possible to retry a single indexer worker with this set up, because once a worker fails the instance is actually terminated (rather than simply stopped). If a task that triggers a reindex is cleared after an instance has been terminated, it will simply fail. The entire reindex must be restarted from the first step in this task group. + +However, there is a valuable tradoff to this approach: it ensures that all of the indexer workers in a data refresh are identical, while still allowing us to avoid manual deployments every time the indexer logic changes. For example, imagine some changes to the reindexing logic are merged to `main` while a data refresh is actively underway, and a new Docker image is published. If one indexer worker failed, and it were possible to retry **just** that indexer worker, it would use the new Docker image -- leading to inconsistency in the behavior of different workers within a single data refresh. +``` ### Create the Terraform and Ansible resources needed to deploy the new indexer workers -In this step we will add the resources needed to actually deploy the new indexer -workers. At the time that this IP is being written, work to -[deploy Airflow with Ansible](https://github.com/WordPress/openverse-infrastructure/pull/829) -is actively underway. This IP builds off of this implementation and is blocked -by its success in production. - -The goals are to: - -- Add the Terraform resources for the indexer worker to - [`next/production/airflow.tf`](https://github.com/WordPress/openverse-infrastructure/blob/7a8fdd02dc448cfb9f2991f7b4ece1da7349209c/next/production/airflow.tf), - using the generic ec2-service module. The configuration can be copied from the - [current configuration](https://github.com/WordPress/openverse-infrastructure/tree/main/modules/services/ingestion-server) - in the ingestion-server. - [This PR](https://github.com/WordPress/openverse-infrastructure/pull/829) can - be used as a guide. -- Update the Airflow - [Ansible playbooks](https://github.com/WordPress/openverse-infrastructure/blob/7a8fdd02dc448cfb9f2991f7b4ece1da7349209c/ansible/roles/airflow/tasks/airflow.yml) - to deploy the indexer workers alongside the catalog. - -The existing logic for deploying Airflow can be followed closely when adding the -catalog-indexer-worker. The majority of this is already implemented and the -existing conventions should be followed, including planning by default, checking -whether a data refresh is currently in progress, skipping if no differences -exist, and checking for positive confirmation before deployment. It is most -useful to note what differs about the catalog-indexer-workers: - -- **Important**: when the compose stack is restarted for the indexer workers, it - is critical that the indexer workers must shut themselves down as soon as - setup is complete. They will be spun up automatically by the DAG when needed - and should not be left running. +In this step we will add the resources needed to actually configure the ASGs. + +Some important notes: + - Currently, the staging and production workers are split into separate environments (i.e., a staging deploy is used to deploy the 2 staging workers separately from the 6 production workers). It is more accurate to view all 8 workers as production instances (i.e., part of the production catalog deployment), which merely _operate_ on different environments. As such all 8 - should be part of the production catalog deploy, but each instance should have - an environment `tag` applied which indicates which environment it is intended - to be used for. + should be part of the production deployment, but two separate ASGs which are + tagged indicating their intended environment. - The playbooks must be updated to check if any of the four _new_ data refresh DAGs are running before deploying, as well. +- The `user_data` script should be updated to pull the Docker image with the + `latest` tag. ### Add remaining steps to the Data Refresh DAGs @@ -557,6 +552,11 @@ entirely in Airflow are discussed at length in the [Approach to the Distributed Reindex](#approach-to-the-distributed-reindex) section. +It is also possible to use EC2 instances but manage them directly in Airflow +using EC2 operators to start and stop the instances as needed. However, more +infrastructure work is required in this approach, and we would require +deployments whenever there are code changes in the indexer workers. + ## Blockers From f40f657fb31232c62e534969094cadd39da5536c Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Thu, 11 Apr 2024 14:34:32 -0700 Subject: [PATCH 05/15] Add more detail about cost analysis and link to breakdown --- ...mentation_plan_ingestion_server_removal.md | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index 4181cd7e403..4cf32d0bf47 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -135,7 +135,17 @@ approaches: Maintaining the [EC2 worker instances](https://calculator.aws/#/createCalculator/ec2-enhancement), provided we continue to automatically stop the instances when not in use, would -be cheaper than using [ECS](https://calculator.aws/#/createCalculator/Fargate). +be cheaper than using +[ECS](https://calculator.aws/#/createCalculator/Fargate)[^1], given equivalent +resources. + +Since our existing indexer workers do not use all of the resources available, we +can also consider reducing resources for the workers. For EC2, we are already +using the cheapest possible offering that will comfortably accommodate our +memory consumption. Since ECS is more flexible and allows scaling vCPU and RAM +independently, it is possible that ECS could be cheaper than EC2 with careful +adjustment of resources[^1]. + Notably both solutions are _signifcantly_ cheaper than the current implementation, because the bulk of the cost comes from constantly running the two m5.xlarge EC2 instances for the ingestion servers. @@ -189,7 +199,7 @@ in local development to spin up a worker as a local Docker container. This is still relatively complex, however, because the worker Docker containers must be launched from Airflow, which is itself dockerized. This requires either using a Docker-in-Docker approach, which should typically be avoided, or -modifying the configuration to allow Airflow to launch sibling containers[^1]. +modifying the configuration to allow Airflow to launch sibling containers[^2]. Requirements differ across operating systems, yet this **must** be tested and maintained across platforms because the data refresh is used as a critical step in setting up our local development environment. @@ -591,5 +601,10 @@ monitor the first production data refreshes closely. [^1]: + See + https://github.com/WordPress/openverse/pull/4026#pullrequestreview-1978477921 + for specific cost assessment, courtesy of @sarayourfriend + +[^2]: https://towardsdatascience.com/using-apache-airflow-dockeroperator-with-docker-compose-57d0217c8219 From c18dfd786bb43efb6debd7eb12cbf27e5f7704e8 Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Thu, 11 Apr 2024 14:39:05 -0700 Subject: [PATCH 06/15] Fix link --- .../20240328-implementation_plan_ingestion_server_removal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index 4cf32d0bf47..372748f08b0 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -435,7 +435,7 @@ can all be refactored from to terminate the instance. Make sure to set `ShouldDecrementDesiredCapacity` to `True` to ensure that the ASG does not try to replace the instance. This task should use the - [`NONE_FAILED` TriggerRule](https://github.com/WordPress/openverse/issues/286) + [`NONE_FAILED` TriggerRule](https://airflow.apache.org/docs/apache-airflow/1.10.9/concepts.html#trigger-rules) to ensure that the instances are terminated, even if there are upstream failures. (Skips in local env.) - Finally, after all tasks have finished (regardless of success/failure), we From 78c19598c4dbe57de15c2b604167599a9343d7a1 Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Thu, 11 Apr 2024 14:42:31 -0700 Subject: [PATCH 07/15] Add prior art --- .../20240328-implementation_plan_ingestion_server_removal.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index 372748f08b0..13601bf2bf4 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -600,6 +600,8 @@ monitor the first production data refreshes closely. +- [Search relevancy sandbox](https://github.com/WordPress/openverse/issues/392) + [^1]: See https://github.com/WordPress/openverse/pull/4026#pullrequestreview-1978477921 From cbd976e2be28e605ab29c78cae1d5a86ff4958c9 Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Thu, 11 Apr 2024 15:17:25 -0700 Subject: [PATCH 08/15] Fix footnote --- ...328-implementation_plan_ingestion_server_removal.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index 13601bf2bf4..4718af3c5ea 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -135,9 +135,8 @@ approaches: Maintaining the [EC2 worker instances](https://calculator.aws/#/createCalculator/ec2-enhancement), provided we continue to automatically stop the instances when not in use, would -be cheaper than using -[ECS](https://calculator.aws/#/createCalculator/Fargate)[^1], given equivalent -resources. +be cheaper than using [ECS](https://calculator.aws/#/createCalculator/Fargate), +given equivalent resources. Since our existing indexer workers do not use all of the resources available, we can also consider reducing resources for the workers. For EC2, we are already @@ -604,9 +603,8 @@ monitor the first production data refreshes closely. [^1]: See - https://github.com/WordPress/openverse/pull/4026#pullrequestreview-1978477921 + [comment](https://github.com/WordPress/openverse/pull/4026#pullrequestreview-1978477921) for specific cost assessment, courtesy of @sarayourfriend [^2]: - -https://towardsdatascience.com/using-apache-airflow-dockeroperator-with-docker-compose-57d0217c8219 + https://towardsdatascience.com/using-apache-airflow-dockeroperator-with-docker-compose-57d0217c8219 From bfc7bd1c4e877ae4bf94c59566b844bf79124dcf Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Mon, 15 Apr 2024 13:51:26 -0700 Subject: [PATCH 09/15] Update link and trigger rule --- .../20240328-implementation_plan_ingestion_server_removal.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index 4718af3c5ea..8a4d96cd3f8 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -160,7 +160,7 @@ from the current setup. The ECS approach requires slightly less code, because it would allow us to completely remove all of the server logic needed in the EC2 instances. The -docker image need only contain the script for reindexing. +Docker image need only contain the script for reindexing. For the EC2 instances we would continue to serve a minimal API with the following endpoints: @@ -434,7 +434,7 @@ can all be refactored from to terminate the instance. Make sure to set `ShouldDecrementDesiredCapacity` to `True` to ensure that the ASG does not try to replace the instance. This task should use the - [`NONE_FAILED` TriggerRule](https://airflow.apache.org/docs/apache-airflow/1.10.9/concepts.html#trigger-rules) + [`NONE_SKIPPED` TriggerRule](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#trigger-rules) to ensure that the instances are terminated, even if there are upstream failures. (Skips in local env.) - Finally, after all tasks have finished (regardless of success/failure), we From 4ff01fbd406776fedff1fb271a3c3a8f7fd229c0 Mon Sep 17 00:00:00 2001 From: Staci Mullins <63313398+stacimc@users.noreply.github.com> Date: Mon, 15 Apr 2024 13:53:08 -0700 Subject: [PATCH 10/15] Accept aetherunbound approval Co-authored-by: Madison Swain-Bowden --- .../20240328-implementation_plan_ingestion_server_removal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index 8a4d96cd3f8..b35f8fa84b6 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -10,7 +10,7 @@ - [ ] @sarayourfriend -- [ ] @AetherUnbound +- [X] @AetherUnbound ## Project links From 661971d6ec3c143d9754abedada6981a9fd7f403 Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Mon, 15 Apr 2024 14:03:36 -0700 Subject: [PATCH 11/15] Lint approval --- .../20240328-implementation_plan_ingestion_server_removal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index b35f8fa84b6..bca552d8b0d 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -10,7 +10,7 @@ - [ ] @sarayourfriend -- [X] @AetherUnbound +- [x] @AetherUnbound ## Project links From 50b9b1299326cda4ab4587073f126b7737af5e1e Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Tue, 16 Apr 2024 17:15:53 -0700 Subject: [PATCH 12/15] Remove unnecessary step --- .../20240328-implementation_plan_ingestion_server_removal.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index bca552d8b0d..5243f67cfe8 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -419,8 +419,6 @@ can all be refactored from [`describe_auto_scaling_groups`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/autoscaling/client/describe_auto_scaling_groups.html) to poll the ASG until all instances have been started, and get the EC2 instance IDs. (Skips in local env.) -- Use dynamic task mapping to generate a Sensor for each of the instance IDs, - and ping its `healthcheck` endpoint until it passes. - Use dynamic task mapping to distribute reindexing across the indexer workers by first calculating `start` and `end` indices that will split the records in the media table into even portions, depending on the number of workers From 2fa6852600e186515d83e4a3e455906b8aabe8f4 Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Tue, 16 Apr 2024 17:21:23 -0700 Subject: [PATCH 13/15] Clarify use of tags for ASGs, add -pool suffix --- .../20240328-implementation_plan_ingestion_server_removal.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index 5243f67cfe8..30a3768cdf5 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -460,7 +460,8 @@ Some important notes: workers as production instances (i.e., part of the production catalog deployment), which merely _operate_ on different environments. As such all 8 should be part of the production deployment, but two separate ASGs which are - tagged indicating their intended environment. + given a "staging-indexer-worker-pool" and "production-indexer-worker-pool" + tag, respectively, to indicate their intended environment. - The playbooks must be updated to check if any of the four _new_ data refresh DAGs are running before deploying, as well. - The `user_data` script should be updated to pull the Docker image with the From 898570405db4494d4c2427607568cabe6c4c29e2 Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Tue, 16 Apr 2024 17:23:26 -0700 Subject: [PATCH 14/15] Reference opportunity for future iteration --- .../20240328-implementation_plan_ingestion_server_removal.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index 30a3768cdf5..ddf0cf157cf 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -446,6 +446,8 @@ can all be refactored from It is not possible to retry a single indexer worker with this set up, because once a worker fails the instance is actually terminated (rather than simply stopped). If a task that triggers a reindex is cleared after an instance has been terminated, it will simply fail. The entire reindex must be restarted from the first step in this task group. However, there is a valuable tradoff to this approach: it ensures that all of the indexer workers in a data refresh are identical, while still allowing us to avoid manual deployments every time the indexer logic changes. For example, imagine some changes to the reindexing logic are merged to `main` while a data refresh is actively underway, and a new Docker image is published. If one indexer worker failed, and it were possible to retry **just** that indexer worker, it would use the new Docker image -- leading to inconsistency in the behavior of different workers within a single data refresh. + +In future iterations this may also be solved by using AMIs with the Docker image baked in, and then preventing launch template version bumps while a data refresh is running. ``` ### Create the Terraform and Ansible resources needed to deploy the new indexer workers From 26480d47f7afd01d4a66b41d3d81f5ec5922aa8f Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Tue, 16 Apr 2024 17:24:42 -0700 Subject: [PATCH 15/15] Record sarayourfriend approval --- .../20240328-implementation_plan_ingestion_server_removal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md index ddf0cf157cf..cf3aa8553b9 100644 --- a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -9,7 +9,7 @@ -- [ ] @sarayourfriend +- [x] @sarayourfriend - [x] @AetherUnbound ## Project links