Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation Plan: Update staging database #1154

Merged
merged 12 commits into from
May 2, 2023
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
# 2023-04-06 Implementation Plan: Update Staging Database

<!-- See the implementation plan guide for more information: https://github.com/WordPress/openverse/tree/19791f51c063d0979112f4b9f4eeace04c8cf5ff/docs/projects#implementation-plans-status-in-rfc -->
<!-- This template is exhaustive and may include sections which aren't relevant to your project. Feel free to remove any sections which would not be useful to have. -->

## Reviewers

<!-- Choose two people at your discretion who make sense to review this based on their existing expertise. Check in to make sure folks aren't currently reviewing more than one other proposal or RFC. -->

- [x] @stacimc
- [x] @krysal

## Project links

<!-- Enumerate any references to other documents/pages, including milestones and other plans -->

- [Project Thread](https://github.com/WordPress/openverse/issues/392)
- [Project Proposal](https://docs.openverse.org/projects/proposals/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.html)

## Overview

<!-- A brief one or two sentence overview of the implementation being described. -->

This document describes how we will implement a mechanism for updating the
staging database with the latest data from the production database.

## Outlined Steps

<!-- Describe the implementation step necessary for completion. -->

There are two primary products of this plan. The first will be a DAG (scheduled
for `@monthly`) which will recreate the staging database from the most recent
snapshot of the production API database. The second will be modifications to the
Django Admin UI notifying maintainers of the next scheduled staging database
update.

### DAG

This will be accomplished by:

1. Use a new `SKIP_STAGING_DATABASE_RESTORE`
[Airflow Variable](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/variables.html)
to control whether the DAG should run or not. This will allow us to skip the
DAG in the case where we have made diverging chances to the staging database
for testing a new feature that we do not want to be overwritten. If this
Variable is set to `True`, the DAG should issue a Slack message notifying the
maintainers of the skipped run and raise an `AirflowSkipException`.
2. Determine the most recent automated **production** snapshot using boto3's
[`describe_db_snapshots`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/rds/client/describe_db_snapshots.html)
function (an Airflow operator does not exist for this operation). This step
should also check the `Status` value of the response for the most recent
snapshot to ensure that it is `available`.
3. In the case where the snapshot is not yet available, wait for the status of
the most recent snapshot using the
[`RdsSnapshotExistenceSensor`](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/sensors/rds/index.html#airflow.providers.amazon.aws.sensors.rds.RdsSnapshotExistenceSensor).
4. In parallel, gather the attributes of the **staging** database using boto3's
[`describe_db_instances`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/rds/client/describe_db_instances.html)
function (an Airflow operator does not exist for this operation). Namely, the
following attributes should be gathered and set to ensure they match:
Comment on lines +58 to +59
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if they don't match? Abort the mission or continue with sending a warning notification?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The care I mention here should only be necessary when setting up the DAG, especially if we're going to use the describe_db_instances call Sara described above to capture the existing settings. So I don't think we'll need any runtime checks here 🙂

- Availability should be a single DB instance (not
[multi-AZ](https://docs.aws.amazon.com/fsx/latest/WindowsGuide/high-availability-multiAZ.html))
- It should be attached to our default Virtual Private Cloud (VPC)
- It should not be publicly accessible
- It will need access to the following VPC security groups: `default`,
`openverse-vpc-db-access`, `staging-dev-rds-sg`
- It should match the staging database's instance size (currently
`m5.xlarge`)
- It should have 3000 GB of allocated gp2 storage
- It should use password authentication (this may not need to be changed from
the default, as the snapshot should contain the same connection
information)
5. Create a new database from this snapshot using boto3's
[`restore_db_instance_from_db_snapshot`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/rds/client/restore_db_instance_from_db_snapshot.html)
function (an operator does not exist for this operation). This database will
be named in a way that does not conflict with the existing staging database
name, e.g. `dev-next-openverse-db`. The database configuration information
from the previous step will be used to ensure the new database matches the
old database's configuration exactly.
6. Wait for the new database to be ready using the
[`RdsDbSensor`](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/sensors/rds/index.html#airflow.providers.amazon.aws.sensors.rds.RdsDbSensor)
([example](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/operators/rds.html#howto-sensor-rdsdbsensor)).
7. Rename the old staging database to `dev-old-openverse-db` using boto3's
[`modify_db_instance`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/rds/client/modify_db_instance.html)
function (an operator does not exist for this operation). We must set the
`ApplyImmediately` option here to `True` to ensure this operation happens
immediately rather than waiting for the next maintenance window.
8. Wait for the old database rename to be complete with the `RdsDbSensor` (we
may need retries on this step, since the database may not be initially
available/named when the sensor first starts). _**Note**: this will cause a
temporary outage of the staging API, see
[the alternatives section](#alternatives) for why this is the case._ A Slack
notification should be sent out at the start of this step to alert the team
of the outage, and again once the outage is resolved.
9. Rename the new database to `dev-openverse-db` using `modify_db_instance`.
(Noting that `ApplyImmediately` should be set to `True` here as well.)
10. Wait for the new database rename to be complete with the `RdsDbSensor` (we
may need retries on this step, since the database may not be initially
available/named when the sensor first starts).
11. If the previous steps fail, rename `dev-old-openverse-db` back to
`dev-openverse-db`. Otherwise, `dev-old-openverse-db` can be deleted using
the
[`RdsDeleteDbInstanceOperator`](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/operators/rds/index.html#airflow.providers.amazon.aws.operators.rds.RdsDeleteDbInstanceOperator).
We should use the following configuration options:
- `wait_for_completion` should be left as `True` here so this stop will hang
until the database removal is complete.
- `SkipFinalSnapshot` should be set to `True` to avoid creating a final
snapshot of the database before deletion.
12. Once the database rename is complete, the following actions will also need
to occur (they can occur simultaneously in the DAG):
1. The staging API must be deployed to apply any necessary migrations to the
new copy the database. In order to do this we must augment the existing
Comment on lines +110 to +111
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the requirement is to only run migrations it would be easier and faster just to run the django command python manage.py migrate in sstaging instance. Are there additional reasons that warrant a deployment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To run a command in staging we'd need to enable AWS SSM on the AWS client installed in Airflow. ECS Exec only works if SSM is enabled on the client. I don't know how much work it would be to set that up for the catalog. Staging API deployments take <=3 minutes. My 2 cents is that it probably isn't worth the hassle for this particular use-case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea agreed - I think a deployment would be the easiest thing to kick off given what we have.

[GitHub API class](https://github.com/WordPress/openverse/blob/aedc9c16ce5ed11709e2b6f0b42bad77ea1eb19b/catalog/dags/common/github.py#L4)
with additional functions for **getting the latest GHCR image tag** and
**triggering the deployment workflow**. Once those functions exist, the
steps for this piece would be:
1. Get the latest image tag for the
[`openverse-api` package](https://github.com/WordPress/openverse/pkgs/container/openverse-api).
2. Trigger the deployment workflow
[in a similar manner to the existing CI/CD workflow](https://github.com/WordPress/openverse/blob/1f7b8e670a0f7812494375570d076a7c33142062/.github/workflows/ci_cd.yml#L833-L846)
using the tag from the previous step and the `openverse-bot` actor.
2. Update the Elasticsearch indices corresponding to each media type. This
can be done by using the
[`UPDATE_INDEX` action](https://github.com/WordPress/openverse/blob/7427bbd4a8178d05a27e6fef07d70905ec7ef16b/ingestion_server/ingestion_server/indexer.py#L314)
on the ingestion server. The steps for this piece would be:
1. Get the date for the last successful run of the update DAG (see the
[`_month_check` function](https://github.com/WordPress/openverse/blob/aedc9c16ce5ed11709e2b6f0b42bad77ea1eb19b/catalog/dags/data_refresh/dag_factory.py#L85)
from the data refresh DAG for a similar example).
2. For each media type:
1. Get the current index (see the
[`get_current_index` task](https://github.com/WordPress/openverse/blob/aedc9c16ce5ed11709e2b6f0b42bad77ea1eb19b/catalog/dags/data_refresh/data_refresh_task_factory.py#L167-L175)
on the data refresh DAG)
2. Initiate the
[`UPDATE_INDEX` action](https://github.com/WordPress/openverse/blob/7427bbd4a8178d05a27e6fef07d70905ec7ef16b/ingestion_server/ingestion_server/api.py#L107-L113)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool to be able to use this 👍

using the date and index suffix retrieved above
3. Wait for the index update to complete
3. Truncate the OAuth tables to prevent production API applications from
working in staging. Each of these truncate operations can be run
simultaneously using
`TRUNCATE TABLE <table_name> RESTART IDENTITY CASCADE;` in a
[`PGExecuteQueryOperator`](https://github.com/WordPress/openverse/blob/aedc9c16ce5ed11709e2b6f0b42bad77ea1eb19b/catalog/dags/common/sql.py#L137).
The tables that need to be truncated are:
- `api_throttledapplication`
- `api_oauth2registration`
- `api_oauth2verification`
- `oauth2_provider_accesstoken`
- `oauth2_provider_grant`
- `oauth2_provider_idtoken`
- `oauth2_provider_refreshtoken`
13. Report the success or failure of the DAG run to Slack.

### Django Admin UI changes

The above section describes the process for performing the staging update. A
side-effect of this method of updating is that the data in staging will be
destroyed in favor of the data from the production database. Maintainers can use
the `SKIP_STAGING_DATABASE_RESTORE` Airflow Variable described above to prevent
the deletion of the staging database. We will also add a notice to the top of
Django Admin UI denoting when the next scheduled database recreation will occur.
This can be done by
[overriding the base Django Admin template](https://docs.djangoproject.com/en/4.2/howto/overriding-templates/#extending-an-overridden-template)
and adding a notice to the top of the page. Since the DAG is intended to be run
`@monthly`, a countdown to the next run can be calculated and displayed for
maintainers. We will also want to provide instructions or a link to instructions
for how to disable the staging database recreation using the aforementioned
Airflow Variable.

## Dependencies

### Tools & packages

<!-- Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->

We already have the `amazon` provider (and thus `boto3`)
[installed on Airflow](https://github.com/WordPress/openverse-catalog/blob/46281fc9fda60ab2df0df6d85043565dfc51b12d/requirements_prod.txt#L5).
We should not need any additional dependencies.

### Other projects or work

<!-- Note any projects this plan is dependent on. -->

This does not depend on any existing projects.

### Infrastructure

<!-- Note any infrastructure this plan is dependent on. -->

This project directly affects our infrastructure, but should not require any
additional dependencies or changes. After this DAG is run, the infrastructure
should mirror what existed before (albeit with an updated staging database).

## Alternatives

### Database cut-over

The plan described above will incur a short (<10 minute) outage of the staging
API. This is because the database name is changed in the middle of the process.
If we want to avoid this outage, we could instead create a new database with a
different name, and then switch the staging API and any other dependent services
(such as the ingestion server) over to the new database. This would eliminate
the outage, but would instead require that we deploy all staging services which
depend on the database as part of the DAG workflow. We do not yet have an
automated way for Airflow to trigger a deployment itself, so this would require
additional work to get Airflow integrated with GitHub Actions. Furthermore,
changes to the services environment variables would require changes to the
infrastructure repository and an application of the updated terraform
configuration.

With this in mind, it seems much easier to handle the outage for staging rather
than try and avoid it.

### Per-table update policy

We could also update the database in a more granular fashion, updating each
table individually. This would allow us to avoid the outage, would enable
further granularity with respect to which tables we update when, and could be
used to reduce the delay between when the production database is updated and
when the staging database receives the same data. This was discussed heavily
[in the implementation plan PR](https://github.com/WordPress/openverse/pull/1154),
and several potential mechanisms for this method of updating were discussed,
namely:

1. Foreign Data Wrapper to point staging media tables to production
2. Postgres' "logical replication" feature to keep production and staging media
tables in sync
3. Foreign Data Wrapper to insert all records from production media tables into
staging

We opted to defer the persuit of this alternate approach for the time being for
the following reasons:

- The FDW/replication approach(es) described may have numerous pitfalls,
unknowns, or drawbacks which the maintainers are not privy to due to
unfamiliarity with the functionality.
- The proposed alternative solution is essentially describes an ETL framework
for production -> staging replication. While such a system could be
significantly more flexible, it would by necessity also be more complex, and
would warrant its own project process to flesh out.
- At the time of writing this is the first of
[three implementation plans this project requires](https://docs.openverse.org/projects/proposals/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.html#required-implementation-plans),
the other two being "Rapid iteration on ingestion server index configuration"
and "Staging Elasticsearch reindex DAGs for both potential index types". Any
elongations to the project timeline at this step could also affect the
timeline for drafting and implementing those plans as well.
- The DAG described here does not present any **permanent, lasting, and
irreversible** changes; it can be disabled and replaced at any time in favor
of a more thorough update approach down the line.

It is intended to be explored further in
[another project in the future](https://github.com/WordPress/openverse/issues/1874).

## Accessibility

<!-- Are there specific accessibility concerns relevant to this plan? Do you expect new UI elements that would need particular care to ensure they're implemented in an accessible way? Consider also low-spec device and slow internet accessibility, if relevant. -->

This DAG will need to have clear instructions for how to run it for those
unfamiliar with Airflow.

## Parallelizable streams

<!-- What, if any, work within this plan can be parallelized? -->

Based on the above steps, the following work can be parallelized:

1. Alterations to the
[GitHub API class](https://github.com/WordPress/openverse/blob/aedc9c16ce5ed11709e2b6f0b42bad77ea1eb19b/catalog/dags/common/github.py#L4)
to include new methods for pulling the latest GHCR image tag and for
triggering a workflow.
2. The addition of the notification banner to the Django Admin UI.

After the above work is complete, the primary DAG can be written and tested.

## Rollback

<!-- How do we roll back this solution in the event of failure? Are there any steps that can not easily be rolled back? -->

A rollback for this would work would only require deleting the created DAG.

## Risks

<!-- What risks are we taking with this solution? Are there risks that once taken can’t be undone?-->

Because this DAG modifies our existing infrastructure directly, there is risk
that we may adversely affect production databases with this DAG. Therefore, care
should be taken in asserting that _only_ staging is affected. One way to
accomplish this would be to provide a wrapper around all the functions we will
need for interacting with `boto3` that checks that `prod` is not in any of the
references made to database assets (save for the initial snapshot acquisition
step). If `prod` is found in any of the references, the DAG should fail.
Comment on lines +282 to +288
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation! Agree that having a check for production will be wise.


Additionally, I have enabled deletion protection on the `prod-openverse-db`.
This does not affect rename operations and thus does not fully mitigate a
production outage if the database was renamed, but it does prevent a full
deletion of the database.

## Prior art

<!-- Include links to documents and resources that you used when coming up with your solution. Credit people who have contributed to the solution that you wish to acknowledge. -->

- [Airflow documentation](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/operators/rds.html)
- Internal discussions regarding restoring from snapshots in the past (these
links are not publicly available but may serve as useful references for
maintainers):
- [Catalog RDS instance expansion plan](https://teamopenverse.wordpress.com/2023/03/27/catalog-rds-instance-storage-expansion-plan/)
- [Discussion around snapshot restores as part of the API ECS migration](https://teamopenverse.wordpress.com/2022/07/26/api-ecs-deployment/#comment-3404)
- [Discussion around snapshot restores as part of an API outage recovery after-incident review](https://teamopenverse.wordpress.com/2022/05/18/openverse-api-outage-2022-05-18-debrief/#comment-1066)