Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation Plan: Update staging database #1154

Merged
merged 12 commits into from
May 2, 2023

Conversation

AetherUnbound
Copy link
Collaborator

@AetherUnbound AetherUnbound commented Apr 7, 2023

Due date:

2023-05-03

Assigned reviewers

Description

This is an implementation plan for the first portion of #1107.

Current round

This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site. Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.

This discussion is currently in the Decision round.

The deadline for review for this round is May 3rd

@AetherUnbound AetherUnbound requested a review from a team as a code owner April 7, 2023 22:29
@AetherUnbound AetherUnbound removed the request for review from a team April 7, 2023 22:29
@AetherUnbound AetherUnbound added 🟧 priority: high Stalls work on the project or its dependents 🌟 goal: addition Addition of new feature labels Apr 7, 2023
@AetherUnbound AetherUnbound added 📄 aspect: text Concerns the textual material in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs skip-changelog labels Apr 7, 2023
@AetherUnbound AetherUnbound requested review from stacimc and krysal and removed request for sarayourfriend April 7, 2023 22:29
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an assigned reviewer but had some small notes for things that could be clarified to avoid easy mistakes during implementation.

Also: the implementation plan guide requests identifying the atomic blocks of work that can be split into individual PRs. This plan looks like it would probably go into a single PR, despite its complexity. Does that sound right? If so, can it be noted explicitly?

@AetherUnbound AetherUnbound self-assigned this Apr 11, 2023
@AetherUnbound AetherUnbound added the 🧭 project: implementation plan An implementation plan for a project label Apr 11, 2023
@AetherUnbound AetherUnbound mentioned this pull request Apr 12, 2023
16 tasks
Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, very clear! And nice to see how many existing operators we'll be able to take advantage of.

the
[`RdsDeleteDbInstanceOperator`](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/operators/rds/index.html#airflow.providers.amazon.aws.operators.rds.RdsDeleteDbInstanceOperator)
(the `wait_for_completion` should be left as `True` here so we don't need a
follow-up sensor).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the last couple steps are:

  • Trigger renaming new database
  • Wait for the renaming to complete (either success or fail)
  • Branch on the status of the renaming:
    • If successful, delete the old db
    • If failed, rename the old db back to dev-openverse-db

I don't understand why those last operations don't require a follow-up sensor to wait for their completion 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my wording may be ambiguous here - the wait_for_completion step means that we avoid the need for a sensor on the delete operation itself. The wait_for_completion parameter is a feature on the Airflow operators themselves, unfortunately anything we do with the RDS client directly will require a follow-up sensor since I don't think there's a similar mechanism available with the boto3 API alone. I'll reword this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH got it, that did totally confuse me -- I was thinking of wait_for_completion as the task id for a sensor. Got it!

configuration.

With this in mind, it seems much easier to handle the outage for staging rather
than try and avoid it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 +1 for this reasoning

accomplish this would be to provide a wrapper around all the functions we will
need for interacting with `boto3` that checks that `prod` is not in any of the
references made to database assets (save for the initial snapshot acquisition
step). If `prod` is found in any of the references, the DAG should fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool idea!


The final product of this plan will be a DAG (scheduled for `@monthly`) which
will recreate the staging database from the most recent snapshot of the
production API database. This will be accomplished by:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that the staging DB was meant to be a subset of the production data, although using the full snapshot would of course be great. I think I may have confused the api DB with the elasticsearch index, which will be a subset. Is that right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct! The database itself will have all data, but some of the indices we use may have a subset of that data (e.g. the proportional-by-provider index described in #1107). We have also made the staging database a subset of the production database in the past, but I don't think that's necessary to do going forward.

@AetherUnbound
Copy link
Collaborator Author

Thank you both for your feedback & questions! I will be answering the questions and revising the document over the next few days.

@AetherUnbound AetherUnbound force-pushed the project/search-relevancy-sandbox branch from d044dfd to 7edf9e8 Compare April 14, 2023 18:25
@AetherUnbound AetherUnbound force-pushed the project/update-database-implementation-plan branch from 9cebcbe to cf5510f Compare April 14, 2023 18:34
@github-actions
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/1154

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

@AetherUnbound
Copy link
Collaborator Author

Also: the implementation plan guide requests identifying the atomic blocks of work that can be split into individual PRs. This plan looks like it would probably go into a single PR, despite its complexity. Does that sound right? If so, can it be noted explicitly?

Yes, that's correct - my thought was that this would be a single DAG. I'll make this explicit.

Base automatically changed from project/search-relevancy-sandbox to main April 14, 2023 19:19
@AetherUnbound AetherUnbound force-pushed the project/update-database-implementation-plan branch from cf5510f to 23a00bb Compare April 14, 2023 19:30
@AetherUnbound
Copy link
Collaborator Author

@sarayourfriend and @stacimc - I believe I've applied all of the feedback & clarification from your notes, please take another look when you have a moment!

<!-- Enumerate any references to other documents/pages, including milestones and other plans -->

- [Project Thread](https://github.com/WordPress/openverse/issues/392)
- [Project Proposal]() _TBD_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [Project Proposal]() _TBD_
- [Project Proposal](https://github.com/WordPress/openverse/pull/1107)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zackkrida
Copy link
Member

Hey, just jumping in here to share (and elaborate on) an alternative approach I mentioned in the project proposal comments. This isn't meant to discredit the work here or suggest that it isn't a suitable solution. In considering this alternative solution though, I feel that it avoids some of the pitfalls of the proposed solution, providing that I understand the motivation of this project correctly.

If the staging database is just a copy of the production database, why do we even need a staging database? I came up with the following reasons, trying to ignore anything that wasn't DB specific:

  1. To test Django admin views and operations which impact Django DB tables
  2. To test new tables for new media types

These to me suggest that we'd want the Django-specific tables to be persistent, and not wiped out weekly by restoring from production backup. What we really want is to only sync the media tables. Additionally, I believe access to these media tables only needs to be read-only, as API consumers and the Django admin dashboard do not actually make changes to the media tables.

I can think of three alternative approaches to reading the media from the production database in the staging database. These approaches have some benefits to the proposal here, along with their own tradeoffs:

Name Pros Cons
1. Foreign Data Wrapper to point staging media tables to production - Access to latest data in real-time
- Minimal code required
- Can downsize staging DB
- Simple to secure read-only connection
- Increase in reads to production DB
2. Postgres' "logical replication" feature to keep production and staging media tables in sync - Near-realtime updates automatically
- Staging DB has its own media records which can be deleted and modified
- Limited experience with feature
- Increase in load on production DB
3. Foreign Data Wrapper to insert all records from production media tables into staging - Staging DB has its own media records which can be deleted and modified - Slower process
- Scheduled updates
- May duplicate data unnecessarily
  1. Use a foreign data wrapper to point staging media tables to their production counterparts. This means we read all media data directly from the production db without storing it in the staging db. It also means that staging always has access to the latest data in realtime. This also doesn't require us to write much code, at all. From a security perspective it's also quite simple to make sure this is a read-only connection that can't modify the production DB. The only thing to consider here is that we'd be increasing reads to the production DB but I think that's a minor concern. We can also dramatically downsize the staging DB with this approach.
  2. Use Postgres' "logical replication" feature keep the production and staging media tables in sync. I don't have experience with this feature, but it allows for Postgres to automatically snyc changes from a table in one DB to another using a pub/sub model. Basically, in the production DB we CREATE PUBLICATION pub_name FOR TABLE images; and are able to subscribe to this in the staging DB. The only benefit here is that we get near-realtime updates automatically and the staging DB has its own media records which can be deleted and modified.
  3. We could use a foreign data wrapper to insert all records from the production media tables into the staging media tables. We could do this incrementally, based on the created_at date of the newest records in the staging media tables, or we could do something more like the data refresh where we copy over all records. This would be slower, and scheduled, but the staging DB has its own media records which can be deleted and modified.

I'm curious what you think about these ideas! I only fell into this rabbit hole while reviewing the project proposal but I'm excited to look at this problem space in a way I hadn't before.

@AetherUnbound
Copy link
Collaborator Author

I believe I have addressed all the feedback from folks! I'm going to be moving this back into the Decision Round, but since I'm going to be gone most of this week and we have a few folks AFK, we'll extend the timeline for response - please respond to this by May 3rd.

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The updates look to capture the feedback well, and thanks for creating #1874. Thanks for the very interesting discussion on this one.

[`get_current_index` task](https://github.com/WordPress/openverse/blob/aedc9c16ce5ed11709e2b6f0b42bad77ea1eb19b/catalog/dags/data_refresh/data_refresh_task_factory.py#L167-L175)
on the data refresh DAG)
2. Initiate the
[`UPDATE_INDEX` action](https://github.com/WordPress/openverse/blob/7427bbd4a8178d05a27e6fef07d70905ec7ef16b/ingestion_server/ingestion_server/api.py#L107-L113)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool to be able to use this 👍

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is perfectly reasonable and appropiate in my view! Thanks for writing it and add the suggestions @AetherUnbound.

I would also like folks who intend to give feedback to explicitly say whether they're okay with data created in staging being destroyed on a regular basis, even with the option to prevent it at any time being available.

To me staging was always a laboratory to try changes and never saw it as a place where data persisted, so I'm okay with it being wiped out on the workflow described here. Having data persisted on stagin would be a new requirement but I still don't see it necessary.

Comment on lines +110 to +111
1. The staging API must be deployed to apply any necessary migrations to the
new copy the database. In order to do this we must augment the existing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the requirement is to only run migrations it would be easier and faster just to run the django command python manage.py migrate in sstaging instance. Are there additional reasons that warrant a deployment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To run a command in staging we'd need to enable AWS SSM on the AWS client installed in Airflow. ECS Exec only works if SSM is enabled on the client. I don't know how much work it would be to set that up for the catalog. Staging API deployments take <=3 minutes. My 2 cents is that it probably isn't worth the hassle for this particular use-case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea agreed - I think a deployment would be the easiest thing to kick off given what we have.

@AetherUnbound AetherUnbound force-pushed the project/update-database-implementation-plan branch from c7986e1 to 7ac9ad0 Compare May 2, 2023 22:34
@AetherUnbound AetherUnbound dismissed sarayourfriend’s stale review May 2, 2023 22:46

Review requests addressed

@AetherUnbound AetherUnbound merged commit df8b72f into main May 2, 2023
@AetherUnbound AetherUnbound deleted the project/update-database-implementation-plan branch May 2, 2023 22:52
@AetherUnbound
Copy link
Collaborator Author

Coming back to this, I actually don't think that performing an UPDATE_INDEX on the staging primary index is the appropriate move at this time. This is mainly because that action assumes that all data in the database should be in the index, which is not how we're planning on setting staging up at the moment (see #1987):

log.info(f"Updating index {destination_index} with changes since {since_date}.")
deleted, mature = get_existence_queries(model_name)
query = SQL(
"SELECT *, {deleted}, {mature} "
"FROM {model_name} "
"WHERE updated_on >= {since_date};"
).format(
deleted=deleted,
mature=mature,
model_name=Identifier(model_name),
since_date=Literal(since_date),
)
self.replicate(model_name, model_name, destination_index, query)
self.refresh(destination_index)

After #1987 is implemented we could add steps to initiate a full refresh of the indices that IP describes once the restore is complete. If we wanted to use UPDATE_INDEX here, we'd need to add some logic to that operation to run the same selections as are planned for #1987 and then compare the dates after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟧 priority: high Stalls work on the project or its dependents 🧭 project: implementation plan An implementation plan for a project skip-changelog 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: Accepted
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants