New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Implementation Plan: Update staging database #1154

Merged

AetherUnbound merged 12 commits into main from project/update-database-implementation-plan

May 2, 2023

Collaborator

AetherUnbound commented Apr 7, 2023 •

edited by zackkrida

Loading

Due date:

2023-05-03

Assigned reviewers

Description

This is an implementation plan for the first portion of #1107.

Current round

This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site. Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.

This discussion is currently in the Decision round.

The deadline for review for this round is May 3rd

AetherUnbound requested a review from a team as a code owner

April 7, 2023 22:29

AetherUnbound removed the request for review from a team

April 7, 2023 22:29

AetherUnbound added 🟧 priority: high 🌟 goal: addition labels

AetherUnbound requested a review from sarayourfriend

April 7, 2023 22:29

AetherUnbound added 📄 aspect: text 🧱 stack: catalog skip-changelog labels

AetherUnbound requested review from stacimc and krysal and removed request for sarayourfriend

April 7, 2023 22:29

AetherUnbound mentioned this pull request

Update project plan PR template #1155

Merged

7 tasks

sarayourfriend reviewed

View reviewed changes

rfcs/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated Show resolved Hide resolved

sarayourfriend reviewed

View reviewed changes

Collaborator

sarayourfriend left a comment

I'm not an assigned reviewer but had some small notes for things that could be clarified to avoid easy mistakes during implementation.

Also: the implementation plan guide requests identifying the atomic blocks of work that can be split into individual PRs. This plan looks like it would probably go into a single PR, despite its complexity. Does that sound right? If so, can it be noted explicitly?

rfcs/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated Show resolved Hide resolved

rfcs/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated Show resolved Hide resolved

AetherUnbound self-assigned this

AetherUnbound added the 🧭 project: implementation plan label

AetherUnbound mentioned this pull request

Search relevancy sandbox #392

Closed

16 tasks

stacimc reviewed

View reviewed changes

Contributor

stacimc left a comment

Looks great, very clear! And nice to see how many existing operators we'll be able to take advantage of.

rfcs/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated Show resolved Hide resolved

rfcs/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated Show resolved Hide resolved

rfcs/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated

+                 the
+                 [`RdsDeleteDbInstanceOperator`](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/operators/rds/index.html#airflow.providers.amazon.aws.operators.rds.RdsDeleteDbInstanceOperator)
+                 (the `wait_for_completion` should be left as `True` here so we don't need a
+                 follow-up sensor).

Contributor

stacimc Apr 13, 2023

So the last couple steps are:

Trigger renaming new database
Wait for the renaming to complete (either success or fail)
Branch on the status of the renaming:
- If successful, delete the old db
- If failed, rename the old db back to dev-openverse-db

I don't understand why those last operations don't require a follow-up sensor to wait for their completion 🤔

Collaborator Author

AetherUnbound Apr 14, 2023

I think my wording may be ambiguous here - the wait_for_completion step means that we avoid the need for a sensor on the delete operation itself. The wait_for_completion parameter is a feature on the Airflow operators themselves, unfortunately anything we do with the RDS client directly will require a follow-up sensor since I don't think there's a similar mechanism available with the boto3 API alone. I'll reword this!

Contributor

stacimc Apr 17, 2023

OH got it, that did totally confuse me -- I was thinking of wait_for_completion as the task id for a sensor. Got it!

rfcs/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated

+              configuration.
+              With this in mind, it seems much easier to handle the outage for staging rather
+              than try and avoid it.

Contributor

stacimc Apr 13, 2023

👍 +1 for this reasoning

rfcs/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated

+              accomplish this would be to provide a wrapper around all the functions we will
+              need for interacting with `boto3` that checks that `prod` is not in any of the
+              references made to database assets (save for the initial snapshot acquisition
+              step). If `prod` is found in any of the references, the DAG should fail.

Contributor

stacimc Apr 13, 2023

Very cool idea!

rfcs/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated

+              The final product of this plan will be a DAG (scheduled for `@monthly`) which
+              will recreate the staging database from the most recent snapshot of the
+              production API database. This will be accomplished by:

Contributor

stacimc Apr 13, 2023

I was under the impression that the staging DB was meant to be a subset of the production data, although using the full snapshot would of course be great. I think I may have confused the api DB with the elasticsearch index, which will be a subset. Is that right?

Collaborator Author

AetherUnbound Apr 14, 2023

Yes, that's correct! The database itself will have all data, but some of the indices we use may have a subset of that data (e.g. the proportional-by-provider index described in #1107). We have also made the staging database a subset of the production database in the past, but I don't think that's necessary to do going forward.

Collaborator Author

AetherUnbound commented Apr 14, 2023

Thank you both for your feedback & questions! I will be answering the questions and revising the document over the next few days.

AetherUnbound force-pushed the project/search-relevancy-sandbox branch from d044dfd to 7edf9e8 Compare

April 14, 2023 18:25

AetherUnbound force-pushed the project/update-database-implementation-plan branch from 9cebcbe to cf5510f Compare

April 14, 2023 18:34

github-actions bot commented Apr 14, 2023

Full-stack documentation: https://docs.openverse.org/_preview/1154

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

AetherUnbound mentioned this pull request

Project Proposal: Search Relevancy Sandbox #1107

Merged

3 tasks

Collaborator Author

AetherUnbound commented Apr 14, 2023

Also: the implementation plan guide requests identifying the atomic blocks of work that can be split into individual PRs. This plan looks like it would probably go into a single PR, despite its complexity. Does that sound right? If so, can it be noted explicitly?

Yes, that's correct - my thought was that this would be a single DAG. I'll make this explicit.

Base automatically changed from project/search-relevancy-sandbox to main

April 14, 2023 19:19

AetherUnbound force-pushed the project/update-database-implementation-plan branch from cf5510f to 23a00bb Compare

April 14, 2023 19:30

Collaborator Author

AetherUnbound commented Apr 14, 2023

@sarayourfriend and @stacimc - I believe I've applied all of the feedback & clarification from your notes, please take another look when you have a moment!

zackkrida reviewed

View reviewed changes

...s/proposals/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated

+              <!-- Enumerate any references to other documents/pages, including milestones and other plans -->
+              - [Project Thread](https://github.com/WordPress/openverse/issues/392)
+              - [Project Proposal]() _TBD_

Member

zackkrida Apr 17, 2023

Suggested change

      
            - [Project Proposal]() _TBD_
          
            - [Project Proposal](https://github.com/WordPress/openverse/pull/1107)

Member

zackkrida Apr 17, 2023

Or perhaps https://docs.openverse.org/projects/proposals/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.html instead

Member

zackkrida commented Apr 17, 2023

Hey, just jumping in here to share (and elaborate on) an alternative approach I mentioned in the project proposal comments. This isn't meant to discredit the work here or suggest that it isn't a suitable solution. In considering this alternative solution though, I feel that it avoids some of the pitfalls of the proposed solution, providing that I understand the motivation of this project correctly.

If the staging database is just a copy of the production database, why do we even need a staging database? I came up with the following reasons, trying to ignore anything that wasn't DB specific:

To test Django admin views and operations which impact Django DB tables
To test new tables for new media types

These to me suggest that we'd want the Django-specific tables to be persistent, and not wiped out weekly by restoring from production backup. What we really want is to only sync the media tables. Additionally, I believe access to these media tables only needs to be read-only, as API consumers and the Django admin dashboard do not actually make changes to the media tables.

I can think of three alternative approaches to reading the media from the production database in the staging database. These approaches have some benefits to the proposal here, along with their own tradeoffs:

Name	Pros	Cons
1. Foreign Data Wrapper to point staging media tables to production	- Access to latest data in real-time - Minimal code required - Can downsize staging DB - Simple to secure read-only connection	- Increase in reads to production DB
2. Postgres' "logical replication" feature to keep production and staging media tables in sync	- Near-realtime updates automatically - Staging DB has its own media records which can be deleted and modified	- Limited experience with feature - Increase in load on production DB
3. Foreign Data Wrapper to insert all records from production media tables into staging	- Staging DB has its own media records which can be deleted and modified	- Slower process - Scheduled updates - May duplicate data unnecessarily

Use a foreign data wrapper to point staging media tables to their production counterparts. This means we read all media data directly from the production db without storing it in the staging db. It also means that staging always has access to the latest data in realtime. This also doesn't require us to write much code, at all. From a security perspective it's also quite simple to make sure this is a read-only connection that can't modify the production DB. The only thing to consider here is that we'd be increasing reads to the production DB but I think that's a minor concern. We can also dramatically downsize the staging DB with this approach.
Use Postgres' "logical replication" feature keep the production and staging media tables in sync. I don't have experience with this feature, but it allows for Postgres to automatically snyc changes from a table in one DB to another using a pub/sub model. Basically, in the production DB we CREATE PUBLICATION pub_name FOR TABLE images; and are able to subscribe to this in the staging DB. The only benefit here is that we get near-realtime updates automatically and the staging DB has its own media records which can be deleted and modified.
We could use a foreign data wrapper to insert all records from the production media tables into the staging media tables. We could do this incrementally, based on the created_at date of the newest records in the staging media tables, or we could do something more like the data refresh where we copy over all records. This would be slower, and scheduled, but the staging DB has its own media records which can be deleted and modified.

I'm curious what you think about these ideas! I only fell into this rabbit hole while reviewing the project proposal but I'm excited to look at this problem space in a way I hadn't before.

AetherUnbound mentioned this pull request

Improved production -> staging database replication #1874

Closed

2 tasks

AetherUnbound force-pushed the project/update-database-implementation-plan branch from 119e297 to 82e8e14 Compare

April 22, 2023 00:43

Collaborator Author

AetherUnbound commented Apr 24, 2023

I believe I have addressed all the feedback from folks! I'm going to be moving this back into the Decision Round, but since I'm going to be gone most of this week and we have a few folks AFK, we'll extend the timeline for response - please respond to this by May 3rd.

AetherUnbound requested review from zackkrida, stacimc, sarayourfriend and krysal

April 24, 2023 23:47

stacimc approved these changes

View reviewed changes

Contributor

stacimc left a comment

Looks good! The updates look to capture the feedback well, and thanks for creating #1874. Thanks for the very interesting discussion on this one.

...s/proposals/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md

+                           [`get_current_index` task](https://github.com/WordPress/openverse/blob/aedc9c16ce5ed11709e2b6f0b42bad77ea1eb19b/catalog/dags/data_refresh/data_refresh_task_factory.py#L167-L175)
+                           on the data refresh DAG)
+. Initiate the
+                           [`UPDATE_INDEX` action](https://github.com/WordPress/openverse/blob/7427bbd4a8178d05a27e6fef07d70905ec7ef16b/ingestion_server/ingestion_server/api.py#L107-L113)

Contributor

stacimc Apr 25, 2023

Cool to be able to use this 👍

...s/proposals/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated Show resolved Hide resolved

krysal approved these changes

View reviewed changes

Member

krysal left a comment

The plan is perfectly reasonable and appropiate in my view! Thanks for writing it and add the suggestions @AetherUnbound.

I would also like folks who intend to give feedback to explicitly say whether they're okay with data created in staging being destroyed on a regular basis, even with the option to prevent it at any time being available.

To me staging was always a laboratory to try changes and never saw it as a place where data persisted, so I'm okay with it being wiped out on the workflow described here. Having data persisted on stagin would be a new requirement but I still don't see it necessary.

...s/proposals/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md

Comment on lines +110 to +111

		1. The staging API must be deployed to apply any necessary migrations to the
		new copy the database. In order to do this we must augment the existing

Member

krysal Apr 25, 2023

If the requirement is to only run migrations it would be easier and faster just to run the django command python manage.py migrate in sstaging instance. Are there additional reasons that warrant a deployment?

Collaborator

sarayourfriend Apr 25, 2023

To run a command in staging we'd need to enable AWS SSM on the AWS client installed in Airflow. ECS Exec only works if SSM is enabled on the client. I don't know how much work it would be to set that up for the catalog. Staging API deployments take <=3 minutes. My 2 cents is that it probably isn't worth the hassle for this particular use-case.

Collaborator Author

AetherUnbound May 2, 2023

Yea agreed - I think a deployment would be the easiest thing to kick off given what we have.

...s/proposals/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md Outdated Show resolved Hide resolved

AetherUnbound and others added 12 commits

May 2, 2023 15:32


          Initial plan

04090e2


          Add the rest of the steps

26633ce


          Add ApplyImmediately note

1d93cdb


          Add feedback & clarification from @stacimc and @sarayourfriend

b33bcbc


          Update project proposal link

427f06a

Co-authored-by: Zack Krida <[email protected]>


          Add some clarification around infrastructure pieces

fc67712

Co-authored-by: Krystle Salazar <[email protected]>


          Add another section describing alternate approaches


          Add note about skipping the entire flow via Variable

4c6f5ae


          Add a section on the required Django Admin UI changes

2eab8a6


          Add steps that must be completed once the database rename is done

8c3b44e


          Add accepted reviews

8b89742

Co-authored-by: Staci Mullins <[email protected]>
Co-authored-by: Krystle Salazar <[email protected]>


          Make title consistent

7ac9ad0

AetherUnbound force-pushed the project/update-database-implementation-plan branch from c7986e1 to 7ac9ad0 Compare

May 2, 2023 22:34

AetherUnbound dismissed sarayourfriend’s stale review

May 2, 2023 22:46

Review requests addressed

AetherUnbound merged commit df8b72f into main

AetherUnbound deleted the project/update-database-implementation-plan branch

May 2, 2023 22:52

AetherUnbound mentioned this pull request

Staging database recreation DAG #1989

Closed

Collaborator Author

AetherUnbound commented Jun 2, 2023

Coming back to this, I actually don't think that performing an UPDATE_INDEX on the staging primary index is the appropriate move at this time. This is mainly because that action assumes that all data in the database should be in the index, which is not how we're planning on setting staging up at the moment (see #1987):

openverse/ingestion_server/ingestion_server/indexer.py

Lines 325 to 338 in ee90144

    
           log.info(f"Updating index {destination_index} with changes since {since_date}.") 
        
           deleted, mature = get_existence_queries(model_name) 
        
           query = SQL( 
        
               "SELECT *, {deleted}, {mature} " 
        
               "FROM {model_name} " 
        
               "WHERE updated_on >= {since_date};" 
        
           ).format( 
        
               deleted=deleted, 
        
               mature=mature, 
        
               model_name=Identifier(model_name), 
        
               since_date=Literal(since_date), 
        
           ) 
        
           self.replicate(model_name, model_name, destination_index, query) 
        
           self.refresh(destination_index)

After #1987 is implemented we could add steps to initiate a full refresh of the indices that IP describes once the restore is complete. If we wanted to use UPDATE_INDEX here, we'd need to add some logic to that operation to run the same selections as are planned for #1987 and then compare the dates after that.

AetherUnbound mentioned this pull request

Implementation Plan: Staging Elasticsearch reindex DAGs for both potential index types #1987

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

📄 aspect: text 🌟 goal: addition 🟧 priority: high 🧭 project: implementation plan skip-changelog 🧱 stack: catalog