Search relevancy sandbox #392

obulat · 2023-02-10T14:56:05Z

Start Date	ETA	Project Lead	Actual Ship Date
2023-04-01	2024-04-31	@AetherUnbound	TBD

Description

Modify the API staging environment to include a proportional subset of media from each provider. Increase the frequency of data refreshes.

This project does not address metrics for measuring and tracking relevancy.

To implement this project, we want to read the production /stats/ endpoints to get the media totals for each provider, then scale these numbers down to set the ingestion limit per provider in staging.

This project could also include the update of the Elasticsearch cluster (or setting up a new cluster and moving the staging there).

Documents

Project homepage

Issues

Milestones

Rapid iteration on Elasticsearch indices: https://github.com/WordPress/openverse/milestone/16

Prior Art

The text was updated successfully, but these errors were encountered:

AetherUnbound · 2023-04-12T21:23:35Z

Update 2023-04-12

Done

Blockers

AetherUnbound · 2023-05-02T23:10:04Z

Update 2023-05-02

Done

Implementation Plan: Update staging database #1154 was merged

Create the issues associated with the work described in Implementation Plan: Update staging database #1154
Implement the issues created above
Implementation plan for the rapid iteration on ingestion server index configuration
Implementation plan for the Staging Elasticsearch reindex DAGs for both potential index types

AetherUnbound · 2023-06-02T23:54:07Z

Both the rapid iteration IP (#1985) and the staging database recreation DAG (#1989) are underway. Most of the implementation for the DAG is complete, with #2207 and #2211 being the final two pieces before that can be shipped. We'll also need to complete #1990 before we turn the DAG on for regular use.

AetherUnbound · 2023-06-09T09:12:11Z

The staging database restore DAG is complete! 🥳 I will enable it for the first time once I'm back from WCEU next week.

AetherUnbound · 2023-06-09T10:44:45Z

The implementation plan for the rapid iteration (#2133) has been merged and the associated issues (#2370, #2371, #2372) have been created. This work is necessary to perform serially but it can begin immediately!

The initial implementation plan for the 3rd portion of this project has also been published by @krysal in #2358

openverse-bot · 2023-06-24T00:27:41Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2023-06-30T23:05:26Z

The staging database restore DAG work was completed and was enabled for the first time this week! While it completed successfully, @sarayourfriend encountered some trouble with the Terraform state which was referencing the RDS instance (in that the instance appeared to Terraform to be removed, even though a new instance was spun up). Sara corrected the state file and we believe this issue is resolved, but we'll need to monitor the next run to see if we encounter the same issue and adjust either Terraform or the DAG accordingly.

@krysal opened the final IP for this work in #2358, and we've begun discussing it there.

sarayourfriend · 2023-07-02T23:00:34Z

but we'll need to monitor the next run to see if we encounter the same issue and adjust either Terraform or the DAG accordingly.

After the subsequent runs the changes in Terraform work as expected, no further work here is needed.

AetherUnbound · 2023-07-10T18:50:26Z

Per the priorities meeting discussion that happened around this project, we'll be putting this project on hold for now. We'll continue the review process to merge #2358, but hold off on any ongoing development efforts in favor of other ongoing projects.

openverse-bot · 2023-10-05T00:20:57Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2023-10-05T02:10:58Z

We've just pulled this project off of on-hold - the Elasticsearch rapid iteration milestone issues can be worked on once we address the API stability!

openverse-bot · 2023-10-20T00:20:57Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2023-10-31T22:43:14Z

The work on this project has been slower due to our current effort to focus on #3197. With that work slowing down, we should be able to start working on this project again this cycle.

openverse-bot · 2023-11-15T00:21:48Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2023-11-15T02:51:19Z

The only update is that we now have a PR open for creating the staging indices: #3232

stacimc · 2023-11-17T18:45:52Z

While working on #3232, I went back and read the implementation plans for this project and came away with a lot of questions. It's possible some of this may be answered in comment threads on the IP prs that didn't make it into the final text, so apologies if I'm rehashing anything!

Here's a really brief summary of my understanding of the three IPs involved in this project, and the DAGs they describe:

Update staging DB
- staging_database_restore DAG
  - this updates the staging DB using the most recent production db snapshot. It does not affect es indices.
Rapid iteration of ingestion server index configuration
- create_new_es_index_staging
  - TL;DR: creates a brand new es index, with lots of options for configuration, but does not actually promote the index/apply any aliases. (Question about this below)
    - Longer explanation: For the given media_type, creates a new staging elasticsearch index with the given index-suffix, derived from the given source_index, optionally filtering documents from the source using a given query. It also takes an index-config which is merged with the ES configuration of the source index (or totally overwrites it, if the override_config param is enabled), allowing you to change the shape of the index on the fly.
- create_new_es_index_production
  - The exact same thing as above, but for production
Staging Elasticsearch Reindex DAGs
- recreate_full_staging_index
  - Creates a new es index in staging based on the <media_type> index, and points the <media-type>-full alias to it. If the point_alias param is enabled, it will instead point it to the main <media_type> alias. It also has a param that’s used to decide whether to delete the index being replaced.
- create_proportional_by_provider_staging_index
  - Creates a new es index in staging based on the given source_index, but with a proportional subset of records for each provider, given by the percentage_of_prod param.

So here are my questions:

We just updated the data refresh to create the filtered index before promoting the new media index, because we observed performance issues during filtered index creation (since it was running a reindex on a live, actively used index). How will the create_new_es_index_production DAGs account for this? They currently create indices from the main es index.
The create_new_es_index_{environment} dags create the indices, but don’t promote them/point any aliases. Shouldn't they? What is the plan for how to promote those indices — is that done manually, or in a separate dag not yet planned?
1. Could we not have a couple params:
  1. target_alias: target alias to apply on new index. If empty, no alias is pointed
  2. delete_old_index: whether to delete an index that is being replace by this one, if applicable (i.e., the index previously pointed to by the target alias)
What is the purpose of the recreate_full_staging_index DAGs? The create_new_es_index_staging does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.
I do think it's a good idea to separate out the create_proportional_by_provider_staging_index DAG (rather than add a the percentage_of_prod param to the create_new_es_index_{environment} DAGs, especially since that param doesn't make sense in prod. But it seems like we could intentionally share a lot of logic around creating the index and the promotion steps (only changing the logic for the actual reindexing).
Should we add an issue to create a staging_data_refresh DAG to run data refreshes on the staging ingestion server/against the staging api? That seems like it would be a really helpful final piece for testing (and easy to implement).

AetherUnbound · 2023-11-21T19:02:06Z

I'll try my best to answer the above, @krysal may also be able to provide some additional context!

We just updated the data refresh to create the filtered index before promoting the new media index, because we observed performance issues during filtered index creation (since it was running a reindex on a live, actively used index). How will the create_new_es_index_production DAGs account for this? They currently create indices from the main es index.

Part of that move was predicated by the API response time investigation project, and the motivation was as you mention to reduce pressure on the live indices while the data refresh is happening. Since we were able to reduce response times through a combination of related media query simplification and API ASGI implementation, I think we should be safe to target live indices for this DAG in the future. With the frequency of the data refresh (and the ease with which we could change the order of operations for it!), it made sense to make that change. For the more on-the-fly index generation that this new DAG was intended to enable, I'm not sure how we'd get around using the live indices (unless we decided to create a whole new index in order to create the index from that, in which case we should just create the index from the database with the new settings).

All that to say that I think with how infrequently this particular DAG is likely to run and how improved our Elasticsearch response time is, I think this should be a safe thing to do now 🙂

The create_new_es_index_{environment} dags create the indices, but don’t promote them/point any aliases. Shouldn't they? What is the plan for how to promote those indices — is that done manually, or in a separate dag not yet planned?

That's a great point, I like the idea of adding those new configuration options!

What is the purpose of the recreate_full_staging_index DAGs? The create_new_es_index_staging does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.

As I understand it, create_new_es_index_* only uses an existing index, whereas recreate_full_staging_index should pull records from the database and insert them into a new index, similar to the data refresh. I believe the same holds for the proportional DAG. The former is derrived while the latter two are created fresh. With that in mind, I think it makes sense to keep them separately in my mind.

I do think it's a good idea to separate out the create_proportional_by_provider_staging_index DAG (rather than add a the percentage_of_prod param to the create_new_es_index_{environment} DAGs, especially since that param doesn't make sense in prod. But it seems like we could intentionally share a lot of logic around creating the index and the promotion steps (only changing the logic for the actual reindexing).

@sarayourfriend also mentioned this in the third IP. I also support having separate DAGs for these but as we're developing them we should keep reuse in mind!

Should we add an issue to create a staging_data_refresh DAG to run data refreshes on the staging ingestion server/against the staging api? That seems like it would be a really helpful final piece for testing (and easy to implement).

That could be useful! Especially for testing changes to the data refresh process. I think one thing we'd want to add on an infrastructure level is an official DNS name to the staging data refresh server, because currently we only reference it by IP which would change every deployment.

Hopefully that helps! I'd love some other folks to weigh in, but I think it might make sense to take some of these changes back to the IPs and update them appropriately once we've desided 😄

krysal · 2023-11-21T23:14:41Z

@stacimc You summarized the parts of these plans very well. The questions are quite valid, and I pretty much agree with the answers that @AetherUnbound has already given, I'll just add a few comments.

What is the purpose of the recreate_full_staging_index DAGs? The create_new_es_index_staging does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.

As I understand it, create_new_es_index_* only uses an existing index, whereas recreate_full_staging_index should pull records from the database and insert them into a new index, similar to the data refresh. I believe the same holds for the proportional DAG. The former is derrived while the latter two are created fresh. With that in mind, I think it makes sense to keep them separately in my mind.

Exactly! The recreate_full_staging_index is planned mainly to decouple the full index creation from the Data Refresh process, and to use the resulting index as the source for the create_proportional_by_provider_staging_index and the create_new_es_index_{environment} DAGs.

Given the complexity of this project at the beginning, it was decided to split the different forms for creating the indexes (by type, environment, and size with custom configurations) in different IPs, and therefore, in different DAGs, it was more understandable and manageable this way. Now that the requirements are clearer and there is more familiarity with ES we can see opportunities to merge the DAGs, but I'll wait to build at least some of them before doing so, because the many moving parts are still there, and will be easy to test them separately, although that's my opinion! Maybe others see it differently. I also believe the same DAGs should probably apply to both environments at the end of the day.

stacimc · 2023-11-28T01:35:20Z

Thanks @krysal and @AetherUnbound, that additional context makes a ton of sense :)

The recreate_full_staging_index is planned mainly to decouple the full index creation from the Data Refresh process, and to use the resulting index as the source for the create_proportional_by_provider_staging_index and the create_new_es_index_{environment} DAGs.

That's a great summary, totally clicked for me. I think the underlying IPs are great, but I was stuck on a higher-level picture of how these DAGs are practically going to be used and relate to one another. That's actually sort of split out into a separate, future project 😅 For now, it makes a lot of sense to just be really flexible in the implementation, as already described :)

stacimc · 2024-02-07T00:48:35Z

Update:

I've opened a PR for the proportional index DAG which is almost ready, mostly needing a final test and then description/testing instructions.

This PR also happens to tackle much of the work needed for this issue to add a DAG for pointing aliases; once it's merged that issue can be resolved very quickly.

While implementing the proportional DAG I did notice a problem for which I filed a bug (#3761). In looking at that now I actually think it might've been easier to implement the fixed version than the original idea, so I may update the proportional index DAG PR to include that fix as well.

Summary: there are only a few issues left in the milestone, the largest of which is almost fully implemented. The remaining issues should be tackled in the next week or so.

stacimc · 2024-03-06T17:05:23Z

The proportional index DAG was merged after awhile in code review, including the fix for the #3761 bug that was filed while working on it! All that remains is the point alias DAG, which is in progress and a much smaller issue.

stacimc · 2024-03-11T16:38:35Z

A PR is open for the point alias DAG.

While working on it I noticed that a few of the concurrency dependencies between all the related elasticsearch DAGs had not been set up properly, or noticed in review. That is a huge pain point of having all these DAGs and very easy to miss. I prototyped an idea for handling that in a more automated way and created an issue (#3891).

At minimum the dependencies need to be fixed; ideally, that issue is implemented and this problem is solved into the future. My plan is to timebox the larger issue and fall back to simply manually fixing the dependencies if there are any issues with the more complex approach.

stacimc · 2024-03-26T22:02:13Z

The final PR for this milestone has been merged! Moving this project to Shipped.

stacimc · 2024-03-26T22:24:23Z

Per the project's stated success criteria, which is:

This project can be considered a success once any maintainer can make
adjustments to the staging API database & Elasticsearch index based on the
requirements described above.

I think this could be considered a success. The process docs note that a project-specific retro could be held at this stage as well, but feedback related to this project has already been discussed at the two most recent retros.

I would like to propose that this project be moved to Success and the project thread closed, given this. @WordPress/openverse-maintainers, what do you think (and maybe specifically @AetherUnbound as the author of the proposal)? I can also wait to raise this question at the community meeting if preferred!

AetherUnbound · 2024-03-26T22:48:16Z

I would be comfortable moving this to Success!

stacimc · 2024-03-27T22:55:21Z

Excellent! Moving to Success 🎉

obulat added the 🧭 project: thread An issue used to track a project and its progress label Feb 10, 2023

obulat assigned AetherUnbound Mar 8, 2023

AetherUnbound mentioned this issue Mar 31, 2023

Project Proposal: Search Relevancy Sandbox #1107

Merged

3 tasks

This was referenced May 3, 2023

Staging database recreation DAG #1989

Closed

Add notice to staging Django Admin UI of next scheduled data wipe #1990

Closed

zackkrida mentioned this issue May 31, 2023

Project thread reminders #2251

Merged

12 tasks

This was referenced Jun 9, 2023

Add Elasticsearch Airflow Provider #2370

Closed

Add Airflow Connections for Elasticsearch clusters #2371

Closed

Build ES index creation DAG #2372

Closed

This was referenced Oct 23, 2023

Move filtered index creation totally to Airflow #3240

Open

Build the ES full index recreation DAG #3246

Closed

krysal mentioned this issue Nov 24, 2023

Clean up all previous indexes after successfully switching to a new one during data refresh #1481

Closed

1 task

krysal added this to the Rapid iteration on Elasticsearch indices milestone Dec 6, 2023

krysal mentioned this issue Dec 8, 2023

Build the proportional-by-provider ES index creation DAG #3498

Closed

AetherUnbound assigned stacimc and unassigned AetherUnbound Dec 19, 2023

AetherUnbound mentioned this issue Dec 19, 2023

Relevancy Experimentation Framework #421

Open

2 tasks

stacimc closed this as completed Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search relevancy sandbox #392

Search relevancy sandbox #392

obulat commented Feb 10, 2023 •

edited

Loading

AetherUnbound commented Apr 12, 2023

AetherUnbound commented May 2, 2023 •

edited by zackkrida

Loading

AetherUnbound commented Jun 2, 2023

AetherUnbound commented Jun 9, 2023

AetherUnbound commented Jun 9, 2023 •

edited

Loading

openverse-bot commented Jun 24, 2023

AetherUnbound commented Jun 30, 2023

sarayourfriend commented Jul 2, 2023

AetherUnbound commented Jul 10, 2023

openverse-bot commented Oct 5, 2023

AetherUnbound commented Oct 5, 2023

openverse-bot commented Oct 20, 2023

AetherUnbound commented Oct 31, 2023

openverse-bot commented Nov 15, 2023

AetherUnbound commented Nov 15, 2023

stacimc commented Nov 17, 2023

AetherUnbound commented Nov 21, 2023

krysal commented Nov 21, 2023

stacimc commented Nov 28, 2023

stacimc commented Feb 7, 2024

stacimc commented Mar 6, 2024 •

edited

Loading

stacimc commented Mar 11, 2024

stacimc commented Mar 26, 2024

stacimc commented Mar 26, 2024

AetherUnbound commented Mar 26, 2024

stacimc commented Mar 27, 2024

Search relevancy sandbox #392

Search relevancy sandbox #392

Comments

obulat commented Feb 10, 2023 • edited Loading

Description

Documents

Issues

Milestones

Prior Art

AetherUnbound commented Apr 12, 2023

Update 2023-04-12

Done

Next

Blockers

AetherUnbound commented May 2, 2023 • edited by zackkrida Loading

Update 2023-05-02

Done

Next

AetherUnbound commented Jun 2, 2023

AetherUnbound commented Jun 9, 2023

AetherUnbound commented Jun 9, 2023 • edited Loading

openverse-bot commented Jun 24, 2023

AetherUnbound commented Jun 30, 2023

sarayourfriend commented Jul 2, 2023

AetherUnbound commented Jul 10, 2023

openverse-bot commented Oct 5, 2023

AetherUnbound commented Oct 5, 2023

openverse-bot commented Oct 20, 2023

AetherUnbound commented Oct 31, 2023

openverse-bot commented Nov 15, 2023

AetherUnbound commented Nov 15, 2023

stacimc commented Nov 17, 2023

AetherUnbound commented Nov 21, 2023

krysal commented Nov 21, 2023

stacimc commented Nov 28, 2023

stacimc commented Feb 7, 2024

stacimc commented Mar 6, 2024 • edited Loading

stacimc commented Mar 11, 2024

stacimc commented Mar 26, 2024

stacimc commented Mar 26, 2024

AetherUnbound commented Mar 26, 2024

stacimc commented Mar 27, 2024

obulat commented Feb 10, 2023 •

edited

Loading

AetherUnbound commented May 2, 2023 •

edited by zackkrida

Loading

AetherUnbound commented Jun 9, 2023 •

edited

Loading

stacimc commented Mar 6, 2024 •

edited

Loading