Project Proposal: Search Relevancy Sandbox #1107

AetherUnbound · 2023-03-31T17:59:33Z

Due date:

2023-04-14

Assigned reviewers

@dhruvkb - I chose you because you've worked on the ingestion server and Elasticsearch in the past and may be familiar (albeit distantly!) with some of these pieces. I know that you're involved in a number of other projects at this moment, so please feel free to opt out if you're currently at capacity.

@sarayourfriend - I chose you because you're familiar with many of the systems at play and have been exploring modifications to Elasticsearch, so your insight into the "user-facing" side of this project would be invaluable.

Description

This PR is the kick-off for the search relevancy sandbox project: #392

github-actions · 2023-03-31T18:05:00Z

Full-stack documentation: https://docs.openverse.org/_preview/1107

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

sarayourfriend

Requesting clarification for a few things.

Overall I think this proposal works, but I think we should mention some specific approaches that we could take that would be way easier than trying to run top-to-bottom data refreshes in staging. Easier both in reduced complexity and in time. We can entirely avoid pushing data from Postgres into Elasticsearch if all we're testing are changes to index configuration: that is worth noting specifically for the implementation plan, especially as it allows us to cut out moving the ingestion server to ECS, a big win in terms of the scale of the downstream work, I would think.

I also wonder if we can bypass data refresh for the proportional-by-provider stuff as well. We could avoid making any changes to the data refresh process or ingestion server code altogether to support the proportional index. For example, a DAG that used random_score to pull random documents for a specific provider and paginated using search_after until we have the right number for that provider, would accomplish the process without needing to copy data from Postgres for it or rely on the ingestion server. Especially if we want to eventually move the data refresh process out of the ingestion server project and into Airflow, this might be a good opportunity to start wiring up the ES connection direct from Airflow.

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md

sarayourfriend · 2023-04-03T00:12:55Z

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md

+   - This plan will describe how rapid iteration of the Elasticsearch index
+     configuration can happen.


What does "the Elasticsearch index" mean? Specifically the "the" isn't clear to me 🤔 ES can have any number of indexes, each with their own independent configuration. Rapidly iterating on index configuration could be as easy as changing the request body to the create index endpoint, which someone could either issue directly in Elasticvue (or similar) or give as input to an Airflow DAG, for example.

I also think we should leverage the re-index API for this. I could see an Airflow DAG meant for iterating on index configuration that works something like this:

Trigger the DAG with a new index configuration body.

The existing configuration can be generated using the index_settings function from the ingestion server project. This function runs fine in a Python shell with zero setup inside the ingestion_server project.

The DAG creates the index via the create index endpoint, either using a name provided when the DAG was triggered or by generating a UUID or date-based suffix. This should happen by hitting ES directly, not through the ingestion server.

The DAG then calls the reindex API to copy the data from the existing (periodically refreshed?) "production data" index into the new index. If we want to incorporate "proportional data" or have a subset of data, we could also have the DAG take in a query. If we just need to test a stemmer change, for example, we can query for documents that include the terms in question. For broader text analysis changes we'd want to use the full production data (see my comment above). If we do use the proportional-by-provider data, we can allow changing the origin index between the full production data and the proportional-by-provider index.

After reindex is done, refresh the new index.

Done! We've just completed one index configuration iteration.

This bypasses the complexity of the ingestion server and wouldn't require any changes to the ingestion server at all.

I love the idea of leveraging Airflow the way you're describing!

By "the" index I mean - the primary index that will eventually serve as the index behind the image|audio alias. In this case/example, that's not what would be iterated on though, I'll try to reword this to be more generic as it relates to indices. The specifics you describe will go into the eventual implementation plan for this step, so I won't include them here.

Sounds good. I think it is worth recording the general approach in this document, otherwise the person doing the implementation plan may not know about this being a viable approach and go down a different rabbit hole. A small note in the implementation plan section that we can use the reindex API directly from Airflow to ES rather than passing through the ingestion server would be sufficient. Right now the suggested approach feels heavily skewed towards needing to iterate on the ingestion server code itself and maybe even moving it to ECS, which is a lot more work that we could avoid. I think it would at least triple the scope of the implementation plan if it included moving the ingestion server to ECS vs just hitting ES directly from Airflow and using the reindex API.

AetherUnbound

Thanks for your feedback @sarayourfriend! I'll address a few of the questions you bring up, but I'll wait for your (and others') thoughts on the proportionality before I change that aspect of the proposal.

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md

AetherUnbound · 2023-04-04T21:45:20Z

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md

+   - This plan will describe how rapid iteration of the Elasticsearch index
+     configuration can happen.


I love the idea of leveraging Airflow the way you're describing!

By "the" index I mean - the primary index that will eventually serve as the index behind the image|audio alias. In this case/example, that's not what would be iterated on though, I'll try to reword this to be more generic as it relates to indices. The specifics you describe will go into the eventual implementation plan for this step, so I won't include them here.

sarayourfriend · 2023-04-05T23:45:14Z

Just wanted to confirm that I'm AFK until Tuesday, but I will respond here ASAP on my Tuesday morning/your Monday afternoon @AetherUnbound.

AetherUnbound · 2023-04-12T21:18:18Z

@sarayourfriend I've some clarity via d044dfd around the motivation behind the proportionality index & suggestion for implementation plans. I think that covers all of your points of clarification but please let me know if I've missed anything or if the new revisions are unclear.

sarayourfriend

LGTM! Excited for this 🚀

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md

zackkrida · 2023-04-14T17:18:17Z

@AetherUnbound I'll review this now, in the meantime I just wanted to point out the structure will need to be updated to meet the changes in #1162.

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md

zackkrida

This looks great, and I had some thoughts that I hope do not derail this work but that I am curious to hear your thoughts on! The last bullet of the last implementation plan:

It will also describe the mechanism by which maintainers can rapidly switch index aliases.

Feels quite important to me and like it may be a bit under-explained by this proposal. As written, the proposal seems to suggest that the DAGs would be used to allow for creating a new staging index that replaces the default, "live" index the staging API is pointed at. Is that correct?

Alternatively, I can see a lot of value in having DAGs that allow for creating and updating staging indices with different qualities, but having a mechanism that allows developers to switch between them much more rapidly, perhaps even a query param-based solution (which would allow, for example, to switch indices via a checkbox on the staging.openverse.org/preferences page).

To put this all another way, I'm wondering if requirement #5 should be a bit more specific in terms of what "easily" means.

One more thought: It occurs to me that the staging API database is always just a subset of the production database, with the exception of situations where we're developing column changes to the media tables. In that sense, it seems wasteful and costly to even have image and audio mat views in staging. Instead, I wonder if we could figure something out with foreign data wrappers in Postgres to create the staging matviews from the production database. It's pretty straightforward to create a read-only user for the prod db that we would connect the staging db with, and then read the media from production. I imagine that would also save considerable time in the data refresh process as well for staging when doing a production-data-volume sized index. It would also allow us to save considerable cost and resources. Production CPU usage pretty reliably hovers around 4% so I don't think the extra reads from staging would incur much of a cost at all.

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md

krysal

Thanks for writing this proposal. It clarified many of my questions for this project.

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md

AetherUnbound · 2023-04-14T18:43:55Z

This looks great, and I had some thoughts that I hope do not derail this work but that I am curious to hear your thoughts on! The last bullet of the last implementation plan:

It will also describe the mechanism by which maintainers can rapidly switch index aliases.
Feels quite important to me and like it may be a bit under-explained by this proposal. As written, the proposal seems to suggest that the DAGs would be used to allow for creating a new staging index that replaces the default, "live" index the staging API is pointed at. Is that correct?
Alternatively, I can see a lot of value in having DAGs that allow for creating and updating staging indices with different qualities, but having a mechanism that allows developers to switch between them much more rapidly, perhaps even a query param-based solution (which would allow, for example, to switch indices via a checkbox on the staging.openverse.org/preferences page).
To put this all another way, I'm wondering if requirement Improve the logging format #5 should be a bit more specific in terms of what "easily" means.

That's a great point, but I actually think this would be a good thing to determine in the implementation plan! It was my intention to have a single alias (like production, at least at the current moment) and allow maintainers to swap between these aliases. However, I do quite like the idea of an API setting which might allow dynamically changing which index is used at query time! That would bring increased complexity to the plan and the project that I wasn't initially considering, but if we scope it out as part of the plan, potentially starting with the rapid swapping but laying out a plan for what changing the index at query time might look like, we could even add that capability down the line outside of the bounds of this specific effort!

One more thought: It occurs to me that the staging API database is always just a subset of the production database, with the exception of situations where we're developing column changes to the media tables. In that sense, it seems wasteful and costly to even have image and audio mat views in staging. Instead, I wonder if we could figure something out with foreign data wrappers in Postgres to create the staging matviews from the production database. It's pretty straightforward to create a read-only user for the prod db that we would connect the staging db with, and then read the media from production. I imagine that would also save considerable time in the data refresh process as well for staging when doing a production-data-volume sized index. It would also allow us to save considerable cost and resources. Production CPU usage pretty reliably hovers around 4% so I don't think the extra reads from staging would incur much of a cost at all.

I think a few things might be conflated here - the API databases don't have their own matviews, only the catalog has a matview. The implementation plan in #1154 describes how we could replace the staging API database wholesale without having to worry about the matview refresh aspect, since that will have already been completed in order to create the prod database (on which the snapshot is based). The data refresh steps described in this document are only those which involve indexing data from the database into Elasticsearch - I'll try to make that clearer in the document!

zackkrida · 2023-04-14T19:07:35Z

Oops, accidentally referred to the API DB media tables as matviews. I'll check out #1154! My concern was always writing and storing a subset of all Openverse media to the staging api db when changing the indices and it sounds like that won't be a concern.

@zackkrida

* Project Proposal: Search Relevancy Sandbox * data refresh server -> ingestion server * Clarifications about size and which index is referenced * Clarify motivation for proportional index and implementation plans * Move & re-title document * Add clarification from @zackkrida and @krysal * Add an index file

sarayourfriend · 2023-04-16T23:13:24Z

laying out a plan for what changing the index at query time might look like

FWIW, if anyone is looking at this later during implementation planning, this probably doesn't need to be more complicated than a query parameter that determines the index, but we'd need to be careful to juggle things in light of https://docs.openverse.org/projects/proposals/detecting_sensitive_textual_content/20230308-implementation_plan_filtering_and_designating_results_with_sensitive_textual_content.html

AetherUnbound requested a review from a team as a code owner March 31, 2023 17:59

AetherUnbound requested review from obulat and dhruvkb March 31, 2023 17:59

AetherUnbound requested a review from sarayourfriend March 31, 2023 17:59

AetherUnbound mentioned this pull request Mar 31, 2023

Search relevancy sandbox #392

Closed

16 tasks

sarayourfriend reviewed Apr 3, 2023

View reviewed changes

AetherUnbound self-assigned this Apr 3, 2023

AetherUnbound commented Apr 4, 2023

View reviewed changes

AetherUnbound force-pushed the project/search-relevancy-sandbox branch from 65be166 to aa0ff85 Compare April 6, 2023 18:31

AetherUnbound mentioned this pull request Apr 7, 2023

Implementation Plan: Update staging database #1154

Merged

2 tasks

AetherUnbound added the 🧭 project: proposal A proposal for a project label Apr 11, 2023

sarayourfriend approved these changes Apr 13, 2023

View reviewed changes

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md Outdated Show resolved Hide resolved

zackkrida reviewed Apr 14, 2023

View reviewed changes

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md Outdated Show resolved Hide resolved

zackkrida approved these changes Apr 14, 2023

View reviewed changes

zackkrida reviewed Apr 14, 2023

View reviewed changes

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md Outdated Show resolved Hide resolved

krysal approved these changes Apr 14, 2023

View reviewed changes

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md Outdated Show resolved Hide resolved

rfcs/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md Outdated Show resolved Hide resolved

Project Proposal: Search Relevancy Sandbox

aec4dc1

AetherUnbound added 4 commits April 14, 2023 11:21

data refresh server -> ingestion server

e1dd21c

Clarifications about size and which index is referenced

ed9569a

Clarify motivation for proportional index and implementation plans

3cee355

Move & re-title document

7edf9e8

AetherUnbound force-pushed the project/search-relevancy-sandbox branch from d044dfd to 7edf9e8 Compare April 14, 2023 18:25

AetherUnbound added 2 commits April 14, 2023 11:51

Add clarification from @zackkrida and @krysal

3db8007

Add an index file

7287895

AetherUnbound merged commit c98680e into main Apr 14, 2023

AetherUnbound deleted the project/search-relevancy-sandbox branch April 14, 2023 19:19

This was referenced May 2, 2023

Implementation Plan: Rapid iteration on ingestion server index configuration #1985

Closed

Implementation Plan: Staging Elasticsearch reindex DAGs for both potential index types #1987

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Proposal: Search Relevancy Sandbox #1107

Project Proposal: Search Relevancy Sandbox #1107

AetherUnbound commented Mar 31, 2023 •

edited

Loading

github-actions bot commented Mar 31, 2023

sarayourfriend left a comment

sarayourfriend Apr 3, 2023

AetherUnbound Apr 4, 2023

sarayourfriend Apr 10, 2023

AetherUnbound left a comment

AetherUnbound Apr 4, 2023

sarayourfriend commented Apr 5, 2023

AetherUnbound commented Apr 12, 2023

sarayourfriend left a comment

zackkrida commented Apr 14, 2023

zackkrida left a comment •

edited

Loading

krysal left a comment

AetherUnbound commented Apr 14, 2023 •

edited

Loading

zackkrida commented Apr 14, 2023

sarayourfriend commented Apr 16, 2023

		- This plan will describe how rapid iteration of the Elasticsearch index
		configuration can happen.

Project Proposal: Search Relevancy Sandbox #1107

Project Proposal: Search Relevancy Sandbox #1107

Conversation

AetherUnbound commented Mar 31, 2023 • edited Loading

Due date:

Assigned reviewers

Description

github-actions bot commented Mar 31, 2023

sarayourfriend left a comment

Choose a reason for hiding this comment

sarayourfriend Apr 3, 2023

Choose a reason for hiding this comment

AetherUnbound Apr 4, 2023

Choose a reason for hiding this comment

sarayourfriend Apr 10, 2023

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

AetherUnbound Apr 4, 2023

Choose a reason for hiding this comment

sarayourfriend commented Apr 5, 2023

AetherUnbound commented Apr 12, 2023

sarayourfriend left a comment

Choose a reason for hiding this comment

zackkrida commented Apr 14, 2023

zackkrida left a comment • edited Loading

Choose a reason for hiding this comment

krysal left a comment

Choose a reason for hiding this comment

AetherUnbound commented Apr 14, 2023 • edited Loading

zackkrida commented Apr 14, 2023

sarayourfriend commented Apr 16, 2023

AetherUnbound commented Mar 31, 2023 •

edited

Loading

zackkrida left a comment •

edited

Loading

AetherUnbound commented Apr 14, 2023 •

edited

Loading