Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Proposal: Search Relevancy Sandbox #1107

Merged
merged 7 commits into from
Apr 14, 2023

Conversation

AetherUnbound
Copy link
Collaborator

@AetherUnbound AetherUnbound commented Mar 31, 2023

Due date:

2023-04-14

Assigned reviewers

@dhruvkb - I chose you because you've worked on the ingestion server and Elasticsearch in the past and may be familiar (albeit distantly!) with some of these pieces. I know that you're involved in a number of other projects at this moment, so please feel free to opt out if you're currently at capacity.

@sarayourfriend - I chose you because you're familiar with many of the systems at play and have been exploring modifications to Elasticsearch, so your insight into the "user-facing" side of this project would be invaluable.

Description

This PR is the kick-off for the search relevancy sandbox project: #392

@AetherUnbound AetherUnbound requested a review from a team as a code owner March 31, 2023 17:59
@AetherUnbound AetherUnbound added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server 🧱 stack: catalog Related to the catalog and Airflow DAGs skip-changelog labels Mar 31, 2023
@github-actions
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/1107

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

@AetherUnbound AetherUnbound mentioned this pull request Mar 31, 2023
16 tasks
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting clarification for a few things.

Overall I think this proposal works, but I think we should mention some specific approaches that we could take that would be way easier than trying to run top-to-bottom data refreshes in staging. Easier both in reduced complexity and in time. We can entirely avoid pushing data from Postgres into Elasticsearch if all we're testing are changes to index configuration: that is worth noting specifically for the implementation plan, especially as it allows us to cut out moving the ingestion server to ECS, a big win in terms of the scale of the downstream work, I would think.

I also wonder if we can bypass data refresh for the proportional-by-provider stuff as well. We could avoid making any changes to the data refresh process or ingestion server code altogether to support the proportional index. For example, a DAG that used random_score to pull random documents for a specific provider and paginated using search_after until we have the right number for that provider, would accomplish the process without needing to copy data from Postgres for it or rely on the ingestion server. Especially if we want to eventually move the data refresh process out of the ingestion server project and into Airflow, this might be a good opportunity to start wiring up the ES connection direct from Airflow.

Comment on lines 134 to 135
- This plan will describe how rapid iteration of the Elasticsearch index
configuration can happen.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "the Elasticsearch index" mean? Specifically the "the" isn't clear to me 🤔 ES can have any number of indexes, each with their own independent configuration. Rapidly iterating on index configuration could be as easy as changing the request body to the create index endpoint, which someone could either issue directly in Elasticvue (or similar) or give as input to an Airflow DAG, for example.

I also think we should leverage the re-index API for this. I could see an Airflow DAG meant for iterating on index configuration that works something like this:

  1. Trigger the DAG with a new index configuration body.
    • The existing configuration can be generated using the index_settings function from the ingestion server project. This function runs fine in a Python shell with zero setup inside the ingestion_server project.
  2. The DAG creates the index via the create index endpoint, either using a name provided when the DAG was triggered or by generating a UUID or date-based suffix. This should happen by hitting ES directly, not through the ingestion server.
  3. The DAG then calls the reindex API to copy the data from the existing (periodically refreshed?) "production data" index into the new index. If we want to incorporate "proportional data" or have a subset of data, we could also have the DAG take in a query. If we just need to test a stemmer change, for example, we can query for documents that include the terms in question. For broader text analysis changes we'd want to use the full production data (see my comment above). If we do use the proportional-by-provider data, we can allow changing the origin index between the full production data and the proportional-by-provider index.
  4. After reindex is done, refresh the new index.

Done! We've just completed one index configuration iteration.

This bypasses the complexity of the ingestion server and wouldn't require any changes to the ingestion server at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea of leveraging Airflow the way you're describing!

By "the" index I mean - the primary index that will eventually serve as the index behind the image|audio alias. In this case/example, that's not what would be iterated on though, I'll try to reword this to be more generic as it relates to indices. The specifics you describe will go into the eventual implementation plan for this step, so I won't include them here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I think it is worth recording the general approach in this document, otherwise the person doing the implementation plan may not know about this being a viable approach and go down a different rabbit hole. A small note in the implementation plan section that we can use the reindex API directly from Airflow to ES rather than passing through the ingestion server would be sufficient. Right now the suggested approach feels heavily skewed towards needing to iterate on the ingestion server code itself and maybe even moving it to ECS, which is a lot more work that we could avoid. I think it would at least triple the scope of the implementation plan if it included moving the ingestion server to ECS vs just hitting ES directly from Airflow and using the reindex API.

@AetherUnbound AetherUnbound self-assigned this Apr 3, 2023
Copy link
Collaborator Author

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback @sarayourfriend! I'll address a few of the questions you bring up, but I'll wait for your (and others') thoughts on the proportionality before I change that aspect of the proposal.

Comment on lines 134 to 135
- This plan will describe how rapid iteration of the Elasticsearch index
configuration can happen.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea of leveraging Airflow the way you're describing!

By "the" index I mean - the primary index that will eventually serve as the index behind the image|audio alias. In this case/example, that's not what would be iterated on though, I'll try to reword this to be more generic as it relates to indices. The specifics you describe will go into the eventual implementation plan for this step, so I won't include them here.

@sarayourfriend
Copy link
Collaborator

Just wanted to confirm that I'm AFK until Tuesday, but I will respond here ASAP on my Tuesday morning/your Monday afternoon @AetherUnbound.

@AetherUnbound AetherUnbound force-pushed the project/search-relevancy-sandbox branch from 65be166 to aa0ff85 Compare April 6, 2023 18:31
@AetherUnbound AetherUnbound added the 🧭 project: proposal A proposal for a project label Apr 11, 2023
@AetherUnbound
Copy link
Collaborator Author

@sarayourfriend I've some clarity via d044dfd around the motivation behind the proportionality index & suggestion for implementation plans. I think that covers all of your points of clarification but please let me know if I've missed anything or if the new revisions are unclear.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Excited for this 🚀

@zackkrida
Copy link
Member

@AetherUnbound I'll review this now, in the meantime I just wanted to point out the structure will need to be updated to meet the changes in #1162.

Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, and I had some thoughts that I hope do not derail this work but that I am curious to hear your thoughts on! The last bullet of the last implementation plan:

  • It will also describe the mechanism by which maintainers can rapidly switch index aliases.

Feels quite important to me and like it may be a bit under-explained by this proposal. As written, the proposal seems to suggest that the DAGs would be used to allow for creating a new staging index that replaces the default, "live" index the staging API is pointed at. Is that correct?

Alternatively, I can see a lot of value in having DAGs that allow for creating and updating staging indices with different qualities, but having a mechanism that allows developers to switch between them much more rapidly, perhaps even a query param-based solution (which would allow, for example, to switch indices via a checkbox on the staging.openverse.org/preferences page).

To put this all another way, I'm wondering if requirement #5 should be a bit more specific in terms of what "easily" means.

One more thought: It occurs to me that the staging API database is always just a subset of the production database, with the exception of situations where we're developing column changes to the media tables. In that sense, it seems wasteful and costly to even have image and audio mat views in staging. Instead, I wonder if we could figure something out with foreign data wrappers in Postgres to create the staging matviews from the production database. It's pretty straightforward to create a read-only user for the prod db that we would connect the staging db with, and then read the media from production. I imagine that would also save considerable time in the data refresh process as well for staging when doing a production-data-volume sized index. It would also allow us to save considerable cost and resources. Production CPU usage pretty reliably hovers around 4% so I don't think the extra reads from staging would incur much of a cost at all.

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this proposal. It clarified many of my questions for this project.

@AetherUnbound AetherUnbound force-pushed the project/search-relevancy-sandbox branch from d044dfd to 7edf9e8 Compare April 14, 2023 18:25
@AetherUnbound
Copy link
Collaborator Author

AetherUnbound commented Apr 14, 2023

This looks great, and I had some thoughts that I hope do not derail this work but that I am curious to hear your thoughts on! The last bullet of the last implementation plan:

  • It will also describe the mechanism by which maintainers can rapidly switch index aliases.
    Feels quite important to me and like it may be a bit under-explained by this proposal. As written, the proposal seems to suggest that the DAGs would be used to allow for creating a new staging index that replaces the default, "live" index the staging API is pointed at. Is that correct?
    Alternatively, I can see a lot of value in having DAGs that allow for creating and updating staging indices with different qualities, but having a mechanism that allows developers to switch between them much more rapidly, perhaps even a query param-based solution (which would allow, for example, to switch indices via a checkbox on the staging.openverse.org/preferences page).
    To put this all another way, I'm wondering if requirement Improve the logging format #5 should be a bit more specific in terms of what "easily" means.

That's a great point, but I actually think this would be a good thing to determine in the implementation plan! It was my intention to have a single alias (like production, at least at the current moment) and allow maintainers to swap between these aliases. However, I do quite like the idea of an API setting which might allow dynamically changing which index is used at query time! That would bring increased complexity to the plan and the project that I wasn't initially considering, but if we scope it out as part of the plan, potentially starting with the rapid swapping but laying out a plan for what changing the index at query time might look like, we could even add that capability down the line outside of the bounds of this specific effort!

One more thought: It occurs to me that the staging API database is always just a subset of the production database, with the exception of situations where we're developing column changes to the media tables. In that sense, it seems wasteful and costly to even have image and audio mat views in staging. Instead, I wonder if we could figure something out with foreign data wrappers in Postgres to create the staging matviews from the production database. It's pretty straightforward to create a read-only user for the prod db that we would connect the staging db with, and then read the media from production. I imagine that would also save considerable time in the data refresh process as well for staging when doing a production-data-volume sized index. It would also allow us to save considerable cost and resources. Production CPU usage pretty reliably hovers around 4% so I don't think the extra reads from staging would incur much of a cost at all.

I think a few things might be conflated here - the API databases don't have their own matviews, only the catalog has a matview. The implementation plan in #1154 describes how we could replace the staging API database wholesale without having to worry about the matview refresh aspect, since that will have already been completed in order to create the prod database (on which the snapshot is based). The data refresh steps described in this document are only those which involve indexing data from the database into Elasticsearch - I'll try to make that clearer in the document!

@zackkrida
Copy link
Member

Oops, accidentally referred to the API DB media tables as matviews. I'll check out #1154! My concern was always writing and storing a subset of all Openverse media to the staging api db when changing the indices and it sounds like that won't be a concern.

@AetherUnbound AetherUnbound merged commit c98680e into main Apr 14, 2023
@AetherUnbound AetherUnbound deleted the project/search-relevancy-sandbox branch April 14, 2023 19:19
dhruvkb pushed a commit that referenced this pull request Apr 14, 2023
* Project Proposal: Search Relevancy Sandbox

* data refresh server -> ingestion server

* Clarifications about size and which index is referenced

* Clarify motivation for proportional index and implementation plans

* Move & re-title document

* Add clarification from @zackkrida and @krysal

* Add an index file
@sarayourfriend
Copy link
Collaborator

laying out a plan for what changing the index at query time might look like

FWIW, if anyone is looking at this later during implementation planning, this probably doesn't need to be more complicated than a query parameter that determines the index, but we'd need to be careful to juggle things in light of https://docs.openverse.org/projects/proposals/detecting_sensitive_textual_content/20230308-implementation_plan_filtering_and_designating_results_with_sensitive_textual_content.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: proposal A proposal for a project skip-changelog 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Status: Accepted
Development

Successfully merging this pull request may close these issues.

4 participants