diff --git a/documentation/projects/proposals/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md b/documentation/projects/proposals/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md new file mode 100644 index 00000000000..f76e54033b4 --- /dev/null +++ b/documentation/projects/proposals/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md @@ -0,0 +1,161 @@ +# Project Proposal - 2023-03-31 + +## Reviewers + + + +- [x] sarayourfriend +- [x] zackkrida +- [x] krysal + +## Project summary + + + +Develop mechanisms and workflows for sustainably creating & updating our staging +API database & Elasticsearch cluster, in order to enable testing changes to +Elasticsearch indices. The staging setup will reflect production in two separate +ways using two different indices: proportionally by provider and with +production-level volumes. + +## Goals + + + +Yearly goal: **Result Relevancy** + +## Requirements + + + +The following should be available once this project is complete: + +1. A mechanism by which maintainers can easily update the staging API database + with recent data. +2. A mechanism by which maintainers can create/update a small but + proportional-to-production-by-provider index. +3. A mechanism by which maintainers can create/update a production-data-volume + sized index. +4. A mechanism by which maintainers can easily deploy & iterate on new + Elasticsearch indexing configurations +5. A mechanism by which maintainers can easily point the staging Elasticsearch + index aliases (currently `image` and `audio`) to one of the above. + +Many of these mechanisms will likely be built in Airflow. + +### Proportional by provider + +The proportional-by-provider index will be a subset of the production data, +constructed so that the number of records by provider is roughly proportional to +the percentages per provider that exist in production. This index will allow us +to iterate rapidly on the following: + +- Changes affecting Elasticsearch index configuration (e.g. shard sizes, + slicing, etc.) +- Changes to the ingestion server's configuration which require running a data + refresh end-to-end +- Measuring index performance +- Integration testing dead link & thumbnail checks across all providers + +This could be done by: + +1. Querying the `/stats` endpoint of the production API (e.g. + https://api.openverse.engineering/v1/images/stats/). +2. Calculating percentage-per-provider. +3. Computing the number of results per provider based on these percentages based + on the condition that the smallest provider has a minimum `N` results (e.g. a + minimum of 10 results). +4. Pulling a number of results based on the previous step per-provider either + into a Postgres table or an Elasticsearch index (the specifics of this step + will be left for the implementation plan). + +This index will allow us to assess how changes to Elasticsearch index or the API +will affect results using a smaller total result count. It will also allow us to +test API results from all currently available providers. + +### Production data volume + +The production-data-volume index will be an index of similar size (i.e. on the +same order of magnitude in document count) to production. This is how the data +refresh is currently set up and does not require any additional work to set up. + +This index will allow us to test how queries and indexing operations perform at +a data volume consistent with production. + +## Success + + + +This project can be considered a success once any maintainer can make +adjustments to the staging API database & Elasticsearch index based on the +requirements described above. + +## Participants and stakeholders + + + +- Lead: @AetherUnbound +- Implementation: + - @AetherUnbound + - TBD +- Stakeholders: + - Openverse Team + +## Infrastructure + + + +This project may require infrastructure work necessary to assist with the rapid +iteration of the ingestion server. Further details will be determined in the +implementation plan. + +The project will also likely involve the creation or modification of Airflow +DAGs related to the data refresh and infrastructure modification. This will +require interfacing Airflow with the staging ingestion server, API database, and +Elasticsearch cluster. + +## Accessibility + + + +Not every member of the maintainer team is intimately familiar with Airflow - we +will need to provide clear instructions for maintainers on how to use Airflow to +run any of the above mechanisms. + +## Marketing + + + +No marketing is required as this is an improvement internal to the team. + +## Required implementation plans + + + +In the order they should be completed: + +1. Staging API DB update procedure + - This plan will describe a DAG which can be triggered to update the staging + API database. +2. Rapid iteration on ingestion server index configuration + - This plan will describe how rapid iteration of Elasticsearch index + configurations can happen. This can be done via Airflow interacting + directly with Elasticsearch rather than having to process requests through + the ingestion server. Airflow would only need to be aware of the + [existing index settings](https://github.com/WordPress/openverse/blob/0a5f4ab2ce5d80a48bd1c57d2a2dbcca14fcbedc/ingestion_server/ingestion_server/es_mapping.py) + and how to augment/adjust them for the new index. + ([See this comment for further inspiration](https://github.com/WordPress/openverse/pull/1107#discussion_r1155399508)) + - It should be assessed as part of this implementation plan whether it would + be easier to convert the ingestion server to ECS or provide a mechanism on + the existing EC2 infrastructure to update the Elasticsearch index + configuration without issuing a deployment (similar to the + [dag-sync script for the catalog](https://github.com/WordPress/openverse-catalog/blob/10857e3ee94ae686853984c54d504b152082d4c2/dag-sync.sh)). +3. Staging Elasticsearch reindex DAGs for both potential index types (these will + be subsets of the full data refresh) + - This plan will describe the DAG or DAGs which will be used to create/update + both the proportional-by-provider and production-data-volume indices. + - It will also describe the mechanism by which maintainers can rapidly switch + index the staging API uses. This could be done in two separate ways: a DAG + which allows changing the primary index alias or a set of changes to the + API which would allow queries to specify which index they use. The + implementation plan should explore and describe both options. diff --git a/documentation/projects/proposals/search_relevancy_sandbox/index.md b/documentation/projects/proposals/search_relevancy_sandbox/index.md new file mode 100644 index 00000000000..f894844ca1d --- /dev/null +++ b/documentation/projects/proposals/search_relevancy_sandbox/index.md @@ -0,0 +1,8 @@ +# Search Relevancy Sandbox + +```{toctree} +:titlesonly: +:glob: + +* +```