Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Proposal: Search Relevancy Sandbox #1107

Merged
merged 7 commits into from
Apr 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Project Proposal - 2023-03-31

## Reviewers

<!-- Choose two people at your discretion who make sense to review this based on their existing expertise. Check in to make sure folks aren't currently reviewing more than one other proposal or RFC. -->

- [x] sarayourfriend
- [x] zackkrida
- [x] krysal

## Project summary

<!-- A brief one or two sentence summary of the project's features -->

Develop mechanisms and workflows for sustainably creating & updating our staging
API database & Elasticsearch cluster, in order to enable testing changes to
Elasticsearch indices. The staging setup will reflect production in two separate
ways using two different indices: proportionally by provider and with
production-level volumes.

## Goals

<!-- Which yearly goal does this project advance? -->

Yearly goal: **Result Relevancy**

## Requirements

<!-- Detailed descriptions of the features required for the project. Include user stories if you feel they'd be helpful, but focus on describing a specification for how the feature would work with an eye towards edge cases. -->

The following should be available once this project is complete:

1. A mechanism by which maintainers can easily update the staging API database
with recent data.
2. A mechanism by which maintainers can create/update a small but
proportional-to-production-by-provider index.
3. A mechanism by which maintainers can create/update a production-data-volume
sized index.
4. A mechanism by which maintainers can easily deploy & iterate on new
Elasticsearch indexing configurations
5. A mechanism by which maintainers can easily point the staging Elasticsearch
index aliases (currently `image` and `audio`) to one of the above.

Many of these mechanisms will likely be built in Airflow.

### Proportional by provider

The proportional-by-provider index will be a subset of the production data,
constructed so that the number of records by provider is roughly proportional to
the percentages per provider that exist in production. This index will allow us
to iterate rapidly on the following:

- Changes affecting Elasticsearch index configuration (e.g. shard sizes,
slicing, etc.)
- Changes to the ingestion server's configuration which require running a data
refresh end-to-end
- Measuring index performance
- Integration testing dead link & thumbnail checks across all providers

This could be done by:

1. Querying the `/stats` endpoint of the production API (e.g.
https://api.openverse.engineering/v1/images/stats/).
2. Calculating percentage-per-provider.
3. Computing the number of results per provider based on these percentages based
on the condition that the smallest provider has a minimum `N` results (e.g. a
minimum of 10 results).
4. Pulling a number of results based on the previous step per-provider either
into a Postgres table or an Elasticsearch index (the specifics of this step
will be left for the implementation plan).

This index will allow us to assess how changes to Elasticsearch index or the API
will affect results using a smaller total result count. It will also allow us to
test API results from all currently available providers.

### Production data volume

The production-data-volume index will be an index of similar size (i.e. on the
same order of magnitude in document count) to production. This is how the data
refresh is currently set up and does not require any additional work to set up.

This index will allow us to test how queries and indexing operations perform at
a data volume consistent with production.

## Success

<!-- How do we measure the success of the project? How do we know our ideas worked? -->

This project can be considered a success once any maintainer can make
adjustments to the staging API database & Elasticsearch index based on the
requirements described above.

## Participants and stakeholders

<!-- Who is working on the project and who are the external stakeholders, if any? Consider the lead, implementers, designers, and other stakeholders who have a say in how the project goes. -->

- Lead: @AetherUnbound
- Implementation:
- @AetherUnbound
- TBD
- Stakeholders:
- Openverse Team

## Infrastructure

<!-- What infrastructural considerations need to be made for this project? If there are none, say so explicitly rather than deleting the section. -->

This project may require infrastructure work necessary to assist with the rapid
iteration of the ingestion server. Further details will be determined in the
implementation plan.

The project will also likely involve the creation or modification of Airflow
DAGs related to the data refresh and infrastructure modification. This will
require interfacing Airflow with the staging ingestion server, API database, and
Elasticsearch cluster.

## Accessibility

<!-- Are there specific accessibility concerns relevant to this project? Do you expect new UI elements that would need particular care to ensure they're implemented in an accessible way? Consider also low-spec device and slow internet accessibility, if relevant. -->

Not every member of the maintainer team is intimately familiar with Airflow - we
will need to provide clear instructions for maintainers on how to use Airflow to
run any of the above mechanisms.

## Marketing

<!-- Are there potential marketing opportunities that we'd need to coordinate with the community to accomplish? If there are none, say so explicitly rather than deleting the section. -->

No marketing is required as this is an improvement internal to the team.

## Required implementation plans

<!-- What are the required implementation plans? Consider if they should be split per level of the stack or per feature. -->

In the order they should be completed:

1. Staging API DB update procedure
- This plan will describe a DAG which can be triggered to update the staging
API database.
2. Rapid iteration on ingestion server index configuration
- This plan will describe how rapid iteration of Elasticsearch index
configurations can happen. This can be done via Airflow interacting
directly with Elasticsearch rather than having to process requests through
the ingestion server. Airflow would only need to be aware of the
[existing index settings](https://github.com/WordPress/openverse/blob/0a5f4ab2ce5d80a48bd1c57d2a2dbcca14fcbedc/ingestion_server/ingestion_server/es_mapping.py)
and how to augment/adjust them for the new index.
([See this comment for further inspiration](https://github.com/WordPress/openverse/pull/1107#discussion_r1155399508))
- It should be assessed as part of this implementation plan whether it would
be easier to convert the ingestion server to ECS or provide a mechanism on
the existing EC2 infrastructure to update the Elasticsearch index
configuration without issuing a deployment (similar to the
[dag-sync script for the catalog](https://github.com/WordPress/openverse-catalog/blob/10857e3ee94ae686853984c54d504b152082d4c2/dag-sync.sh)).
3. Staging Elasticsearch reindex DAGs for both potential index types (these will
be subsets of the full data refresh)
- This plan will describe the DAG or DAGs which will be used to create/update
both the proportional-by-provider and production-data-volume indices.
- It will also describe the mechanism by which maintainers can rapidly switch
index the staging API uses. This could be done in two separate ways: a DAG
which allows changing the primary index alias or a set of changes to the
API which would allow queries to specify which index they use. The
implementation plan should explore and describe both options.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Search Relevancy Sandbox

```{toctree}
:titlesonly:
:glob:

*
```