diff --git a/documentation/projects/proposals/trust_and_safety/detecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md b/documentation/projects/proposals/trust_and_safety/detecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
new file mode 100644
index 00000000000..cfd71c46b11
--- /dev/null
+++ b/documentation/projects/proposals/trust_and_safety/detecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
@@ -0,0 +1,830 @@
+# 2024-09-03 Implementation Plan: Undo the split indices for filtered textual sensitivity
+
+**Author**: @sarayourfriend
+
+
+
+
+## Reviewers
+
+
+
+- [ ] @dhruvkb (API familiarity, especially in code that interacts with ES
+ during moderation)
+- [ ] @stacimc (Data refresh and ingestion worker expertise)
+
+## Project links
+
+
+
+This implementation exists as a one-off modification to our textual sensitivity
+detection, and does not have a corresponding project proposal.
+[Refer to this GitHub issue requesting the implementation plan for discussion of the motivation for this change](https://github.com/WordPress/openverse/issues/3336).
+
+**Before reading this plan**, please review
+[the original implementation plan for the split-index approach](./20230330-implementation_plan_filtering_and_designating_results_with_sensitive_textual_content.md).
+This document assumes familiarity with the details of the original plan and does
+not review them in detail.
+
+When this plan uses the phrase "the sensitive terms" or "sensitive terms list",
+it refers to
+[the newline-delimited list of sensitive terms described in this related IP](./20230309-implementation_plan_sensitive_terms_list.md).
+**This plan does not mention any specific sensitive terms**, it is not necessary
+to know the specific terms list to review this plan. However, it may be helpful
+to know that the list is nearly 2000 lines long and is UTF-8 encoded.
+
+Finally, this plan assumes that by the time it is relevant to consider it, the
+[project to remove the ingestion server](https://github.com/WordPress/openverse/issues/3925)
+will be complete. Therefore, this plan gives no consideration to the ingestion
+server. It only considers the indexer worker and the new fully
+Airflow-controlled data refresh implemented by the linked project.
+
+## Overview
+
+
+
+This plan proposes changing our sensitive text detection strategy to accommodate
+a single index per media type. The proposed method involves searching for
+sensitive textual content in the indexer worker, rather than using the
+Elasticsearch reindex API and Elasticsearch text queries.
+
+The indexer worker will create
+[a new `sensitivity` field for each Elasticsearch document](https://github.com/WordPress/openverse/blob/df57ab3eb6586502995d6fa70cf417352ec68402/ingestion_server/ingestion_server/elasticsearch_models.py#L78-L128).
+The field will be a keyword-list field, to match the `unstable__sensitivity`
+field in the API. The valid keywords
+[will match those used by the API](https://github.com/WordPress/openverse/blob/46a42f7e2c2409d7a8377ce188f4fafb96d5fdec/api/api/constants/sensitivity.py#L1-L12):
+
+- `sensitive_text`: included for a document when the title, tags, or description
+ of the work include any of the sensitive terms.
+- `user_reported_sensitivity`: included for a document there is a confirmed
+ sensitive content report for the work.
+
+Details of the programming logic for each are covered in the
+[step details section](#step-details).
+
+```{admonition} Rationale for a new index field
+:class: hint
+
+While the filtered index currently exists to represent works free of sensitive
+textual content, we can somewhat simplify the overall handling of sensitive
+works in search by approaching this problem with both modes of sensitivity in
+mind. The current approach to searching non-sensitive works is to search the
+filtered index _and_ exclude works[^must-not] where `mature=true`. By switching
+to a single field populated with the sensitivity designations, we can simplify
+the search for non-sensitive works by filtering for when the `sensitivity` field
+is empty.
+```
+
+[^must-not]:
+ We use the boolean `must_not` with a match query to exclude works.
+ Elasticsearch turns the match query into a Lucene `TermQuery`. Elasticsearch
+ prefers “positive” queries, so this is faster than a `filter` for
+ `mature:false`. We will use `must_not` with an exists query to exclude based
+ on the presence of `sensitivity` entries.
+
+Once the `sensitivity` field is available in the index, we will update the API
+to query based on that rather than switching indices and using the `mature`
+approach. The API will use a feature flag to determine which querying approach
+to use. This makes it easy for us to switch to the old querying approach in case
+of failure, without needing to roll back or revert code, which could impact
+other ongoing projects.
+
+Finally, once we have confidence the new approach is working, we begin the phase
+of cleaning up the multi-index approach. There are two major parts to this:
+removing the filtered index management from the data refresh, and removing the
+API code that utilises it. Neither of these appear to have much complexity. Both
+are described in detail in the [step-by-step section](#step-by-step-plan).
+
+### Evaluating the performance of the new approach
+
+We should make an effort to compare the new approach to the old. There are two
+methods we can use, one which seems intuitive but that I believe will not work,
+and another which I believe will work, and which I propose we follow.
+
+To attempt a simultaneous comparison of the two strategies, we could do the
+following: preserve the ability for both strategies in the API, and direct a
+portion of search requests to the new method. Using a new, temporary response
+header, expose the search method used for the request in the Nginx logs. That
+will enable us to query Logs Insights and determine whether there is a
+meaningful difference in the average, median, p95, and p99 search timings for
+the two methods over the same time period. This is an "ad-hoc" approach to
+compensate
+[for the lack of an existing A/B experimentation framework](https://github.com/WordPress/openverse/issues/421).
+
+However, this presents a problem: it does not compare the final outcome of this
+plan with the current state of things. The motivation for removing the filtered
+index is predicated on the idea that moving to a single index will improve
+Elasticsearch's ability to utilise its query cache and reduce disk operations.
+If we are simultaneously using the single and multi-index approaches, the
+behaviour of the cluster and application will not reflect the reduction in disk
+operations or increase in cache utilisation. That is because the thing causing
+the current behaviour will still be operating. In fact, trying both
+simultaneously would probably make things worse for all queries across the
+board. We would be sending rather different types of queries, still to multiple
+indices, thereby reducing Elasticsearch's opportunity for cache utilisation and
+thus increasing its disk operations.
+
+To actually implement a simultaneous comparison, I hypothesise that we would
+need to have identically resourced clusters, one for executing the new type of
+query, and one for the old type. I believe because the performance question in
+this plan has to do with reducing disk operations and increasing cache
+utilisation, that the metrics are unique from relevance optimisations, which we
+would be able to measure side by side.
+
+While possible in theory, it is too complex!
+
+Instead, I believe we should use a regular feature-flag approach as described
+above, and compare the performance of the new approach in isolation. Any
+significant negative impact will be immediately evident. Provided we don't see
+an immediate issue, we can evaluate the performance of the new querying approach
+for as long as we like (I propose at least two weeks), before we remove the
+filtered index and the API code that uses it.
+
+The original issue makes clear that we can see the improvement that using a
+single index makes without needing to remove the additional indices, because the
+results described by issue were collected while both indices were available on
+the cluster. That fact makes this a viable approach. It means we can reliably
+see if we've achieved the goal of reduced disk operations and increased cache
+utilisation, without entering a situation where rollback is extremely difficult.
+
+### Initial query performance analysis
+
+Based on the requirement to not negatively affect overall search performance, I
+wanted to make sure there weren't any obvious or glaring problems with using the
+field-exists approach.
+
+Whereas querying the `mature` field for a boolean value is a `match` field
+query, querying the existence of a field uses `exists`. We do not have anything
+directly analogous to try with in our existing data set. However, we do have
+works with no tags. Therefore, I decided to attempt a comparison of queries for
+works with no tags and `mature=false` works.
+
+Compare these two queries.
+
+
+The proposed approach: must_not exists:sensitivity
+
+```json
+// POST image/_search
+{
+ "from": 0,
+ "highlight": {
+ "fields": {
+ "description": {},
+ "tags.name": {},
+ "title": {}
+ },
+ "order": "score"
+ },
+ "query": {
+ "bool": {
+ "must": [
+ {
+ "simple_query_string": {
+ "default_operator": "AND",
+ "fields": ["title", "description", "tags.name"],
+ "flags": "AND|NOT|PHRASE|WHITESPACE",
+ "query": "water"
+ }
+ }
+ ],
+ "must_not": [
+ {
+ "exists": {
+ "field": "tags.name"
+ }
+ }
+ ],
+ "should": [
+ {
+ "simple_query_string": {
+ "boost": 10000,
+ "fields": ["title"],
+ "flags": "AND|NOT|PHRASE|WHITESPACE",
+ "query": "water"
+ }
+ },
+ {
+ "rank_feature": {
+ "boost": 10000,
+ "field": "standardized_popularity"
+ }
+ }
+ ]
+ }
+ },
+ "size": 20
+}
+```
+
+
+
+
+
+Our current mature-filter query: must_not mature=false
+
+```json
+// POST image/_search
+{
+ "from": 0,
+ "highlight": {
+ "fields": {
+ "description": {},
+ "tags.name": {},
+ "title": {}
+ },
+ "order": "score"
+ },
+ "query": {
+ "bool": {
+ "must": [
+ {
+ "simple_query_string": {
+ "default_operator": "AND",
+ "fields": ["title", "description", "tags.name"],
+ "flags": "AND|NOT|PHRASE|WHITESPACE",
+ "query": "water"
+ }
+ }
+ ],
+ "must_not": [
+ {
+ "term": {
+ "mature": true
+ }
+ }
+ ],
+ "should": [
+ {
+ "simple_query_string": {
+ "boost": 10000,
+ "fields": ["title"],
+ "flags": "AND|NOT|PHRASE|WHITESPACE",
+ "query": "water"
+ }
+ },
+ {
+ "rank_feature": {
+ "boost": 10000,
+ "field": "standardized_popularity"
+ }
+ }
+ ]
+ }
+ },
+ "size": 20
+}
+```
+
+
+
+I was worried that a boolean query would perform much better. In fact, if you
+remove all other aspects of the query (the actual text-search parts), it does
+seem to indicate that an exists query is much slower than the boolean match
+query. Crucially, however, if you use a real query from our application as a
+starting point, **there appears to be zero performance difference between the
+two**.
+
+It's hard to give hard numbers, because I didn't write a script to analyse this
+and executed the queries and looked at the results by hand. But, executing both
+queries against the staging cluster resulted in the similar "took" results.
+Regardless of the approach, "took" was always usually between 39 and 51. Even
+when I changed the search term to try to avoid the query cache, both appeared to
+perform identically.
+
+However, there were instances where the tags-exists-based query was faster, by
+around 4 times. That, I believe, is due to the difference in the output of the
+two queries. The tag-exists-based query was naturally spending less time
+fetching the list of tags for each document, as none of the results had tags.
+The mature-boolean-based query, however had tags in almost every result in the
+first 20 which were requested, and so did have to spend time fetching tags for
+each document.
+
+The differences between the `tags` field and the proposed `sensitivity` field
+make it hard to do an ahead-of-time comparison. However, as I've said above, I
+think this broad (and admittedly naïve!) performance analysis is sufficient to
+allow us to move forward with implementing the list approach to begin with. If
+we encounter a performance issue, we can easily collapse the list into a
+`has_sensitivty` boolean for when we just want to exclude anything with a
+sensitivity designation.
+
+```{note}
+[Please see the note in the linked code above regarding provider
+supplied sensitivity](https://github.com/WordPress/openverse/blob/46a42f7e2c2409d7a8377ce188f4fafb96d5fdec/api/api/constants/sensitivity.py#L4-L7).
+
+This plan makes no explicit consideration for provider supplied sensitivity.
+However, I believe the approach described in this plan increases, or at
+least maintains, our flexibility in the event it becomes relevant (i.e.,
+we start intentionally ingesting works explicitly designated as sensitive
+or mature by the source).
+```
+
+## Expected Outcomes
+
+
+
+I've copied these lists of requirements/not-requirements from
+[this comment of mine in the issue for this IP](https://github.com/WordPress/openverse/issues/3336#issuecomment-2288432417).
+
+Requirements:
+
+- **A single index rather than a filtered and unfiltered index**. This is the
+ main requirement.
+- Arbitrary changes to the list of sensitive terms continue to be reflected in
+ search within one data refresh cycle
+ - Keeping it “within one data refresh cycle" means the responsiveness to
+ changes does not go down, but doesn't preclude it somehow becoming faster
+ than one data refresh cycle, if for some reason that's possible.
+- No ambiguity between the existing `mature` field and however we indicate
+ textual sensitivity in the index.
+- No negative change in search performance (i.e., cannot make search worse off
+ after this).
+- Including or excluding results with sensitive terms must not affect ranking.
+
+Not requirements:
+
+- That textual sensitivity designation exists in the catalogue, as opposed to
+ only in Elasticsearch (it is not currently the case and making that a
+ requirement would, strictly speaking, expand the scope of the project).
+- Excluding a subset of "safe" providers from sensitive text
+ scanning[^exclude-source-subset].
+- Avoid unnecessary re-evaluation, e.g., by skipping evaluation for works where
+ their metadata has not changed since the last time they were evaluated against
+ the current sensitive terms list[^avoid-unnecessary-reevaluation].
+
+[^exclude-source-subset]:
+ This IP should make this an easy change in the future, if we decide to do
+ it. We would just skip the sensitive text detection if the `source` field
+ was in the list of sources to ignore.
+
+[^avoid-unnecessary-reevaluation]:
+ This IP should also make _this_ an easy change. We could cache the sensitive
+ text scan based on a combined hash of
+ `{title}{description}{"".join(tags)}{sensitive_terms_hash}`. This could be
+ even simpler if we know that `updated_at` on the work is reliable. If it is,
+ then we could just cache based on that and a hash of the terms list, rather
+ than needing to hash the potentially large text blob of title, description,
+ and tags.
+
+Not being a requirement does not exclude the possibility of the thing, if it
+turns out that works in our favour. I just want to make sure the scope of
+_requirements_ is clear. The last two would be nice to have improvements, and
+may or may not be easy to implement, and if they do not end up part of this
+project, I will definitely create issues for each of them afterwards.
+
+## Step-by-step plan
+
+
+
+These steps must be completed in sequence. Steps with multiple PRs may be
+completed simultaneously to each other. Overall, this plan should result in
+about 5 main pull requests.
+
+1. Generate a usable `sensitivity` field for Elasticsearch documents (1 PR)
+1. Update API with new query method (behind feature flag) (2 PRs, one for
+ search, one for moderation tools)
+1. Turn on feature flag in production and observe (1 minor infrastructure PR)
+1. Clean up (2 PRs)
+
+## Step details
+
+
+
+### Step 1: Generate usable `sensitivity` field for Elasticsearch documents
+
+Complete the following in a single PR:
+
+- Indexer worker field generation:
+ - Retrieve the sensitive terms list from the GitHub URL (see data refresh for
+ reference) and make the list available to the indexer processes without
+ requiring subsequent outbound requests for the list.
+ - Create a new Python module in the indexer worker, `sensitive_terms.py`. It
+ will export the following:
+ - A constant `SENSITIVE_TERMS_LOC` set to
+ `Path(__file__).parent / "sensitive_terms.txt"`. Indexer workers are
+ ephemeral, so it is safe to write to disk without risking accidental
+ re-use across runs.
+ - A function `retrieve_sensitive_terms` which retrieves the sensitive
+ terms list from the network location and saves it to disk at
+ `SENSITIVE_TERMS_LOC`. This function should check an environment
+ variable for the network location of the sensitive terms list. If that
+ variable is undefined, it should simply write the mock sensitive terms
+ ("water", "running", and "bird") to the disk location as a
+ newline-delimited list.
+ - A function `get_sensitive_terms`, which will be wrapped in
+ `@functools.cache`. The function will read `SENSITIVE_TERMS_LOC` and
+ split the result on newline and return a tuple of compiled sensitive
+ term regexes. The regexes will target sensitive terms that are
+ surrounded by word boundaries.
+ [Refer to this demonstration on Regexer using the mock term "water" and the expression `\bwater\b`](https://regexr.com/85gtv).
+ - Using a regex is unfortunately necessary to ensure we do not have
+ false-positives on non-sensitive terms in sub-portions of words, as
+ with location names like
+ [Shitterton in Dorset County, UK](https://en.wikipedia.org/wiki/Shitterton).
+ This is a minimum consideration we should give in light of our
+ acknowledgement that text-based matching
+ [is not perfect](./20230309-implementation_plan_sensitive_terms_list.md#this-will-not-be-perfect).
+ - The use of compiled regexes makes caching the result especially
+ important so that regex compilation is not unnecessarily repeated
+ within a single indexer worker subprocess.
+ - We should start with a single regex per term. In the future, we can
+ consider combining slices of the sensitive terms list into larger
+ regex statements, and evaluate whether this performs better or worse
+ for our corpus. For now, we will keep it as simple as possible, with a
+ single regex per term. It does not change the downstream code to do
+ so, and introduces fewer complications with respect to regex
+ construction.
+ - The indexer worker will call `retrieve_sensitive_terms` on service
+ startup. Add this to `api.create_api`. This populates the file to disk.
+ - Because each indexer worker spawns multiple multiprocessing processes,
+ each of which have independent Python interpreters, we rely on
+ `functools.cache` to ensure each subprocess only has to parse
+ `get_sensitive_terms` once, regardless of the number of documents it
+ processes. This also ensures the list is available to `get_instance_attrs`
+ and related functions without needing to pass it around as a new argument
+ from somewhere "below" `launch_reindex` in the call stack for document
+ processing.
+ - The primary logic will live as an extension to the indexer worker's static
+ `elasticsearch_models.Model::get_instance_attrs`.
+ - Add a new static method
+ `elasticsearch_models.Model::get_text_sensitivity`.
+ - This method will use `sensitive_terms.get_sensitive_terms` to retrieve
+ the sensitive terms tuple.
+ - Iterating through the sensitive term regexes and test the title,
+ description, and each tag against the regex. If the term is in any
+ field, immediately return true. We only need to test terms until we find
+ one matching term, so it is safe to break the loop early once we find
+ one.
+ - Update `get_instance_attrs` to create a new `sensitivity` list field. If
+ the call to `Model::get_text_sensitivity` is true, include
+ `sensitive_text` in the list. Additionally, if `row[schema["mature"]]` is
+ true, add `user_reported_sensitivity` to the list.
+- Update the index mappings to add the new `sensitivity` field set to
+ `{"type": "keyword"}`.
+
+After merging the PR, run the staging data refresh and confirm the new field
+appears in new indices and is queryable (manually run queries against it). It is
+not necessary to manually run the production data refresh, and it is highly
+unlikely we will complete, review, and merge the API work before a production
+data refresh runs.
+
+### Step 2: Update the API to use the new field (behind a feature flag)
+
+Complete the following in two PRs.
+
+- First PR: search
+ - Add a new feature-flag environment variable `USE_INDEX_BASED_SENSITIVITY`
+ that defaults to false.
+ - Update search for the new query approach:
+ - Modify `search_controller.get_index` to check the feature flag and return
+ the unfiltered index when it is true. Otherwise, it should follow its
+ existing logic.
+ - Update `search_controller.build_search_query` to also check the feature
+ flag. When it is false, use the existing `mature` based strategy for
+ handling `include_sensitive_results`. When it is true and
+ `include_sensitive_results` is false, add a new `must_not` entry with an
+ exists query on the `sensitivity` field.
+ - Finally, update `search_controller.query_media` to check the feature flag,
+ and when true, bypass the `SearchContext.build` call and create
+ `SearchContext` directly, as described below.
+ - Retaining the `result_ids` list, create the `SearchContext` with
+ `SearchContext(result_ids, {result.identifier for result in results if result.sensitivity})`.
+ - While this change to a single index and building the sensitivity list
+ upstream of the API allows us to eventually remove `SearchContext`, we
+ will retain it for now and mock its usage. This allows us to avoid
+ needing to add the same feature flag checks in `MediaView` and
+ `MediaSerializer`. Instead, we can leave those alone for now, and the
+ same `sensitivity` list will continue to be populated by the serializer.
+ Once we confirm this feature and enter the clean-up phase, we will
+ remove `SearchContext` altogether. Details for this are in the
+ [clean-up section](#step-4-clean-up).
+- Second PR: moderation
+ - Update `AbstractSensitiveMedia._bulk_update_es` to pass a script update that
+ append `user_reported_sensitivity` to the `sensitivity` list, or creates the
+ list if needed.
+ - Refer to
+ [the Elasticsearch documentation on updates for examples of scripted updates](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#update-api-example).
+ - [Refer to the following section for an example script to work from](#bulk-update-painless-script).
+ - This should happen in addition to the `mature` update that already exists,
+ but **must not be conditional on the `USE_INDEX_BASED_SENSITIVTY` feature
+ flag**. It cannot be conditional in order to truly support the feature
+ flag's accuracy in search results. If we toggle the flag on/off/on before
+ a data refresh runs, the `sensitivity` list must still get updated with
+ user-reported sensitivity. Otherwise, when the feature goes back on, the
+ `sensitivity` list would not reflect the fact that those works had been
+ reviewed and confirmed as sensitive. In that case, they would start
+ reappearing in default search results (non-sensitive search).
+ - Update `MediaListAdmin.has_sensitive_text` to check the `sensitivity` field
+ of the Elasticsearch document if `USE_INDEX_BASED_SENSITIVITY` is true.
+ - The method currently uses a term query to get the result with the
+ identifier.
+ [We should refactor this to use a single document `get` query by `id`](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/examples.html#ex-get).
+ Elasticsearch document `id` matches the Django model's `pk`. Doing so
+ should improve the retrieval speed, and reduces the result to a single
+ document, which makes reading the `sensitivity` field when retrieving it
+ much easier.
+ - When `USE_INDEX_BASED_SENSITIVITY` is true, query the "full" index instead
+ of the filtered index.
+
+#### Bulk update Painless script
+
+The following script works to safely add a designation to the sensitivity list:
+
+```json
+{
+ "script": {
+ "source": "if (Objects.isNull(ctx._source.sensitivity)) { ctx._source.sensitivity = new String[] {params.designation} } else if (!ctx._source.sensitivity.contains(params.designation)) { ctx._source.sensitivity.add(params.designation) }",
+ "lang": "painless",
+ "params": {
+ "designation": "user_reported_sensitivity"
+ }
+ }
+}
+```
+
+We will store the script as a Python multi-line string to aid with readability
+and maintenance. The Elasticsearch REST API will accept this string as-is, and
+we do not need to modify it to remove newlines. Add the following to
+`constants.sensitivity`:
+
+```py
+ADD_SENSITIVITY_DESIGNATION_PAINLESS = """
+if (Objects.isNull(ctx._source.sensitivity)) {
+ ctx._source.sensitivity = new String[] {params.designation}
+} else if (!ctx._source.sensitivity.contains(params.designation)) {
+ ctx._source.sensitivity.add(params.designation)
+}
+"""
+
+def add_sensitivity_designation_script_query(designation):
+ return {
+ "script": {
+ "source": ADD_SENSITIVITY_DESIGNATION_PAINLESS,
+ "lang": "painless",
+ "params": {
+ "designation": designation,
+ },
+ },
+ }
+```
+
+The API unit tests run against the real Elasticsearch cluster, so the result of
+running the script as part of the sensitive-media document update should be
+"painless"[^😛] to test.
+
+[^😛]: 😛
+
+### Step 3: Turn on feature flag in live environments and observe
+
+Set `USE_INDEX_BASED_SENSITIVITY` to `True` in staging and redeploy. Confirm
+search against the new indices created with the staging data refresh works,
+including sensitive and non-sensitive queries. Also test and verify the
+moderation features by confirming a sensitivity report in staging and making
+sure the work is updated with `user_reported_sensitivity`.
+
+If the production data refresh has not yet run, wait for it to run on its
+regular cadence, and confirm the new `sensivitiy` field as we did in staging in
+step 1.
+
+After confirming the production indices, set `USE_INDEX_BASED_SENSITIVITY` to
+`True` in production and redeploy. Monitor production API search times as well
+as resource usage. Monitor the production Elasticsearch cluster health in
+Kibana, and check I/O operations for the data nodes. You may also check overall
+disk operations for the entire cluster in CloudWatch.
+
+### Step 4: Clean up
+
+1. Data refresh
+
+- This essentially involves undoing
+ [the filtered-index creation aspects added to the new data refresh in this PR](https://github.com/WordPress/openverse/pull/4833).
+- Note that that PR includes fixes for other parts of the new data refresh, and
+ so cannot merely be reverted to accomplish this task. Luckily, it still
+ shouldn't be too much trouble.
+- Delete `create_and_populate_filtered_index` and its usage.
+- Remove filtered index promotion (task group ID `promote_filtered_index`).
+- Update the DAG flow accordingly.
+- This should ideally be implemented in a single PR to make it as easy as
+ possible to revert, should the need arise.
+- We can, at this point, also delete the `mature` field from the Elasticsearch
+ mappings and remove the `raw` versions of `title`, `description`, and
+ `tags.name`.
+
+1. API code
+
+- Remove `search_controller.get_index` and its usage.
+- Remove mature content exclusion based on the `mature` boolean from
+ `build_search_query`.
+- Update `MediaView::get_db_results` to merge `sensitivity` from the
+ Elasticsearch `Hit` onto the Django model for each result.
+- Update `MediaSerializer::unstable__sensitivity` to merely return `sensitivity`
+ from the object:
+ - Change the field to a `serializers.ListField` and delete
+ `get_unstable__sensitivity`.
+- Remove the feature flag.
+- As with the data refresh changes, implement these in a single PR if possible,
+ to ensure the easiest possible revert if needed.
+
+## Dependencies
+
+### Feature flags
+
+
+
+One new API environment variable will act as the feature flag:
+`USE_INDEX_BASED_SENSITIVITY`
+
+### Infrastructure
+
+
+
+There are no infrastructure changes required for this, aside from updating
+environment variables.
+
+The intentional increase in indexer worker resource usage should not increase
+costs directly. However, the complexity of sensitive text designation exists no
+matter where it goes in our compute pipeline. In the indexer workers, it will
+naturally extend the time each worker spends on any given document. As such, the
+overall amount of time the indexer workers are running will increase, and so
+will their cost to us. Elsewhere, for example, in the Airflow local executor, we
+would pay for it in decreased resources available for other tasks, extending the
+time of those tasks, and potentially decreasing the scheduler's stability (if
+resource usage increases a lot).
+
+### Tools & packages
+
+
+
+We do not require any new tools or packages to implement this change.
+
+### Other projects or work
+
+
+
+As mentioned in [the project links](#project-links), this plan assumes
+[the removal of the ingestion server](https://github.com/WordPress/openverse/issues/3925)
+is finished by the time this plan makes modifications to the data refresh and
+indexer worker.
+
+## Alternatives
+
+
+
+### Elasticsearch ingest pipeline
+
+I spent a few days experimenting with and trying to see whether an ingest
+pipeline would be a viable way to approach the creation of the sensitivity list
+for this project. I believe it is generally viable, and I was able to make it
+work locally with our local sample data and the full sensitive terms list. I
+wrote about my evaluation of the ingest pipeline approach in these three
+comments on the implementation plan issue:
+
+1. [Describing the approach in general and initial testing](https://github.com/WordPress/openverse/issues/3336#issuecomment-2316894501)
+1. [Further local testing with the complete sensitive terms list](https://github.com/WordPress/openverse/issues/3336#issuecomment-2325659270)
+1. [Testing the ingest pipeline in staging](https://github.com/WordPress/openverse/issues/3336#issuecomment-2325684604)
+
+Ultimately, despite ingest pipelines working fine to create the sensitivity
+designations list, I don't think it's the right approach for us. As noted in the
+final comment, whilst dedicated ingest nodes are not required to work with
+ingest pipelines, a cluster still needs at least one node with the ingest role
+to run ingest pipelines (it just isn't required to be those nodes' only role).
+In order to service that requirement, we would need to introduce a new
+complexity into our Elasticsearch cluster. We do not have the capacity (time and
+knowledge) to fulfil this requirement. Therefore, we should not use
+Elasticsearch ingest pipelines to generate the sensitivity designation.
+
+### Move sensitive text detection earlier in the data refresh pipeline
+
+Rather than detect sensitive text in the indexer worker when creating the
+Elasticsearch documents, we could attempt to insert the sensitivity list into
+the API database when we create the API tables from the catalogue tables. I
+chose against recommending this because running several hundred regular
+expressions in a loop will be resource intensive. We currently use the local
+executor for Airflow, and such intense processing would likely be problematic
+for our Airflow instance as it tries to complete other tasks as well. Placing
+the regex execution in the indexer worker allows us to take advantage of the
+indexer workers' provisioned CPU resources (which are under utilised overall at
+only 25% when last evaluated in March of this year, as discussed in the
+[ingestion server removal IP](/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md#approach-to-the-distributed-reindex)).
+
+Ideally, the list of specific sensitivity designations would exist only in the
+API database and Elasticsearch would only be required to track a boolean
+indicating whether any sensitivity designation exists at all. This would
+simplify the following aspects of the plan:
+
+- We would not need to use a script update to add to the sensitivity designation
+ list in the Elasticsearch document. We would be able to use a simple `doc`
+ update to set a `sensitive` boolean as needed. We could even skip the update
+ for documents when moderating user reports if the document already has
+ `sensitive=true` due to sensitive text.
+- We would not need to query Elasticsearch in the media report admin to
+ determine whether a work has sensitive text, as it would already be on the
+ work.
+- Search could use simple boolean query against the `sensitive` field instead of
+ the slightly more complex negated exists against the `sensitivity` keyword
+ list field.
+
+I tried to determine whether there was a method of adding the `sensitivity` list
+in the indexer worker, so that we could benefit from the indexer worker's
+significant compute and memory resources without interrupting other Airflow
+work, while still gaining these benefits. I did not see any existing mechanism
+for accomplishing this kind of Postgres update in the indexer workers, but I'm
+sure [the reviewers](#reviewers) are well equipped to correct me on this.
+Crucially, even if the capability does not currently exist, if we can add it
+without significantly increasing this project's complexity, it would be worth
+considering.
+
+Additionally, if we were able to use remote compute resources to execute
+individual tasks without the complexity that comes with using remote operators
+(i.e., if we were using a remote executor), we could also consider doing this in
+the data refresh tasks that build the new API tables. The road to that solution
+is much more complex than this, though. It also remains an available option
+should our Airflow infrastructure gain this capability in the future (for
+example, if we were able to use the Kubernetes executor). I'm resistant to using
+a remote operator for this single task, rather than rolling it into our existing
+pseudo-remote operator of the indexer worker, because of the cost changes we
+would incur by doing so.
+
+## Design
+
+
+
+There are no interface changes for this plan.
+
+## Blockers
+
+
+
+See [the section on other projects](#other-projects-or-work) regarding this
+plans relationship to the removal of the ingestion server.
+
+## API version changes
+
+
+
+No API version changes.
+
+## Accessibility
+
+
+
+None.
+
+## Rollback
+
+
+
+Before we get to the clean-up step, everything should either have no effect on
+existing functionality (new Elasticsearch document field) or can be toggled by
+the [feature flag](#feature-flags).
+
+Once we get to the clean-up step, we will start to need to revert or do full
+rollbacks if required. The plan should be safe enough such that we won't start
+the clean-up step without having verified that the new approach works. Clean-up
+will (mostly) be removing the old unused code and simplifying some downstream
+effects of the old approach based on the new approach.
+
+## Privacy
+
+
+
+There are no changes to privacy.
+
+## Localisation
+
+
+
+We do not require any localisation for this project.
+
+## Risks
+
+
+
+The primary risk is the indexer worker performance when executing the sensitive
+term regexes. Following that, there is a risk that search performance will not
+go well with the `exists` query.
+
+## Prior art
+
+
+
+[I did a bit of exploring of different approaches before being assigned this implementation plan, and wrote about them in this GitHub comment](https://github.com/WordPress/openverse/issues/3336#issuecomment-2288432417).
+
+All the other implementation plans and project proposals in the
+[trust and safety category](../index.md) are significant in considering how to
+approach this and set precedences that need to be followed or considered for a
+comprehensive plan.