IP: Undo split indices for sensitive text detection #4904

sarayourfriend · 2024-09-10T06:07:53Z

Fixes

Description

This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site. Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.

Current round

This discussion is currently in the Decision round.

The deadline for review of this round is 2024-09-25.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
[N/A] I added or updated tests for the changes I made (if applicable).
[N/A] I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
[N/A] I ran the DAG documentation generator (ov just catalog/generate-docs for catalog
PRs) or the media properties generator (ov just catalog/generate-docs media-props
for the catalog or ov just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

github-actions · 2024-09-10T06:28:07Z

Full-stack documentation: https://docs.openverse.org/_preview/4904

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

https://docs.openverse.org/_preview/4904/projects/proposals/trust_and_safety/detecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.html

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

dhruvkb

The plan looks good to me. The steps are logical, the changes to the API look correct and the approximate analysis of the performance impact also makes sense.

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

zackkrida

@sarayourfriend this looks excellent but I'd like to suggest one addition: Could you define specific prerequisites for the "cleanup" steps? My thinking is that in the past on some projects, Nuxt 3 being a recent example, we have jumped into cleanup work somewhat hastily and perhaps without sufficient assurance that our changes were stable.

sarayourfriend · 2024-09-12T04:20:23Z

Could you define specific prerequisites for the "cleanup" steps?

Sure thing, good call out. When I get to revision (after Staci reviews for clarification round), I'll add something like the following:

Clean-up should occur only after 2 weeks of running the new approach in production, including two full production data refreshes. This is to ensure sufficiently exericse the new approach both during the data refresh and at query time before starting to take actions that will make rolling back much more cumbersome.

Does that sound alright?

stacimc

This looks great to me, @sarayourfriend -- I had a question about the indexer worker in the local dev environment, but that should be easily handled. I'm curious about your thoughts on the ingestion approach, but I think this approach will work well and I see the tradeoffs.

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

sarayourfriend · 2024-09-15T23:53:42Z

I've also been thinking about this IP for the last week and regretting my recommendation of the sensitivity list. I think instead, an object of boolean properties like sensitivity: { text: boolean, user_reported: boolean } would be better. It could also have an any: boolean field as a normalised version of all the booleans in the object, which we could query against for simpler non-sensitive queries, which are the predominant kind of queries we make. Regardless of the version we go with, I want to change the IP to go with this approach. It has the following advantages:

It does not require using a Painless script to update the document in the index (way simpler, less fiddly, easier to test and maintain).
It produces an identical type of must_not term query on the boolean property/ies, matching our current query. Therefore, we can be confident it will have identical performance characteristics to our current query.
It is potentially more flexible long term, as an object can much more naturally grow in properties to include things like which fields had sensitive text, which sensitive terms were detected, and so forth. Things that may be valuable or even essential information to improving how our sensitive text detection works in the future.

The last advantage particularly applies in the context of a catalogue-based approach like the one Staci asked about in this comment.

sarayourfriend · 2024-09-19T05:08:39Z

@zackkrida I've added details for a cool-off period in the plan in this commit: 13c2beb

@stacimc I've added details about the discussion we had yesterday re: moving the check into Airflow in this commit: 27b3781

That second commit also includes the update to use a sensitivity object with boolean properties and a denormalised any.

I am waiting on one last question to clarify from @stacimc that I sent in Slack regarding how the indexer workers are used, and then I will be able to make a small change (it will be small either way) to address this clarification Staci mentioned regarding the ephemerality of the indexer workers.

…ot put the check in Airflow

sarayourfriend · 2024-09-20T00:36:40Z

@dhruvkb and @stacimc this is ready for y'all to take another look and make a decision or raise blockers. @dhruvkb I know you left an approval before but just wanted to wait until the decision round to lock it in so feel free to change your mind, of course! 🙂

stacimc

Fantastic! I love the addition of the denormalized any as well 👍

Thanks for indulging my questions about the approach -- and for so clearly documenting what ended up being a very complex discussion! 😄 Especially as we talk about the many different places data transformation is happening in our pipelines, it's so nice to really nail down our priorities and consider the options thoroughly. I feel very confident with the approach you've outlined here, cheers!

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

Co-authored-by: Staci Mullins <[email protected]>

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

sarayourfriend · 2024-09-26T17:18:03Z

This is past the deadline, and I've pinged in Slack with no response, so I'm going to merge based on Dhruv's previous PR review with the approval.

Thanks for the reviews, y'all.

krysal

I finally got to reply to this. Fantastic write-up, @sarayourfriend! Everything seems evaluated and well explained. Compared to the alternatives, the selected approach sounds wonderfully simple (not necessarily easy). I'm eager to try it, but I want to refrain from interfering since you have expressed the intention of continuing with it.

I left minor comments that don't block anything. Would you like me to merge it as is? Do you have the list of issues for a milestone?

krysal · 2024-10-15T23:33:45Z

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

+Additionally, a denormalised field `sensitivity.any` will be added to simplify
+our current most-common query case, where we query for works that have no known
+sensitivity designations.


Well thought out, this simplifies the case a lot!

krysal · 2024-10-16T14:22:14Z

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

+    `Path(__file__).parent / f"sensitive_terms-{target_index}.txt"`. This
+    function should check an environment variable for the network location of
+    the sensitive terms list. If that variable is undefined, it should simply


Is this environment variable SENSITIVE_TERMS_LOC? It's mentioned below, but where it comes from needs to be clarified.

krysal · 2024-10-16T15:24:24Z

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md

+[^provider-supplied-sensitivity]:
+    [Please see the note in the linked code above regarding provider supplied sensitivity](https://github.com/WordPress/openverse/blob/46a42f7e2c2409d7a8377ce188f4fafb96d5fdec/api/api/constants/sensitivity.py#L4-L7).
+    This plan makes no explicit consideration for provider supplied sensitivity.
+    However, I believe the approach described in this plan increases, or at
+    least maintains, our flexibility in the event it becomes relevant (i.e., we
+    start intentionally ingesting works explicitly designated as sensitive or
+    mature by the source).


I wondered about this when reading the rendered version in the preview link. I wrote a comment, but I removed it after getting to the footnotes, which answered my question :) I think it's worth leaving this part as a paragraph rather than a footnote.

sarayourfriend · 2024-10-17T01:28:04Z

@krysal please feel free to make any edits you'd like directly to the IP and merge it as you'd prefer. I can assist in the work if you'd like, otherwise, please go ahead and implement it however you see fit 👍

sarayourfriend requested a review from a team as a code owner September 10, 2024 06:07

sarayourfriend requested review from krysal, stacimc and dhruvkb and removed request for a team and krysal September 10, 2024 06:07

sarayourfriend force-pushed the add/undo-split-filtered-index branch from b5f73fd to d616bf8 Compare September 10, 2024 06:19

zackkrida reviewed Sep 10, 2024

View reviewed changes

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md Show resolved Hide resolved

dhruvkb approved these changes Sep 11, 2024

View reviewed changes

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md Show resolved Hide resolved

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md Outdated Show resolved Hide resolved

zackkrida reviewed Sep 11, 2024

View reviewed changes

stacimc reviewed Sep 13, 2024

View reviewed changes

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md Outdated Show resolved Hide resolved

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md Show resolved Hide resolved

sarayourfriend added 4 commits September 20, 2024 10:15

IP: Undo split indices for sensitive text detection

1b886df

Clarify cool-off period before starting clean-up

5b22670

Switch to using a nested sensitivity object and explain why we will n…

d7da7f8

…ot put the check in Airflow

Clear up ingestion worker ephemerality assumptions

e68381a

sarayourfriend force-pushed the add/undo-split-filtered-index branch from 27b3781 to e68381a Compare September 20, 2024 00:35

sarayourfriend requested review from stacimc and dhruvkb September 20, 2024 00:35

stacimc approved these changes Sep 23, 2024

View reviewed changes

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md Outdated Show resolved Hide resolved

Add Staci approval

cad6459

Co-authored-by: Staci Mullins <[email protected]>

sarayourfriend commented Sep 26, 2024

View reviewed changes

...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md Outdated Show resolved Hide resolved

Add Dhruv approval

dc7c128

krysal approved these changes Oct 16, 2024

View reviewed changes

krysal merged commit cd0ee7c into main Oct 17, 2024
44 checks passed

krysal deleted the add/undo-split-filtered-index branch October 17, 2024 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IP: Undo split indices for sensitive text detection #4904

IP: Undo split indices for sensitive text detection #4904

sarayourfriend commented Sep 10, 2024 •

edited

Loading

github-actions bot commented Sep 10, 2024

dhruvkb left a comment

zackkrida left a comment

sarayourfriend commented Sep 12, 2024

stacimc left a comment

sarayourfriend commented Sep 15, 2024

sarayourfriend commented Sep 19, 2024

sarayourfriend commented Sep 20, 2024

stacimc left a comment

sarayourfriend commented Sep 26, 2024

krysal left a comment

krysal Oct 15, 2024

krysal Oct 16, 2024

krysal Oct 16, 2024

sarayourfriend commented Oct 17, 2024

IP: Undo split indices for sensitive text detection #4904

IP: Undo split indices for sensitive text detection #4904

Conversation

sarayourfriend commented Sep 10, 2024 • edited Loading

Fixes

Description

Current round

Checklist

Developer Certificate of Origin

github-actions bot commented Sep 10, 2024

dhruvkb left a comment

Choose a reason for hiding this comment

zackkrida left a comment

Choose a reason for hiding this comment

sarayourfriend commented Sep 12, 2024

stacimc left a comment

Choose a reason for hiding this comment

sarayourfriend commented Sep 15, 2024

sarayourfriend commented Sep 19, 2024

sarayourfriend commented Sep 20, 2024

stacimc left a comment

Choose a reason for hiding this comment

sarayourfriend commented Sep 26, 2024

krysal left a comment

Choose a reason for hiding this comment

krysal Oct 15, 2024

Choose a reason for hiding this comment

krysal Oct 16, 2024

Choose a reason for hiding this comment

krysal Oct 16, 2024

Choose a reason for hiding this comment

sarayourfriend commented Oct 17, 2024

sarayourfriend commented Sep 10, 2024 •

edited

Loading