Fix and add links

WordPress · Mar 5, 2024 · 3d70e64 · 3d70e64
1 parent a399420
commit 3d70e64
Showing 1 changed file with 28 additions and 23 deletions.
diff --git a/...posals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/...posals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
@@ -44,8 +44,7 @@ problem of wasting resources both in time, which continues to increase, and in
 the machines (CPU) it uses, which could easily be avoided making the changes
 permanent by saving them in the upstream database.
 
-[img-data-refresh]:
-  https://github.com/WordPress/openverse-catalog/blob/main/DAGs.md#image_data_refresh
+[img-data-refresh]: ./../../../catalog/reference/DAGs.md#image_data_refresh
 
 ## Expected Outcomes
 
@@ -58,8 +57,9 @@ permanent by saving them in the upstream database.
 
 ## Step-by-step plan
 
-The cleaning functions that the Ingestion Server applies are already implemented
-in the Catalog in the `MediaStore` class: see its `_tag_blacklisted` method
+The cleaning functions that the Ingestion Server applies (see the
+[cleanup][ing_server_cleanup] file) are already implemented in the Catalog in
+the `MediaStore` class: see its [`_tag_blacklisted` method][tag_blacklisted]
 (which probably should be renamed) and the [url utilities][url_utils] file. The
 only part that it's not there and can't be ported is the filtering of
 low-confidence tags, since provider scripts don't save an "accuracy" by tag.
@@ -70,29 +70,25 @@ With this the plan then starts in the Ingestion Server with the following steps:
 1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanup)
 1. [Run an image Data Refresh to confirm cleaning time is reduced](#run-an-image-data-refresh-to-confirm-cleaning-time-is-reduced)
 
+[ing_server_cleanup]:
+  https://github.com/WordPress/openverse/blob/f8971fdbea36fe0eaf5b7d022b56e4edfc03bebd/ingestion_server/ingestion_server/cleanup.py#L79-L168
+[tag_blacklisted]:
+  https://github.com/WordPress/openverse/blob/f8971fdbea36fe0eaf5b7d022b56e4edfc03bebd/catalog/dags/common/storage/media.py#L245-L259
 [url_utils]:
   https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/catalog/dags/common/urls.py
 
 ## Step details
 
 ### Save TSV files of cleaned data to AWS S3
 
-In a previous exploration, the Ingestion Server was set to store TSV files of
-the cleaned data in the form of `<identifier> <cleaned_field>`, which can be
-used later to perform the updates efficiently in the catalog DB, which only had
-indexes for the `identifier` field. These files are saved to the disk of the
-Ingestion Server EC2 instances, and worked fine for files with URL corrections
-since this type of fields is relatively short, but became a problem when trying
-to save tags, as the file turned too large and filled up the disk, causing
-problems to the data refresh execution.
-
-The alternative is to upload TSV files to the Amazon Simple Storage Service
-(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The
-benefit of using S3 buckets is that they have streaming capabilities and will
-allow us to read the files in chunks later if necessary for performance. The
-downside is that objects in S3 don't allow appending, so it may require to
-upload files with different part numbers or evaluate if the [multipart upload
-process][aws_mpu] will serve us here.
+In a previous exploration, the Ingestion Server was set to [store TSV files of
+the cleaned data][pr-saving-tsv] in the form of `<identifier> <cleaned_field>`,
+which can be used later to perform the updates efficiently in the catalog DB,
+which only had indexes for the `identifier` field. These files are saved to the
+disk of the Ingestion Server EC2 instances, and worked fine for files with URL
+corrections since this type of fields is relatively short, but became a problem
+when trying to save tags, as the file turned too large and filled up the disk,
+causing problems to the data refresh execution.
 
 [aws_mpu]:
   https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
@@ -108,6 +104,14 @@ shows the number of records cleaned by field for last runs at the moment of
 writing this IP, except for tags, which we don't have accurate registries since
 file saving was disabled.
 
+The alternative is to upload TSV files to the Amazon Simple Storage Service
+(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The
+benefit of using S3 buckets is that they have streaming capabilities and will
+allow us to read the files in chunks later if necessary for performance. The
+downside is that objects in S3 don't allow appending, so it may require to
+upload files with different part numbers or evaluate if the [multipart upload
+process][aws_mpu] will serve us here.
+
 ### Make and run a batched update DAG for one-time cleanup
 
 A batched catalog cleaner DAG (or potentially a `batched_update_from_file`)
@@ -122,7 +126,7 @@ and ideally can't be singly blocked by any DAG.
 A [proof of concept PR](https://github.com/WordPress/openverse/pull/3601)
 consisted of uploading each file to temporary `UNLOGGED` DB tables (which
 provides huge gains in writing performance while their disadventages are not
-relevant since they won't be permanent), and including a `row_id` serial number
+relevant to us, they won't be permanent), and including a `row_id` serial number
 used later to query it in batches. Adding an index in this last column after
 filling up the table could improve the query performance. An adaptation will be
 needed to handle the column type of tags (`jsonb`).
@@ -194,7 +198,8 @@ What risks are we taking with this solution? Are there risks that once taken can
 
 - Previous attempt from cc-archive: [Clean preexisting data using ImageStore
   #517][mathemancer_pr]
-- @obulat's PR to
-  [add logging and save cleaned up data in the Ingestion Server](https://github.com/WordPress/openverse/pull/904)
+- @obulat's PR to [add logging and save cleaned up data in the Ingestion
+  Server][pr-saving-tsv]
 
+[pr-saving-tsv]: https://github.com/WordPress/openverse/pull/904
 [mathemancer_pr]: https://github.com/cc-archive/cccatalog/pull/517