Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete unacceptable thumbnails from catalog DB after the image data refresh is finished #1816

Closed
krysal opened this issue Apr 18, 2023 · 9 comments · Fixed by #2302
Closed
Assignees
Labels
🕹 aspect: interface Concerns end-users' experience with the software 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@krysal
Copy link
Member

krysal commented Apr 18, 2023

Description

In #1811 the need arose to delete the thumbnails of the following providers: 'rijksmuseum', 'sketchfab', 'sciencemuseum', 'thingiverse' and 'flickr'.

Description

Make a backup (just in case something goes wrong) and apply the update to nullify those providers' thumbnails:

UPDATE image SET thumbnail = NULL WHERE provider IN 
	('rijksmuseum', 'sketchfab', 'sciencemuseum', 'thingiverse')

-- Flickr will presumably take much longer so it's suggested to apply it separately
UPDATE image SET thumbnail = NULL WHERE provider='flickr';

Additional context

Related to #675.

@krysal krysal added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 🕹 aspect: interface Concerns end-users' experience with the software ⛔ status: blocked Blocked & therefore, not ready for work 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 18, 2023
@krysal krysal self-assigned this Apr 18, 2023
@krysal krysal removed the ⛔ status: blocked Blocked & therefore, not ready for work label May 5, 2023
@krysal
Copy link
Member Author

krysal commented May 5, 2023

Since the data refresh process is now finished, I will proceed with this issue. First backing up the thumbnail data, just in case.

CREATE TEMPORARY TABLE small_image_thumbnails AS (
    SELECT identifier, provider, thumbnail FROM IMAGE
    WHERE thumbnail IS NOT NULL AND provider IN 
        ('rijksmuseum', 'sketchfab', 'sciencemuseum', 'thingiverse')
);

\copy small_image_thumbnails TO '/tmp/small_image_thumbnails_2023_04_05.tsv' DELIMITER E'\t' CSV HEADER;

Uploaded to s3://openverse-catalog/image/small_image_thumbnails_2023_04_05.tsv

The next step is to apply the update.

@krysal
Copy link
Member Author

krysal commented May 5, 2023

The changes for the small providers were applied quite fast:

UPDATE image SET thumbnail = NULL WHERE provider IN 
     ('rijksmuseum', 'sketchfab', 'sciencemuseum', 'thingiverse');

UPDATE 198754
Time: 150.083s (2 minutes 30 seconds), executed in: 150.083s (2 minutes 30 seconds)

I tried to do the same for Flickr separately since this provider is considerably larger, and as expected, it took much longer to create the temporary table.

-- Backup
CREATE TEMPORARY TABLE flickr_thumbnails_deleted_2023_04_05 AS (
     SELECT identifier, thumbnail FROM image WHERE provider='flickr' AND thumbnail IS NOT NULL
 );
SELECT 497009314
Time: 1984.467s (33 minutes 4 seconds), executed in: 1984.466s (33 minutes 4 seconds)

And since the resulting table is huge

SELECT pg_size_pretty(pg_relation_size('flickr_thumbnails_deleted_2023_04_05'));
+----------------+
| pg_size_pretty |
|----------------|
| 51 GB          |
+----------------+

I tried to upload the tsv directly to s3 but it seems that our postgres version doesn't support exports to Amazon S3 🫤

aws rds describe-db-engine-versions --region us-east-1 --engine postgres --engine-version 13.2 | grep s3Export
# returns nothing

So we have two options before applying the update for Flickr:

  1. Skip saving Flickr's thumbnails (we're not using them anyway)
  2. Save them to a normal table for a while
CREATE TABLE flickr_thumbnails_deleted_2023_04_05 AS (
    identifier  uuid, # not sure is this would be allowed since in theory uuid are unique across the DB
    thumbnail   varchar(3000)
);

-- this would be the backup
INSERT INTO flickr_thumbnails_deleted_2023_04_05
SELECT identifier, thumbnail FROM image
WHERE provider='flickr' AND thumbnail IS NOT NULL;

I slightly prefer the first option as it is the simplest, in case we really need these thumbnails we could probably get them from one of the whole DB backup, although it will be more cumbersome. @WordPress/openverse What do you think?

@zackkrida zackkrida changed the title Delete unacceptably thumbnails from catalog DB after the image data refresh is finished Delete unacceptable thumbnails from catalog DB after the image data refresh is finished May 5, 2023
@AetherUnbound
Copy link
Collaborator

Dang, 51 GB for just two values 🥲 I think we have no shortage of space on the catalog at this point, so my preference would be to save it to a non-temporary table!

@sarayourfriend
Copy link
Collaborator

Ditto, unless there's a reason to save those 51GB, keeping it is probably going to be easier to recover from than needing to extract them from a backup.

@krysal
Copy link
Member Author

krysal commented May 11, 2023

Done! Flickr thumbnails are backed up to the persistent flickr_thumbnails_deleted_2023_05_08 table and in the process of being deleted from image.

It's been 4 hours and counting...

@krysal
Copy link
Member Author

krysal commented Jun 2, 2023

The previous manual update query was terminated on May 16th before it was completed since it was not known when it would end and was blocking other work. I created a simple DAG to do this progressively.

@AetherUnbound
Copy link
Collaborator

@krysal Should we reopen this and wait to close it until the thumbnails themselves are deleted?

@krysal
Copy link
Member Author

krysal commented Jun 7, 2023

@AetherUnbound that's right.

@krysal krysal reopened this Jun 7, 2023
@krysal
Copy link
Member Author

krysal commented Jun 20, 2023

This is completed.

@krysal krysal closed this as completed Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🕹 aspect: interface Concerns end-users' experience with the software 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants