Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script for hard removal of unused histories #17725

Closed
mvdbeek opened this issue Mar 14, 2024 · 15 comments
Closed

Script for hard removal of unused histories #17725

mvdbeek opened this issue Mar 14, 2024 · 15 comments
Assignees

Comments

@mvdbeek
Copy link
Member

mvdbeek commented Mar 14, 2024

https://gist.github.com/mvdbeek/fc352f0f21a5eca14b8f7416f34e95e5 seems to "work" on usegalaxy.org, but I've rolled it back for now.
@jdavcs mentioned that we need to exclude some additional items:

galaxy_main=> select * from history_tag_association where history_id in (sELECT id FROM history WHERE history.user_id IS NULL AND history.hid_counter = 1 ORDER BY history.update_time);
  id   | history_id | tag_id | user_tname | value | user_value | user_id 
-------+------------+--------+------------+-------+------------+---------
 40231 |    2522236 |   1951 | exome      |       |            |   67901
 40232 |    2522236 |   2160 | cancer     |       |            |   67901
 40233 |    2522236 |   4724 | colorectal |       |            |   67901
 41783 |    2544120 |   1951 | exome      |       |            |   67901
 41784 |    2544120 |   2160 | cancer     |       |            |   67901
 41785 |    2544120 |   4724 | colorectal |       |            |   67901
 35779 |    2444466 |   1951 | exome      |       |            |   67901
 35780 |    2444466 |   2160 | cancer     |       |            |   67901
 35781 |    2444466 |   4724 | colorectal |       |            |   67901
 42530 |    2554242 |   1951 | exome      |       |            |   67901
 42531 |    2554242 |   2160 | cancer     |       |            |   67901
 42532 |    2554242 |   4724 | colorectal |       |            |   67901
 36145 |    2452043 |   1951 | exome      |       |            |   67901
 36146 |    2452043 |   2160 | cancer     |       |            |   67901
 36147 |    2452043 |   4724 | colorectal |       |            |   67901
 38063 |    2491679 |   1951 | exome      |       |            |   67901
 38064 |    2491679 |   2160 | cancer     |       |            |   67901
 38065 |    2491679 |   4724 | colorectal |       |            |   67901
 39213 |    2506616 |   1951 | exome      |       |            |   67901
 39214 |    2506616 |   2160 | cancer     |       |            |   67901
 39215 |    2506616 |   4724 | colorectal |       |            |   67901
 39279 |    2507459 |   1951 | exome      |       |            |   67901
 39280 |    2507459 |   2160 | cancer     |       |            |   67901
 39281 |    2507459 |   4724 | colorectal |       |            |   67901
 42391 |    2552392 |   1951 | exome      |       |            |   67901
 42392 |    2552392 |   2160 | cancer     |       |            |   67901
 42393 |    2552392 |   4724 | colorectal |       |            |   67901
 42566 |    2556597 |   1951 | exome      |       |            |   67901
 42567 |    2556597 |   2160 | cancer     |       |            |   67901
 42568 |    2556597 |   4724 | colorectal |       |            |   67901
 36670 |    2464460 |   1951 | exome      |       |            |   67901
 36671 |    2464460 |   2160 | cancer     |       |            |   67901
 36672 |    2464460 |   4724 | colorectal |       |            |   67901
(33 rows)
@jdavcs
Copy link
Member

jdavcs commented Mar 14, 2024

@mvdbeek So do you think we should add this as an executable script to /scripts so admins can run it on their instances?

@mvdbeek
Copy link
Member Author

mvdbeek commented Mar 14, 2024

I think this is likely only required for large, public instances that allow anon access. I think the first step should be just improving it with fixes for what you already found, run it against test and main and then maybe include it in gxadmin or pgcleanup.py

@jdavcs
Copy link
Member

jdavcs commented Mar 14, 2024

Here's my proposal.

There are multiple tables that reference the history table. We should ensure we don't leave them in an inconsistent state. Some of them are very unlikely to have a reference to an anonymous history, but that's not impossible (like we've discovered with history tag associations), either due to a bug, or an oversight in application logic, now or in the future. So I suggest handling all of them regardless.

Among those tables there are those where we don't want to keep records that don't have a corresponding history record (like a history tag association); but there are others that we do (e.g. job, hda, etc.). For the former case, we delete the row, for the latter - set the column referencing history to NULL. For that to be possible, we need to modify the database schema and alter the column definitions in 6 tables: so they accept NULL values:

job.history_id
history_dataset_association.history_id
history_dataset_collection_association.history_id
workflow_invocation.history_id
job_export_history_archive.history_id
job_import_history_archive.history_id

That would require a db migration.

Next, we'd run this script:

CREATE TEMPORARY TABLE tmp_unused_history(id INT PRIMARY KEY);
INSERT INTO tmp_unused_history 
SELECT id FROM history WHERE user_id IS NULL AND hid_counter = 1 AND update_time < '01-01-2015';

BEGIN TRANSACTION;

-- delete rows we don't need to keep
DELETE FROM event 
WHERE history_id IN (SELECT id FROM tmp_unused_history);
 
DELETE FROM history_audit 
WHERE history_id IN (SELECT id FROM tmp_unused_history);
 
DELETE FROM history_tag_association 
WHERE history_id IN (SELECT id FROM tmp_unused_history);
 
DELETE FROM history_annotation_association 
WHERE history_id IN (SELECT id FROM tmp_unused_history);
 
DELETE FROM history_rating_association 
WHERE history_id IN (SELECT id FROM tmp_unused_history);
 
DELETE FROM history_user_share_association 
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM default_history_permissions 
WHERE history_id IN (SELECT id FROM tmp_unused_history);
 
DELETE FROM data_manager_history_association 
WHERE history_id IN (SELECT id FROM tmp_unused_history);
 
DELETE FROM cleanup_event_history_association 
WHERE history_id IN (SELECT id FROM tmp_unused_history);
 
DELETE FROM galaxy_session_to_history 
WHERE history_id IN (SELECT id FROM tmp_unused_history);

-- set history id to NULL in rows we want to keep 
UPDATE job 
SET history_id = NULL 
WHERE history_id in (SELECT id FROM tmp_unused_history);

UPDATE history_dataset_association
SET history_id = NULL 
WHERE history_id in (SELECT id FROM tmp_unused_history);

UPDATE history_dataset_collection_association
SET history_id = NULL
WHERE history_id in (SELECT id FROM tmp_unused_history);

UPDATE workflow_invocation
SET history_id = NULL
WHERE history_id in (SELECT id FROM tmp_unused_history);

UPDATE job_export_history_archive
SET history_id = NULL
WHERE history_id in (SELECT id FROM tmp_unused_history);

UPDATE job_import_history_archive
SET history_id = NULL
WHERE history_id in (SELECT id FROM tmp_unused_history);

UPDATE galaxy_session
SET current_history_id = NULL
WHERE current_history_id in (SELECT id FROM tmp_unused_history);

-- delete history rows
DELETE FROM history WHERE id IN (SELECT id FROM tmp_unused_history);

COMMIT TRANSACTION;

Questions:

  1. What date is a safe cutoff so we don't delete currently used histories?
  2. I am not sure we should delete galaxy_session rows - so I moved it into the "set to null" group of statements. If we do want to delete those rows, then I'll update the statement to make the correct selection of rows to delete.
  3. I think this has the potential to run for a long time. I don't think it must be one huge transaction: there's no harm done if we delete the rows from one table, but not the next in the list. So, maybe better wrap each statement in a transaction? That way if something happens, we won't have to rerun everything.

Does this make sense?

Also, some numbers to help select the cutoff date:

galaxy_main=> SELECT count(*) FROM history WHERE user_id IS NULL AND hid_counter = 1 AND update_time < '01-01-2015';
  count  
---------
 1406895
(1 row)

galaxy_main=> SELECT count(*) FROM history WHERE user_id IS NULL AND hid_counter = 1 AND update_time < '01-01-2023';
  count  
---------
 3614539
(1 row)

galaxy_main=> SELECT count(*) FROM history WHERE user_id IS NULL AND hid_counter = 1 AND update_time < '01-01-2024';
  count  
---------
 4216140
(1 row)

galaxy_main=> SELECT count(*) FROM history WHERE user_id IS NULL AND hid_counter = 1;
  count  
---------
 5088343
(1 row)

@mvdbeek
Copy link
Member Author

mvdbeek commented Mar 15, 2024

job.history_id
history_dataset_association.history_id
history_dataset_collection_association.history_id
workflow_invocation.history_id
job_export_history_archive.history_id
job_import_history_archive.history_id

Can we instead excluded histories that are referenced in these tables ? This is going to be a very small fraction of histories with no datasets.

@hexylena
Copy link
Member

For that to be possible, we need to modify the database schema and alter the column definitions in 6 tables: so they accept NULL values:

just accepting? or changing a default (older postgres rewrote on changing a default.) might be worth testing this migration on a replica of a big database just to be sure we document if the migration will take a long time.

What date is a safe cutoff so we don't delete currently used histories?

anything before like, a month ago, should be fine.

I am not sure we should delete galaxy_session rows

I am not able to think of a situation where that wouldn't be safe, the user id is null, these can only be for anonymous sessions.

I think this has the potential to run for a long time. I don't think it must be one huge transaction: there's no harm done if we delete the rows from one table, but not the next in the list. So, maybe better wrap each statement in a transaction? That way if something happens, we won't have to rerun everything.

yes, this could run for ages. And folks will want to consider vacuuming after since it'll be a lot of dead tuples in the middle of the database.

Did you consider batching, rather than nested transactions or multiple small ones? Something like this to pull the oldest empty histories, and clean those up. Admins could change the batch size to be what they're comfortable with.

SELECT id FROM history WHERE user_id IS NULL AND hid_counter = 1 AND update_time < '01-01-2015' ORDER BY update_time ASC LIMIT 1000

might also drop the temp table afterwards, otherwise subsequent runs are going to be an issue.

@mvdbeek
Copy link
Member Author

mvdbeek commented Mar 15, 2024

It doesn't seem like a good idea to add migrations to allow nullable columns to where we don't want nullable columns. This seems antithetical to a clean data model that was brought up in the discussion yesterday.

👍 to batches, that's how I tested https://gist.github.com/mvdbeek/fc352f0f21a5eca14b8f7416f34e95e5

I am not able to think of a situation where that wouldn't be safe, the user id is null, these can only be for anonymous sessions.

IIRC we've done that in the past already

@jdavcs
Copy link
Member

jdavcs commented Mar 15, 2024

Thanks for the detailed feedback! Posting draft here one more time before wrapping in a script.

Edits:

  • Added on commit drop to explicitly drop the temp table within the transaction.
  • Added exclusion clauses to the temp table select statement instead of nullifying the history_id column (yes, good point about a clean data model!)

I've considered deleting galaxy_session rows, however this has strings attached: it's referenced from the event, user_action and the job table. At the same time, the current_history_id column in the galaxy_session table is nullable. I think setting that value to null is a more reasonable approach than excluding sessions that are not referenced from those tables and then setting the current_history_id on the excluded rows to null.

Batching seems to be the way to go, especially given that it gives other admins the flexibility to set the batch size in addition to the cut-off date.

-- update time and size of limit will be configurable
BEGIN TRANSACTION;

-- setup and populate temporary table
CREATE TEMPORARY TABLE tmp_unused_history(id INT PRIMARY KEY) ON COMMIT DROP;
INSERT INTO tmp_unused_history
SELECT id FROM history
WHERE
    user_id IS NULL
    AND hid_counter = 1
    AND update_time < '01-01-2015'
    AND id NOT IN (SELECT history_id FROM job)
    AND id NOT IN (SELECT history_id FROM history_dataset_association)
    AND id NOT IN (SELECT history_id FROM history_dataset_collection_association)
    AND id NOT IN (SELECT history_id FROM workflow_invocation)
    AND id NOT IN (SELECT history_id FROM job_export_history_archive)
    AND id NOT IN (SELECT history_id FROM job_import_history_archive)
LIMIT 10;

-- delete rows we don't need to keep
DELETE FROM event
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM history_audit
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM history_tag_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM history_annotation_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM history_rating_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM history_user_share_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM default_history_permissions
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM data_manager_history_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM cleanup_event_history_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM galaxy_session_to_history
WHERE history_id IN (SELECT id FROM tmp_unused_history);

-- set history id to NULL in rows in galaxy_session
UPDATE galaxy_session
SET current_history_id = NULL
WHERE current_history_id in (SELECT id FROM tmp_unused_history);

-- delete history rows
DELETE FROM history WHERE id IN (SELECT id FROM tmp_unused_history);

COMMIT TRANSACTION;

@hexylena
Copy link
Member

yeah this is starting to look fantastic!

    AND id NOT IN (SELECT history_id FROM job)

oh this will be very slow. hmm. you can make it distinct but that won't improve the speed at all. I'm not sure there is a way to improve efficiency there, you need the full set of IDs at some point.

you could stuff the six queries in a temp table via "union" since they're all the same shape and do a select distinct from but again, that probably doesn't improve perf much.

@jdavcs
Copy link
Member

jdavcs commented Mar 19, 2024

Yes, and not just that one, hda is just as bad. distinct doesn't help with speed. What does help, surprisingly, is adding a WHERE clause that caps the max job id (if we are querying the job table). Based on the query plan, the only difference is an index scan (without the clause) versus a sequential table scan (with the clause); however, once you run it, it changes everything: a simple select history_id from job where id < some-max-value runs in 2 min, whereas the same without the WHERE clause runs forever (I cancelled after an hour).

I don't have a clear solution yet. One approach might be to run deletes skipping over constraint errors (I don't know how to efficiently implement that yet). Another is to add WHERE clauses, which makes building the tmp table with history ids manageable. However, the big issue is that the db is live, and every minute we get new jobs. So, even if we filter out the ids of histories we want to delete that are not referenced from other tables, the next second a job may be created referencing a history from our list, and that will raise an error when trying to delete it. Locking the table is not an option, of course.

@hexylena
Copy link
Member

hexylena commented Mar 19, 2024

@jdavcs yeah, that's a great suggestion. at that point you might even wrap it in a function. What about something like where we identify the range of integers that we're going to target, and then do those. (not necessary to use the table really, it's just a list of integers, but, it reads potentially a bit cleaner / less chance to miss one left/right bound on a WHERE and end up in a strange state.)

CREATE OR REPLACE FUNCTION trim_empty_histories(start integer, count integer) AS $$
BEGIN;
-- update time and size of limit will be configurable
BEGIN TRANSACTION;

-- find our target history IDs
CREATE TEMPORARY TABLE tmp_target_unused_history(id INT) ON COMMIT DROP;
INSERT INTO tmp_target_unused_history
SELECT id FROM history 
  WHERE history_id > $1 AND history_id <= $1 + $2;

-- setup and populate temporary table
CREATE TEMPORARY TABLE tmp_unused_history(id INT PRIMARY KEY) ON COMMIT DROP;
INSERT INTO tmp_unused_history
SELECT id FROM history
WHERE
    user_id IS NULL
    AND hid_counter = 1
    AND history_id in (select id from tmp_target_unused_history)
    AND id NOT IN (SELECT history_id FROM job WHERE history_id in (select id from tmp_target_unused_history))
    AND id NOT IN (SELECT history_id FROM history_dataset_association WHERE history_id in (select id from tmp_target_unused_history))
    AND id NOT IN (SELECT history_id FROM history_dataset_collection_association WHERE history_id in (select id from tmp_target_unused_history))
    AND id NOT IN (SELECT history_id FROM workflow_invocation WHERE history_id in (select id from tmp_target_unused_history))
    AND id NOT IN (SELECT history_id FROM job_export_history_archive WHERE history_id in (select id from tmp_target_unused_history))
    AND id NOT IN (SELECT history_id FROM job_import_history_archive WHERE history_id in (select id from tmp_target_unused_history))
LIMIT 10;

-- delete rows we don't need to keep
DELETE FROM event
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM history_audit
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM history_tag_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM history_annotation_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM history_rating_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM history_user_share_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM default_history_permissions
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM data_manager_history_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM cleanup_event_history_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);

DELETE FROM galaxy_session_to_history
WHERE history_id IN (SELECT id FROM tmp_unused_history);

-- set history id to NULL in rows in galaxy_session
UPDATE galaxy_session
SET current_history_id = NULL
WHERE current_history_id in (SELECT id FROM tmp_unused_history);

-- delete history rows
DELETE FROM history WHERE id IN (SELECT id FROM tmp_unused_history);

COMMIT TRANSACTION;
$$ LANGUAGE sql;

@jdavcs
Copy link
Member

jdavcs commented Mar 19, 2024

Good idea about wrapping in a function, thanks!

I think the first tmp table is not necessary. If we want to work on ranges of history IDs, it'll be much faster to use a simple WHERE clause when constructing tmp_unused_history table: WHERE id >= low AND id < high (i.e., the column is an int and it's indexed).

I think the main problem is that the db is live: no matter what table we pre-populate with history ids, there may be a job created with a history from that list after we do that, but before we try to delete that particular history. That'll raise an error.

I think I'll try something like this to handle this problem:

  1. Identify a set of unused histories (by whichever process is fastest)
  2. NEW: Mark those histories as deleted and purged to render them unusable
  3. Delete association tables we don't care about (currently, all the delete statements in the draft script)
  4. Update galaxy_session (set history_id to null for affected rows)
  5. Delete histories

@hexylena
Copy link
Member

I think the first tmp table is not necessary

no definitely not, was just to simplify the WHERE and feel more sure that one wasn't missing or the wrong comparison direction.

I think the main problem is that the db is live: no matter what table we pre-populate with history ids, there may be a job created with a history from that list after we do that, but before we try to delete that particular history. That'll raise an error.

but the jobs we'll be working on are ancient so, surely that's a low risk, right? if we're assuming the admin goes by history ID starting from low numbers.

NEW: Mark those histories as deleted and purged to render them unusable

if this is happening in a transaction it won't be visible to anything creating jobs, you'll have the same problem. Unless you do that in a separate transaction.

@jdavcs
Copy link
Member

jdavcs commented Mar 19, 2024

if this is happening in a transaction it won't be visible to anything creating jobs, you'll have the same problem. Unless you do that in a separate transaction.

Oh yes, separate transaction for sure. I think we can treat these as separate steps: (1) render unused histories unusable; (2) physical cleanup - so, I think, it's perfectly OK if the first commits but the second has a rollback.

but the jobs we'll be working on are ancient so, surely that's a low risk, right? if we're assuming the admin goes by history ID starting from low numbers.

Yes, of course! But once we get close enough (within a month?), it's not impossible. This extra steps would be a "due diligence" kind of thing, and it's not expensive.

@hexylena
Copy link
Member

Not suggesting you don't bother with due diligence of course, just wanted to be sure whether writing the measures was worth it vs simply a transaction getting rolled back and trying it again.

@jdavcs
Copy link
Member

jdavcs commented May 29, 2024

Implemented in #18079 (also see #18058 for details)

@jdavcs jdavcs closed this as completed May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants