Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script for deleting userless histories from database + testing + drop unused model testing code #18079

Merged
merged 11 commits into from
May 14, 2024

Conversation

jdavcs
Copy link
Member

@jdavcs jdavcs commented May 1, 2024

Replaces #18058.

For description + initial review, please refer to #18058. This branch addresses code review comments + targets the dev branch (I think there's too much new functionality here to justify treating this as a bug fix and targeting an existing release).

I'm not sure if this is the correct way to make a script installable? Also, do we need to modify sys.path?

(tests and fixtures are mostly extracted from #17662 and replace the dropped testing code)

TODO:

  • Add minimal documentation

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@jdavcs jdavcs added kind/feature area/admin area/database Galaxy's database or data access layer area/scripts labels May 1, 2024
@jdavcs jdavcs added this to the 24.1 milestone May 1, 2024
@jdavcs jdavcs requested a review from mvdbeek May 1, 2024 22:18
@jdavcs
Copy link
Member Author

jdavcs commented May 2, 2024

Packages test failure relevant, due to failed import of pytest fixtures. Will fix.

@jdavcs jdavcs force-pushed the dev_userless_history_script_and_tests branch from d038a75 to 018ecfe Compare May 2, 2024 18:04
@jdavcs jdavcs force-pushed the dev_userless_history_script_and_tests branch from 018ecfe to 2d9f78f Compare May 2, 2024 18:12
@jdavcs
Copy link
Member Author

jdavcs commented May 2, 2024

I've moved fixtures into tests.unit.data.model.conftest.py (not one level below, so that they are also available to tests in model). Importing fixtures from lib via pytest_plugins doesn't work for package tests: plugins have to be defined at a top-level conftest.py. One other option is to add a symlink to packages/data/tests/conftest.py pointing to the conftest.py where plugins are defined, but I thought just defining them in contest is both simple and straightforward.

@mvdbeek mvdbeek requested a review from natefoo May 7, 2024 14:04
Copy link
Member

@natefoo natefoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! Should be a lot for us:

galaxy_main=> SELECT count(*) FROM history WHERE user_id IS NULL AND hid_counter = 1 AND create_time < now() - interval '30 days';
  count
---------
 5255712
(1 row)

galaxy_main=> SELECT count(*) FROM history;
  count
---------
 8914573
(1 row)

lib/galaxy/model/scripts/history_table_pruner.py Outdated Show resolved Hide resolved
lib/galaxy/model/scripts/prune_history_table.py Outdated Show resolved Hide resolved
doc/source/admin/useful_scripts.rst Outdated Show resolved Hide resolved
Deleting unused histories
-------------------------

Galaxy accommodates anonymous usage by creating a default history. Often, such histories will remain unused, as a result of which the database may contain a considerable number of anonymous histories along with associated records, which serve no purpose. Deleting such records will declutter the database and free up space. However, given that a row in the history table may be referenced from multiple other tables, manually deleting such data may leave the database in an inconsistent state. Furthermore, whereas some types of data associated with such histories are clearly obsolete and can be safely deleted, others may require preservation for a variety of reasons.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is unused defined as never-used? Or does it also mean "was used, but no user associated and all datasets are purged"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Unused" here is defined as a history having hid_counter = 1 and no associated user. If, for whatever reason, the history has an associated job or dataset, that history will not be deleted. Even if the associated datasets were purged, those rows remain in the database and have to refer to a history row; deleting those histories would require deleting the associated rows, and even that is possible, that's the kind of data admins may want to preserve.

For reference, here's the output from a sample run of the script:

INFO:galaxy.model.scripts.history_table_pruner:Running batch 2 of 3: history id range 50-79:
INFO:galaxy.model.scripts.history_table_pruner: Marking histories as deleted and purged
INFO:galaxy.model.scripts.history_table_pruner: Collecting history ids
INFO:galaxy.model.scripts.history_table_pruner: Collecting ids of histories to exclude based on 6 associated tables:
INFO:galaxy.model.scripts.history_table_pruner:  Collecting history_id from job_import_history_archive
INFO:galaxy.model.scripts.history_table_pruner:  Collecting history_id from job_export_history_archive
INFO:galaxy.model.scripts.history_table_pruner:  Collecting history_id from workflow_invocation
INFO:galaxy.model.scripts.history_table_pruner:  Collecting history_id from history_dataset_collection_association
INFO:galaxy.model.scripts.history_table_pruner:  Collecting history_id from job
INFO:galaxy.model.scripts.history_table_pruner:  Collecting history_id from history_dataset_association
INFO:galaxy.model.scripts.history_table_pruner: Populating temporary table
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from event
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from history_audit
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from history_tag_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from history_annotation_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from history_rating_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from history_user_share_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from default_history_permissions
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from data_manager_history_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from cleanup_event_history_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from galaxy_session_to_history
INFO:galaxy.model.scripts.history_table_pruner: Set history_id to null in galaxy_session
INFO:galaxy.model.scripts.history_table_pruner: Deleting histories in id range 50-79

@jdavcs jdavcs merged commit 076f298 into galaxyproject:dev May 14, 2024
43 of 44 checks passed
@nsoranzo
Copy link
Member

This broke py312-lint tests, the following will fix it:

diff --git a/lib/galaxy/model/scripts/history_table_pruner.py b/lib/galaxy/model/scripts/history_table_pruner.py
index c23813cbd7..ebd60ba063 100644
--- a/lib/galaxy/model/scripts/history_table_pruner.py
+++ b/lib/galaxy/model/scripts/history_table_pruner.py
@@ -64,7 +64,7 @@ class HistoryTablePruner:
         high = get_high(low)
         batch_counter = 1
         while low <= self.max_id:
-            log.info(f"Running batch {batch_counter} of {self.batches}: history id range {low}-{high-1}:")
+            log.info(f"Running batch {batch_counter} of {self.batches}: history id range {low}-{high - 1}:")
             self._run_batch(low, high)
             low = high
             high = get_high(high)
@@ -76,7 +76,7 @@ class HistoryTablePruner:
         return today.replace(month=today.month - 1)
 
     def _run_batch(self, low, high):
-        empty_batch_msg = f" No histories to delete in id range {low}-{high-1}"
+        empty_batch_msg = f" No histories to delete in id range {low}-{high - 1}"
         self._mark_histories_as_deleted_and_purged(low, high)
         histories = self._get_histories(low, high)
         if not histories:
@@ -200,7 +200,7 @@ class HistoryTablePruner:
 
     def _delete_histories(self, low, high):
         """Last step: delete histories that are safe to delete."""
-        log.info(f" Deleting histories in id range {low}-{high-1}")
+        log.info(f" Deleting histories in id range {low}-{high - 1}")
         stmt = text(f"DELETE FROM history WHERE id IN (SELECT id FROM {TMP_TABLE})")
         with self.engine.begin() as conn:
             conn.execute(stmt)

@nsoranzo
Copy link
Member

This broke py312-lint tests, the following will fix it:

Added commit to fix this to PR #18152 .

@jdavcs
Copy link
Member Author

jdavcs commented May 15, 2024

Thank you for the fix, @nsoranzo! Sorry I missed that!

@jdavcs jdavcs added the highlight/admin Included in admin/dev release notes label May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/admin area/database Galaxy's database or data access layer area/scripts highlight/admin Included in admin/dev release notes kind/feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants