Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEV-1086 - monthly reports #125

Merged
merged 4 commits into from
May 7, 2024
Merged

DEV-1086 - monthly reports #125

merged 4 commits into from
May 7, 2024

Conversation

aelkiss
Copy link
Member

@aelkiss aelkiss commented Apr 23, 2024

Overview

Data sets created for the HathiTrust research center consist of directory tree of zip files containing only the OCR text files for each volume. There are four subsets with various conditions for inclusion based around the rights attributes of the item. Each night, items are added and removed from the subsets each day based on newly-ingested items and updated rights.

Currently, researchers who use these data sets get an email each day with items removed from the data sets. This is rather noisy, especially since in certain scenarios items can change rights frequently -- note this is also undesired behavior, but even if it was fixed, it is not necessary to send nightly deletion logs. Instead, we want to report on items deleted from the data sets once a month.

Implementation

  • Staying the same: https://github.com/hathitrust/datasets/blob/main/bin/grep_logs.sh extracts deletion events from the worker logs.
  • New: DedupeDeleteLog reads these deletion logs, compares against what is in the data set at the end of the month, and outputs only those that remain deleted.
  • Changed: bin/notify.rb is moved to lib/datasets/notify.rb and integrated with the rest of the CLI

Testing

  • Try the functionality with:
    • cloning the repository and checking out the DEV-1086-monthly-reports branch
    • running docker compose build and docker compose run test bundle install
    • copying a month of deletion logs from ictc-ht-datasets-000:/htprep/datasets/logs/delete_notifications_sent to a convenient location under this repository (let's say "deletelogs")
    • running docker compose run processor bin/datasets.rb notify --dry-run deletelogs/*
  • Also see included tests; clone, bundle install, and run docker compose run test to run those tests yourself

You should see the text of the emails output.

Review questions

I think in general I'm pretty happy with this; I don't think there are any specific areas I'm looking for feedback on.

  • Do the changes here make sense?
  • Is this a reasonable way to achieve the goal as stated on the ticket?

@coveralls
Copy link

coveralls commented Apr 23, 2024

Coverage Status

coverage: 98.808% (-0.04%) from 98.849%
when pulling 5429c53 on DEV-1086-monthly-reports
into 3009ed4 on main.

- dedupes output
- omits anything that is in the target dataset at time of report
@aelkiss aelkiss force-pushed the DEV-1086-monthly-reports branch 2 times, most recently from 19aa9ec to 13735ed Compare April 30, 2024 21:04
@aelkiss aelkiss force-pushed the DEV-1086-monthly-reports branch from 13735ed to 6902c6d Compare May 1, 2024 17:44
@aelkiss
Copy link
Member Author

aelkiss commented May 1, 2024

Note that the missed coverage is I think only on a branch that actually sends email, which we don't really want to do (or try to do), and has previously been working in production

@aelkiss
Copy link
Member Author

aelkiss commented May 1, 2024

Also note that grep_logs.sh does need to be updated to run only monthly (and to process all unprocessed worker logs) as well, and we should do that on this PR before merging.

@aelkiss aelkiss marked this pull request as ready for review May 1, 2024 18:24
@aelkiss aelkiss requested a review from moseshll May 1, 2024 19:58
Copy link
Contributor

@moseshll moseshll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE (as mwarin would say)
I even did some mean things to the CLI and it reacted in entirely unsurprising ways.
The only thing I don't like is the e-mail text in a heredoc. There would be a complexity cost associated with doing it otherwise (in a file, as a const) so set it aside as a matter of aesthetics.

@aelkiss
Copy link
Member Author

aelkiss commented May 7, 2024

I moved the part of grep_logs.sh that does the sending to a separate script that can be run on a separate schedule.

While grep_logs.sh has its own issues (namely, the date dependency), I'm not inclined to address that right now - that would probably need to be done with a bigger change like what we did with post-zephir processing & hathifiles.

@aelkiss aelkiss merged commit a17f379 into main May 7, 2024
1 check passed
@aelkiss
Copy link
Member Author

aelkiss commented May 7, 2024

I will go ahead and deploy this; the daily grep_logs.sh won't send any reports now, so we will need to add a separate monthly cron job to mail reports.

@aelkiss aelkiss deleted the DEV-1086-monthly-reports branch May 7, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants