Skip to content

Latest commit

 

History

History
47 lines (37 loc) · 2.27 KB

README.md

File metadata and controls

47 lines (37 loc) · 2.27 KB

Batch Issue Remover

Removes incorrect or damaged issues in chronam / ONI batches:

make
./remove-issues /path/to/batch_xxx_yyyyyyy_ver01/ \
  /path/to/batch_xxx_yyyyyyy_ver02 \
  sn12345678/2020-01-01_01 sn12345678/2020020101 sn12345678/2020030101

This would remove the first editions of the January 1st, February 1st, and March 1st issues of the title identified with LCCN "sn12345678". The changes are written to the destination directory rather than run in-place so that the original batch may be preserved if necessary.

The issue keys are stripped of dashes and underscores to allow for more readable input.

The source directory should either be the pristine dark archive, or a copy thereof (though the TIFF files won't matter, as they aren't copied to the destination). The destination will be immediately ingestable.

The tool performs the following actions:

  • The source batch.xml is scanned for issues in question. If any of the given issue keys aren't found, an error is reported and no processing occurs.
  • Most files in the source are synced to the destination location:
    • TIFF images are skipped as they're not necessary for ONI and take a long time to copy.
    • Validated XML files (e.g., *_1.xml) are skipped as they aren't necessary for ONI and imply something no longer true (that the batch was run through LC's DVV tool after it was built).
    • batch.xml is rewritten in transit to remove relevant <issue> elements.
    • Any issue directory matching the given issue key(s) is obviously skipped.

On most failures, the tool will attempt to retry the job. There are a lot of careful error checks as this tool needs to be able to correct batches at any time in the future if we have to reload from our archive (rather than re-archiving a second batch and hoping we didn't create new problems).

Note: If you have a pile of issues you need to remove and aren't sure where they live, NCA has a useful tool to help. Clone NCA, run make, and then use bin/find-issues.

Note also that the code is definitely over-architected. It's basically a heavily-modified copy of another tool which already had a job / worker approach that seemed potentially useful for retries and failure reporting.