4 26 2021 Tech Team Report

4-26-2021

Date	Task	Hours (Main)	Hours (EOLS)
19-Apr-2021	Report, pdf library research, update Drupal modules	1	3
20-Apr-2021	AnnoRep - develop test code to view pdf contents, create example with different comment types, find text anchor		4
21-Apr-2021	AnnoRep- try to read full pdf text, add functionality to detect docx or pdf @Dataverse, #7816 investigation, Drupal core sec update to all machines	2	3
22-Apr-2021	#7817 PR to IQSS, restore un/re ingest buttons, address 5.4.1 issues in doc (fix collection placement in search, update stage archive location, investigate archivecopy url), deploy fixes, reindex stage, comment on 7816	5
23-Apr-2021	Change citation to use owning collection name, fix test, deploy to dev/stage, reindex, AR - add pdf annotation retrieval	1	6

Operations:

Updated Drupal modules and core (security-related) and deployed to dev, stage, and, after testing, prod
Updated the location where stage sends archival files to use the Google dev bucket
Re-indexed dev/stage after updates that affect solr contents

Dataverse:

Investigated/fixed #7817 - file replace not using the original mimetype of tabular files when comparing to new file, sent fix PR to IQSS
Investigated #7816 - Excel ingest not handling dates. Wrote notes in the issue re: how Excel encodes dates and what would be needed to parse info from the xml components of xlsx files to determine which columns have dates.
Restored un/re ingest buttons on file page (lost in the menu redesign that happened in 5.3)
Fixed style issue shifting the collection name to a new line in Dataset lists/search
Changed citation to use parent collection name instead of root collection name

AnnoRep:

Added a call to Dataverse search API to get file mimetype of doc
Investigated pdf libraries, tried pdfbox and found ways to read text, find 'highlight' comments, get the comment and anchor text
Developed mechanism to find location of anchors in overall text stream (FWIW: Since pdf doesn't have an overall sequence of page elements like docx, one has to start from knowing the anchor text and then looking within the overall text stream, computed by looking at positions on the page, to find where the anchors are. I created code that looks ahead for anchors which then generates appropriate anchor start/end events so that the contents can then be handled the same way as with docs files (in both cases, using a stream rather than reading the whole doc string into memory - not sure how memory efficient the underlying libraries are).
Created a proof-of-concept pdf annotation extractor mirroring the docx output. (It only looks for comments on page 1 right now).

@skarcher pointed out that the URL currently stored for google archiving resolves to an odd page (one that assumes the zip file is a folder) versus one that shows file details or one that points to the folder the zip file is in. While one can navigate from the odd page to find these, it would be better to point directly to something useful - what's the best choice?
AnnoRep - is processing of a pdf file required w.r.t. creating a cleaned document? I.e. we convert docx to pdf which, at least currently, removes any comments which are then captured as ~Hypothesis annotations. With pdf, should I create a pdf copy without the highlight comments in it (as an additional aux file associated with the original pdf file, analogous to how the pdf is an aux file associated with teh original docx datafile)? Or just create the annotations file?

Handle any issues in 5.4.1 discovered in testing, deploy to prod when ready
Investigate circumstances when redirect to user page happens (per discussion)
Anno-Rep work
- Make a PR to update aux files (in Dataverse 5.4, they are only allowed on tabular files, multiple copies can be uploaded which then breaks download, no delete, no way to list.). Awaiting IQSS changes to aux file naming.
- Update pdf annotation code to handle multiple pages
- Code cleanup
- Deploy pdf annotation extractor

Still TBD: