Feature/chadwyck healey #157

WHaverals · 2025-02-10T18:42:41Z

I built a new TML-file parser! For multiple reasons:

Most importantly, these files are weird to say the least -- not well-formed XML, nor SGML, nor HTML...
the plaintext version of the corpus (we received from Mark Algee-Hewitt) contained lot of weirdness... e.g.:

many files still contain waht looks like HTML tags
other encoding issues
lots of copyright notices
extremely weird indentation (which causes some files to be well over 10MB, one is even 335MB -- simply because each newline is a tab, see e.g. Z200474224)
too many blank lines too...

Mark also sent us a metadata file (POETRY_CORPUS.xlsx), which is a mess too, e.g.:

all the translators are forgotten; and these are sometimes just really important! (this is a big issue for me - e.g. Guy Fawkes translated Theocritus, and this is not mentioned anywhere in the xlsx file...)
encoding errors (e.g. A Sash for Wu YÃ¼n, â€œFifteen stars fence polar fire ...â€, etc.)

So in short, both the the plaintext files and the metadata file need to be cleaned up.
I went back to the original TML files, and tried to parse them - to the best of my ability, to create a clean txt corpus.

The relevant files in this pull request are:

parse_tml.md (~ description of the main logic that was followed and implemented; there are many edge cases)
entity_map.py (~ I made an attempt to map certain exotic entities (XML codes, HTML codes, etc.) that need to be resolved; the most important (frequent ones) should be resolved -- the ones remaining are part of the really long tail of singletons)
tml_parser.py (~ the main script!)

cmd to run the script: python tml_parser.py --input_dir "tml" --output_dir "tml_parsed"

optional args:
--num_files 1000 (leave out to process all files)
--verbose (for verbose output)

codecov · 2025-02-10T18:43:51Z

Codecov Report

Attention: Patch coverage is 19.78610% with 300 lines in your changes missing coverage. Please review.

Project coverage is 74.70%. Comparing base (7b66adc) to head (a93b480).
Report is 48 commits behind head on develop.

❌ Your patch status has failed because the patch coverage (19.78%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (74.70%) is below the target coverage (75.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #157      +/-   ##
===========================================
- Coverage    76.04%   74.70%   -1.35%     
===========================================
  Files           23       29       +6     
  Lines         1929     3139    +1210     
===========================================
+ Hits          1467     2345     +878     
- Misses         462      794     +332

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py

laurejt · 2025-02-11T16:11:38Z

src/corppa/poetry_detection/chadwyck_healey/entity_map.py

+    "&ast;": "*",
+    "&wblank;": "\u2003",  # EM SPACE
+    "&point;": ".",
+    "&supere;": "ᵉ",  # U+1D49


Is it worth preserving superscripts? It's fine if we do, we'll just need to specifically account and convert for them downstream.

I vote no - but we could make it optional

Since it's in the code right now I say we don't do anything. My planned way of dealing with this downstream is to use ftfy with NFKC normalization which will get rid of these kinds of characters (e.g., ™ --> TM and H₂O --> H2O and so on).

src/corppa/poetry_detection/chadwyck_healey/entity_map.py

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py

src/corppa/poetry_detection/chadwyck_healey/build_poem_corpus.py

laurejt · 2025-02-11T17:37:53Z

src/corppa/poetry_detection/chadwyck_healey/entity_map.py

This is really part of the TML Parser, so I'll add move the mapping there as a global variable

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py

laurejt · 2025-02-11T18:11:48Z

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py

+            print(f"Error type: {type(e).__name__}")
+            print(f"Error message: {str(e)}")
+            print(traceback.format_exc())
+            return None, None


This means we don't actually parse all files! What are we missing in these cases?

laurejt · 2025-02-11T22:34:54Z

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py

+        if self.figure_only:
+            print("\nThe following files contained only a figure (no text):")
+            for f in self.figure_only:
+                print("  -", f)


self.figure_only is never updated... is this intentional?

Also moved entity map into the parser script

laurejt · 2025-02-14T16:58:29Z

@rlskoeser Note that the parsing is quite ad hoc and is likely to miss text that has a different structure. For example, it looks like speaker div tags are ignored and text that for one reason or another fall outside of tags can be lost as is the case with Z300491786.tml

rlskoeser

Trying to keep my comments at fairly high level based on my understanding of our priorities for this script

switching from subprocess + file to python-magic would improve efficiency and be a light lift (I can contribute if we decide it's worth investing in it); it does require libmagic to be installed; a better solution would be a one-time option to sanitize (and possible filter) the input files so it doesn't have to be done every time
it seems like it would be useful to add options for a metadata only mode, to generate the spreadsheet without doing all the text parsing; the way the code is set up that should be pretty easy to integrate
it would be better to use logging to report on parsing status and errors
in future it might be nice to add an option to transform specific ids/filenames, so if we have a problem with a specific text or set of texts we don't have to regenerate everything; we could also add a check to skip text files that already exist in the output directory unless a flag is specified to overwrite/regenerate

I was able to run the script locally; fixed one variable that didn't work when specifying a limit (I think it just got missed on a refactor). I tried to add files to the figure_only list in two different places, but it didn't work - those files were reported as failed to parse.

The parsing seems to be very complicated and seems to some redundancy; I think we can get beautifulsoup to do more of the work for us. My approach would be to start writing unit tests for the different cases we know we need to handle and expected output and then start revising the code, but I'm not clear on how much of a priority that is right now.

rlskoeser · 2025-02-17T19:29:27Z

src/corppa/poetry_detection/chadwyck_healey/get_tag_stats.py

+
+import bs4
+import ftfy
+from build_poem_corpus import get_poem_subdir


It looks like build_poem_corpus was removed in refactors; is this code not actually used (would expect the import to error)?

rlskoeser · 2025-02-17T19:55:44Z

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py

+    p = run(["file", text_file], capture_output=True, text=True)
+    if " ISO-8859 text" in p.stdout:
+        # Assume ISO-8859-like texts are Latin-1 (hopefully it's not macroman)
+        return "latin1"
+    elif " Non-ISO extended-ASCII text" in p.stdout:
+        # Assume this is Windows-1252
+        return "cp1252"
+    elif " UTF-8 text" in p.stdout:
+        return "utf-8"
+    elif " ASCII text" in p.stdout:
+        # Treat ASCII as UTF-8
+        return "utf-8"
+    else:
+        raise ValueError(f"Unknown encoding: {p.stdout}")


It would be more efficient to use a python interface to libmagic like python-magic rather than calling out to the file command. Either way, there should be an option to specify only the return value we care about so we don't have to do string matching.

Actually, if we have any thought of running the script more than once we should make this a one-time step (opt in via command-line) and update the files to fix the encoding so we don't have to check on future runs.

I think we should just axe all of this, since all of the files can be opened as latin-1. I had assumed this wasn't the case because of the added logic. But file returns either ASCII or ISO-8859 text for all of the existing Chadwyck-Healey .tml files

rlskoeser · 2025-02-18T15:28:27Z

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py

+        if not element:
+            return ""
+
+        # convert italic spans (e.g., <span class="italic">word</span> => word)


It seems like it would be cleaner and easier to use beautifulsoup strings instead of text, but not sure we want to change that now.

rlskoeser · 2025-02-18T15:42:51Z

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py

+            if anon_author:
+                metadata["author_lastname"] = "Anon."
+                metadata["author_firstname"] = anon_author.get("firstname", "")
+                metadata["author_birth"] = anon_author.get("birth", "")
+                metadata["author_death"] = anon_author.get("death", "")
+                metadata["author_period"] = anon_author.get("period", "")


It doesn't seem like this actually needs to be different logic than orig author

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py

rlskoeser · 2025-02-18T16:01:50Z

@rlskoeser Note that the parsing is quite ad hoc and is likely to miss text that has a different structure. For example, it looks like speaker div tags are ignored and text that for one reason or another fall outside of tags can be lost as is the case with Z300491786.tml

Coming back to the main page and saw this comment again. @laurejt how concerned are you about the parsing, and how much do you think we should invest in improving it? Would it make sense to do as a second pass after this initial script is merged in?

laurejt · 2025-02-18T17:05:31Z

@rlskoeser Note that the parsing is quite ad hoc and is likely to miss text that has a different structure. For example, it looks like speaker div tags are ignored and text that for one reason or another fall outside of tags can be lost as is the case with Z300491786.tml

Coming back to the main page and saw this comment again. @laurejt how concerned are you about the parsing, and how much do you think we should invest in improving it? Would it make sense to do as a second pass after this initial script is merged in?

I think we just need to deal with this issue later. I think we need to determine whether we have the bandwidth to triage how bad the parsing is failing, and what if any that we need to correct. It's just not possible in this step, since there's no infrastructure for testing what is/is not being parsed correctly.

* Add option to check file encodings (default to latin-1) * Add option to extract metadata only * Switch to use single parser for extraction (LXML)

rlskoeser

Revisions look reasonable, let's get it merged!

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py

Co-authored-by: Rebecca Sutton Koeser <[email protected]>

laurejt and others added 6 commits September 3, 2024 13:13

Init commite of Chadwyck Healey code

b1e24d5

Add script for checking tag attributes

baeb02b

Updated build corpus script to handle test set.

eee8aa4

Added dependencies to project toml

6cb1bb4

In-progress effort for parsing Chadwyck Healey

1bb80b4

update tml parser

dc3150a

WHaverals requested a review from laurejt February 10, 2025 18:42

laurejt reviewed Feb 11, 2025

View reviewed changes

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py Outdated Show resolved Hide resolved

laurejt reviewed Feb 11, 2025

View reviewed changes

src/corppa/poetry_detection/chadwyck_healey/entity_map.py Outdated Show resolved Hide resolved

laurejt reviewed Feb 11, 2025

View reviewed changes

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py Show resolved Hide resolved

laurejt reviewed Feb 11, 2025

View reviewed changes

src/corppa/poetry_detection/chadwyck_healey/build_poem_corpus.py Outdated Show resolved Hide resolved

laurejt added 2 commits February 11, 2025 12:33

Rename directory

8eb0f60

Removing now redundant build_poem_corpus script

19c8523

laurejt reviewed Feb 11, 2025

View reviewed changes

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py Outdated Show resolved Hide resolved

laurejt reviewed Feb 11, 2025

View reviewed changes

laurejt added 2 commits February 11, 2025 13:55

Merge branch 'develop' into feature/chadwyck-healey

635778a

Updated project .tml and reverted directory name

afbb4ad

laurejt reviewed Feb 11, 2025

View reviewed changes

Updated tml_parser & added a few unit tests

4ce1a47

Also moved entity map into the parser script

laurejt requested a review from rlskoeser February 14, 2025 16:33

laurejt and others added 3 commits February 14, 2025 13:38

Added pkg reqs to test

212a5b7

Update pyproject.toml

8b6d2fd

Fix misnamed variable when limiting documents

0b1d32e

rlskoeser reviewed Feb 18, 2025

View reviewed changes

laurejt added 2 commits February 18, 2025 15:14

Update tag stats script, removing CH subdir logic

507e4c4

Several TML parser updates

5af9a67

* Add option to check file encodings (default to latin-1) * Add option to extract metadata only * Switch to use single parser for extraction (LXML)

laurejt requested a review from rlskoeser February 18, 2025 21:08

rlskoeser approved these changes Feb 18, 2025

View reviewed changes

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py Show resolved Hide resolved

src/corppa/poetry_detection/chadwyck_healey/tml_parser.py Outdated Show resolved Hide resolved

Update src/corppa/poetry_detection/chadwyck_healey/tml_parser.py

a93b480

Co-authored-by: Rebecca Sutton Koeser <[email protected]>

laurejt merged commit 5124fe5 into develop Feb 18, 2025
5 of 7 checks passed

laurejt deleted the feature/chadwyck-healey branch February 18, 2025 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/chadwyck healey #157

Feature/chadwyck healey #157

WHaverals commented Feb 10, 2025

codecov bot commented Feb 10, 2025 •

edited

Loading

laurejt Feb 11, 2025

rlskoeser Feb 18, 2025 •

edited

Loading

laurejt Feb 18, 2025

laurejt Feb 11, 2025

laurejt Feb 11, 2025

laurejt Feb 11, 2025

laurejt commented Feb 14, 2025

rlskoeser left a comment •

edited

Loading

rlskoeser Feb 17, 2025

rlskoeser Feb 17, 2025

rlskoeser Feb 17, 2025

laurejt Feb 18, 2025 •

edited

Loading

rlskoeser Feb 18, 2025

rlskoeser Feb 18, 2025

rlskoeser commented Feb 18, 2025

laurejt commented Feb 18, 2025

rlskoeser left a comment

Feature/chadwyck healey #157

Feature/chadwyck healey #157

Conversation

WHaverals commented Feb 10, 2025

codecov bot commented Feb 10, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

rlskoeser Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

laurejt commented Feb 14, 2025

rlskoeser left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

laurejt Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlskoeser commented Feb 18, 2025

laurejt commented Feb 18, 2025

rlskoeser left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 10, 2025 •

edited

Loading

rlskoeser Feb 18, 2025 •

edited

Loading

rlskoeser left a comment •

edited

Loading

laurejt Feb 18, 2025 •

edited

Loading