Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/chadwyck healey #157

Merged
merged 17 commits into from
Feb 18, 2025
Merged

Feature/chadwyck healey #157

merged 17 commits into from
Feb 18, 2025

Conversation

WHaverals
Copy link
Collaborator

#142

I built a new TML-file parser! For multiple reasons:

  1. Most importantly, these files are weird to say the least -- not well-formed XML, nor SGML, nor HTML...
  2. the plaintext version of the corpus (we received from Mark Algee-Hewitt) contained lot of weirdness... e.g.:
  • many files still contain waht looks like HTML tags
  • other encoding issues
  • lots of copyright notices
  • extremely weird indentation (which causes some files to be well over 10MB, one is even 335MB -- simply because each newline is a tab, see e.g. Z200474224)
  • too many blank lines too...
  1. Mark also sent us a metadata file (POETRY_CORPUS.xlsx), which is a mess too, e.g.:
  • all the translators are forgotten; and these are sometimes just really important! (this is a big issue for me - e.g. Guy Fawkes translated Theocritus, and this is not mentioned anywhere in the xlsx file...)
  • encoding errors (e.g. A Sash for Wu Yün, “Fifteen stars fence polar fire ...â€, etc.)

So in short, both the the plaintext files and the metadata file need to be cleaned up.
I went back to the original TML files, and tried to parse them - to the best of my ability, to create a clean txt corpus.

The relevant files in this pull request are:

  • parse_tml.md (~ description of the main logic that was followed and implemented; there are many edge cases)
  • entity_map.py (~ I made an attempt to map certain exotic entities (XML codes, HTML codes, etc.) that need to be resolved; the most important (frequent ones) should be resolved -- the ones remaining are part of the really long tail of singletons)
  • tml_parser.py (~ the main script!)

cmd to run the script: python tml_parser.py --input_dir "tml" --output_dir "tml_parsed"

optional args:
--num_files 1000 (leave out to process all files)
--verbose (for verbose output)

@WHaverals WHaverals requested a review from laurejt February 10, 2025 18:42
Copy link

codecov bot commented Feb 10, 2025

Codecov Report

Attention: Patch coverage is 19.78610% with 300 lines in your changes missing coverage. Please review.

Project coverage is 74.70%. Comparing base (7b66adc) to head (a93b480).
Report is 48 commits behind head on develop.

❌ Your patch status has failed because the patch coverage (19.78%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (74.70%) is below the target coverage (75.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #157      +/-   ##
===========================================
- Coverage    76.04%   74.70%   -1.35%     
===========================================
  Files           23       29       +6     
  Lines         1929     3139    +1210     
===========================================
+ Hits          1467     2345     +878     
- Misses         462      794     +332     

"*": "*",
"&wblank;": "\u2003", # EM SPACE
"&point;": ".",
"&supere;": "ᵉ", # U+1D49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth preserving superscripts? It's fine if we do, we'll just need to specifically account and convert for them downstream.

Copy link
Collaborator

@rlskoeser rlskoeser Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote no - but we could make it optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's in the code right now I say we don't do anything. My planned way of dealing with this downstream is to use ftfy with NFKC normalization which will get rid of these kinds of characters (e.g., --> TM and H₂O --> H2O and so on).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really part of the TML Parser, so I'll add move the mapping there as a global variable

print(f"Error type: {type(e).__name__}")
print(f"Error message: {str(e)}")
print(traceback.format_exc())
return None, None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we don't actually parse all files! What are we missing in these cases?

Comment on lines +520 to +523
if self.figure_only:
print("\nThe following files contained only a figure (no text):")
for f in self.figure_only:
print(" -", f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.figure_only is never updated... is this intentional?

Also moved entity map into the parser script
@laurejt laurejt requested a review from rlskoeser February 14, 2025 16:33
@laurejt
Copy link
Contributor

laurejt commented Feb 14, 2025

@rlskoeser Note that the parsing is quite ad hoc and is likely to miss text that has a different structure. For example, it looks like speaker div tags are ignored and text that for one reason or another fall outside of tags can be lost as is the case with Z300491786.tml

Copy link
Collaborator

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to keep my comments at fairly high level based on my understanding of our priorities for this script

  • switching from subprocess + file to python-magic would improve efficiency and be a light lift (I can contribute if we decide it's worth investing in it); it does require libmagic to be installed; a better solution would be a one-time option to sanitize (and possible filter) the input files so it doesn't have to be done every time
  • it seems like it would be useful to add options for a metadata only mode, to generate the spreadsheet without doing all the text parsing; the way the code is set up that should be pretty easy to integrate
  • it would be better to use logging to report on parsing status and errors
  • in future it might be nice to add an option to transform specific ids/filenames, so if we have a problem with a specific text or set of texts we don't have to regenerate everything; we could also add a check to skip text files that already exist in the output directory unless a flag is specified to overwrite/regenerate

I was able to run the script locally; fixed one variable that didn't work when specifying a limit (I think it just got missed on a refactor). I tried to add files to the figure_only list in two different places, but it didn't work - those files were reported as failed to parse.

The parsing seems to be very complicated and seems to some redundancy; I think we can get beautifulsoup to do more of the work for us. My approach would be to start writing unit tests for the different cases we know we need to handle and expected output and then start revising the code, but I'm not clear on how much of a priority that is right now.


import bs4
import ftfy
from build_poem_corpus import get_poem_subdir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like build_poem_corpus was removed in refactors; is this code not actually used (would expect the import to error)?

Comment on lines +262 to +275
p = run(["file", text_file], capture_output=True, text=True)
if " ISO-8859 text" in p.stdout:
# Assume ISO-8859-like texts are Latin-1 (hopefully it's not macroman)
return "latin1"
elif " Non-ISO extended-ASCII text" in p.stdout:
# Assume this is Windows-1252
return "cp1252"
elif " UTF-8 text" in p.stdout:
return "utf-8"
elif " ASCII text" in p.stdout:
# Treat ASCII as UTF-8
return "utf-8"
else:
raise ValueError(f"Unknown encoding: {p.stdout}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more efficient to use a python interface to libmagic like python-magic rather than calling out to the file command. Either way, there should be an option to specify only the return value we care about so we don't have to do string matching.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if we have any thought of running the script more than once we should make this a one-time step (opt in via command-line) and update the files to fix the encoding so we don't have to check on future runs.

Copy link
Contributor

@laurejt laurejt Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just axe all of this, since all of the files can be opened as latin-1. I had assumed this wasn't the case because of the added logic. But file returns either ASCII or ISO-8859 text for all of the existing Chadwyck-Healey .tml files

if not element:
return ""

# convert italic spans (e.g., <span class="italic">word</span> => word)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would be cleaner and easier to use beautifulsoup strings instead of text, but not sure we want to change that now.

Comment on lines +854 to +859
if anon_author:
metadata["author_lastname"] = "Anon."
metadata["author_firstname"] = anon_author.get("firstname", "")
metadata["author_birth"] = anon_author.get("birth", "")
metadata["author_death"] = anon_author.get("death", "")
metadata["author_period"] = anon_author.get("period", "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like this actually needs to be different logic than orig author

@rlskoeser
Copy link
Collaborator

@rlskoeser Note that the parsing is quite ad hoc and is likely to miss text that has a different structure. For example, it looks like speaker div tags are ignored and text that for one reason or another fall outside of tags can be lost as is the case with Z300491786.tml

Coming back to the main page and saw this comment again. @laurejt how concerned are you about the parsing, and how much do you think we should invest in improving it? Would it make sense to do as a second pass after this initial script is merged in?

@laurejt
Copy link
Contributor

laurejt commented Feb 18, 2025

@rlskoeser Note that the parsing is quite ad hoc and is likely to miss text that has a different structure. For example, it looks like speaker div tags are ignored and text that for one reason or another fall outside of tags can be lost as is the case with Z300491786.tml

Coming back to the main page and saw this comment again. @laurejt how concerned are you about the parsing, and how much do you think we should invest in improving it? Would it make sense to do as a second pass after this initial script is merged in?

I think we just need to deal with this issue later. I think we need to determine whether we have the bandwidth to triage how bad the parsing is failing, and what if any that we need to correct. It's just not possible in this step, since there's no infrastructure for testing what is/is not being parsed correctly.

* Add option to check file encodings (default to latin-1)
* Add option to extract metadata only
* Switch to use single parser for extraction (LXML)
@laurejt laurejt requested a review from rlskoeser February 18, 2025 21:08
Copy link
Collaborator

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisions look reasonable, let's get it merged!

@laurejt laurejt merged commit 5124fe5 into develop Feb 18, 2025
5 of 7 checks passed
@laurejt laurejt deleted the feature/chadwyck-healey branch February 18, 2025 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants