-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/chadwyck healey #157
Conversation
Codecov ReportAttention: Patch coverage is
❌ Your patch status has failed because the patch coverage (19.78%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #157 +/- ##
===========================================
- Coverage 76.04% 74.70% -1.35%
===========================================
Files 23 29 +6
Lines 1929 3139 +1210
===========================================
+ Hits 1467 2345 +878
- Misses 462 794 +332 |
"*": "*", | ||
"&wblank;": "\u2003", # EM SPACE | ||
"&point;": ".", | ||
"&supere;": "ᵉ", # U+1D49 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth preserving superscripts? It's fine if we do, we'll just need to specifically account and convert for them downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote no - but we could make it optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's in the code right now I say we don't do anything. My planned way of dealing with this downstream is to use ftfy
with NFKC normalization which will get rid of these kinds of characters (e.g., ™
--> TM
and H₂O
--> H2O
and so on).
src/corppa/poetry_detection/chadwyck_healey/build_poem_corpus.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really part of the TML Parser, so I'll add move the mapping there as a global variable
print(f"Error type: {type(e).__name__}") | ||
print(f"Error message: {str(e)}") | ||
print(traceback.format_exc()) | ||
return None, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we don't actually parse all files! What are we missing in these cases?
if self.figure_only: | ||
print("\nThe following files contained only a figure (no text):") | ||
for f in self.figure_only: | ||
print(" -", f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.figure_only
is never updated... is this intentional?
Also moved entity map into the parser script
@rlskoeser Note that the parsing is quite ad hoc and is likely to miss text that has a different structure. For example, it looks like speaker div tags are ignored and text that for one reason or another fall outside of tags can be lost as is the case with Z300491786.tml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to keep my comments at fairly high level based on my understanding of our priorities for this script
- switching from subprocess + file to python-magic would improve efficiency and be a light lift (I can contribute if we decide it's worth investing in it); it does require libmagic to be installed; a better solution would be a one-time option to sanitize (and possible filter) the input files so it doesn't have to be done every time
- it seems like it would be useful to add options for a metadata only mode, to generate the spreadsheet without doing all the text parsing; the way the code is set up that should be pretty easy to integrate
- it would be better to use logging to report on parsing status and errors
- in future it might be nice to add an option to transform specific ids/filenames, so if we have a problem with a specific text or set of texts we don't have to regenerate everything; we could also add a check to skip text files that already exist in the output directory unless a flag is specified to overwrite/regenerate
I was able to run the script locally; fixed one variable that didn't work when specifying a limit (I think it just got missed on a refactor). I tried to add files to the figure_only
list in two different places, but it didn't work - those files were reported as failed to parse.
The parsing seems to be very complicated and seems to some redundancy; I think we can get beautifulsoup to do more of the work for us. My approach would be to start writing unit tests for the different cases we know we need to handle and expected output and then start revising the code, but I'm not clear on how much of a priority that is right now.
|
||
import bs4 | ||
import ftfy | ||
from build_poem_corpus import get_poem_subdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like build_poem_corpus
was removed in refactors; is this code not actually used (would expect the import to error)?
p = run(["file", text_file], capture_output=True, text=True) | ||
if " ISO-8859 text" in p.stdout: | ||
# Assume ISO-8859-like texts are Latin-1 (hopefully it's not macroman) | ||
return "latin1" | ||
elif " Non-ISO extended-ASCII text" in p.stdout: | ||
# Assume this is Windows-1252 | ||
return "cp1252" | ||
elif " UTF-8 text" in p.stdout: | ||
return "utf-8" | ||
elif " ASCII text" in p.stdout: | ||
# Treat ASCII as UTF-8 | ||
return "utf-8" | ||
else: | ||
raise ValueError(f"Unknown encoding: {p.stdout}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more efficient to use a python interface to libmagic like python-magic rather than calling out to the file
command. Either way, there should be an option to specify only the return value we care about so we don't have to do string matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if we have any thought of running the script more than once we should make this a one-time step (opt in via command-line) and update the files to fix the encoding so we don't have to check on future runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just axe all of this, since all of the files can be opened as latin-1
. I had assumed this wasn't the case because of the added logic. But file returns either ASCII or ISO-8859 text for all of the existing Chadwyck-Healey .tml files
if not element: | ||
return "" | ||
|
||
# convert italic spans (e.g., <span class="italic">word</span> => word) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be cleaner and easier to use beautifulsoup strings instead of text, but not sure we want to change that now.
if anon_author: | ||
metadata["author_lastname"] = "Anon." | ||
metadata["author_firstname"] = anon_author.get("firstname", "") | ||
metadata["author_birth"] = anon_author.get("birth", "") | ||
metadata["author_death"] = anon_author.get("death", "") | ||
metadata["author_period"] = anon_author.get("period", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like this actually needs to be different logic than orig author
Coming back to the main page and saw this comment again. @laurejt how concerned are you about the parsing, and how much do you think we should invest in improving it? Would it make sense to do as a second pass after this initial script is merged in? |
I think we just need to deal with this issue later. I think we need to determine whether we have the bandwidth to triage how bad the parsing is failing, and what if any that we need to correct. It's just not possible in this step, since there's no infrastructure for testing what is/is not being parsed correctly. |
* Add option to check file encodings (default to latin-1) * Add option to extract metadata only * Switch to use single parser for extraction (LXML)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisions look reasonable, let's get it merged!
Co-authored-by: Rebecca Sutton Koeser <[email protected]>
#142
I built a new TML-file parser! For multiple reasons:
plaintext
version of the corpus (we received from Mark Algee-Hewitt) contained lot of weirdness... e.g.:Z200474224
)POETRY_CORPUS.xlsx
), which is a mess too, e.g.:A Sash for Wu Yün, “Fifteen stars fence polar fire ...â€
, etc.)So in short, both the the plaintext files and the metadata file need to be cleaned up.
I went back to the original TML files, and tried to parse them - to the best of my ability, to create a clean txt corpus.
The relevant files in this pull request are:
parse_tml.md
(~ description of the main logic that was followed and implemented; there are many edge cases)entity_map.py
(~ I made an attempt to map certain exotic entities (XML codes, HTML codes, etc.) that need to be resolved; the most important (frequent ones) should be resolved -- the ones remaining are part of the really long tail of singletons)tml_parser.py
(~ the main script!)cmd to run the script:
python tml_parser.py --input_dir "tml" --output_dir "tml_parsed"
optional args:
--num_files 1000
(leave out to process all files)--verbose
(for verbose output)