Merge branch 'develop' of github.com:ramanathanlab/pdfwf into develop

ramanathanlab · Sep 4, 2024 · bd7160a · bd7160a
2 parents 38ea242 + a1d94e6
commit bd7160a
Show file tree

Hide file tree

Showing 2 changed files with 10 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -41,6 +41,13 @@ optional arguments:
   --config CONFIG  Path to workflow configuration file
 ```
 
+### Choosing a Parser (April 2024)
+[`Nougat`](https://arxiv.org/abs/2308.13418), [`Marker`](https://github.com/VikParuchuri/marker), and `Oreo` are designed and trained to parse academic papers of any scientific domain. Since there are no canonical metrics on how to evaluate parser output quality, it is non-trivial to compare accuracy in a meaningful way. Regardless, we conducted a small experiment on $n=380$ multi-disciplinary paragaphs amounting to $33,000$ words. 
+
+Nougat appears to be more accurate as evidenced by slightly higher transcription quality overall (as measured by BLEU score) and the ability to detect rare domain-specific terms, in particular. Oreo, on the other hand, is faster by a factor of $4$ and has slightly lower but comparable accuracy as Nougat (lower BLEU but higher METEOR score). However, Oreo struggles to properly order paragraphs on challenging document layouts and does not filter out erroneously repeated words as Nougat does. Marker is dominated by Oreo in terms of inference speed and inferior to Nougat in terms of transcription quality. These results are not domain-specific. As of now, we suggest Nougat if you parse <1M papers. 
+
+While PDFs of any kind can be parsed (e.g., blog articles, corporate documents) with any of these frameworks, it is unclear how accurate each of them is. Since scientific PDFs tend to have a complex layout, parsing PDFs should provide reasonable output as long as the layout is somewhat comparable to that of a scientific paper.
+
 ### Workflow Configuration
 The computing platform, virtual environment, parser settings, and other settings are specified via a YAML configuration file.
 

diff --git a/pdfwf/timer.py b/pdfwf/timer.py
@@ -148,6 +148,9 @@ def parse_logs(self, log_path: PathLike) -> list[TimeStats]:
         # Extracted items from all print statements
         for line in lines:
             match = re.findall(regex_pattern, line)
+            # If the line doesn't contain the timer information, skip it
+            if not match:
+                continue
             time_stats.append(
                 TimeStats(
                     tags=match[1].split(),