Skip to content

Commit

Permalink
Merge branch 'develop' of github.com:ramanathanlab/pdfwf into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
braceal committed Sep 4, 2024
2 parents 38ea242 + a1d94e6 commit bd7160a
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 0 deletions.
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,13 @@ optional arguments:
--config CONFIG Path to workflow configuration file
```

### Choosing a Parser (April 2024)
[`Nougat`](https://arxiv.org/abs/2308.13418), [`Marker`](https://github.com/VikParuchuri/marker), and `Oreo` are designed and trained to parse academic papers of any scientific domain. Since there are no canonical metrics on how to evaluate parser output quality, it is non-trivial to compare accuracy in a meaningful way. Regardless, we conducted a small experiment on $n=380$ multi-disciplinary paragaphs amounting to $33,000$ words.

Nougat appears to be more accurate as evidenced by slightly higher transcription quality overall (as measured by BLEU score) and the ability to detect rare domain-specific terms, in particular. Oreo, on the other hand, is faster by a factor of $4$ and has slightly lower but comparable accuracy as Nougat (lower BLEU but higher METEOR score). However, Oreo struggles to properly order paragraphs on challenging document layouts and does not filter out erroneously repeated words as Nougat does. Marker is dominated by Oreo in terms of inference speed and inferior to Nougat in terms of transcription quality. These results are not domain-specific. As of now, we suggest Nougat if you parse <1M papers.

While PDFs of any kind can be parsed (e.g., blog articles, corporate documents) with any of these frameworks, it is unclear how accurate each of them is. Since scientific PDFs tend to have a complex layout, parsing PDFs should provide reasonable output as long as the layout is somewhat comparable to that of a scientific paper.

### Workflow Configuration
The computing platform, virtual environment, parser settings, and other settings are specified via a YAML configuration file.

Expand Down
3 changes: 3 additions & 0 deletions pdfwf/timer.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,9 @@ def parse_logs(self, log_path: PathLike) -> list[TimeStats]:
# Extracted items from all print statements
for line in lines:
match = re.findall(regex_pattern, line)
# If the line doesn't contain the timer information, skip it
if not match:
continue
time_stats.append(
TimeStats(
tags=match[1].split(),
Expand Down

0 comments on commit bd7160a

Please sign in to comment.