This repo contains images and code for benchmarking OCR programs. This benchmark was created to evaluate Tesseract.js and Scribe.js, however it can be used with any OCR program that exports to a supported format (.hocr
or Abbyy .xml
).
The corpus of images aims to be small yet diverse, allowing for testing a wide variety of real-world use cases with minimal runtime. The goal is to include a handful of every document type (e.g. book, academic paper, magazine), document layout (e.g. single column, multi-column, table), and scanning condition (e.g. screenshot, high-resolution scan, low-resolution scan). This stands in contrast with many other benchmark corpuses, which often include a large number of homogenous documents that essentially test the same case.
A summary of recent benchmark results comparing between Tesseract.js and Scribe.js are below. Note that this is an intentionally difficult benchmark intended to compare between OCR programs. The average accuracy statistics should not be compared with numbers reported by other benchmarks, and are not indicative of real-world performance with high-quality document scans. Both Tesseract.js and Scribe.js routinely achieve >97% accuracy with high-quality inputs with simple layouts.
file | Tesseract.js Accuracy | Scribe.js Accuracy | Difference |
---|---|---|---|
chart_01.truth.hocr | 86.71% | 88.61% | 1.90% |
chart_02.truth.hocr | 80.41% | 89.69% | 9.28% |
deposition_01.truth.hocr | 92.63% | 99.65% | 7.02% |
deposition_02.truth.hocr | 81.48% | 98.94% | 17.46% |
filing_01.truth.hocr | 99.45% | 98.91% | -0.55% |
filing_03.truth.hocr | 97.68% | 98.84% | 1.16% |
form_01.truth.hocr | 55.32% | 54.41% | -0.91% |
multicol_01.truth.hocr | 91.50% | 90.97% | -0.53% |
multicol_02.truth.hocr | 91.30% | 95.20% | 3.90% |
multicol_03.truth.hocr | 95.00% | 95.26% | 0.26% |
multicol_04.truth.hocr | 94.31% | 93.56% | -0.74% |
multicol_05.truth.hocr | 97.09% | 97.25% | 0.16% |
multicol_06.truth.hocr | 91.48% | 94.21% | 2.73% |
multicol_07.truth.hocr | 95.03% | 96.18% | 1.15% |
multicol_08.truth.hocr | 97.10% | 97.89% | 0.79% |
multicol_11.truth.hocr | 99.23% | 99.54% | 0.31% |
receipt_03.truth.hocr | 98.36% | 96.72% | -1.64% |
singlecol_01.hocr | 97.65% | 100.00% | 2.35% |
slide_01.truth.hocr | 95.51% | 96.15% | 0.64% |
table_01.truth.hocr | 86.73% | 87.17% | 0.44% |
table_02.truth.hocr | 92.13% | 100.00% | 7.87% |
table_04.truth.hocr | 51.11% | 72.96% | 21.85% |
table_05.truth.hocr | 20.51% | 91.21% | 70.70% |
table_07.truth.hocr | 58.65% | 91.56% | 32.91% |
table_08.truth.hocr | 97.87% | 99.39% | 1.52% |
table_09.truth.hocr | 96.90% | 96.21% | -0.69% |
table_10.truth.hocr | 84.47% | 97.41% | 12.94% |
table_11.truth.hocr | 66.39% | 98.88% | 32.49% |
table_12.truth.hocr | 65.98% | 98.96% | 32.99% |
Average | 84.76% | 93.65% | 8.89% |
Scripts are provided to run OCR on the test corpus and generate .hocr
files for Tesseract.js and Scribe.js. Running using another program will require writing your own script. Each set of results should be saved in a subdirectory within the results
folder. For example, results/scribejs
contains the .hocr
files produced by Scribe.js.
node run_scribejs.js
node run_tesseractjs.js
Directories of .hocr
data are compared to the ground truth versions by running the run_eval.js
script.
node run_eval.js results/scribejs
node run_eval.js results/tesseractjs_lstm
The results are saved as .csv
files in the stats
directory.
node util/summarize.js stats/eval_stats_tesseractjs_lstm.csv stats/eval_stats_scribejs.csv
The corpus may include all printed documents that a user may need to digitize using OCR. This includes most document types--books, forms, receipts, newspapers, etc.
The following types of images are outside the scope of this project. In all cases, these document types are excluded because they present unique/specific challenges that are not broadly applicable to general document digitization use-cases.
- Handwritten text
- Most OCR programs do not work with handwritten text, and do not claim to support it.
- CAPTCHAs
- CAPTCHAs are deliberately designed to be adversarial to common OCR programs.
- Extremely low-quality images.
- While the corpus intentionally includes many scans of medium and low quality, all documents could plausibly be encountered within a document scanning workflow (and most were taken from real document processing tasks).
- If a document image/scan is so poor quality that it would never be accepted by a bank, court, or in any other official capacity, are outside of the scope of this corpus.
- Examples include crumpled up receipts, extremely blurry/warped images from cell phones, etc.
- Pictures of non-documents
- License plates, T-shirts, street signs, etc.
- Traditional OCR programs are not designed for these types of applications, and are not the best approach.
Some number of "errors" reported by the benchmark are actually minor or entirely subjective differences. For example, what ASCII character gets used to represent a non-ASCII bullet point? Is recognition run on words within embedded images (e.g. logos or graphs)?
This benchmark compares OCR on the word level. If the data being benchmarked has words in the exact same locations with the exact same content as the ground truth, it will be marked as accurate. This check does not consider word order or page layout, which can be relevant when extract text from OCR data. For example, it is possible for a 2-column layout to be incorrectly identified as a 1-column layout (therefore ordering words incorrectly), while still getting all individual words correct. If we did add checks for layout/word order, the ground truth would need to be re-evaluated, as it has not yet been reviewed for this type of accuracy.
Note that adding a benchmark that included layout would require a non-trivial amount of work past checking the ground truth data and adding a comparison. This is because the vast majority of differences in how lines are ordered/segmented are not due to multiple columns being improperly combined, but rather due to subjective differences between programs. For example, are table cells grouped by column or by row? Are floating elements with the same baseline (e.g. a page header and page number) combined into the same line or do they get different lines? Etc.
If users encounter an image that is not represented in the corpus, they should feel free to add new images as a PR. For those contributing, please follow these guidelines:
- Confirm your contributions is within the scope of this corpus (see "Scope" section of readme)
- Submit only
.png
or.jpeg
files for images. - Submit images that are distinct from existing images.
- "Distinct" could mean a different scan quality, different font, different layout, or any other quality that impacts OCR accuracy.
- Include only 1-2 images per relevant test case.
- Once a single page from a document is included, adding additional pages that are nearly identical (same font, quality, etc.) provides little additional value, but inflates the size of the benchmark.
- Highly Recommended: Include accurate ground truth data.
- To significantly improve the odds you contribution is merged, contribute an accurate ground truth file alongside your image.
- A ground truth file is simply a
.hocr
file that contains no errors. This can be generated by using scribeocr.com to run OCR (or upload existing OCR), correcting all errors, and exporting as.hocr
. - All files require ground truth data to be included in the benchmark, so any image contributions without ground truth would require another use to create this data.
- A ground truth file is simply a
- To significantly improve the odds you contribution is merged, contribute an accurate ground truth file alongside your image.