You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scribe.js was originally built for the use-case of processing documents or books. For example, one of the original use-cases was converting a scan of a book into a native-text document using scribeocr.com.
As all the development focus has been on processing large documents with dozens of pages, processing simple, single-page documents currently takes longer than it needs to. Some examples are below.
Simple Benchmark
Below are basic runtime measurements for the scribe.recognize function used with 3 different documents from the test corpus. All units are in milliseconds. The source images are pasted at he bottom of this issue.
Example 1: Trivial Single-Word Image
Total runtime: 342
Recognition: 98
Font optimization: 238
OCR Comparison: 2
This includes creating both Tesseract Combined Temp and Tesseract Combined.
No differences were actually compared as everything matched.
Example 2: Simple Full-Page Layout, Shorter Recognition
Total runtime: 3536
Recognition: 2583
Font optimization: 859
OCR Comparison: 84
Example 3: Complex Full-Page Layout, Longer Recognition
Total runtime: 10333
Recognition: 8797
Font Optimization: 1076
OCR Comparison: 441
Possible Changes
As demonstrated by the timings above, the runtime for different images is attributable to different steps. For the simplest image the vast majority of runtime is attributable to font optimization, while for the most complex document, the vast majority of runtime is attributable to recognition.
Change 1: Skip Font Optimization Entirely for Small Inputs
In the case of the example above, there is no need to do anything after recognition (at least for .txt outputs), as there were no mismatches to judge between. While this is admittedly a special case, a broadly-applicable change would be to skip the font optimization step (although not font detection) for small inputs. Font optimization relies on having a certain amount of data points--data containing a single word or sentence is not sufficient for generating custom fonts, even putting aside runtime considerations,
Change 2: Add Option for Parallelizing Recognition at Job Level
For single-page jobs where runtime is driven by recognition (such as example 3 above), we can likely produce significant performance gains by adding an option for parallelizing steps within the same job. Note that this would make runtimes for large documents significantly slower, so this should only be an option and/or automatically enabled for single-page jobs. When running with large, multi-page jobs, the most efficient way of implementing parallel processing is to implement on the "coarse grained" level, where multiple pages are processed in parallel, which is what we do now.
Example Images
The text was updated successfully, but these errors were encountered:
Balearica
changed the title
Create parallel processing mode tuned for small jobs
Improve performance for single-page jobs
Aug 31, 2024
Overview
Scribe.js was originally built for the use-case of processing documents or books. For example, one of the original use-cases was converting a scan of a book into a native-text document using scribeocr.com.
As all the development focus has been on processing large documents with dozens of pages, processing simple, single-page documents currently takes longer than it needs to. Some examples are below.
Simple Benchmark
Below are basic runtime measurements for the
scribe.recognize
function used with 3 different documents from the test corpus. All units are in milliseconds. The source images are pasted at he bottom of this issue.342
98
238
2
Tesseract Combined Temp
andTesseract Combined
.3536
2583
859
84
10333
8797
1076
441
Possible Changes
As demonstrated by the timings above, the runtime for different images is attributable to different steps. For the simplest image the vast majority of runtime is attributable to font optimization, while for the most complex document, the vast majority of runtime is attributable to recognition.
Change 1: Skip Font Optimization Entirely for Small Inputs
In the case of the example above, there is no need to do anything after recognition (at least for
.txt
outputs), as there were no mismatches to judge between. While this is admittedly a special case, a broadly-applicable change would be to skip the font optimization step (although not font detection) for small inputs. Font optimization relies on having a certain amount of data points--data containing a single word or sentence is not sufficient for generating custom fonts, even putting aside runtime considerations,Change 2: Add Option for Parallelizing Recognition at Job Level
For single-page jobs where runtime is driven by recognition (such as example 3 above), we can likely produce significant performance gains by adding an option for parallelizing steps within the same job. Note that this would make runtimes for large documents significantly slower, so this should only be an option and/or automatically enabled for single-page jobs. When running with large, multi-page jobs, the most efficient way of implementing parallel processing is to implement on the "coarse grained" level, where multiple pages are processed in parallel, which is what we do now.
Example Images
The text was updated successfully, but these errors were encountered: