-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Future plans for this project and in-browser OCR #87
Comments
Is this engine going to be line-based? If so, do you think a hybrid approach, where your improved ONNX-based layout analysis is paired with Tesseract's robust line-recognition engine, could be feasible? |
It will be possible to use the new engine only for detection and then feed the outputs into Tesseract or something else for recognition, although using one engine for the whole pipeline is going to be simpler. |
An early preview of the new engine has been published as a Rust crate. A WebAssembly version with an API similar to this repository will be coming in future. After installing Rust and its package manager Cargo, you can install the tool with
And use it like this:
How well this works compared to Tesseract depends a lot on the image. I would say it is very much in an early preview state and not yet ready to replace this library yet. However, if Tesseract is failing to detect text in your images, it might already be more useful. Development is happening in https://github.com/robertknight/ocrs, |
The Tesseract library has provided a lot of value to many projects since it was open sourced. I intend to keep this project updated with new releases of the engine, but significant improvements to in-browser OCR will benefit from a new foundation.
About a year ago I started working on a new OCR engine for use in the browser, other WebAssembly environments and native environments.
The main initial user-facing improvement I'm aiming for is much more robust text detection that doesn't require the image cleanup and pre-processing that is required for Tesseract. This will enable it to "just work" with photos, noisy images and the many other kinds of input where Tesseract just doesn't "see" the text correctly. As an example, here are the results of applying Tesseract to a noisy input document compared to the new engine:
Tesseract:
New engine:
The improvements come from using machine learning for text detection, instead of the hand-coded binarization and other pre-processing steps that Tesseract uses. In future I also plan to extend the machine-learning oriented approach to layout analysis, to produce more reliable reading order-determination across a variety of document layouts.
For developers, the aim is to build this on a modern stack which is much more amenable to ongoing maintenance, re-training etc: PyTorch for training, ONNX for models, Rust for the inference engine (which is also capable of running many other ONNX models). Training data will all be from openly-licensed and unrestricted sources, such as HierText.
The current status, as of early July 2023, is that I have working CLI tool that produced the results above. Native performance is reasonable, but significant optimization is still required for the WebAssembly build (which lacks access to CPU features such as FMA, AVX, and which may be restricted to a single thread) to be on-par with tesseract-wasm. Depending on the image, character recognition performance can be decent, or can lag behind Tesseract by quite a bit. Text detection is much better though, which means this can already "see" a lot of text that Tesseract misses.
I hope to release an initial version of the CLI tool and WebAssembly library in the next few months.
The text was updated successfully, but these errors were encountered: