Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future plans for this project and in-browser OCR #87

Open
robertknight opened this issue Jul 3, 2023 · 3 comments
Open

Future plans for this project and in-browser OCR #87

robertknight opened this issue Jul 3, 2023 · 3 comments

Comments

@robertknight
Copy link
Owner

robertknight commented Jul 3, 2023

The Tesseract library has provided a lot of value to many projects since it was open sourced. I intend to keep this project updated with new releases of the engine, but significant improvements to in-browser OCR will benefit from a new foundation.

About a year ago I started working on a new OCR engine for use in the browser, other WebAssembly environments and native environments.

The main initial user-facing improvement I'm aiming for is much more robust text detection that doesn't require the image cleanup and pre-processing that is required for Tesseract. This will enable it to "just work" with photos, noisy images and the many other kinds of input where Tesseract just doesn't "see" the text correctly. As an example, here are the results of applying Tesseract to a noisy input document compared to the new engine:

Tesseract:

Tesseract

New engine:

The improvements come from using machine learning for text detection, instead of the hand-coded binarization and other pre-processing steps that Tesseract uses. In future I also plan to extend the machine-learning oriented approach to layout analysis, to produce more reliable reading order-determination across a variety of document layouts.

For developers, the aim is to build this on a modern stack which is much more amenable to ongoing maintenance, re-training etc: PyTorch for training, ONNX for models, Rust for the inference engine (which is also capable of running many other ONNX models). Training data will all be from openly-licensed and unrestricted sources, such as HierText.

The current status, as of early July 2023, is that I have working CLI tool that produced the results above. Native performance is reasonable, but significant optimization is still required for the WebAssembly build (which lacks access to CPU features such as FMA, AVX, and which may be restricted to a single thread) to be on-par with tesseract-wasm. Depending on the image, character recognition performance can be decent, or can lag behind Tesseract by quite a bit. Text detection is much better though, which means this can already "see" a lot of text that Tesseract misses.

I hope to release an initial version of the CLI tool and WebAssembly library in the next few months.

@robertknight robertknight pinned this issue Jul 3, 2023
@jbaiter
Copy link

jbaiter commented Jul 24, 2023

Is this engine going to be line-based? If so, do you think a hybrid approach, where your improved ONNX-based layout analysis is paired with Tesseract's robust line-recognition engine, could be feasible?

@robertknight
Copy link
Owner Author

It will be possible to use the new engine only for detection and then feed the outputs into Tesseract or something else for recognition, although using one engine for the whole pipeline is going to be simpler.

@robertknight
Copy link
Owner Author

robertknight commented Jan 1, 2024

An early preview of the new engine has been published as a Rust crate. A WebAssembly version with an API similar to this repository will be coming in future.

After installing Rust and its package manager Cargo, you can install the tool with

cargo install ocrs-cli

And use it like this:

ocrs image.jpeg

How well this works compared to Tesseract depends a lot on the image. I would say it is very much in an early preview state and not yet ready to replace this library yet. However, if Tesseract is failing to detect text in your images, it might already be more useful.

Development is happening in https://github.com/robertknight/ocrs, although I will probably split this up into two separate repositories in future (this has now been done).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants