Future plans for this project and in-browser OCR #87

robertknight · 2023-07-03T17:59:39Z

The Tesseract library has provided a lot of value to many projects since it was open sourced. I intend to keep this project updated with new releases of the engine, but significant improvements to in-browser OCR will benefit from a new foundation.

About a year ago I started working on a new OCR engine for use in the browser, other WebAssembly environments and native environments.

The main initial user-facing improvement I'm aiming for is much more robust text detection that doesn't require the image cleanup and pre-processing that is required for Tesseract. This will enable it to "just work" with photos, noisy images and the many other kinds of input where Tesseract just doesn't "see" the text correctly. As an example, here are the results of applying Tesseract to a noisy input document compared to the new engine:

Tesseract:

New engine:

The improvements come from using machine learning for text detection, instead of the hand-coded binarization and other pre-processing steps that Tesseract uses. In future I also plan to extend the machine-learning oriented approach to layout analysis, to produce more reliable reading order-determination across a variety of document layouts.

For developers, the aim is to build this on a modern stack which is much more amenable to ongoing maintenance, re-training etc: PyTorch for training, ONNX for models, Rust for the inference engine (which is also capable of running many other ONNX models). Training data will all be from openly-licensed and unrestricted sources, such as HierText.

The current status, as of early July 2023, is that I have working CLI tool that produced the results above. Native performance is reasonable, but significant optimization is still required for the WebAssembly build (which lacks access to CPU features such as FMA, AVX, and which may be restricted to a single thread) to be on-par with tesseract-wasm. Depending on the image, character recognition performance can be decent, or can lag behind Tesseract by quite a bit. Text detection is much better though, which means this can already "see" a lot of text that Tesseract misses.

I hope to release an initial version of the CLI tool and WebAssembly library in the next few months.

jbaiter · 2023-07-24T16:16:27Z

Is this engine going to be line-based? If so, do you think a hybrid approach, where your improved ONNX-based layout analysis is paired with Tesseract's robust line-recognition engine, could be feasible?

robertknight · 2023-07-24T20:55:38Z

It will be possible to use the new engine only for detection and then feed the outputs into Tesseract or something else for recognition, although using one engine for the whole pipeline is going to be simpler.

robertknight · 2024-01-01T16:01:37Z

An early preview of the new engine has been published as a Rust crate. A WebAssembly version with an API similar to this repository will be coming in future.

After installing Rust and its package manager Cargo, you can install the tool with

cargo install ocrs-cli

And use it like this:

ocrs image.jpeg

How well this works compared to Tesseract depends a lot on the image. I would say it is very much in an early preview state and not yet ready to replace this library yet. However, if Tesseract is failing to detect text in your images, it might already be more useful.

Development is happening in https://github.com/robertknight/ocrs, ~~although I will probably split this up into two separate repositories in future~~ (this has now been done).

robertknight pinned this issue Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future plans for this project and in-browser OCR #87

Future plans for this project and in-browser OCR #87

robertknight commented Jul 3, 2023 •

edited

Loading

jbaiter commented Jul 24, 2023

robertknight commented Jul 24, 2023

robertknight commented Jan 1, 2024 •

edited

Loading

Future plans for this project and in-browser OCR #87

Future plans for this project and in-browser OCR #87

Comments

robertknight commented Jul 3, 2023 • edited Loading

jbaiter commented Jul 24, 2023

robertknight commented Jul 24, 2023

robertknight commented Jan 1, 2024 • edited Loading

robertknight commented Jul 3, 2023 •

edited

Loading

robertknight commented Jan 1, 2024 •

edited

Loading