Skip to content

2. Optical Character Recognition

Saumya Shah edited this page Aug 14, 2018 · 1 revision

Overview

Consider the above-mentioned image and constituent cropped entries.

OCR

The text generated for GARRETT Peggy is

GARRETT Peggy.
Effects under £600.

22 August. The Will of Peggy Garrett late of 33 Addison-
road-North Notting Hill in the County of Middlesex Widow
who died 30 July 1873 at 33 Addison-road-North was proved

at the Principal Registry by Robert Henry Hoar of
3 Campden-hill-gardens Notting Hill Tobacconist the sole

Executor.

As you can observe, a great deal of this text is raw and distorted due to the quality of the image. This phase will do a raw OCR and clean it so that it is suitable for the subsequent phase to learn on these text entries.

Cleaned OCR would be

GARRETT Peggy. Effects under £600. 22 August. The Will of Peggy Garrett late of 33 Addison-road-North Notting Hill in the County of Middlesex Widow who died 30 July 1873 at 33 Addison-road-North was proved at the Principal Registry by Robert Henry Hoar of 3 Campden-hill-gardens Notting Hill Tobacconist the sole Executor.

Issues

  1. OCR output depends on each image. Since the image is not perfectly processed to remove any noise, a few stray characters may be added and some characters may be misjudged during OCR.

Implementation

To take a look at the implementation code, usage and output sample, click here.

Clone this wiki locally