-
-
Notifications
You must be signed in to change notification settings - Fork 4
OCR wiki
OCR formally known as Optical Character Recognition is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo, or from subtitle text superimposed on an image. OCR has a fairly long history dating back to 1913 when Dr. Edmund Fournier d'Albe invented the Optophone to scan and convert text into sound for visually impaired people. Eventually, people at Google decided they wanted to take over an open-source project from HP called Tesseract which they have developed maintained, and updated up until their latest major release in 2018.
Our current implementation utilizes Tesseract and the Pytesseract library to conduct OCR on cropped images. We are using an open-source image cropper (https://github.com/fengyuanchen/cropperjs) to allow users to crop the question from their pdf which we then run OCR on and pass on to be added to the database.
First, we started researching possible solutions for OCR. We tried Keras-OCR, OCRmyPDF, EasyOCR, Calamari-OCR which each had their different issues but the main issue was that not all of these OCR services are still being maintained or have easy-to-follow documentation. Also, a chunk of these are built on the back of tesseract so we decided to just use tesseract. There may be some value in exploring these others more in the future if we hope to optimize our OCR functionality.
Some early issues we had were simply just getting the right things imported. You want to make sure that you have Tessseract and Poppler to be able to use the Pytesseract library. Now that we have a Docker container this shouldn't be an issue. Another issue we had was finding a way to convert the pdf's to images. Tesseract only accepts files of type ".png",".jpg",".tiff" and a few other obscure file types. So we needed to find a solution to convert the pdf to an image. The first thing we did was use the Pdf2Image python library which allows you to convert a pdf to an image. Since this, there have not been any other notable issues besides the image cropper not being the best.
One major goal we have for the future is to either find a way to optimize the cropper we are using or develop a new solution. The current cropper doesn't allow users to change the size of the cropping box so it may be difficult to fit the whole question in the box. An alternative to using a cropper would be to devolop some kind of way for the user to highlight the text they want to crop and then crop that specific test.