-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Add optional input for alternate image to use when sandwiching OCR data #210
Comments
should be readed as: OCR.tif is included in ORIGINAL.pdf WITHOUT any modification. ;-) |
zdenop: Yes, that is correct. I want to run OCR on image_b (improved for OCR), but include image_a (original) in the resulting PDF. BTW - I just realized that there is a user forum (https://groups.google.com/forum/#!forum/tesseract-ocr). Maybe somebody has asked / answered my question there. My apologies for not looking at that forum earlier. |
Please post an example of a cleaned vs uncleaned image where accuracy improves significantly. Or even better, point to some documentation that has some examples. In the long term, one would hope that OCR could improve such that having a separate cleaned image is unnecessary. Regarding this feature request, I think it is probably better to use an outside utility that can replace the images in a Tesseract produced PDF. The caller is already generating a separate set of clean images, so is therefore comfortable with pre/post processing. This approach lets us keep the design intent and implementation of PDF generation simple ('don't mess with the images'). I don't know if such a tool exists already, but based on my knowledge of the Tesseract PDF it shouldn't be too hard to write. Apologies, but I am not volunteering to write one unless I need it myself for something. The closest existing thing I know about is OverlayPDF from Apache PDFBox. Previously mentioned here. https://www.mail-archive.com/[email protected]/msg11853.html Also, if you really want to hack Tesseract to do what you are asking for, the code is in api/pdfrenderer.cc. You would have to replace both the pix and the filename. I'd just be reluctant to make this a general feature of Tesseract. https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L894 |
Hi jbreiden, I don't think it would be wise to try to add all that clean-up functionality into tesseract, which is why I'm proposing a solution to take an image that has already been processed externally. The exact intent is to "not mess with the images." An external tool that could replace the image layer would certainly be good, but I haven't found any (suggestions welcome!). I tried to use hocr2pdf to use tesseract's .hocr data from my cleaned image and add it to a PDF with my original image but ran into a showstopper issue - When searching for a word in the document, the PDF viewers I tried would highlight the wrong part of the document. Maybe there is a bug with hocr2pdf or I am using it incorrectly. Tesseract already knows how to make a PDF so this reduces the possibility of an external program interpreting the PDF or hOCR specs differently and ruining the output. Anyway, thanks for the pointer to api/pdfrenderer.cc! I might just add the feature locally to address my needs, or maybe try to start a new program based off of it as you suggest. I don't mind closing this issue if others feel this feature is inappropriate or a better solution is made. |
Maybe this tool could help. |
Swapping images in Tesseract a PDF is pretty easy for a programmer if destination images are JPEG or JPEG 2000. It is really just a matter of cutting and pasting the data, then cleaning up the results with qpdf. The hardest part is getting the courage to open up a PDF file and look inside it. Regarding HOCR and bounding boxes, make sure you have image resolution metadata set correctly everywhere. The hocr-pdf program mentioned above works okay, but is limited to latin character sets and will also struggle with ligatures in English. |
Fix issue tesseract-ocr#210. This adds an optional command-line argument to set the image which will be used when generating a PDF image. This addresses a niche case where the user wishes to use an optimized image for OCR but maintain the visual appearance of the original image when generating a PDF. Signed-off-by: David Hendricks <[email protected]>
Heh, yeah, I opened up a PDF and saw the stream content for the image and was thinking of how to replace it, but it seemed like there'd be a non-trivial amount of work to get right for a PDF n00b such as myself. As far as I could tell the image resolution metadata was correct, or at least consistent. Couldn't get hocr-pdf working unfortunately due to some python module dependency that I couldn't find (I installed PyXML but no dice). In the end I just went ahead and hacked my feature into Tesseract and it works pretty well :-) Here it is if you're interested, though be warned that it's a bit of a kludge in its current state: dhendrix@6cc206f Thanks for the helpful pointers! Feel free to close if this feature is not desired for upstream, it can live on in my github account. |
The python dependency is reportlab. |
Fix issue tesseract-ocr#210. This adds an optional command-line argument to set the image which will be used when generating a PDF image. This addresses a niche case where the user wishes to use an optimized image for OCR but maintain the visual appearance of the original image when generating a PDF. Signed-off-by: David Hendricks <[email protected]>
Fix issue tesseract-ocr#210. This adds an optional command-line argument to set the image which will be used when generating a PDF image. This addresses a niche case where the user wishes to use an optimized image for OCR but maintain the visual appearance of the original image when generating a PDF. Signed-off-by: David Hendricks <[email protected]>
Add pdf renderer tests. Install pdf font in cmake tool chain. resolves tesseract-ocr#210 resolves tesseract-ocr#3798
…ammatically. Support new rendering_dpi api params. Add pdf renderer tests. Install pdf font in cmake tool chain. resolves tesseract-ocr#210 resolves tesseract-ocr#3798
…ammatically. Support new rendering_dpi api params. Add pdf renderer tests. Install pdf font in cmake tool chain. resolves tesseract-ocr#210 resolves tesseract-ocr#3798
…ammatically. Support new rendering_dpi api params. Add pdf renderer tests. Install pdf font in cmake tool chain. resolves tesseract-ocr#210 resolves tesseract-ocr#3798
…ammatically. Support new rendering_dpi api params. Add pdf renderer tests. Install pdf font in cmake tool chain. resolves tesseract-ocr#210 resolves tesseract-ocr#3798
…ammatically Support new rendering_dpi api params. Add pdf renderer tests. Install pdf font in cmake tool chain. resolves tesseract-ocr#210 resolves tesseract-ocr#3798
Hi,
I recently started using tesseract to help unclutter my desk at home, so forgive me if this is a n00b question/request.
I use textcleaner from Fred's ImageMagick Scripts to cleanup my scanned images for better OCR accuracy. However, the images that are optimized for OCR do not necessarily look good from a human standpoint, and I would like the final OCR'd PDF to look visually identical to the original scan.
So here's my feature request: Add an optional argument to take a cleaned image. Example invocation: tesseract -l eng -psm 4 --cleaned-image ${SRC}_cleaned.pnm ${SRC}.pnm out pdf
It will use ${SRC}.pnm to generate the final PDF image but layout detection, character recognition, etc. will be done using the --cleaned-image argument for better accuracy. That way the user will be given a final PDF that looks like the original but searches as well as the cleaned-up image.
I'd be surprised if nobody has already thought of this, so maybe work is already underway or maybe it's not possible. Thoughts?
The text was updated successfully, but these errors were encountered: