ProcessPage() generates a corrupt file #271

joseavegaa · 2021-09-07T22:53:42Z

I'm trying to OCR a PIL image and create a searchable PDF from that image. According to the documentation, I should use ProcessPage() to generate the PDF files. However, every file that is created is corrupted or damaged.

The code is as follows:

with tesserocr.PyTessBaseAPI() as api:
api.SetVariable("tessedit_create_pdf", "true")
api.SetImage(img)
api.ProcessPage(outputbase=img_name, image=img, page_index=0, filename=img_name)

The PDF is then created, but it says that the file is corrupted.

I've also tried to use ProcessPages() with an image file, but once again the PDF generated is corrupted.

I've found the issue #167, but it isn't explain what page_index should be, and what filename should be set to. The documentation isn't clear on the correct order to call ProcessPage(), should I call GetUTF8Text first or is ProcessPage() call in any order? What is the correct usage?

Also, if I want to store the OCR result as a string variable, but I too want to create a searchable PDF, should I call GetUTF8Text and ProcessPage() individually, which will result in the OCR being process twice, or is there a way to get it done without the extra processing?

Thanks for the help.

tesserocr = 2.5.1
tesseract = 4.1.1
python = 3.8.10

The text was updated successfully, but these errors were encountered:

zdenop · 2021-09-11T08:30:32Z

IMO tesserocr ProcessPage is not finished/needs adaptation. At the moment ProcessPage() just store "pdf object" as pdf document, which is wrong.
It uses tesseract render, but renderer need to call renderer->BeginDocument(...). and renderer->EndDocument() to create valid pdf (hocr).
However ProcessPages works ok, so you should use it to create a searchable pdf (maybe with some workaround - based on your use case)

sirfz · 2021-09-14T12:51:46Z

IMO tesserocr ProcessPage is not finished/needs adaptation. At the moment ProcessPage() just store "pdf object" as pdf document, which is wrong.
It uses tesseract render, but renderer need to call renderer->BeginDocument(...). and renderer->EndDocument() to create valid pdf (hocr).

Can you submit a PR to fix this? If not I'll try to do it whenever I have some time

zdenop · 2021-09-14T14:09:05Z

I do not have time either, but IMO there should be somed iscussion what is expected goal (input&output)...

Just adding renderer->BeginDocument(...). and renderer->EndDocument() could be easy. This could be used for case
to use memory object (PIL image) and to create file. But at this stage it will generate 100 pdf files for one tiff with 100 pages...

If somebody wants to OCR one multipage tiff and receive one pdf then solution is to use ProcessPages(). This could be used for case when somebody wants to OCR list of image files: first files should be converted to multipage tiff, stored to disk and processed....

If somebody what to do everything in memory (PIL-> OCR->PDF) than more development needs to be done (maybe on side of tesseract too).

Other interesting idea would be play with HOCR/ALTO(?) output (e.g. create pdf with python, add there input image, hocr result, maybe highlight areas with low confidence etc...)

sirfz · 2021-09-14T16:53:55Z

Indeed, ProcessPage simply converts the given PIL image to pix and calls tesseract's ProcessPage API so we shouldn't make such changes to the API. Perhaps providing separate API methods to allow what you're suggesting is a better approach. Something like:

renderer = api.GetRenderer(path)
with renderer:  # calls renderer.BeginDocument()
    # do stuff...
# renderer.EndDocument() called

I'm not familiar with the API so this is just a rough draft

zdenop · 2021-09-15T12:26:49Z

With PR #277 this should works:

image_filename = "5.png"
img = Image.open(image_filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.SetVariable("tessedit_create_hocr", "true")
    api.SetVariable("tessedit_create_alto", "true")
    api.ProcessPage(outputbase="test1",
                    image=img,
                    page_index=0,
                    filename=image_filename,
                    title="this will be title")

zdenop · 2021-09-15T12:48:27Z

Of course, implement something that can easily process multipage tiff (or list of filenames) would take more works.
E.g. that code like this would work:

import tesserocr
from PIL import Image, ImageSequence

filename = "multipage.tif"
title = "My title"
outputbase = "ocr_result"
im = Image.open(filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    renderer = api.GetRenderer()
    renderer.BeginDocument(title)
    for page_index, img in enumerate(ImageSequence.Iterator(im)):
        api.ProcessPage(img,
                        page_index,
                        outputbase,
                        renderer,
                        retry_config=None,
                        timeout=0)
    renderer.EndDocument()

joseavegaa · 2021-09-15T22:35:34Z

With PR #277 this should works:

image_filename = "5.png"
img = Image.open(image_filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.SetVariable("tessedit_create_hocr", "true")
    api.SetVariable("tessedit_create_alto", "true")
    api.ProcessPage(outputbase="test1",
                    image=img,
                    page_index=0,
                    filename=image_filename,
                    title="this will be title")

Thanks! As soon as it is merged and updated on conda-forge (which is on 2.5.1 atm) I'll try it this way.

bertsky · 2021-09-28T06:27:19Z

Indeed, ProcessPage simply converts the given PIL image to pix and calls tesseract's ProcessPage API so we shouldn't make such changes to the API. Perhaps providing separate API methods to allow what you're suggesting is a better approach.

I also do not think #277 is a good solution. It makes tesserocr deviate from Tesseract's API unexpectedly:

ProcessPage: low-level function which does not concern itself with the header and footer section of the renderers' document handler
ProcessPagesInternal: mid-level function which covers document header and footer, and delegates to ProcessPage, or ProcessPagesMultipageTiff and ProcessPagesFileList if necessary
ProcessPages: high-level function which does all of the above

Since tesserocr already wraps ProcessPages, we already have everything. On the contrary: you could argue that it was a mistake for the Tesseract API (and therefore also tesserocr) to expose ProcessPagesInternal and ProcessPage at all.

IINM this is merely a documentation issue (but perhaps we should unexpose ProcessPage).

zdenop · 2021-10-03T17:27:05Z

Yes this is for discussion, but for me all existing solution are not the best. e.g. ProcessPages can be used only for files (as input). If you have a memory object - bad luck - put it disk. Also I do not like that ProcessPage(s) put result to disk... If you want to avoid disk operation - you have not chance (but this is IMO also problem of tesseract API >= 4).
I think tesserocr should be able to take memory object as input and maybe read output from disk and return it as memory object ;-)

bertsky · 2021-10-03T19:43:18Z

@zdenop thanks for your explanation, I had completely overlooked that aspect. Indeed, ProcessPage is the only function that allows us to hand in a Pix instance in-memory. (Tesseract's ProcessPages also does in-memory in case of stdin or http/curl, but that's no use for us.) So you PR does make sense after all. I also agree that this should probably best be fixed in Tesseract itself, and considering the timeline for 5.0 it should be discussed in Tesseract urgently. (One could for example split ProcessPagesInternal such that renderer->BeginDocument and ProcessPage and renderer->EndDocument become a new function...)

sirfz · 2021-11-09T18:18:04Z

@bertsky @zdenop I think you both raise very important points, I did post a comment in #277 to address my concern which you've already discussed. Perhaps a separate method should be introduced with the desired behavior (to avoid deviating from tesseract's API)?

sirfz added the bug label Sep 14, 2021

zdenop added a commit to zdenop/tesserocr that referenced this issue Sep 15, 2021

fix ProcessPage()/ fixes sirfz#271

bf4c2a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProcessPage() generates a corrupt file #271

ProcessPage() generates a corrupt file #271

joseavegaa commented Sep 7, 2021

zdenop commented Sep 11, 2021

sirfz commented Sep 14, 2021

zdenop commented Sep 14, 2021

sirfz commented Sep 14, 2021 •

edited

Loading

zdenop commented Sep 15, 2021 •

edited

Loading

zdenop commented Sep 15, 2021

joseavegaa commented Sep 15, 2021

bertsky commented Sep 28, 2021

zdenop commented Oct 3, 2021

bertsky commented Oct 3, 2021

sirfz commented Nov 9, 2021

ProcessPage() generates a corrupt file #271

ProcessPage() generates a corrupt file #271

Comments

joseavegaa commented Sep 7, 2021

zdenop commented Sep 11, 2021

sirfz commented Sep 14, 2021

zdenop commented Sep 14, 2021

sirfz commented Sep 14, 2021 • edited Loading

zdenop commented Sep 15, 2021 • edited Loading

zdenop commented Sep 15, 2021

joseavegaa commented Sep 15, 2021

bertsky commented Sep 28, 2021

zdenop commented Oct 3, 2021

bertsky commented Oct 3, 2021

sirfz commented Nov 9, 2021

sirfz commented Sep 14, 2021 •

edited

Loading

zdenop commented Sep 15, 2021 •

edited

Loading