Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcessPage() generates a corrupt file #271

Open
joseavegaa opened this issue Sep 7, 2021 · 11 comments
Open

ProcessPage() generates a corrupt file #271

joseavegaa opened this issue Sep 7, 2021 · 11 comments
Labels

Comments

@joseavegaa
Copy link

I'm trying to OCR a PIL image and create a searchable PDF from that image. According to the documentation, I should use ProcessPage() to generate the PDF files. However, every file that is created is corrupted or damaged.

The code is as follows:

with tesserocr.PyTessBaseAPI() as api:
api.SetVariable("tessedit_create_pdf", "true")
api.SetImage(img)
api.ProcessPage(outputbase=img_name, image=img, page_index=0, filename=img_name)

The PDF is then created, but it says that the file is corrupted.

I've also tried to use ProcessPages() with an image file, but once again the PDF generated is corrupted.

I've found the issue #167, but it isn't explain what page_index should be, and what filename should be set to. The documentation isn't clear on the correct order to call ProcessPage(), should I call GetUTF8Text first or is ProcessPage() call in any order? What is the correct usage?

Also, if I want to store the OCR result as a string variable, but I too want to create a searchable PDF, should I call GetUTF8Text and ProcessPage() individually, which will result in the OCR being process twice, or is there a way to get it done without the extra processing?

Thanks for the help.

  • tesserocr = 2.5.1
  • tesseract = 4.1.1
  • python = 3.8.10
@zdenop
Copy link
Contributor

zdenop commented Sep 11, 2021

IMO tesserocr ProcessPage is not finished/needs adaptation. At the moment ProcessPage() just store "pdf object" as pdf document, which is wrong.
It uses tesseract render, but renderer need to call renderer->BeginDocument(...). and renderer->EndDocument() to create valid pdf (hocr).
However ProcessPages works ok, so you should use it to create a searchable pdf (maybe with some workaround - based on your use case)

@sirfz
Copy link
Owner

sirfz commented Sep 14, 2021

IMO tesserocr ProcessPage is not finished/needs adaptation. At the moment ProcessPage() just store "pdf object" as pdf document, which is wrong.
It uses tesseract render, but renderer need to call renderer->BeginDocument(...). and renderer->EndDocument() to create valid pdf (hocr).

Can you submit a PR to fix this? If not I'll try to do it whenever I have some time

@sirfz sirfz added the bug label Sep 14, 2021
@zdenop
Copy link
Contributor

zdenop commented Sep 14, 2021

I do not have time either, but IMO there should be somed iscussion what is expected goal (input&output)...

Just adding renderer->BeginDocument(...). and renderer->EndDocument() could be easy. This could be used for case
to use memory object (PIL image) and to create file. But at this stage it will generate 100 pdf files for one tiff with 100 pages...

If somebody wants to OCR one multipage tiff and receive one pdf then solution is to use ProcessPages(). This could be used for case when somebody wants to OCR list of image files: first files should be converted to multipage tiff, stored to disk and processed....

If somebody what to do everything in memory (PIL-> OCR->PDF) than more development needs to be done (maybe on side of tesseract too).

Other interesting idea would be play with HOCR/ALTO(?) output (e.g. create pdf with python, add there input image, hocr result, maybe highlight areas with low confidence etc...)

@sirfz
Copy link
Owner

sirfz commented Sep 14, 2021

Indeed, ProcessPage simply converts the given PIL image to pix and calls tesseract's ProcessPage API so we shouldn't make such changes to the API. Perhaps providing separate API methods to allow what you're suggesting is a better approach. Something like:

renderer = api.GetRenderer(path)
with renderer:  # calls renderer.BeginDocument()
    # do stuff...
# renderer.EndDocument() called

I'm not familiar with the API so this is just a rough draft

zdenop added a commit to zdenop/tesserocr that referenced this issue Sep 15, 2021
@zdenop
Copy link
Contributor

zdenop commented Sep 15, 2021

With PR #277 this should works:

image_filename = "5.png"
img = Image.open(image_filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.SetVariable("tessedit_create_hocr", "true")
    api.SetVariable("tessedit_create_alto", "true")
    api.ProcessPage(outputbase="test1",
                    image=img,
                    page_index=0,
                    filename=image_filename,
                    title="this will be title")

@zdenop
Copy link
Contributor

zdenop commented Sep 15, 2021

Of course, implement something that can easily process multipage tiff (or list of filenames) would take more works.
E.g. that code like this would work:

import tesserocr
from PIL import Image, ImageSequence

filename = "multipage.tif"
title = "My title"
outputbase = "ocr_result"
im = Image.open(filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    renderer = api.GetRenderer()
    renderer.BeginDocument(title)
    for page_index, img in enumerate(ImageSequence.Iterator(im)):
        api.ProcessPage(img,
                        page_index,
                        outputbase,
                        renderer,
                        retry_config=None,
                        timeout=0)
    renderer.EndDocument()

@joseavegaa
Copy link
Author

With PR #277 this should works:

image_filename = "5.png"
img = Image.open(image_filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.SetVariable("tessedit_create_hocr", "true")
    api.SetVariable("tessedit_create_alto", "true")
    api.ProcessPage(outputbase="test1",
                    image=img,
                    page_index=0,
                    filename=image_filename,
                    title="this will be title")

Thanks! As soon as it is merged and updated on conda-forge (which is on 2.5.1 atm) I'll try it this way.

@bertsky
Copy link
Contributor

bertsky commented Sep 28, 2021

Indeed, ProcessPage simply converts the given PIL image to pix and calls tesseract's ProcessPage API so we shouldn't make such changes to the API. Perhaps providing separate API methods to allow what you're suggesting is a better approach.

I also do not think #277 is a good solution. It makes tesserocr deviate from Tesseract's API unexpectedly:

  • ProcessPage: low-level function which does not concern itself with the header and footer section of the renderers' document handler
  • ProcessPagesInternal: mid-level function which covers document header and footer, and delegates to ProcessPage, or ProcessPagesMultipageTiff and ProcessPagesFileList if necessary
  • ProcessPages: high-level function which does all of the above

Since tesserocr already wraps ProcessPages, we already have everything. On the contrary: you could argue that it was a mistake for the Tesseract API (and therefore also tesserocr) to expose ProcessPagesInternal and ProcessPage at all.

IINM this is merely a documentation issue (but perhaps we should unexpose ProcessPage).

@zdenop
Copy link
Contributor

zdenop commented Oct 3, 2021

Yes this is for discussion, but for me all existing solution are not the best. e.g. ProcessPages can be used only for files (as input). If you have a memory object - bad luck - put it disk. Also I do not like that ProcessPage(s) put result to disk... If you want to avoid disk operation - you have not chance (but this is IMO also problem of tesseract API >= 4).
I think tesserocr should be able to take memory object as input and maybe read output from disk and return it as memory object ;-)

@bertsky
Copy link
Contributor

bertsky commented Oct 3, 2021

@zdenop thanks for your explanation, I had completely overlooked that aspect. Indeed, ProcessPage is the only function that allows us to hand in a Pix instance in-memory. (Tesseract's ProcessPages also does in-memory in case of stdin or http/curl, but that's no use for us.) So you PR does make sense after all. I also agree that this should probably best be fixed in Tesseract itself, and considering the timeline for 5.0 it should be discussed in Tesseract urgently. (One could for example split ProcessPagesInternal such that renderer->BeginDocument and ProcessPage and renderer->EndDocument become a new function...)

@sirfz
Copy link
Owner

sirfz commented Nov 9, 2021

@bertsky @zdenop I think you both raise very important points, I did post a comment in #277 to address my concern which you've already discussed. Perhaps a separate method should be introduced with the desired behavior (to avoid deviating from tesseract's API)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants