[Question] How to do searchable PDF via tesserocr #264

PenthagonHacker · 2021-07-19T19:11:32Z

Hello guys!
So I am completely new to tesseract and tesserocr. I need to make pdf file with the text layer a.k.a, searchable pdf.
I found in tesseract documentation that there's such thing as TessPDFRenderer So my question is there any way I can use this method via tesserocr and pycharm ?
I looked through the tesserocr.py and I haven't found anything even remotely close to that.

Thank you beforehand!

ES-Alexander · 2021-08-09T13:21:34Z

Probably better to use OCRmyPDF for this since it’s literally made for that use case.

Tesserocr can help you perform OCR on images, but it doesn’t come with extensive PDF modification utilities built in because that’s outside the scope of the library.

sirfz · 2021-08-09T17:39:28Z

You can use the ProcessPage method which should be able (if I understand correctly) to output a searchable PDF if you set the tessedit_create_pdf to true. See ProcessPages as well.

tritium01 · 2021-08-17T16:54:38Z

Hello guys!
So I am completely new to tesseract and tesserocr. I need to make pdf file with the text layer a.k.a, searchable pdf.
I found in tesseract documentation that there's such thing as TessPDFRenderer So my question is there any way I can use this method via tesserocr and pycharm ?
I looked through the tesserocr.py and I haven't found anything even remotely close to that.

Thank you beforehand!

Have you tried @sirfz method or found a solution? I am interested in this too

zdenop · 2021-09-15T13:03:53Z

import tesserocr

tessdata_path = "tessdata"
outbase = "my_first_pdf"
image_filename = "5.png"
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.ProcessPages(outbase, image_filename)

after applying PR #277 this should works too:

img = Image.open(image_filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.ProcessPage(outputbase=outbase,
                    image=img,
                    page_index=0,
                    filename=image_filename,
                    title="this will be title")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How to do searchable PDF via tesserocr #264

[Question] How to do searchable PDF via tesserocr #264

PenthagonHacker commented Jul 19, 2021

ES-Alexander commented Aug 9, 2021

sirfz commented Aug 9, 2021

tritium01 commented Aug 17, 2021

zdenop commented Sep 15, 2021

[Question] How to do searchable PDF via tesserocr #264

[Question] How to do searchable PDF via tesserocr #264

Comments

PenthagonHacker commented Jul 19, 2021

ES-Alexander commented Aug 9, 2021

sirfz commented Aug 9, 2021

tritium01 commented Aug 17, 2021

zdenop commented Sep 15, 2021