Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to do searchable PDF via tesserocr #264

Open
PenthagonHacker opened this issue Jul 19, 2021 · 4 comments
Open

[Question] How to do searchable PDF via tesserocr #264

PenthagonHacker opened this issue Jul 19, 2021 · 4 comments

Comments

@PenthagonHacker
Copy link

Hello guys!
So I am completely new to tesseract and tesserocr. I need to make pdf file with the text layer a.k.a, searchable pdf.
I found in tesseract documentation that there's such thing as TessPDFRenderer So my question is there any way I can use this method via tesserocr and pycharm ?
I looked through the tesserocr.py and I haven't found anything even remotely close to that.

Thank you beforehand!

@ES-Alexander
Copy link

Probably better to use OCRmyPDF for this since it’s literally made for that use case.

Tesserocr can help you perform OCR on images, but it doesn’t come with extensive PDF modification utilities built in because that’s outside the scope of the library.

@sirfz
Copy link
Owner

sirfz commented Aug 9, 2021

You can use the ProcessPage method which should be able (if I understand correctly) to output a searchable PDF if you set the tessedit_create_pdf to true. See ProcessPages as well.

@tritium01
Copy link

Hello guys!
So I am completely new to tesseract and tesserocr. I need to make pdf file with the text layer a.k.a, searchable pdf.
I found in tesseract documentation that there's such thing as TessPDFRenderer So my question is there any way I can use this method via tesserocr and pycharm ?
I looked through the tesserocr.py and I haven't found anything even remotely close to that.

Thank you beforehand!

Have you tried @sirfz method or found a solution? I am interested in this too

@zdenop
Copy link
Contributor

zdenop commented Sep 15, 2021

import tesserocr

tessdata_path = "tessdata"
outbase = "my_first_pdf"
image_filename = "5.png"
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.ProcessPages(outbase, image_filename)

after applying PR #277 this should works too:

img = Image.open(image_filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.ProcessPage(outputbase=outbase,
                    image=img,
                    page_index=0,
                    filename=image_filename,
                    title="this will be title")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants