PDF returns no text, tables, or anything #717

toakleyy · 2022-08-24T16:05:25Z

toakleyy
Aug 24, 2022

I have used pdfplumber with success on a separate PDF, but now when I try the same function on this new PDF I get absolutely nothing. I'm simply trying to extract text. See attached PDF here (I have blocked out personal info, but it is not blocked out when I try to plumb it) -
RJ PDF markup.pdf

Some code I am using that isn't working:
`

import pdfplumber
    
file = "localfilename"

with pdfplumber.open(file) as _pdf:
    pages = _pdf.pages

    for _i_, page in enumerate(pages):
        text = page.extract_text()
        print(f'page {_i_}')
        print(text)`

My output looks like this:
page 0

page 1

page 2

So it scans each page, but doesn't actually come up with any text.

Any help would be much appreciated. Thank you.

jsvine · 2022-08-25T16:30:55Z

jsvine
Aug 25, 2022
Maintainer

Hi @toakleyy, and thanks for your interest in this library. From what you've shared, it appears that your PDF is a scanned document, not a born-digital PDF. (A quick way to test this: Can you select / copy / paste any text within the document?) If that's the case, you'll first want to run optical character recognition on the PDF. Then you can extract the OCR-detected text with pdfplumber.

2 replies

toakleyy Aug 25, 2022
Author

Thanks, I will try this. Can you recommend an OCR library for Python?

jsvine Aug 25, 2022
Maintainer

The most established open-source OCR tool is probably Tesseract, though its quality can be a bit spotty. But certainly worth trying as a first option. OCRMyPDF is a Tesseract-based Python project that works both via the command line and as a Python library. I've used it with success in the past. Another Python- and Tesseract-based option is PyTesseract library, which I've also found to work well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF returns no text, tables, or anything #717

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

PDF returns no text, tables, or anything #717

toakleyy Aug 24, 2022

Replies: 1 comment · 2 replies

jsvine Aug 25, 2022 Maintainer

toakleyy Aug 25, 2022 Author

jsvine Aug 25, 2022 Maintainer

toakleyy
Aug 24, 2022

Replies: 1 comment 2 replies

jsvine
Aug 25, 2022
Maintainer

toakleyy Aug 25, 2022
Author

jsvine Aug 25, 2022
Maintainer