pdf plumb returning "none" #450

GrantMWu · 2021-06-10T14:27:29Z

GrantMWu
Jun 10, 2021

I'm trying to use pdfplumber to extract text from a pdf, but I'm getting a return of "none" for certain pages. For other pages, the below code works fine. I suspect this has something to do with the way the pdf is set up, and I'm wondering if there is an easy work around. my code is below and sample pdf is attached

import pdfplumber

with pdfplumber.open(test_pdf) as pdf:
page = pdf.pages[0]
text = page.extract_text()

print(text)
test_pdf.pdf

I'm pretty new to coding, outside of a few python classes in college.

samkit-jain · 2021-06-10T17:40:54Z

samkit-jain
Jun 10, 2021
Collaborator

Hi @gmwu843 Appreciate your interest in the library and wish you well in your Python journey. When dealing with text extraction related issues, the first step would be to check if pdfminer.six is able to extract it or not. Behind the scenes, pdfplumber relies on pdfminer.six. They also provide a handy tool for text extraction and can be found here. When you run the PDF on it like python pdf2txt.py test_pdf.pdf, you'll notice that it prints nothing because it is unable to read any text which also explains why pdfplumber is not able to read as well.

The reason could be that font information/mapping/cmap is missing in the PDF. When viewing in a PDF reader, the text is copyable because the reader might be substituting the missing mappings with a default font.

In order to extract the text correctly, you can repair the PDF using Ghostscript like so

gs -o output.pdf -sDEVICE=pdfwrite input.pdf

When using the repaired PDF, you'll be able to extract the text properly. Attaching the repaired PDF here for your reference.

1 reply

GrantMWu Jun 11, 2021
Author

Thanks! That worked. Appreciate the help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf plumb returning "none" #450

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

pdf plumb returning "none" #450

GrantMWu Jun 10, 2021

Replies: 1 comment · 1 reply

samkit-jain Jun 10, 2021 Collaborator

GrantMWu Jun 11, 2021 Author

GrantMWu
Jun 10, 2021

Replies: 1 comment 1 reply

samkit-jain
Jun 10, 2021
Collaborator

GrantMWu Jun 11, 2021
Author