Extend the capabilities of extract_text() #1502

erenirmak · 2022-12-14T06:02:56Z

erenirmak
Dec 14, 2022

Some PDF files might be scanned documents, consist of images instead of texts. Some PDF documents might have images beside texts. For both situations, we lose some information.

extract_text() function of PyPDF2 can be extended to process the images automatically as well as usual texts. It would make our life easier. Though I don't know the backend. Is that possible to implement?

MartinThoma · 2022-12-14T07:34:26Z

MartinThoma
Dec 14, 2022
Maintainer

https://pypdf2.readthedocs.io/en/latest/user/extract-text.html

Sometimes PDFs do not contain the text as it’s displayed, but instead an image. You notice that when you cannot copy the text. Then there are PDF files that contain an image and a text layer in the background. That typically happens when a document was scanned. Although the scanning software (OCR) is pretty good today, it still fails once in a while. PyPDF2 is no OCR software; it will not be able to detect those failures. PyPDF2 will also never be able to extract text from images.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend the capabilities of extract_text() #1502

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Extend the capabilities of extract_text() #1502

erenirmak Dec 14, 2022

Replies: 1 comment

MartinThoma Dec 14, 2022 Maintainer

erenirmak
Dec 14, 2022

MartinThoma
Dec 14, 2022
Maintainer