Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

pdfminer.high_level.extract_text pdfminer.six, but using pdfminer package #318

Open
Lucas-C opened this issue Jun 18, 2022 · 0 comments
Open

Comments

@Lucas-C
Copy link

Lucas-C commented Jun 18, 2022

Hi!

I have moved from using pdfminer.six to using this pdfminer package,
and I needed an equivalent of pdfminer.high_level.extract_text().

I thought it may be useful to other people performing the same migration to share the extract_text() function I used:

from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

def extract_text(pdf_file, password="", page_numbers=None, maxpages=0, caching=True, laparams=None):
    """
    Equivalent of pdfminer.high_level.extract_text from pdfminer.six, but with pdfminer package.
    Inspired by https://github.com/euske/pdfminer/blob/master/tools/pdf2txt.py
    """
    outfp = StringIO()
    rsrcmgr = PDFResourceManager(caching=caching)
    device = TextConverter(rsrcmgr, outfp, laparams=laparams)
    with open(pdf_file, 'rb') as fp:
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp, page_numbers, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
            interpreter.process_page(page)
    device.close()
    return outfp.getvalue()
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant