Is there any way of extracting text as blocks? #673

dmlls · 2022-06-28T08:14:19Z

dmlls
Jun 28, 2022

Hi everyone 👋🏻

I have been going through the documentation, issues and discussions but I haven't found an answer to the following question:

Does pdfplumber provide any way of extracting text as blocks? Something along the lines of Page.get_text("blocks") in PyMuPDF.

(Sorry in advance if this question has already been answered elsewhere.)

jsvine · 2022-06-29T19:01:45Z

jsvine
Jun 29, 2022
Maintainer

Hi @dmlls, and thanks for your interest in this library. I'm not very familiar with MyMuPDF's get-blocks method, but adding something like that to pdfplumber is intriguing. (There isn't currently an equivalent method.) For single-column text, you could do something like this:

blocks = re.split(r"\n\n+", my_page.extract_text(layout=True, ...))

... but multi-column layouts would require a more sophisticated approach.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way of extracting text as blocks? #673

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Is there any way of extracting text as blocks? #673

dmlls Jun 28, 2022

Replies: 1 comment

jsvine Jun 29, 2022 Maintainer

dmlls
Jun 28, 2022

jsvine
Jun 29, 2022
Maintainer