Extracting text and image from pdf using pymupdf python #1262

souravsingh09 · 2021-09-14T10:08:44Z

souravsingh09
Sep 14, 2021

I am trying to scrape text and images from pdf using python. But it seems that the text and image are not properly extracted i.e. in the correct sequence of the image are not maintained. PFA the image of the pdf I want to scrape.

the output of the code is.

As you can see in the output that the image tag is wrongly placed after the "For the latter, assumptions requiring medical". However, it should actually have been placed after "type 1 diabetes in-silico subjects"

Please help me as I am stuck with this and can't find the solution to it. My code is:
doc = fitz.open(file_path)
for i in range(len(doc)):
page1 = doc.loadPage(i)
page1text = page1.getText("xhtml")
page1text = page1text.strip()
page1text = page1text.strip('\n')
page1text= re.sub('\s+', ' ', page1text)
print(page1text)

JorjMcKie · 2021-09-14T12:04:39Z

JorjMcKie
Sep 14, 2021
Maintainer

First of all, the output formats XML and (X)HTML are out of my control. The are thin wrappers around original respective MuPDF functions.

Second, all bare extraction formats extract text in the same sequence as the document creator has specified it: you may extract e.g. a header after the footer text. This sequence can be arbitrary ... down to single characters. Which means, if a page contains text written with n characters, there are up to n! possible ways to "code" / create a page without any visual difference.

I don't know why you are using XHTML output (which strips off position information) as opposed to HTML, but with HTML you would at least be able to detect the actual position / sequence of page elements.
Using get_text("dict") would be better for this.
Even the format get_text("blocks") outputs text and image blocks, each accompanied by the coordinates of the rectangle they are occurring in ... that format would be more adequate, because you could sort by (1) vertical, then (2) horizontal coordinates.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting text and image from pdf using pymupdf python #1262

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Extracting text and image from pdf using pymupdf python #1262

souravsingh09 Sep 14, 2021

Replies: 1 comment

JorjMcKie Sep 14, 2021 Maintainer

souravsingh09
Sep 14, 2021

JorjMcKie
Sep 14, 2021
Maintainer