Extracting text and image from pdf using pymupdf python #1262
Replies: 1 comment
-
First of all, the output formats XML and (X)HTML are out of my control. The are thin wrappers around original respective MuPDF functions. Second, all bare extraction formats extract text in the same sequence as the document creator has specified it: you may extract e.g. a header after the footer text. This sequence can be arbitrary ... down to single characters. Which means, if a page contains text written with n characters, there are up to I don't know why you are using XHTML output (which strips off position information) as opposed to HTML, but with HTML you would at least be able to detect the actual position / sequence of page elements. |
Beta Was this translation helpful? Give feedback.
-
I am trying to scrape text and images from pdf using python. But it seems that the text and image are not properly extracted i.e. in the correct sequence of the image are not maintained. PFA the image of the pdf I want to scrape.
the output of the code is.
As you can see in the output that the image tag is wrongly placed after the "For the latter, assumptions requiring medical". However, it should actually have been placed after "type 1 diabetes in-silico subjects"
Please help me as I am stuck with this and can't find the solution to it. My code is:
doc = fitz.open(file_path)
for i in range(len(doc)):
page1 = doc.loadPage(i)
page1text = page1.getText("xhtml")
page1text = page1text.strip()
page1text = page1text.strip('\n')
page1text= re.sub('\s+', ' ', page1text)
print(page1text)
Beta Was this translation helpful? Give feedback.
All reactions