PDF Text Extraction Order Not Matching Visual Layout Despite Correct Coordinates #4124

Phalgun-Santhapuri · 2024-12-09T06:43:38Z

Phalgun-Santhapuri
Dec 9, 2024

I am working on extracting text from a PDF using PyMuPDF. However, I am encountering an issue where the extracted text order does not match the visual flow/Layout flow of the PDF .

Details of the Issue:

The PDF's text is correctly positioned according to its coordinates (bounding boxes), but the logical extraction order is incorrect.
For example, on the first page of my PDF:

After extracting line 2, the tool directly jumps to a table at the bottom of the page, skipping intervening text.
Later, it picks up lines 3–20 in an unordered manner.
I have verified that the issue is not related to column or layout misalignment, as the coordinates are accurate.

The document contains multi-column layouts and mixed elements like tables and have complex layouts in the PDF.
From dict i am getting the bounding box information later i am applying the further logic.
But i observed that the the text in the dict or any other option available itself has the incorrect order of the text which i have mentioned above.

JorjMcKie · 2024-12-09T07:32:54Z

JorjMcKie
Dec 9, 2024
Maintainer

All of what you mention looks like normal in PDFs: in extreme cases, every single character may appear in arbitrary sequence when extracted. Only when explicitly sorting for output, a "natural" reading sequence can be established.

So we need an example page before we can say anything else.

3 replies

Phalgun-Santhapuri Dec 10, 2024
Author

@JorjMcKie, Okay sure,
In the attached example pdf the first line is test john but when extracted the name is coming below after the table is extracted.
The text order does not match the visual layout.
When we do extraction, we need to maintain the layout information also.

Pymupdf_sample.pdf

JorjMcKie Dec 10, 2024
Maintainer

This is probably for you:

import pymupdf, pymupdf4llm, pathlib

doc = pymupdf.open("sample.pdf")
md = pymupdf4llm.to_markdown(doc, margins=0)
pathlib.Path(doc.name + ".md").write_text(md)

Produces this
sample.pdf.md

Phalgun-Santhapuri Dec 10, 2024
Author

@JorjMcKie, I noticed that this solution works for the current issue. However, I observed that in cases of closely spaced text within a simple layout or column layout, the text with the smaller y-value is being printed first, which does not provide an accurate extraction. Additionally, I noticed that text is missing from the total page content in readable documents when dealing with column-layout PDFs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Text Extraction Order Not Matching Visual Layout Despite Correct Coordinates #4124

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

PDF Text Extraction Order Not Matching Visual Layout Despite Correct Coordinates #4124

Phalgun-Santhapuri Dec 9, 2024

Replies: 1 comment · 3 replies

JorjMcKie Dec 9, 2024 Maintainer

Phalgun-Santhapuri Dec 10, 2024 Author

JorjMcKie Dec 10, 2024 Maintainer

Phalgun-Santhapuri Dec 10, 2024 Author

Phalgun-Santhapuri
Dec 9, 2024

Replies: 1 comment 3 replies

JorjMcKie
Dec 9, 2024
Maintainer

Phalgun-Santhapuri Dec 10, 2024
Author

JorjMcKie Dec 10, 2024
Maintainer

Phalgun-Santhapuri Dec 10, 2024
Author