PDF Text Extraction Order Not Matching Visual Layout Despite Correct Coordinates #4124
Unanswered
Phalgun-Santhapuri
asked this question in
Looking for help
Replies: 1 comment 3 replies
-
All of what you mention looks like normal in PDFs: in extreme cases, every single character may appear in arbitrary sequence when extracted. Only when explicitly sorting for output, a "natural" reading sequence can be established. So we need an example page before we can say anything else. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am working on extracting text from a PDF using PyMuPDF. However, I am encountering an issue where the extracted text order does not match the visual flow/Layout flow of the PDF .
Details of the Issue:
The PDF's text is correctly positioned according to its coordinates (bounding boxes), but the logical extraction order is incorrect.
For example, on the first page of my PDF:
After extracting line 2, the tool directly jumps to a table at the bottom of the page, skipping intervening text.
Later, it picks up lines 3–20 in an unordered manner.
I have verified that the issue is not related to column or layout misalignment, as the coordinates are accurate.
The document contains multi-column layouts and mixed elements like tables and have complex layouts in the PDF.
From dict i am getting the bounding box information later i am applying the further logic.
But i observed that the the text in the dict or any other option available itself has the incorrect order of the text which i have mentioned above.
Beta Was this translation helpful? Give feedback.
All reactions