Cannot find image in PDF page where image exist #1470
-
Please check attached PDF on page 13 out of 25 |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
Well, you are wrong: |
Beta Was this translation helpful? Give feedback.
-
ok. |
Beta Was this translation helpful? Give feedback.
-
Or at least know I have this case and do OCR on the full page in these cases |
Beta Was this translation helpful? Give feedback.
-
This page has a large amount of drawings, which could raise initial suspicions to start with: >>> doc=fitz.open("PeerJ_2021_May_28_d5de.pdf")
>>> page=doc[12]
>>> paths = page.get_drawings()
>>> len(paths)
502
>>> If you then look at the amount of single draw commands: >>> draws = 0
>>> for p in paths:
draws += len(p["items"])
>>> draws
4623
>>> ... it becomes clear that something significant is being done on the page. I have a script somewhere which helps separating drawings into disjoint subsets. Once you have those, join associated rectangles and make pixmaps, which can be separately be OCRed. |
Beta Was this translation helpful? Give feedback.
This page has a large amount of drawings, which could raise initial suspicions to start with:
If you then look at the amount of single draw commands:
... it becomes clear that something significant is being done on the page.
I have a script somewhere which helps separating drawings into disjoint subsets. Once you have those, join associated rectangles and make pixmaps, which can be separately be OCRed.