Skip to content

Cannot find image in PDF page where image exist #1470

Answered by JorjMcKie
dannydocrxiv asked this question in Q&A
Discussion options

You must be logged in to vote

Or at least know I have this case and do OCR on the full page in these cases

This page has a large amount of drawings, which could raise initial suspicions to start with:

>>> doc=fitz.open("PeerJ_2021_May_28_d5de.pdf")
>>> page=doc[12]
>>> paths = page.get_drawings()
>>> len(paths)
502
>>>

If you then look at the amount of single draw commands:

>>> draws = 0
>>> for p in paths:
        draws += len(p["items"])
	
>>> draws
4623
>>>

... it becomes clear that something significant is being done on the page.

I have a script somewhere which helps separating drawings into disjoint subsets. Once you have those, join associated rectangles and make pixmaps, which can be separately be OCRed.

Replies: 4 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@dannydocrxiv
Comment options

Answer selected by JorjMcKie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
2 participants
Converted from issue

This discussion was converted from issue #1468 on December 18, 2021 08:32.