-
I have been using this tool to identify images and text with the help of the output from My doubt here is fitz should have been able to identify the middle portion at least as an image, so if it couldn't, Is this a file specific issue(corrupted content?) or something else, Can someone help me out to identify the problem. Thanks in advance |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
What seems to be text here, really is a plethora of drawings: |
Beta Was this translation helpful? Give feedback.
-
Example: >>> paths=page.get_drawings()
>>> len(paths)
11
>>> for p in paths:
print(len(p["items"])) # this is the number of draws commands per path
4
386
188
215
398
103
277
503
200
4
4 So except for 3 paths the other 8 each contain hundreds of single draw commands. |
Beta Was this translation helpful? Give feedback.
What seems to be text here, really is a plethora of drawings:
The PDF creator has synthesized every single letter as a drawing: a "D" is drawn as a left-closed right half circle, an "o" is a small circle, and so on, etc., pp.
His motivation? Who knows! Maybe make things difficult for you and me. If you extract the page's contents via
page.read_contents()
and store it in a file, you will get this:cont.zip
The only way to get that text is via some OCR tool.