Missing image/text from getText('dict') #902

mani2106 · 2021-02-16T07:14:12Z

mani2106
Feb 16, 2021

I have been using this tool to identify images and text with the help of the output from getText('dict') using version 1.17.0, it works well for most of the files, but for the attached one it totally misses the content in the middle.

My doubt here is fitz should have been able to identify the middle portion at least as an image, so if it couldn't, Is this a file specific issue(corrupted content?) or something else, Can someone help me out to identify the problem. Thanks in advance

output of getText('dict')

The pdf

Answered by JorjMcKie

Feb 16, 2021

What seems to be text here, really is a plethora of drawings:
The PDF creator has synthesized every single letter as a drawing: a "D" is drawn as a left-closed right half circle, an "o" is a small circle, and so on, etc., pp.
His motivation? Who knows! Maybe make things difficult for you and me. If you extract the page's contents via page.read_contents() and store it in a file, you will get this:
cont.zip
The only way to get that text is via some OCR tool.

View full answer

JorjMcKie · 2021-02-16T09:00:49Z

JorjMcKie
Feb 16, 2021
Maintainer

What seems to be text here, really is a plethora of drawings:
The PDF creator has synthesized every single letter as a drawing: a "D" is drawn as a left-closed right half circle, an "o" is a small circle, and so on, etc., pp.
His motivation? Who knows! Maybe make things difficult for you and me. If you extract the page's contents via page.read_contents() and store it in a file, you will get this:
cont.zip
The only way to get that text is via some OCR tool.

2 replies

mani2106 Feb 16, 2021
Author

Interesting, may I know how do I come to that conclusion?, and is there a way to do this with code?.

And like you mentioned the usage of ocr, the problem arises when we try to differentiate searchable and scanned pdfs, to know for which document to use ocr and fitz (using block types from getText('dict')), in this case the code in place detects this as a searchable pdf and we end up getting nothing but the header and footer. So if there is a way to detect drawings, we can use that to improve the existing code.

Thanks for the swift reply!

JorjMcKie Feb 16, 2021
Maintainer

Interesting, may I know how do I come to that conclusion?, and is there a way to do this with code?.

Hm, probably not via a compelling conclusion, but you can at least become suspicious:

only very little text
no image on the page
but a large page /Contents object (70 KB).
then you can search for an inlne image, which is wrapped by a text string pair "BI" / "EI". A typical reason for large contents obviously.
if none of the above, then probably this type of text "emulation" is there.

You can extract a page's drawings for sure: page.get_drawings(). This is a list of Python dictionaries, which each contain all the atomic drawing commands (lines, curves, rectangles). But hard to imagine that you can come to the correct conclusion by looking at these.
But of course: getting such a long list of drawing paths alone is a god indicator ...

JorjMcKie · 2021-02-16T13:37:39Z

JorjMcKie
Feb 16, 2021
Maintainer

Example:

>>> paths=page.get_drawings()
>>> len(paths)
11
>>> for p in paths:
	print(len(p["items"]))  # this is the number of draws commands per path

4
386
188
215
398
103
277
503
200
4
4

So except for 3 paths the other 8 each contain hundreds of single draw commands.

1 reply

mani2106 Feb 17, 2021
Author

Thanks for the tip, I will see what I can do with this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing image/text from getText('dict') #902

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Missing image/text from getText('dict') #902

Uh oh!

Uh oh!

mani2106 Feb 16, 2021

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

JorjMcKie Feb 16, 2021 Maintainer

Uh oh!

mani2106 Feb 16, 2021 Author

Uh oh!

JorjMcKie Feb 16, 2021 Maintainer

Uh oh!

JorjMcKie Feb 16, 2021 Maintainer

Uh oh!

mani2106 Feb 17, 2021 Author

mani2106
Feb 16, 2021

Replies: 2 comments 3 replies

JorjMcKie
Feb 16, 2021
Maintainer

mani2106 Feb 16, 2021
Author

JorjMcKie Feb 16, 2021
Maintainer

JorjMcKie
Feb 16, 2021
Maintainer

mani2106 Feb 17, 2021
Author