'Flattening' Annotation text into the searchable of a document #1181

petertennis · 2021-07-30T15:30:34Z

petertennis
Jul 30, 2021

I am processing a diverse set of architectural documents with the aim of making them more searchable in PDF viewers - some have native PDF text, some don't, often they have partial native text. So I OCR the page, and add the text elements which are not already present to the page (I check for overlaps etc to accomplish this)

Now I notice that some of my documents have Annotations that effectively contain the text for a particular part of the page. The quality of this content is often better than my OCR results and the Annotation bounding boxes seem to be in the right place to line up with the visual text.

So.......I would like to push this text into the natively searchable layer. Ideally, this would be accomplished without removing them as Annotations. Is there any function to do this automatically in PyMuPDF? Or any other thoughts on what I am trying to accomplish?

Thanks for reading!

petertennis · 2021-07-30T15:35:09Z

petertennis
Jul 30, 2021
Author

Here is an example for reference - notice how the table in the bottom of the document has lots of Annotations but you cannot search the text easily via Ctrl-F etc
Sample Tabular Finish Schedule - 5651.pdf

0 replies

JorjMcKie · 2021-07-30T19:18:39Z

JorjMcKie
Jul 30, 2021
Maintainer

Hm, I understand.
You could insert your own invisible text - similar to how an OCR program would do it. There are issues ... as there always are, of course:

I see no way to detect the font and fontsize, by which the annot text is displayed.
It is easy to extract the annot's text: annot.info["content"].
But there seems to be no way to find out its fontsize. Some heuristics must be used to determine one such, that the text fits into annot.rect when using page.insert_textbox or TextWriter.fill_textbox.
The current version 1.18.15 has a "simulate" parameter for TextWriter.fill_textbox. So one could start with some reasonable fontsize and gradually decrease it until the text fits in the annot rect.
Once this is done, write the textwriter content (= annot text) to the page making the text hidden.
That text will be searchable, however not be exactly positioned untderneath the annot such that the cursor selection would coincide with the visible text.

2 replies

petertennis Jul 30, 2021
Author

As always, thanks for the prompt and very thoughtful reply!

Yes - it all makes sense - a little bit of a chore but definitely doable with your suggestion to get a result which is decent - I am already familiar with not getting an exact lining up of the visible and invisible text and its something I can live with. Hopefully I wil get a chance to implement it soon and I will drop a comment back here if I do.

Thanks again and have a great weekend!

JorjMcKie Jul 30, 2021
Maintainer

Thank you for the compliments 😎. Have a nice weekend too.
I noticed, that the eample page is rotated by 270 degrees. You may know, that I am assuming and returning coordinates (rectangles, points, etc.) pertaining to the unrotated page.
So when inserting text, computing maximum fitting textlength and such, remember this fact: Compute acceptable text length based on the annot rect's height (!), and then insert the text with rotation angle 270 starting in the correct (bottom left) corner of the rect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Flattening' Annotation text into the searchable of a document #1181

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

'Flattening' Annotation text into the searchable of a document #1181

petertennis Jul 30, 2021

Replies: 2 comments · 2 replies

petertennis Jul 30, 2021 Author

JorjMcKie Jul 30, 2021 Maintainer

petertennis Jul 30, 2021 Author

JorjMcKie Jul 30, 2021 Maintainer

petertennis
Jul 30, 2021

Replies: 2 comments 2 replies

petertennis
Jul 30, 2021
Author

JorjMcKie
Jul 30, 2021
Maintainer

petertennis Jul 30, 2021
Author

JorjMcKie Jul 30, 2021
Maintainer