Question: Inserting unicode any utf-8 without detecting the language with a custom font #690

MerlijnWajer · 2020-10-16T12:44:40Z

MerlijnWajer
Oct 16, 2020

I am hoping to create PDF files from 'hOCR' (output format of OCR engines) and create a (hidden!) text layer on top of a PDF with images. I already have a working proof of concept of this, although it's in very early stages: https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/hocr2pdf.py

Changing render_mode=0 to render_mode=3 will indeed make the text invisible. But it only supports a very limited set of characters.

In any case, it will look something like this with my current code:

I am not using the TextWriter interface since I need to be able to have the text fill the text boxes, with my own morph code.

What I would like to do is use a glyphless font ( this one is extracted from Tesseract: https://wizzup.org/glyphless.ttf ), but I've had trouble loading the font. I believe such a font will save a lot in size of the PDF, since it is a very small font (572 bytes), and since I don't want to actually see the text, and just make it selectable, that should work fine?

I could not figure out how to load the glyphless.ttf font using MuPDF and render text with it -- any tips?

Thanks!

MerlijnWajer · 2020-10-16T12:46:37Z

MerlijnWajer
Oct 16, 2020
Author

The Tesseract PDF rendered code has some useful info (at least for me) on the way they do it: https://github.com/tesseract-ocr/tesseract/blob/master/src/api/pdfrenderer.cpp#L35

0 replies

MerlijnWajer · 2020-10-16T12:57:39Z

MerlijnWajer
Oct 16, 2020
Author

In case you're looking for hOCR files for my example program:

https://archive.org/download/sim_english-illustrated-magazine_1884-12_2_15/sim_english-illustrated-magazine_1884-12_2_15_tesseract_hocr.html.gz

(You will need to gunzip it, I made my program stop after one page for all my tests)

... chinese hocr file will follow momentarily.

0 replies

MerlijnWajer · 2020-10-16T13:35:37Z

MerlijnWajer
Oct 16, 2020
Author

Sorry, I now pushed the latest code. Also a branch that does load the glyphless font, but nothing seems to get added to the PDF.

Here is a similar file with hOCR (but really, one character being added to the PDF with such a glyphless font could be enough):

(You might want to wget it) https://wizzup.org/lanjingdeyanjing0000bing_tesseract_hocr.html

I suppose the glyphless font requires more hacks that Tesseract applies to map all characters to 0, as mentioned in the pdf renderer.

0 replies

JorjMcKie · 2020-10-16T20:47:52Z

JorjMcKie
Oct 16, 2020
Maintainer

I haven't tried inserting text with a glyphless font with PyMuPDF before.
I blieve it won't work - at last not when using insertText / insertTextbox.
Might be that TextWriter is better here, but I doubt it - it also has limitations for the fonts it supports.
Why don't you just use invisible text, text rendering mode = 3?

0 replies

JorjMcKie · 2020-10-16T20:50:32Z

JorjMcKie
Oct 16, 2020
Maintainer

There is the repo https://github.com/jbarlow83/OCRmyPDF, which has some overlaps with your work I believe ...

0 replies

JorjMcKie · 2020-10-16T21:08:02Z

JorjMcKie
Oct 16, 2020
Maintainer

Just tried it:
TextWriter does accept the glyphless font as a font using font = fitz.Font(fontfile="glyphless.ttf").
And you then can write (append) text to the text writer. But when later extracting, all characters are trnslated to spaces, except the former spaces, which have been translated to non-breaking spaces.

Also tried insertText with the font: more or the less the same, does not complain about the glyphless font, but extracts spaces with text extraction.

0 replies

MerlijnWajer · 2020-10-19T16:27:35Z

MerlijnWajer
Oct 19, 2020
Author

For your information, I am studying the Tesseract C++ code some more, and they seem to perform quite some interesting hacks. Maybe it is not reasonable to assume that these will work with pymupdf. I will get back to you in a few days from now. Thanks.

0 replies

JorjMcKie · 2020-10-19T19:06:49Z

JorjMcKie
Oct 19, 2020
Maintainer

Interesting to see where this leads to.

FYI: MuPDF v1.18.0 (not PyMuPDF yet) contains native support for OCR-based text extraction via Tesseract.
I haven't looked deeply into it yet, and MuPDF has declared this as experimental for the time being. My main concern however is that it is unclear how to bundle or integrate my builds with some possibly existing Tesseract or, alternatively offer a joint overall build ...

0 replies

MerlijnWajer · 2020-10-19T19:14:25Z

MerlijnWajer
Oct 19, 2020
Author

Understood. I am trying to do integrate OCR results into PDFs, not OCR PDF files. I'll keep you posted, I will have some minimal Python code that generates a small PDF (by hand) that I will then manipulate with pymupdf, I think.

0 replies

MerlijnWajer · 2020-10-22T12:00:14Z

MerlijnWajer
Oct 22, 2020
Author

As a follow up... I've ported the tesseract pdfrenderer.cpp to Python here:

https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/pdfrenderer.py

And then the other file in that repo (recode.py) uses OCR-result files (hOCR) and an input-pdf with images to create a new searchable pdf. In the mode == 2 mode, it will also apply the MRC as I discussed in the other ticket.

Tesseract does a lot of neat hacks/tricks to get the size to be small. If you're interested I can try to work with you on support something similar with regards to text insertion in pymupdf, but I'm content with the pdfrenderer.py that I wrote -- it works with all unicode and the output pdf is really small.

0 replies

Question: Inserting unicode any utf-8 without detecting the language with a custom font #690

Uh oh!

MerlijnWajer Oct 16, 2020

Replies: 10 comments

Uh oh!

MerlijnWajer Oct 16, 2020 Author

Uh oh!

MerlijnWajer Oct 16, 2020 Author

Uh oh!

MerlijnWajer Oct 16, 2020 Author

Uh oh!

JorjMcKie Oct 16, 2020 Maintainer

Uh oh!

JorjMcKie Oct 16, 2020 Maintainer

Uh oh!

JorjMcKie Oct 16, 2020 Maintainer

Uh oh!

MerlijnWajer Oct 19, 2020 Author

Uh oh!

JorjMcKie Oct 19, 2020 Maintainer

Uh oh!

Uh oh!

MerlijnWajer Oct 19, 2020 Author

Uh oh!

MerlijnWajer Oct 22, 2020 Author

MerlijnWajer
Oct 16, 2020

MerlijnWajer
Oct 16, 2020
Author

MerlijnWajer
Oct 16, 2020
Author

MerlijnWajer
Oct 16, 2020
Author

JorjMcKie
Oct 16, 2020
Maintainer

JorjMcKie
Oct 16, 2020
Maintainer

JorjMcKie
Oct 16, 2020
Maintainer

MerlijnWajer
Oct 19, 2020
Author

JorjMcKie
Oct 19, 2020
Maintainer

MerlijnWajer
Oct 19, 2020
Author

MerlijnWajer
Oct 22, 2020
Author