Question: Inserting unicode any utf-8 without detecting the language with a custom font #690
Replies: 10 comments
-
The Tesseract PDF rendered code has some useful info (at least for me) on the way they do it: https://github.com/tesseract-ocr/tesseract/blob/master/src/api/pdfrenderer.cpp#L35 |
Beta Was this translation helpful? Give feedback.
-
In case you're looking for hOCR files for my example program: (You will need to gunzip it, I made my program stop after one page for all my tests) ... chinese hocr file will follow momentarily. |
Beta Was this translation helpful? Give feedback.
-
Sorry, I now pushed the latest code. Also a branch that does load the glyphless font, but nothing seems to get added to the PDF. Here is a similar file with hOCR (but really, one character being added to the PDF with such a glyphless font could be enough): (You might want to I suppose the glyphless font requires more hacks that Tesseract applies to map all characters to 0, as mentioned in the pdf renderer. |
Beta Was this translation helpful? Give feedback.
-
I haven't tried inserting text with a glyphless font with PyMuPDF before. |
Beta Was this translation helpful? Give feedback.
-
There is the repo https://github.com/jbarlow83/OCRmyPDF, which has some overlaps with your work I believe ... |
Beta Was this translation helpful? Give feedback.
-
Just tried it: Also tried insertText with the font: more or the less the same, does not complain about the glyphless font, but extracts spaces with text extraction. |
Beta Was this translation helpful? Give feedback.
-
For your information, I am studying the Tesseract C++ code some more, and they seem to perform quite some interesting hacks. Maybe it is not reasonable to assume that these will work with pymupdf. I will get back to you in a few days from now. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Interesting to see where this leads to. FYI: MuPDF v1.18.0 (not PyMuPDF yet) contains native support for OCR-based text extraction via Tesseract. |
Beta Was this translation helpful? Give feedback.
-
Understood. I am trying to do integrate OCR results into PDFs, not OCR PDF files. I'll keep you posted, I will have some minimal Python code that generates a small PDF (by hand) that I will then manipulate with pymupdf, I think. |
Beta Was this translation helpful? Give feedback.
-
As a follow up... I've ported the tesseract pdfrenderer.cpp to Python here: https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/pdfrenderer.py And then the other file in that repo (recode.py) uses OCR-result files (hOCR) and an input-pdf with images to create a new searchable pdf. In the Tesseract does a lot of neat hacks/tricks to get the size to be small. If you're interested I can try to work with you on support something similar with regards to text insertion in pymupdf, but I'm content with the pdfrenderer.py that I wrote -- it works with all unicode and the output pdf is really small. |
Beta Was this translation helpful? Give feedback.
-
I am hoping to create PDF files from 'hOCR' (output format of OCR engines) and create a (hidden!) text layer on top of a PDF with images. I already have a working proof of concept of this, although it's in very early stages: https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/hocr2pdf.py
Changing
render_mode=0
torender_mode=3
will indeed make the text invisible. But it only supports a very limited set of characters.In any case, it will look something like this with my current code:
I am not using the TextWriter interface since I need to be able to have the text fill the text boxes, with my own morph code.
What I would like to do is use a glyphless font ( this one is extracted from Tesseract: https://wizzup.org/glyphless.ttf ), but I've had trouble loading the font. I believe such a font will save a lot in size of the PDF, since it is a very small font (572 bytes), and since I don't want to actually see the text, and just make it selectable, that should work fine?
I could not figure out how to load the
glyphless.ttf
font using MuPDF and render text with it -- any tips?Thanks!
Beta Was this translation helpful? Give feedback.
All reactions