Can OCR TextPage results be written to a page? #1453

caerulescens · 2021-12-13T13:17:53Z

caerulescens
Dec 13, 2021

Calling get_textpage_ocr() returns a TextPage object, which contains the full/partial OCR results. What's the best way to layer this information directly into the PDF so that each word has minimal underfilling/overfilling the bounding box?

Answered by JorjMcKie

Dec 13, 2021

However, if you meant: "How can I create a PDF page with an OCRed textlayer":

Extract the text using page.get_text("dict", textpage=tpocr). Then walk through the text spans and select any with font GlyphlessFont.
With each of these spans do a page.insert_text()in the following way:

insertion point is span["origin"]
best choose "cour" (Courier) as the font, because GlyphlessFont also is (seems to be) monospaced
compute the fontsize such that the width of span["bbox"] comes out
when done, save the document to a new file and check that text inside images in fact is now selectable in your PDF viewer

View full answer

JorjMcKie · 2021-12-13T13:44:52Z

JorjMcKie
Dec 13, 2021
Maintainer

I'm not sure I fully understand: do you mean how to obtain minimal lineheights / bboxes?

If so, just do fitz.TOOLS.set_small_glyph_heights(True) before any text processing function (including search_for() etc.).
Apart from the OCR process executed as part of making the textpage, everything else behaves just like normal.
Once the textpage is created, it contains alls information extracted from the document page - including any coordinates. Every other text function using this textpage will be supplied with information coming from this textpage exclusively ...

The fact, that there has been OCR going on, when making the textpage, is invisible (and can only be seen when stumbling over the font name GlyphlessFont).

0 replies

caerulescens · 2021-12-13T13:54:27Z

caerulescens
Dec 13, 2021
Author

Sorry for being so vague; my goal is to take that information in the TextPage and write it over the image within the original OCR'd document. I've been using coordinate information from tesseract and TextWriter to accomplish this, but I wanted to see if there was a simpler way of doing this using PyMuPDF?

0 replies

JorjMcKie · 2021-12-13T13:55:05Z

JorjMcKie
Dec 13, 2021
Maintainer

However, if you meant: "How can I create a PDF page with an OCRed textlayer":

Extract the text using page.get_text("dict", textpage=tpocr). Then walk through the text spans and select any with font GlyphlessFont.
With each of these spans do a page.insert_text()in the following way:

insertion point is span["origin"]
best choose "cour" (Courier) as the font, because GlyphlessFont also is (seems to be) monospaced
compute the fontsize such that the width of span["bbox"] comes out
when done, save the document to a new file and check that text inside images in fact is now selectable in your PDF viewer

3 replies

caerulescens Dec 13, 2021
Author

Okay, that's what I thought I had to do; you're implying using TextWriter, correct?

JorjMcKie Dec 13, 2021
Maintainer

I suspected this as you have seen.
I forgot to mention, that your text should be inserted underneath the images: ``overay=False. Or use the TextWriter` class, which offers choosing a transparency and / or text rendering mode 3.

caerulescens Dec 13, 2021
Author

Thank you!

caerulescens · 2021-12-17T02:08:59Z

caerulescens
Dec 17, 2021
Author

I'm adding the below to extend the discussion with a code example using TextWriter and information from Tesseract-OCR. The information is easier to obtain from Page.textpage_ocr(...). See useful references here and here.

...
tw = fitz.TextWriter(page.rect)
text = ...  # the text being written
x_fsize = ...  # take from word's "x_fsize" attribute
word_bbox = fitz.Rect(..., ..., ..., ...)  # take from word's "bbox" attribute
line_bbox = fitz.Rect(..., ..., ..., ...)  # take from line's "bbox" attribute
constant = ... * 72.0 / dpi  # replace ellipses with tesseract "baseline" attribute value (in pixels); this is "b" from "y = mx + b" giving the baseline y-intercept.
origin = fitz.Point(word_bbox.x0, line_bbox.y1 + constant)
font = fitz.Font("cour")
text_len = font.text_length(text=text, fontsize=x_fsize)
fontsize = (x_fsize / text_len) * word_bbox.width
tw.append(pos=origin, text=text, font=font, fontsize=fontsize)
tw.write_text(page, render_mode=3)
...

0 replies

caerulescens · 2021-12-17T03:25:31Z

caerulescens
Dec 17, 2021
Author

@JorjMcKie Do you know if it's possible to use the GlyphLessFont with TextWriter? I tried to do this, but I ran into problems; see below to reproduce.

python: 3.9
pymupdf: 1.19.3
os: debian

Using this PDF file as the starting point....
original.pdf

$ mutool info original.pdf
PDF-1.4
Info object (23 0 R):
...
Pages: 1

Retrieving info from pages 1-1...
Mediaboxes (1):
        1       (1 0 R):        [ 0 0 612 792 ]

Images (1):
        1       (1 0 R):        [ Flate ] 1603x696 8bpc DevRGB (7 0 R)

Use either Ghostscript or Tesseract-OCR to create a PDF which will contain the GlyphLessFont. I think pdf.ttf can also be used and loaded using fitz.Font(fontfile=...) or fitz.Font(fontbuffer=...) to reproduce the stacktrace.

Using gs:

gs -sDEVICE=pdfocr8 -r300 -dNOPAUSE -dBATCH -sOutputFile=out.pdf original.pdf

Using fitz.Font(...)

filename = "./pdf.ttf"

# Using font file
font = fitz.Font(fontfile=filename)

# Using font buffer
with open(filename, "rb") as f:
    font = fitz.Font(fontbuffer=f.read())

Use TextWriter with the GlyphLessFont.

Using pdf.ttf,

doc = fitz.Document()
page = doc.new_page()
tw = fitz.TextWriter(page.rect)
tw.append(
    pos=fitz.Point(100, 100),
    text="Hello, World!",
    font=fitz.Font(fontfile="./pdf.ttf"),
    fontsize=12
)
tw.write_text(page=page, render_mode=3)

Extracting font,

# extracting font from pdf
xref = None
gs_doc = fitz.Document(filename="out.pdf")
for page in gs_doc:
    for f in gs_doc.get_page_fonts(page.number):
        if f[3] == "GlyphLessFont":
            xref = f[0]
            break
buffer = gs_doc.extract_font(xref)[3]

# write using extracted font in new file
doc = fitz.Document()
page = doc.new_page()
tw = fitz.TextWriter(page.rect)
tw.append(
    pos=fitz.Point(100, 100),
    text="Hello, World!",
    font=fitz.Font(fontbuffer=buffer),
    fontsize=12
)
tw.write_text(page=page, render_mode=3)

Both ways come up with the same stack trace:

Traceback (most recent call last):
  File ".../glyphless_font_thing.py", line 23, in <module>
    tw.write_text(page=page, render_mode=3)
  File ".../lib/python3.9/site-packages/fitz/fitz.py", line 8756, in write_text
    repair_mono_font(page, font)
  File ".../lib/python3.9/site-packages/fitz/fitz.py", line 3237, in repair_mono_font
    maxadv = max([font.glyph_advance(cp) for cp in font.valid_codepoints()[:3]])
ValueError: max() arg is an empty sequence

I'm guessing that what's missing is the CMap which maps the font's code points to the same glyph, which is how the GlyphLessFont works. How would I proceed from here?

0 replies

JorjMcKie · 2021-12-17T04:05:34Z

JorjMcKie
Dec 17, 2021
Maintainer

possible to to use the GlyphLessFont with TextWriter?

No, it is not. Neither does it work using the old insert_text / insert_textbox methods with that font.
What you could try is using some monospaced font instead like Courier (or some nicer-looking equivalent). MuPDF internally handles GlyphlessFont as monospaced, too.

1 reply

caerulescens Dec 17, 2021
Author

I did the Courier thing you mentioned above; it works well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can OCR TextPage results be written to a page? #1453

{{title}}

Replies: 6 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Can OCR TextPage results be written to a page? #1453

caerulescens Dec 13, 2021

Replies: 6 comments · 4 replies

JorjMcKie Dec 13, 2021 Maintainer

caerulescens Dec 13, 2021 Author

JorjMcKie Dec 13, 2021 Maintainer

caerulescens Dec 13, 2021 Author

JorjMcKie Dec 13, 2021 Maintainer

caerulescens Dec 13, 2021 Author

caerulescens Dec 17, 2021 Author

caerulescens Dec 17, 2021 Author

JorjMcKie Dec 17, 2021 Maintainer

caerulescens Dec 17, 2021 Author

caerulescens
Dec 13, 2021

Replies: 6 comments 4 replies

JorjMcKie
Dec 13, 2021
Maintainer

caerulescens
Dec 13, 2021
Author

JorjMcKie
Dec 13, 2021
Maintainer

caerulescens Dec 13, 2021
Author

JorjMcKie Dec 13, 2021
Maintainer

caerulescens Dec 13, 2021
Author

caerulescens
Dec 17, 2021
Author

caerulescens
Dec 17, 2021
Author

JorjMcKie
Dec 17, 2021
Maintainer

caerulescens Dec 17, 2021
Author