Can OCR TextPage results be written to a page? #1453
-
Calling |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 4 replies
-
I'm not sure I fully understand: do you mean how to obtain minimal lineheights / bboxes? If so, just do The fact, that there has been OCR going on, when making the textpage, is invisible (and can only be seen when stumbling over the font name |
Beta Was this translation helpful? Give feedback.
-
Sorry for being so vague; my goal is to take that information in the |
Beta Was this translation helpful? Give feedback.
-
However, if you meant: "How can I create a PDF page with an OCRed textlayer": Extract the text using
|
Beta Was this translation helpful? Give feedback.
-
I'm adding the below to extend the discussion with a code example using ...
tw = fitz.TextWriter(page.rect)
text = ... # the text being written
x_fsize = ... # take from word's "x_fsize" attribute
word_bbox = fitz.Rect(..., ..., ..., ...) # take from word's "bbox" attribute
line_bbox = fitz.Rect(..., ..., ..., ...) # take from line's "bbox" attribute
constant = ... * 72.0 / dpi # replace ellipses with tesseract "baseline" attribute value (in pixels); this is "b" from "y = mx + b" giving the baseline y-intercept.
origin = fitz.Point(word_bbox.x0, line_bbox.y1 + constant)
font = fitz.Font("cour")
text_len = font.text_length(text=text, fontsize=x_fsize)
fontsize = (x_fsize / text_len) * word_bbox.width
tw.append(pos=origin, text=text, font=font, fontsize=fontsize)
tw.write_text(page, render_mode=3)
... |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie Do you know if it's possible to use the python:
Using gs -sDEVICE=pdfocr8 -r300 -dNOPAUSE -dBATCH -sOutputFile=out.pdf original.pdf Using filename = "./pdf.ttf"
# Using font file
font = fitz.Font(fontfile=filename)
# Using font buffer
with open(filename, "rb") as f:
font = fitz.Font(fontbuffer=f.read())
Using doc = fitz.Document()
page = doc.new_page()
tw = fitz.TextWriter(page.rect)
tw.append(
pos=fitz.Point(100, 100),
text="Hello, World!",
font=fitz.Font(fontfile="./pdf.ttf"),
fontsize=12
)
tw.write_text(page=page, render_mode=3) Extracting font, # extracting font from pdf
xref = None
gs_doc = fitz.Document(filename="out.pdf")
for page in gs_doc:
for f in gs_doc.get_page_fonts(page.number):
if f[3] == "GlyphLessFont":
xref = f[0]
break
buffer = gs_doc.extract_font(xref)[3]
# write using extracted font in new file
doc = fitz.Document()
page = doc.new_page()
tw = fitz.TextWriter(page.rect)
tw.append(
pos=fitz.Point(100, 100),
text="Hello, World!",
font=fitz.Font(fontbuffer=buffer),
fontsize=12
)
tw.write_text(page=page, render_mode=3) Both ways come up with the same stack trace:
I'm guessing that what's missing is the CMap which maps the font's code points to the same glyph, which is how the |
Beta Was this translation helpful? Give feedback.
-
No, it is not. Neither does it work using the old |
Beta Was this translation helpful? Give feedback.
However, if you meant: "How can I create a PDF page with an OCRed textlayer":
Extract the text using
page.get_text("dict", textpage=tpocr)
. Then walk through the text spans and select any with fontGlyphlessFont
.With each of these spans do a
page.insert_text()
in the following way:span["origin"]
GlyphlessFont
also is (seems to be) monospacedspan["bbox"]
comes out