Inaccurate bbox information obtained for 1.18.17 while correct information for 1.18.5 #1427
-
Hi I have a file while the the bbox information outputs from 1.18.17 are pretty small in terms of height dimension. while the 1.18.5 version seems correct. I will send the file via email for the privacy issue
1.18.5 bbox |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 7 replies
-
Uff - you do have a talent to dig out problem-PDFs 😉. This is a MuPDF problem with that specific font, SimSun. You can reproduce it using the PDF viewer SumatraPDF (works on Windows only - or with Wine on Linux). All you can do is adding a plausibility check to your text extraction routine:
Example: >>> fitz.TOOLS.set_small_glyph_heights("True")
True
>>> page=doc[0]
>>> page.clean_contents()
>>> blocks = page.get_text("dict")["blocks"]
>>> for b in blocks:
if b["type"] !=0: continue
for l in b["lines"]:
for s in l["spans"]:
bbox = fitz.Rect(s["bbox"])
if bbox.height < s["size"]:
bbox.y0 = bbox.y1 - s["size"]
page.draw_rect(bbox, width=0.3, color=(1,0,0)) Gives you nice results. |
Beta Was this translation helpful? Give feedback.
-
I have looked at the embedded font SimSun. Its total height (ascender - descender) is far below 1, even below 0.2. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
Uff - you do have a talent to dig out problem-PDFs 😉.
This is a MuPDF problem with that specific font, SimSun. You can reproduce it using the PDF viewer SumatraPDF (works on Windows only - or with Wine on Linux).
Adobe Acrobat works ok, as well as Foxit reader. Nitro PDF and PDF XChange have the same problems as MuPDF / SumatraPDF.
All you can do is adding a plausibility check to your text extraction routine:
fitz.TOOLS.set_small_glyph_heights(True)
, this ensures that character bbox height equals fontsize - in normal situations.y0 = y1 - fontsize
and go with the…