Skip to content

Inaccurate bbox information obtained for 1.18.17 while correct information for 1.18.5 #1427

Discussion options

You must be logged in to vote

Uff - you do have a talent to dig out problem-PDFs 😉.

This is a MuPDF problem with that specific font, SimSun. You can reproduce it using the PDF viewer SumatraPDF (works on Windows only - or with Wine on Linux).
Adobe Acrobat works ok, as well as Foxit reader. Nitro PDF and PDF XChange have the same problems as MuPDF / SumatraPDF.

All you can do is adding a plausibility check to your text extraction routine:

  1. Execute fitz.TOOLS.set_small_glyph_heights(True), this ensures that character bbox height equals fontsize - in normal situations.
  2. In your text extraction, check whether bbox height is smaller than fontsize. If it is, then you have that problem. Set y0 = y1 - fontsize and go with the…

Replies: 3 comments 7 replies

Comment options

You must be logged in to vote
6 replies
@Yichen-fqyd
Comment options

@JorjMcKie
Comment options

@JorjMcKie
Comment options

@JorjMcKie
Comment options

@Yichen-fqyd
Comment options

Answer selected by Yichen-fqyd
Comment options

You must be logged in to vote
1 reply
@Yichen-fqyd
Comment options

Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants