Skip to content

words extracted from pdf getting split #1161

Discussion options

You must be logged in to vote

This is a badly constructed PDF:
Invisibly for the PDF viewer, space characters are specified which (partly) overlap their preceeding character. E.g. "Notifi" ends at some x-coordinate, say 25.5, then a space character follows which sparts at 25.0 ending at 26.0, followed by "cations" ...
I have modified the script textlayout.py such that it detects the situation and ignores those spaces. I have attached it here.
Play with it until you see acceptable results.
It definitely is not a bug of PyMuPDF, but the PDF maker screwed up the file.
textlayout.zip

Replies: 3 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by JorjMcKie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
2 participants
Converted from issue

This discussion was converted from issue #1159 on July 22, 2021 17:19.