words extracted from pdf getting split #1161
-
Please provide all mandatory information! Describe the bug (mandatory)Few words are getting split into half . Example beneficiaries word is split into 'benefi', 'ciaries' . But there is no space between this words in pdf To Reproduce (mandatory)doc = fitz.open(input_file) Expected behavior (optional)words shouldn't get split. Screenshots (optional)Your configuration (mandatory)
For example, the output of Additional context (optional)Add any other context about the problem here. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Please:
|
Beta Was this translation helpful? Give feedback.
-
I tried all the solutions which you mentioned but its not working. This is a sample pdf file where the words 'first', 'notifications' are getting split. |
Beta Was this translation helpful? Give feedback.
-
This is a badly constructed PDF: |
Beta Was this translation helpful? Give feedback.
This is a badly constructed PDF:
Invisibly for the PDF viewer, space characters are specified which (partly) overlap their preceeding character. E.g. "Notifi" ends at some x-coordinate, say 25.5, then a space character follows which sparts at 25.0 ending at 26.0, followed by "cations" ...
I have modified the script
textlayout.py
such that it detects the situation and ignores those spaces. I have attached it here.Play with it until you see acceptable results.
It definitely is not a bug of PyMuPDF, but the PDF maker screwed up the file.
textlayout.zip