words extracted from pdf getting split #1161

nischithac · 2021-07-22T11:25:04Z

nischithac
Jul 22, 2021

Please provide all mandatory information!

Describe the bug (mandatory)

Few words are getting split into half . Example beneﬁciaries word is split into 'beneﬁ', 'ciaries' . But there is no space between this words in pdf

To Reproduce (mandatory)

doc = fitz.open(input_file)
page = doc[0]
page.getText()
Output: The discounts are not available to beneﬁ ciaries of \nMedicare, Medicaid or other federal or state healthcare programs or residents of Massachusetts, Puerto Rico and other US territories.

Expected behavior (optional)

words shouldn't get split.
Expected Output: The discounts are not available to beneﬁciaries of \nMedicare, Medicaid or other federal or state healthcare programs or residents of Massachusetts, Puerto Rico and other US territories.

Screenshots (optional)

Your configuration (mandatory)

Operating system, potentially version and bitness: Ubuntu
Python version, bitness: Python 3.6
PyMuPDF version, installation method (wheel or generated from source). Pymupdf 1.18.9

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

Additional context (optional)

Add any other context about the problem here.

Answered by JorjMcKie

Jul 22, 2021

This is a badly constructed PDF:
Invisibly for the PDF viewer, space characters are specified which (partly) overlap their preceeding character. E.g. "Notifi" ends at some x-coordinate, say 25.5, then a space character follows which sparts at 25.0 ending at 26.0, followed by "cations" ...
I have modified the script textlayout.py such that it detects the situation and ignores those spaces. I have attached it here.
Play with it until you see acceptable results.
It definitely is not a bug of PyMuPDF, but the PDF maker screwed up the file.
textlayout.zip

View full answer

JorjMcKie · 2021-07-22T12:09:13Z

JorjMcKie
Jul 22, 2021
Maintainer

Please:

upgrade to the current version
provide minimal example file to reproduce
read the documentation concerning text extraction flag TEXT_INHIBIT_SPACES and friends.
try this layout-preserving script.

0 replies

nischithac · 2021-07-22T13:33:09Z

nischithac
Jul 22, 2021
Author

I tried all the solutions which you mentioned but its not working. This is a sample pdf file where the words 'first', 'notifications' are getting split.
Code:
doc = fitz.open(pr_input_file)
page = doc[0]
blocks = page.getText('rawdict', flags=fitz.TEXT_INHIBIT_SPACES)
for block in blocks['blocks']:
if block['type'] == 0:
for line in block['lines']:
for span in line['spans']:
text = ''
for char in span['chars']:
text+=char['c']
print(text)
Output:
Fingersticks are required if your glucose alarms and readings do not match symptoms or when you see Check Blood Glucose symbol during the fi rst 12 hours.
**
PA may be required.
†
Notifi cations will only be received when alarms are turned on and the sensor is within 20 feet of the reading device.
Split_word_issue.pdf

0 replies

JorjMcKie · 2021-07-22T15:55:48Z

JorjMcKie
Jul 22, 2021
Maintainer

This is a badly constructed PDF:
Invisibly for the PDF viewer, space characters are specified which (partly) overlap their preceeding character. E.g. "Notifi" ends at some x-coordinate, say 25.5, then a space character follows which sparts at 25.0 ending at 26.0, followed by "cations" ...
I have modified the script textlayout.py such that it detects the situation and ignores those spaces. I have attached it here.
Play with it until you see acceptable results.
It definitely is not a bug of PyMuPDF, but the PDF maker screwed up the file.
textlayout.zip

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

words extracted from pdf getting split #1161

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

words extracted from pdf getting split #1161

nischithac Jul 22, 2021

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Additional context (optional)

Replies: 3 comments

JorjMcKie Jul 22, 2021 Maintainer

nischithac Jul 22, 2021 Author

JorjMcKie Jul 22, 2021 Maintainer

nischithac
Jul 22, 2021

JorjMcKie
Jul 22, 2021
Maintainer

nischithac
Jul 22, 2021
Author

JorjMcKie
Jul 22, 2021
Maintainer