You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 16, 2023. It is now read-only.
I ran into the same problem as #1679 when processing PDFs that had been OCR'ed with Abbyocr already: spaces between individual letters.
The issue in my case was pdfminer's default laparams, especially word_margin's default of 0.1:
>>> from pdfminer.high_level import extract_text as pdfminer_extract_text
>>> pdfminer_extract_text("0000131.pdf")
'e S T A D T W E R K E\n\nxx\n\nV e r t r a g s k o n t o - N r . :[...]
Changing word_margin=1 fixed it for me, but I'm not sure if it's universally good. (I've tried various margin values; 1.0 seems to be the smallest that worked well for me.)
I ran into the same problem as #1679 when processing PDFs that had been OCR'ed with Abbyocr already: spaces between individual letters.
The issue in my case was pdfminer's default laparams, especially word_margin's default of 0.1:
Changing word_margin=1 fixed it for me, but I'm not sure if it's universally good. (I've tried various margin values; 1.0 seems to be the smallest that worked well for me.)
Relevant information
The text was updated successfully, but these errors were encountered: