Replies: 1 comment
-
This type of thing for sure is one of the open ends. To this end, there is no prejudice in MuPDF (as yet). In PyMuPDF, there do exist a number of scripts / approaches which have steps in that direction and sort the text blocks or lines or spans by vertical and horizontal coordinates. I have once written something, which sorts every single character of a page. Even looked at whether the same character was overlapping itself by 80% or so and then discarded it ... So I would recommend you develop something that meets your requirements in your target language / script. Look at whether or not a "\n" is preceeded by a ".", and if not replace it by chr(32) - but the latter only within the same text block, etc. |
Beta Was this translation helpful? Give feedback.
-
I was wondering of some heuristics could be implemented for real end of lines ("\n") what denotes the end of a sentence in paragraphs rather than the the visual end of line in a pdf. I don't know if this is already existing.
In the mean time, what could I do to implement this myself with postprocessing all paragraphs? I could replace all '\n' by a ' ' (space) of course, but then lines with no period at the end (what is a full sentence) will merge with the next sentence below.
Beta Was this translation helpful? Give feedback.
All reactions