Heuristics for distinguishing end of lines for a sentence #1102

gevezex · 2021-06-22T12:14:08Z

gevezex
Jun 22, 2021

I was wondering of some heuristics could be implemented for real end of lines ("\n") what denotes the end of a sentence in paragraphs rather than the the visual end of line in a pdf. I don't know if this is already existing.

In the mean time, what could I do to implement this myself with postprocessing all paragraphs? I could replace all '\n' by a ' ' (space) of course, but then lines with no period at the end (what is a full sentence) will merge with the next sentence below.

JorjMcKie · 2021-06-22T13:22:06Z

JorjMcKie
Jun 22, 2021
Maintainer

This type of thing for sure is one of the open ends.
And to be realistic: every "solution" will be preliminary or at least not failsafe.
First of all, PDFs are polyglot, you can't be sure that another language has a notion of "." at all (Devanagari: Hindi, Nepali, ...) or if it does have that, that this is the same character / glyph, e.g. Chinese, Japanese ("。") or think of the scripts going from right to left or top to bottom.
Second, in PDF it is possible to position every single character / glyph independently. So you may (although this is rare and mostly intentionally been built like that) get an arbitrary permutation of a page's characters if you do any naive page.get_text(...).
Third, some document creators shy away from using too many different fonts and prefer simulating bold text or other text effects (like shades) by storing the same character multiple times with a tiny variation of coordinates, and / or color.

To this end, there is no prejudice in MuPDF (as yet). In PyMuPDF, there do exist a number of scripts / approaches which have steps in that direction and sort the text blocks or lines or spans by vertical and horizontal coordinates. I have once written something, which sorts every single character of a page. Even looked at whether the same character was overlapping itself by 80% or so and then discarded it ...

So I would recommend you develop something that meets your requirements in your target language / script. Look at whether or not a "\n" is preceeded by a ".", and if not replace it by chr(32) - but the latter only within the same text block, etc.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heuristics for distinguishing end of lines for a sentence #1102

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Heuristics for distinguishing end of lines for a sentence #1102

gevezex Jun 22, 2021

Replies: 1 comment

JorjMcKie Jun 22, 2021 Maintainer

gevezex
Jun 22, 2021

JorjMcKie
Jun 22, 2021
Maintainer