Trailing spaces on line 27 of eng.punc #28

juliangilbey · 2019-11-09T23:23:30Z

I've not yet worked out whether eng.punc is used by the LSTM mode of tesseract, but I discovered that there are two trailing spaces on line 27 of this file, which might cause the occasional problem.

stweil · 2019-11-10T07:37:58Z

Which occasional problem are you referring to? If there is a problem, you can create a new traineddata file without those spaces and see whether that fixes the problem.

stweil · 2019-11-10T07:43:45Z

Link to line 27 in file eng.punc. The trailing spaces are also in eng.traineddata and can be found there in 17 lines. It looks like other languages have them, too.

stweil · 2019-11-10T07:48:34Z

LSTM and legacy mode use different punc components from the traineddata file, but both have the trailing spaces.

juliangilbey · 2019-11-10T08:32:58Z

AFAICT, the space on each line indicates where "word characters" ("alphanumerics" for lack of a better term right now - non-punctuation symbols) are expected to appear. So line 1 has a single space, indicating a sequence of [A-Z...] with no punctuation, and other lines have a trailing space to indicate initial punctuation followed by word characters. Except for line 27, every line has precisely one space. I hope that makes sense.

I haven't detected an actual problem yet, but any such problem would likely be very subtle.

stweil added the question Further information is requested label Nov 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trailing spaces on line 27 of eng.punc #28

Trailing spaces on line 27 of eng.punc #28

juliangilbey commented Nov 9, 2019

stweil commented Nov 10, 2019

stweil commented Nov 10, 2019 •

edited

Loading

stweil commented Nov 10, 2019

juliangilbey commented Nov 10, 2019

Trailing spaces on line 27 of eng.punc #28

Trailing spaces on line 27 of eng.punc #28

Comments

juliangilbey commented Nov 9, 2019

stweil commented Nov 10, 2019

stweil commented Nov 10, 2019 • edited Loading

stweil commented Nov 10, 2019

juliangilbey commented Nov 10, 2019

stweil commented Nov 10, 2019 •

edited

Loading