You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've not yet worked out whether eng.punc is used by the LSTM mode of tesseract, but I discovered that there are two trailing spaces on line 27 of this file, which might cause the occasional problem.
The text was updated successfully, but these errors were encountered:
Which occasional problem are you referring to? If there is a problem, you can create a new traineddata file without those spaces and see whether that fixes the problem.
Link to line 27 in file eng.punc. The trailing spaces are also in eng.traineddata and can be found there in 17 lines. It looks like other languages have them, too.
AFAICT, the space on each line indicates where "word characters" ("alphanumerics" for lack of a better term right now - non-punctuation symbols) are expected to appear. So line 1 has a single space, indicating a sequence of [A-Z...] with no punctuation, and other lines have a trailing space to indicate initial punctuation followed by word characters. Except for line 27, every line has precisely one space. I hope that makes sense.
I haven't detected an actual problem yet, but any such problem would likely be very subtle.
I've not yet worked out whether eng.punc is used by the LSTM mode of tesseract, but I discovered that there are two trailing spaces on line 27 of this file, which might cause the occasional problem.
The text was updated successfully, but these errors were encountered: