Motivations are as follows:
- We can regard the long word as a combination of several sub-elements, which may be beneficial for the lexicon-free case that is more flexible in real life where word component is free from lexicon.
- We can relieve the pressure of memory by LSTM, i.e., LSTM only need to remember the root and affix rather than the big number of whole-words requiring lots of samples, and is friendly to gradient propagation.
Some examples which show the weakness of word-based methods like CRNN (suffered from their memory on the lexicon in trainset):
aa------u----t--o---l--d----e---n---t--ii-f--y---- => autoldentify
i---d----ee--n---t--ii-f--y---- => identify
m------a--t-h--l--y--p-e---- => mathlype
cc-------a----r---e----y---o-----v----v----e---r---k---d---ll-llo----w-----hh--f-- => careyovverkdllowhf
h------o----o---o----c----o---o----o----c--k----- => hooocooock
0--------0-----0-----0-----0-----0-----0-----0-----0-----0------ => 0000000000
g-----o---o---o---c---o---o---o---o--g---l--e--- => gooocoooogle
l----u--u---u--d---a--g---a--y--u---c--k--- => luuudagayuck
y-----o--u---d----d---a---d--- => youddad
g------i--v--e--l-l---e---f---i--v--e--- => givellefive
The memory got by LSTM can help to recognize some ambiguous character (e.g., l, o) using the context in one hand, but in other hand limit the ability to recognize out-of-vocabulary words which have somewhat different compositional styles from words in training set.
So we decide to use subword-based method to make RNN only care about the sub-region of a whole word to approach the problem above. It can probabaly reduce the needs of large amount of training data since a word in our method should consist of subwords, showing more intelligence in processing text line.