Skip to content

Better Use of Lemmatization/Stemming

Latest
Compare
Choose a tag to compare
@AdamMeyers AdamMeyers released this 20 Sep 22:08
· 79 commits to master since this release

Several Changes detailed in the revision notes. The biggest changes are:

  1. Stemming has been removed from the distributional system. This has been replaced with the lemmatization procedures used to create the .terms files. So for statistical purposes the following forms will be mapped to the same lemma: speech recognizer, recognizer of speech, speech recognizers, sr, srs
  2. The .out_term_list files have a new format, tab separated values. The first field is the lemma, the other fields are the various forms of the lemma observed in the input file.