Skip to content

Much better memory management + word-level training support

Compare
Choose a tag to compare
@minimaxir minimaxir released this 30 Apr 03:37
· 105 commits to master since this release
  • Switched to a fit_generator implementation of generating sequences for training, instead of loading all sequences into memory. This will allow training large text files (10MB+) without requiring ridiculous amounts of RAM.
  • Better word_level support:
    • The model will only keep max_words words and discard the rest.
    • The model will not train to predict words not in the vocabulary
    • All punctuation (including smart quotes) are their own token.
    • When generating, newlines/tabs have surrounding whitespace stripped. (this is not the case for other punctuation as there are too many rules around that)
  • Training on single text no longer uses meta tokens to indicate the start/end of the text and does not use them when generating, which results in slightly better output.