title | category | layout | order |
---|---|---|---|
Language Models |
tut |
page |
10 |
A language model is a probability distribution over sequences of tokens. In most cases, these tokens are words and the sequences are sentences. If we train a language model on some reference corpus, it can then be used to calculate the likelihood of new text with respect to the reference. This has many broad and practical applications in natural language processing.
In MeTA, we have a basic (yet efficient) n-gram language model class. An n-gram language model makes the assumption that the probability of a word only depends on the previous n-1 words. That is, the language model creates a probability distribution over all windows of n words.
MeTA does not yet support language model inference, which is the process of
learning the model parameters. Instead, it reads an already-trained language
model from the standardized
ARPA
file format. We recommend tokenizing data with MeTA and then using a language
modeling toolkit such as KenLM to create the
.arpa
file. MeTA reads this file and creates its own binarized version, which
can then be used for various tasks. Using KenLM, we can create an .arpa
file
for MeTA with the following command:
{% highlight bash %} ./lmplz --order 3 --text input.txt.tok --arpa output.arpa {% endhighlight %}
Thus, the file that MeTA's LM uses is output.arpa
which is a 3-gram language
model.
To run the language modeling applications bundled with MeTA, you need to
configure the [language-model]
section in the configuration file. Here is an
example configuration:
{% highlight toml %} [language-model] arpa-file = "../data/english-sentences.arpa" binary-file-prefix = "english-sentences-" {% endhighlight %}
The arpa-file
parameter is the path to the .arpa
model file. MeTA reads this
file and then stores its own binarized version with the prefix
binary-file-prefix
. MeTA uses whichever n-value was used to generate the
.arpa
file.
{% highlight bash %} ./sentence-likelihood config.toml {% endhighlight %}
Some example output using the provided model may be
{% highlight bash %} [info] Loading language model from binary files: english-sentences-* (../src/lm/language_model.cpp:32) [info] Done. (2ms) (../src/lm/language_model.cpp:44) Input a sentence, (blank) to quit.
I should get a part time job. Tokenized sentence:
I should get a part time job .Perplexity per word: 8.29551 (0ms) Log prob: -9.18843 (0ms)
I should get a part time octopus. Tokenized sentence:
I should get a part time octopus .Perplexity per word: 30.0232 (0ms) Log prob: -14.7746 (0ms) {% endhighlight %}
A higher perplexity means that the input sentence does not seem as likely as a lower perplexity. Log probability is the opposite: a higher log probability means that the input sentence is more likely to have been generated by the language model than a sentence with lower log probability. Note that all log probabilities are negative, so high log probabilities will be close to zero.
An important note is that the input sentence should be tokenized in the same way as the reference corpus read by the language model inference algorithm. Otherwise, the vocabularies may not match up and there could be out-of-vocabulary words that decrease the likelihood of the sentence unintentionally.
The file src/lm/tools/sentence_likelihood.cpp
contains the simple use case of
the language model class as demonstrated above.