-
Notifications
You must be signed in to change notification settings - Fork 344
Word Embeddings
Support for training word embeddings in Mallet is included in the current development release on Github. It is not available in the 2.0.8 release.
Unlike word2vec
or fasttext
, Mallet separates data import from model training.
Word embeddings can be trained from the same format data files as topic models. The main difference is that embeddings typically do not remove high-frequency words, as these can provide information about the syntactic function of words.
bin/mallet import-file --input history.txt --keep-sequence --output history.seq
To train embeddings with default parameters and same vectors to a file called vectors.txt
bin/mallet run cc.mallet.topics.WordEmbeddings --input history.seq --output vectors.txt
You will first see a few descriptive statistics of the collection and then, as the algorithm proceeds, information about progress. The progress line prints about every five seconds, and shows
- the number of word tokens processed
- the number of milliseconds run so far
- the ratio of these two values (tokens per second)
- the average value of the vector elements, which are initialized to small values and get bigger as we train
- a "step" value that roughly indicates how much we are changing the vectors, which decreases as the learning rate decreases and the quality of the vectors increases
There are several options, most of which are inherited from other implementations. We don't always have good reasons for setting the values of these options, so I've tried to indicate which ones have meaningful effects.
-
input
A Mallet token sequence file. -
output
A space-delimited text file with one line per word, with the string at the beginning and the vector element values afterwards. -
output-context
The word2vec algorithm actually trains two embeddings against each other, of which one is kept. These are functionally identical mirror images of each other. Unlike GloVe vectors, where the two sets of vectors are almost identical, averaging them will produce quite different vectors from either of the original vectors. -
output-stats-prefix
For compatibility with other embedding systems, you can optionally include a line at the top of the file that lists the number of words in the vocabulary and the number of elements in each vector. -
num-dimensions
This option controls the number of dimensions for the latent vectors. Since people usually don't actually look at the vectors they train, this value is usually set to a large-ish round number like 100 or 300. The default here is 50, mostly because that's often a good enough number and it's fast to train. With topic models this parameter is all anyone wants to talk about, but for these models no one seems to care. -
window-size
This option controls the width of the sliding context window around each word. The default is 5, which means look five tokens to the left and five tokens to the right of the current word. Closer words have more weight. Using a smaller value like 2 will focus the algorithm on the immediate context of each word, so vectors will tend to encode information about the syntactic function of words in their context. For example, a noun might be close to another noun that occurs with similar determiners and prepositions. Using a larger value like 10 will make less distinction between words that are near a word and words that are immediately adjacent to a word, so vectors will tend to encode more semantic information. -
num-threads
If you increase this to, say, 4, then the collection will be partitioned into four equal sections and each thread will start working on its own section. This can make things faster of course, but also improve the vectors by adding some randomness. -
num-iters
This counts how many times to sweep through the data. Using more iterations will cause the learning rate to decrease more slowly. Values between 3 and 5 seem good enough. -
frequency-factor
In natural language, frequent words ("the") occur exponentially more often than less frequent words ("word"). It appears to be useful to downsample the top words in order to focus the algorithm on more content-bearing words. Values between 0.001 and 0.00001 seem to be good. -
num-samples
The SGNS objective wants two words that occur in close proximity to have a word vector and a context vector that are close to each other, but random pairs of words to have vectors that are far away. This variable changes the relative strength of the attraction and repulsion forces. It has a direct effect on running time -- more samples, longer running. The default is 5, which seems to be a good number. -
example-word
To get a sense of how the algorithm is proceeding, you can specify a query word. Each time the algorithm reports progress it will print the ten words with the closest (in cosine similarity) vectors to your query word. For example if the query islondon
I get:1.000000 3365 london 0.877789 6412 paris 0.864460 8981 boston 0.860364 9143 chicago 0.858702 7044 philadelphia 0.843321 3584 york 0.840233 6377 jersey 0.834708 37473 macmillan 0.829747 15344 angeles 0.821647 9142 hall
In this case the closest word to the query is the query itself, with 1.0 cosine similarity. I'm including this because it's a good check -- if the query isn't the only word with 1.0 similarity, something is wrong. The others seem good: major cities with strong connections to London. Note that the query is case sensitive. If I ask for
London
instead oflondon
it can't find the word and silently ignores it. Also be on the lookout for similarities that are "too high". Values of 0.98 or higher usually indicate something is wrong. Good query words tend to be well represented in the corpus and have several similar words. -
ordering
Embeddings can be fussy. Especially for words that don't occur very often, or occur a lot in a few specific parts of the collection, seemingly small changes to the input corpus can have big effects on vector similarity. Artificially adding some variability to the collection can help to surface some of this variability. (See Antoniak and Mimno, "On the stability...", NAACL 2018 for details.) With the default valuelinear
the algorithm reads all the documents in the order they were originally presented, which tends to amplify the impact of early documents. The valueshuffled
selects a random order, so that early documents have less weight. The valuerandom
implements a "bootstrap" sample: documents are sampled with replacement from the original collection, so they may occur multiple times or not at all. Especially if your collection is smaller than about 10 million word tokens you should consider running about 20 bootstrap samples.