layout | title | category | order |
---|---|---|---|
page |
Word Embeddings |
tut |
12 |
Word embeddings are a way of representing the individual words used in natural languages as fixed-length numeric vectors in some vector space. Most useful models for word embeddings find vectors for words where meaning can be captured via (linear) vector composition. For example, one can answer word analogy questions like the following:
- woman is to sister as man is to what? (brother)
- summer is to rain as winter is to what? (snow)
- man is to king as woman is to what? (queen)
- fell is to fallen as ate is to what? (eaten)
We can answer these questions by finding the word vector that is most similar (via some metric like cosine similarity) to the result of some vector math operation. For answering the first question, one might form a query like
where
There are many different models for word embeddings. MeTA implements the learning algorithm from GloVe for learning its word embeddings. This tutorial will walk you through how to use the tools in MeTA for learning and interacting with word embeddings on your own data.
MeTA's GloVe implementation is broken into three steps:
- Extract a vocabulary from the data for which we would like to construct word embeddings
- Use that vocabulary to extract the co-occurrence matrix from our data
- Learn word embeddings for each word in our vocabulary using the co-occurrence matrix we extracted
Steps 1 and 2 are one-time, upfront costs. Step 3 can be repeated as many times as you would like (to, e.g., construct embeddings of different dimensionality) once the vocabulary and co-occurrence matrix have been extracted.
To extract a vocabulary from your data, you will need to add the following section (with parameters adjusted according to your needs) to your configuration file:
{% highlight toml %} [embeddings] prefix = "path/to/store/model/files" filter = [{type = "icu-tokenizer", suppress-tags = "true"}, {type = "lowercase"}] [embeddings.vocab] min-count = 10 max-size = 400000 {% endhighlight %}
The prefix
key indicates the folder where you would like to store the
model files. (This path should be created before running the tools.)
The filter
key is a filter chain to use to extract the
token sequences from your data. You can feel free to change this however
you would like. The chain given above is a reasonable default for learning
uncased word vectors.
In the embeddings.vocab
table, you can specify how to prune your
vocabulary. Typically, you will either truncate the vocabulary below a
certain frequency count (min-count
), or you will truncate the vocabulary
at a certain maximum size (max-size
) to keep only the most frequent
terms. The less data available for a vocabulary item, the worse its word
embedding will be.
Note that even if you limit your vocabulary, the model will always include an <unk> vector that will be returned when querying for out-of-vocabulary terms.
To extract the vocabulary, you can now run the embedding-vocab
tool:
{% highlight bash %} ./embedding-vocab config.toml {% endhighlight %}
The tool will extract a vocab, prune it, and write the output to
$prefix/vocab.bin
.
Once you've extracted your vocabulary, you are ready for the second pass through the training text that extracts the word co-occurrence statistics.
You can configure a few properties for this process with the following
(optional) values in the [embeddings]
section of your configuration file.
{% highlight toml %} window-size = 15 max-ram = 4096 merge-fanout = 8 num-threads = 4 {% endhighlight %}
The window-size
key indicates the size of the window in which a word is
counted as having co-occurred with another. The window is symmetric, so a
window-size
of 15 counts another word as having co-occurred if it was
The max-ram
key is a heuristic memory limit (in MB). The tool will
collect co-occurrence counts until a buffer of this size in RAM is
exhausted, which is then flushed to disk. Higher values create fewer
temporary files and make collection faster, but obviously this should be
set to some value
The merge-fanout
key controls the maximum number of temporary chunk files
that can exist on disk before a multi-way merge is conducted to merge them
together. The default, if not specified, is 8.
The num-threads
key controls the number of threads used to extract
co-occurrence data. By default, this is set to the number of total system
threads. Each thread is given a max-ram / num-threads
RAM allowance.
To extract the co-occurrence matrix, you can now run the
embedding-cooccur
tool:
{% highlight bash %} ./embedding-cooccur config.toml {% endhighlight %}
The tool will extract the co-occurrence matrix and write it to the file
$prefix/cooccur.bin
.
Now you are ready to train the embeddings themselves on the global
co-occurrence data we extracted in the previous two steps. This process can
be configured with the following (optional) values in the [embeddings]
section of your configuration file.
{% highlight toml %} max-ram = 4096 vector-size = 50 num-threads = 4 max-iter = 25 learning-rate = 0.05 xmax = 100.0 scale = 0.75 unk-num-avg = 100 {% endhighlight %}
max-ram
, as before, is a heuristic memory limit that is used during the first phase of the learning algorithm, which shuffles the data for the SGD-based trainer.vector-size
indicates the desired dimensionality of the generated word embeddingsnum-threads
indicates the number of concurrent threads to run during training. Each thread will operate on its own separate subset of the training data, so this should be set low enough to allow concurrent access to separate files for each thread. By default, we use one thread per "core" (including hyperthreading cores)max-iter
indicates the number of iterations to run the algorithm for. More iterations results in better optimization, but this is the major time/quality tradeoff setting.learning-rate
is the initial learning rate. You likely won't need to adjust this unless you are using truly massive corpora.xmax
indicates the maximum co-occurrence count for which to stop the "dampening" that occurs for rare word pairs. You likely won't need to adjust this.scale
indicates the exponent used in the scaling function. You likely won't need to adjust this.unk-num-avg
indicates the number of rare words to average for constructing the <unk> word embedding.
You can now train your word embeddings using the glove
tool:
{% highlight bash %} ./glove config.toml {% endhighlight %}
The output will be written as two vector files:
$prefix/embeddings.target.bin
and $prefix/embeddings.context.bin
.
Now that you've learned some word embeddings on your data, you can explore
your dataset with the interactive-embeddings
tool.
{% highlight bash %} ./interactive-embeddings config.toml {% endhighlight %}
This tool will prompt you for vector-space queries and report to you the top 10 most similar words according to cosine distance with your query. For example, to answer the analogy questions given at the beginning of the tutorial, we could use the following queries:
- sister - woman + man
- rain - summer + winter
- king - man + woman
- fallen - fell + ate
Any addition or subtraction expression involving at least one word will be accepted.
If you want to use word embeddings in your own application, you can load
them into a word_embeddings
object and query it like so:
{% highlight cpp %} // load embeddings given the [embeddings] configuration group auto model = embeddings::load_embeddings(config);
// query the model for a specific word auto embed = model.at("dog"); embed.tid; // the term id for the vector embed.v; // the embedding vector for the term
// query the model to convert a term id to a string_view auto term = model.term(embed.tid);
// query the model to find the top_k similar embeddings auto top = model.top_k(embed.v);
top[0].e; // the embedding, with fields tid and v top[0].score; // the score that this embedding obtained {% endhighlight %}