Sources:
Andrej Karpathy blog: The Unreasonable Effectiveness of Recurrent Neural Networks http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Stanford cs231n (spring 2017) lecture 10: Recurrent Neural Networks https://www.youtube.com/watch?v=6niqTuYFZLQ
Chris Olah's Understanding LSTM Networks: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Visualizing and Understanding Recurrent Networks https://arxiv.org/abs/1506.02078
Neural networks like CNNs typically require some fixed-size input, and produce a fixed-size output (see one-to-one in figure below).
RNNs can operate on every item of a sequence; so the length of that sequence can very in size. The output can also vary in size.
- One to many: eg, given an input image, produce a sequence of words that describes it (image captioning)
- Many to one: eg, given an input sequence of words, produce a single label for the text (sentiment classification)
- Many to many: eg, given a sequence of English words, produce a sequence of French words (machine translation); or given a sequence of video frames, produce a sentence describing the scene.
You can also iterate over fixed-sized inputs on an RNN.
Like a static variable in a class that gets updated every time some method is called, the hidden state in a RNN is updated by a new input. The updated hidden state is fed back into the model the next time it reads a new input.
For example: for every word in a sentence, run the word through the RNN function (the input word is a one-hot vector):
At the first step (
The second word
This process is repeated until the end of the sentence.
Both
If you want to produce an output
In code, it looks like this:
class RNN:
'''A single recurrent cell'''
# Initialize the hidden state to zeros
self.h = np.zeros(some_size)
def step(self, x):
'''Update the hidden state'''
self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
# Optional: compute the output vector
y = np.dot(self.W_hy, self.h)
return y
rnn = RNN()
Reminder: np.dot(a, b)
is the dot-product of 2 vectors (not element-wise!).
np.tanh
is applied element-wise.
Notice that there are three sets of weights! self.W_hh
) for the hidden state; self.W_xh
) for the input; and self.W_hy
) for the output.
The hidden state self.h
is initialized to a vector of zeros.
np.tanh
function implements a non-linearity that squashes the activations to the range [-1, 1]
.
For each step in a sequence, you'd run the line below to update the hidden state:
rnn.step(x) # x is an input vector
Of course, you can daisy-chain (stack) RNN models so that the output of one cell becomes the input of a downstream cell:
y1 = rnn.step(x)
y2 = rnn2.step(y1)
For each prediction, there's the accompanying loss (usually softmax loss):
The final loss is the sum of all individual loss at each step:
On the backwards pass, the loss gradient flows through each time-step, and each step will compute the local gradient for weights
In a many-to-one model, the last hidden state
Whereas for a one-to-many model, the input is an initializer for step 1 of the hidden state.
A sequence-to-sequence model (eg, neural machine translation) is basically a many-to-one model placed before a one-to-many model.
It operates in 2 stages:
- the model upfront encodes the input sequence of words to a single summary vector. That vector is the hidden state of the last step of the model.
- the model downstream decodes that vector into a sequence of words in another language.
Let's say that we have a character-level model that predicts what the next letter should be given an input letter. Assume the vocabulary consists of only 4 letters: h, e, l, o; and we're training the model to predict the word 'hello'.
The characters are first converted to a 4-element, one-hot vector:
The input vector multiplies by the input weight matrix
For the first letter 'h', the correct next letter should be 'e', but it gave 'o' a higher score. In this case, we'd use a softmax loss to quantify (a scalar) how wrong the prediction is, and that loss's gradient will be fed back into the cell during the backwards pass. This process repeats during training.
At test time, each input's score is converted to a probability distribution by a softmax function. A prediction is then sampled from that distribution. That prediction is then fed back to the next time-step:
And this process repeats:
Why sample from the distribution? Why not just take the character with the highest score (argmax)?
- Sometimes you do take the argmax. The advantage with sampling is that you get variety in your outputs so that you don't always end up with the same output given the same input. eg, the same image can be captioned in a few different ways by the model.
During test time, when the first prediction is made, can you feed the softmax score into the next round (instead of using the one-hot vector)?
- No, because the softmax scores look very different from what the model saw during training. This can cause bad outputs.
- The other problem is that the using a dense vector as an input can be computationally expensive. If your vocabulary size is 10,000, then the input vector becomes a dense softmax vector of 10,000 numbers.
During the forward pass, you're stepping through time to compute the loss. During the backward pass, you're stepping backwards through time to compute the gradient.
What if the input sequence is very long? Like Wikipedia-sized long? You can't just run through all wikipedia text forward and backward to produce one gradient update; you'll run out of memory and never converge because it's so slow.
Just as you'd make a gradient update after every few images for a CNN, you can do the same thing (mini-batch) to gradient updates in a RNN: you run a gradient update once every few words (say, 100 words). This is known as a truncated backpropagation through time.
The hidden state visualized
What information is stored in the hidden state? ie, what exactly does it 'remember'?
This paper selects one number from the hidden state vector, and see which characters cause a spike in activation when an input sequence is fed into the model.
Most elements from the hidden state vector aren't easily interpretable:
But some elements activate in a more interpretable way:
Quote detection:
Line break:
if
statement conditions:
The take-away is that even though the model was trained to predict the next character, it also learned useful structural rules of the input data.
Use a CNN to distill an image down to a summary vector
The initial input is a special <START>
token.
Previously, the hidden state was calculated like this:
Now we have an image vector
It's important to note that the image vector is not used as the input
For training, the labels must have special tokens marking the <START>
and <END>
of the sentence. This tells the network to stop generating words whenever it has sampled an <END>
token.
Example results:
Bad results:
instead of outputting a single summary vector, it outputs a grid of vectors. imagine the image is divided into grids of 9 squares. then the output matrix will have 9 vectors, each corresponding to its own location in the grid.
besides sampling the output vocab, the model also produces a distribution of image locations that it wants to look. This distribution can be seen as the attention of the model (ie, where it is looking at).
the attention matrix is produced by the current hidden state
the attention matrix multiplies by the image matrix to produce a summary vector
The attention mechanism for text input is similar to that for images. Instead of outputting a single hidden state at the end, the RNN outputs all hidden states. Each state corresponds to the word that it saw during each time-step.
A separate distribution is also produced to annotate where the model wants to look at. This annotation is the attention.
On the backwrd pass, the gradient is multiplied by the same weight matrix over and over again. this can cause the gradient to grow (explode) dramatically, or diminish (vanish) greatly.
Image one number in the weight matrix. If it is greater than 1, it's multiplied by the gradient many times over many time-steps, that'll cause the gradient to explode. If the number is less than 1, it'll cause the gradient to vanish towards zero.
A hack to fix to the exploding gradient problem is to clip the gradient by clamping it down to a pre-set maximum.
To solve the vanishing gradient problem, we need to change the RNN structure.
lstm solves the problem of vanishing/exploding gradient
lstm has one extra state: cell state
the cell state updated by 4 gates
-
$i$ : input gate -
$f$ : forget gate -
$o$ : output gate -
$g$ : 'gate' gate
the hidden state from the previous time-step and the weight matrix are used to calculate 4 gates. These 4 gates, with the previous cell state are used to calculate the current cell state.
the current cell step and the forget gate are used to calculate the new hidden state.
stack the previous hidden state and the current input, multiply by the weight matrix to produce a vector. Each element of the vector is then run through a sigmoid function to calculate i,f,o gates, and a tanh to get the g gate.
- i: whether to write to the new cell state
- f: how much to forget from the previous hidden state
- o: how much to reveal from the cell state to the new hidden state (which is then passed to the next time step)
- g: how much to write to the new cell state
Stacking vectors: literally stack 1 vector on top of another. for example:
if A = [1 5 3]
and B = [8 6 4]
stack(A, B) -> [1 5 3 ; 8 6 4]
Reminder:
The input and the previous step's hidden state are stacked (concatenated) and multiply by the weights
$W \begin{pmatrix} h_{t-1} \ x_t \end{pmatrix} $
The product of that is run through 3 sigmoid and 1 tanh to produce the 4 gates:
Simplified (bias term is implied):
Combined (bias term is implied):
Compute the current cell state:
Compute the current hidden state:
The forget gate
The sigmoid gates squashes values to between 0 and 1. But if we only consider extreme values, 0 or 1, it becomes to easier to understand what is going on. A 0 will reset information, and a 1 will retain it in the cell state. The same applies to the input gate
For the go gate
Google's 2014 sequence to sequence paper says that "... we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, ...".
Surprisingly, the LSTM did not suffer on very long sentences, despite the recent experience of otherresearchers with related architectures [26]. We were able to do well on long sentences because wereversed the order of words in the source sentence but not thetarget sentences in the training and testset. By doing so, we introduced many short term dependenciesthat made the optimization problemmuch simpler (see sec. 2 and 3.3). As a result, SGD could learnLSTMs that had no trouble withlong sentences. The simple trick of reversing the words in the source sentence is one of the keytechnical contributions of this work.
... we found it extremely valuable to reverse the order of the words of the input sentence. So for example, instead of mapping the sentence
a, b, c
to the sentenceα, β, γ
, the LSTM is asked to mapc, b, a
toα, β, γ
, whereα, β, γ
is the translation ofa, b, c
. This way,a
is in close proximity toα
,b
is fairly close toβ
, and so on, a fact that makes it easy for SGD to 'establish communication' between the input and the output. We found this simple data transformation to greatly boost the performance of the LSTM.
While we do not have a complete explanation to this phenomenon, we believe that it is caused by the introduction of many short term dependencies to the dataset. Normally, when we concatenate a source sentence with a target sentence, each word in the source sentence is far from its corresponding word in the target sentence. As a result, the problem has a large "minimal time lag" [17]. By reversing the words in the source sentence, the average distance between corresponding words inthe source and target language is unchanged. However, the first few words in the source language are now very close to the first few words in the target language, so the problem’s minimal time lag is greatly reduced. Thus, backpropagation has an easier time "establishing communication" between the source sentence and the target sentence, which in turn results in substantially improved overall performance.
Initially, we believed that reversing the input sentences would only lead to more confident predictions in the early parts of the target sentence and to less confident predictions in the later parts. However, LSTMs trained on reversed source sentences did much better on long sentences than LSTMs trained on the raw source sentences ..., which suggests that reversing the input sentences results in LSTMs with better memory utilization.
Does fastai library implement this sentence-reversal technique as input data augmentation?
Yes, you can specify backwards=True
when creating a text databunch. The parent class explains what the arg does.
Searched fastai form with keywords "reverse words LSTM" and found this thread: DeepLearning-Lec11-Notes. Then searched for the word "reverse" in the post, and found this line: "Bi Directional: Take all your sequences and reverse them and make a 'backwards model' then average the predictions."
This reversal of input sequence is what makes the bi in bidirectional LSTM, in which the input sequence is looked at both from the front-to-back, and back-to-front. This creates 2 hidden states that are stacked together.
For a many-to-many model, how does it know when to start generating outputs? And when to stop generating outputs?
It knows to start generating outputs when it receives an end-of-sentence <EOS>
token. And when the output samples the same token, the model stops generating outputs.
This paper introduced the encoder-decoder architecture that uses 2 modified RNNs (with 2 gates to control how much of the hidden state to forget or modify). The encoder turns a variable-length sequence (sentence of words) into a fixed length vector representation. The decoder uses that vector to produce either a score or sequences that are fed in to a statistical machine translation model to help improve its performance.
This paper from Google used 2 LSTMs to create an encoder-decoder model for neural machine translation. The encoder turns a variable-length sequence (sentence of words) into a fixed length vector. That vector is fed into the decoder and outputs a variable-length sequence (the translation). They discovered that reversing the input sentence improves the translation score markedly.
Read this illustrated article on Visualizing A Neural Machine Translation Model before reading the next paper.
This paper added attention to the encoder-decoder architecture. It also used a bi-directional RNN encoder. The authors argued that turning a sentence into a single vector creates a bottleneck, and can limit the performance (how well it translates a sentence) of the model. By adding a soft attention vector that learns which parts of the input sentences that matter, it can improve the performance of the model. This attention vector will direct the decoder to look for specific words in the input when generating an output word. It does not encode the input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors, and chooses a subset of these vectors adaptively while emitting the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. This allows a model to cope better with long sentences.
LSTM gates control how hidden states update the cell state by specifying what to forget, and what to update (and how much to update). This reminds me of the book Why We Sleep by neuroscientist Matthew Walker. In it, he mentions the 2 phrases of sleep: REM and NREM (non-REM) sleep. We switch between these 2 modes during sleep. One mode removes memory, and the other edits existing memory. This is almost exactly identical to what the gating mechanism in a LSTM does.
hidden states are short-term memories. cell states are long-term memories. the 4 gates in an lstm control how the new, short-term memories should update the long-term memory.