forked from tarekeldeeb/DeepSpeech-Quran
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
143 additions
and
115 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,70 +1,76 @@ | ||
|
||
# Introduction | ||
Introduction | ||
============ | ||
|
||
In this project we will reproduce the results of | ||
[Deep Speech: Scaling up end-to-end speech recognition](http://arxiv.org/abs/1412.5567). | ||
`Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_. | ||
The core of the system is a bidirectional recurrent neural network (BRNN) | ||
trained to ingest speech spectrograms and generate English text transcriptions. | ||
|
||
Let a single utterance $x$ and label $y$ be sampled from a training set | ||
$S = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), . . .\}$. | ||
Each utterance, $x^{(i)}$ is a time-series of length $T^{(i)}$ | ||
Let a single utterance :math:`x` and label :math:`y` be sampled from a training set | ||
`S = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), . . .\}`. | ||
Each utterance, :math:`x^{(i)}` is a time-series of length :math:`T^{(i)}` | ||
where every time-slice is a vector of audio features, | ||
$x^{(i)}_t$ where $t=1,\ldots,T^{(i)}$. | ||
We use MFCC as our features; so $x^{(i)}_{t,p}$ denotes the $p$-th MFCC feature | ||
in the audio frame at time $t$. The goal of our BRNN is to convert an input | ||
sequence $x$ into a sequence of character probabilities for the transcription | ||
$y$, with $\hat{y}_t =\mathbb{P}(c_t \mid x)$, | ||
where $c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}$. | ||
(The significance of $blank$ will be explained below.) | ||
|
||
Our BRNN model is composed of $5$ layers of hidden units. | ||
For an input $x$, the hidden units at layer $l$ are denoted $h^{(l)}$ with the | ||
convention that $h^{(0)}$ is the input. The first three layers are not recurrent. | ||
For the first layer, at each time $t$, the output depends on the MFCC frame | ||
$x_t$ along with a context of $C$ frames on each side. | ||
(We typically use $C \in \{5, 7, 9\}$ for our experiments.) | ||
`x^{(i)}_t` where :math:`t=1,\ldots,T^{(i)}`. | ||
We use MFCC as our features; so :math:`x^{(i)}_{t,p}` denotes the :math:`p`-th MFCC feature | ||
in the audio frame at time :math:`t`. The goal of our BRNN is to convert an input | ||
sequence :math:`x` into a sequence of character probabilities for the transcription | ||
`y`, with :math:`\hat{y}_t =\mathbb{P}(c_t \mid x)`, | ||
where :math:`c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}`. | ||
(The significance of :math:`blank` will be explained below.) | ||
|
||
Our BRNN model is composed of :math:`5` layers of hidden units. | ||
For an input :math:`x`, the hidden units at layer :math:`l` are denoted :math:`h^{(l)}` with the | ||
convention that :math:`h^{(0)}` is the input. The first three layers are not recurrent. | ||
For the first layer, at each time :math:`t`, the output depends on the MFCC frame | ||
`x_t` along with a context of :math:`C` frames on each side. | ||
(We typically use :math:`C \in \{5, 7, 9\}` for our experiments.) | ||
The remaining non-recurrent layers operate on independent data for each time step. | ||
Thus, for each time $t$, the first $3$ layers are computed by: | ||
Thus, for each time :math:`t`, the first :math:`3` layers are computed by: | ||
|
||
$$h^{(l)}_t = g(W^{(l)} h^{(l-1)}_t + b^{(l)})$$ | ||
.. math:: | ||
h^{(l)}_t = g(W^{(l)} h^{(l-1)}_t + b^{(l)}) | ||
where $g(z) = \min\{\max\{0, z\}, 20\}$ is a clipped rectified-linear (ReLu) | ||
activation function and $W^{(l)}$, $b^{(l)}$ are the weight matrix and bias | ||
parameters for layer $l$. The fourth layer is a bidirectional recurrent | ||
layer[[1](http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf)]. | ||
where :math:`g(z) = \min\{\max\{0, z\}, 20\}` is a clipped rectified-linear (ReLu) | ||
activation function and :math:`W^{(l)}`, :math:`b^{(l)}` are the weight matrix and bias | ||
parameters for layer :math:`l`. The fourth layer is a bidirectional recurrent | ||
layer `[1] <http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf>`_. | ||
This layer includes two sets of hidden units: a set with forward recurrence, | ||
$h^{(f)}$, and a set with backward recurrence $h^{(b)}$: | ||
`h^{(f)}`, and a set with backward recurrence :math:`h^{(b)}`: | ||
|
||
.. math:: | ||
h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)}) | ||
$$h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})$$ | ||
$$h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)})$$ | ||
h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)}) | ||
Note that $h^{(f)}$ must be computed sequentially from $t = 1$ to $t = T^{(i)}$ | ||
for the $i$-th utterance, while the units $h^{(b)}$ must be computed | ||
sequentially in reverse from $t = T^{(i)}$ to $t = 1$. | ||
Note that :math:`h^{(f)}` must be computed sequentially from :math:`t = 1` to :math:`t = T^{(i)}` | ||
for the :math:`i`-th utterance, while the units :math:`h^{(b)}` must be computed | ||
sequentially in reverse from :math:`t = T^{(i)}` to :math:`t = 1`. | ||
|
||
The fifth (non-recurrent) layer takes both the forward and backward units as inputs | ||
|
||
$$h^{(5)} = g(W^{(5)} h^{(4)} + b^{(5)})$$ | ||
.. math:: | ||
h^{(5)} = g(W^{(5)} h^{(4)} + b^{(5)}) | ||
where $h^{(4)} = h^{(f)} + h^{(b)}$. The output layer are standard logits that | ||
correspond to the predicted character probabilities for each time slice $t$ and | ||
character $k$ in the alphabet: | ||
where :math:`h^{(4)} = h^{(f)} + h^{(b)}`. The output layer are standard logits that | ||
correspond to the predicted character probabilities for each time slice :math:`t` and | ||
character :math:`k` in the alphabet: | ||
|
||
$$h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k$$ | ||
.. math:: | ||
h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k | ||
Here $b^{(6)}_k$ denotes the $k$-th bias and $(W^{(6)} h^{(5)}_t)_k$ the $k$-th | ||
Here :math:`b^{(6)}_k` denotes the :math:`k`-th bias and :math:`(W^{(6)} h^{(5)}_t)_k` the :math:`k`-th | ||
element of the matrix product. | ||
|
||
Once we have computed a prediction for $\hat{y}_{t,k}$, we compute the CTC loss | ||
[[2]](http://www.cs.toronto.edu/~graves/preprint.pdf) $\cal{L}(\hat{y}, y)$ | ||
Once we have computed a prediction for :math:`\hat{y}_{t,k}`, we compute the CTC loss | ||
`[2] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ :math:`\cal{L}(\hat{y}, y)` | ||
to measure the error in prediction. During training, we can evaluate the gradient | ||
$\nabla \cal{L}(\hat{y}, y)$ with respect to the network outputs given the | ||
ground-truth character sequence $y$. From this point, computing the gradient | ||
:math:`\nabla \cal{L}(\hat{y}, y)` with respect to the network outputs given the | ||
ground-truth character sequence :math:`y`. From this point, computing the gradient | ||
with respect to all of the model parameters may be done via back-propagation | ||
through the rest of the network. We use the Adam method for training | ||
[[3](http://arxiv.org/abs/1412.6980)]. | ||
`[3] <http://arxiv.org/abs/1412.6980>`_. | ||
|
||
The complete BRNN model is illustrated in the figure below. | ||
|
||
 | ||
.. image:: ../images/rnn_fig-624x548.png | ||
:alt: DeepSpeech BRNN |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,69 +1,86 @@ | ||
# Geometric Constants | ||
Geometric Constants | ||
=================== | ||
|
||
This is about several constants related to the geometry of the network. | ||
|
||
## n_steps | ||
The network views each speech sample as a sequence of time-slices $x^{(i)}_t$ of | ||
length $T^{(i)}$. As the speech samples vary in length, we know that $T^{(i)}$ | ||
need not equal $T^{(j)}$ for $i \ne j$. For each batch, BRNN in TensorFlow needs | ||
to know `n_steps` which is the maximum $T^{(i)}$ for the batch. | ||
n_steps | ||
------- | ||
The network views each speech sample as a sequence of time-slices :math:`x^{(i)}_t` of | ||
length :math:`T^{(i)}`. As the speech samples vary in length, we know that :math:`T^{(i)}` | ||
need not equal :math:`T^{(j)}` for :math:`i \ne j`. For each batch, BRNN in TensorFlow needs | ||
to know ``n_steps`` which is the maximum :math:`T^{(i)}` for the batch. | ||
|
||
## n_input | ||
Each of the at maximum `n_steps` vectors is a vector of MFCC features of a | ||
n_input | ||
------- | ||
Each of the at maximum ``n_steps`` vectors is a vector of MFCC features of a | ||
time-slice of the speech sample. We will make the number of MFCC features | ||
dependent upon the sample rate of the data set. Generically, if the sample rate | ||
is 8kHz we use 13 features. If the sample rate is 16kHz we use 26 features... | ||
We capture the dimension of these vectors, equivalently the number of MFCC | ||
features, in the variable `n_input` | ||
features, in the variable ``n_input``. | ||
|
||
## n_context | ||
n_context | ||
--------- | ||
As previously mentioned, the BRNN is not simply fed the MFCC features of a given | ||
time-slice. It is fed, in addition, a context of $C \in \{5, 7, 9\}$ frames on | ||
time-slice. It is fed, in addition, a context of :math:`C \in \{5, 7, 9\}` frames on | ||
either side of the frame in question. The number of frames in this context is | ||
captured in the variable `n_context` | ||
captured in the variable ``n_context``. | ||
|
||
Next we will introduce constants that specify the geometry of some of the | ||
non-recurrent layers of the network. We do this by simply specifying the number | ||
of units in each of the layers | ||
of units in each of the layers. | ||
|
||
## n_hidden_1, n_hidden_2, n_hidden_5 | ||
`n_hidden_1` is the number of units in the first layer, `n_hidden_2` the number | ||
of units in the second, and `n_hidden_5` the number in the fifth. We haven't | ||
n_hidden_1, n_hidden_2, n_hidden_5 | ||
---------------------------------- | ||
``n_hidden_1`` is the number of units in the first layer, ``n_hidden_2`` the number | ||
of units in the second, and ``n_hidden_5`` the number in the fifth. We haven't | ||
forgotten about the third or sixth layer. We will define their unit count below. | ||
|
||
A LSTM BRNN consists of a pair of LSTM RNN's. | ||
One LSTM RNN that works "forward in time" | ||
One LSTM RNN that works "forward in time": | ||
|
||
<img src="../images/LSTM3-chain.png" alt="LSTM" width="800"> | ||
.. image:: ../images/LSTM3-chain.png | ||
:alt: Image shows a diagram of a recurrent neural network with LSTM cells, with arrows depicting the flow of data from earlier time steps to later timesteps within the RNN. | ||
|
||
and a second LSTM RNN that works "backwards in time" | ||
and a second LSTM RNN that works "backwards in time": | ||
|
||
<img src="../images/LSTM3-chain.png" alt="LSTM" width="800"> | ||
.. image:: ../images/LSTM3-chain-backwards.png | ||
:alt: Image shows a diagram of a recurrent neural network with LSTM cells, this time with data flowing from later time steps to earlier timesteps within the RNN. | ||
|
||
The dimension of the cell state, the upper line connecting subsequent LSTM units, | ||
is independent of the input dimension and the same for both the forward and | ||
backward LSTM RNN. | ||
|
||
## n_cell_dim | ||
n_cell_dim | ||
---------- | ||
Hence, we are free to choose the dimension of this cell state independent of the | ||
input dimension. We capture the cell state dimension in the variable `n_cell_dim`. | ||
input dimension. We capture the cell state dimension in the variable ``n_cell_dim``. | ||
|
||
## n_hidden_3 | ||
n_hidden_3 | ||
---------- | ||
The number of units in the third layer, which feeds in to the LSTM, is | ||
determined by `n_cell_dim` as follows | ||
```python | ||
n_hidden_3 = 2 * n_cell_dim | ||
``` | ||
|
||
## n_character | ||
The variable `n_character` will hold the number of characters in the target | ||
language plus one, for the $blamk$. | ||
determined by ``n_cell_dim`` as follows | ||
|
||
.. code:: python | ||
n_hidden_3 = 2 * n_cell_dim | ||
n_character | ||
----------- | ||
The variable ``n_character`` will hold the number of characters in the target | ||
language plus one, for the :math:`blank`. | ||
For English it is the cardinality of the set | ||
$\{a,b,c, . . . , z, space, apostrophe, blank\}$ | ||
|
||
.. math:: | ||
\{a,b,c, . . . , z, space, apostrophe, blank\} | ||
we referred to earlier. | ||
|
||
## n_hidden_6 | ||
The number of units in the sixth layer is determined by `n_character` as follows | ||
```python | ||
n_hidden_6 = n_character | ||
``` | ||
n_hidden_6 | ||
---------- | ||
The number of units in the sixth layer is determined by ``n_character`` as follows: | ||
|
||
.. code:: python | ||
n_hidden_6 = n_character | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.