-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #26 from dsp-uga/Documentation
Documentation
- Loading branch information
Showing
5 changed files
with
157 additions
and
58 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
Jeremy Shi: | ||
- Preprocessin the audio and transciption data (segmentation, uploading, etc.) | ||
- Google Cloud VM GPU set up | ||
- Swig Decoder (adopted from Baidu's DeepSpeech2 on PaddlePaddle) | ||
- Presentation Notebook (live demo) | ||
- Testing Word Error Rate on DASS | ||
- Documentation | ||
- Paper Writing | ||
- Code Review | ||
|
||
Ailing Wang | ||
- Building stacked-LSTM model in Keras | ||
- Google Cloud training and testing | ||
- Presentation Notebook (images, theories, etc.) | ||
- Testing Word Error Rate on LibriSpeech | ||
- Documentation | ||
- Paper Writing | ||
- Code Review |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,118 +1,172 @@ | ||
# speech-recognition | ||
### Members: | ||
* Yuanming Shi | ||
* Ailing Wang | ||
# Automatic Speech Recognition on the Digital Archive of the Southern Speech | ||
[](https://opensource.org/licenses/MIT) | ||
|
||
|
||
## Technology Used: | ||
* Python | ||
This project is impolemented over the course of three weeks as the final project of the CSCI 8360 Data Science Practicum class offered in Spring 2018 at the University of Georgia. For course webpage, [click here](http://dsp-uga.github.io/sp18/schedule.html). This project has benefited from the depositories from [Udacity's NanoDegree](https://github.com/udacity/AIND-VUI-Capstone), [@robmsmt](https://github.com/robmsmt/KerasDeepSpeech), [Baidu's Bay Area DL School](https://github.com/baidu-research/ba-dls-deepspeech), and [Baidu's PaddlePaddle](https://github.com/PaddlePaddle/DeepSpeech). Huge thanks to these people who make these sources publicly available! | ||
|
||
* Keras | ||
## Prerequisites: | ||
|
||
* AIND-VUI-Capstone | ||
- [Python 3.6](https://www.python.org/downloads/release/python-360/) | ||
- [Anaconda](https://www.anaconda.com/) | ||
- [Tensorflow](http://www.tensorflow.org) | ||
- [Keras](http://keras.io) | ||
- [Jupyter Notebook](http://jupyter.org/) (for presentation purposes) | ||
- [ASR Evaluation](https://github.com/belambert/asr-evaluation) (for evaluting WER) | ||
- [Swig](http://www.swig.org/) (used for decoding the outputs of the network by using a language model. To check more detailed instructions, check the wiki page.) | ||
- [KenLM](https://kheafield.com/code/kenlm/) (used for decoding the outputs of the network by using a language model. To check more detailed instructions, check the wiki page.) | ||
|
||
* ASR-evaluation | ||
For other required libraries, please check `environment.yml` file. | ||
|
||
* Baidu deepspeech | ||
### Google VM Hardware Specs | ||
|
||
## Problem Statement | ||
We created virtual machine instance in google cloud with 16v CPU and 64GB of RAM. It takes approximately 2 hours to train 1 epoch of the LSTM model on full dataset (train-960 of LibriSpeech). With 8v CPU, 50GB of RAM, and a Nvidia Tesla P100 GPU, it takes around 750 seconds to train 1 epoch of Baidu's DeepSpeech2 Model on the train-100 set of LibriSpeech. | ||
|
||
## Dataset | ||
|
||
## Deliverables | ||
Our presentation comes in the form of a Jupyter Notebook. The notebook file is under the [src](./src) folder and it is named `demo.ipynb`. To see its html version, please navigate to the [presentation](./presentation) folder and check `demo.html` and its dependencies. | ||
|
||
|
||
## Datasets | ||
|
||
### LibriSpeech ASR corpus | ||
We use LibriSpeech ASR corpus as our major training dataset. LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey.It is publicly available. To access all the different datasets files, [click here.](http://openslr.org/12) | ||
|
||
We used LibriSpeech ASR corpus as our major training dataset. | ||
You also need to convert all .flac files of LibriSpeech to .wav files. There are a lot of available scripts online for doing so. [Here is an example on how to do so](https://github.com/udacity/AIND-VUI-Capstone/blob/master/flac_to_wav.sh). | ||
|
||
LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. | ||
### DASS (Digital Archive of Southern Speech) Corpus | ||
DASS is an audio corpus that records 64 interviews (3-4 hours each) of Southern speeches featuring dialects in eight Southern states with a mixture of ethnicities, ages, social classes, and education levels. DASS provides fruitful resources for researchers to work on. It would be interesting to see how well the model trained on general North American English corpora performs on these Southern speeches. The audio data is also publicly accessible from http://www.lap.uga.edu/, and it is also available via the Linguistic Data Consortium (https://catalog.ldc.upenn.edu/LDC2016S05). | ||
|
||
|
||
To download the dataset and convert all .flac files to .wav files: | ||
## How to install and run | ||
|
||
0. Install dependencies | ||
|
||
There are several sound-processing tools you need to install, with `libav` being the main one (because `soundfile` library in Python depends on it). Here is how to do it in Linux: | ||
``` | ||
wget http://www.openslr.org/resources/12/dev-clean.tar.gz | ||
tar -xzvf dev-clean.tar.gz | ||
wget http://www.openslr.org/resources/12/test-clean.tar.gz | ||
tar -xzvf test-clean.tar.gz | ||
mv flac_to_wav.sh LibriSpeech | ||
cd LibriSpeech | ||
./flac_to_wav.sh | ||
sudo apt-get install libav-tools | ||
``` | ||
### DASS Corpus | ||
|
||
## Technology Installation | ||
|
||
* Keras | ||
Create (and activate) a new environment with Python 3.6 | ||
To do it in Mac, the easiest way is to use [Brew](https://brew.sh/): | ||
``` | ||
conda create --name <environment_name> python=3.6 | ||
source activate <environment_name> | ||
brew install libav | ||
``` | ||
|
||
Install Tensorflow | ||
Also, as we have mentioned before `swig` is also used as an option in integrating a language model to generate recognized texts. Detailed instructions on installing it could be found in our [Wiki](https://github.com/dsp-uga/speech-recognition/wiki) (to navigate to the wiki, you can also press `g` `w` on your keyboard) . | ||
|
||
1. Clone this repository. | ||
``` | ||
pip install tensorflow==1.1.0 | ||
$ git clone https://github.com/dsp-uga/speech-recognition.git | ||
$ cd speech-recognition | ||
``` | ||
|
||
Install Keras | ||
2. Create a conda environment based on `environments.yml` offered in this repository. | ||
``` | ||
sudo pip install keras | ||
$ conda env create -f environments.yml -n <environment_name> python=3.6 | ||
$ source activate <environment_name> | ||
``` | ||
It will build a conda environment with the name you specify. | ||
|
||
3. Download all the audio files into the [data](./data) folder and organize them as it is instructed there. | ||
|
||
* AIND-VUI-Capstone Package | ||
4. Generate json dictionaries | ||
|
||
For more detailed local environment setup, please refer to https://github.com/udacity/AIND-VUI-Capstone/blob/master/README.md | ||
Json dictionaries record the paths, lengths, and transciptions of all the audio files. In order to train your model, two json files are necessary -- `train_corpus.json` and `valid_corpus.json`. If you also have a test set to test upon, you also need another json file -- `test_corpus.json`. (And of course, you can change their names and specify them in your training/testing process). In our system, the json dictionaries are stored in the [json_dict](./json_dict) folder. You can check the README file there to see what a json dict should look like in detail. To generate training and validation these json dictionaries: | ||
``` | ||
git clone https://github.com/udacity/AIND-VUI-Capstone.git | ||
cd AIND-VUI-Capstone | ||
cd src | ||
python create_desc_json.py .././data/LibriSpeech/dev-clean/ .././json_dict/train_corpus.json | ||
python create_desc_json.py .././data/LibriSpeech/test-clean/ .././json_dict/valid_corpus.json | ||
``` | ||
* ASR-evaluation | ||
Note that these two commands are for the `dev-clean` and `test-clean` datasets in LibriSpeech. It is assumed that you already have these files downloaded in the `data` folder. | ||
|
||
Installation: | ||
To generate testing json dictionary of DASS: | ||
``` | ||
pip install asr-evaluation | ||
cd src | ||
python create_test_desc_json.py .././data/DASS .././json_dict/test_corpus.json | ||
``` | ||
|
||
Commandline usage | ||
5. Train a model | ||
|
||
We offer these following models for users to train on. Here is the command to run it: | ||
``` | ||
wer <true_text.txt> <predicted_test.txt> | ||
cd src | ||
python train.py <-m model_name> <-mfcc False> <-p pickled_path> <-s save_model_path> <-e epochs> <-u units> <-l recurrent layers> | ||
``` | ||
|
||
For more detailed information, please refer to https://github.com/belambert/asr-evaluation | ||
#### CNN-RNN Model | ||
The first model we tried is CNN-RNN model, which include 1 Conv1D layer, 1 simple RNN layer and 1 Time distrubuted dense layer. | ||
|
||
* Baidu deepspeech | ||
#### Bi-directional RNN (BRNN) Model | ||
BRNN is an improvement from Vanilla RNN. We can also stack recurrent layers together and make it Deep BRNN. Its usage in speech recognition is pioneered by Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Check out their famous paper from: https://arxiv.org/abs/1303.5778. | ||
|
||
## Execution Steps | ||
#### LSTM Model | ||
We created a LSTM model which has 3 LSTM layer and 8 Time distributed dense layer. For the full dataset, which contains about 1000 hours of audio file, we increase the complexity of the model to 4 LSTM layer and 12 Time distributed layer. The structure of the model is inspired by https://arxiv.org/abs/1801.00059. | ||
|
||
#### Baidu's DeepSpeech 2 Model | ||
The model consists of 1-3 1D or 2D convolutional layers, 7 RNN layers. The original paper can be found at https://arxiv.org/abs/1512.02595, and we use the Keras implementation from [here](https://github.com/robmsmt/KerasDeepSpeech/blob/master/model.py). | ||
|
||
## Approaches We Tried | ||
6. Use a trained model for prediction | ||
To make predictions, users can use the following command: | ||
``` | ||
cd src | ||
python predict.py <-m model_name> <-mfcc False> <-p pickled_path> <-s save_model_path> <-u units> <-l recurrent layers> <-r range> <-lm languageModel> <-part partitionOfDatasets> | ||
``` | ||
|
||
### Models We Explored | ||
Although the command seems cumbersome, we have provided a lot of default settings so users do not need to specify most of the arguments in practice. To see how the detailed help on each of the argument, check out the source [here](https://github.com/dsp-uga/speech-recognition/blob/Documentation/src/predict.py). | ||
|
||
#### CNN-RNN Model | ||
The first model we tried is CNN-RNN model, which include 1 Conv1D layer, 1 simple RNN layer and 1 Time distrubuted dense layer. | ||
## How to evaluate | ||
|
||
To use the model, simply pass cnn_rnn as argument to train.py file. | ||
We use Word Error Rate (WER) to evaluate our system. We use the the open-sourced `asr-evaluation` package to do so. | ||
|
||
Installation (if `asr-evalatuion` is not installed in your customized conda environment already): | ||
``` | ||
pip install asr-evaluation | ||
``` | ||
|
||
#### LSTM Model | ||
We created a LSTM model which has 3 LSTM layer and 8 Time distributed dense layer. For the full dataset, which contains about 1000 hours of audio file, we increase the complexity of the model to 4 LSTM layer and 12 Time distributed layer. The structure of the model is inspired by https://arxiv.org/abs/1801.00059. | ||
Commandline usage (in the examples here, we put our results under the `prediction` folder): | ||
``` | ||
cd prediction | ||
wer <true_text.txt> <predicted_test.txt> | ||
``` | ||
|
||
For more detailed information, please refer to https://github.com/belambert/asr-evaluation | ||
|
||
To use the model, pass tdnn or tdnn_large as argument to train.py file. | ||
|
||
## Results | ||
|
||
### Google VM | ||
So far, the best WER on LibriSpeech is around 80% and the best WER on DASS is around 102%. We are still in the process of improving our models. | ||
|
||
We created virtual machine instance in google cloud with 16 CPUs. It takes approximately 2 hours to train 1 epoch of the LSTM model on full dataset. | ||
## TODO | ||
- More improvements for the architectures of these models. | ||
- More hyperparameter tuning for the models | ||
- Better language models. | ||
|
||
## How to Contribute | ||
We welcome any kind of contribution. If you want to contribute, just create a ticket! | ||
|
||
## Accuracy | ||
## Team Members: | ||
* Yuanming Shi, Institute for Artificial Intelligence, The University of Georgia | ||
* Ailing Wang, Department of Computer Science, The University of Georgia | ||
|
||
See [CONTRIBUTORS.md](./CONTRIBUTORS.md) for detailed contributions by each team member. | ||
|
||
## License | ||
MIT | ||
|
||
## Reference | ||
|
||
https://github.com/udacity/AIND-VUI-Capstone | ||
|
||
https://github.com/baidu-research/ba-dls-deepspeech | ||
|
||
https://kheafield.com/code/kenlm/ | ||
|
||
https://github.com/PaddlePaddle/DeepSpeech | ||
|
||
https://github.com/robmsmt/KerasDeepSpeech | ||
|
||
http://www.openslr.org/12/ | ||
|
||
## Lisence | ||
http://lap3.libs.uga.edu/u/jstanley/vowelcharts/ | ||
|
||
https://catalog.ldc.upenn.edu/LDC2012S03 | ||
|
||
http://www.lap.uga.edu/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
channels: | ||
- defaults | ||
dependencies: | ||
- future | ||
- numpy | ||
- scipy | ||
- ipython | ||
- ipyparallel | ||
- jupyter | ||
- matplotlib | ||
- scipy | ||
- scikit-image | ||
- scikit-learn | ||
- h5py | ||
- cython | ||
- spyder | ||
- nose | ||
- bokeh | ||
- tqdm | ||
- pip: | ||
- click | ||
- keras | ||
- tensorflow | ||
- soundfile |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
This folder store the source file for openfst 1.6.3, which is required for using the Swig Decoder developed by Baidu [here](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/decoders). | ||
|
||
For detailed instructions, please navigate to our Wiki and see how to integrate it into the system. |
Binary file not shown.