Speech-to-Text-WaveNet2 : End-to-end sentence level English speech recognition using DeepMind's WaveNet

A tensorflow implementation of speech recognition based on DeepMind's WaveNet: A Generative Model for Raw Audio. (Hereafter the Paper)

The architecture is shown in the following figure.

(Some images are cropped from [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499) and [Neural Machine Translation in Linear Time](https://arxiv.org/abs/1610.10099))

Version

Current Version : 2.1.0.0

demo
test
train
train model

Dependencies

tensorflow >= 1.12.0
librosa
glog
nltk

If you have problems with the librosa library, try to install ffmpeg by the following command. ( Ubuntu 14.04 )


sudo add-apt-repository ppa:mc3man/trusty-media
sudo apt-get update
sudo apt-get dist-upgrade -y
sudo apt-get -y install ffmpeg

Dataset

Audio was augmented by the scheme in the Tom Ko et al's paper. (Thanks @migvel for your kind information)

Usage

Exculte

python ***.py --help

to get help when you use ***.py

Create dataset

Download and extract dataset(only VCTK support now, other will coming soon)
Assume the directory of VCTK dataset is f:/speech, Execute

python tools/create_tf_record.py -input_dir='/zhzhao/VCTK'

to create record for train or test

Train

Rename config/config.json.example to config/english-28.json
Execute

python train.py

to train model.

Test

Execute

python test.py

to evalute model.

Demo

1.Download pretrain model(buriburisuri model) and extract to 'release' directory

2.Execute


python demo.py -input_path

to transform a speech wave file to the English sentence. The result will be printed on the console.

For example, try the following command.


python demo.py -input_path=data/demo.wav -ckpt_dir=release/buriburisuri

The result will be as follows:


please scool stella

The ground truth is as follows:


PLEASE SCOOL STELLA

As mentioned earlier, there is no language model, so there are some cases where capital letters, punctuations, and words are misspelled.

Pretrained models

buriburisuri model : convert model from https://github.com/buriburisuri/speech-to-text-wavenet.

Future works

try to tokenlize the english label with nltk
train with all punctuation
add attention layer

Other resources

Namju's other repositories

Citation

If you find this code useful please cite us in your work:


Kim and Park. Speech-to-Text-WaveNet. 2016. GitHub repository. https://github.com/buriburisuri/.

Authors

Namju Kim ([email protected]) at KakaoBrain Corp.

Kyubyong Park ([email protected]) at KakaoBrain Corp.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
config		config
data		data
doc		doc
pretrain		pretrain
tools		tools
torch_lyuan		torch_lyuan
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
demo.py		demo.py
exp.md		exp.md
test.py		test.py
train.py		train.py
utils.py		utils.py
wavenet.py		wavenet.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech-to-Text-WaveNet2 : End-to-end sentence level English speech recognition using DeepMind's WaveNet

Version

Dependencies

Dataset

Usage

Create dataset

Train

Test

Demo

Pretrained models

Future works

Other resources

Namju's other repositories

Citation

Authors

About

Releases

Packages

Languages

hellboywyh/speech-to-text-wavenet

Folders and files

Latest commit

History

Repository files navigation

Speech-to-Text-WaveNet2 : End-to-end sentence level English speech recognition using DeepMind's WaveNet

Version

Dependencies

Dataset

Usage

Create dataset

Train

Test

Demo

Pretrained models

Future works

Other resources

Namju's other repositories

Citation

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages