Skip to content

Latest commit

 

History

History
50 lines (41 loc) · 1.76 KB

README.md

File metadata and controls

50 lines (41 loc) · 1.76 KB

Show-and-Speak

This is the pytorch implement for our paper "SHOW AND SPEAK: DIRECTLY SYNTHESIZE SPOKEN DESCRIPTION OF IMAGES". More details can be seen in the project page.

Requirements

python 3.6
pytorch 1.4.0
scipy 1.2.1

train the code

Download database

You can download our processed database from Flickr8k_SAS. Then unzip the file in the root directory of the code. You can get the directory tree as:

├── Data_for_SAS
│   ├── bottom_up_features_36_info
│   ├── images
│   ├── mel_80
│   ├── wavs
│   ├── train
│   │   ├── filenames.pickle
│   ├── val
│   │   ├── filenames.pickle
│   ├── test
│   │   ├── filenames.pickle

Among them, "bottom_up_features_36_info" contains the extracted bottom-up features of images; "images" contains all raw images of Flickr8k; "mel_80" contains the mel spectrogram of audio files; "wavs" constains all the speech synthesized by TTS system.

Train the code

run

python train --data_dir Data_for_SAS --save_path outputs 

Inference

Download the pre-trained waveglow model and put it in the root directory of this code.

run

python train --data_dir Data_for_SAS --save_path outputs --only_val

Cite

@article{wang2020show,
title={Show and Speak: Directly Synthesize Spoken Description of Images},
author={Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg},
journal={arXiv preprint arXiv:arXiv:2010.12267},
year={2020}
}