add ge2e and tacotron2_aishell3 example (#107)

* hacky thing, add tone support for acoustic model * fix experiments for waveflow and wavenet, only write visual log in rank-0 * use emb add in tacotron2 * 1. remove space from numericalized representation; 2. fix decoder paddign mask's unsqueeze dim. * remove bn in postnet * refactoring code * add an option to normalize volume when loading audio. * add an embedding layer. * 1. change the default min value of LogMagnitude to 1e-5; 2. remove stop logit prediction from tacotron2 model. * WIP: baker * add ge2e * fix lstm speaker encoder * fix lstm speaker encoder * fix speaker encoder and add support for 2 more datasets * simplify visualization code * add a simple strategy to support multispeaker for tacotron. * add vctk example for refactored tacotron * fix indentation * fix class name * fix visualizer * fix root path * fix root path * fix root path * fix typos * fix bugs * fix text log extention name * add example for baker and aishell3 * update experiment and display * format code for tacotron_vctk, add plot_waveform to display * add new trainer * minor fix * add global condition support for tacotron2 * add gst layer * add 2 frontend * fix fmax for example/waveflow * update collate function, data loader not does not convert nested list into numpy array. * WIP: add hifigan * WIP:update hifigan * change stft to use conv1d * add audio datasets * change batch_text_id, batch_spec, batch_wav to include valid lengths in the returned value * change wavenet to use on-the-fly prepeocessing * fix typos * resolve conflict * remove imports that are removed * remove files not included in this release * remove imports to deleted modules * move tacotron2_msp * clean code * fix argument order * fix argument name * clean code for data processing * WIP: add README * add more details to thr README, fix some preprocess scripts * add voice cloning notebook * add an optional to alter the loss and model structure of tacotron2, add an alternative config * add plot_multiple_attentions and update visualization code in transformer_tts * format code * remove tacotron2_msp * update tacotron2 from_pretrained, update setup.py * update tacotron2 * update tacotron_aishell3's README * add images for exampels/tacotron2_aishell3's README * update README for examples/ge2e * add STFT back * add extra_config keys into the default config of tacotron * fix typos and docs * update README and doc * update docstrings for tacotron * update doc * update README * add links to downlaod pretrained models * refine READMEs and clean code * add praatio into requirements for running the experiments * format code with pre-commit * simplify text processing code and update notebook
PaddlePaddle · May 13, 2021 · 4f288a6 · 4f288a6
1 parent 0aa7088
commit 4f288a6
Show file tree

Hide file tree

Showing 82 changed files with 9,407 additions and 2,464 deletions.
diff --git a/README.md b/README.md
@@ -18,14 +18,14 @@ In order to facilitate exploiting the existing TTS models directly and developin
 
 - Vocoders
   - [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
-  - [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499)
 
 - TTS models
   - [Neural Speech Synthesis with Transformer Network (Transformer TTS)](https://arxiv.org/abs/1809.08895)
   - [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](arxiv.org/abs/1712.05884)
 
+## Updates
 
-And more will be added in the future.
+May-07-2021, Add an example for voice cloning in Chinese. Check [examples/tacotron2_aishell3](./examples/tacotron2_aishell3).
 
 
 ## Setup
@@ -45,7 +45,7 @@ See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. T
 pip install -U paddle-parakeet
 ```
 
-or 
+or
 ```bash
 git clone https://github.com/PaddlePaddle/Parakeet
 cd Parakeet
@@ -59,9 +59,10 @@ See [install](https://paddle-parakeet.readthedocs.io/en/latest/install.html) for
 Entries to the introduction, and the launch of training and synthsis for different example models:
 
 - [>>> WaveFlow](./examples/waveflow)
-- [>>> WaveNet](./examples/wavenet)
 - [>>> Transformer TTS](./examples/transformer_tts)
 - [>>> Tacotron2](./examples/tacotron2)
+- [>>> Tacotron2_AISHELL3](./examples/tacotron2_aishell3)
+- [>>> GE2E](./examples/ge2e)
 
 
 ## Audio samples
@@ -70,6 +71,25 @@ Entries to the introduction, and the launch of training and synthsis for differe
 
 Check our [website](https://paddle-parakeet.readthedocs.io/en/latest/demo.html) for audio sampels.
 
+
+## Checkpoints
+
+### Tacotron2
+1. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3.zip)
+2. [tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3_alternative.zip)
+
+### Tacotron2_AISHELL3
+1. [tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip)
+
+### TransformerTTS
+1. [transformer_tts_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.3.zip)
+
+### WaveFlow
+1. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip)
+
+### GE2E
+1. [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip)
+
 ## Copyright and License
 
 Parakeet is provided under the [Apache-2.0 license](LICENSE).
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -68,7 +68,6 @@
 
 html_theme = "sphinx_rtd_theme"
 
-
 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".

diff --git a/docs/source/demo.rst b/docs/source/demo.rst
@@ -140,4 +140,48 @@ Vocoder audio samples
 
 Audio samples generated from ground-truth spectrograms with a vocoder.
 
+.. raw:: html
+
+    <embed>
+    <table>
+        <tr>
+            <th  align="left"> WaveFlow res 128</th>
+        </tr>
+        <tr>
+            <td>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_0.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_1.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_2.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_3.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_4.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+            </td>
+        </tr>
+        </tabel>
+    </table>
+    </embed>
 
diff --git a/docs/source/parakeet.models.rst b/docs/source/parakeet.models.rst
@@ -28,13 +28,6 @@ parakeet.models.waveflow module
    :undoc-members:
    :show-inheritance:
 
-parakeet.models.wavenet module
-------------------------------
-
-.. automodule:: parakeet.models.wavenet
-   :members:
-   :undoc-members:
-   :show-inheritance:
 
 Module contents
 ---------------

diff --git a/examples/ge2e/README.md b/examples/ge2e/README.md
@@ -0,0 +1,129 @@
+# Speaker Encoder
+
+This experiment trains a speaker encoder with speaker verification as its task. It is done as a part of the experiment of transfer learning from speaker verification to multispeaker text-to-speech synthesis, which can be found at [tacotron2_aishell3](../tacotron2_shell3). The trained speaker encoder is used to extract utterance embeddings from utterances.
+
+## Model
+
+The model used in this experiment is the speaker encoder with text independent speaker verification task in [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf). GE2E-softmax loss is used.
+
+## File Structure
+
+```text
+ge2e
+├── README.md
+├── README_cn.md
+├── audio_processor.py
+├── config.py
+├── dataset_processors.py
+├── inference.py
+├── preprocess.py
+├── random_cycle.py
+├── speaker_verification_dataset.py
+└── train.py
+```
+
+## Download Datasets
+
+Currently supported datasets are  Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata, which can be downloaded from corresponding webpage.
+
+1. Librispeech/train-other-500
+
+   An English multispeaker dataset，[URL](https://www.openslr.org/resources/12/train-other-500.tar.gz)，only the `train-other-500` subset is used.
+
+2. VoxCeleb1
+
+   An English multispeaker dataset，[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) , Audio Files from Dev A to Dev D should be downloaded, combined and extracted.
+
+3. VoxCeleb2
+
+   An English multispeaker dataset，[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) , Audio Files from Dev A to Dev H should be downloaded, combined and extracted.
+
+4. Aidatatang-200zh
+
+   A Mandarin Chinese multispeaker dataset ，[URL](https://www.openslr.org/62/) .
+
+5. magicdata
+
+   A Mandarin Chinese multispeaker dataset ，[URL](https://www.openslr.org/68/) .
+
+If you want to use other datasets, you can also download and preprocess it as long as it meets the requirements described below.
+
+## Preprocess Datasets
+
+Multispeaker datasets are used as training data, though the transcriptions are not used. To enlarge the amount of data used for training, several multispeaker datasets are combined. The preporcessed datasets are organized in a file structure described below. The mel spectrogram of each utterance is save in `.npy` format. The dataset is 2-stratified (speaker-utterance). Since multiple datasets are combined, to avoid conflict in speaker id, dataset name is prepended to the speake ids.
+
+```text
+dataset_root
+├── dataset01_speaker01/
+│   ├── utterance01.npy
+│   ├── utterance02.npy
+│   └── utterance03.npy
+├── dataset01_speaker02/
+│   ├── utterance01.npy
+│   ├── utterance02.npy
+│   └── utterance03.npy
+├── dataset02_speaker01/
+│   ├── utterance01.npy
+│   ├── utterance02.npy
+│   └── utterance03.npy
+└── dataset02_speaker02/
+    ├── utterance01.npy
+    ├── utterance02.npy
+    └── utterance03.npy
+```
+
+Run the command to preprocess datasets.
+
+```bash
+python preprocess.py --datasets_root=<datasets_root> --output_dir=<output_dir> --dataset_names=<dataset_names>
+```
+
+Here `--datasets_root` is the directory that contains several extracted dataset; `--output_dir` is the directory to save the preprocessed dataset; `--dataset_names` is the dataset to preprocess. If there are multiple datasets in `--datasets_root` to preprocess, the names can be joined with comma. Currently supported dataset names are  librispeech_other, voxceleb1, voxceleb2, aidatatang_200zh and magicdata.
+
+## Training
+
+When preprocessing is done, run the command below to train the mdoel.
+
+```bash
+python train.py --data=<data_path> --output=<output> --device="gpu" --nprocs=1
+```
+
+- `--data` is the path to the preprocessed dataset.
+- `--output` is the directory to save results，usually a subdirectory of `runs`.It contains visualdl log files, text log files, config file and a `checkpoints` directory, which contains parameter file and optimizer state file. If `--output` already has some training results in it, the most recent parameter file and optimizer state file is loaded before training.
+- `--device` is the device type to run the training, 'cpu' and 'gpu' are supported.
+- `--nprocs` is the number of replicas to run in multiprocessing based parallel training。Currently multiprocessing based parallel training is only enabled when using 'gpu' as the devicde. `CUDA_VISIBLE_DEVICES` can be used to specify visible devices with cuda.
+
+Other options are described below.
+
+- `--config` is a `.yaml` config file used to override the default config(which is coded in `config.py`).
+- `--opts` is command line options to further override config files. It should be the last comman line options passed with multiple key-value pairs separated by spaces.
+- `--checkpoint_path` specifies the checkpoiont to load before training, extension is not included. A parameter file ( `.pdparams`) and an optimizer state file ( `.pdopt`) with the same name is used. This option has a higher priority than auto-resuming from the `--output` directory.
+
+## Pretrained Model
+
+The pretrained model is first trained to 1560k steps at Librispeech-other-500 and voxceleb1. Then trained at aidatatang_200h and magic_data to 3000k steps.
+
+Download URL [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip).
+
+## Inference
+
+When training is done, run the command below to generate utterance embedding for each utterance in a dataset.
+
+```bash
+python inference.py --input=<input> --output=<output> --checkpoint_path=<checkpoint_path> --device="gpu"
+```
+
+`--input` is the path of the dataset used for inference.
+
+`--output` is the directory to save the processed results. It has the same file structure as the input dataset. Each utterance in the dataset has a corrsponding utterance embedding file in `*.npy` format.
+
+`--checkpoint_path` is the path of the checkpoint to use, extension not included.
+
+`--pattern` is the wildcard pattern to filter audio files for inference, defaults to `*.wav`.
+
+`--device` and `--opts` have the same meaning as in the training script.
+
+## References
+
+1. [Generalized End-to-end Loss for Speaker Verification](https://arxiv.org/pdf/1710.10467.pdf)
+2. [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf)
diff --git a/examples/ge2e/README_cn.md b/examples/ge2e/README_cn.md
@@ -0,0 +1,124 @@
+# Speaker Encoder
+
+本实验是的在多说话人数据集上以 Speaker Verification 为任务训练一个 speaker encoder, 这是作为 transfer learning from speaker verification to multispeaker text-to-speech synthesis 实验的一部分, 可以在 [tacotron2_aishell3](../tacotron2_aishell3) 中找到。用训练好的模型来提取音频的 utterance embedding.
+
+## 模型
+
+本实验使用的模型是 [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf) 中的 speaker encoder text independent 模型。使用的是 GE2E softmax 损失函数。
+
+## 目录结构
+
+```text
+ge2e
+├── README_cn.md
+├── audio_processor.py
+├── config.py
+├── dataset_processors.py
+├── inference.py
+├── preprocess.py
+├── random_cycle.py
+├── speaker_verification_dataset.py
+└── train.py
+```
+
+## 数据集下载
+
+本实验支持了 Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata 数据集。可以在对应的页面下载。
+
+1. Librispeech/train-other-500
+
+   英文多说话人数据集，[下载链接](https://www.openslr.org/resources/12/train-other-500.tar.gz)，我们的实验中仅用到了 train-other-500 这个子集。
+
+2. VoxCeleb1
+
+   英文多说话人数据集，[下载链接](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html)，需要下载其中的 Audio Files 中的 Dev A 到 Dev D 四个压缩文件并合并解压。
+
+3. VoxCeleb2
+
+   英文多说话人数据集，[下载链接](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html)，需要下载其中的 Audio Files 中的 Dev A 到 Dev H 八个压缩文件并合并解压。
+
+4. Aidatatang-200zh
+
+   中文多说话人数据集，[下载链接](https://www.openslr.org/62/)。
+
+5. magicdata
+
+   中文多说话人数据集，[下载链接](https://www.openslr.org/68/)。
+
+如果用户需要使用其他的数据集，也可以自行下载并进行数据处理，只要符合如下的要求。
+
+## 数据集预处理
+
+训练中使用的数据集是多说话人数据集，transcription 并不会被使用。为了扩大数据的量，训练过程可以将多个数据集合并为一个。处理后的文件结果组织方式如下，每个句子的频谱存储为 `.npy` 格式。以 speaker-utterance 的两层目录结构存储。因为合并数据集的原因，为了避免 speaker id 冲突，dataset 名会被添加到 speaker id 前面。
+
+```text
+dataset_root
+├── dataset01_speaker01/
+│   ├── utterance01.npy
+│   ├── utterance02.npy
+│   └── utterance03.npy
+├── dataset01_speaker02/
+│   ├── utterance01.npy
+│   ├── utterance02.npy
+│   └── utterance03.npy
+├── dataset02_speaker01/
+│   ├── utterance01.npy
+│   ├── utterance02.npy
+│   └── utterance03.npy
+└── dataset02_speaker02/
+    ├── utterance01.npy
+    ├── utterance02.npy
+    └── utterance03.npy
+```
+
+运行数据处理脚本
+
+```bash
+python preprocess.py --datasets_root=<datasets_root> --output_dir=<output_dir> --dataset_names=<dataset_names>
+```
+
+其中 datasets_root 是包含多个原始数据集的路径，--output_dir 是多个数据集合并后输出的路径，dataset_names 是数据集的名称，多个数据集可以用逗号分割，比如 'librispeech_other, voxceleb1'. 目前支持的数据集有 librispeech_other, voxceleb1, voxceleb2, aidatatang_200zh, magicdata.
+
+## 训练
+
+数据处理完成后，使用如下的脚本训练。
+
+```bash
+python train.py --data=<data_path> --output=<output> --device="gpu" --nprocs=1
+```
+
+- `--data` 是处理后的数据集路径。
+- `--output` 是训练结果的保存路径，一般使用 runs 下的一个子目录。保存结果包含 visualdl 的 log 文件，文本 log 记录，运行 config 备份，以及 checkpoints 目录，里面包含参数文件和优化器状态文件。如果指定的 output 路径包含此前的训练结果，训练前会自动加载最近的参数文件和优化器状态文件。
+- `--device` 是运行设备，目前支持 'cpu' 和 'gpu'.
+- `--nprocs` 是指定运行进程数。目前仅在使用 'gpu' 是支持多进程训练。可以配合 `CUDA_VISIBLE_DEVICES` 环境变量指定可见卡号。
+
+另外还有几个选项。
+
+- `--config` 是用于覆盖默认配置（默认配置可以查看 `config.py`) 的配置文件，为 `.yaml` 文件。
+- `--opts` 是用命令行参数进一步覆盖配置。这是最后一个传入的命令行选项，用多组空格分隔的 KEY VALUE 对的方式传入。
+- `--checkpoint_path` 指定从中恢复的 checkpoint, 不需要包含扩展名。同名的参数文件( `.pdparams`) 和优化器文件( `.pdopt`)会被加载以恢复训练。这个参数指定的恢复训练优先级高于自动从 `output` 文件夹中恢复训练。
+
+## 预训练模型
+
+预训练模型是在 Librispeech-other-500 和 voxceleb1 上训练到 1560k steps 后用 aidatatang_200h 和 magic_data 训练到 3000k 的结果。
+
+下载链接 [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip)
+
+## 预测
+
+使用训练好的模型进行预测，对一个数据集中的所有 utterance 生成一个 embedding.
+
+```bash
+python inference.py --input=<input> --output=<output> --checkpoint_path=<checkpoint_path> --device="gpu"
+```
+
+- `--input` 是需要处理的数据集的路径。
+- `--output` 是处理的结果，它会保持和 `--input` 相同的文件夹结构，对应 input 中的每一个音频文件会有一个同名的 `*.npy` 文件，是从这个音频文件中提取到的 utterance embedding.
+- `--checkpoint_path` 为用于预测的参数文件路径，不包含扩展名。
+- `--pattern` 是用于筛选数据集中需要处理的音频文件的通配符模式，默认为 `*.wav`.
+- `--device` 和 `--opts` 的语义和训练脚本一致。
+
+## 参考文献
+
+1. [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf)
+2. [Transfer Learning from Speaker Verification toMultispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf)