Skip to content
This repository has been archived by the owner on Sep 11, 2022. It is now read-only.

Commit

Permalink
add ge2e and tacotron2_aishell3 example (#107)
Browse files Browse the repository at this point in the history
* hacky thing, add tone support for acoustic model

* fix experiments for waveflow and wavenet, only write visual log in rank-0

* use emb add in tacotron2

* 1. remove space from numericalized representation;
2. fix decoder paddign mask's unsqueeze dim.

* remove bn in postnet

* refactoring code

* add an option to normalize volume when loading audio.

* add an embedding layer.

* 1. change the default min value of LogMagnitude to 1e-5;
2. remove stop logit prediction from tacotron2 model.

* WIP: baker

* add ge2e

* fix lstm speaker encoder

* fix lstm speaker encoder

* fix speaker encoder and add support for 2 more datasets

* simplify visualization code

* add a simple strategy to support multispeaker for tacotron.

* add vctk example for refactored tacotron

* fix indentation

* fix class name

* fix visualizer

* fix root path

* fix root path

* fix root path

* fix typos

* fix bugs

* fix text log extention name

* add example for baker and aishell3

* update experiment and display

* format code for tacotron_vctk, add plot_waveform to display

* add new trainer

* minor fix

* add global condition support for tacotron2

* add gst layer

* add 2 frontend

* fix fmax for example/waveflow

* update collate function, data loader not does not convert nested list into numpy array.

* WIP: add hifigan

* WIP:update hifigan

* change stft to use conv1d

* add audio datasets

* change batch_text_id, batch_spec, batch_wav to include valid lengths in the returned value

* change wavenet to use on-the-fly prepeocessing

* fix typos

* resolve conflict

* remove imports that are removed

* remove files not included in this release

* remove imports to deleted modules

* move tacotron2_msp

* clean code

* fix argument order

* fix argument name

* clean code for data processing

* WIP: add README

* add more details to thr README, fix some preprocess scripts

* add voice cloning notebook

* add an optional to alter the loss and model structure of tacotron2, add an alternative config

* add plot_multiple_attentions and update visualization code in transformer_tts

* format code

* remove tacotron2_msp

* update tacotron2 from_pretrained, update setup.py

* update tacotron2

* update tacotron_aishell3's README

* add images for exampels/tacotron2_aishell3's README

* update README for examples/ge2e

* add STFT back

* add extra_config keys into the default config of tacotron

* fix typos and docs

* update README and doc

* update docstrings for tacotron

* update doc

* update README

* add links to downlaod pretrained models

* refine READMEs and clean code

* add praatio into requirements for running the experiments

* format code with pre-commit

* simplify text processing code and update notebook
  • Loading branch information
Feiyu Chan authored May 13, 2021
1 parent 0aa7088 commit 4f288a6
Show file tree
Hide file tree
Showing 82 changed files with 9,407 additions and 2,464 deletions.
28 changes: 24 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,14 @@ In order to facilitate exploiting the existing TTS models directly and developin

- Vocoders
- [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
- [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499)

- TTS models
- [Neural Speech Synthesis with Transformer Network (Transformer TTS)](https://arxiv.org/abs/1809.08895)
- [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](arxiv.org/abs/1712.05884)

## Updates

And more will be added in the future.
May-07-2021, Add an example for voice cloning in Chinese. Check [examples/tacotron2_aishell3](./examples/tacotron2_aishell3).


## Setup
Expand All @@ -45,7 +45,7 @@ See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. T
pip install -U paddle-parakeet
```

or
or
```bash
git clone https://github.com/PaddlePaddle/Parakeet
cd Parakeet
Expand All @@ -59,9 +59,10 @@ See [install](https://paddle-parakeet.readthedocs.io/en/latest/install.html) for
Entries to the introduction, and the launch of training and synthsis for different example models:

- [>>> WaveFlow](./examples/waveflow)
- [>>> WaveNet](./examples/wavenet)
- [>>> Transformer TTS](./examples/transformer_tts)
- [>>> Tacotron2](./examples/tacotron2)
- [>>> Tacotron2_AISHELL3](./examples/tacotron2_aishell3)
- [>>> GE2E](./examples/ge2e)


## Audio samples
Expand All @@ -70,6 +71,25 @@ Entries to the introduction, and the launch of training and synthsis for differe

Check our [website](https://paddle-parakeet.readthedocs.io/en/latest/demo.html) for audio sampels.


## Checkpoints

### Tacotron2
1. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3.zip)
2. [tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3_alternative.zip)

### Tacotron2_AISHELL3
1. [tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip)

### TransformerTTS
1. [transformer_tts_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.3.zip)

### WaveFlow
1. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip)

### GE2E
1. [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip)

## Copyright and License

Parakeet is provided under the [Apache-2.0 license](LICENSE).
1 change: 0 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,6 @@

html_theme = "sphinx_rtd_theme"


# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
Expand Down
44 changes: 44 additions & 0 deletions docs/source/demo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -140,4 +140,48 @@ Vocoder audio samples

Audio samples generated from ground-truth spectrograms with a vocoder.

.. raw:: html

<embed>
<table>
<tr>
<th align="left"> WaveFlow res 128</th>
</tr>
<tr>
<td>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_0.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_1.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_2.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_3.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_4.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
</tabel>
</table>
</embed>

7 changes: 0 additions & 7 deletions docs/source/parakeet.models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,6 @@ parakeet.models.waveflow module
:undoc-members:
:show-inheritance:

parakeet.models.wavenet module
------------------------------

.. automodule:: parakeet.models.wavenet
:members:
:undoc-members:
:show-inheritance:

Module contents
---------------
Expand Down
129 changes: 129 additions & 0 deletions examples/ge2e/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Speaker Encoder

This experiment trains a speaker encoder with speaker verification as its task. It is done as a part of the experiment of transfer learning from speaker verification to multispeaker text-to-speech synthesis, which can be found at [tacotron2_aishell3](../tacotron2_shell3). The trained speaker encoder is used to extract utterance embeddings from utterances.

## Model

The model used in this experiment is the speaker encoder with text independent speaker verification task in [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf). GE2E-softmax loss is used.

## File Structure

```text
ge2e
├── README.md
├── README_cn.md
├── audio_processor.py
├── config.py
├── dataset_processors.py
├── inference.py
├── preprocess.py
├── random_cycle.py
├── speaker_verification_dataset.py
└── train.py
```

## Download Datasets

Currently supported datasets are Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata, which can be downloaded from corresponding webpage.

1. Librispeech/train-other-500

An English multispeaker dataset,[URL](https://www.openslr.org/resources/12/train-other-500.tar.gz),only the `train-other-500` subset is used.

2. VoxCeleb1

An English multispeaker dataset,[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) , Audio Files from Dev A to Dev D should be downloaded, combined and extracted.

3. VoxCeleb2

An English multispeaker dataset,[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) , Audio Files from Dev A to Dev H should be downloaded, combined and extracted.

4. Aidatatang-200zh

A Mandarin Chinese multispeaker dataset ,[URL](https://www.openslr.org/62/) .

5. magicdata

A Mandarin Chinese multispeaker dataset ,[URL](https://www.openslr.org/68/) .

If you want to use other datasets, you can also download and preprocess it as long as it meets the requirements described below.

## Preprocess Datasets

Multispeaker datasets are used as training data, though the transcriptions are not used. To enlarge the amount of data used for training, several multispeaker datasets are combined. The preporcessed datasets are organized in a file structure described below. The mel spectrogram of each utterance is save in `.npy` format. The dataset is 2-stratified (speaker-utterance). Since multiple datasets are combined, to avoid conflict in speaker id, dataset name is prepended to the speake ids.

```text
dataset_root
├── dataset01_speaker01/
│   ├── utterance01.npy
│   ├── utterance02.npy
│   └── utterance03.npy
├── dataset01_speaker02/
│   ├── utterance01.npy
│   ├── utterance02.npy
│   └── utterance03.npy
├── dataset02_speaker01/
│   ├── utterance01.npy
│   ├── utterance02.npy
│   └── utterance03.npy
└── dataset02_speaker02/
   ├── utterance01.npy
   ├── utterance02.npy
   └── utterance03.npy
```

Run the command to preprocess datasets.

```bash
python preprocess.py --datasets_root=<datasets_root> --output_dir=<output_dir> --dataset_names=<dataset_names>
```

Here `--datasets_root` is the directory that contains several extracted dataset; `--output_dir` is the directory to save the preprocessed dataset; `--dataset_names` is the dataset to preprocess. If there are multiple datasets in `--datasets_root` to preprocess, the names can be joined with comma. Currently supported dataset names are librispeech_other, voxceleb1, voxceleb2, aidatatang_200zh and magicdata.

## Training

When preprocessing is done, run the command below to train the mdoel.

```bash
python train.py --data=<data_path> --output=<output> --device="gpu" --nprocs=1
```

- `--data` is the path to the preprocessed dataset.
- `--output` is the directory to save results,usually a subdirectory of `runs`.It contains visualdl log files, text log files, config file and a `checkpoints` directory, which contains parameter file and optimizer state file. If `--output` already has some training results in it, the most recent parameter file and optimizer state file is loaded before training.
- `--device` is the device type to run the training, 'cpu' and 'gpu' are supported.
- `--nprocs` is the number of replicas to run in multiprocessing based parallel training。Currently multiprocessing based parallel training is only enabled when using 'gpu' as the devicde. `CUDA_VISIBLE_DEVICES` can be used to specify visible devices with cuda.

Other options are described below.

- `--config` is a `.yaml` config file used to override the default config(which is coded in `config.py`).
- `--opts` is command line options to further override config files. It should be the last comman line options passed with multiple key-value pairs separated by spaces.
- `--checkpoint_path` specifies the checkpoiont to load before training, extension is not included. A parameter file ( `.pdparams`) and an optimizer state file ( `.pdopt`) with the same name is used. This option has a higher priority than auto-resuming from the `--output` directory.

## Pretrained Model

The pretrained model is first trained to 1560k steps at Librispeech-other-500 and voxceleb1. Then trained at aidatatang_200h and magic_data to 3000k steps.

Download URL [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip).

## Inference

When training is done, run the command below to generate utterance embedding for each utterance in a dataset.

```bash
python inference.py --input=<input> --output=<output> --checkpoint_path=<checkpoint_path> --device="gpu"
```

`--input` is the path of the dataset used for inference.

`--output` is the directory to save the processed results. It has the same file structure as the input dataset. Each utterance in the dataset has a corrsponding utterance embedding file in `*.npy` format.

`--checkpoint_path` is the path of the checkpoint to use, extension not included.

`--pattern` is the wildcard pattern to filter audio files for inference, defaults to `*.wav`.

`--device` and `--opts` have the same meaning as in the training script.

## References

1. [Generalized End-to-end Loss for Speaker Verification](https://arxiv.org/pdf/1710.10467.pdf)
2. [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf)
124 changes: 124 additions & 0 deletions examples/ge2e/README_cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Speaker Encoder

本实验是的在多说话人数据集上以 Speaker Verification 为任务训练一个 speaker encoder, 这是作为 transfer learning from speaker verification to multispeaker text-to-speech synthesis 实验的一部分, 可以在 [tacotron2_aishell3](../tacotron2_aishell3) 中找到。用训练好的模型来提取音频的 utterance embedding.

## 模型

本实验使用的模型是 [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf) 中的 speaker encoder text independent 模型。使用的是 GE2E softmax 损失函数。

## 目录结构

```text
ge2e
├── README_cn.md
├── audio_processor.py
├── config.py
├── dataset_processors.py
├── inference.py
├── preprocess.py
├── random_cycle.py
├── speaker_verification_dataset.py
└── train.py
```

## 数据集下载

本实验支持了 Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata 数据集。可以在对应的页面下载。

1. Librispeech/train-other-500

英文多说话人数据集,[下载链接](https://www.openslr.org/resources/12/train-other-500.tar.gz),我们的实验中仅用到了 train-other-500 这个子集。

2. VoxCeleb1

英文多说话人数据集,[下载链接](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html),需要下载其中的 Audio Files 中的 Dev A 到 Dev D 四个压缩文件并合并解压。

3. VoxCeleb2

英文多说话人数据集,[下载链接](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html),需要下载其中的 Audio Files 中的 Dev A 到 Dev H 八个压缩文件并合并解压。

4. Aidatatang-200zh

中文多说话人数据集,[下载链接](https://www.openslr.org/62/)

5. magicdata

中文多说话人数据集,[下载链接](https://www.openslr.org/68/)

如果用户需要使用其他的数据集,也可以自行下载并进行数据处理,只要符合如下的要求。

## 数据集预处理

训练中使用的数据集是多说话人数据集,transcription 并不会被使用。为了扩大数据的量,训练过程可以将多个数据集合并为一个。处理后的文件结果组织方式如下,每个句子的频谱存储为 `.npy` 格式。以 speaker-utterance 的两层目录结构存储。因为合并数据集的原因,为了避免 speaker id 冲突,dataset 名会被添加到 speaker id 前面。

```text
dataset_root
├── dataset01_speaker01/
│   ├── utterance01.npy
│   ├── utterance02.npy
│   └── utterance03.npy
├── dataset01_speaker02/
│   ├── utterance01.npy
│   ├── utterance02.npy
│   └── utterance03.npy
├── dataset02_speaker01/
│   ├── utterance01.npy
│   ├── utterance02.npy
│   └── utterance03.npy
└── dataset02_speaker02/
   ├── utterance01.npy
   ├── utterance02.npy
   └── utterance03.npy
```

运行数据处理脚本

```bash
python preprocess.py --datasets_root=<datasets_root> --output_dir=<output_dir> --dataset_names=<dataset_names>
```

其中 datasets_root 是包含多个原始数据集的路径,--output_dir 是多个数据集合并后输出的路径,dataset_names 是数据集的名称,多个数据集可以用逗号分割,比如 'librispeech_other, voxceleb1'. 目前支持的数据集有 librispeech_other, voxceleb1, voxceleb2, aidatatang_200zh, magicdata.

## 训练

数据处理完成后,使用如下的脚本训练。

```bash
python train.py --data=<data_path> --output=<output> --device="gpu" --nprocs=1
```

- `--data` 是处理后的数据集路径。
- `--output` 是训练结果的保存路径,一般使用 runs 下的一个子目录。保存结果包含 visualdl 的 log 文件,文本 log 记录,运行 config 备份,以及 checkpoints 目录,里面包含参数文件和优化器状态文件。如果指定的 output 路径包含此前的训练结果,训练前会自动加载最近的参数文件和优化器状态文件。
- `--device` 是运行设备,目前支持 'cpu' 和 'gpu'.
- `--nprocs` 是指定运行进程数。目前仅在使用 'gpu' 是支持多进程训练。可以配合 `CUDA_VISIBLE_DEVICES` 环境变量指定可见卡号。

另外还有几个选项。

- `--config` 是用于覆盖默认配置(默认配置可以查看 `config.py`) 的配置文件,为 `.yaml` 文件。
- `--opts` 是用命令行参数进一步覆盖配置。这是最后一个传入的命令行选项,用多组空格分隔的 KEY VALUE 对的方式传入。
- `--checkpoint_path` 指定从中恢复的 checkpoint, 不需要包含扩展名。同名的参数文件( `.pdparams`) 和优化器文件( `.pdopt`)会被加载以恢复训练。这个参数指定的恢复训练优先级高于自动从 `output` 文件夹中恢复训练。

## 预训练模型

预训练模型是在 Librispeech-other-500 和 voxceleb1 上训练到 1560k steps 后用 aidatatang_200h 和 magic_data 训练到 3000k 的结果。

下载链接 [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip)

## 预测

使用训练好的模型进行预测,对一个数据集中的所有 utterance 生成一个 embedding.

```bash
python inference.py --input=<input> --output=<output> --checkpoint_path=<checkpoint_path> --device="gpu"
```

- `--input` 是需要处理的数据集的路径。
- `--output` 是处理的结果,它会保持和 `--input` 相同的文件夹结构,对应 input 中的每一个音频文件会有一个同名的 `*.npy` 文件,是从这个音频文件中提取到的 utterance embedding.
- `--checkpoint_path` 为用于预测的参数文件路径,不包含扩展名。
- `--pattern` 是用于筛选数据集中需要处理的音频文件的通配符模式,默认为 `*.wav`.
- `--device``--opts` 的语义和训练脚本一致。

## 参考文献

1. [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf)
2. [Transfer Learning from Speaker Verification toMultispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf)
Loading

0 comments on commit 4f288a6

Please sign in to comment.