This repository gathers our efforts to evaluate/compare the current Multi-speaker TTS systems using objective metrics.
We have used the UTMOS model to predict the Naturalness Mean Opinion Score (nMOS). In the HierSpeech++ paper, the authors have used the open-source version of UTMOS, and the presented results of human nMOS and UTMOS are almost aligned. Although this can not be considered an absolute evaluation metric, it can be used to easily compare models in quality terms.
Following previous works, we evaluate pronunciation accuracy using an ASR model. For it, we have used the Whisper Large v3 model for english text and the Paraformer model for chinese text. Additionally, we also removed all text punctuation before computing WER/CER.
To compare the similarity between the synthesized voice and the original speaker, we compute the Speaker Encoder Cosine Similarity (SECS). We have two choices to compute the SECS, which contains ERes2Net-large and WavLM-base-plus-sv.
We compute the mel cepstral distortion between the predicted wav and the ground-truth wav as follows,
$$ \operatorname{MCD}\left(\mathbf{c}_p, \mathbf{c}g\right)=\frac{10}{\ln 10} \sqrt{2 \sum{k=1}^{M_c}\left[c_p(k)-c_g(k)\right]^2} $$
where
We compute the perceptual evaluation of speech quality score via pypesq.
We compute the root mean root mean square error in F0 estimation as follows,
$$ \operatorname{RMSE}(\mathbf{f0}, \hat{\mathbf{f0}})=\sqrt{\frac{1}{N} \sum_{i=1}^{N}\left(\mathbf{f0}{i}-\hat{\mathbf{f0}}{i}\right)^2} $$
where
pip install -r requirements.txt
bash run.sh
- Visqol score.
- voice/unvoice errors.
- We borrow a little of code from Amphion for some evaluation mectrics computation.