You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's a considerable mismatch w.r.t. dataset's characteristics between Constituicao and LJSpeech. Audios of the former are longer (20s-40s) while the latter's do not usually go beyond 10s, and I'm not sure whether this fact plays nice with FastSpeech 2's recipe. AAMOF ESPnet's TTS recipe ignores audios longer than 20s by default.
A possible way to go would be re-segment Constituicao to make individual utts shorter. MFA's has been finding SILs in the middle of sentences quite often - in fact the speaker pauses in between titles and end of sentences. A VAD and an FA would be of great help with that.
There's a considerable mismatch w.r.t. dataset's characteristics between Constituicao and LJSpeech. Audios of the former are longer (20s-40s) while the latter's do not usually go beyond 10s, and I'm not sure whether this fact plays nice with FastSpeech 2's recipe. AAMOF ESPnet's TTS recipe ignores audios longer than 20s by default.
A possible way to go would be re-segment Constituicao to make individual utts shorter. MFA's has been finding SILs in the middle of sentences quite often - in fact the speaker pauses in between titles and end of sentences. A VAD and an FA would be of great help with that.
plot_scripts.zip
The text was updated successfully, but these errors were encountered: