An expressive voice conversion model that is able to perform cross-speaker style transfer improved by self-generated synthetic expressive data.
- Melspectrogram-based for lightweight training and explicit duration control
- BigVGAN V2 generator
- Large Scale Training for zero-shot voice conversion
- VITS2 (https://github.com/p0p4k/vits2_pytorch/)
- NVIDIA BigVGAN (https://github.com/NVIDIA/BigVGAN)
- Speaker Normalized Affine Coupling layer (SNAC) (https://github.com/hcy71o/SNAC)
- Features preparation and Cosine Similarity based Speaker GRL (https://github.com/PlayVoice/whisper-vits-svc)
- F0 estimation Torch CREPE (https://github.com/maxrmorrison/torchcrepe)