Skip to content

A repository for single- and multi-modal speaker verification, speaker recognition, and speaker diarization.

License

Notifications You must be signed in to change notification settings

new-a-class/3D-Speaker

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



license

3D-Speaker is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible on ModelScope. Furthermore, we present a large-scale speech corpus also called 3D-Speaker to facilitate the research of speech representation disentanglement.

3D-Speaker Dataset

Dataset introduction and download address: 3D-Speaker
Paper address: 3D-Speaker

Quickstart

Install 3D-Speaker

git clone https://github.com/alibaba-damo-academy/3D-Speaker.git && cd 3D-Speaker
conda create -n 3D-Speaker python=3.8
conda activate 3D-Speaker
pip install -r requirements.txt

Running experiments

# Speaker verification: ERes2Net on 3D Speaker
cd egs/3dspeaker/sv-eres2net/
bash run.sh
# Speaker verification: CAM++ on 3D Speaker
cd egs/3dspeaker/sv-cam++/
bash run.sh
# Self-supervised speaker verification: RDINO on 3D Speaker
cd egs/3dspeaker/sv-rdino/
bash run.sh
# Speaker diarization:
cd egs/3dspeaker/speaker-diarization/
bash run.sh

Inference using pretrained models from Modelscope

All pretrained models are released on Modelscope.

# Install modelscope
pip install modelscope
# CAM++ trained on VoxCeleb
model_id=damo/speech_campplus_sv_en_voxceleb_16k
# CAM++ trained on 200k labeled speakers
model_id=damo/speech_campplus_sv_zh-cn_16k-common
# ERes2Net trained on VoxCeleb
model_id=damo/speech_eres2net_sv_en_voxceleb_16k
# Run CAM++ or ERes2Net inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path

# RDINO trained on VoxCeleb
model_id=damo/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k
# Run rdino inference
python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path

News

  • [2023.6] Releasing 3D-Speaker dataset and its corresponding benchmarks including ERes2Net, CAM++ and RDINO.
  • [2023.6] ERes2Net pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
  • [2023.6] Speaker diarization inference recipe released, which comprises multiple modules, including voice activity detection, speech segmentation, speaker embedding extraction, and speaker clustering.
  • [2023.5] ERes2Net training recipe on VoxCeleb released.
  • [2023.5] ERes2Net pretrained model released, trained on VoxCeleb.
  • [2023.4] RDINO training recipe on VoxCeleb released.
  • [2023.4] CAM++ pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
  • [2023.4] CAM++ training recipe on VoxCeleb released.

To be expected

  • [2023.7] Releasing CAM++, ERes2Net-Base, ERes2Net-Large pretrained models trained on 3D-Speaker.
  • [2023.7] Releasing Dialogue Detection model and Semantic Speaker Change Detection model in speaker diarization.
  • [2023.7] Releasing CAM++ model in language recognition.

Contact

If you have any comment or question about 3D-Speaker, please contact us by

  • email: {zsq174630, chenyafeng.cyf, tongmu.wh, shuli.cly}@alibaba-inc.com

License

3D-Speaker is released under the Apache License 2.0.

Acknowledge

3D-Speaker contains third-party components and code modified from some open-source repos, including:

Citations

If you are using 3D Speaker dataset in your research, please cite:

@inproceedings{chen2023pushing,
  title={3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement},
  author={Siqi Zheng, Luyao Cheng, Yafeng Chen, Hui Wang and Qian Chen},
  url={https://arxiv.org/pdf/2306.15354.pdf},
  year={2023}
}

If you are using RDINO model in your research, please cite:

@inproceedings{chen2023pushing,
  title={Pushing the limits of self-supervised speaker verification using regularized distillation framework},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and Chen, Qian},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

If you are using CAM++ model in your research, please cite:

@article{cam++,
  title={CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking},
  author={Hui Wang and Siqi Zheng and Yafeng Chen and Luyao Cheng and Qian Chen},
  booktitle={Interspeech 2023},
  year={2023},
  organization={IEEE}
}

If you are using ERes2Net model in your research, please cite:

@article{eres2net,
  title={An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification},
  author={Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Jiajun Qi},
  booktitle={Interspeech 2023},
  year={2023},
  organization={IEEE}
}

About

A repository for single- and multi-modal speaker verification, speaker recognition, and speaker diarization.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 66.7%
  • Shell 27.3%
  • Perl 6.0%