MagicData-RAMC Dataset and Baseline

Contents

Description

The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in MagicData-RAMC are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided. As a Mandarin speech dataset designed for dialog scenarios with high quality and rich annotations, MagicData-RAMC enriches the data diversity in the Mandarin speech community and allows extensive research on a series of speech-related tasks, including automatic speech recognition, speaker diarization, topic detection, keyword search, text-to-speech, etc. We also conduct several relevant tasks and provide experimental results to help evaluate the dataset.

Download

The dataset can be downloaded on openslr.

Speaker Diarization Task

For speaker diarization track, we use VBHMM x-vectors (aka VBx) trained by VoxCeleb Data (openslr-49) and CN-Celeb Corpus (openslr-82) on this task. X-vectors embeddings are extracted by ResNet, and besides, agglomerative hierarchical clustering with variational Bayes HMM resegmentation are conducted to get final result.

Data Preparation:

Run prepare_magicdata_160h.py under scripys folder.

Testing & Scoring:

./run.sh

For scoring, DIHARD Socring Tools could be used to calculate DER, JER and so on. We already add this repo as a git submodule under our project.

git submodule update --init --recursive
cd sd/dscore
python score.py --collar 0.25 -r ${groundtruth_rttm} -s ${predicted_rttm}

We formulate CDER (Conversational Diarization Error Rate) to evaluate the performance of the speaker diarization system on the sentence level under conversational scenario. Our CDER-Metric could be used to calculate CDER.

cd sd/CDER-Metric
python score.py -r ${groundtruth_rttm} -s ${predicted_rttm}

Result:

Method	Subset	DER (collar 0.25)	DER (collar 0)	JER	CDER
VBx	MagicData-RAMC Dev	5.57	17.48	45.73	26.9
VBx	MagicData-RAMC Test	7.96	19.90	47.49	28.2

Note that we will provide CSSD-Test set on Sep, 8, 2022. All participates should submit results on CSSD-Test set before Sep, 10, 2022. And we will score and rank according to submitted results. All papers could use MagicData-RAMC Dev and MagicData-RAMC Test to evaluate proposed methods.

ASR Task

For ASR track, we use Conformer implemented by Espnet to conduct speech recognition. 160h development set is devided into two part: 140h audio recordings are merged with MAGICDATA Mandarin Chinese Read Speech Corpus (openslr-68) for training, while the other 20h audio recordings are reserved for testing.

Data Preparation:

Run prepare_magicdata_160h.py and prepare_magicdata_750h.py under scripys folder.

Network Training:

./run.sh

Decoding & Scoring:

For scoring, sclite of Espnet could be used to obtain WER.

sclite -r ${ref_path} trn -h ${output_path} trn -i rm -o all stdout > ${result_path}

Result:

Method	Subset	Err
Conformer	MagicData-RAMC Dev	16.5
Conformer	MagicData-RAMC Test	19.1

Reference Resource

Citation

If you use MagicData-RAMC dataset in your research, please kindly consider citing our paper:

@article{yang2022open,
title={Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset},
author={Yang, Zehui and Chen, Yifan and Luo, Lei and Yang, Runyan and Ye, Lingxuan and Cheng, Gaofeng and Xu, Ji and Jin, Yaohui and Zhang, Qingqing and Zhang, Pengyuan and others},
journal={arXiv preprint arXiv:2203.16844},
year={2022}
}

Contact

If you have any questions, please contact us. You could open an issue on github or email us.

Acknowledgment

We thank @MG623 for finding label mistakes in CTS-CN-F2F-2019-11-15-1422 (detail). We thank @kli017 for pointing out the problem in data prepare stage(detail).

Reference

[1] Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N.E.Y., Heymann, J., Wiesner, M., Chen, N. and Renduchintala, A., 2018. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015.

[2] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P. and Silovsky, J., 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.

[3] Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and Pang, R., 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.

[4] Watanabe, S., Hori, T., Kim, S., Hershey, J.R. and Hayashi, T., 2017. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), pp.1240-1253.

[5] Landini, F., Wang, S., Diez, M., Burget, L., Matějka, P., Žmolíková, K., Mošner, L., Silnova, A., Plchot, O., Novotný, O. and Zeinali, H., 2020, May. But system for the second dihard speech diarization challenge. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6529-6533). IEEE.

[6] Diez, M., Burget, L., Landini, F. and Černocký, J., 2019. Analysis of speaker diarization based on Bayesian HMM with eigenvoice priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, pp.355-368.

[7] Ryant, N., Church, K., Cieri, C., Du, J., Ganapathy, S. and Liberman, M., 2020. Third DIHARD challenge evaluation plan. arXiv preprint arXiv:2006.05815.

[8] Watanabe, S., Mandel, M., Barker, J., Vincent, E., Arora, A., Chang, X., Khudanpur, S., Manohar, V., Povey, D., Raj, D. and Snyder, D., 2020. CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. arXiv preprint arXiv:2004.09249.

[9] Fu, Y., Cheng, L., Lv, S., Jv, Y., Kong, Y., Chen, Z., Hu, Y., Xie, L., Wu, J., Bu, H. and Xu, X., 2021. AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario. arXiv preprint arXiv:2104.03603.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
asr		asr
kws		kws
sd		sd
tsv		tsv
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MagicData-RAMC Dataset and Baseline

Description

Download

Speaker Diarization Task

Data Preparation:

Testing & Scoring:

Result:

ASR Task

Data Preparation:

Network Training:

Decoding & Scoring:

Result:

Reference Resource

Open Source project:

Dataset:

Model:

Citation

Contact

Acknowledgment

Reference

About

Releases

Packages

Contributors 2

Languages

MagicHub-io/MagicData-RAMC

Folders and files

Latest commit

History

Repository files navigation

MagicData-RAMC Dataset and Baseline

Data Preparation:

Testing & Scoring:

Result:

Data Preparation:

Network Training:

Decoding & Scoring:

Result:

Open Source project:

Dataset:

Model:

About

Resources

Stars

Watchers

Forks

Languages