The baseline system of Magichub Code-Switching ASR Challenge is developed using ETEH.
For the model architecture, we use Transformer. We use a 2-layer convolutional neural network (CNN) as the front-end. Each CNN layer has 320 filters, each of which has 3x3 kernel size with 2x2 stride. The self-attention encoder and decoder are 17-layer and 6-layer, respectively. All sub-layers, as well as embedding layers, produce outputs of dimension 320. In the multi-head attention networks, the head number is 8. The inner dimension of position-wise feed-forward networks is 2,048. All the ASR models are trained with batch size 512, using Adam algorithm with gradient clipping norm 5, warm-up of 25,000 steps, and Noam learning rate decay scheme. We train the ASR model for 30 epochs.
We combine the MagicData-RAMC Train and TAL_CSASR Train as the training dataset for the baseline ASR model. We use 83-dimensional features, which include 80-dimensional filter banks and 3-dimensional pitch features, as input acoustic features. Features are extracted with a 25ms Hamming window, shifted every 10ms. We DO NOT apply CMVN to the acoustic features.
The ASR model predicts subword units based on byte pair encoding (BPE) for English and Chinese characters for Mandarin as output targets. There are 5,276 output targets in total, of which 1,007 are English subword units (including special symbols) and 4,269 are Chinese characters.
Dev: https://magichub.com/datasets/dev-set-of-chinese-english-code-mixing-conversational-speech-corpus/
Test: https://magichub.com/datasets/chinese-english-code-mixing-conversational-speech-corpus/
Please refer to "dev_data_preprocess.sh" for details.
Please refer to "dev_scoring_sclite.sh" for details.
Please refer to "test_data_preprocess.sh" for details. When submitting your final hyp file for scoring, please make sure that the utterence ID format is same as the one in "test/ref_example.gb.txt" generated by "test_data_preprocess.sh".
dev MER: 29.2%
test MER: 26.5%
Other scoring details:
,-----------------------------------------------------------------------. | exp/talcs_magic_160/eteh_baseline/decode/hyp.dev.gb.txt | |-----------------------------------------------------------------------| | SPKR | # Snt # Chr | Corr Sub Del Ins Err S.Err | |--------+-----------------+--------------------------------------------| | g00 | 4456 34286 | 75.7 21.6 2.7 4.9 29.2 72.2 | |=======================================================================| | Sum/Avg| 4456 34286 | 75.7 21.6 2.7 4.9 29.2 72.2 | |=======================================================================| | Mean |4456.0 34286.0 | 75.7 21.6 2.7 4.9 29.2 72.2 | | S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 | | Median |4456.0 34286.0 | 75.7 21.6 2.7 4.9 29.2 72.2 | `-----------------------------------------------------------------------' ,--------------------------------------------------------------------. | exp/talcs_magic_160/eteh_baseline/decode/hyp.test.gb.txt | |--------------------------------------------------------------------| | SPKR | # Snt # Chr | Corr Sub Del Ins Err S.Err | |--------+-----------------+-----------------------------------------| | g00 | 11243 77302 | 77.3 20.2 2.5 3.8 26.5 63.5 | |====================================================================| | Sum/Avg| 11243 77302 | 77.3 20.2 2.5 3.8 26.5 63.5 | |====================================================================| | Mean |11243.0 77302.0 | 77.3 20.2 2.5 3.8 26.5 63.5 | | S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 | | Median |11243.0 77302.0 | 77.3 20.2 2.5 3.8 26.5 63.5 | `--------------------------------------------------------------------'
If you have any questions, please contact us. You could open an issue on github or email us.