Add non-streaming Zipformer recipe for KsponSpeech (#1664)

k2-fsa · Jun 24, 2024 · 6f102d3 · 6f102d3
1 parent 3059eb4
commit 6f102d3
Show file tree

Hide file tree

Showing 36 changed files with 4,212 additions and 4 deletions.
diff --git a/egs/ksponspeech/ASR/README.md b/egs/ksponspeech/ASR/README.md
@@ -27,6 +27,7 @@ There are various folders containing the name `transducer` in this folder. The f
 
 |                                          | Encoder              | Decoder            | Comment                                           |
 | ---------------------------------------- | -------------------- | ------------------ | ------------------------------------------------- |
-| `pruned_transducer_stateless7_streaming`                              | Streaming Zipformer   | Embedding + Conv1d | streaming version of pruned_transducer_stateless7                                 |
+| `pruned_transducer_stateless7_streaming` | Streaming Zipformer  | Embedding + Conv1d | streaming version of pruned_transducer_stateless7 |
+| `zipformer`                              | Upgraded Zipformer   | Embedding + Conv1d | The latest recipe                                 |
 
 The decoder in `transducer_stateless` is modified from the paper [Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/). We place an additional Conv1d layer right after the input embedding layer.
diff --git a/egs/ksponspeech/ASR/RESULTS.md b/egs/ksponspeech/ASR/RESULTS.md
@@ -19,13 +19,13 @@ The CERs are:
 | fast beam search     | 320ms      | 10.21      | 11.04      | --epoch 30 --avg 9  | simulated streaming  |
 | fast beam search     | 320ms      | 10.25      | 11.08      | --epoch 30 --avg 9  | chunk-wise           |
 | modified beam search | 320ms      | 10.13      | 10.88      | --epoch 30 --avg 9  | simulated streaming  |
-| modified beam search | 320ms      | 10.1       | 10.93      | --epoch 30 --avg 9  | chunk-size           |
+| modified beam search | 320ms      | 10.1       | 10.93      | --epoch 30 --avg 9  | chunk-wize           |
 | greedy search        | 640ms      | 9.94       | 10.82      | --epoch 30 --avg 9  | simulated streaming  |
 | greedy search        | 640ms      | 10.04      | 10.85      | --epoch 30 --avg 9  | chunk-wise           |
 | fast beam search     | 640ms      | 10.01      | 10.81      | --epoch 30 --avg 9  | simulated streaming  |
 | fast beam search     | 640ms      | 10.04      | 10.7       | --epoch 30 --avg 9  | chunk-wise           |
 | modified beam search | 640ms      | 9.91       | 10.72      | --epoch 30 --avg 9  | simulated streaming  |
-| modified beam search | 640ms      | 9.92       | 10.72      | --epoch 30 --avg 9  | chunk-size           |
+| modified beam search | 640ms      | 9.92       | 10.72      | --epoch 30 --avg 9  | chunk-wize           |
 
 Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
 while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.
@@ -67,4 +67,50 @@ for m in greedy_search modified_beam_search fast_beam_search; do
     --decode-chunk-len 32 \
     --num-decode-streams 2000
 done
-```
+```
+
+### zipformer (Zipformer + pruned statelss transducer)
+
+#### [zipformer](./zipformer)
+
+Number of model parameters: 74,778,511, i.e., 74.78 M
+
+##### Training on KsponSpeech (with MUSAN)
+
+Model: [johnBamma/icefall-asr-ksponspeech-zipformer-2024-06-24](https://huggingface.co/johnBamma/icefall-asr-ksponspeech-zipformer-2024-06-24)
+
+The CERs are:
+
+| decoding method      | eval_clean | eval_other | comment             |
+|----------------------|------------|------------|---------------------|
+| greedy search        | 10.60      | 11.56      | --epoch 30 --avg 9  |
+| fast beam search     | 10.59      | 11.54      | --epoch 30 --avg 9  |
+| modified beam search | 10.35      | 11.35      | --epoch 30 --avg 9  |
+
+The training command is:
+
+```bash
+./zipformer/train.py \
+    --world-size 4 \
+    --num-epochs 30 \
+    --start-epoch 1 \
+    --use-fp16 1 \
+    --exp-dir zipformer/exp \
+    --max-duration 750 \
+    --enable-musan True \
+    --base-lr 0.035
+```
+
+NOTICE: I decreased `base_lr` from 0.045(default) to 0.035, Because of `RuntimeError: grad_scale is too small`.
+
+The decoding command is:
+
+```bash
+for m in greedy_search fast_beam_search modified_beam_search; do
+    ./zipformer/decode.py \
+        --epoch 30 \
+        --avg 9 \
+        --exp-dir zipformer/exp \
+        --decoding-method $m
+done
+```
diff --git a/egs/ksponspeech/ASR/zipformer/README.md b/egs/ksponspeech/ASR/zipformer/README.md
@@ -0,0 +1 @@
+This recipe implements Zipformer model.
diff --git a/egs/ksponspeech/ASR/zipformer/asr_datamodule.py b/egs/ksponspeech/ASR/zipformer/asr_datamodule.py
@@ -0,0 +1 @@
+../pruned_transducer_stateless7_streaming/asr_datamodule.py
diff --git a/egs/ksponspeech/ASR/zipformer/beam_search.py b/egs/ksponspeech/ASR/zipformer/beam_search.py
@@ -0,0 +1 @@
+../../../librispeech/ASR/zipformer/beam_search.py