From 13daf73468da70f5db49c928969fce6c6edc041a Mon Sep 17 00:00:00 2001
From: Xiaoyu Yang <45973641+marcoyang1998@users.noreply.github.com>
Date: Wed, 21 Feb 2024 18:06:27 +0800
Subject: [PATCH] docs for finetune zipformer (#1509)

---
 .../from_supervised/finetune_zipformer.rst    | 140 ++++++++++++++++++
 docs/source/recipes/Finetune/index.rst        |  15 ++
 docs/source/recipes/index.rst                 |   1 +
 3 files changed, 156 insertions(+)
 create mode 100644 docs/source/recipes/Finetune/from_supervised/finetune_zipformer.rst
 create mode 100644 docs/source/recipes/Finetune/index.rst

diff --git a/docs/source/recipes/Finetune/from_supervised/finetune_zipformer.rst b/docs/source/recipes/Finetune/from_supervised/finetune_zipformer.rst
new file mode 100644
index 0000000000..7ca4eb811c
--- /dev/null
+++ b/docs/source/recipes/Finetune/from_supervised/finetune_zipformer.rst
@@ -0,0 +1,140 @@
+Finetune from a supervised pre-trained Zipformer model
+======================================================
+
+This tutorial shows you how to fine-tune a supervised pre-trained **Zipformer**
+transducer model on a new dataset.
+
+.. HINT::
+
+  We assume you have read the page :ref:`install icefall` and have setup
+  the environment for ``icefall``.
+
+.. HINT::
+
+  We recommend you to use a GPU or several GPUs to run this recipe
+
+
+For illustration purpose, we fine-tune the Zipformer transducer model
+pre-trained on `LibriSpeech`_ on the small subset of `GigaSpeech`_. You could use your
+own data for fine-tuning if you create a manifest for your new dataset.
+
+Data preparation
+----------------
+
+Please follow the instructions in the `GigaSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/gigaspeech/ASR>`_
+to prepare the fine-tune data used in this tutorial. We only require the small subset in GigaSpeech for this tutorial.
+
+
+Model preparation
+-----------------
+
+We are using the Zipformer model trained on full LibriSpeech (960 hours) as the intialization. The
+checkpoint of the model can be downloaded via the following command:
+
+.. code-block:: bash
+
+    $ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
+    $ cd icefall-asr-librispeech-zipformer-2023-05-15/exp
+    $ git lfs pull --include "pretrained.pt"
+    $ ln -s pretrained.pt epoch-99.pt
+    $ cd ../data/lang_bpe_500
+    $ git lfs pull --include bpe.model
+    $ cd ../../..
+
+Before fine-tuning, let's test the model's WER on the new domain. The following command performs
+decoding on the GigaSpeech test sets:
+
+.. code-block:: bash
+
+    ./zipformer/decode_gigaspeech.py \
+        --epoch 99 \
+        --avg 1 \
+        --exp-dir icefall-asr-librispeech-zipformer-2023-05-15/exp \
+        --use-averaged-model 0 \
+        --max-duration 1000 \
+        --decoding-method greedy_search
+
+You should see the following numbers:
+
+.. code-block::
+
+    For dev, WER of different settings are:
+    greedy_search	20.06	best for dev
+
+    For test, WER of different settings are:
+    greedy_search	19.27	best for test
+
+
+Fine-tune
+---------
+
+Since LibriSpeech and GigaSpeech are both English dataset, we can initialize the whole
+Zipformer model with the checkpoint downloaded in the previous step (otherwise we should consider
+initializing the stateless decoder and joiner from scratch due to the mismatch of the output
+vocabulary). The following command starts a fine-tuning experiment:
+
+.. code-block:: bash
+
+    $ use_mux=0
+    $ do_finetune=1
+
+    $ ./zipformer/finetune.py \
+        --world-size 2 \
+        --num-epochs 20 \
+        --start-epoch 1 \
+        --exp-dir zipformer/exp_giga_finetune${do_finetune}_mux${use_mux} \
+        --use-fp16 1 \
+        --base-lr 0.0045 \
+        --bpe-model data/lang_bpe_500/bpe.model \
+        --do-finetune $do_finetune \
+        --use-mux $use_mux \
+        --master-port 13024 \
+        --finetune-ckpt icefall-asr-librispeech-zipformer-2023-05-15/exp/pretrained.pt \
+        --max-duration 1000
+
+The following arguments are related to fine-tuning:
+
+- ``--base-lr``
+    The learning rate used for fine-tuning. We suggest to set a **small** learning rate for fine-tuning,
+    otherwise the model may forget the initialization very quickly. A reasonable value should be around
+    1/10 of the original lr, i.e 0.0045.
+
+- ``--do-finetune``
+    If True, do fine-tuning by initializing the model from a pre-trained checkpoint.
+    **Note that if you want to resume your fine-tuning experiment from certain epochs, you
+    need to set this to False.**
+
+- ``--finetune-ckpt``
+    The path to the pre-trained checkpoint (used for initialization).
+
+- ``--use-mux``
+    If True, mix the fine-tune data with the original training data by using `CutSet.mux <https://lhotse.readthedocs.io/en/latest/api.html#lhotse.supervision.SupervisionSet.mux>`_
+    This helps maintain the model's performance on the original domain if the original training
+    is available. **If you don't have the original training data, please set it to False.**
+
+After fine-tuning, let's test the WERs. You can do this via the following command:
+
+.. code-block:: bash
+
+    $ use_mux=0
+    $ do_finetune=1
+    $ ./zipformer/decode_gigaspeech.py \
+        --epoch 20 \
+        --avg 10 \
+        --exp-dir zipformer/exp_giga_finetune${do_finetune}_mux${use_mux} \
+        --use-averaged-model 1 \
+        --max-duration 1000 \
+        --decoding-method greedy_search
+
+You should see numbers similar to the ones below:
+
+.. code-block:: text
+
+    For dev, WER of different settings are:
+    greedy_search	13.47	best for dev
+
+    For test, WER of different settings are:
+    greedy_search	13.66	best for test
+
+Compared to the original checkpoint, the fine-tuned model achieves much lower WERs
+on the GigaSpeech test sets.
diff --git a/docs/source/recipes/Finetune/index.rst b/docs/source/recipes/Finetune/index.rst
new file mode 100644
index 0000000000..e62b8980f8
--- /dev/null
+++ b/docs/source/recipes/Finetune/index.rst
@@ -0,0 +1,15 @@
+Fine-tune a pre-trained model
+=============================
+
+After pre-training on public available datasets, the ASR model is already capable of
+performing general speech recognition with relatively high accuracy. However, the accuracy
+could be still low on certain domains that are quite different from the original training
+set. In this case, we can fine-tune the model with a small amount of additional labelled
+data to improve the performance on new domains.
+
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Table of Contents
+
+   from_supervised/finetune_zipformer
diff --git a/docs/source/recipes/index.rst b/docs/source/recipes/index.rst
index 8df61f0d08..52795d4527 100644
--- a/docs/source/recipes/index.rst
+++ b/docs/source/recipes/index.rst
@@ -17,3 +17,4 @@ We may add recipes for other tasks as well in the future.
    Streaming-ASR/index
    RNN-LM/index
    TTS/index
+   Finetune/index