From 13daf73468da70f5db49c928969fce6c6edc041a Mon Sep 17 00:00:00 2001 From: Xiaoyu Yang <45973641+marcoyang1998@users.noreply.github.com> Date: Wed, 21 Feb 2024 18:06:27 +0800 Subject: [PATCH] docs for finetune zipformer (#1509) --- .../from_supervised/finetune_zipformer.rst | 140 ++++++++++++++++++ docs/source/recipes/Finetune/index.rst | 15 ++ docs/source/recipes/index.rst | 1 + 3 files changed, 156 insertions(+) create mode 100644 docs/source/recipes/Finetune/from_supervised/finetune_zipformer.rst create mode 100644 docs/source/recipes/Finetune/index.rst diff --git a/docs/source/recipes/Finetune/from_supervised/finetune_zipformer.rst b/docs/source/recipes/Finetune/from_supervised/finetune_zipformer.rst new file mode 100644 index 0000000000..7ca4eb811c --- /dev/null +++ b/docs/source/recipes/Finetune/from_supervised/finetune_zipformer.rst @@ -0,0 +1,140 @@ +Finetune from a supervised pre-trained Zipformer model +====================================================== + +This tutorial shows you how to fine-tune a supervised pre-trained **Zipformer** +transducer model on a new dataset. + +.. HINT:: + + We assume you have read the page :ref:`install icefall` and have setup + the environment for ``icefall``. + +.. HINT:: + + We recommend you to use a GPU or several GPUs to run this recipe + + +For illustration purpose, we fine-tune the Zipformer transducer model +pre-trained on `LibriSpeech`_ on the small subset of `GigaSpeech`_. You could use your +own data for fine-tuning if you create a manifest for your new dataset. + +Data preparation +---------------- + +Please follow the instructions in the `GigaSpeech recipe `_ +to prepare the fine-tune data used in this tutorial. We only require the small subset in GigaSpeech for this tutorial. + + +Model preparation +----------------- + +We are using the Zipformer model trained on full LibriSpeech (960 hours) as the intialization. The +checkpoint of the model can be downloaded via the following command: + +.. code-block:: bash + + $ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15 + $ cd icefall-asr-librispeech-zipformer-2023-05-15/exp + $ git lfs pull --include "pretrained.pt" + $ ln -s pretrained.pt epoch-99.pt + $ cd ../data/lang_bpe_500 + $ git lfs pull --include bpe.model + $ cd ../../.. + +Before fine-tuning, let's test the model's WER on the new domain. The following command performs +decoding on the GigaSpeech test sets: + +.. code-block:: bash + + ./zipformer/decode_gigaspeech.py \ + --epoch 99 \ + --avg 1 \ + --exp-dir icefall-asr-librispeech-zipformer-2023-05-15/exp \ + --use-averaged-model 0 \ + --max-duration 1000 \ + --decoding-method greedy_search + +You should see the following numbers: + +.. code-block:: + + For dev, WER of different settings are: + greedy_search 20.06 best for dev + + For test, WER of different settings are: + greedy_search 19.27 best for test + + +Fine-tune +--------- + +Since LibriSpeech and GigaSpeech are both English dataset, we can initialize the whole +Zipformer model with the checkpoint downloaded in the previous step (otherwise we should consider +initializing the stateless decoder and joiner from scratch due to the mismatch of the output +vocabulary). The following command starts a fine-tuning experiment: + +.. code-block:: bash + + $ use_mux=0 + $ do_finetune=1 + + $ ./zipformer/finetune.py \ + --world-size 2 \ + --num-epochs 20 \ + --start-epoch 1 \ + --exp-dir zipformer/exp_giga_finetune${do_finetune}_mux${use_mux} \ + --use-fp16 1 \ + --base-lr 0.0045 \ + --bpe-model data/lang_bpe_500/bpe.model \ + --do-finetune $do_finetune \ + --use-mux $use_mux \ + --master-port 13024 \ + --finetune-ckpt icefall-asr-librispeech-zipformer-2023-05-15/exp/pretrained.pt \ + --max-duration 1000 + +The following arguments are related to fine-tuning: + +- ``--base-lr`` + The learning rate used for fine-tuning. We suggest to set a **small** learning rate for fine-tuning, + otherwise the model may forget the initialization very quickly. A reasonable value should be around + 1/10 of the original lr, i.e 0.0045. + +- ``--do-finetune`` + If True, do fine-tuning by initializing the model from a pre-trained checkpoint. + **Note that if you want to resume your fine-tuning experiment from certain epochs, you + need to set this to False.** + +- ``--finetune-ckpt`` + The path to the pre-trained checkpoint (used for initialization). + +- ``--use-mux`` + If True, mix the fine-tune data with the original training data by using `CutSet.mux `_ + This helps maintain the model's performance on the original domain if the original training + is available. **If you don't have the original training data, please set it to False.** + +After fine-tuning, let's test the WERs. You can do this via the following command: + +.. code-block:: bash + + $ use_mux=0 + $ do_finetune=1 + $ ./zipformer/decode_gigaspeech.py \ + --epoch 20 \ + --avg 10 \ + --exp-dir zipformer/exp_giga_finetune${do_finetune}_mux${use_mux} \ + --use-averaged-model 1 \ + --max-duration 1000 \ + --decoding-method greedy_search + +You should see numbers similar to the ones below: + +.. code-block:: text + + For dev, WER of different settings are: + greedy_search 13.47 best for dev + + For test, WER of different settings are: + greedy_search 13.66 best for test + +Compared to the original checkpoint, the fine-tuned model achieves much lower WERs +on the GigaSpeech test sets. diff --git a/docs/source/recipes/Finetune/index.rst b/docs/source/recipes/Finetune/index.rst new file mode 100644 index 0000000000..e62b8980f8 --- /dev/null +++ b/docs/source/recipes/Finetune/index.rst @@ -0,0 +1,15 @@ +Fine-tune a pre-trained model +============================= + +After pre-training on public available datasets, the ASR model is already capable of +performing general speech recognition with relatively high accuracy. However, the accuracy +could be still low on certain domains that are quite different from the original training +set. In this case, we can fine-tune the model with a small amount of additional labelled +data to improve the performance on new domains. + + +.. toctree:: + :maxdepth: 2 + :caption: Table of Contents + + from_supervised/finetune_zipformer diff --git a/docs/source/recipes/index.rst b/docs/source/recipes/index.rst index 8df61f0d08..52795d4527 100644 --- a/docs/source/recipes/index.rst +++ b/docs/source/recipes/index.rst @@ -17,3 +17,4 @@ We may add recipes for other tasks as well in the future. Streaming-ASR/index RNN-LM/index TTS/index + Finetune/index