NVIDIA · krishnacpuvvada · Nov 6, 2024 · nithinraok · Nov 6, 2024 · titu1994
diff --git a/docs/source/asr/asr_all.bib b/docs/source/asr/asr_all.bib
@@ -1040,4 +1040,23 @@ @inproceedings{vaswani2017aayn
   booktitle={Advances in Neural Information Processing Systems},
   pages={6000--6010},
   year={2017}
+}
+
+@inproceedings{radford2023whisper,
+  title={Robust speech recognition via large-scale weak supervision},
+  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
+  booktitle={International conference on machine learning},
+  pages={28492--28518},
+  year={2023},
+  organization={PMLR}
+}
+
+@misc{puvvada2024canary,
+      title={Less is More: Accurate Speech Recognition & Translation without Web-Scale Data}, 
+      author={Krishna C. Puvvada and Piotr Żelasko and He Huang and Oleksii Hrinchuk and Nithin Rao Koluguri and Kunal Dhawan and Somshubra Majumdar and Elena Rastorgueva and Zhehuai Chen and Vitaly Lavrukhin and Jagadeesh Balam and Boris Ginsburg},
+      year={2024},
+      eprint={2406.19674},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2406.19674}, 
 }
diff --git a/docs/source/asr/configs.rst b/docs/source/asr/configs.rst
@@ -1120,3 +1120,160 @@ Depending on the type of model, there may be extra steps that must be performed
 
 * CTC Models - `Examples directory for CTC Models <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_ctc/README.md>`_
 * RNN Transducer Models - `Examples directory for Transducer Models <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_transducer/README.md>`_
+
+Multi-task AED Models
+------------------
+
+The config for multitask AED models (e.g., Canary models that can perform both ASR and translation tasks) is at ``<NeMo_git_root>/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml``. Multi-task AED models are built with an encoder-decoder architecture utilizing FastConformer for the encoder and Transformer for the decoder.
+
+Various sections of the config of `Multi-task AED Models <./models.html#_AED_model>`__ are as follows:
+
+* Model initialization (``init_from_nemo_model``)
+* Dataset configs (``train_ds``, ``validation_ds``, and ``test_ds``)
+* Special tokenizer config (``spl_tokens``)
+* Tokenizer config (``tokenizer``)
+* Prompt format (``prompt_format``)
+* Model defaults (``model_defaults``)
+* Audio preprocessor (``preprocessor``)
+* Augmentation (``spec_augment``)
+* FastConformer encoder (``encoder``)
+* Optional intermediate transformer encoder (``transf_encoder``)
+* Transformer decoder (``transf_decoder``)
+* Label prediction head (``head``)
+* Decoding strategy (``decoding``)
+* Loss config (``loss``)
+* Optimizer (``optim``)
+* Training config (``trainer``)
+* Experiment manager (``exp_manager``)
+
+While most of the sections are similar to other ASR models, the multi-task AED model has a few unique sections:
+
+Model Initialization
+~~~~~~~~~~~~~~~~~~~~
+
+For larger models (1B+ params), initialization from a pretrained encoder is recommended for better stability and convergence:
+
+.. code-block:: yaml
+
+    init_from_nemo_model:
+      model0:
+        path: "<path to pretrained model>"
+        include: ["encoder"]
+        exclude: ["encoder.pre_encode.out"]
+
+The ``include`` parameter specifies which components to load from the pretrained model. In this case, only the encoder weights are loaded. The ``exclude`` parameter allows you to skip specific sub-components during loading - here, the pre-encoder output layer is excluded to allow for architectural modifications.
+
+
+Dataset Configurations
+~~~~~~~~~~~~~~~~~~~~
+
+The multi-task AED models support only Lhotse-based data loading. Datasets can be configured using either manifest files or Lhotse YAML configs:
+
+.. code-block:: yaml
+
+    train_ds:
+      use_lhotse: true
+      input_cfg: "<path to lhotse config>"
+      batch_duration: 600
+      max_duration: 40
+      use_bucketing: True
+      num_buckets: 30
+      bucket_duration_bins: [3.79, 4.82, 5.688, ..., 35.827]
+      text_field: "answer"
+      lang_field: "target_lang"
+
+For more details about Lhotse-based dataset specification, refer to `Lhotse Dataloading <./datasets.html#_hotse_dataloading>`__
+
+Special Tokens Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The model uses special tokens for task control and language specification. These can be configured during tokenizer creation or loaded from an existing tokenizer:
+
+.. code-block:: yaml
+
+    spl_tokens:
+      model_dir: "<path to tokenizer dir>"
+      tokens: ["translate", "transcribe", "en", "es", "de", "fr"]
+      force_rebuild: False
+
+.. _canary_tokenizer_config:
+
+Tokenizer Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+The model uses an aggregate tokenizer system that combines special tokens with language-specific tokenizers:
+
+.. code-block:: yaml
+
+    tokenizer:
+      dir: null  # Null for aggregate tokenizers
+      type: agg
+      langs:
+        spl_tokens:  # Special tokens model
+          dir: "<path>"
+          type: bpe
+        en:  # Language-specific tokenizers
+          dir: "<path>/tokenizer_en_${model.model_defaults.vocab_size_per_lang}"
+          type: bpe
+        # Additional languages follow same pattern
+
+Prompt Format
+~~~~~~~~~~~~~
+
+Currently, AED models only support Canary format which uses special tokens for task and language control:
+
+.. code-block:: yaml
+
+    model:
+      prompt_format: "canary"   # Options supported: ["canary"]
+
+For Canary format, the input sequence to decoder starts with special tokens and consists of [start_of_transcript, source_lang, task, target_lang, pnc] tokens.
+
+Model Defaults
+~~~~~~~~~~~~~~
+
+The ``model_defaults`` section contains shared parameters that define the model's architecture and vocabulary settings:
+
+.. code-block:: yaml
+
+    model_defaults:
+      asr_enc_hidden: 1024    # Hidden size for ASR encoder
+      lm_dec_hidden: 1024     # Hidden size for transformer decoder
+      text_field: "answer"    # Field name for ground truth in manfiest
+      lang_field: "target_lang"    # Field name for output language in manifest
+
+These parameters are referenced throughout the config using OmegaConf interpolation (${...}) to maintain consistency across different components of the model.
+
+
+Transformer Decoder
+~~~~~~~~~~~~~~~~~
-~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~
-~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~
+
+The transformer decoder uses a pre-LN architecture with configurable size and attention parameters:
+
+.. code-block:: yaml
+
+    transf_decoder:
+      _target_: nemo.collections.asr.modules.transformer.get_nemo_transformer
+      model_name: null
+      pretrained: false
+      pre_ln_final_layer_norm: true
+      config_dict:
+        max_sequence_length: 2048
+        num_layers: 4
+        hidden_size: ${model.model_defaults.lm_dec_hidden}
+        inner_size: ${multiply:${model.model_defaults.lm_dec_hidden}, 4}
+        num_attention_heads: 8
+        ffn_dropout: 0.1
+        vocab_size: None  # Set at runtime
+
+Loss Configuration
+~~~~~~~~~~~~~~~~
+
+The model uses smoothed cross entropy loss with optional label smoothing:
+
+.. code-block:: yaml
+
+    loss:
+      _target_: nemo.collections.common.losses.smoothed_cross_entropy.SmoothedCrossEntropyLoss
+      label_smoothing: ${model.label_smoothing}
+      pad_id: null
diff --git a/docs/source/asr/datasets.rst b/docs/source/asr/datasets.rst
@@ -536,6 +536,7 @@ An example using an AIS cluster at ``hostname:port`` with a tarred dataset for t
 
 .. _Hybrid-ASR-TTS_model__Text-Only-Data:
 
+.. _lhotse_dataloading:
 
 Lhotse Dataloading
 ------------------

diff --git a/docs/source/asr/images/aed_model.png b/docs/source/asr/images/aed_model.png
diff --git a/docs/source/asr/models.rst b/docs/source/asr/models.rst
@@ -20,13 +20,17 @@ Spotlight Models
 Canary
 ~~~~~~
 
-Canary-1B is the latest ASR model from NVIDIA NeMo. It sits at the top of the `HuggingFace OpenASR Leaderboard <https://huggingface.co/spaces/hf-audio/open_asr_leaderboard>`__ at time of publishing.
+Canary-1B is the latest multi-lingual, multi-task model supporting automatic speech-to-text recognition (ASR) as well as translation from NVIDIA NeMo. It supports ASR in 4 languages (English, German, French, Spanish) and translation between English and the 3 other supported languages. It sits at the top of the `HuggingFace OpenASR Leaderboard <https://huggingface.co/spaces/hf-audio/open_asr_leaderboard>`__ at time of publishing.
 
-You can `download the checkpoint <https://huggingface.co/nvidia/canary-1b>`__  or try out Canary in action in this `HuggingFace Space <https://huggingface.co/spaces/nvidia/canary-1b>`__.
+It is an attention-based encoder-decoder (AED) model with a :ref:`FastConformer Encoder <Fast-Conformer>` and Transformer Decoder :cite:`asr-models-vaswani2017aayn`.
 
-Canary-1B is an encoder-decoder model with a :ref:`FastConformer Encoder <Fast-Conformer>` and Transformer Decoder :cite:`asr-models-vaswani2017aayn`.
+Model checkpoints:
+
+* `Canary-1B <https://huggingface.co/nvidia/canary-1b>`__  model card
+
+HuggingFace Spaces to try out Canary-1B in your browser:
 
-It is a multi-lingual, multi-task model, supporting automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) as well as translation between English and the 3 other supported languages.
+* `Canary-1B <https://huggingface.co/spaces/nvidia/canary-1b>`__ space
 
 
 Parakeet
@@ -45,6 +49,27 @@ HuggingFace Spaces to try out Parakeet models in your browser:
 * `Parakeet-RNNT-1.1B <https://huggingface.co/spaces/nvidia/parakeet-rnnt-1.1b>`__ space
 * `Parakeet-TDT-1.1B <https://huggingface.co/spaces/nvidia/parakeet-tdt-1.1b>`__ space
 
+.. _AED_model:
+
+AED
+---
+
+Attention-based Encoder-Decoder (AED) models in NeMo are based on Fast-conformer encoder and Transformer decoder. 
+
+Here is the overall architecture of the AED models in NeMo:
+
+    .. image:: images/aed_model.png
+        :align: center
+        :alt: AED Model
+        :scale: 50%
+
+The Multi-task AED model is implemented in the :class:`~nemo.collections.asr.models.EncDecMultiTaskModel` class. 
+The model uses an aggregate tokenizer system with special tokens to control different tasks (ASR and translation) and languages during inference. 
+You can find the example config file for the Multi-task AED model at ``<NeMo_git_root>/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml``. 
+For more details about tokenizer configuration, refer to the `Tokenizer Configurations <./configs.html#_canary_tokenizer_config>`_ section. 
+Example launch script for Multi-task AED model can be found at ``<NeMo_git_root>/examples/asr/speech_multitask/speech_to_text_aed.py``.
+
+
 .. _Conformer_model:
 
 Conformer