Skip to content

Commit

Permalink
adding canary docs
Browse files Browse the repository at this point in the history
Signed-off-by: Krishna Puvvada <[email protected]>
  • Loading branch information
Krishna Puvvada committed Nov 6, 2024
1 parent ba2b96d commit 9bdb804
Show file tree
Hide file tree
Showing 5 changed files with 206 additions and 4 deletions.
19 changes: 19 additions & 0 deletions docs/source/asr/asr_all.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1040,4 +1040,23 @@ @inproceedings{vaswani2017aayn
booktitle={Advances in Neural Information Processing Systems},
pages={6000--6010},
year={2017}
}

@inproceedings{radford2023whisper,
title={Robust speech recognition via large-scale weak supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
booktitle={International conference on machine learning},
pages={28492--28518},
year={2023},
organization={PMLR}
}

@misc{puvvada2024canary,
title={Less is More: Accurate Speech Recognition & Translation without Web-Scale Data},
author={Krishna C. Puvvada and Piotr Żelasko and He Huang and Oleksii Hrinchuk and Nithin Rao Koluguri and Kunal Dhawan and Somshubra Majumdar and Elena Rastorgueva and Zhehuai Chen and Vitaly Lavrukhin and Jagadeesh Balam and Boris Ginsburg},
year={2024},
eprint={2406.19674},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.19674},
}
157 changes: 157 additions & 0 deletions docs/source/asr/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1120,3 +1120,160 @@ Depending on the type of model, there may be extra steps that must be performed
* CTC Models - `Examples directory for CTC Models <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_ctc/README.md>`_
* RNN Transducer Models - `Examples directory for Transducer Models <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_transducer/README.md>`_
Multi-task AED Models
------------------
The config for multitask AED models (e.g., Canary models that can perform both ASR and translation tasks) is at ``<NeMo_git_root>/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml``. Multi-task AED models are built with an encoder-decoder architecture utilizing FastConformer for the encoder and Transformer for the decoder.
Various sections of the config of `Multi-task AED Models <./models.html#_AED_model>`__ are as follows:
* Model initialization (``init_from_nemo_model``)
* Dataset configs (``train_ds``, ``validation_ds``, and ``test_ds``)
* Special tokenizer config (``spl_tokens``)
* Tokenizer config (``tokenizer``)
* Prompt format (``prompt_format``)
* Model defaults (``model_defaults``)
* Audio preprocessor (``preprocessor``)
* Augmentation (``spec_augment``)
* FastConformer encoder (``encoder``)
* Optional intermediate transformer encoder (``transf_encoder``)
* Transformer decoder (``transf_decoder``)
* Label prediction head (``head``)
* Decoding strategy (``decoding``)
* Loss config (``loss``)
* Optimizer (``optim``)
* Training config (``trainer``)
* Experiment manager (``exp_manager``)
While most of the sections are similar to other ASR models, the multi-task AED model has a few unique sections:
Model Initialization
~~~~~~~~~~~~~~~~~~~~
For larger models (1B+ params), initialization from a pretrained encoder is recommended for better stability and convergence:
.. code-block:: yaml
init_from_nemo_model:
model0:
path: "<path to pretrained model>"
include: ["encoder"]
exclude: ["encoder.pre_encode.out"]
The ``include`` parameter specifies which components to load from the pretrained model. In this case, only the encoder weights are loaded. The ``exclude`` parameter allows you to skip specific sub-components during loading - here, the pre-encoder output layer is excluded to allow for architectural modifications.
Dataset Configurations
~~~~~~~~~~~~~~~~~~~~
The multi-task AED models support only Lhotse-based data loading. Datasets can be configured using either manifest files or Lhotse YAML configs:
.. code-block:: yaml
train_ds:
use_lhotse: true
input_cfg: "<path to lhotse config>"
batch_duration: 600
max_duration: 40
use_bucketing: True
num_buckets: 30
bucket_duration_bins: [3.79, 4.82, 5.688, ..., 35.827]
text_field: "answer"
lang_field: "target_lang"
For more details about Lhotse-based dataset specification, refer to `Lhotse Dataloading <./datasets.html#_hotse_dataloading>`__
Special Tokens Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~
The model uses special tokens for task control and language specification. These can be configured during tokenizer creation or loaded from an existing tokenizer:
.. code-block:: yaml
spl_tokens:
model_dir: "<path to tokenizer dir>"
tokens: ["translate", "transcribe", "en", "es", "de", "fr"]
force_rebuild: False
.. _canary_tokenizer_config:
Tokenizer Configuration
~~~~~~~~~~~~~~~~~~~~~~~
The model uses an aggregate tokenizer system that combines special tokens with language-specific tokenizers:
.. code-block:: yaml
tokenizer:
dir: null # Null for aggregate tokenizers
type: agg
langs:
spl_tokens: # Special tokens model
dir: "<path>"
type: bpe
en: # Language-specific tokenizers
dir: "<path>/tokenizer_en_${model.model_defaults.vocab_size_per_lang}"
type: bpe
# Additional languages follow same pattern
Prompt Format
~~~~~~~~~~~~~
Currently, AED models only support Canary format which uses special tokens for task and language control:
.. code-block:: yaml
model:
prompt_format: "canary" # Options supported: ["canary"]
For Canary format, the input sequence to decoder starts with special tokens and consists of [start_of_transcript, source_lang, task, target_lang, pnc] tokens.
Model Defaults
~~~~~~~~~~~~~~
The ``model_defaults`` section contains shared parameters that define the model's architecture and vocabulary settings:
.. code-block:: yaml
model_defaults:
asr_enc_hidden: 1024 # Hidden size for ASR encoder
lm_dec_hidden: 1024 # Hidden size for transformer decoder
text_field: "answer" # Field name for ground truth in manfiest
lang_field: "target_lang" # Field name for output language in manifest
These parameters are referenced throughout the config using OmegaConf interpolation (${...}) to maintain consistency across different components of the model.
Transformer Decoder
~~~~~~~~~~~~~~~~~
The transformer decoder uses a pre-LN architecture with configurable size and attention parameters:
.. code-block:: yaml
transf_decoder:
_target_: nemo.collections.asr.modules.transformer.get_nemo_transformer
model_name: null
pretrained: false
pre_ln_final_layer_norm: true
config_dict:
max_sequence_length: 2048
num_layers: 4
hidden_size: ${model.model_defaults.lm_dec_hidden}
inner_size: ${multiply:${model.model_defaults.lm_dec_hidden}, 4}
num_attention_heads: 8
ffn_dropout: 0.1
vocab_size: None # Set at runtime
Loss Configuration
~~~~~~~~~~~~~~~~
The model uses smoothed cross entropy loss with optional label smoothing:
.. code-block:: yaml
loss:
_target_: nemo.collections.common.losses.smoothed_cross_entropy.SmoothedCrossEntropyLoss
label_smoothing: ${model.label_smoothing}
pad_id: null
1 change: 1 addition & 0 deletions docs/source/asr/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -536,6 +536,7 @@ An example using an AIS cluster at ``hostname:port`` with a tarred dataset for t
.. _Hybrid-ASR-TTS_model__Text-Only-Data:

.. _lhotse_dataloading:

Lhotse Dataloading
------------------
Expand Down
Binary file added docs/source/asr/images/aed_model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 29 additions & 4 deletions docs/source/asr/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,17 @@ Spotlight Models
Canary
~~~~~~

Canary-1B is the latest ASR model from NVIDIA NeMo. It sits at the top of the `HuggingFace OpenASR Leaderboard <https://huggingface.co/spaces/hf-audio/open_asr_leaderboard>`__ at time of publishing.
Canary-1B is the latest multi-lingual, multi-task model supporting automatic speech-to-text recognition (ASR) as well as translation from NVIDIA NeMo. It supports ASR in 4 languages (English, German, French, Spanish) and translation between English and the 3 other supported languages. It sits at the top of the `HuggingFace OpenASR Leaderboard <https://huggingface.co/spaces/hf-audio/open_asr_leaderboard>`__ at time of publishing.

You can `download the checkpoint <https://huggingface.co/nvidia/canary-1b>`__ or try out Canary in action in this `HuggingFace Space <https://huggingface.co/spaces/nvidia/canary-1b>`__.
It is an attention-based encoder-decoder (AED) model with a :ref:`FastConformer Encoder <Fast-Conformer>` and Transformer Decoder :cite:`asr-models-vaswani2017aayn`.

Canary-1B is an encoder-decoder model with a :ref:`FastConformer Encoder <Fast-Conformer>` and Transformer Decoder :cite:`asr-models-vaswani2017aayn`.
Model checkpoints:

* `Canary-1B <https://huggingface.co/nvidia/canary-1b>`__ model card

HuggingFace Spaces to try out Canary-1B in your browser:

It is a multi-lingual, multi-task model, supporting automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) as well as translation between English and the 3 other supported languages.
* `Canary-1B <https://huggingface.co/spaces/nvidia/canary-1b>`__ space


Parakeet
Expand All @@ -45,6 +49,27 @@ HuggingFace Spaces to try out Parakeet models in your browser:
* `Parakeet-RNNT-1.1B <https://huggingface.co/spaces/nvidia/parakeet-rnnt-1.1b>`__ space
* `Parakeet-TDT-1.1B <https://huggingface.co/spaces/nvidia/parakeet-tdt-1.1b>`__ space

.. _AED_model:

AED
---

Attention-based Encoder-Decoder (AED) models in NeMo are based on Fast-conformer encoder and Transformer decoder.

Here is the overall architecture of the AED models in NeMo:

.. image:: images/aed_model.png
:align: center
:alt: AED Model
:scale: 50%

The Multi-task AED model is implemented in the :class:`~nemo.collections.asr.models.EncDecMultiTaskModel` class.
The model uses an aggregate tokenizer system with special tokens to control different tasks (ASR and translation) and languages during inference.
You can find the example config file for the Multi-task AED model at ``<NeMo_git_root>/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml``.
For more details about tokenizer configuration, refer to the `Tokenizer Configurations <./configs.html#_canary_tokenizer_config>`_ section.
Example launch script for Multi-task AED model can be found at ``<NeMo_git_root>/examples/asr/speech_multitask/speech_to_text_aed.py``.


.. _Conformer_model:

Conformer
Expand Down

0 comments on commit 9bdb804

Please sign in to comment.