Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding canary docs #11176

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions docs/source/asr/asr_all.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1040,4 +1040,23 @@ @inproceedings{vaswani2017aayn
booktitle={Advances in Neural Information Processing Systems},
pages={6000--6010},
year={2017}
}

@inproceedings{radford2023whisper,
title={Robust speech recognition via large-scale weak supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
booktitle={International conference on machine learning},
pages={28492--28518},
year={2023},
organization={PMLR}
}

@misc{puvvada2024canary,
title={Less is More: Accurate Speech Recognition & Translation without Web-Scale Data},
author={Krishna C. Puvvada and Piotr Żelasko and He Huang and Oleksii Hrinchuk and Nithin Rao Koluguri and Kunal Dhawan and Somshubra Majumdar and Elena Rastorgueva and Zhehuai Chen and Vitaly Lavrukhin and Jagadeesh Balam and Boris Ginsburg},
year={2024},
eprint={2406.19674},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.19674},
}
157 changes: 157 additions & 0 deletions docs/source/asr/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1120,3 +1120,160 @@ Depending on the type of model, there may be extra steps that must be performed

* CTC Models - `Examples directory for CTC Models <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_ctc/README.md>`_
* RNN Transducer Models - `Examples directory for Transducer Models <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_transducer/README.md>`_

Multi-task AED Models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the following information present in the docs:

  • What is Canary model
  • How to load a canary model and transcribe?
  • How to finetune a Canary model to adapt to new languages
  • Same languages but new data -> tokenizer update
  • Recommended hyper parameters for finetuning

------------------

The config for multitask AED models (e.g., Canary models that can perform both ASR and translation tasks) is at ``<NeMo_git_root>/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml``. Multi-task AED models are built with an encoder-decoder architecture utilizing FastConformer for the encoder and Transformer for the decoder.

Various sections of the config of `Multi-task AED Models <./models.html#_AED_model>`__ are as follows:

* Model initialization (``init_from_nemo_model``)
* Dataset configs (``train_ds``, ``validation_ds``, and ``test_ds``)
* Special tokenizer config (``spl_tokens``)
* Tokenizer config (``tokenizer``)
* Prompt format (``prompt_format``)
* Model defaults (``model_defaults``)
* Audio preprocessor (``preprocessor``)
* Augmentation (``spec_augment``)
* FastConformer encoder (``encoder``)
* Optional intermediate transformer encoder (``transf_encoder``)
* Transformer decoder (``transf_decoder``)
* Label prediction head (``head``)
* Decoding strategy (``decoding``)
* Loss config (``loss``)
* Optimizer (``optim``)
* Training config (``trainer``)
* Experiment manager (``exp_manager``)

While most of the sections are similar to other ASR models, the multi-task AED model has a few unique sections:

Model Initialization
~~~~~~~~~~~~~~~~~~~~

For larger models (1B+ params), initialization from a pretrained encoder is recommended for better stability and convergence:

.. code-block:: yaml

init_from_nemo_model:
model0:
path: "<path to pretrained model>"
include: ["encoder"]
exclude: ["encoder.pre_encode.out"]

The ``include`` parameter specifies which components to load from the pretrained model. In this case, only the encoder weights are loaded. The ``exclude`` parameter allows you to skip specific sub-components during loading - here, the pre-encoder output layer is excluded to allow for architectural modifications.


Dataset Configurations
~~~~~~~~~~~~~~~~~~~~

The multi-task AED models support only Lhotse-based data loading. Datasets can be configured using either manifest files or Lhotse YAML configs:

.. code-block:: yaml

train_ds:
use_lhotse: true
input_cfg: "<path to lhotse config>"
batch_duration: 600
max_duration: 40
use_bucketing: True
num_buckets: 30
bucket_duration_bins: [3.79, 4.82, 5.688, ..., 35.827]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide info / link to calcualte bucket duration bins (ir just update below sentence to mention that to get the list of floats, follow this link)

text_field: "answer"
lang_field: "target_lang"

For more details about Lhotse-based dataset specification, refer to `Lhotse Dataloading <./datasets.html#_hotse_dataloading>`__

Special Tokens Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~

The model uses special tokens for task control and language specification. These can be configured during tokenizer creation or loaded from an existing tokenizer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need docs how to construct this special tokenizer (either here or somewhere below and link to that section)


.. code-block:: yaml

spl_tokens:
model_dir: "<path to tokenizer dir>"
tokens: ["translate", "transcribe", "en", "es", "de", "fr"]
force_rebuild: False

.. _canary_tokenizer_config:

Tokenizer Configuration
~~~~~~~~~~~~~~~~~~~~~~~

The model uses an aggregate tokenizer system that combines special tokens with language-specific tokenizers:

.. code-block:: yaml

tokenizer:
dir: null # Null for aggregate tokenizers
type: agg
langs:
spl_tokens: # Special tokens model
dir: "<path>"
type: bpe
en: # Language-specific tokenizers
dir: "<path>/tokenizer_en_${model.model_defaults.vocab_size_per_lang}"
type: bpe
# Additional languages follow same pattern

Prompt Format
~~~~~~~~~~~~~

Currently, AED models only support Canary format which uses special tokens for task and language control:

.. code-block:: yaml

model:
prompt_format: "canary" # Options supported: ["canary"]

For Canary format, the input sequence to decoder starts with special tokens and consists of [start_of_transcript, source_lang, task, target_lang, pnc] tokens.

Model Defaults
~~~~~~~~~~~~~~

The ``model_defaults`` section contains shared parameters that define the model's architecture and vocabulary settings:

.. code-block:: yaml

model_defaults:
asr_enc_hidden: 1024 # Hidden size for ASR encoder
lm_dec_hidden: 1024 # Hidden size for transformer decoder
text_field: "answer" # Field name for ground truth in manfiest
lang_field: "target_lang" # Field name for output language in manifest

These parameters are referenced throughout the config using OmegaConf interpolation (${...}) to maintain consistency across different components of the model.


Transformer Decoder
~~~~~~~~~~~~~~~~~
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~


The transformer decoder uses a pre-LN architecture with configurable size and attention parameters:

.. code-block:: yaml

transf_decoder:
_target_: nemo.collections.asr.modules.transformer.get_nemo_transformer
model_name: null
pretrained: false
pre_ln_final_layer_norm: true
config_dict:
max_sequence_length: 2048
num_layers: 4
hidden_size: ${model.model_defaults.lm_dec_hidden}
inner_size: ${multiply:${model.model_defaults.lm_dec_hidden}, 4}
num_attention_heads: 8
ffn_dropout: 0.1
vocab_size: None # Set at runtime

Loss Configuration
~~~~~~~~~~~~~~~~

The model uses smoothed cross entropy loss with optional label smoothing:

.. code-block:: yaml

loss:
_target_: nemo.collections.common.losses.smoothed_cross_entropy.SmoothedCrossEntropyLoss
label_smoothing: ${model.label_smoothing}
pad_id: null
1 change: 1 addition & 0 deletions docs/source/asr/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -536,6 +536,7 @@ An example using an AIS cluster at ``hostname:port`` with a tarred dataset for t

.. _Hybrid-ASR-TTS_model__Text-Only-Data:

.. _lhotse_dataloading:

Lhotse Dataloading
------------------
Expand Down
Binary file added docs/source/asr/images/aed_model.png
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No picture upload. Upload to latest nemo release and update url

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 29 additions & 4 deletions docs/source/asr/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,17 @@ Spotlight Models
Canary
~~~~~~

Canary-1B is the latest ASR model from NVIDIA NeMo. It sits at the top of the `HuggingFace OpenASR Leaderboard <https://huggingface.co/spaces/hf-audio/open_asr_leaderboard>`__ at time of publishing.
Canary-1B is the latest multi-lingual, multi-task model supporting automatic speech-to-text recognition (ASR) as well as translation from NVIDIA NeMo. It supports ASR in 4 languages (English, German, French, Spanish) and translation between English and the 3 other supported languages. It sits at the top of the `HuggingFace OpenASR Leaderboard <https://huggingface.co/spaces/hf-audio/open_asr_leaderboard>`__ at time of publishing.

You can `download the checkpoint <https://huggingface.co/nvidia/canary-1b>`__ or try out Canary in action in this `HuggingFace Space <https://huggingface.co/spaces/nvidia/canary-1b>`__.
It is an attention-based encoder-decoder (AED) model with a :ref:`FastConformer Encoder <Fast-Conformer>` and Transformer Decoder :cite:`asr-models-vaswani2017aayn`.

Canary-1B is an encoder-decoder model with a :ref:`FastConformer Encoder <Fast-Conformer>` and Transformer Decoder :cite:`asr-models-vaswani2017aayn`.
Model checkpoints:

* `Canary-1B <https://huggingface.co/nvidia/canary-1b>`__ model card

HuggingFace Spaces to try out Canary-1B in your browser:

It is a multi-lingual, multi-task model, supporting automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) as well as translation between English and the 3 other supported languages.
* `Canary-1B <https://huggingface.co/spaces/nvidia/canary-1b>`__ space


Parakeet
Expand All @@ -45,6 +49,27 @@ HuggingFace Spaces to try out Parakeet models in your browser:
* `Parakeet-RNNT-1.1B <https://huggingface.co/spaces/nvidia/parakeet-rnnt-1.1b>`__ space
* `Parakeet-TDT-1.1B <https://huggingface.co/spaces/nvidia/parakeet-tdt-1.1b>`__ space

.. _AED_model:

AED
---

Attention-based Encoder-Decoder (AED) models in NeMo are based on Fast-conformer encoder and Transformer decoder.

Here is the overall architecture of the AED models in NeMo:

.. image:: images/aed_model.png
:align: center
:alt: AED Model
:scale: 50%

The Multi-task AED model is implemented in the :class:`~nemo.collections.asr.models.EncDecMultiTaskModel` class.
The model uses an aggregate tokenizer system with special tokens to control different tasks (ASR and translation) and languages during inference.
You can find the example config file for the Multi-task AED model at ``<NeMo_git_root>/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml``.
For more details about tokenizer configuration, refer to the `Tokenizer Configurations <./configs.html#_canary_tokenizer_config>`_ section.
Example launch script for Multi-task AED model can be found at ``<NeMo_git_root>/examples/asr/speech_multitask/speech_to_text_aed.py``.


.. _Conformer_model:

Conformer
Expand Down
Loading