Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic.
2023-07-26
: We have released our training recipe for real-time AV-ASR, see here.
2023-06-16
: We have released our training recipe for AutoAVSR, see here.
2023-03-27
: We have released our AutoAVSR models for LRS3, see here.
This is the repository of Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR) on LRS3.
We provide a tutorial to show how to use our Auto-AVSR models to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.
English -> Mandarin -> Spanish | French -> Portuguese -> Italian |
---|---|
- Clone the repository and enter it locally:
git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
cd Visual_Speech_Recognition_for_Multiple_Languages
- Setup the environment.
conda create -y -n autoavsr python=3.8
conda activate autoavsr
- Install pytorch, torchvision, and torchaudio by following instructions here, and install all packages:
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
- Download and extract a pre-trained model and/or language model from model zoo to:
-
./benchmarks/${dataset}/models
-
./benchmarks/${dataset}/language_models
- [For VSR and AV-ASR] Install RetinaFace or MediaPipe tracker.
python eval.py config_filename=[config_filename] \
labels_filename=[labels_filename] \
data_dir=[data_dir] \
landmarks_dir=[landmarks_dir]
-
[config_filename]
is the model configuration path, located in./configs
. -
[labels_filename]
is the labels path, located in${lipreading_root}/benchmarks/${dataset}/labels
. -
[data_dir]
and[landmarks_dir]
are the directories for original dataset and corresponding landmarks. -
gpu_idx=-1
can be added to switch fromcuda:0
tocpu
.
python infer.py config_filename=[config_filename] data_filename=[data_filename]
-
data_filename
is the path to the audio/video file. -
detector=mediapipe
can be added to switch from RetinaFace to MediaPipe tracker.
python crop_mouth.py data_filename=[data_filename] dst_filename=[dst_filename]
dst_filename
is the path where the cropped mouth will be saved.
We support a number of datasets for speech recognition:
- Lip Reading Sentences 2 (LRS2)
- Lip Reading Sentences 3 (LRS3)
- Chinese Mandarin Lip Reading (CMLR)
- CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)
- GRID
- Lombard GRID
- TCD-TIMIT
Lip Reading Sentences 3 (LRS3)
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
- | 19.1 | GoogleDrive or BaiduDrive(key: dqsy) | 891 |
Audio-only | |||
- | 1.0 | GoogleDrive or BaiduDrive(key: dvf2) | 860 |
Audio-visual | |||
- | 0.9 | GoogleDrive or BaiduDrive(key: sai5) | 1540 |
Language models | |||
- | - | GoogleDrive or BaiduDrive(key: t9ep) | 191 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: mi3c) | 18577 |
Lip Reading Sentences 2 (LRS2)
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
- | 26.1 | GoogleDrive or BaiduDrive(key: 48l1) | 186 |
Language models | |||
- | - | GoogleDrive or BaiduDrive(key: 59u2) | 180 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: 53rc) | 9358 |
Lip Reading Sentences 3 (LRS3)
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
- | 32.3 | GoogleDrive or BaiduDrive(key: 1b1s) | 186 |
Language models | |||
- | - | GoogleDrive or BaiduDrive(key: 59u2) | 180 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: mi3c) | 18577 |
Chinese Mandarin Lip Reading (CMLR)
Components | CER | url | size (MB) |
---|---|---|---|
Visual-only | |||
- | 8.0 | GoogleDrive or BaiduDrive(key: 7eq1) | 195 |
Language models | |||
- | - | GoogleDrive or BaiduDrive(key: k8iv) | 187 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: 1ret) | 3721 |
CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
Spanish | 44.5 | GoogleDrive or BaiduDrive(key: m35h) | 186 |
Portuguese | 51.4 | GoogleDrive or BaiduDrive(key: wk2h) | 186 |
French | 58.6 | GoogleDrive or BaiduDrive(key: t1hf) | 186 |
Language models | |||
Spanish | - | GoogleDrive or BaiduDrive(key: 0mii) | 180 |
Portuguese | - | GoogleDrive or BaiduDrive(key: l6ag) | 179 |
French | - | GoogleDrive or BaiduDrive(key: 6tan) | 179 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: vsic) | 3040 |
GRID
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
Overlapped | 1.2 | GoogleDrive or BaiduDrive(key: d8d2) | 186 |
Unseen | 4.8 | GoogleDrive or BaiduDrive(key: ttsh) | 186 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: 16l9) | 1141 |
You can include data_ext=.mpg
in your command line to match the video file extension in the GRID dataset.
Lombard GRID
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
Unseen (Front Plain) | 4.9 | GoogleDrive or BaiduDrive(key: 38ds) | 186 |
Unseen (Side Plain) | 8.0 | GoogleDrive or BaiduDrive(key: k6m0) | 186 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: cusv) | 309 |
You can include data_ext=.mov
in your command line to match the video file extension in the Lombard GRID dataset.
TCD-TIMIT
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
Overlapped | 16.9 | GoogleDrive or BaiduDrive(key: jh65) | 186 |
Unseen | 21.8 | GoogleDrive or BaiduDrive(key: n2gr) | 186 |
Language models | |||
- | - | GoogleDrive or BaiduDrive(key: 59u2) | 180 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: bnm8) | 930 |
If you use the AutoAVSR models training code, please consider citing the following paper:
@inproceedings{ma2023auto,
author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels},
year={2023},
}
If you use the VSR models for multiple languages please consider citing the following paper:
@article{ma2022visual,
title={{Visual Speech Recognition for Multiple Languages in the Wild}},
author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
journal={{Nature Machine Intelligence}},
volume={4},
pages={930--939},
year={2022}
url={https://doi.org/10.1038/s42256-022-00550-z},
doi={10.1038/s42256-022-00550-z}
}
It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.
[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)