GitHub - Leoputera2407/tyler_silent_speech: Official repository for "A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition"

MONA LISA

This repository contains code for training Multimodal Orofacial Neural Audio (MONA) and Large Language Model (LLM) Integrated Scoring Adjustment (LISA). Together, MONA LISA sets a new state-of-the art for decoding silent speech, achieving 7.3% WER on validation data for open vocabulary.

See the preprint on arxiv.

Paper reproduction

First you will need to download the Gaddy 2020 dataset Then, the following scripts can be modified and run in order on SLURM or a local machine. An individual model trains on one A100 for 24-48 hours depending on loss functions (supTcon increases train time by ~75%). The full model sweep as done in the paper trains 60 models. 0) run notebooks/tyler/2023-07-17_cache_dataset_with_attrs_.py

run notebooks/tyler/batch_model_sweep.sh (2024-01-15_icml_models.py)
run notebooks/tyler/2024-01-26_icml_pred.py
run notebooks/tyler/batch_beam_search.sh (2024-01-26_icml_beams.py)
run notebooks/tyler/2024-01-28_icml_figures.py
run notebooks/tyler/2024-01-31_icml_TEST.py

Brain-to-text '24 reproduction

Train 10 models of the Pytorch NPTL baseline RNN
Run beam search with the 5-gram model. The average validation WER should be approximatel 14.6%
run notebooks/tyler/2024-02-13_wiilet_competition.py. The validation WER of finetuned LISA should be approximately 13.7% without finetuning, or 10.2% with finetuning.

The final competition WER was 8.9%, which at time of writing is rank 1.

Environment Setup

alternate setup

First build the environment.yml. Then,

> conda install libsndfile -c conda-forge
> 
> pip install jiwer torchaudio matplotlib scipy soundfile absl-py librosa numba unidecode praat-textgrids g2p_en einops opt_einsum hydra-core pytorch_lightning "neptune-client==0.16.18"

Explanation of model outputs for CTC loss

For each timestep, the network predicts probability of each of 38 characters ('abcdefghijklmnopqrstuvwxyz0123456789|_'), where | is word boundary, and _ is the "blank token". The blank token is used to separate repeat letters like "ll" in hello: [h,h,e,l,l,_,l,o]. It can optionally be inserted elsewhere too, like __hhhh_eeee_llll_lllooo___

Example prediction

Target text: after breakfast instead of working i decided to walk down towards the common

Example model prediction (argmax last dim) of shape (1821, 38):

______________________________________________________________a__f___tt__eerr|||b__rr_eaaakk___ff____aa____ss_tt___________________||____a_nd__|_ssttt___eaa_dd_||ooff||ww___o_rr_____kk_____ii___nngg________________________||_____a____t__||_______c______i___d_____eedd__________||tt___o__||_w_____a______l_kkk____________________||______o______w__t______________|||t____oowwwaarrrdddsss____||thhee_|||c_____o___mm__mm___oo_nn___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

Beam search gives, ' after breakfast and stead of working at cided to walk owt towards the common ', which here is same as result from "best path decoding" (argmax), but in theory could be different since sums probability of multiple alignments and is therefore more accurate.

Montreal forced aligner

Instructions for getting phoneme alignments

https://montreal-forced-aligner.readthedocs.io/en/latest/first_steps/index.html#first-steps-align-pretrained

> conda create -n mfa -c conda-forge montreal-forced-aligner
> mfa model download acoustic english_us_arpa
> mfa model download dictionary english_us_arpa
> mfa validate --single_speaker -j 32 /data/data/T12_data/synthetic_audio/TTS english_us_arpa english_us_arpa
> mfa model download g2p english_us_arpa
> mfa g2p --single_speaker /data/data/T12_data/synthetic_audio/TTS english_us_arpa ~/Documents/MFA/TTS/oovs_found_english_us_arpa.txt --dictionary_path english_us_arpa
> mfa model add_words english_us_arpa ~/mfa_data/g2pped_oovs.txt
> mfa adapt --single_speaker -j 32 /data/data/T12_data/synthetic_audio/TTS english_us_arpa english_us_arpa /data/data/T12_data/synthetic_audio/adapted_bark_english_us_arpa
> mfa validate --single_speaker -j 32 /data/data/T12_data/synthetic_audio/TTS english_us_arpa english_us_arpa
# ensure no OOV (I had to manually correct a transcript due to a `{`)
> mfa adapt --single_speaker -j 32 --output_directory /data/data/T12_data/synthetic_audio/TTS /data/data/T12_data/synthetic_audio/TTS english_us_arpa english_us_arpa /data/data/T12_data/synthetic_audio/adapted_bark_english_us_arpa

### misc

Fast transfer of cache on sherlock to local NVME

cd $MAG/librispeech find . -type f | parallel -j 16 rsync -avPR {} $LOCAL_SCRATCH/librispeech/

find . -type f | parallel -j 16 rsync -avPR {} $SCRATCH/librispeech/

Name		Name	Last commit message	Last commit date
Latest commit History 629 Commits
LLaMA_rescoring		LLaMA_rescoring
data_collection		data_collection
fine_tuning_data		fine_tuning_data
hifi_gan		hifi_gan
notebooks		notebooks
plots		plots
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
NPTL_formatting.ipynb		NPTL_formatting.ipynb
README.md		README.md
TransformerPipeline_Whisper.ipynb		TransformerPipeline_Whisper.ipynb
align.py		align.py
architecture.py		architecture.py
asr_evaluation.py		asr_evaluation.py
cmudict.txt		cmudict.txt
contrastive.py		contrastive.py
data_utils.py		data_utils.py
dataloaders.py		dataloaders.py
devset.pkl		devset.pkl
devset2.pkl		devset2.pkl
environment.yml		environment.yml
evaluate.py		evaluate.py
functional.py		functional.py
getTopKPredictions.py		getTopKPredictions.py
helpers.py		helpers.py
hyena.py		hyena.py
make_vocoder_trainset.py		make_vocoder_trainset.py
normalizers.pkl		normalizers.pkl
read_emg.py		read_emg.py
recognition_model-ORIG.py		recognition_model-ORIG.py
recognition_model.py		recognition_model.py
requirements.txt		requirements.txt
s4.py		s4.py
self-supervised.ipynb		self-supervised.ipynb
slurm.md		slurm.md
sweeps.ipynb		sweeps.ipynb
testset.pkl		testset.pkl
testset2.pkl		testset2.pkl
testset_largedev.json		testset_largedev.json
testset_origdev.json		testset_origdev.json
trainset.pkl		trainset.pkl
trainset2.pkl		trainset2.pkl
transduction_model.py		transduction_model.py
transformer.py		transformer.py
vocoder.py		vocoder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MONA LISA

Paper reproduction

Brain-to-text '24 reproduction

Environment Setup

alternate setup

Explanation of model outputs for CTC loss

Example prediction

Montreal forced aligner

About

Releases

Packages

Languages

License

Leoputera2407/tyler_silent_speech

Folders and files

Latest commit

History

Repository files navigation

MONA LISA

Paper reproduction

Brain-to-text '24 reproduction

Environment Setup

alternate setup

Explanation of model outputs for CTC loss

Example prediction

Montreal forced aligner

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages