Finetune Whisper model on LibriSpeech #1571

marcoyang1998 · 2024-03-28T08:23:44Z

This recipe finetunes a Whisper model on LibriSpeech following #1466.

Update the results
Compare full fine-tune and partial (e.g. encoder/decode) fine-tune
Compare AdamW and ScaledAdam
Compare the compressed fbank feature (Lilcom) and un-compressed fbank feature (hdf5) as input

marcoyang1998 · 2024-03-29T05:41:28Z

A comparison of using different fbank features to decode Whisper model. The LilcomChunkyWriter stores compressed fbank features, causing a slight mismatch when inferencing the Whisper model. NumpyHdf5Writer stores the un-compressed fbank feature, but requires more storage.

In general, using the un-compressed features is slightly better than using the compressed features. The performance difference is minor, except for large-v2. The WERs are obtained using greedy search.

model name	feature type	WER
small	Lilcom	4.59/10.46
small	hdf5	4.57/10.11
small.en	Lilcom	4.83/11.06
small.en	hdf5	4.82/11.04
medium	Lilcom	4.02/7.53
medium	hdf5	4.04/7.53
medium.en	Lilcom	3.72/7.69
medium.en	hdf5	3.72/7.65
large-v2	Lilcom	4.37/8.03
large-v2	hdf5	4.25/7.68
large-v3	Lilcom	3.73/6.1
large-v3	hdf5	3.73/6.1

…option

marcoyang1998 · 2024-04-07T06:36:58Z

Effect of freezing different modules

Num epoch = 10, with Lilcom compressed features
Only fine-tune on train-clean-100.

Finetune small.en, adam optimizer, lr=1e-5

Without fine-tuning: 4.83/11.06 (greedy)

Freeze modules	Num trainable	test-clean/test-other
None	241M	Greedy: 3.35/7.22, Beam search: 3.28/6.63
encoder	154M	Greedy: 3.67/7.81, Beam search: 3.51/7.17
decoder	87M	Greedy: 3.14/7.37, Beam search: 3.02/6.98

Finetune medium, adam optimizer, lr=1e-5

Num epoch = 10, with Lilcom compressed features
Without fine-tuning: 4.02/7.53 (greedy)

Freeze modules	Num trainable	test-clean/test-other
None	762M	Greedy: 2.82/5.88, Beam search: 2.74/5.56
encoder	457M	Greedy: 3.2/6.41, Beam search: 3.02/6.0
decoder	356M	Greedy: 2.81/7.38, Beam search: 2.64/5.85

marcoyang1998 · 2024-04-09T02:13:24Z

Effect of different learning rates:

Model: small.en (without fine-tune: 4.83/11.06)

learning rate	test-clean/test-other
1e-4	4.77/10.48
5e-5	3.8/8.12
1e-5	3.35/7.22
5e-6	3.24/7.01

Model: medium (without fine-tune: 4.02/7.53)

learning rate	test-clean/test-other
5e-5	6.81/14.76
1e-5	2.82/5.88
5e-6	2.79/5.74

marcoyang1998 added 12 commits March 28, 2024 12:33

add files

c2f8c6d

fbank for whisper

1cf78fd

support decoding

76e0d59

generate train-all-shuf for whisper fbank

eb68536

fix typo

711859c

update train.py

ebc0f3b

deactivate beam search temporarily for speed

360f208

support freezing modules

cfbc829

update the decoding script

5d41dec

add an option to use hdf5 for whisper fbank extraction

55a6857

update comments; generate train-all-shuf after feature extraction

4d9f212

support on-the-fly whisper fbank extraction

f208431

support fine-tuning mono-lingual whisper model; add ScaledAdam as an …

6b2bd0f

…option

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetune Whisper model on LibriSpeech #1571

Finetune Whisper model on LibriSpeech #1571

marcoyang1998 commented Mar 28, 2024 •

edited

Loading

marcoyang1998 commented Mar 29, 2024 •

edited

Loading

marcoyang1998 commented Apr 7, 2024 •

edited

Loading

marcoyang1998 commented Apr 9, 2024

Finetune Whisper model on LibriSpeech #1571

Are you sure you want to change the base?

Finetune Whisper model on LibriSpeech #1571

Conversation

marcoyang1998 commented Mar 28, 2024 • edited Loading

marcoyang1998 commented Mar 29, 2024 • edited Loading

marcoyang1998 commented Apr 7, 2024 • edited Loading

Effect of freezing different modules

Finetune small.en, adam optimizer, lr=1e-5

Finetune medium, adam optimizer, lr=1e-5

marcoyang1998 commented Apr 9, 2024

Effect of different learning rates:

marcoyang1998 commented Mar 28, 2024 •

edited

Loading

marcoyang1998 commented Mar 29, 2024 •

edited

Loading

marcoyang1998 commented Apr 7, 2024 •

edited

Loading