Model weight can be downloaded at:
Changelog: v0.2 vs v0.3
Overall Comparison
Phase |
Aspect |
v0.2 |
v0.3 |
Pretraining |
Data Size |
2.42M |
3.87M |
|
Data Source |
parler-tts/mls_eng_10k |
facebook/multilingual_librispeech |
|
Data Synthetic Pipeline |
Using WhisperVQ(old checkpoint: whisper-vq-stoks-medium-en+pl.model) to tokenize english-only audio. |
Using latest checkpoint whisper-vq-stoks-v3-7lang.model for 8 lang audio. |
|
Epoch |
1 |
1 |
|
Global batch size |
480 |
480 |
|
Learning Rate |
2e-4 |
2e-4 |
|
Warmup Steps |
80 |
50 |
|
Weight Decay |
0.005 |
0.005 |
|
Max length |
512 |
512 |
|
Precision |
bf16 |
bf16 |
Instruction Phase |
Data Size |
929K |
1.89M + 165k (phase 3) |
|
Preprocessing |
Using rule-base to remove all hard-to-pronounce prompt |
Utilizing rule-based methods to filter out hard-to-pronounce prompts, and rephrasing certain LLM-generated responses to sound more natural and human-like. |
|
Data Synthetic Pipeline |
Using old text-to-speech checkpoint to generate: t2s-small-yt.model then using whisper-vq-stoks-medium-en+pl.model to tokenize audio. |
Change t2s checkpoint to t2s-v1.1-small-en+pl.model and whisperVQ checkpoint to whisper-vq-stoks-v3-7lang.model. |
|
Epoch |
5 |
1 |
|
Global batch size |
128 |
256 |
|
Gradient Acc Step per device |
1 |
8 |
|
Learning Rate |
1e-4 |
7e-5 and 1.5e-5 for phase 3 |
|
Warmup Steps |
80 |
73 and 8 for phase 3 |
|
Weight Decay |
0.005 |
0.005 |
|
Max length |
1024 |
4096 |
|
Precision |
bf16 |
bf16 |
Instruction Phase Data Task Types
Task Type |
v0.2 |
v0.3 |
Speech Multiturn |
None |
150k(Mostly 2 turns around 10k >=4 turns |
Speech QA |
679k samples |
1.332M samples |
Transcription |
250k samples(Using a special token to denote a transcription task) |
400k samples(Using 6 different prompts) |
Noise Audio |
None |
8k samples(Using Qwen2.5-72B to generate diverse synthetic answers for randomly generated sound tokens, with lengths matching the distribution of the Speech QA prompt) |
Text-only |
None |
150k samples including: 100k multiturn + 50k single turn |
Performance