Release First release of 🍓 Ichigo! · janhq/ichigo

Model weight can be downloaded at:

Changelog: v0.2 vs v0.3

Phase	Aspect	v0.2	v0.3
Pretraining	Data Size	2.42M	3.87M
	Data Source	parler-tts/mls_eng_10k	facebook/multilingual_librispeech
	Data Synthetic Pipeline	Using WhisperVQ(old checkpoint: whisper-vq-stoks-medium-en+pl.model) to tokenize english-only audio.	Using latest checkpoint whisper-vq-stoks-v3-7lang.model for 8 lang audio.
	Epoch	1	1
	Global batch size	480	480
	Learning Rate	2e-4	2e-4
	Warmup Steps	80	50
	Weight Decay	0.005	0.005
	Max length	512	512
	Precision	bf16	bf16
Instruction Phase	Data Size	929K	1.89M + 165k (phase 3)
	Preprocessing	Using rule-base to remove all hard-to-pronounce prompt	Utilizing rule-based methods to filter out hard-to-pronounce prompts, and rephrasing certain LLM-generated responses to sound more natural and human-like.
	Data Synthetic Pipeline	Using old text-to-speech checkpoint to generate: t2s-small-yt.model then using whisper-vq-stoks-medium-en+pl.model to tokenize audio.	Change t2s checkpoint to t2s-v1.1-small-en+pl.model and whisperVQ checkpoint to whisper-vq-stoks-v3-7lang.model.
	Epoch	5	1
	Global batch size	128	256
	Gradient Acc Step per device	1	8
	Learning Rate	1e-4	7e-5 and 1.5e-5 for phase 3
	Warmup Steps	80	73 and 8 for phase 3
	Weight Decay	0.005	0.005
	Max length	1024	4096
	Precision	bf16	bf16

Task Type	v0.2	v0.3
Speech Multiturn	None	150k(Mostly 2 turns around 10k >=4 turns
Speech QA	679k samples	1.332M samples
Transcription	250k samples(Using a special token to denote a transcription task)	400k samples(Using 6 different prompts)
Noise Audio	None	8k samples(Using Qwen2.5-72B to generate diverse synthetic answers for randomly generated sound tokens, with lengths matching the distribution of the Speech QA prompt)
Text-only	None	150k samples including: 100k multiturn + 50k single turn