Seeking Strategies for Extending Whisper with a New Task (Speaker Diarization) #2045
ReinforcedKnowledge
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi!
I tried to extend Whisper to perform speaker diarization, the training worked, but the results are not good. The issue is that there is no way to keep the speaker identification from one segment to another (Whisper only works on 30s audio) and that fine-tuning a model by introducing a new special token for this specific task makes the fine-tuning harder. (or maybe I just don't have enough data / data of enough quality and/or didn't stumble upon the good hyper parameters configuration for this)
But I was wondering if there are any ideas going around into how we can improve the Whisper in that regard. The constraint is to use the end-to-end Whisper model only and not rely on other tools to achieve a new task that wasn't in the pre-training.
Or is the multi-task training format of Whisper much restrictive?
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions