-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore silence detection in speech-to-text #379
Comments
Any progress on this? Compared to python the transcripts provided by bumblebee are pretty bad. Lots of repetition of sentences, missing text, etc. We are on the verge of giving up and moving to SaaS for this, unfortunately :( |
PRs are definitely welcome. |
Jonatan, Valim, So I thought of changing these two files:
I still need to review and test all the logic. But do you think this would be the place to implement this processor? |
@tubedude unfortunately it doesn't fit into the usual logits processing approach. We generate the transcription token-by-token, and logits processing applies some transformation to logits at each iteration. My understanding is that the While looking around, I noticed that huggingface/transformers made significant changes to long-form transcription within the last year. They added support for sequential transcription of long inputs, similar to openai-whisper for improved transcription quality. The implementation involves several techniques, including the nospeech detection. They do use logits processor as part of this, however not to alter the logits, but rather to accumulate information in the object state, and extract it later, when deciding if a chunk is silence (the authors actually consider it hacky, but that's what they did to match the openai implementation, ref). This hack doesn't really fit into our functional implementation; but regardless it is only applicable within the new long-form implementation. The two main PRs with the new changes are huggingface/transformers#27492 and huggingface/transformers#27658. So taking a step back, huggingface/transformers now has two separate approaches for long-form transcription (a) "sequential" long-input generation (which does the nonspeech detection among other techniques) (b) chunked generation with output merging. Our current implementation does (b). Maintaining both, especially with streaming, is most likely too much. Implementing (a) is a lot of work and I think there are challenges related to serving and streaming, because the input slice points are not known upfront (offsets are adjusted on each iteration). All that said, I think it may be worth looking at the PRs, the paper mentioned in those PRs, and consider a different implementation for long-form transcription. Given the complexity, I can't really point to anything directly actionable, and it's not something we can prioritize at the moment. |
Whisper may hallucinate text when an audio chunk is silence or noise (see #377 (comment)). The openai-whisper implementation has
no_speech_threshold
andlogprob_threshold
that may be related. By a quick search there are a few discussions around Whisper hallucination, it may be worth experimenting if there's something we can incorporate into the current algorithm.The text was updated successfully, but these errors were encountered: