-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guidance about settings for realtime STT on GPU #124
Comments
Hi Alex, To get realtime results, use models for engines that support "Intermediate Results". Currently, all DeepSpeech/Coqui, Vosk and April have this capability. I suspect you already know this. In my opinion Vosk provides the best quality comparing to accuracy but April is not bad as well.
Actually, Speech Note supports GPU acceleration only for Whisper and Faster Whisper models. Inference for April, Vosk and DeepSpeech is always done with CPU.
Unfortunately, there is no hidden option to speed up STT right now. Indeed, LiveCaptions uses exactly the same engine and models, so in theory there should be no differences 🤔. Perhaps VAD is a problem. Speech Note uses VAD processing before STT and this might add additional delay.. maybe. I will investigate what can be done to make STT more real-time. |
Just tested LiveCaptions and STT in that app is ridiculously fast. Wow, amazing! |
Is this true? It is so fast I thought it was on GPU? |
That's right, only CPU. April, Vosk and DeepSpeech are fast without GPU because they use different model architectures (usually older one with significantly worse accuracy). Whisper is based on GPT, which requires a lot of computing power, so it's slow without GPU acceleration. |
I primarily use this as an accessibility tool and the closer to realtime my outputs are, the more effective the tool is for me, especially in meetings and events that I attend. I'd like to have guidance on the options that are best for my use case as it's not obvious from the interface or in the documentations I've found.
I'm struggling to figure out which models/settings will produce output closest to realtime. https://github.com/abb128/LiveCaptions does a better job than I've been able to get with speechnote (even with the same AprilASR model) so I'm sure there's better options that I can use or that SpeechNote can implement, but LiveCaptions doesn't support GPU accelerated speech recognition as far as I can tell.
The text was updated successfully, but these errors were encountered: