Guidance about settings for realtime STT on GPU #124

alexschneider · 2024-04-23T17:39:23Z

I primarily use this as an accessibility tool and the closer to realtime my outputs are, the more effective the tool is for me, especially in meetings and events that I attend. I'd like to have guidance on the options that are best for my use case as it's not obvious from the interface or in the documentations I've found.

I'm struggling to figure out which models/settings will produce output closest to realtime. https://github.com/abb128/LiveCaptions does a better job than I've been able to get with speechnote (even with the same AprilASR model) so I'm sure there's better options that I can use or that SpeechNote can implement, but LiveCaptions doesn't support GPU accelerated speech recognition as far as I can tell.

mkiol · 2024-04-24T18:12:49Z

Hi Alex,

To get realtime results, use models for engines that support "Intermediate Results". Currently, all DeepSpeech/Coqui, Vosk and April have this capability. I suspect you already know this. In my opinion Vosk provides the best quality comparing to accuracy but April is not bad as well.

but LiveCaptions doesn't support GPU accelerated speech recognition as far as I can tell

Actually, Speech Note supports GPU acceleration only for Whisper and Faster Whisper models. Inference for April, Vosk and DeepSpeech is always done with CPU.

LiveCaptions does a better job than I've been able to get with speechnote

Unfortunately, there is no hidden option to speed up STT right now. Indeed, LiveCaptions uses exactly the same engine and models, so in theory there should be no differences 🤔. Perhaps VAD is a problem. Speech Note uses VAD processing before STT and this might add additional delay.. maybe.

I will investigate what can be done to make STT more real-time.

mkiol · 2024-04-24T18:22:03Z

Just tested LiveCaptions and STT in that app is ridiculously fast. Wow, amazing!

KUKHUA · 2024-06-28T01:52:03Z

Inference for April, Vosk and DeepSpeech is always done with CPU.

Is this true? It is so fast I thought it was on GPU?

mkiol · 2024-06-28T17:19:37Z

Is this true? It is so fast I thought it was on GPU?

That's right, only CPU. April, Vosk and DeepSpeech are fast without GPU because they use different model architectures (usually older one with significantly worse accuracy). Whisper is based on GPT, which requires a lot of computing power, so it's slow without GPU acceleration.

mkiol added the enhancement New feature or request label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance about settings for realtime STT on GPU #124

Guidance about settings for realtime STT on GPU #124

alexschneider commented Apr 23, 2024

mkiol commented Apr 24, 2024

mkiol commented Apr 24, 2024

KUKHUA commented Jun 28, 2024 •

edited

Loading

mkiol commented Jun 28, 2024

Guidance about settings for realtime STT on GPU #124

Guidance about settings for realtime STT on GPU #124

Comments

alexschneider commented Apr 23, 2024

mkiol commented Apr 24, 2024

mkiol commented Apr 24, 2024

KUKHUA commented Jun 28, 2024 • edited Loading

mkiol commented Jun 28, 2024

KUKHUA commented Jun 28, 2024 •

edited

Loading