Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance about settings for realtime STT on GPU #124

Open
alexschneider opened this issue Apr 23, 2024 · 4 comments
Open

Guidance about settings for realtime STT on GPU #124

alexschneider opened this issue Apr 23, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@alexschneider
Copy link

I primarily use this as an accessibility tool and the closer to realtime my outputs are, the more effective the tool is for me, especially in meetings and events that I attend. I'd like to have guidance on the options that are best for my use case as it's not obvious from the interface or in the documentations I've found.

I'm struggling to figure out which models/settings will produce output closest to realtime. https://github.com/abb128/LiveCaptions does a better job than I've been able to get with speechnote (even with the same AprilASR model) so I'm sure there's better options that I can use or that SpeechNote can implement, but LiveCaptions doesn't support GPU accelerated speech recognition as far as I can tell.

@mkiol mkiol added the enhancement New feature or request label Apr 24, 2024
@mkiol
Copy link
Owner

mkiol commented Apr 24, 2024

Hi Alex,

To get realtime results, use models for engines that support "Intermediate Results". Currently, all DeepSpeech/Coqui, Vosk and April have this capability. I suspect you already know this. In my opinion Vosk provides the best quality comparing to accuracy but April is not bad as well.

but LiveCaptions doesn't support GPU accelerated speech recognition as far as I can tell

Actually, Speech Note supports GPU acceleration only for Whisper and Faster Whisper models. Inference for April, Vosk and DeepSpeech is always done with CPU.

LiveCaptions does a better job than I've been able to get with speechnote

Unfortunately, there is no hidden option to speed up STT right now. Indeed, LiveCaptions uses exactly the same engine and models, so in theory there should be no differences 🤔. Perhaps VAD is a problem. Speech Note uses VAD processing before STT and this might add additional delay.. maybe.

I will investigate what can be done to make STT more real-time.

@mkiol
Copy link
Owner

mkiol commented Apr 24, 2024

Just tested LiveCaptions and STT in that app is ridiculously fast. Wow, amazing!

@KUKHUA
Copy link

KUKHUA commented Jun 28, 2024

Inference for April, Vosk and DeepSpeech is always done with CPU.

Is this true? It is so fast I thought it was on GPU?

@mkiol
Copy link
Owner

mkiol commented Jun 28, 2024

Is this true? It is so fast I thought it was on GPU?

That's right, only CPU. April, Vosk and DeepSpeech are fast without GPU because they use different model architectures (usually older one with significantly worse accuracy). Whisper is based on GPT, which requires a lot of computing power, so it's slow without GPU acceleration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants