Speech to text on short audio file #169

rumbu13 · 2023-04-22T05:07:34Z

rumbu13
Apr 22, 2023

I am using faster-whisper for recognizing short commands in a rhasspy setup. That means that all audio files fed to faster-whisper are between 1-5 seconds. Is there any recommended parameters to use when transcribing such short audio files in order to speed up the process?

Currently, transcribing for any voice command is done on a Intel Core i5-8400T and it takes around 18s using large-v2 model which is too much for intent recognition. I must use the large-v2 model because only this model correctly recognize intended commands.

Thanks for any idea.

Answered by guillaumekln

Apr 22, 2023

It’s possible your audio triggers the "temperature fallback" which makes the transcription much slower. But that’s how Whisper tries to recover bad transcriptions by default.

Here are things you can try:

Looks like your CPU has 6 cores, so add cpu_threads=6 when loading the model
Use beam_size=1
Disable the temperature fallback with temperature=0 (might impact the transcription quality)
If you don’t care about timestamps, disable them with without_timestamps=True

View full answer

guillaumekln · 2023-04-22T05:17:34Z

guillaumekln
Apr 22, 2023

What faster-whisper options do you currently set, if any?

3 replies

rumbu13 Apr 22, 2023
Author

I setup the model with cpu and int8.

Transcribing is done mostly with default parameters, language ("ro") is specified to avoid time spent on detection. I played only with beam_size, reducing it gets me some improvement, but not noticeable (1s).

guillaumekln Apr 22, 2023

It’s possible your audio triggers the "temperature fallback" which makes the transcription much slower. But that’s how Whisper tries to recover bad transcriptions by default.

Here are things you can try:

Looks like your CPU has 6 cores, so add cpu_threads=6 when loading the model
Use beam_size=1
Disable the temperature fallback with temperature=0 (might impact the transcription quality)
If you don’t care about timestamps, disable them with without_timestamps=True

Answer selected by rumbu13

rumbu13 Apr 22, 2023
Author

Thank you, there is some difference:

cpu_threads = 6 -2 s
beam_size = 1 - 1s
temperature = 0 , no impact
without_timestamps=True, -1s

Still 12.6 seconds for 1 second of audio.

If interested, here is the gist: https://gist.github.com/rumbu13/ef85148016853d5f63250e4ccd6b0353

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech to text on short audio file #169

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Speech to text on short audio file #169

rumbu13 Apr 22, 2023

Replies: 1 comment · 3 replies

guillaumekln Apr 22, 2023

rumbu13 Apr 22, 2023 Author

guillaumekln Apr 22, 2023

rumbu13 Apr 22, 2023 Author

rumbu13
Apr 22, 2023

Replies: 1 comment 3 replies

guillaumekln
Apr 22, 2023

rumbu13 Apr 22, 2023
Author

rumbu13 Apr 22, 2023
Author