Standalone Faster-Whisper-XXL features #231

Purfview · 2024-04-06T16:04:16Z

Purfview
Apr 6, 2024
Maintainer

Includes all Standalone Faster-Whisper features +the additional ones mentioned below.
Includes all needed libs.

Vocal extraction model:

--ff_mdx_kim2: Preprocess audio with MDX23 Kim vocal v2 model (thanks to Kimberley Jensen). [Better than HT Demucs v4 FT]

Alternative VAD (Voice activity detection) methods:

--vad_method choices:

silero_v3 - Generally less accurate than v4, but doesn't have some quirks of v4.
silero_v4 - Same as silero_v4_fw. Runs original Silero's code instead of adapted one.
silero_v5 - Same as silero_v5_fw. Runs original Silero's code instead of adapted one.
silero_v4_fw - Default model. Most accurate Silero version, has some non-fatal quirks.
silero_v5_fw - Bad accuracy. Not a VAD, it's Random Detector of Some Speech :), has various fatal quirks. Avoid!
pyannote_v3 - The best accuracy, supports CUDA.
pyannote_onnx_v3 - Lite version of pyannote_v3. Similar accuracy to Silero v4, maybe a bit better, supports CUDA.
webrtc - Low accuracy, outdated VAD. Takes only 'vad_min_speech_duration_ms' & 'vad_speech_pad_ms'.
auditok - Actually it's not VAD, it's AAD - Audio Activity Detection.

Speaker Diarization:

--diarize choices:

pyannote_v3.0 - Fastest for CPU
pyannote_v3.1 - Same as v3.0 but should be faster with CUDA
reverb_v1 - Allegedly better than pyannote v3
reverb_v2 - The slowest, allegedly the best

For more read and post there -> Speaker Diarization
Legal notice: Reverb models are only for personal non-profit use.

Latest CTranslate2:

Up to ~26% faster on CPU with the int8 quantizations.
Flash attention support, that's CUDA, but the benchmarks shows no effect on the performance.

despairTK · 2024-04-07T03:41:27Z

despairTK
Apr 7, 2024

I really like the new parameter --vad_alt_method. Among them, silero_v3/silero_v4/pyannote_onnx_v3 are much better than the original VAD.

For example, there will be some gaps in the original VAD, and for example, sentences starting with "So" will often have a delayed start of the timeline. These issues are resolved in the silero_v3/silero_v4/pyannote_onnx_v3 parameters.

Finally, let me ask, which of the three parameters of silero_v3/silero_v4/pyannote_onnx_v3 has the best test results? Or what are their characteristics?

3 replies

Purfview Apr 7, 2024
Maintainer Author

First why I've added alternatives, because Silero v4 has problems like false-positive detection on nearly silent parts and other imperfections:
snakers4/silero-vad#396
snakers4/silero-vad#369

...sentences starting with "So" will often have a delayed start of the timeline

Did you evaluate that looking at a transcription?

I really like the new parameter --vad_alt_method. Among them, silero_v3/silero_v4/pyannote_v3 are much better than the original VAD.

Actually the original VAD runs exactly same model as "silero_v4". 😌
"silero_v4" runs original code from silero's repo, when the original VAD runs the same model with adapted code.

Finally, let me ask, which of the three parameters of silero_v3/silero_v4/pyannote_onnx_v3 has the best test results?

Probably neither of them are best, I not tested them much yet.
Did one quick test with "pyannote_onnx_v3", for my disappointment it was less accurate than original, it missed few segments.

despairTK Apr 7, 2024

Well...after replying to you, I transcribed dozens of audios, ranging from 10 minutes to 2 hours, and the audio language is English. After comparing the transcription results, I think silero_v3 is the best choice.

Purfview Apr 7, 2024
Maintainer Author

After comparing the transcription results,

That's not how VAD evaluation works, you are looking at whisper's randomness not at VAD's accuracy.

To evaluate VAD you need to look only at VAD's timestamps, you can get them with --vad_dump, then those can be loaded in SE and looked on the waveform.

And VAD params needs to be adjusted so you can see accuracy better, like vad_speech_pad_ms=0 and vad_min_silence_duration_ms=0.
And the same threshold doesn't convert to a same responsiveness on the different methods.

tristan-mcinnis · 2024-04-09T06:29:53Z

tristan-mcinnis
Apr 9, 2024

any hope of doing something similar for mac in the future?

5 replies

Purfview Apr 9, 2024
Maintainer Author

Probably there won't be macOS builds for this.

dgoryeo Apr 9, 2024

I guess no chance of a python package either?

Purfview Apr 9, 2024
Maintainer Author

I guess no chance of a python package either?

Yeap, it's out of the scope of this repo.

tristan-mcinnis Apr 11, 2024

Sorry, saw a comment here earlier about 'why post' this here but now its gone -- I was trying to keep it in the thread of this 'XXL' version. For example, if there can be an XXL version why not as if there will be a '-MAC' version in the future... who knows - just asking :)

Separately, for others who may find this thread and dismayed at no mac support there is a lot of work being done with 'whisperkit' which also does great work. https://github.com/argmaxinc/WhisperKit

The future of high-quality transcriptions is very bright.

aehlke May 17, 2024

WhisperKit has no plans to add VAD though. I plan to do it using a flutter port I found of silero

Sonnenfleck · 2024-04-21T05:55:01Z

Sonnenfleck
Apr 21, 2024

A little annoyance with --ff_mdx_kim2 in r192.3.3 [XXL], which otherwise works great (well, after I realized I had to change my scripts because it's now faster-whisper-xxl.exe instead of whisper-faster-xxl.exe 😉 ):

I'm running Faster-Whisper-XXL in a Nextcloud folder (with a cronjob checking if new audio files have been synchronized, then running faster-whisper-xxl). So far, this worked fine, but in r192.3.3 with MDX filtering enabled, it seems first the *_mdx.wav file is created and then it's moved to a temp folder (?). This move fails because Nextcloud already tries to sync the mdx file, and this leads to whisper-faster-xxl just quitting with an error that the *_mdx.wav file is already in use.

I now set the Nextcloud rules to just ignore *_mdx.wav files, but would it be possible to create them in a temp folder from the start?

5 replies

Purfview Apr 21, 2024
Maintainer Author

it seems first the *_mdx.wav file is created and then it's moved to a temp folder (?)

It's intermediate file, first it's written to the same folder as the original input file then it's deleted. No "movements" of it.
If Nextcloud app tries to do anything with it then of course deletion fails.

would it be possible to create them in a temp folder

Should be possible.

Sonnenfleck Apr 22, 2024

I'm confused: If the mdx file is deleted before transcription starts, then what file is whisper running on?
At least as I observe it, the problem here is: As soon as mdx filtering is finished, Nextcloud grabs the _mdx.wav to sync it, but before it finished syncing (which would be just a few seconds later), faster-whisper tries to delete the _mdx.wav but can't, because it's in use. Shouldn't the intermediate file be deleted after transcription? Or does VAD create another intermediate file (if yes, then it's not visible in the same folder)?

Purfview Apr 22, 2024
Maintainer Author

I'm confused: If the mdx file is deleted before transcription starts, then what file is whisper running on?

The intermediate file is loaded into variables further and it's not needed in the physical form anymore.
Yesterday I've made a quick patch to use a temp folder but not tested it yet, I'll try to release it today.

Or does VAD create another intermediate file (if yes, then it's not visible in the same folder)?

VAD's "intermediate" is created only in memory.

Sonnenfleck Apr 22, 2024

Ah, I get it, thanks for clarifying and for implementing the patch. :)

Purfview Apr 22, 2024
Maintainer Author

Implemented in r192.3.4

Herzfrequenz21 · 2024-04-27T19:05:07Z

Herzfrequenz21
Apr 27, 2024

Do I need to use some kind of tag to make the recognition against a little noise or soft music better?

14 replies

dgoryeo May 2, 2024

You're right, I thought the memory error was at the point of vocal isolation.

Herzfrequenz21 May 5, 2024

This has nothing to do with the audio length.

How can I make the memory be used from the CPU as before?
This line now specifies CUDA:
Standalone Faster-Whisper-XXL r192.1.1.1 running on: CUDA

Purfview May 6, 2024
Maintainer Author

Use --device cpu

1001ruchka May 6, 2024

This has nothing to do with the audio length.

How can I make the memory be used from the CPU as before? This line now specifies CUDA: Standalone Faster-Whisper-XXL r192.1.1.1 running on: CUDA

Try using CUDA with int8:
--compute_type=int8_float16 or --compute_type=int8
With these parameters I managed to run large_v2 on 4GB of video memory.

for Russian:

-prompt default

faster-whisper-xxl.exe --language ru --model "large-v2" --compute_type=int8_float16 --sentence -prompt default --beep_off --print_progress --vad_alt_method pyannote_v3 --ff_mdx_kim2 --mdx_device cpu "video.mp4"

or

-prompt None

faster-whisper-xxl.exe --language ru --model "large-v2" --compute_type=int8_float16 --sentence -prompt None --beep_off --print_progress --vad_alt_method pyannote_v3 --ff_mdx_kim2 --mdx_device cpu "video.mp4"

or

-prompt auto

faster-whisper-xxl.exe --language ru --model "large-v2" --compute_type=int8_float16 --sentence -prompt auto --beep_off --print_progress --vad_alt_method pyannote_v3 --ff_mdx_kim2 --mdx_device cpu "video.mp4"

Herzfrequenz21 May 21, 2024

Use --device cpu

It's taking me a long time. I've never been able to wait.
I haven't tried the option suggested below yet.

koebbe14 · 2024-05-22T13:36:45Z

koebbe14
May 22, 2024

Such a great tool. Especially for those who aren't very saavy in Python or command line! Thanks for creating!

Is it possible to perform speaker diarization with this standalone version?

9 replies

koebbe14 Jul 15, 2024

I would throw in $50

dgoryeo Jul 15, 2024

You can count on $20 from me as well.

Purfview Jul 15, 2024
Maintainer Author

A new thread would be better place for this -> #281
Make a donation and post there, or I'll update there with the amount donated.

...posting PayPal addresses or other personal information might not be so clever...

If you don't trust me with your email "address" then how you can trust me running the program?! 😉

thbaero Jul 15, 2024

I'll do the rest of it to make the 100£ ;)
Thanks for the others to chip in.

Let's continue on the other thread. #281

thbaero Jul 16, 2024

@dgoryeo - only 10£ will be needed in the end

MahmoudAshraf97 · 2024-07-02T12:15:26Z

MahmoudAshraf97
Jul 2, 2024

Hey @Purfview , I was wondering if you have (or willing to run) any benchmarks that compare pyannote_v3 with silero_v5
Thanks

3 replies

Purfview Dec 10, 2024
Maintainer Author

Sorry for late answer, probably you are aware now that silero_v5 is not good, worst from all Silero versions.

mjamil85 Dec 11, 2024

In my testing with many audio (english or japanese) samples, silero v5 is the worst one.

My final test result, [--vad_method default (silero_v4_fw)], gives better results.

dgoryeo Dec 12, 2024

@mjamil85 , did you get to test silero_v3 as well?

rodnvs · 2024-07-30T15:37:21Z

rodnvs
Jul 30, 2024

Hi @Purfview, I did a test with the --ff_mdx_kim2 feature and it took a long time to complete, about 45min for a 10min video. Is the voice extraction feature processed using the GPU, or CPU?

3 replies

1001ruchka Jul 30, 2024

If you have an NVIDIA graphics card with CUDA support and you didn't specify --mdx_device cpu - CUDA should be used. On my NVIDIA 3060 12GB, processing ff_mdx_kim2 for a 90 minute video takes no more than 3-5 minutes. Please note ff_mdx_kim2 is quite demanding on video memory, my video memory usage is over 8GB during ff_mdx_kim2.

rodnvs Jul 31, 2024

Yep, I forgot about it, my bad.
Here are the results from a new test using the GPU, on a RTX 3050 (laptop):
The UVR5 still very fast compared to the built in extractor. Any thoughts?

>> Transcribe only
Command: faster-whisper-xxl.exe test_video.mp4 --language es --task transcribe --model medium --compute_type auto --device cuda --standard --print_progress --vad_alt_method pyannote_v3 --vad_min_silence_duration_ms 2000
Start Time: 2024-07-31 11:40:32.919825
End Time: 2024-07-31 11:42:43.851822
Duration: 2m10s
Result: Successful
-------------------------------------------------
>> Voice extraction + Transcribe
Command: faster-whisper-xxl.exe test_video.mp4 --language es --task transcribe --model medium --compute_type auto --device cuda --standard --print_progress --vad_alt_method pyannote_v3 --vad_min_silence_duration_ms 2000 --ff_mdx_kim2 --mdx_device cuda
Start Time: 2024-07-31 11:45:37.965749
End Time: 2024-07-31 12:01:47.287552
Duration: 16m09s
Result: Successful
-------------------------------------------------
>> UVR5 Python script voice extraction only (https://github.com/nomadkaraoke/python-audio-separator)
Command: audio-separator test_video.mp4 --model_filename Kim_Vocal_2.onnx --output_format wav --single_stem Vocals
Start Time: 2024-07-31 14:29:06.907680
End Time: 2024-07-31 14:30:15.179309
Duration: 1m08s
Result: Successful

Purfview Jul 31, 2024
Maintainer Author

about 45min for a 10min video

That's too slow for GPU. Try --mdx_device cpu in your benchmark test.
EDIT: I think that's too slow even for CPU.

RTX 3050 (laptop)

How much VRAM?

Is the voice extraction feature processed using the GPU, or CPU?

It can use both.

Xylemm · 2024-08-13T09:41:31Z

Xylemm
Aug 13, 2024

Is there a series of parameters that work best to capture very short audio clips?

My clips with just "Yes" or "Let's go" produce a blank transcription. I've adjusted --vad_min_speech_duration_ms and others, but nothing catches these short clips.

2 replies

dgoryeo Aug 13, 2024

Which vad method do you use? I believe Auditok would perform best for your scenario.
In addition, if your clips are super short (say less than 15 sec), I think you should disable VAD.

Xylemm Aug 14, 2024

I fixed this with --vad_filter=false

Now it perfectly transcribes every file down to about 1 or 2 seconds long.

Thank you!

JustAndreww · 2024-09-21T15:00:53Z

JustAndreww
Sep 21, 2024

Is there any way to make auto dialogs to work?
Quite often I see following lines:

<Time1>: Hello.
<Time2>: Hi.

instead of

<Time1>:- Hello.</br>
- Hi.

Thanks!

0 replies

Ad4mts · 2024-10-22T09:33:35Z

Ad4mts
Oct 22, 2024

Since this faster Whisper model has been modified from the original version, could you please upload the source code so the community can contribute and add new features or im i missing something? Thanks!

8 replies

soenneker Oct 30, 2024

@Purfview How are you compiling the python project into an .exe?

Ad4mts Oct 30, 2024

@Purfview Totally a fair ask, i think, im not looking to contribute directly, i think just it d be helpful to see the source since this standalone has cool tweaks like VAD and --sentence for movies, wich is not implemented in the faster whisper or whisper-ctranslate2 repo. Having it open will make it easier for us to understand and maybe build on it.
it is just a friendly suggestion. at the end it is up to you of course ;)

Purfview Oct 30, 2024
Maintainer Author

@soenneker How are you compiling the python project into an .exe?

https://github.com/pyinstaller/pyinstaller

@Ad4mts im not looking to contribute... ...wich is not implemented in the faster whisper or whisper-ctranslate2 repo.

If you want something implemented somewhere you go post there in corresponding repos. And leave programming things to the programmers they know what to do and where to look for sources, no need to fantasize about things.

Ed1ks Dec 2, 2024

@Purfview
Your standalone faster whisper is great,
Could you please provide the source code or parameters you are using in your cli/wrapper?
If I am using the original faster-whisper, then the output is different. I want to see Parameters, which this standalone is doing.

ClaireCJS Dec 2, 2024

@Purfview Your standalone faster whisper is great, Could you please provide the source code or parameters you are using in your cli/wrapper? If I am using the original faster-whisper, then the output is different. I want to see Parameters, which this standalone is doing.

I end up logging my query and output, and if i run it several times, the log shows all the runs.

Maybe a --logging option

but .log not .txt or lyrics will get overwritten

Standalone Faster-Whisper-XXL features #231

Purfview Apr 6, 2024 Maintainer

Vocal extraction model:

Alternative VAD (Voice activity detection) methods:

Speaker Diarization:

Latest CTranslate2:

Replies: 10 comments · 52 replies

Purfview Apr 7, 2024 Maintainer Author

Purfview Apr 7, 2024 Maintainer Author

Purfview Apr 9, 2024 Maintainer Author

Purfview Apr 9, 2024 Maintainer Author

Purfview Apr 21, 2024 Maintainer Author

Purfview Apr 22, 2024 Maintainer Author

Purfview Apr 22, 2024 Maintainer Author

Purfview May 6, 2024 Maintainer Author

Purfview Jul 15, 2024 Maintainer Author

Purfview Dec 10, 2024 Maintainer Author

Purfview Jul 31, 2024 Maintainer Author

Purfview Oct 30, 2024 Maintainer Author

Purfview
Apr 6, 2024
Maintainer

Replies: 10 comments 52 replies

Purfview Apr 7, 2024
Maintainer Author

Purfview Apr 7, 2024
Maintainer Author

Purfview Apr 9, 2024
Maintainer Author

Purfview Apr 9, 2024
Maintainer Author

Purfview Apr 21, 2024
Maintainer Author

Purfview Apr 22, 2024
Maintainer Author

Purfview Apr 22, 2024
Maintainer Author

Purfview May 6, 2024
Maintainer Author

Purfview Jul 15, 2024
Maintainer Author

Purfview Dec 10, 2024
Maintainer Author

Purfview Jul 31, 2024
Maintainer Author

Purfview Oct 30, 2024
Maintainer Author