Speaker Diarization #322

Purfview · 2024-11-06T19:19:51Z

Purfview
Nov 6, 2024
Maintainer

Speaker Diarization supported since r193.1.

`--diarize` choices:

pyannote_v3.0 - Fastest for CPU
pyannote_v3.1 - Same as v3.0 but should be faster with CUDA
reverb_v1 - Allegedly better than pyannote v3
reverb_v2 - The slowest, allegedly the best

Other diarization options:

--num_speakers - Number of speakers, when known.
--min_speakers - Minimum number of speakers. Has no effect when num_speakers is provided.
--max_speakers - Maximum number of speakers. Has no effect when num_speakers is provided.
--speaker - To replace 'SPEAKER' string with your own word.
--diarize_device - "cuda" or "cpu". Automatic, no need to touch it
--diarize_threads - Threads. Automatic, no need to touch it
--diarize_dump - Dumps diarization output to a file.

Legal notice: Reverb models are only for personal non-profit use.

1001ruchka · 2024-11-06T21:19:07Z

1001ruchka
Nov 6, 2024

I'm trying to adapt this for a bilingual video. Unfortunately the -diarize option doesn't work with -sentence.
Ideally it would be great if it was possible to specify the speaker numbers to be recognized and then they would be recognized as a single speaker using -sentence. Recognition is done only for the audio of a given speaker, which avoids whisper glitches when changing languages.
Now I try to recognize a bilingual video. The English speaker was recognized as 2 voices instead of one - [SPEAKER_00] and [SPEAKER_01]. I used the regular expression in find and replace in EmEditor
^(?!.*\[SPEAKER_00\].*|.*\[SPEAKER_01\].*).*\S.*$
with replace to a line break (\n).
As a result, I get a subtitle file with only speaker [SPEAKER_00] and [SPEAKER_01] left. Again, in my case it is one speaker, just the program identified it as 2 different voices.

So far I see 2 problems with what I got: 1) you have to manually merge sentences 2) Whsiper recognizes the audio of all speakers, which can cause errors when changing language

11 replies

Purfview Nov 7, 2024
Maintainer Author

Edited your post to hide things under the "spoilers".

1001ruchka Nov 7, 2024

So here's the video:
https://www.youtube.com/watch?v=noHSbloSbhk
one speaker speaks in Russian, another interpreter speaks in Russian and English.
I make subtitles with diarization:
faster-whisper-xxl.exe --language en --model “large-v2” --compute_type=float16 -prompt auto --beep_off --print_progress --vad_method pyannote_v3 --diarize --output_format all “C:\Users\alexa\Downloads\ART Test Whisper\ARTEM TAINOV AFTER EAST VS WEST 13 INTERVIEW.mp4”
Note the timing 00:00:35,860 - 00:00:54,180 - Whisper combined Russian and English speakers into one subtitle.
ARTEM TAINOV AFTER EAST VS WEST 13 INTERVIEW.zip

1001ruchka Nov 8, 2024

Video:
https://www.youtube.com/watch?v=g2FoXlSEiqA
The presenter of the master class speaks English, also the interpreter and some participants speak English, but most of the listeners speak Russian.

faster-whisper-xxl.exe --language en --model "large-v2" --compute_type=float16 -prompt auto --beep_off --print_progress --vad_method pyannote_v3 --diarize --output_format all "C:\Users\alexa\Downloads\Марио Whisper test\Что такое лимбическое слушание Беседа с Марио Сальвадором.mp4"

The presenter is recognized as [SPEAKER_05] and [SPEAKER_06]. But at timing 00:07:54,050 the interpreter is also misrecognized as [SPEAKER_06]. At timing 00:08:57,650 the presenter is misrecognized as [SPEAKER_01].

00:14:18,450 - interpreter and presenter's speech are merged.
00:14:29,570 - subtitled without speaker's label

00:15:11,356 - interpreter and presenter's speech are merged.
00:15:22 - Speaker's speech not recognized

limbic.zip

Purfview Nov 16, 2024
Maintainer Author

These examples doesn't show "The English speaker was recognized as 2 voices instead of one..."

In your examples different speakers are already transcribed in one line and that is expected behavior in such cases atm.

Purfview Nov 19, 2024
Maintainer Author

Check v194.1, it works with --sentence, actually it only works with --sentence now and it's activated automatically.

Elviseras · 2024-11-20T08:13:29Z

Elviseras
Nov 20, 2024

Hi, Purfview.. Thanks for last release and quick development!! You are awesome!!
For my case, I am not interested in achieving more transcription speed, but in obtaining maximum precision regardless of time... Could you help me about the best settings for this? Thank you so much

1 reply

Purfview Nov 20, 2024
Maintainer Author

Could you help me about the best settings for this?

You tell me, as I don't use it nor I'm interested in diarization.

For my case, I am not interested in achieving more transcription speed, but in obtaining maximum precision regardless of time...

As far as I understand, the current state of diarization is not super accurate, if you know better models than those added then let us know.

KA-UZs · 2024-11-20T17:51:05Z

KA-UZs
Nov 20, 2024

Hi, r193.1 json contained the speaker, I can't find it in r194.1. Can you fix it? Thx,
],
"speaker": "SPEAKER_01"
},

8 replies

KA-UZs Nov 20, 2024

One more idea, if the same speaks in several consecutive lines, could you write it in one line? Now a sentence breaks into several lines sometimes.
Sentence part 01
sentence part 02
sentence part 03.
instead of
Sentence part 01 sentence part 02 sentence part 03.
for me, the --sentence does not solve it. Especially at the end of the transcript, a sentence is split into several lines.

Purfview Nov 20, 2024
Maintainer Author

Hi, I use it to process the text, statistics, e.g. who talked how much
Can you put it in this version too?

Hi, why would you want less accurate statistics, as in json the segments can by wildly non-split?
I could but what's the the point in having less accurate diarization there?
json contains pure whisper segmentation without alteration.

If they speak at the same time (overlapping speech), can you easily tell who said the text?

There is no easy way to do that. What makes it worse is that a diarization model can output overlaps even when there are no overlaps at all.

I send a donate. I tested 194.1, but I couldn't find a speaker in the json. It would be easier than to look it up from _dump.

A donation would be apriciated. You don't need to look at "dump" as speakers are already asigned in the other output formats.

Purfview Nov 20, 2024
Maintainer Author

One more idea, if the same speaks in several consecutive lines, could you write it in one line? Now a sentence breaks into several lines sometimes. Sentence part 01 sentence part 02 sentence part 03. instead of Sentence part 01 sentence part 02 sentence part 03. for me, the --sentence does not solve it. Especially at the end of the transcript, a sentence is split into several lines.

Not sure I understand what you mean, --sentence IS to break the sentences to different segments.

KA-UZs Nov 21, 2024

Hi, why would you want less accurate statistics, as in json the segments can by wildly non-split?
I could but what's the the point in having less accurate diarization there?
.json contains pure whisper segmentation without alteration.

Time start - end transcript from json. Time start - end speaker from the diarization. r193.1 before I worked from their overlap. Many times it is not clear.
We use the same diarization, but yours is much faster, maybe even more accurate speaker text matching.

The r194.1 json does not include the text under who said it. It was in r193.1 but you removed it.
If it was included, I could check faster where there is a difference. I'm using my current method of speech speaker matching. But if yours is a better switch, it would speed up the comparison. I haven't had time to test the latest version in detail, but I'll try :-)

Purfview Nov 21, 2024
Maintainer Author

If you want to test something with the less accurate diarization mix results then it's your personal thing.
Sill I can't make sense what is the point of that.

Joelson-Forte · 2024-11-25T04:01:59Z

Joelson-Forte
Nov 25, 2024

@Purfview
I'm glad you've resumed this project. We haven't heard from you for a long time. As far as I know, you were taking care of your health. If that's true, I hope you're doing well.

I really liked this option, and although you may not be crazy about this feature, in the world of video editing it's a game-changer that speeds up content production. In seconds, I can have all the dialogue from a specific "speaker", instead of spending hours searching for all the lines from that speaker.

This feature definitely needs improvement, because it's not very accurate in diarization, but this is caused by the component that performs the diarization (pyannote_v3.0, pyannote_v3.1, reverb_v1 and reverb_v2) and not by Faster-Whisper-XXL.

Here's a typical example: In this short video of just 1 minute, there are 5 speakers and the diarization only managed to record 3 speakers. The two women in the video were given the name [Speaker_00] even though they have different voices and [Speaker_02] was given to two men with very different voices as the first is a man in his early 50s and the other is a teenager of approximately 19 years old.

https://huggingface.co/datasets/Joelson-Forte/Auto-Captions-for-Vegas/resolve/main/Diarize%20Test.zip?download=true

Here the command line used: faster-whisper-xxl.exe "Path/of/file/Diarize Test.mp4" --model small --device cpu --verbose true --max_line_width 40 --max_line_count 2 --diarize pyannote_v3.1 --task transcribe --output_format srt --output_dir source

1 reply

Purfview Nov 27, 2024
Maintainer Author

--diarize pyannote_v3.0 recognized 4 from 5, didn't recognized only that guy at 01:00:

[ 00:00:00.030 -->  00:00:02.039] A SPEAKER_00
[ 00:00:02.039 -->  00:00:02.528] B SPEAKER_03
[ 00:00:02.528 -->  00:00:02.730] C SPEAKER_00
[ 00:00:02.798 -->  00:00:07.101] D SPEAKER_03
[ 00:00:07.304 -->  00:00:11.995] E SPEAKER_00
[ 00:00:12.062 -->  00:00:13.007] F SPEAKER_03
[ 00:00:17.310 -->  00:00:17.327] G SPEAKER_03
[ 00:00:17.327 -->  00:00:17.378] H SPEAKER_02
[ 00:00:17.378 -->  00:00:17.884] I SPEAKER_03
[ 00:00:17.884 -->  00:00:18.222] J SPEAKER_02
[ 00:00:18.222 -->  00:00:18.677] K SPEAKER_03
[ 00:00:19.757 -->  00:00:34.337] L SPEAKER_02
[ 00:00:34.607 -->  00:00:35.620] M SPEAKER_03
[ 00:00:36.345 -->  00:00:36.362] N SPEAKER_03
[ 00:00:36.362 -->  00:00:36.919] O SPEAKER_01
[ 00:00:36.919 -->  00:00:36.970] P SPEAKER_03
[ 00:00:37.881 -->  00:00:41.205] Q SPEAKER_03
[ 00:00:41.543 -->  00:00:44.800] R SPEAKER_01
[ 00:00:45.323 -->  00:00:46.504] S SPEAKER_03
[ 00:00:47.162 -->  00:00:51.887] T SPEAKER_03
[ 00:00:51.989 -->  00:00:53.474] U SPEAKER_01
[ 00:00:54.149 -->  00:00:55.988] V SPEAKER_01
[ 00:00:56.207 -->  00:01:00.696] W SPEAKER_03
[ 00:01:00.848 -->  00:01:07.851] X SPEAKER_03

EvHanHan · 2024-12-01T21:04:27Z

EvHanHan
Dec 1, 2024

Hi,
I am trying to use speaker detection for text output instead of srt.
do you know how can I get it?
thanks!

4 replies

Purfview Dec 1, 2024
Maintainer Author

--output_format text or -f text

EvHanHan Dec 2, 2024

With this config:

{
  "extra_args": [
    "--model", "tiny"
  ],
  "srt_args": [
    "--standard",
    "--diarize", "pyannote_v3.0",
    "--num_speakers", "2"
  ]
}

It works to detect speaker for srt format.

With this config:

{
  "extra_args": [
    "--model", "tiny"
  ],
  "text_args": [
    "--diarize", "pyannote_v3.0",
    "--num_speakers", "2"
  ]
}

It does not work for detect speaker for text format.

Any idea?

Purfview Dec 2, 2024
Maintainer Author

It does not work

How do you know that it doesn't work?

Purfview Dec 2, 2024
Maintainer Author

I just tested it again with an audio of two speakers. There is no speaker information in the text output.

Apparently it doesn't run with -f text, I've no idea how you are allegedly getting any output.
Add any other format beside it to make it work: -f srt text

steipal · 2024-12-03T16:55:33Z

steipal
Dec 3, 2024

Great addition with the diarization! :-)
We use diarization as a separate independant step in a complex workflow using FFAStrans. But it would be nice to do it without the extra separate pyannote package now that you have implemented it in faster-whisper-xxl. So in that regard I have two feature requests

Fetch speaker embeddings (if available?) and dump.
Make it possible to only do diarization without transcribing.

Keep up the great work! :-)

5 replies

Purfview Dec 4, 2024
Maintainer Author

Hi, not sure what you meant with "1".
Anyway, please make a post at ideas so I wouldn't forget it.

steipal Dec 4, 2024

Embeddings are the voice prints as vectors, used to distinguish one voice from another. It's usefull when you want to build a database of known voices. In pyannote we activate by the following py lines:

diarization, embeddings = pipeline({"waveform": waveform, "sample_rate": sample_rate}, return_embeddings=True)
for s, speaker in enumerate(diarization.labels()):
    print(f"embeddings {embeddings[s]}")

I have no idea if this is feasible in your implementation. Anyway, I will post at "ideas". Thanks! :-)

thbaero Dec 10, 2024

That would be a nice addition indeed.

I'm using the current setup to transcribe live conversations in a business context with roughly the same people (ie my team members and typical conversation partners in my company).

As a user I would be happy to export the "fingerprint" of a voice to a separate text file where I can define a specific alias for each voice "fingerprint". This way there software could recognize the speaker by comparing the voice fingerprints from a new recording with the database of known fingerprints.

If a known voice is found the defined alias is used instead of SPEAKER_##.

Jessomadic Dec 12, 2024

this would be an amazing feature. It seems like this repo is able to do it based on a dir of known voices to crossmatch with!
https://github.com/NavodPeiris/speechlib

chrischris616 Dec 12, 2024

Great idea! absolute valueable!

Joelson-Forte · 2024-12-04T03:59:07Z

Joelson-Forte
Dec 4, 2024

@Purfview
It’s been a long time since I’ve seen such an interesting and useful feature. I hope it gets better and more accurate over time, as this feature has enormous potential. Thank you for the addition!

0 replies

Speaker Diarization #322

Purfview Nov 6, 2024 Maintainer

--diarize choices:

Other diarization options:

Replies: 7 comments · 30 replies

Purfview Nov 7, 2024 Maintainer Author

Purfview Nov 16, 2024 Maintainer Author

Purfview Nov 19, 2024 Maintainer Author

Purfview Nov 20, 2024 Maintainer Author

Purfview Nov 20, 2024 Maintainer Author

Purfview Nov 20, 2024 Maintainer Author

Purfview Nov 21, 2024 Maintainer Author

Purfview Nov 27, 2024 Maintainer Author

Purfview Dec 1, 2024 Maintainer Author

Purfview Dec 2, 2024 Maintainer Author

Purfview Dec 2, 2024 Maintainer Author

Purfview Dec 4, 2024 Maintainer Author

Purfview
Nov 6, 2024
Maintainer

`--diarize` choices:

Replies: 7 comments 30 replies

Purfview Nov 7, 2024
Maintainer Author

Purfview Nov 16, 2024
Maintainer Author

Purfview Nov 19, 2024
Maintainer Author

Purfview Nov 20, 2024
Maintainer Author

Purfview Nov 20, 2024
Maintainer Author

Purfview Nov 20, 2024
Maintainer Author

Purfview Nov 21, 2024
Maintainer Author

Purfview Nov 27, 2024
Maintainer Author

Purfview Dec 1, 2024
Maintainer Author

Purfview Dec 2, 2024
Maintainer Author

Purfview Dec 2, 2024
Maintainer Author

Purfview Dec 4, 2024
Maintainer Author