Transcription and diarization (speaker identification) #264

Majdoddin · 2022-10-06T18:28:25Z

Majdoddin
Oct 6, 2022

Whisper's transcription plus Pyannote's Diarization

Update - @johnwyles added HTML output for audio/video files from Google Drive, along with some fixes.

Using the new word-level timestamping of Whisper, the transcription words are highlighted as the video plays, with optional autoscroll. And the display on small displays is improved.

Moreover, the model is loaded just once, thus the whole thing runs much faster now. You can also hardcode your Huggingface token.

Andrej Karpathy suggested training a classifier on top of openai/whisper model features to identify the speaker, so we can visualize the speaker in the transcript. But, as pointed out by Christian Perone, it seems that features from whisper wouldn't be that great for speaker recognition as its main objective is basically to ignore speaker differences.

In Majdoddin/nlp, I use pyannote-audio, a speaker diarization toolkit by Hervé Bredin, to identify the speakers, and then match it with the transcriptions of Whispr. Check the result here.

Edit: To make it easier to match the transcriptions to diarizations by speaker change, Sarah Kaiser suggested runnnig the pyannote.audio first and then just running whisper on the split-by-speaker chunks.
For sake of performance (and transcription quality?), we attach the audio segements into a single audio file with a silent spacer as a seperator, and run whisper on it. Enjoy it!

(For sake of performance , I also tried attaching the audio segements into a single audio file with a silent -or beep- spacer as a seperator, and run whisper on it see it on colab. It works on some audio, and fails on some -e.g. Dyson's Interview. The problem is, whisper does not reliably make a timestap on a spacer. See the discussions #139 and #29)

SomeCodingUser · 2022-10-06T19:26:37Z

SomeCodingUser
Oct 6, 2022

This looks so cool! Unfortunately I couldn't get the notebook to run because of a clash with the dependencies as here

How did you work around that?

5 replies

Majdoddin Oct 7, 2022
Author

This issue is reported here.
I just edited the notebook.
My work around is to first install and run pyannote.audio and then install and run whisper.
By installing whisper, there is an error massage that huggingface_hub version does not match that required by payannote. But because I have finished running pyannote then, it does not harm and can be sefely ignored.
I tried it on colab and it works. Would you please try it again?

octimot Oct 7, 2022

This is a smart workaround but unfortunately not applicable in a production environment - in an app instead of a notebook for eg. We should look for a more integrated approach imho.

crazy4pi314 Oct 7, 2022

It looks like the PR fixing the issue in Pyannote.audio is stuck in CI, hopefully it can get merged soon 🤞

Majdoddin Oct 26, 2022
Author

here seems to be resolved. Currently, I do not use it in the notebook, because then, as suggested by @lukas681, I would need to install a whole environment.

rodgermoore Nov 23, 2022

Docker should fix this? Separate the different processing parts in separate containers (one for pyannote, one for whisper), link them via compose and then this should work fine. Also in production. WIll work on this when I have time.

crazy4pi314 · 2022-10-07T19:07:50Z

crazy4pi314
Oct 7, 2022

Is it possible to run pyannote.audio first and then just run whisper on the split-by-speaker chunks? It's a hit to performance, but for offline use it would be worth it for my use case :)

3 replies

Majdoddin Oct 11, 2022
Author

Great idea! I implemented it and it works great! Check the code and result. I will wrap it up as a tool soon.

Majdoddin Oct 26, 2022
Author

I rewrote it, Check the updated Colab notebook and the result.

Lorenzoncina Aug 8, 2023

Sure it's possibile and also performances are not affetced too much if you run Speaker diarization on GPU. A good reason for doing that is the the fact the Whisper is not able to recognize sevral languages otherwise. Segmentation by speaker allows to overcome this issue

vintlucky777 · 2022-10-11T05:58:58Z

vintlucky777
Oct 11, 2022

have you considered training a separate net to isolate single speaker's audio? if you can separate speakers into separate channels, then run diarization classifier on top -- the result could be much better.

3 replies

Majdoddin Oct 11, 2022
Author

#264 (reply in thread)

coder543 Oct 19, 2022

The sample podcast you used had very little overlapping speech (talking over each other), from what I could tell. This is something I'm curious to test more, and hopefully I'll try running this locally soon.

I don't know how or which models, but I have heard of some machine learning models to separate audio from different speakers/instruments/whatever into their own separate audio files, and for overlapping speech especially, that would probably make a big difference. I think Whisper's larger models tend to drop more words from the transcript when there's overlapping speech, and I believe I saw an example of this in your test in the one part I could find with overlapping speech.

But, your demo was really cool. It's impressive what all of this technology can do.

Majdoddin Oct 29, 2022
Author

#434

louismorgner · 2022-10-21T08:59:57Z

louismorgner
Oct 21, 2022

Awesome work @Majdoddin 👏

0 replies

laarisyko · 2022-10-22T12:15:09Z

laarisyko
Oct 22, 2022

Hi, great work indeed. There is blind source seperation method called ICA (independent component analysis) which can seperate sources from each other if there are multiple obersevations (microphones, i.e. stereo recording). This could help at least for two speakers and the stereo recordings from modern laptops or smartphones.

This could be performed as the first processing step to help the following methods.

There is a sklearn implementation called FastICA https://scikit-learn.org/stable/auto_examples/decomposition/plot_ica_blind_source_separation.html

What do you think?

0 replies

subtyping · 2022-10-22T19:36:14Z

subtyping
Oct 22, 2022

Just wanted report back an update - you no longer need to do the workaround reported by @Majdoddin.

Installing pyannote.audio, then whisper will spit out the error similar to "pyannote-audio 2.0.1 requires huggingface-hub<0.9,>=0.7, but you have huggingface-hub 0.10.1 which is incompatible." ignore this and try importing the packages. Currently working for me in the same venv using Python 3.10.6 on m1.

0 replies

bmurphy96 · 2022-10-24T07:52:31Z

bmurphy96
Oct 24, 2022

so let's talk abouta simple situation .. i do a podcast where i always record over cleanfeed. It's either me by myself or two of us where the recording is on split tracks. Can I use pyannote.audio as part of the process to identify speakers? seems easy enough if the wave files are seperate.. and if it can be done..would anyone know how offhand? I have the transcription process runing pretty well.. i have processed all 78 episodes of my podcast but names would be great...

2 replies

Majdoddin Oct 26, 2022
Author

I rewrote the code, you can give the names of the recognized speakers to be written in the HTML file. Check the updated Colab notebook and the result. Is it what you need?
And, please consider supporting my work with a donation :)

flavour-of-qualia Nov 6, 2022

New notebook is taking much much time and computational power to complete. I run 40 minutes video 2 hours ago on premium GPU and it is still running. Usual GPU can't do it at all, all messages are tcmalloc: large alloc... @Majdoddin

falvarezoliva · 2022-10-26T21:14:51Z

falvarezoliva
Oct 26, 2022

Another aproach is to create 2 files from stereo, transcript both and simply order both results by start time

0 replies

zhangjoe99 · 2022-10-31T15:51:52Z

zhangjoe99
Oct 31, 2022

Hey, I was trying to run your code on the Google Collab and I was running into an error with pyannote. https://colab.research.google.com/drive/12W6bR-C6NIEjAML19JubtzHPIlVxdaUq?usp=sharing#scrollTo=RQyROdrfsvk4

When running this line: pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization')
I get an authentication error

Could not download 'pyannote/speaker-diarization' pipeline.
It might be because the pipeline is private or gated so make
sure to authenticate. Visit https://hf.co/settings/tokens to
create your access token and retry with:

   >>> Pipeline.from_pretrained('pyannote/speaker-diarization',
   ...                          use_auth_token=YOUR_AUTH_TOKEN)

If this still does not work, it might be because the pipeline is gated:
visit https://hf.co/pyannote/speaker-diarization to accept the user conditions.

When I go to the pyannote github and follow the steps for an authentication token. I still get this error

pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization', use_auth_token="...")

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
[/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_errors.py](https://localhost:8080/#) in hf_raise_for_status(response, endpoint_name)
    212     try:
--> 213         response.raise_for_status()
    214     except HTTPError as e:

8 frames
HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/pyannote/segmentation/resolve/2022.07/pytorch_model.bin

The above exception was the direct cause of the following exception:

HfHubHTTPError                            Traceback (most recent call last)
[/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_errors.py](https://localhost:8080/#) in hf_raise_for_status(response, endpoint_name)
    252         # Convert `HTTPError` into a `HfHubHTTPError` to display request information
    253         # as well (request id and/or server error message)
--> 254         raise HfHubHTTPError(str(HTTPError), response=response) from e
    255 
    256 

HfHubHTTPError: <class 'requests.exceptions.HTTPError'> (Request ID: 5LvayE51Uh3BHPb1MC_qg)

Did you have this issue, any ideas? Thanks!

5 replies

mezaros Oct 31, 2022

Author of Pyannote has decided to gate his models. More details on his GitHub, but frankly, the new token system doesn't work reliably. Lots of reports of mixed results. Not sure why he's kneecapping his own project and adding this friction, but that's his right.

robertvy Nov 4, 2022

I was able to fix this by accepting the terms at https://huggingface.co/pyannote/segmentation

flavour-of-qualia Nov 6, 2022

yes, i had so much stress not understanding how to launch it and on huggingface it was not metioned that you need to accept terms also on segmentation 😠

Majdoddin Nov 8, 2022
Author

I updated the notebook, should work now 🙂

finnatsea Nov 29, 2022

Thanks for showing the concept but the code is a mess.

robertvy · 2022-11-05T01:23:02Z

robertvy
Nov 5, 2022

https://colab.research.google.com/drive/1V-Bt5Hm2kjaDb4P1RyMSswsDKyrzc2-3?usp=sharing

I think this code is actually quick and works great. Found it on Twitter: https://twitter.com/dwarkesh_sp/status/1579672641887408129

1 reply

Majdoddin Nov 8, 2022
Author

Similar 🙂

trappedinspacetime · 2022-11-05T19:07:02Z

trappedinspacetime
Nov 5, 2022

How about this ? https://medium.com/saarthi-ai/who-spoke-when-build-your-own-speaker-diarization-module-from-scratch-e7d725ee279

1 reply

Majdoddin Nov 8, 2022
Author

That uses Google Text-to-Speech API. which is NOT free.

philforrence · 2022-11-08T19:02:22Z

philforrence
Nov 8, 2022

Hey! This is an incredible project. Thanks so much for dedicating your time to it and documenting it for all of us :)

I want to get in on the fun! However, I'm running into issues with timing. The program seems to catch on the line
dz = pipeline(DEMO_FILE, num_speakers=2)
for a long time. How long should it take in colab for the process to run on a 5 minute video? If it's stalling there (with no error output) any suggestions on what might be wrong with my colab setup? The only things I changed were adding my Youtube video, the num_speakers argument, and adding my access token.

Peace and Love,
Phil

20 replies

laarisyko Nov 13, 2022

Another idea from a recording perspective: If there are N people in a meeting there are nowadays also N smartphones in the room, which can record at least mono, but also sometimes stereo. So there could be up to 2 x N channels. It should be possible to separate each speaker, from a mathematical standpoint. What do you think?

Majdoddin Nov 15, 2022
Author

@rnehrboss Thank you very much for your generous donation 🙂

Majdoddin Nov 15, 2022
Author

@rnehrboss
One can Fine-tune/adapt whisper to have input audio with multiple channels and do beamforming, currently it gets only mono. Interesting project!
@laarisyko
Whisper can do source separation. But it generates transcriptions just for one of the speakers. It should be fine-tuned to transcribe for all. See demo and discussion.

laarisyko Nov 15, 2022

Thank you for pointing me to this :)

rnehrboss Nov 23, 2022

@Majdoddin . There is a lot of hallucination, and repeating of words or phrases where there is quiet for long periods of time (just background noise). Any suggestions?

yinruiqing · 2022-11-09T17:17:36Z

yinruiqing
Nov 9, 2022

See also thisone. Diarization and asr are run independently in the current version. Next, I will try @crazy4pi314 's idea

1 reply

rnehrboss Nov 23, 2022

@yinruiqing Thanks for sharing... Do you know how this compares to @Majdoddin 's implementation?

Amizhthan171 · 2022-11-22T11:33:22Z

Amizhthan171
Nov 22, 2022

Does this notebook diarize more than 2 speakers? Suppose there is an audio recording of 5+ speakers, will it be able to give a clean result.

1 reply

simonMoisselin Dec 1, 2022

Yes it does.

TheHypnoo · 2022-11-23T09:59:59Z

TheHypnoo
Nov 23, 2022

Hello everyone, first of all, thank you for the great idea and code to be able to do it. At first, I want to test to see if it really works, but I have added all the code shown in the colab, but when I run the code tells me a dependency error.

I am using it on my main device, MacBook Pro M1.

I show the error:

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/torchaudio/_internal/module_utils.py:99: UserWarning: Failed to import soundfile. 'soundfile' backend is not available.
  warnings.warn("Failed to import soundfile. 'soundfile' backend is not available.")
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/soundfile.py", line 142, in <module>
    raise OSError('sndfile library not found')
OSError: sndfile library not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/sergigonzalez/dev/transcript_audios/transcript.py", line 9, in <module>
    from pyannote.audio import Pipeline
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pyannote/audio/__init__.py", line 29, in <module>
    from .core.inference import Inference
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pyannote/audio/core/inference.py", line 35, in <module>
    from pyannote.audio.core.model import Model
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pyannote/audio/core/model.py", line 46, in <module>
    from pyannote.audio.core.task import Problem, Resolution, Specifications, Task
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pyannote/audio/core/task.py", line 45, in <module>
    from torch_audiomentations import Identity
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/torch_audiomentations/__init__.py", line 1, in <module>
    from .augmentations.background_noise import AddBackgroundNoise
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/torch_audiomentations/augmentations/background_noise.py", line 10, in <module>
    from ..utils.file import find_audio_files_in_paths
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/torch_audiomentations/utils/file.py", line 5, in <module>
    import soundfile
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/soundfile.py", line 162, in <module>
    _snd = _ffi.dlopen(_os.path.join(
OSError: cannot load library '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/_soundfile_data/libsndfile.dylib': dlopen(/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/_soundfile_data/libsndfile.dylib, 0x0002): tried: '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/_soundfile_data/libsndfile.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/_soundfile_data/libsndfile.dylib' (no such file), '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/_soundfile_data/libsndfile.dylib' (no such file)

4 replies

pagdot Nov 23, 2022

seems like you are missing the sndfile library. https://stackoverflow.com/a/73210326 seems like it may answer it (Just did a quick search)

TheHypnoo Nov 25, 2022

I had SndFile directly installed but if I didn't install conda, activate it and use it, it doesn't work. I don't know if anyone has the same problem.... In this case, I have it activated and implemented in a function everything that is in the codelab but in local. I have used a call and it does not really work for me (attached image) the audio is 1 minute where it is spoken clearly and loudly.

Maybe the code I am using is not working as it should? Then also from the audio that I transform and transcribe it with whisper it doesn't write me the Speakers or anything...

TheHypnoo Nov 28, 2022

@Majdoddin I don't know if you have any idea about this? I would be interested in getting it fixed as soon as possible.

fleek Dec 21, 2022

Use python 3.8, it won't work well with 3.10

shahdeep1908-tech · 2023-06-08T07:19:01Z

shahdeep1908-tech
Jun 8, 2023

Hey everybody, I am in a requirement of speaker diarization where I have a whisper API to translate real-time audio. I am finding no way to bifurcate the users speaking while I send audio. Thus I am unable to generate a proper transcription.
I am currently sending audio through React Js as it is a desktop application.
Can you guys help me out with this query

1 reply

Majdoddin Jul 21, 2023
Author

Hi @shahdeep1908-tech
www.lexicaps.com seamlessly adds diarization to Whispers transcription. No 3rd party packages.
Please contact me if you need more features.
Announcement: #1537
Repo: https://github.com/Majdoddin/lexicaps

rikabi89 · 2023-07-30T16:16:51Z

rikabi89
Jul 30, 2023

Forgive my ignorance. Could I use this tool to separate multiple speaker audio from one audio file with an output for each person found?

6 replies

rikabi89 Jul 30, 2023

This would save a lot of time in respect of building an audio dataset for voice cloning on programs like Tortoise TTS etc. I train based one voice usually, so this would save me manually editing out the voice or voices that I don't require.

johnwyles Jul 30, 2023

@Majdoddin can you chime in? This great lad had an interesting question I had in my endeavors as well and it’d be lovely to help him out

rikabi89 Aug 1, 2023

I did manage to get something to work : https://github.com/rikabi89/diarization_script/blob/main/diarization_script.py - however it was not accurate at all in differentiating between two different speakers and there was a lot of overlap in output .wav. TBH I don't know any coding so I used chatGPT to assist here.

trappedinspacetime Aug 4, 2023

Please don't rely on chatGPT. It has a long way to cover.

NavodPeiris Jan 19, 2024

checkout this python package developed by me: https://pypi.org/project/speechlib/
this is the github repo: https://github.com/Navodplayer1/speechlib

this package does speaker diarization, transcription and speaker recognition all together and gives a transcription with actual speaker names!

the-sambot · 2023-08-29T15:16:17Z

the-sambot
Aug 29, 2023

Any idea why I'm getting this error on the Google Colab Pyannote_plays_and_Whisper_rhymes_v_2_1? I accepted both terms and created a read API on HF.

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
[<ipython-input-29-bed055964a75>](https://localhost:8080/#) in <cell line: 1>()
----> 1 from pyannote.audio import Pipeline
      2 pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization', use_auth_token= (access_token) or True )

15 frames
[/usr/lib/python3.10/ctypes/__init__.py](https://localhost:8080/#) in __init__(self, name, mode, handle, use_errno, use_last_error, winmode)
    372 
    373         if handle is None:
--> 374             self._handle = _dlopen(self._name, mode)
    375         else:
    376             self._handle = handle

OSError: /usr/local/lib/python3.10/dist-packages/torchtext/lib/libtorchtext.so: undefined symbol: _ZN2at4_ops10select_int4callERKNS_6TensorElN3c106SymIntE

1 reply

the-sambot Aug 29, 2023

Weirdly I ran v_2_0 and had no errors.

kinghmy · 2023-08-29T15:16:52Z

kinghmy
Aug 29, 2023

你好，来信我已收到，我会尽快处理，谢谢！

0 replies

the-sambot · 2023-08-30T12:35:07Z

the-sambot
Aug 30, 2023

First of all, thanks to everyone who has put time into this. It's really awesome to have this much done.

Is there any way to tweak the diarization being done by pytune? Is it a model that needs to be trained and, if so, how does one go about that? My test file had two male speakers and it missed the speaker switching at least half of the time probably. I submitted a second test file where I had one speaker on the left channel and one on the right channel and got about the same results. I'm really happy with the quality of the translation, though.

What if I had a multi-track audio? I get those from time-to-time. The words would "just" (in quotes because I realize this isn't a simple feat) need to be timestamped on each track and the multiple timestamped transcripts stitched together in time order into one transcript. Diariazation of a single-channel track is also important, but when I have multitrack, it seems like it could be done with less work. I was thinking to test with with creating manually creating a whisper transcript for each track and then using powershell to mash it together. Anyone know offhand what the call to get the word-level timestamping is? I seem to recall reading that somewhere in the comments.

1 reply

NavodPeiris Jan 19, 2024

checkout this python package developed by me: https://pypi.org/project/speechlib/
this is the github repo: https://github.com/Navodplayer1/speechlib

if you can convert your file to mono then above package will do just fine.

also if you want word level timestamps then you should use https://huggingface.co/openai/whisper-large-v3
in there you see option:
result = pipe(sample, return_timestamps="word")
print(result["chunks"])

AIhasArrived · 2023-12-18T22:50:48Z

AIhasArrived
Dec 18, 2023

Hello,
I am late to the party, could anyone tell me if they were able to achieve this?
I see that someone made a somewhat paid or maybe gate keeped method? I hoped to get the solution to work locally?
Will anyone reading this give me 1 minute of this time telling me how this can work please, Thanks.

8 replies

AIhasArrived Dec 18, 2023

@nezhar this seems exaclty what I want. Very cool. One little problem for me, I usualy work with windows, and these script.sh are linux based I presume.
You mentioned docker, I think this is related to cloud encapsulation of hardware + os, dont know if I have the ressources to run that (never used docker aswell), is it possible to make it work LOCALLY aswell please?

I realize you might not have the answer if you dont work in windows, but would still like to hear your output on this, any steps are appreciated.

nezhar Dec 18, 2023

You may need to modify the scripts or consider using a Bash emulator. The pipeline can run locally on either a CPU or a GPU. However, using a GPU is recommended for processing long audio files. Please follow the instructions in the README for more details.

MahmoudAshraf97 Dec 19, 2023

@AIhasArrived check this
https://github.com/MahmoudAshraf97/whisper-diarization

AIhasArrived Dec 19, 2023

@MahmoudAshraf97 , This is so beautiful. I hoep I can try it later, if it works then I would like to thank every human being who participated here.

NavodPeiris Jan 19, 2024

checkout this python package developed by me: https://pypi.org/project/speechlib/
this is the github repo: https://github.com/Navodplayer1/speechlib

this package do speaker diarization, transcription and speaker recognition all together and gives a transcription with actual speaker names!

kinghmy · 2023-12-18T22:51:24Z

kinghmy
Dec 18, 2023

你好，来信我已收到，我会尽快处理，谢谢！

0 replies

mysamsungmyjoy · 2024-01-02T13:16:20Z

mysamsungmyjoy
Jan 2, 2024

I want to use this offline on my laptop. Is there a way to install these Whisper and Pyannote Audio models and run them locally on a Windows PC? My Wifi connection drops from time to time I spent the whole day on Google Colab running the notebook with Chrome open to transcribe a movie audio but after 6 hours I came back and see that the thing has stopped loss. I want to run the soft on the Windows PC. Any body give a guide pls.

7 replies

AIhasArrived Jan 21, 2024

@Navodplayer1 3 additional questions:

I see the transcription and pre processing are 2 separate steps right, what do the pre processing do?
I saw the structure you showed with folder contianing 2 speakers sub folders and each containing wav files, is that an OUTPUT hopefully and not an input rigth? (The goal is to have one input)
Is there a way to install this without doing an import? I want to just do "git clone .." then "pip isntall -r req.." can I do that and still be able to use it? (I am on windows if that helps)

NavodPeiris Jan 21, 2024

@AIhasArrived

preprocessing makes sure the file is in the correct format for transcription because AI models used by this require audio files to be mono. It also does 16-bit PCM encoding to further ensure compatibility.
No. path to voices_folder is used as an argument to Transcriptor function. so, it is an input. voices folder should have subfolders named after each person. each subfolder should have voice samples from each person in wav format. During speaker recognition these samples will be used to recognize speaker.
it is easy to use pip install and use it. If you want to clone and use it, then use files in speechlib folder and import into your own script

gety9 Jan 22, 2024

@Navodplayer1
1, Is manual preprocessing required? Or can i simply pass mp3 to script and it will auto preprocess?
2, is voices_folder required? Or can i simply provide only 1 video and model will recognize different speakers (without names of course - example Speaker 1, Speaker 2, etc)?
3, How time / resources hungry is the script? I am trying to run example script using obama_zach.wav on Google Colab free acount (gpu t4). And it takes so much time... (maybe it's cause i didn't provided voice files)

mxzgithub Jan 22, 2024

The voices folder is optional. I tried to get in some fixes to speed the script up. Sadly I currently have no time to get deeper into the code.
There is still some room for improvement. I have tried to parallel process some of the more CPU intensive stuff. But it is still very slow on my 4090. I have big files, over 900 segments.

NavodPeiris Jan 22, 2024

@gety9
I have released speechlib 1.0.10. Now we use faster-whisper. so, now process should be fast.

now you can directly send mp3 to transcribe
voices_folder is optional, if you do not provide voices folder it will put arbitrary tags like SPEAKER_00, SPEAKER_01 etc.

These metrics are from Google Colab tests.
These metrics do not take into account model download times.
These metrics are done without quantization enabled.
(quantization will make this even faster)

metrics for faster-whisper "tiny" model:
    on gpu:
        audio name: obama_zach.wav
        duration: 6 min 36 s
        diarization time: 24s
        speaker recognition time: 10s
        transcription time: 64s


metrics for faster-whisper "small" model:
    on gpu:
        audio name: obama_zach.wav
        duration: 6 min 36 s
        diarization time: 24s
        speaker recognition time: 10s
        transcription time: 95s


metrics for faster-whisper "medium" model:
    on gpu:
        audio name: obama_zach.wav
        duration: 6 min 36 s
        diarization time: 24s
        speaker recognition time: 10s
        transcription time: 193s


metrics for faster-whisper "large" model:
    on gpu:
        audio name: obama_zach.wav
        duration: 6 min 36 s
        diarization time: 24s
        speaker recognition time: 10s
        transcription time: 343s

kinghmy · 2024-01-02T13:16:58Z

kinghmy
Jan 2, 2024

你好，来信我已收到，我会尽快处理，谢谢！

0 replies

garikai22 · 2024-01-20T11:39:24Z

garikai22
Jan 20, 2024

Hi guys,
Newbie to coding and would be incredibly grateful for any tips or advice.
I'm trying to create an application that records speech using a mobile or PC microphone and then subsequently transcribes the generated audio file. Ideally, this should incorporate speaker diarization also. I've read around Whisper and Pyannote and have been metally piecing together the workflow using youtube videos and articles online. I would be super grateful for any advice/ guidance on how to get started on this in Colab (e.g., example similar projects, repos, useful articles).
Thank you! 🙏🏾

1 reply

radrad Feb 17, 2024

@garikai22
I am also interested in this space. Here are the best resources I found. Contact me if you want to colaborate.
https://www.youtube.com/watch?v=aiFUJU-dXhI&list=PLGkspzSREmDeyY4rpkab3SLiSBUH-c0iD&index=1&ab_channel=ChristianGroll
Kickoff: Building a data science product with python - Transcription of audio conversations
https://github.com/cgroll/conversation-transcriptor

Implemented with latest Whisper3 + GPT-4-Vision + OpenAI TTS and a WebRTC browser front-end for speed.
https://www.youtube.com/watch?v=CLF_uNfBZyc&ab_channel=ChristopherTaylor
https://catid.io/posts/aiwebcam/
https://github.com/catid/aiwebcam2

https://github.com/AokiKoshiro/talk-with-ai
https://talk-with-ai.onrender.com/

thewh1teagle · 2024-04-15T00:36:40Z

thewh1teagle
Apr 15, 2024

Do you know how can I implement speaker labeling in Rust? I would like to add diarization to vibe

0 replies

jhdeov · 2024-05-03T05:53:44Z

jhdeov
May 3, 2024

The pyannote stage provides sentence-level timestamps, but whisper doesn't use those timestamps and creates its own. Is it possible to feed pyannote's timestamps to whisper, to make whisper transcribe those individual timestamps and then concatenate?
For example, pyannote generates the following for a small audio file for me:

[ 00:00:08.029 --> 00:00:09.194] A SPEAKER_00
[ 00:00:09.953 --> 00:00:10.965] B SPEAKER_00
[ 00:00:12.737 --> 00:00:13.598] C SPEAKER_00

I'd like to make whisper transcribe each of those original 3 intervals without generating its own timestamps, e.g.

[ 00:00:08.029 --> 00:00:09.194] A SPEAKER_00 TEXT1
[ 00:00:09.953 --> 00:00:10.965] B SPEAKER_00 TEXT2
[ 00:00:12.737 --> 00:00:13.598] C SPEAKER_00 TEXT3

2 replies

thewh1teagle Jul 5, 2024

Good question.
I'm wondering about getting word timestamps from whisper and then running the speaker recognition model on each word segment. will it still run fast enough?

jhdeov Jul 5, 2024

I managed to find a workaround

Run a diarization to get timestamps
Save the timestamps as a single SRT
Convert the SRT to a Praat TextGrid
Break up the sound files into a set of smaller audio files based on the timestaps (using a Praat script)
Run whisper on each individual sound file -- it can range anywhere from 1 sec to 30 sec
Concatenate the results

The speed seems pretty decent but I didn't do any tests, just my impression.

myeze · 2024-08-09T17:06:48Z

myeze
Aug 9, 2024

@Majdoddin Hey, not sure if this can help but I was able to add a way to trim audio files by specific time intervals as well as output different sections onto the console before the html/text file was created. I am also looking to add a way to use recorded audio as well so if anyone knows how to, it would be a tremendous help! GitHub Link

3 replies

thewh1teagle · 2024-08-09T17:09:10Z

thewh1teagle
Aug 9, 2024

I just added high quality diarization to Vibe app in https://github.com/thewh1teagle/vibe

0 replies

Odrec · 2024-11-27T14:38:57Z

Odrec
Nov 27, 2024

WhisperX is Whisper with diarization.

https://github.com/m-bain/whisperX

0 replies

Transcription and diarization (speaker identification) #264

Whisper's transcription plus Pyannote's Diarization

Replies: 53 comments · 134 replies

Majdoddin Oct 7, 2022 Author

Majdoddin Oct 26, 2022 Author

Majdoddin Oct 11, 2022 Author

Majdoddin Oct 26, 2022 Author

Majdoddin Oct 11, 2022 Author

Majdoddin Oct 29, 2022 Author

Majdoddin Oct 26, 2022 Author

Majdoddin Nov 8, 2022 Author

Majdoddin Nov 8, 2022 Author

Majdoddin Nov 8, 2022 Author

Majdoddin Nov 15, 2022 Author

Majdoddin Nov 15, 2022 Author

Majdoddin Jul 21, 2023 Author

Replies: 53 comments 134 replies

Majdoddin Oct 7, 2022
Author

Majdoddin Oct 26, 2022
Author

Majdoddin Oct 11, 2022
Author

Majdoddin Oct 26, 2022
Author

Majdoddin Oct 11, 2022
Author

Majdoddin Oct 29, 2022
Author

Majdoddin Oct 26, 2022
Author

Majdoddin Nov 8, 2022
Author

Majdoddin Nov 8, 2022
Author

Majdoddin Nov 8, 2022
Author

Majdoddin Nov 15, 2022
Author

Majdoddin Nov 15, 2022
Author

Majdoddin Jul 21, 2023
Author