Transcription and diarization (speaker identification) #264
Replies: 53 comments 134 replies
-
This looks so cool! Unfortunately I couldn't get the notebook to run because of a clash with the dependencies as here How did you work around that? |
Beta Was this translation helpful? Give feedback.
-
Is it possible to run pyannote.audio first and then just run whisper on the split-by-speaker chunks? It's a hit to performance, but for offline use it would be worth it for my use case :) |
Beta Was this translation helpful? Give feedback.
-
have you considered training a separate net to isolate single speaker's audio? if you can separate speakers into separate channels, then run diarization classifier on top -- the result could be much better. |
Beta Was this translation helpful? Give feedback.
-
Awesome work @Majdoddin 👏 |
Beta Was this translation helpful? Give feedback.
-
Hi, great work indeed. There is blind source seperation method called ICA (independent component analysis) which can seperate sources from each other if there are multiple obersevations (microphones, i.e. stereo recording). This could help at least for two speakers and the stereo recordings from modern laptops or smartphones. This could be performed as the first processing step to help the following methods. There is a sklearn implementation called FastICA https://scikit-learn.org/stable/auto_examples/decomposition/plot_ica_blind_source_separation.html What do you think? |
Beta Was this translation helpful? Give feedback.
-
Just wanted report back an update - you no longer need to do the workaround reported by @Majdoddin. Installing pyannote.audio, then whisper will spit out the error similar to "pyannote-audio 2.0.1 requires huggingface-hub<0.9,>=0.7, but you have huggingface-hub 0.10.1 which is incompatible." ignore this and try importing the packages. Currently working for me in the same venv using Python 3.10.6 on m1. |
Beta Was this translation helpful? Give feedback.
-
so let's talk abouta simple situation .. i do a podcast where i always record over cleanfeed. It's either me by myself or two of us where the recording is on split tracks. Can I use pyannote.audio as part of the process to identify speakers? seems easy enough if the wave files are seperate.. and if it can be done..would anyone know how offhand? I have the transcription process runing pretty well.. i have processed all 78 episodes of my podcast but names would be great... |
Beta Was this translation helpful? Give feedback.
-
Another aproach is to create 2 files from stereo, transcript both and simply order both results by start time |
Beta Was this translation helpful? Give feedback.
-
Hey, I was trying to run your code on the Google Collab and I was running into an error with pyannote. https://colab.research.google.com/drive/12W6bR-C6NIEjAML19JubtzHPIlVxdaUq?usp=sharing#scrollTo=RQyROdrfsvk4 When running this line: pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization')
When I go to the pyannote github and follow the steps for an authentication token. I still get this error
Did you have this issue, any ideas? Thanks! |
Beta Was this translation helpful? Give feedback.
-
https://colab.research.google.com/drive/1V-Bt5Hm2kjaDb4P1RyMSswsDKyrzc2-3?usp=sharing I think this code is actually quick and works great. Found it on Twitter: https://twitter.com/dwarkesh_sp/status/1579672641887408129 |
Beta Was this translation helpful? Give feedback.
-
How about this ? https://medium.com/saarthi-ai/who-spoke-when-build-your-own-speaker-diarization-module-from-scratch-e7d725ee279 |
Beta Was this translation helpful? Give feedback.
-
Hey! This is an incredible project. Thanks so much for dedicating your time to it and documenting it for all of us :) I want to get in on the fun! However, I'm running into issues with timing. The program seems to catch on the line Peace and Love, |
Beta Was this translation helpful? Give feedback.
-
See also thisone. Diarization and asr are run independently in the current version. Next, I will try @crazy4pi314 's idea |
Beta Was this translation helpful? Give feedback.
-
Does this notebook diarize more than 2 speakers? Suppose there is an audio recording of 5+ speakers, will it be able to give a clean result. |
Beta Was this translation helpful? Give feedback.
-
Hello everyone, first of all, thank you for the great idea and code to be able to do it. At first, I want to test to see if it really works, but I have added all the code shown in the colab, but when I run the code tells me a dependency error.
I show the error:
|
Beta Was this translation helpful? Give feedback.
-
Hey everybody, I am in a requirement of speaker diarization where I have a whisper API to translate real-time audio. I am finding no way to bifurcate the users speaking while I send audio. Thus I am unable to generate a proper transcription. |
Beta Was this translation helpful? Give feedback.
-
Forgive my ignorance. Could I use this tool to separate multiple speaker audio from one audio file with an output for each person found? |
Beta Was this translation helpful? Give feedback.
-
Any idea why I'm getting this error on the Google Colab Pyannote_plays_and_Whisper_rhymes_v_2_1? I accepted both terms and created a read API on HF.
|
Beta Was this translation helpful? Give feedback.
-
你好,来信我已收到,我会尽快处理,谢谢!
|
Beta Was this translation helpful? Give feedback.
-
First of all, thanks to everyone who has put time into this. It's really awesome to have this much done. Is there any way to tweak the diarization being done by pytune? Is it a model that needs to be trained and, if so, how does one go about that? My test file had two male speakers and it missed the speaker switching at least half of the time probably. I submitted a second test file where I had one speaker on the left channel and one on the right channel and got about the same results. I'm really happy with the quality of the translation, though. What if I had a multi-track audio? I get those from time-to-time. The words would "just" (in quotes because I realize this isn't a simple feat) need to be timestamped on each track and the multiple timestamped transcripts stitched together in time order into one transcript. Diariazation of a single-channel track is also important, but when I have multitrack, it seems like it could be done with less work. I was thinking to test with with creating manually creating a whisper transcript for each track and then using powershell to mash it together. Anyone know offhand what the call to get the word-level timestamping is? I seem to recall reading that somewhere in the comments. |
Beta Was this translation helpful? Give feedback.
-
Hello, |
Beta Was this translation helpful? Give feedback.
-
你好,来信我已收到,我会尽快处理,谢谢!
|
Beta Was this translation helpful? Give feedback.
-
I want to use this offline on my laptop. Is there a way to install these Whisper and Pyannote Audio models and run them locally on a Windows PC? My Wifi connection drops from time to time I spent the whole day on Google Colab running the notebook with Chrome open to transcribe a movie audio but after 6 hours I came back and see that the thing has stopped loss. I want to run the soft on the Windows PC. Any body give a guide pls. |
Beta Was this translation helpful? Give feedback.
-
你好,来信我已收到,我会尽快处理,谢谢!
|
Beta Was this translation helpful? Give feedback.
-
Hi guys, |
Beta Was this translation helpful? Give feedback.
-
Do you know how can I implement speaker labeling in Rust? I would like to add diarization to vibe |
Beta Was this translation helpful? Give feedback.
-
The pyannote stage provides sentence-level timestamps, but whisper doesn't use those timestamps and creates its own. Is it possible to feed pyannote's timestamps to whisper, to make whisper transcribe those individual timestamps and then concatenate?
I'd like to make whisper transcribe each of those original 3 intervals without generating its own timestamps, e.g.
|
Beta Was this translation helpful? Give feedback.
-
@Majdoddin Hey, not sure if this can help but I was able to add a way to trim audio files by specific time intervals as well as output different sections onto the console before the html/text file was created. I am also looking to add a way to use recorded audio as well so if anyone knows how to, it would be a tremendous help! GitHub Link |
Beta Was this translation helpful? Give feedback.
-
I just added high quality diarization to Vibe app in https://github.com/thewh1teagle/vibe |
Beta Was this translation helpful? Give feedback.
-
WhisperX is Whisper with diarization. |
Beta Was this translation helpful? Give feedback.
-
Whisper's transcription plus Pyannote's Diarization
Update - @johnwyles added HTML output for audio/video files from Google Drive, along with some fixes.
Using the new word-level timestamping of Whisper, the transcription words are highlighted as the video plays, with optional autoscroll. And the display on small displays is improved.
Moreover, the model is loaded just once, thus the whole thing runs much faster now. You can also hardcode your Huggingface token.
Andrej Karpathy suggested training a classifier on top of openai/whisper model features to identify the speaker, so we can visualize the speaker in the transcript. But, as pointed out by Christian Perone, it seems that features from whisper wouldn't be that great for speaker recognition as its main objective is basically to ignore speaker differences.
In
Majdoddin/nlp
, I usepyannote-audio
, a speaker diarization toolkit by Hervé Bredin, to identify the speakers, and then match it with the transcriptions of Whispr. Check the result here.Edit: To make it easier to match the transcriptions to diarizations by speaker change, Sarah Kaiser suggested runnnig the pyannote.audio first and then just running whisper on the split-by-speaker chunks.
For sake of performance (and transcription quality?), we attach the audio segements into a single audio file with a silent spacer as a seperator, and run whisper on it. Enjoy it!
(For sake of performance , I also tried attaching the audio segements into a single audio file with a silent -or beep- spacer as a seperator, and run whisper on it see it on colab. It works on some audio, and fails on some -e.g. Dyson's Interview. The problem is, whisper does not reliably make a timestap on a spacer. See the discussions #139 and #29)
Beta Was this translation helpful? Give feedback.
All reactions