Could transcription natively support stereo audio with unique speakers on each channel? #1026

daytonturner · 2023-03-04T01:05:31Z

daytonturner
Mar 4, 2023

In my particular use case (telephone call recordings), I have perfect stereo separation between the two speakers. Caller is on left channel, callee is on the right. By default, whisper will take audio and downmix it to mono before performing transcription. The result is quite good, but there's no way to determine which transcription came from which speaker, of course.

Im wondering if there's a way to have whisper (optionally) process stereo audio and mark each transcription with which channel it was heard on?

There's been some discussion of other methods of how to do this in #585 and so far the closest strategy has been to transcribe the whole file as mono like usual, then do a secondary pass on just the right channel, trying to identify start/stop timestamps for when just that speaker was talking. Then, attempting to line those timestamps up with timestamps from the initial transcription to identify those sentences as the right-channel speaker.

This feels largely like a dead-end because the timestamp accuracy is not tight enough, or often includes a lot of silence in the initial timestamps (e.g. from 0sec to first words, versus actually noting when the first words start, or a timestamp's start time beginning at the previous sentence's end time, which includes multiple seconds of leading silence). This results in timestamp comparison failing because they dont actually line up well enough.

I've also had challenges with this method, particularly timestamp accuracy, because in many phone calls, you have 1-3 word quick exchanges back and forth, particularly at the start or end of the call, and this throws timestamp accuracy matching for a loop.

Any thoughts or ideas are appreciated, even if its just input with respect to whether or not stereo processing would even work?

glangford · 2023-03-04T11:33:50Z

glangford
Mar 4, 2023

I'm wondering if there's a way to have whisper (optionally) process stereo audio and mark each transcription with which channel it was heard on?

whisper has no built-in method to do this currently. What you could do though is first separate the left and right stereo channels into separate mono tracks with ffmpeg, then run whisper twice, once for each file.

See for example 'stereo → 2 × mono files' here;
https://trac.ffmpeg.org/wiki/AudioChannelManipulation

There are multiple methods offered to split stereo audio, one is
ffmpeg -i stereo.wav -map_channel 0.0.0 left.wav -map_channel 0.0.1 right.wav

The two subtitle files can then be merged (one possibility is https://github.com/cdown/srt).

Timestamp accuracy is a work in progress (word level timestamps) so the two speakers would then be aligned better.

0 replies

wiseman · 2023-05-11T18:29:57Z

wiseman
May 11, 2023

Just a comment that I would assume that splitting the conversation into two separate files and transcribing them separately would hurt recognition accuracy, as you would lose a lot of information about the conversational context. For example, if you do transcription on the unified, stereo file, and speaker A says "What is your favorite color?", whisper will be primed by the language model to hear a color when speaker B says "Blue." If speaker A then says "What do ghosts say?", the model will be primed to hear speaker B say "Boo." But whisper doing transcription just on speaker B's mono file may struggle to accurately recognize "Blue. Boo." without context.

3 replies

daytonturner May 11, 2023
Author

This is absolutely true, I've taken the approach of transcribing each channel separately and then re-merging them together, and this context loss does impact transcription from time to time, especially on low fidelity recordings.

I haven't looked into how whisper could be modified to understand stereo audio, but it does seem like being able to provide stereo audio which is transcribed in mono, but could understand which channel the audio was originally on, and therefore tag the transcription accordingly would be a much much better way to achieve this. Absolutely no idea how complex it would be to achieve.

glangford May 12, 2023

Just a comment that I would assume that splitting the conversation into two separate files and transcribing them separately would hurt recognition accuracy, as you would lose a lot of information about the conversational context.

With this in mind (and updating my earlier comment), using --word_timestamps True is now available and seems to be quite accurate - at least most of the time. I don't know if it is sufficiently accurate yet for the use case described (especially quick exchanges).

But, thinking out loud, it would at least be theoretically possible to :

run either Whisper or a Voice Activity Detector on the Left channel, and collect the timestamps for the start/end of each block of speech. This will be a set of time blocks associated with Speaker A.
do the same with the Right channel to get the times when Speaker B is talking.
run whisper on the original L+R channels (and keep the text)
in theory, this third run should have the same start/end times as two individual mono runs combined. So now who is speaking can be determined by looking to see where each speech block was recorded earlier - from the L mono run or the R mono run.

This is inefficient of course. 😊 But, since we are only interested in the timing for the mono runs, a smaller Whisper model could be used, and a bigger one running in stereo.

daytonturner May 16, 2023
Author

This was actually my original approach, albeit before Word level timestamps were available. I found that the timestamp accuracy was nowhere near close enough to match up, especially when you're trying to do this to a telephone call recording (my use case) where its not unheard of that both people can be speaking at the same time, occasionally.

jelmers19 · 2023-09-13T13:18:42Z

jelmers19
Sep 13, 2023

@daytonturner, did you make any progress since the latest message? my usecase is similar and i'm curious to know if you found a good approach

0 replies

ronnqvist · 2024-03-11T13:56:17Z

ronnqvist
Mar 11, 2024

I have a similar project going on, I had great success developing it using https://perplexity.ai/pro?referral_code=A72TCP3U configured to use the Claude 3 model. Or I could actually have it develop it for me, without much errors. My little hack makes a HTML5-player that highlights the transcription, which is also clickable to navigate to the right position iin the player. The phrases from the left and right channels are grouped to the left and the right. Quite impressive without me ever touching the actual code!

Let me know if you're interested in the little hack that it produced, it's still work int progress that I might publish if someone's interested. I'm still working on improving the quality of the transcription itself and also implementing some AI-features for analysis of the content for the end user to use.

0 replies

DarkPhyber-hg · 2024-04-27T23:59:20Z

DarkPhyber-hg
Apr 27, 2024

@daytonturner @jelmers19 did either of you get anywhere with this? i have a similar use case of transcribing stereo phone calls.

0 replies

elmamoun · 2024-06-05T09:58:54Z

elmamoun
Jun 5, 2024

Hello,
Asking for update
is there a working solution, i'm facing the same problem and the quality of AI based diarization is not that good.

0 replies

rbagge-ebury · 2024-06-17T07:59:01Z

rbagge-ebury
Jun 17, 2024

Have the same need as described above. Have audio-files where caller/receiver are split on different channels and would be great to be able to use that for Voice ID. Our call-center solution provides that type of transcription, but the quality of the transcripts are of lesser quality than whisper.

0 replies

cepm-nate · 2024-12-16T22:28:38Z

cepm-nate
Dec 16, 2024

I hate to add a "me to" here, but it seems like transcribing recorded calls is a HUGE use case for this. Existing transcription technology outside of Whisper often comes with multi-channel support for exactly this reason. I'd love to see it natively in Whisper as well, especially if it can take context into account between the speakers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could transcription natively support stereo audio with unique speakers on each channel? #1026

{{title}}

Replies: 8 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Could transcription natively support stereo audio with unique speakers on each channel? #1026

Replies: 8 comments · 3 replies

daytonturner May 11, 2023 Author

daytonturner May 16, 2023 Author

Replies: 8 comments 3 replies

daytonturner May 11, 2023
Author

daytonturner May 16, 2023
Author