Could transcription natively support stereo audio with unique speakers on each channel? #1026
Replies: 8 comments 3 replies
-
See for example 'stereo → 2 × mono files' here; There are multiple methods offered to split stereo audio, one is The two subtitle files can then be merged (one possibility is https://github.com/cdown/srt). Timestamp accuracy is a work in progress (word level timestamps) so the two speakers would then be aligned better. |
Beta Was this translation helpful? Give feedback.
-
Just a comment that I would assume that splitting the conversation into two separate files and transcribing them separately would hurt recognition accuracy, as you would lose a lot of information about the conversational context. For example, if you do transcription on the unified, stereo file, and speaker A says "What is your favorite color?", whisper will be primed by the language model to hear a color when speaker B says "Blue." If speaker A then says "What do ghosts say?", the model will be primed to hear speaker B say "Boo." But whisper doing transcription just on speaker B's mono file may struggle to accurately recognize "Blue. Boo." without context. |
Beta Was this translation helpful? Give feedback.
-
@daytonturner, did you make any progress since the latest message? my usecase is similar and i'm curious to know if you found a good approach |
Beta Was this translation helpful? Give feedback.
-
I have a similar project going on, I had great success developing it using https://perplexity.ai/pro?referral_code=A72TCP3U configured to use the Claude 3 model. Or I could actually have it develop it for me, without much errors. My little hack makes a HTML5-player that highlights the transcription, which is also clickable to navigate to the right position iin the player. The phrases from the left and right channels are grouped to the left and the right. Quite impressive without me ever touching the actual code! Let me know if you're interested in the little hack that it produced, it's still work int progress that I might publish if someone's interested. I'm still working on improving the quality of the transcription itself and also implementing some AI-features for analysis of the content for the end user to use. |
Beta Was this translation helpful? Give feedback.
-
@daytonturner @jelmers19 did either of you get anywhere with this? i have a similar use case of transcribing stereo phone calls. |
Beta Was this translation helpful? Give feedback.
-
Hello, |
Beta Was this translation helpful? Give feedback.
-
Have the same need as described above. Have audio-files where caller/receiver are split on different channels and would be great to be able to use that for Voice ID. Our call-center solution provides that type of transcription, but the quality of the transcripts are of lesser quality than whisper. |
Beta Was this translation helpful? Give feedback.
-
I hate to add a "me to" here, but it seems like transcribing recorded calls is a HUGE use case for this. Existing transcription technology outside of Whisper often comes with multi-channel support for exactly this reason. I'd love to see it natively in Whisper as well, especially if it can take context into account between the speakers. |
Beta Was this translation helpful? Give feedback.
-
In my particular use case (telephone call recordings), I have perfect stereo separation between the two speakers. Caller is on left channel, callee is on the right. By default, whisper will take audio and downmix it to mono before performing transcription. The result is quite good, but there's no way to determine which transcription came from which speaker, of course.
Im wondering if there's a way to have whisper (optionally) process stereo audio and mark each transcription with which channel it was heard on?
There's been some discussion of other methods of how to do this in #585 and so far the closest strategy has been to transcribe the whole file as mono like usual, then do a secondary pass on just the right channel, trying to identify start/stop timestamps for when just that speaker was talking. Then, attempting to line those timestamps up with timestamps from the initial transcription to identify those sentences as the right-channel speaker.
This feels largely like a dead-end because the timestamp accuracy is not tight enough, or often includes a lot of silence in the initial timestamps (e.g. from 0sec to first words, versus actually noting when the first words start, or a timestamp's start time beginning at the previous sentence's end time, which includes multiple seconds of leading silence). This results in timestamp comparison failing because they dont actually line up well enough.
I've also had challenges with this method, particularly timestamp accuracy, because in many phone calls, you have 1-3 word quick exchanges back and forth, particularly at the start or end of the call, and this throws timestamp accuracy matching for a loop.
Any thoughts or ideas are appreciated, even if its just input with respect to whether or not stereo processing would even work?
Beta Was this translation helpful? Give feedback.
All reactions